The predictive power of the CluSTr database

被引:34
作者
Petryszak, R [1 ]
Kretschmann, E [1 ]
Wieser, D [1 ]
Apweiler, R [1 ]
机构
[1] EMBL, Outstn Hinxton, EBI, Hinxton CB10 1SD, Cambs, England
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/bti542
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The CluSTr database employs a fully automatic single-linkage hierarchical clustering method based on a similarity matrix. In order to compute the matrix, first all-against-all pair-wise comparisons between protein sequences are computed using the Smith-Waterman algorithm. The statistical significance of the similarity scores is then assessed using a Monte Carlo analysis, yielding Z-values, which are used to populate the matrix. This paper describes automated annotation experiments that quantify the predictive power and hence the biological relevance of the CluSTr data. The experiments utilized the UniProt data-mining framework to derive annotation predictions using combinations of InterPro and CluSTr. We show that this combination of data sources greatly increases the precision of predictions made by the data-mining framework, compared with the use of InterPro data alone. We conclude that the CluSTr approach to clustering proteins makes a valuable contribution to traditional protein classifications.
引用
收藏
页码:3604 / 3609
页数:6
相关论文
共 15 条
[1]  
Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkw1099, 10.1093/nar/gkh131]
[2]   Fundamentals of massive automatic pairwise alignments of protein sequences:: theoretical significance of Z-value statistics [J].
Bastien, O ;
Aude, JC ;
Roy, S ;
Maréchal, E .
BIOINFORMATICS, 2004, 20 (04) :534-537
[3]   Significance of Z-value statistics of Smith-Waterman scores for protein alignments [J].
Comet, JP ;
Aude, JC ;
Glémet, E ;
Risler, JL ;
Hénaut, A ;
Slonimski, PP ;
Codani, JJ .
COMPUTERS & CHEMISTRY, 1999, 23 (3-4) :317-331
[4]   An efficient algorithm for large-scale detection of protein families [J].
Enright, AJ ;
Van Dongen, S ;
Ouzounis, CA .
NUCLEIC ACIDS RESEARCH, 2002, 30 (07) :1575-1584
[5]   Swissknife - 'lazy parsing' of SWISS-PROT entries [J].
Hermjakob, H ;
Fleischmann, W ;
Apweiler, R .
BIOINFORMATICS, 1999, 15 (09) :771-772
[6]   The International Protein Index: An integrated database for proteomics experiments [J].
Kersey, PJ ;
Duarte, J ;
Williams, A ;
Karavidopoulou, Y ;
Birney, E ;
Apweiler, R .
PROTEOMICS, 2004, 4 (07) :1985-1988
[7]  
Kretschmann E, 2004, 8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XIII, PROCEEDINGS, P65
[8]   Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT [J].
Kretschmann, E ;
Fleischmann, W ;
Apweiler, R .
BIOINFORMATICS, 2001, 17 (10) :920-926
[9]   CluSTr: a database of clusters of SWISS-PROT plus TrEMBL proteins [J].
Kriventseva, EV ;
Fleischmann, W ;
Zdobnov, EM ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :33-36
[10]   The InterPro Database, 2003 brings increased coverage and new features [J].
Mulder, NJ ;
Apweiler, R ;
Attwood, TK ;
Bairoch, A ;
Barrell, D ;
Bateman, A ;
Binns, D ;
Biswas, M ;
Bradley, P ;
Bork, P ;
Bucher, P ;
Copley, RR ;
Courcelle, E ;
Das, U ;
Durbin, R ;
Falquet, L ;
Fleischmann, W ;
Griffiths-Jones, S ;
Haft, D ;
Harte, N ;
Hulo, N ;
Kahn, D ;
Kanapin, A ;
Krestyaninova, M ;
Lopez, R ;
Letunic, I ;
Lonsdale, D ;
Silventoinen, V ;
Orchard, SE ;
Pagni, M ;
Peyruc, D ;
Ponting, CP ;
Selengut, JD ;
Servant, F ;
Sigrist, CJA ;
Vaughan, R ;
Zdobnov, EM .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :315-318