A protein classification benchmark collection for machine learning

被引:34
作者
Sonego, Paolo
Pacurar, Mircea
Dhir, Somdutta
Kertesz-Farkas, Attila
Kocsor, Andras
Gaspari, Zoltan
Leunissen, Jack A. M.
Pongor, Sandor
机构
[1] Int Ctr Genet Engn & Biotechnol, Prot Struct & Bioinformat Grp, I-34012 Trieste, Italy
[2] Univ Szeged, H-6720 Szeged, Hungary
[3] Hungarian Acad Sci, Res Grp Artificial Intelligence, H-6720 Szeged, Hungary
[4] Eotvos Lorand Univ, Inst Chem, H-1117 Budapest, Hungary
[5] Hungarian Acad Sci, Biol Res Ctr, Bioinformat Grp, H-6701 Szeged, Hungary
[6] Univ Wageningen & Res Ctr, Lab Bioinformat, NL-6700 ET Wageningen, Netherlands
关键词
D O I
10.1093/nar/gkl812
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.
引用
收藏
页码:D232 / D236
页数:5
相关论文
共 30 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]   SCOP database in 2004: refinements integrate structure and sequence family data [J].
Andreeva, A ;
Howorth, D ;
Brenner, SE ;
Hubbard, TJP ;
Chothia, C ;
Murzin, AG .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D226-D229
[3]  
[Anonymous], 1994, Advances in social science methodology
[4]   BIOREL: The benchmark resource to estimate the relevance of the gene networks [J].
Antonov, AV ;
Mewes, HW .
FEBS LETTERS, 2006, 580 (03) :844-848
[5]  
Baldi P., 2001, Bioinformatics: the machine learning approach
[6]  
Bishop CM., 1995, Neural networks for pattern recognition
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   A protein-protein docking benchmark [J].
Chen, R ;
Mintseris, J ;
Janin, J ;
Weng, ZP .
PROTEINS-STRUCTURE FUNCTION AND GENETICS, 2003, 52 (01) :88-91
[9]  
Duda R. O., 1973, Pattern Classification
[10]  
Egan JP., 1975, Signal Detection Theory and ROC Analysis