Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

被引:14
作者
Brown, J. B. [1 ]
Akutsu, Tatsuya [1 ]
机构
[1] Kyoto Univ, Inst Chem Res, Bioinformat Ctr, Kyoto 6110011, Japan
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
SUBCELLULAR LOCATION PREDICTION; FUSION CLASSIFIER; PHOSPHODIESTERASE; LOCALIZATION; RESOURCE; HISTONES; DATABASE; CELLS; MPLOC; PLOC;
D O I
10.1186/1471-2105-10-25
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM). Results: We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes. Conclusion: Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.
引用
收藏
页数:22
相关论文
共 48 条
[1]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[2]  
[Anonymous], 2004, KERNEL METHODS PATTE
[3]  
[Anonymous], 2005, DNA REPAIR MUTAGENES
[4]  
[Anonymous], 2015, MOL BIOL CELL
[5]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[6]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[7]   Recognition and classification of histones using support vector machine [J].
Bhasin, M ;
Reinherz, EL ;
Reche, PA .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2006, 13 (01) :102-112
[8]   Prediction of CTL epitopes using QM, SVM and ANN techniques [J].
Bhasin, M ;
Raghava, GPS .
VACCINE, 2004, 22 (23-24) :3195-3204
[9]  
BROWN J, 2007, THESIS KYOTO U
[10]   SCRATCH: a protein structure and structural feature prediction server [J].
Cheng, J ;
Randall, AZ ;
Sweredoski, MJ ;
Baldi, P .
NUCLEIC ACIDS RESEARCH, 2005, 33 :W72-W76