Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

被引：111

作者：

Wu, Jiansheng ^{[1
]}

Liu, Hongde ^{[1
]}

Duan, Xueye ^{[1
]}

Ding, Yan ^{[1
]}

Wu, Hongtao ^{[1
]}

Bai, Yunfei ^{[1
]}

Sun, Xiao ^{[1
]}

机构：

[1] Southeast Univ, Sch Biol Sci & Med Engn, State Key Lab Bioelect, Nanjing 210096, Peoples R China

来源：

BIOINFORMATICS | 2009年 / 25卷 / 01期

基金：

中国国家自然科学基金;

关键词：

SECONDARY STRUCTURE; SITES;

D O I：

10.1093/bioinformatics/btn583

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical-chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein-DNA interactions.

引用

页码：30 / 35

页数：6

共 33 条

[1] PSSM-based prediction of DNA binding sites in proteins [J].

Ahmad, S ;

Sarai, A .

BMC BIOINFORMATICS, 2005, 6 (1)

[2] Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information [J].

Ahmad, S ;

Gromiha, MM ;

Sarai, A .

BIOINFORMATICS, 2004, 20 (04) :477-486

[3] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[4] BASIC LOCAL ALIGNMENT SEARCH TOOL [J].

ALTSCHUL, SF ;

GISH, W ;

MILLER, W ;

MYERS, EW ;

LIPMAN, DJ .

JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410

[5] The Protein Data Bank [J].

Berman, HM ;

Westbrook, J ;

Feng, Z ;

Gilliland, G ;

Bhat, TN ;

Weissig, H ;

Shindyalov, IN ;

Bourne, PE .

NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242

[6] Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions [J].

Bhardwaj, Nitin ;

Lu, Hui .

FEBS LETTERS, 2007, 581 (05) :1058-1066

[7] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[8] Rescuing the function of mutant p53 [J].

Bullock, AN ;

Fersht, A .

NATURE REVIEWS CANCER, 2001, 1 (01) :68-76

[9] Learning from imbalanced data in surveillance of nosocomial infection [J].

Cohen, Gilles ;

Hilario, Melanie ;

Sax, Hugo ;

Hugonnet, Stephane ;

Geissbuhler, Antoine .

ARTIFICIAL INTELLIGENCE IN MEDICINE, 2006, 37 (01) :7-18

[10] Hydrogen bonds in protein-DNA complexes: Where geometry meets plasticity [J].

Coulocheri, Stavroula A. ;

Pigis, Diomidis G. ;

Papavassiliou, Kostas A. ;

Papavassiliou, Athanasios G. .

BIOCHIMIE, 2007, 89 (11) :1291-1303

← 1 2 3 4 →