Predicting protein-protein interactions in unbalanced data using the primary structure of proteins

被引:58
作者
Yu, Chi-Yuan [2 ]
Chou, Lih-Ching [2 ]
Chang, Darby Tien-Hao [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Elect Engn, Tainan 70101, Taiwan
[2] Natl Taiwan Univ, Grad Inst Biomed Elect & Bioinformat, Taipei 106, Taiwan
来源
BMC BIOINFORMATICS | 2010年 / 11卷
关键词
INTERACTION NETWORKS; RESOURCE; COMPLEXES;
D O I
10.1186/1471-2105-11-167
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Most of these approaches have achieved satisfactory performance on datasets comprising equal number of interacting and non-interacting protein pairs. However, this ratio is highly unbalanced in nature, and these techniques have not been comprehensively evaluated with respect to the effect of the large number of non-interacting pairs in realistic datasets. Moreover, since highly unbalanced distributions usually lead to large datasets, more efficient predictors are desired when handling such challenging tasks. Results: This study presents a method for PPI prediction based only on sequence information, which contributes in three aspects. First, we propose a probability-based mechanism for transforming protein sequences into feature vectors. Second, the proposed predictor is designed with an efficient classification algorithm, where the efficiency is essential for handling highly unbalanced datasets. Third, the proposed PPI predictor is assessed with several unbalanced datasets with different positive-to-negative ratios (from 1:1 to 1:15). This analysis provides solid evidence that the degree of dataset imbalance is important to PPI predictors. Conclusions: Dealing with data imbalance is a key issue in PPI prediction since there are far fewer interacting protein pairs than non-interacting ones. This article provides a comprehensive study on this issue and develops a practical tool that achieves both good prediction performance and efficiency using only protein sequence information.
引用
收藏
页数:10
相关论文
共 49 条
  • [31] Effect of training datasets on support vector machine prediction of protein-protein interactions
    Lo, SL
    Cai, CZ
    Chen, YZ
    Chung, MCM
    [J]. PROTEOMICS, 2005, 5 (04) : 876 - 884
  • [32] Detecting protein function and protein-protein interactions from genome sequences
    Marcotte, EM
    Pellegrini, M
    Ng, HL
    Rice, DW
    Yeates, TO
    Eisenberg, D
    [J]. SCIENCE, 1999, 285 (5428) : 751 - 753
  • [33] Predicting protein-protein interactions using signature products
    Martin, S
    Roe, D
    Faulon, JL
    [J]. BIOINFORMATICS, 2005, 21 (02) : 218 - 226
  • [34] Human protein reference database - 2006 update
    Mishra, Gopa R.
    Suresh, M.
    Kumaran, K.
    Kannabiran, N.
    Suresh, Shubha
    Bala, P.
    Shivakumar, K.
    Anuradha, N.
    Reddy, Raghunath
    Raghavan, T. Madhan
    Menon, Shalini
    Hanumanthu, G.
    Gupta, Malvika
    Upendran, Sapna
    Gupta, Shweta
    Mahesh, M.
    Jacob, Bincy
    Mathew, Pinky
    Chatterjee, Pritam
    Arun, K. S.
    Sharma, Salil
    Chandrika, K. N.
    Deshpande, Nandan
    Palvankar, Kshitish
    Raghavnath, R.
    Krishnakanth, R.
    Karathia, Hiren
    Rekha, B.
    Nayak, Rashmi
    Vishnupriya, G.
    Kumar, H. G. Mohan
    Nagini, M.
    Kumar, G. S. Sameer
    Jose, Rojan
    Deepthi, P.
    Mohan, S. Sujatha
    Gandhi, T. K. B.
    Harsha, H. C.
    Deshpande, Krishna S.
    Sarker, Malabika
    Prasad, T. S. Keshava
    Pandey, Akhilesh
    [J]. NUCLEIC ACIDS RESEARCH, 2006, 34 : D411 - D414
  • [35] Sequence-based prediction of protein-protein interactions by means of codon usage
    Najafabadi, Hamed Shateri
    Salavati, Reza
    [J]. GENOME BIOLOGY, 2008, 9 (05)
  • [36] An ensemble of K-local hyperplanes for predicting protein-protein interactions
    Nanni, L
    Lumini, A
    [J]. BIOINFORMATICS, 2006, 22 (10) : 1207 - 1210
  • [37] PRISM: protein interactions by structural matching
    Ogmen, U
    Keskin, O
    Aytuna, AS
    Nussinov, R
    Gursoy, A
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : W331 - W336
  • [38] Data classification with radial basis function networks based on a novel kernel density estimation algorithm
    Oyang, YJ
    Hwang, SC
    Ou, YY
    Chen, CY
    Chen, ZW
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2005, 16 (01): : 225 - 236
  • [39] Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles
    Pellegrini, M
    Marcotte, EM
    Thompson, MJ
    Eisenberg, D
    Yeates, TO
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (08) : 4285 - 4288
  • [40] Development of human protein reference database as an initial platform for approaching systems biology in humans
    Peri, S
    Navarro, JD
    Amanchy, R
    Kristiansen, TZ
    Jonnalagadda, CK
    Surendranath, V
    Niranjan, V
    Muthusamy, B
    Gandhi, TKB
    Gronborg, M
    Ibarrola, N
    Deshpande, N
    Shanker, K
    Shivashankar, HN
    Rashmi, BP
    Ramya, MA
    Zhao, ZX
    Chandrika, KN
    Padma, N
    Harsha, HC
    Yatish, AJ
    Kavitha, MP
    Menezes, M
    Choudhury, DR
    Suresh, S
    Ghosh, N
    Saravana, R
    Chandran, S
    Krishna, S
    Joy, M
    Anand, SK
    Madavan, V
    Joseph, A
    Wong, GW
    Schiemann, WP
    Constantinescu, SN
    Huang, LL
    Khosravi-Far, R
    Steen, H
    Tewari, M
    Ghaffari, S
    Blobe, GC
    Dang, CV
    Garcia, JGN
    Pevsner, J
    Jensen, ON
    Roepstorff, P
    Deshpande, KS
    Chinnaiyan, AM
    Hamosh, A
    [J]. GENOME RESEARCH, 2003, 13 (10) : 2363 - 2371