Clustering evolving proteins into homologous families

被引:9
作者
Chan, Cheong Xin [1 ,2 ]
Mahbob, Maisarah [3 ]
Ragan, Mark A. [1 ,2 ]
机构
[1] Univ Queensland, Inst Mol Biosci, Brisbane, Qld 4072, Australia
[2] Australian Res Council Ctr Excellence Bioinformat, Brisbane, Qld 4072, Australia
[3] Univ Queensland, Sch Chem & Mol Biosci, Brisbane, Qld 4072, Australia
来源
BMC BIOINFORMATICS | 2013年 / 14卷
基金
澳大利亚研究理事会;
关键词
MAXIMUM-LIKELIHOOD; MICROBIAL GENOMES; EVOLUTION; ALGORITHMS; SIMILARITY;
D O I
10.1186/1471-2105-14-120
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. Results: Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. Conclusions: Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.
引用
收藏
页数:11
相关论文
共 35 条
[1]  
[Anonymous], MATH MODEL SCI COMPU
[2]   Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution [J].
Apeltsin, Leonard ;
Morris, John H. ;
Babbitt, Patricia C. ;
Ferrin, Thomas E. .
BIOINFORMATICS, 2011, 27 (03) :326-333
[3]  
Azzalini A, PDFCLUSTER CLUSTER A
[4]  
Bansal AK, 1999, BIOINFORMATICS, V15, P900
[5]   Highways of gene sharing in prokaryotes [J].
Beiko, RG ;
Harlow, TJ ;
Ragan, MA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (40) :14332-14337
[6]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[7]   Isochores and the evolutionary genomics of vertebrates [J].
Bernardi, G .
GENE, 2000, 241 (01) :3-17
[8]   ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time [J].
Cai, Yunpeng ;
Sun, Yijun .
NUCLEIC ACIDS RESEARCH, 2011, 39 (14) :e95
[9]   Lateral Transfer of Genes and Gene Fragments in Staphylococcus Extends beyond Mobile Elements [J].
Chan, Cheong Xin ;
Beiko, Robert G. ;
Ragan, Mark A. .
JOURNAL OF BACTERIOLOGY, 2011, 193 (15) :3964-3977
[10]   Lateral Transfer of Genes and Gene Fragments in Prokaryotes [J].
Chan, Cheong Xin ;
Beiko, Robert G. ;
Darling, Aaron E. ;
Ragan, Mark A. .
GENOME BIOLOGY AND EVOLUTION, 2009, 1 :429-438