Improving the specificity of high-throughput ortholog prediction

被引:69
作者
Fulton, Debra L.
Li, Yvonne Y.
Laird, Matthew R.
Horsman, Benjamin G. S.
Roche, Fiona M.
Brinkman, Fiona S. L. [1 ]
机构
[1] Simon Fraser Univ, Dept Mol Biol & Biochem, Burnaby, BC V5A 1S6, Canada
[2] Univ British Columbia, Dept Med Genet, Vancouver, BC, Canada
[3] Canada Michael Smith Genome Sci Ctr, Vancouver, BC, Canada
关键词
D O I
10.1186/1471-2105-7-270
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits(RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function. Results: To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) - identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida-Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at http://www.pathogenomics.ca/ortholuge/( software under GNU General Public License). Conclusion: The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.
引用
收藏
页数:16
相关论文
共 27 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] The complete genome sequence of Escherichia coli K-12
    Blattner, FR
    Plunkett, G
    Bloch, CA
    Perna, NT
    Burland, V
    Riley, M
    ColladoVides, J
    Glasner, JD
    Rode, CK
    Mayhew, GF
    Gregor, J
    Davis, NW
    Kirkpatrick, HA
    Goeden, MA
    Rose, DJ
    Mau, B
    Shao, Y
    [J]. SCIENCE, 1997, 277 (5331) : 1453 - +
  • [3] Evidence that plant-like genes in Chlamydia species reflect an ancestral relationship between Chlamydiaceae, cyanobacteria, and the chloroplast
    Brinkman, FSL
    Blanchard, JL
    Cherkasov, A
    Av-Gay, Y
    Brunham, RC
    Fernandez, RC
    Finlay, BB
    Otto, SP
    Ouellette, BFF
    Keeling, PJ
    Rose, AM
    Hancock, REW
    Jones, SJM
    [J]. GENOME RESEARCH, 2002, 12 (08) : 1159 - 1167
  • [4] Glocal alignment: finding rearrangements during alignment
    Brudno, Michael
    Malde, Sanket
    Poliakov, Alexander
    Do, Chuong B.
    Couronne, Olivier
    Dubchak, Inna
    Batzoglou, Serafim
    [J]. BIOINFORMATICS, 2003, 19 : i54 - i62
  • [5] The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000
    Buell, CR
    Joardar, V
    Lindeberg, M
    Selengut, J
    Paulsen, IT
    Gwinn, ML
    Dodson, RJ
    Deboy, RT
    Durkin, AS
    Kolonay, JF
    Madupu, R
    Daugherty, S
    Brinkac, L
    Beanan, MJ
    Haft, DH
    Nelson, WC
    Davidsen, T
    Zafar, N
    Zhou, LW
    Liu, J
    Yuan, QP
    Khouri, H
    Fedorova, N
    Tran, B
    Russell, D
    Berry, K
    Utterback, T
    Van Aken, SE
    Feldblyum, TV
    D'Ascenzo, M
    Deng, WL
    Ramos, AR
    Alfano, JR
    Cartinhour, S
    Chatterjee, AK
    Delaney, TP
    Lazarowitz, SG
    Martin, GB
    Schneider, DJ
    Tang, XY
    Bender, CL
    White, O
    Fraser, CM
    Collmer, A
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (18) : 10181 - 10186
  • [6] cis-Regulatory and protein evolution in orthologous and duplicate genes
    Castillo-Davis, CI
    Hartl, DL
    Achaz, G
    [J]. GENOME RESEARCH, 2004, 14 (08) : 1530 - 1536
  • [7] OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups
    Chen, Feng
    Mackey, Aaron J.
    Stoeckert, Christian J., Jr.
    Roos, David S.
    [J]. NUCLEIC ACIDS RESEARCH, 2006, 34 : D363 - D368
  • [8] Multiple sequence alignment with the Clustal series of programs
    Chenna, R
    Sugawara, H
    Koike, T
    Lopez, R
    Gibson, TJ
    Higgins, DG
    Thompson, JD
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (13) : 3497 - 3500
  • [9] The Mouse Genome Database (MGD): from genes to mice - a community resource for mouse biology
    Eppig, JT
    Bult, CJ
    Kadin, JA
    Richardson, JE
    Blake, JA
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D471 - D475
  • [10] Felsenstein J., 2005, PHYLIP PHYLOGENY INF, DOI DOI 10.1111/J.1096-0031.1989.TB00562.X