Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies

被引:51
作者
Florea, Liliana [1 ]
Souvorov, Alexander [2 ]
Kalbfleisch, Theodore S. [3 ]
Salzberg, Steven L. [1 ]
机构
[1] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
[2] NIH, Natl Ctr Biotechnol Informat, Bethesda, MD 20892 USA
[3] Univ Louisville, Ctr Genet & Mol Med, Louisville, KY 40292 USA
基金
美国食品与农业研究所; 美国国家卫生研究院;
关键词
SEQUENCE; PROGRAM; NUMBER; BLAST;
D O I
10.1371/journal.pone.0021400
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
070301 [无机化学]; 070403 [天体物理学]; 070507 [自然资源与国土空间规划学]; 090105 [作物生产系统与生态工程];
摘要
Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12-20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6-15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly.
引用
收藏
页数:10
相关论文
共 25 条
[1]
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]
BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]
Distinguishing protein-coding and noncoding genes in the human genome [J].
Clamp, Michele ;
Fry, Ben ;
Kamal, Mike ;
Xie, Xiaohui ;
Cuff, James ;
Lin, Michael F. ;
Kellis, Manolis ;
Lindblad-Toh, Kerstin ;
Lander, Eric S. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (49) :19428-19433
[4]
Finishing the euchromatic sequence of the human genome [J].
Collins, FS ;
Lander, ES ;
Rogers, J ;
Waterston, RH .
NATURE, 2004, 431 (7011) :931-945
[5]
Quantitative measures for the management and comparison of annotated genomes [J].
Eilbeck, Karen ;
Moore, Barry ;
Holt, Carson ;
Yandell, Mark .
BMC BIOINFORMATICS, 2009, 10
[6]
Gene and alternative splicing annotation with AIR [J].
Florea, L ;
Di Francesco, V ;
Miller, J ;
Turner, R ;
Yao, A ;
Harris, M ;
Walenz, B ;
Mobarry, C ;
Merkulov, GV ;
Charlab, R ;
Dew, I ;
Deng, ZM ;
Istrail, S ;
Li, P ;
Sutton, G .
GENOME RESEARCH, 2005, 15 (01) :54-66
[7]
A computer program for aligning a cDNA sequence with a genomic DNA sequence [J].
Florea, L ;
Hartzell, G ;
Zhang, Z ;
Rubin, GM ;
Miller, W .
GENOME RESEARCH, 1998, 8 (09) :967-974
[8]
The International HapMap Project [J].
Gibbs, RA ;
Belmont, JW ;
Hardenbol, P ;
Willis, TD ;
Yu, FL ;
Yang, HM ;
Ch'ang, LY ;
Huang, W ;
Liu, B ;
Shen, Y ;
Tam, PKH ;
Tsui, LC ;
Waye, MMY ;
Wong, JTF ;
Zeng, CQ ;
Zhang, QR ;
Chee, MS ;
Galver, LM ;
Kruglyak, S ;
Murray, SS ;
Oliphant, AR ;
Montpetit, A ;
Hudson, TJ ;
Chagnon, F ;
Ferretti, V ;
Leboeuf, M ;
Phillips, MS ;
Verner, A ;
Kwok, PY ;
Duan, SH ;
Lind, DL ;
Miller, RD ;
Rice, JP ;
Saccone, NL ;
Taillon-Miller, P ;
Xiao, M ;
Nakamura, Y ;
Sekine, A ;
Sorimachi, K ;
Tanaka, T ;
Tanaka, Y ;
Tsunoda, T ;
Yoshino, E ;
Bentley, DR ;
Deloukas, P ;
Hunt, S ;
Powell, D ;
Altshuler, D ;
Gabriel, SB ;
Qiu, RZ .
NATURE, 2003, 426 (6968) :789-796
[9]
Genome-Wide Survey of SNP Variation Uncovers the Genetic Structure of Cattle Breeds [J].
Gibbs, Richard A. ;
Taylor, Jeremy F. ;
Van Tassell, Curtis P. ;
Barendse, William ;
Eversoie, Kallye A. ;
Gill, Clare A. ;
Green, Ronnie D. ;
Hamernik, Debora L. ;
Kappes, Steven M. ;
Lien, Sigbjorn ;
Matukumalli, Lakshmi K. ;
McEwan, John C. ;
Nazareth, Lynne V. ;
Schnabel, Robert D. ;
Taylor, Jeremy F. ;
Weinstock, George M. ;
Wheeler, David A. ;
Ajmone-Marsan, Paolo ;
Barendse, William ;
Boettcher, Paul J. ;
Caetano, Alexandre R. ;
Garcia, Jose Fernando ;
Hanotte, Olivier ;
Mariani, Paola ;
Skow, Loren C. ;
Williams, John L. ;
Caetano, Alexandre R. ;
Diallo, Boubacar ;
Green, Ronnie D. ;
Hailemariam, Lemecha ;
Hanotte, Olivier ;
Martinez, Mario L. ;
Morris, Chris A. ;
Silva, Luiz O. C. ;
Spelman, Richard J. ;
Taylor, Jeremy F. ;
Mulatu, Woudyalew ;
Zhao, Keyan ;
Abbey, Colette A. ;
Agaba, Morris ;
Araujo, Flabio R. ;
Bunch, Rowan J. ;
Burton, James ;
Gill, Clare A. ;
Gorni, Chiara ;
Olivier, Hanotte ;
Harrison, Blair E. ;
Luff, Bill ;
Machado, Marco A. ;
Mariani, Paola .
SCIENCE, 2009, 324 (5926) :528-532
[10]
Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202, 10.1101/gr.229202. Article published online before March 2002]