Discovery and revision of Arabidopsis genes by proteogenomics

被引:207
作者
Castellana, Natalie E. [1 ]
Payne, Samuel H. [2 ]
Shen, Zhouxin [3 ]
Stanke, Mario [4 ]
Bafna, Vineet [1 ]
Briggs, Steven P. [3 ]
机构
[1] Univ Calif San Diego, Dept Comp Sci & Engn, La Jolla, CA 92093 USA
[2] Univ Calif San Diego, Bioinformat Program, La Jolla, CA 92093 USA
[3] Univ Calif San Diego, Div Biol, La Jolla, CA 92093 USA
[4] Inst Microbiol & Genet, D-37077 Gottingen, Germany
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
annotation; genomics; proteomics;
D O I
10.1073/pnas.0811066106
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein- coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (> 99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.
引用
收藏
页码:21034 / 21038
页数:5
相关论文
共 16 条
[1]   Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics [J].
Baerenfaller, Katja ;
Grossmann, Jonas ;
Grobei, Monica A. ;
Hull, Roger ;
Hirsch-Hoffmann, Matthias ;
Yalovsky, Shaul ;
Zimmermann, Philip ;
Grossniklaus, Ueli ;
Gruissem, Wilhelm ;
Baginsky, Sacha .
SCIENCE, 2008, 320 (5878) :938-941
[2]   Steady progress and recent breakthroughs in the accuracy of automated genome annotation [J].
Brent, Michael R. .
NATURE REVIEWS GENETICS, 2008, 9 (01) :62-73
[3]   A high-quality catalog of the Drosophila melanogaster proteome [J].
Brunner, Erich ;
Ahrens, Christian H. ;
Mohanty, Sonali ;
Baetschmann, Hansruedi ;
Loevenich, Sandra ;
Potthast, Frank ;
Deutsch, Eric W. ;
Panse, Christian ;
de Lichtenberg, Ulrik ;
Rinner, Oliver ;
Lee, Hookeun ;
Pedrioli, Patrick G. A. ;
Malmstrom, Johan ;
Koehler, Katja ;
Schrimpf, Sabine ;
Krijgsveld, Jeroen ;
Kregenow, Floyd ;
Heck, Albert J. R. ;
Hafen, Ernst ;
Schlapbach, Ralph ;
Aebersold, Ruedi .
NATURE BIOTECHNOLOGY, 2007, 25 (05) :576-583
[4]   The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana [J].
Cannon S.B. ;
Mitra A. ;
Baumgarten A. ;
Young N.D. ;
May G. .
BMC Plant Biology, 4 (1)
[5]   A plant-specific protein essential for blue-light-induced chloroplast movements [J].
DeBlasio, SL ;
Luesse, DL ;
Hangarter, RP .
PLANT PHYSIOLOGY, 2005, 139 (01) :101-114
[6]  
Desiere F, 2005, GENOME BIOL, V6
[7]   Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics [J].
Fermin, Damian ;
Allen, Baxter B. ;
Blackwell, Thomas W. ;
Menon, Rajasree ;
Adamski, Marcin ;
Xu, Yin ;
Ulintz, Peter ;
Omenn, Gilbert S. ;
States, David J. .
GENOME BIOLOGY, 2006, 7 (04)
[8]   Whole proteome analysis of post-translational modifications: Applications of mass-spectrometry for proteogenomic annotation [J].
Gupta, Nitin ;
Tanner, Stephen ;
Jaitly, Navdeep ;
Adkins, Joshua N. ;
Lipton, Mary ;
Edwards, Robert ;
Romine, Margaret ;
Osterman, Andrei ;
Bafna, Vineet ;
Smith, Richard D. ;
Pevzner, Pavel A. .
GENOME RESEARCH, 2007, 17 (09) :1362-1377
[9]   A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection [J].
Hanada, Kousuke ;
Zhang, Xu ;
Borevitz, Justin O. ;
Li, Wen-Hsiung ;
Shiu, Shin-Han .
GENOME RESEARCH, 2007, 17 (05) :632-640
[10]  
JIANG B, 2004, NATURE, V431, P163