Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes

被引:315
作者
Ezkurdia, Iakes [1 ]
Juan, David [3 ]
Manuel Rodriguez, Jose [4 ]
Frankish, Adam [5 ]
Diekhans, Mark [6 ]
Harrow, Jennifer [5 ]
Vazquez, Jesus [2 ]
Valencia, Alfonso [3 ,4 ]
Tress, Michael L. [3 ]
机构
[1] CNIC, Unidad Prote, Madrid 28029, Spain
[2] CNIC, Lab Prote Cardiovasc, Madrid 28029, Spain
[3] Spanish Natl Canc Res Ctr CNIO, Struct Biol & Bioinformat Programme, Madrid 28029, Spain
[4] Spanish Natl Canc Res Ctr CNIO, INB, Madrid 28029, Spain
[5] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
[6] Univ Calif Santa Cruz, Sch Engn, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
基金
美国国家卫生研究院;
关键词
PREDICTION; PROTEOMICS; DATABASE; GENOME; ANNOTATION; SEQUENCES; TOPOLOGY; ALGORITHM;
D O I
10.1093/hmg/ddu309
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide massspectrometry(MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
引用
收藏
页码:5866 / 5878
页数:13
相关论文
共 62 条
  • [11] A phylogenomic study of human, dog, and mouse
    Cannarozzi, Gina
    Schneider, Adrian
    Gonnet, Gaston
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (01) : 9 - 14
  • [12] Distinguishing protein-coding and noncoding genes in the human genome
    Clamp, Michele
    Fry, Ben
    Kamal, Mike
    Xie, Xiaohui
    Cuff, James
    Lin, Michael F.
    Kellis, Manolis
    Lindblad-Toh, Kerstin
    Lander, Eric S.
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (49) : 19428 - 19433
  • [13] Colaert N, 2011, NAT METHODS, V8, P481, DOI [10.1038/nmeth.1604, 10.1038/NMETH.1604]
  • [14] Finishing the euchromatic sequence of the human genome
    Collins, FS
    Lander, ES
    Rogers, J
    Waterston, RH
    [J]. NATURE, 2004, 431 (7011) : 931 - 945
  • [15] Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment
    Cox, Juergen
    Neuhauser, Nadin
    Michalski, Annette
    Scheltema, Richard A.
    Olsen, Jesper V.
    Mann, Matthias
    [J]. JOURNAL OF PROTEOME RESEARCH, 2011, 10 (04) : 1794 - 1805
  • [16] Open source system for analyzing, validating, and storing protein identification data
    Craig, R
    Cortens, JP
    Beavis, RC
    [J]. JOURNAL OF PROTEOME RESEARCH, 2004, 3 (06) : 1234 - 1242
  • [17] A method for reducing the time required to match protein sequences with tandem mass spectra
    Craig, R
    Beavis, RC
    [J]. RAPID COMMUNICATIONS IN MASS SPECTROMETRY, 2003, 17 (20) : 2310 - 2316
  • [18] A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages
    Domazet-Loso, Tomislav
    Brajkovic, Josip
    Tautz, Diethard
    [J]. TRENDS IN GENETICS, 2007, 23 (11) : 533 - 539
  • [19] An integrated encyclopedia of DNA elements in the human genome
    Dunham, Ian
    Kundaje, Anshul
    Aldred, Shelley F.
    Collins, Patrick J.
    Davis, CarrieA.
    Doyle, Francis
    Epstein, Charles B.
    Frietze, Seth
    Harrow, Jennifer
    Kaul, Rajinder
    Khatun, Jainab
    Lajoie, Bryan R.
    Landt, Stephen G.
    Lee, Bum-Kyu
    Pauli, Florencia
    Rosenbloom, Kate R.
    Sabo, Peter
    Safi, Alexias
    Sanyal, Amartya
    Shoresh, Noam
    Simon, Jeremy M.
    Song, Lingyun
    Trinklein, Nathan D.
    Altshuler, Robert C.
    Birney, Ewan
    Brown, James B.
    Cheng, Chao
    Djebali, Sarah
    Dong, Xianjun
    Dunham, Ian
    Ernst, Jason
    Furey, Terrence S.
    Gerstein, Mark
    Giardine, Belinda
    Greven, Melissa
    Hardison, Ross C.
    Harris, Robert S.
    Herrero, Javier
    Hoffman, Michael M.
    Iyer, Sowmya
    Kellis, Manolis
    Khatun, Jainab
    Kheradpour, Pouya
    Kundaje, Anshul
    Lassmann, Timo
    Li, Qunhua
    Lin, Xinying
    Marinov, Georgi K.
    Merkel, Angelika
    Mortazavi, Ali
    [J]. NATURE, 2012, 489 (7414) : 57 - 74
  • [20] Locating proteins in the cell using TargetP, SignalP and related tools
    Emanuelsson, Olof
    Brunak, Soren
    von Heijne, Gunnar
    Nielsen, Henrik
    [J]. NATURE PROTOCOLS, 2007, 2 (04) : 953 - 971