Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes

被引:315
作者
Ezkurdia, Iakes [1 ]
Juan, David [3 ]
Manuel Rodriguez, Jose [4 ]
Frankish, Adam [5 ]
Diekhans, Mark [6 ]
Harrow, Jennifer [5 ]
Vazquez, Jesus [2 ]
Valencia, Alfonso [3 ,4 ]
Tress, Michael L. [3 ]
机构
[1] CNIC, Unidad Prote, Madrid 28029, Spain
[2] CNIC, Lab Prote Cardiovasc, Madrid 28029, Spain
[3] Spanish Natl Canc Res Ctr CNIO, Struct Biol & Bioinformat Programme, Madrid 28029, Spain
[4] Spanish Natl Canc Res Ctr CNIO, INB, Madrid 28029, Spain
[5] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
[6] Univ Calif Santa Cruz, Sch Engn, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
基金
美国国家卫生研究院;
关键词
PREDICTION; PROTEOMICS; DATABASE; GENOME; ANNOTATION; SEQUENCES; TOPOLOGY; ALGORITHM;
D O I
10.1093/hmg/ddu309
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide massspectrometry(MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
引用
收藏
页码:5866 / 5878
页数:13
相关论文
共 62 条
  • [1] Database resources of the National Center for Biotechnology Information
    Acland, Abigail
    Agarwala, Richa
    Barrett, Tanya
    Beck, Jeff
    Benson, Dennis A.
    Bollin, Colleen
    Bolton, Evan
    Bryant, Stephen H.
    Canese, Kathi
    Church, Deanna M.
    Clark, Karen
    DiCuccio, Michael
    Dondoshansky, Ilya
    Federhen, Scott
    Feolo, Michael
    Geer, Lewis Y.
    Gorelenkov, Viatcheslav
    Hoeppner, Marilu
    Johnson, Mark
    Kelly, Christopher
    Khotomlianski, Viatcheslav
    Kimchi, Avi
    Kimelman, Michael
    Kitts, Paul
    Krasnov, Sergey
    Kuznetsov, Anatoliy
    Landsman, David
    Lipman, David J.
    Lu, Zhiyong
    Madden, Thomas L.
    Madej, Tom
    Maglott, Donna R.
    Marchler-Bauer, Aron
    Karsch-Mizrachi, Ilene
    Murphy, Terence
    Ostell, James
    O'Sullivan, Christopher
    Panchenko, Anna
    Phan, Lon
    Pruitt, Don Preussm Kim D.
    Rubinstein, Wendy
    Sayers, Eric W.
    Schneider, Valerie
    Schuler, Gregory D.
    Sequeira, Edwin
    Sherry, Stephen T.
    Shumway, Martin
    Sirotkin, Karl
    Siyan, Karanjit
    Slotta, Douglas
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D7 - D17
  • [2] Mass spectrometry-based proteomics
    Aebersold, R
    Mann, M
    [J]. NATURE, 2003, 422 (6928) : 198 - 207
  • [3] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [4] Update on activities at the Universal Protein Resource (UniProt) in 2013
    Apweiler, Rolf
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Alam-Faruque, Yasmin
    Alpi, Emanuela
    Antunes, Ricardo
    Arganiska, Joanna
    Casanova, Elisabet Barrera
    Bely, Benoit
    Bingley, Mark
    Bonilla, Carlos
    Britto, Ramona
    Bursteinas, Borisas
    Chan, Wei Mun
    Chavali, Gayatri
    Cibrian-Uhalte, Elena
    Da Silva, Alan
    De Giorgi, Maurizio
    Dimmer, Emily
    Fazzini, Francesco
    Gane, Paul
    Fedotov, Alexander
    Castro, Leyla Garcia
    Garmiri, Penelope
    Hatton-Ellis, Emma
    Hieta, Reija
    Huntley, Rachael
    Jacobsen, Julius
    Jones, Rachel
    Legge, Duncan
    Liu, Wudong
    Luo, Jie
    MacDougall, Alistair
    Mutowo, Prudence
    Nightingale, Andrew
    Orchard, Sandra
    Patient, Samuel
    Pichler, Klemens
    Poggioli, Diego
    Pundir, Sangya
    Pureza, Luis
    Qi, Guoying
    Rosanoff, Steven
    Sawford, Tony
    Sehra, Harminder
    Turner, Edward
    Volynkin, Vladimir
    Wardell, Tony
    Watkins, Xavier
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D43 - D47
  • [5] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
  • [6] The quantitative proteome of a human cell line
    Beck, Martin
    Schmidt, Alexander
    Malmstroem, Johan
    Claassen, Manfred
    Ori, Alessandro
    Szymborska, Anna
    Herzog, Franz
    Rinner, Oliver
    Ellenberg, Jan
    Aebersold, Ruedi
    [J]. MOLECULAR SYSTEMS BIOLOGY, 2011, 7
  • [7] Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
    Birney, Ewan
    Stamatoyannopoulos, John A.
    Dutta, Anindya
    Guigo, Roderic
    Gingeras, Thomas R.
    Margulies, Elliott H.
    Weng, Zhiping
    Snyder, Michael
    Dermitzakis, Emmanouil T.
    Stamatoyannopoulos, John A.
    Thurman, Robert E.
    Kuehn, Michael S.
    Taylor, Christopher M.
    Neph, Shane
    Koch, Christoph M.
    Asthana, Saurabh
    Malhotra, Ankit
    Adzhubei, Ivan
    Greenbaum, Jason A.
    Andrews, Robert M.
    Flicek, Paul
    Boyle, Patrick J.
    Cao, Hua
    Carter, Nigel P.
    Clelland, Gayle K.
    Davis, Sean
    Day, Nathan
    Dhami, Pawandeep
    Dillon, Shane C.
    Dorschner, Michael O.
    Fiegler, Heike
    Giresi, Paul G.
    Goldy, Jeff
    Hawrylycz, Michael
    Haydock, Andrew
    Humbert, Richard
    James, Keith D.
    Johnson, Brett E.
    Johnson, Ericka M.
    Frum, Tristan T.
    Rosenzweig, Elizabeth R.
    Karnani, Neerja
    Lee, Kirsten
    Lefebvre, Gregory C.
    Navas, Patrick A.
    Neri, Fidencio
    Parker, Stephen C. J.
    Sabo, Peter J.
    Sandstrom, Richard
    Shafer, Anthony
    [J]. NATURE, 2007, 447 (7146) : 799 - 816
  • [8] Aligning multiple genomic sequences with the threaded blockset aligner
    Blanchette, M
    Kent, WJ
    Riemer, C
    Elnitski, L
    Smit, AFA
    Roskin, KM
    Baertsch, R
    Rosenbloom, K
    Clawson, H
    Green, ED
    Haussler, D
    Miller, W
    [J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715
  • [9] Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome
    Brosch, Markus
    Saunders, Gary I.
    Frankish, Adam
    Collins, Mark O.
    Yu, Lu
    Wright, James
    Verstraten, Ruth
    Adams, David J.
    Harrow, Jennifer
    Choudhary, Jyoti S.
    Hubbard, Tim
    [J]. GENOME RESEARCH, 2011, 21 (05) : 756 - 767
  • [10] Quantifying the mechanisms of domain gain in animal proteins
    Buljan, Marija
    Frankish, Adam
    Bateman, Alex
    [J]. GENOME BIOLOGY, 2010, 11 (07):