Distinguishing protein-coding and noncoding genes in the human genome

被引:363
作者
Clamp, Michele [1 ]
Fry, Ben
Kamal, Mike
Xie, Xiaohui
Cuff, James
Lin, Michael F.
Kellis, Manolis
Lindblad-Toh, Kerstin
Lander, Eric S.
机构
[1] MIT, Broad Inst, Cambridge, MA 02142 USA
[2] Harvard, Cambridge Ctr 7, Cambridge, MA USA
[3] MIT, Dept Biol, Cambridge, MA 02139 USA
[4] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[5] Whitehead Inst Biomed Res, Cambridge Ctr 9, Cambridge, MA 02142 USA
[6] Harvard Univ, Sch Med, Dept Syst Biol, Boston, MA 02115 USA
关键词
comparative genomics;
D O I
10.1073/pnas.0709013104
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximate to 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximate to 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.
引用
收藏
页码:19428 / 19433
页数:6
相关论文
共 16 条
  • [1] SEQUENCE AND ORGANIZATION OF THE HUMAN MITOCHONDRIAL GENOME
    ANDERSON, S
    BANKIER, AT
    BARRELL, BG
    DEBRUIJN, MHL
    COULSON, AR
    DROUIN, J
    EPERON, IC
    NIERLICH, DP
    ROE, BA
    SANGER, F
    SCHREIER, PH
    SMITH, AJH
    STADEN, R
    YOUNG, IG
    [J]. NATURE, 1981, 290 (5806) : 457 - 465
  • [2] The Vertebrate Genome Annotation (Vega) database
    Ashurst, JL
    Chen, CK
    Gilbert, JGR
    Jekosch, K
    Keenan, S
    Meidl, P
    Searle, SM
    Stalker, J
    Storey, R
    Trevanion, S
    Wilming, L
    Hubbard, T
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D459 - D465
  • [3] Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
    Birney, Ewan
    Stamatoyannopoulos, John A.
    Dutta, Anindya
    Guigo, Roderic
    Gingeras, Thomas R.
    Margulies, Elliott H.
    Weng, Zhiping
    Snyder, Michael
    Dermitzakis, Emmanouil T.
    Stamatoyannopoulos, John A.
    Thurman, Robert E.
    Kuehn, Michael S.
    Taylor, Christopher M.
    Neph, Shane
    Koch, Christoph M.
    Asthana, Saurabh
    Malhotra, Ankit
    Adzhubei, Ivan
    Greenbaum, Jason A.
    Andrews, Robert M.
    Flicek, Paul
    Boyle, Patrick J.
    Cao, Hua
    Carter, Nigel P.
    Clelland, Gayle K.
    Davis, Sean
    Day, Nathan
    Dhami, Pawandeep
    Dillon, Shane C.
    Dorschner, Michael O.
    Fiegler, Heike
    Giresi, Paul G.
    Goldy, Jeff
    Hawrylycz, Michael
    Haydock, Andrew
    Humbert, Richard
    James, Keith D.
    Johnson, Brett E.
    Johnson, Ericka M.
    Frum, Tristan T.
    Rosenzweig, Elizabeth R.
    Karnani, Neerja
    Lee, Kirsten
    Lefebvre, Gregory C.
    Navas, Patrick A.
    Neri, Fidencio
    Parker, Stephen C. J.
    Sabo, Peter J.
    Sandstrom, Richard
    Shafer, Anthony
    [J]. NATURE, 2007, 447 (7146) : 799 - 816
  • [4] The transcriptional landscape of the mammalian genome
    Carninci, P
    Kasukawa, T
    Katayama, S
    Gough, J
    Frith, MC
    Maeda, N
    Oyama, R
    Ravasi, T
    Lenhard, B
    Wells, C
    Kodzius, R
    Shimokawa, K
    Bajic, VB
    Brenner, SE
    Batalov, S
    Forrest, ARR
    Zavolan, M
    Davis, MJ
    Wilming, LG
    Aidinis, V
    Allen, JE
    Ambesi-Impiombato, X
    Apweiler, R
    Aturaliya, RN
    Bailey, TL
    Bansal, M
    Baxter, L
    Beisel, KW
    Bersano, T
    Bono, H
    Chalk, AM
    Chiu, KP
    Choudhary, V
    Christoffels, A
    Clutterbuck, DR
    Crowe, ML
    Dalla, E
    Dalrymple, BP
    de Bono, B
    Della Gatta, G
    di Bernardo, D
    Down, T
    Engstrom, P
    Fagiolini, M
    Faulkner, G
    Fletcher, CF
    Fukushima, T
    Furuno, M
    Futaki, S
    Gariboldi, M
    [J]. SCIENCE, 2005, 309 (5740) : 1559 - 1563
  • [5] Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution
    Cheng, J
    Kapranov, P
    Drenkow, J
    Dike, S
    Brubaker, S
    Patel, S
    Long, J
    Stern, D
    Tammana, H
    Helt, G
    Sementchenko, V
    Piccolboni, A
    Bekiranov, S
    Bailey, DK
    Ganesh, M
    Ghosh, S
    Bell, I
    Gerhard, DS
    Gingeras, TR
    [J]. SCIENCE, 2005, 308 (5725) : 1149 - 1154
  • [6] Recent duplication, domain accretion and the dynamic mutation of the human genome
    Eichler, EE
    [J]. TRENDS IN GENETICS, 2001, 17 (11) : 661 - 669
  • [7] Pfam:: clans, web tools and services
    Finn, Robert D.
    Mistry, Jaina
    Schuster-Bockler, Benjamin
    Griffiths-Jones, Sam
    Hollich, Volker
    Lassmann, Timo
    Moxon, Simon
    Marshall, Mhairi
    Khanna, Ajay
    Durbin, Richard
    Eddy, Sean R.
    Sonnhammer, Erik L. L.
    Bateman, Alex
    [J]. NUCLEIC ACIDS RESEARCH, 2006, 34 : D247 - D251
  • [8] Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human
    Goodstadt, Leo
    Ponting, Chris P.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2006, 2 (09) : 1134 - 1150
  • [9] Ensembl 2007
    Hubbard, T. J. P.
    Aken, B. L.
    Beal, K.
    Ballester, B.
    Caccamo, M.
    Chen, Y.
    Clarke, L.
    Coates, G.
    Cunningham, F.
    Cutts, T.
    Down, T.
    Dyer, S. C.
    Fitzgerald, S.
    Fernandez-Banet, J.
    Graf, S.
    Haider, S.
    Hammond, M.
    Herrero, J.
    Holland, R.
    Howe, K.
    Howe, K.
    Johnson, N.
    Kahari, A.
    Keefe, D.
    Kokocinski, F.
    Kulesha, E.
    Lawson, D.
    Longden, I.
    Melsopp, C.
    Megy, K.
    Meidl, P.
    Overduin, B.
    Parker, A.
    Prlic, A.
    Rice, S.
    Rios, D.
    Schuster, M.
    Sealy, I.
    Severin, J.
    Slater, G.
    Smedley, D.
    Spudich, G.
    Trevanion, S.
    Vilella, A.
    Vogel, J.
    White, S.
    Wood, M.
    Cox, T.
    Curwen, V.
    Durbin, R.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D610 - D617
  • [10] LIN MF, 2007, GENOME RES