GENCODE: The reference human genome annotation for The ENCODE Project

被引:3292
作者
Harrow, Jennifer [1 ]
Frankish, Adam [1 ]
Gonzalez, Jose M. [1 ]
Tapanari, Electra [1 ]
Diekhans, Mark [2 ]
Kokocinski, Felix [1 ]
Aken, Bronwen L. [1 ]
Barrell, Daniel [1 ]
Zadissa, Amonida [1 ]
Searle, Stephen [1 ]
Barnes, If [1 ]
Bignell, Alexandra [1 ]
Boychenko, Veronika [1 ]
Hunt, Toby [1 ]
Kay, Mike [1 ]
Mukherjee, Gaurab [1 ]
Rajan, Jeena [1 ]
Despacio-Reyes, Gloria [1 ]
Saunders, Gary [1 ]
Steward, Charles [1 ]
Harte, Rachel [2 ]
Lin, Michael [3 ]
Howald, Cedric [4 ]
Tanzer, Andrea [5 ,6 ]
Derrien, Thomas [4 ]
Chrast, Jacqueline [4 ]
Walters, Nathalie [4 ]
Balasubramanian, Suganthi [7 ]
Pei, Baikang [7 ]
Tress, Michael [8 ]
Manuel Rodriguez, Jose [8 ]
Ezkurdia, Iakes [8 ]
van Baren, Jeltje [9 ]
Brent, Michael [9 ]
Haussler, David [2 ]
Kellis, Manolis [3 ]
Valencia, Alfonso [8 ]
Reymond, Alexandre [4 ]
Gerstein, Mark [7 ]
Guigo, Roderic [5 ,6 ]
Hubbard, Tim J. [1 ]
机构
[1] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
[2] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
[3] MIT, Cambridge, MA 02139 USA
[4] Univ Lausanne, Ctr Integrat Genom, CH-1015 Lausanne, Switzerland
[5] Ctr Genom Regulat CRG, Barcelona 08003, Catalonia, Spain
[6] UPF, Barcelona 08003, Catalonia, Spain
[7] Yale Univ, New Haven, CT 06520 USA
[8] Spanish Natl Canc Res Ctr CNIO, E-28029 Madrid, Spain
[9] Ctr Genome Sci & Syst Biol, St Louis, MO 63130 USA
基金
英国惠康基金; 美国国家卫生研究院; 美国国家科学基金会;
关键词
GENE-EXPRESSION; NONCODING RNAS; IDENTIFICATION; SEQUENCES; REVEALS; PSEUDOGENE; PREDICTION; TOPOLOGY; TRANSCRIPTION; COMPLEXITY;
D O I
10.1101/gr.135350.111
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (IncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
引用
收藏
页码:1760 / 1774
页数:15
相关论文
共 75 条
  • [41] AnnoTrack - a tracking system for genome annotation
    Kokocinski, Felix
    Harrow, Jennifer
    Hubbard, Tim
    [J]. BMC GENOMICS, 2010, 11
  • [42] Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
    Langmead, Ben
    Trapnell, Cole
    Pop, Mihai
    Salzberg, Steven L.
    [J]. GENOME BIOLOGY, 2009, 10 (03):
  • [43] Kalign - an accurate and fast multiple sequence alignment algorithm
    Lassmann, T
    Sonnhammer, ELL
    [J]. BMC BIOINFORMATICS, 2005, 6 (1)
  • [44] Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans
    Lewis, BP
    Green, RE
    Brenner, SE
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (01) : 189 - 192
  • [45] PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions
    Lin, Michael F.
    Jungreis, Irwin
    Kellis, Manolis
    [J]. BIOINFORMATICS, 2011, 27 (13) : I275 - I282
  • [46] A high-resolution map of human evolutionary constraint using 29 mammals
    Lindblad-Toh, Kerstin
    Garber, Manuel
    Zuk, Or
    Lin, Michael F.
    Parker, Brian J.
    Washietl, Stefan
    Kheradpour, Pouya
    Ernst, Jason
    Jordan, Gregory
    Mauceli, Evan
    Ward, Lucas D.
    Lowe, Craig B.
    Holloway, Alisha K.
    Clamp, Michele
    Gnerre, Sante
    Alfoeldi, Jessica
    Beal, Kathryn
    Chang, Jean
    Clawson, Hiram
    Cuff, James
    Di Palma, Federica
    Fitzgerald, Stephen
    Flicek, Paul
    Guttman, Mitchell
    Hubisz, Melissa J.
    Jaffe, David B.
    Jungreis, Irwin
    Kent, W. James
    Kostka, Dennis
    Lara, Marcia
    Martins, Andre L.
    Massingham, Tim
    Moltke, Ida
    Raney, Brian J.
    Rasmussen, Matthew D.
    Robinson, Jim
    Stark, Alexander
    Vilella, Albert J.
    Wen, Jiayu
    Xie, Xiaohui
    Zody, Michael C.
    Worley, Kim C.
    Kovar, Christie L.
    Muzny, Donna M.
    Gibbs, Richard A.
    Warren, Wesley C.
    Mardis, Elaine R.
    Weinstock, George M.
    Wilson, Richard K.
    Birney, Ewan
    [J]. NATURE, 2011, 478 (7370) : 476 - 482
  • [47] firestar -: prediction of functionally important residues using structural templates and alignment reliability
    Lopez, Gonzalo
    Valencia, Alfonso
    Tress, Michael L.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : W573 - W577
  • [48] firestar-advances in the prediction of functionally important residues
    Lopez, Gonzalo
    Maietta, Paolo
    Rodriguez, Jose Manuel
    Valencia, Alfonso
    Tress, Michael L.
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 : W235 - W241
  • [49] An algorithm for progressive multiple alignment of sequences with insertions
    Löytynoja, A
    Goldman, N
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (30) : 10557 - 10562
  • [50] Detecting amino acid sites under positive selection and purifying selection
    Massingham, T
    Goldman, N
    [J]. GENETICS, 2005, 169 (03) : 1753 - 1762