Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome

被引:149
作者
Margulies, Elliott H. [1 ]
Cooper, Gregory M.
Asimenos, George
Thomas, Daryl J.
Dewey, Colin N.
Siepel, Adam
Birney, Ewan
Keefe, Damian
Schwartz, Ariel S.
Hou, Minmei
Taylor, James
Nikolaev, Sergey
Montoya-Burgos, Juan I.
Loytynoja, Ari
Whelan, Simon
Pardi, Fabio
Massingham, Tim
Brown, James B.
Bickel, Peter
Holmes, Ian
Mullikin, James C.
Ureta-Vidal, Abel
Paten, Benedict
Stone, Eric A.
Rosenbloom, Kate R.
Kent, W. James
Antonarakis, Stylianos E.
Batzoglou, Serafim
Goldman, Nick
Hardison, Ross
Haussler, David
Miller, Webb
Pachter, Lior
Green, Eric D.
Sidow, Arend
机构
[1] Natl Human Genome Res Inst, Genome Technol Branch, NIH, Bethesda, MD 20892 USA
[2] Stanford Univ, Dept Genet, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[4] Univ Calif Santa Cruz, Dept Biomol Engn, Santa Cruz, CA 95064 USA
[5] Univ Calif Santa Cruz, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[6] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[7] European Bioinformat Inst, Hinxton CB10 1SA, England
[8] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
[9] Univ Geneva, Sch Med, Dept Genet Med & Dev, CH-1211 Geneva, Switzerland
[10] Univ Geneva, Fac Sci, Dept Zool & Anim Biol, CH-1211 Geneva, Switzerland
[11] Univ Calif Berkeley, Dept Appl Sci & Engn, Berkeley, CA 94720 USA
[12] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[13] Univ Calif Berkeley, Dept Bioengn, Berkeley, CA 94720 USA
[14] Natl Human Genome Res Inst, NIH Intramural Sequencing Ctr, NIH, Bethesda, MD 20892 USA
[15] Penn State Univ, Huck Inst Life Sci, Ctr Comparat Genom & Bioinformat, University Pk, PA 16802 USA
[16] Univ Calif Santa Cruz, Howard Hughes Med Inst, Santa Cruz, CA 95064 USA
[17] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
[18] Stanford Univ, Dept Pathol, Stanford, CA 94305 USA
[19] Baylor Coll Med, Human Genome Sequencing Ctr, Houston, TX 77030 USA
[20] Baylor Coll Med, Dept Mol & Human Genet, Houston, TX 77030 USA
[21] Washington Univ, Sch Med, Genome Sequencing Ctr, St Louis, MO 63108 USA
[22] Harvard Univ, Broad Inst, Cambridge, MA 02141 USA
[23] MIT, Cambridge, MA 02141 USA
[24] Whitehead Inst Biomed Res, Cambridge, MA 02142 USA
[25] BC Canc Res Ctr, BC Canc Agcy, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
关键词
D O I
10.1101/gr.6034307
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity ( sequence coverage), and specificity ( alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class ( with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
引用
收藏
页码:760 / 774
页数:15
相关论文
共 72 条
[61]   Conserved fragments of transposable elements in intergenic regions: evidence for widespread recruitment of MIR- and L2-derived sequences within the mouse and human genomes [J].
Silva, JC ;
Shabalina, SA ;
Harris, DG ;
Spouge, JL ;
Kondrashov, AS .
GENETICS RESEARCH, 2003, 82 (01) :1-18
[62]   Orthology, paralogy and proposed classification for paralog subtypes [J].
Sonnhammer, ELL ;
Koonin, EV .
TRENDS IN GENETICS, 2002, 18 (12) :619-620
[63]   The genome sequence of Caenorhabditis briggsae:: A platform for comparative genomics [J].
Stein, LD ;
Bao, ZR ;
Blasiar, D ;
Blumenthal, T ;
Brent, MR ;
Chen, NS ;
Chinwalla, A ;
Clarke, L ;
Clee, C ;
Coghlan, A ;
Coulson, A ;
D'Eustachio, P ;
Fitch, DHA ;
Fulton, LA ;
Fulton, RE ;
Griffiths-Jones, S ;
Harris, TW ;
Hillier, LW ;
Kamath, R ;
Kuwabara, PE ;
Mardis, ER ;
Marra, MA ;
Miner, TL ;
Minx, P ;
Mullikin, JC ;
Plumb, RW ;
Rogers, J ;
Schein, JE ;
Sohrmann, M ;
Spieth, J ;
Stajich, JE ;
Wei, CC ;
Willey, D ;
Wilson, RK ;
Durbin, R ;
Waterston, RH .
PLOS BIOLOGY, 2003, 1 (02) :166-+
[64]   Trade-offs in detecting evolutionarily constrained sequence by comparative genomics [J].
Stone, EA ;
Cooper, GM ;
Sidow, A .
ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, 2005, 6 :143-164
[65]   The ENCODE project at UC Santa Cruz [J].
Thomas, Daryl J. ;
Rosenbloom, Kate R. ;
Clawson, Hiram ;
Hinrichs, Angie S. ;
Trumbower, Heather ;
Raney, Brian J. ;
Karolchik, Donna ;
Barber, Galt P. ;
Harte, Rachel A. ;
Hillman-Jackson, Jennifer ;
Kuhn, Robert M. ;
Rhead, Brooke L. ;
Smith, Kayla E. ;
Thakkapallayil, Archana ;
Zweig, Ann S. ;
Haussler, David ;
Kent, W. James .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D663-D667
[66]   Comparative analyses of multi-species sequences from targeted genomic regions [J].
Thomas, JW ;
Touchman, JW ;
Blakesley, RW ;
Bouffard, GG ;
Beckstrom-Sternberg, SM ;
Margulies, EH ;
Blanchette, M ;
Siepel, AC ;
Thomas, PJ ;
McDowell, JC ;
Maskeri, B ;
Hansen, NF ;
Schwartz, MS ;
Weber, RJ ;
Kent, WJ ;
Karolchik, D ;
Bruen, TC ;
Bevan, R ;
Cutler, DJ ;
Schwartz, S ;
Elnitski, L ;
Idol, JR ;
Prasad, AB ;
Lee-Lin, SQ ;
Maduro, VVB ;
Summers, TJ ;
Portnoy, ME ;
Dietrich, NL ;
Akhter, N ;
Ayele, K ;
Benjamin, B ;
Cariaga, K ;
Brinkley, CP ;
Brooks, SY ;
Granite, S ;
Guan, X ;
Gupta, J ;
Haghighi, P ;
Ho, SL ;
Huang, MC ;
Karlins, E ;
Laric, PL ;
Legaspi, R ;
Lim, MJ ;
Maduro, QL ;
Masiello, CA ;
Mastrian, SD ;
McCloskey, JC ;
Pearson, R ;
Stantripop, S .
NATURE, 2003, 424 (6950) :788-793
[67]   Assessing computational tools for the discovery of transcription factor binding sites [J].
Tompa, M ;
Li, N ;
Bailey, TL ;
Church, GM ;
De Moor, B ;
Eskin, E ;
Favorov, AV ;
Frith, MC ;
Fu, YT ;
Kent, WJ ;
Makeev, VJ ;
Mironov, AA ;
Noble, WS ;
Pavesi, G ;
Pesole, G ;
Régnier, M ;
Simonis, N ;
Sinha, S ;
Thijs, G ;
van Helden, J ;
Vandenbogaert, M ;
Weng, ZP ;
Workman, C ;
Ye, C ;
Zhu, Z .
NATURE BIOTECHNOLOGY, 2005, 23 (01) :137-144
[68]   An abundance of bidirectional promoters in the human genome [J].
Trinklein, ND ;
Aldred, SF ;
Hartman, SJ ;
Schroeder, DI ;
Otillar, RP ;
Myers, RM .
GENOME RESEARCH, 2004, 14 (01) :62-66
[69]   SABmark - a benchmark for sequence alignment that covers the entire known fold space [J].
Van Walle, I ;
Lasters, I ;
Wyns, L .
BIOINFORMATICS, 2005, 21 (07) :1267-1268
[70]   Initial sequencing and comparative analysis of the mouse genome [J].
Waterston, RH ;
Lindblad-Toh, K ;
Birney, E ;
Rogers, J ;
Abril, JF ;
Agarwal, P ;
Agarwala, R ;
Ainscough, R ;
Alexandersson, M ;
An, P ;
Antonarakis, SE ;
Attwood, J ;
Baertsch, R ;
Bailey, J ;
Barlow, K ;
Beck, S ;
Berry, E ;
Birren, B ;
Bloom, T ;
Bork, P ;
Botcherby, M ;
Bray, N ;
Brent, MR ;
Brown, DG ;
Brown, SD ;
Bult, C ;
Burton, J ;
Butler, J ;
Campbell, RD ;
Carninci, P ;
Cawley, S ;
Chiaromonte, F ;
Chinwalla, AT ;
Church, DM ;
Clamp, M ;
Clee, C ;
Collins, FS ;
Cook, LL ;
Copley, RR ;
Coulson, A ;
Couronne, O ;
Cuff, J ;
Curwen, V ;
Cutts, T ;
Daly, M ;
David, R ;
Davies, J ;
Delehaunty, KD ;
Deri, J ;
Dermitzakis, ET .
NATURE, 2002, 420 (6915) :520-562