Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome

被引:149
作者
Margulies, Elliott H. [1 ]
Cooper, Gregory M.
Asimenos, George
Thomas, Daryl J.
Dewey, Colin N.
Siepel, Adam
Birney, Ewan
Keefe, Damian
Schwartz, Ariel S.
Hou, Minmei
Taylor, James
Nikolaev, Sergey
Montoya-Burgos, Juan I.
Loytynoja, Ari
Whelan, Simon
Pardi, Fabio
Massingham, Tim
Brown, James B.
Bickel, Peter
Holmes, Ian
Mullikin, James C.
Ureta-Vidal, Abel
Paten, Benedict
Stone, Eric A.
Rosenbloom, Kate R.
Kent, W. James
Antonarakis, Stylianos E.
Batzoglou, Serafim
Goldman, Nick
Hardison, Ross
Haussler, David
Miller, Webb
Pachter, Lior
Green, Eric D.
Sidow, Arend
机构
[1] Natl Human Genome Res Inst, Genome Technol Branch, NIH, Bethesda, MD 20892 USA
[2] Stanford Univ, Dept Genet, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[4] Univ Calif Santa Cruz, Dept Biomol Engn, Santa Cruz, CA 95064 USA
[5] Univ Calif Santa Cruz, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[6] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[7] European Bioinformat Inst, Hinxton CB10 1SA, England
[8] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
[9] Univ Geneva, Sch Med, Dept Genet Med & Dev, CH-1211 Geneva, Switzerland
[10] Univ Geneva, Fac Sci, Dept Zool & Anim Biol, CH-1211 Geneva, Switzerland
[11] Univ Calif Berkeley, Dept Appl Sci & Engn, Berkeley, CA 94720 USA
[12] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[13] Univ Calif Berkeley, Dept Bioengn, Berkeley, CA 94720 USA
[14] Natl Human Genome Res Inst, NIH Intramural Sequencing Ctr, NIH, Bethesda, MD 20892 USA
[15] Penn State Univ, Huck Inst Life Sci, Ctr Comparat Genom & Bioinformat, University Pk, PA 16802 USA
[16] Univ Calif Santa Cruz, Howard Hughes Med Inst, Santa Cruz, CA 95064 USA
[17] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
[18] Stanford Univ, Dept Pathol, Stanford, CA 94305 USA
[19] Baylor Coll Med, Human Genome Sequencing Ctr, Houston, TX 77030 USA
[20] Baylor Coll Med, Dept Mol & Human Genet, Houston, TX 77030 USA
[21] Washington Univ, Sch Med, Genome Sequencing Ctr, St Louis, MO 63108 USA
[22] Harvard Univ, Broad Inst, Cambridge, MA 02141 USA
[23] MIT, Cambridge, MA 02141 USA
[24] Whitehead Inst Biomed Res, Cambridge, MA 02142 USA
[25] BC Canc Res Ctr, BC Canc Agcy, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4S6, Canada
关键词
D O I
10.1101/gr.6034307
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity ( sequence coverage), and specificity ( alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class ( with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
引用
收藏
页码:760 / 774
页数:15
相关论文
共 72 条
  • [1] Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes
    Aparicio, S
    Chapman, J
    Stupka, E
    Putnam, N
    Chia, J
    Dehal, P
    Christoffels, A
    Rash, S
    Hoon, S
    Smit, A
    Gelpke, MDS
    Roach, J
    Oh, T
    Ho, IY
    Wong, M
    Detter, C
    Verhoef, F
    Predki, P
    Tay, A
    Lucas, S
    Richardson, P
    Smith, SF
    Clark, MS
    Edwards, YJK
    Doggett, N
    Zharkikh, A
    Tavtigian, SV
    Pruss, D
    Barnstead, M
    Evans, C
    Baden, H
    Powell, J
    Glusman, G
    Rowen, L
    Hood, L
    Tan, YH
    Elgar, G
    Hawkins, T
    Venkatesh, B
    Rokhsar, D
    Brenner, S
    [J]. SCIENCE, 2002, 297 (5585) : 1301 - 1310
  • [2] An intermediate grade of finished genomic sequence suitable for comparative analyses
    Blakesley, RW
    Hansen, NF
    Mullikin, JC
    Thomas, PJ
    McDowell, JC
    Maskeri, B
    Young, AC
    Benjamin, B
    Brooks, SY
    Coleman, BI
    Gupta, J
    Ho, SL
    Karlins, EM
    Maduro, QL
    Stantripop, S
    Tsurgeon, C
    Vogt, JL
    Walker, MA
    Masiello, CA
    Guan, XB
    Bouffared, GG
    Green, ED
    [J]. GENOME RESEARCH, 2004, 14 (11) : 2235 - 2244
  • [3] Aligning multiple genomic sequences with the threaded blockset aligner
    Blanchette, M
    Kent, WJ
    Riemer, C
    Elnitski, L
    Smit, AFA
    Roskin, KM
    Baertsch, R
    Rosenbloom, K
    Clawson, H
    Green, ED
    Haussler, D
    Miller, W
    [J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715
  • [4] Phylogenetic shadowing of primate sequences to find functional regions of the human genome
    Boffelli, D
    McAuliffe, J
    Ovcharenko, D
    Lewis, KD
    Ovcharenko, I
    Pachter, L
    Rubin, EM
    [J]. SCIENCE, 2003, 299 (5611) : 1391 - 1394
  • [5] MAVID: Constrained ancestral alignment of multiple sequences
    Bray, N
    Pachter, L
    [J]. GENOME RESEARCH, 2004, 14 (04) : 693 - 699
  • [6] LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA
    Brudno, M
    Do, CB
    Cooper, GM
    Kim, MF
    Davydov, E
    Green, ED
    Sidow, A
    Batzoglou, S
    [J]. GENOME RESEARCH, 2003, 13 (04) : 721 - 731
  • [7] Prediction of complete gene structures in human genomic DNA
    Burge, C
    Karlin, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) : 78 - 94
  • [8] Finding functional features in Saccharomyces genomes by phylogenetic footprinting
    Cliften, P
    Sudarsanam, P
    Desikan, A
    Fulton, L
    Fulton, B
    Majors, J
    Waterston, R
    Cohen, BA
    Johnston, M
    [J]. SCIENCE, 2003, 301 (5629) : 71 - 76
  • [9] A vision for the future of genomics research
    Collins, FS
    Green, ED
    Guttmacher, AE
    Guyer, MS
    [J]. NATURE, 2003, 422 (6934) : 835 - 847
  • [10] Distribution and intensity of constraint in mammalian genomic sequence
    Cooper, GM
    Stone, EA
    Asimenos, G
    Green, ED
    Batzoglou, S
    Sidow, A
    [J]. GENOME RESEARCH, 2005, 15 (07) : 901 - 913