Genome-wide identification of human functional DNA using a neutral indel model

被引:135
作者
Lunter, Gerton
Ponting, Chris P.
Hein, Jotun
机构
[1] Univ Oxford, MRC, Funct Genet Unit, Dept Human Anat & Genet, Oxford, England
[2] Univ Oxford, Dept Stat, Bioinformat Grp, Oxford, England
基金
英国生物技术与生命科学研究理事会; 英国工程与自然科学研究理事会; 英国医学研究理事会;
关键词
D O I
10.1371/journal.pcbi.0020005
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
It has become clear that a large proportion of functional DNA in the human genome does not code for protein. Identification of this non-coding functional sequence using comparative approaches is proving difficult and has previously been thought to require deep sequencing of multiple vertebrates. Here we introduce a new model and comparative method that, instead of nucleotide substitutions, uses the evolutionary imprint of insertions and deletions (indels) to infer the past consequences of selection. The model predicts the distribution of indels under neutrality, and shows an excellent fit to human-mouse ancestral repeat data. Across the genome, many unusually long ungapped regions are detected that are unaccounted for by the neutral model, and which we predict to be highly enriched in functional DNA that has been subject to purifying selection with respect to indels. We use the model to determine the proportion under indel-purifying selection to be between 2.56% and 3.25% of human euchromatin. Since annotated protein-coding genes comprise only 1.2% of euchromatin, these results lend further weight to the proposition that more than half the functional complement of the human genome is non-protein-coding. The method is surprisingly powerful at identifying selected sequence using only two or three mammalian genomes. Applying the method to the human, mouse, and dog genomes, we identify 90 Mb of human sequence under indel-purifying selection, at a predicted 10% false-discovery rate and 75% sensitivity. As expected, most of the identified sequence represents unannotated material, while the recovered proportions of known protein-coding and microRNA genes closely match the predicted sensitivity of the method. The method's high sensitivity to functional sequence such as microRNAs suggest that as yet unannotated microRNA genes are enriched among the sequences identified. Futhermore, its independence of substitutions allowed us to identify sequence that has been subject to heterogeneous selection, that is, sequence subject to both positive selection with respect to substitutions and purifying selection with respect to indels. The ability to identify elements under heterogeneous selection enables, for the first time, the genome-wide investigation of positive selection on functional elements other than protein-coding genes.
引用
收藏
页码:2 / 12
页数:11
相关论文
共 29 条
[1]   Ultraconserved elements in the human genome [J].
Bejerano, G ;
Pheasant, M ;
Makunin, I ;
Stephen, S ;
Kent, WJ ;
Mattick, JS ;
Haussler, D .
SCIENCE, 2004, 304 (5675) :1321-1325
[2]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[3]   Identification of hundreds of conserved and nonconserved human microRNAs [J].
Bentwich, I ;
Avniel, A ;
Karov, Y ;
Aharonov, R ;
Gilad, S ;
Barad, O ;
Barzilai, A ;
Einat, P ;
Einav, U ;
Meiri, E ;
Sharon, E ;
Spector, Y ;
Bentwich, Z .
NATURE GENETICS, 2005, 37 (07) :766-770
[4]  
BERGMAN CM, 2002, GENOME BIOL, V3, DOI DOI 10.1186/GB-2002-3-12-RESEARCH0086
[5]   An overview of ensembl [J].
Birney, E ;
Andrews, TD ;
Bevan, P ;
Caccamo, M ;
Chen, Y ;
Clarke, L ;
Coates, G ;
Cuff, J ;
Curwen, V ;
Cutts, T ;
Down, T ;
Eyras, E ;
Fernandez-Suarez, XM ;
Gane, P ;
Gibbins, B ;
Gilbert, J ;
Hammond, M ;
Hotz, HR ;
Iyer, V ;
Jekosch, K ;
Kahari, A ;
Kasprzyk, A ;
Keefe, D ;
Keenan, S ;
Lehvaslaiho, H ;
McVicker, G ;
Melsopp, C ;
Meidl, P ;
Mongin, E ;
Pettett, R ;
Potter, S ;
Proctor, G ;
Rae, M ;
Searle, S ;
Slater, G ;
Smedley, D ;
Smith, J ;
Spooner, W ;
Stabenau, A ;
Stalker, J ;
Storey, R ;
Ureta-Vidal, A ;
Woodwark, KC ;
Cameron, G ;
Durbin, R ;
Cox, A ;
Hubbard, T ;
Clamp, M .
GENOME RESEARCH, 2004, 14 (05) :925-928
[6]   The share of human genomic DNA under selection estimated from human-mouse genomic alignments [J].
Chiaromonte, F ;
Weber, RJ ;
Roskin, KM ;
Diekhans, M ;
Kent, WJ ;
Haussler, D .
COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY, 2003, 68 :245-254
[7]   Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs) [J].
Dermitzakis, ET ;
Reymond, A ;
Scamuffa, N ;
Ucla, C ;
Kirkness, E ;
Rossier, C ;
Antonarakis, SE .
SCIENCE, 2003, 302 (5647) :1033-1035
[8]   An endogenous retroviral long terminal repeat is the dominant promoter for human β1,3-galactosyltransferase 5 in the colon [J].
Dunn, CA ;
Medstrand, P ;
Mager, DL .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (22) :12841-12846
[9]  
Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
[10]   Ultraconserved elements in insect genomes:: A highly conserved intronic sequence implicated in the control of homothorax mRNA splicing [J].
Glazov, EA ;
Pheasant, M ;
McGraw, EA ;
Bejerano, G ;
Mattick, JS .
GENOME RESEARCH, 2005, 15 (06) :800-808