Repetitive Elements May Comprise Over Two-Thirds of the Human Genome

被引:756
作者
de Koning, A. P. Jason [1 ]
Gu, Wanjun [1 ]
Castoe, Todd A. [1 ]
Batzer, Mark A. [2 ]
Pollock, David D. [1 ]
机构
[1] Univ Colorado, Sch Med, Dept Biochem & Mol Genet, Aurora, CO USA
[2] Louisiana State Univ, Dept Biol Sci, Baton Rouge, LA 70803 USA
来源
PLOS GENETICS | 2011年 / 7卷 / 12期
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
DE-NOVO IDENTIFICATION; TRANSPOSABLE ELEMENTS; DNA-SEQUENCES; REPEATS; EVOLUTION; DATABASE; CLASSIFICATION; LANDSCAPES; FAMILIES; FEATURES;
D O I
10.1371/journal.pgen.1002384
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo "clouds''). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%-69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (similar to 25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed "element-specific'' P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified similar to 100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.
引用
收藏
页数:12
相关论文
共 39 条
[1]   Repseek, a tool to retrieve approximate repeats from large DNA sequences [J].
Achaz, Guillaume ;
Boyer, Frederic ;
Rocha, Eduardo P. C. ;
Viari, Alain ;
Coissac, Eric .
BIOINFORMATICS, 2007, 23 (01) :119-121
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   Automated de novo identification of repeat sequence families in sequenced genomes [J].
Bao, ZR ;
Eddy, SR .
GENOME RESEARCH, 2002, 12 (08) :1269-1276
[4]   Alu repeats and human genomic diversity [J].
Batzer, MA ;
Deininger, PL .
NATURE REVIEWS GENETICS, 2002, 3 (05) :370-379
[5]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[6]   Genomes were forged by massive bombardments with retroelements and retrosequences [J].
Brosius, J .
GENETICA, 1999, 107 (1-3) :209-238
[7]   Discovery of Highly Divergent Repeat Landscapes in Snake Genomes Using High-Throughput Sequencing [J].
Castoe, Todd A. ;
Hall, Kathryn T. ;
Mboulas, Marcel L. Guibotsy ;
Gu, Wanjun ;
de Koning, A. P. Jason ;
Fox, Samuel E. ;
Poole, Alexander W. ;
Vemulapalli, Vijetha ;
Daza, Juan M. ;
Mockler, Todd ;
Smith, Eric N. ;
Feschotte, Cedric ;
Pollock, David D. .
GENOME BIOLOGY AND EVOLUTION, 2011, 3 :641-653
[8]   PILER: identification and classification of genomic repeats [J].
Edgar, RC ;
Myers, EW .
BIOINFORMATICS, 2005, 21 :I152-I158
[9]   Recent duplication, domain accretion and the dynamic mutation of the human genome [J].
Eichler, EE .
TRENDS IN GENETICS, 2001, 17 (11) :661-669
[10]   Exploring Repetitive DNA Landscapes Using REPCLASS, a Tool That Automates the Classification of Transposable Elements in Eukaryotic Genomes [J].
Feschotte, Cedric ;
Keswani, Umeshkumar ;
Ranganathan, Nirmal ;
Guibotsy, Marcel L. ;
Levine, David .
GENOME BIOLOGY AND EVOLUTION, 2009, 1 :205-220