Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

被引:22
作者
Becher, Veronica [1 ]
Deymonnaz, Alejandro [1 ]
Heiber, Pablo [1 ]
机构
[1] Univ Buenos Aires, Dept Computat, Fac Ciencias Exactas & Nat, RA-1053 Buenos Aires, DF, Argentina
关键词
ULTRACONSERVED ELEMENTS; TRANSPOSABLE ELEMENTS; REPUTER;
D O I
10.1093/bioinformatics/btp321
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: There is a significant ongoing research to identify the number and types of repetitive DNA sequences. As more genomes are sequenced, efficiency and scalability in computational tools become mandatory. Existing tools fail to find distant repeats because they cannot accommodate whole chromosomes, but segments. Also, a quantitative framework for repetitive elements inside a genome or across genomes is still missing. Results: We present a new efficient algorithm and its implementation as a software tool to compute all perfect repeats in inputs of up to 500 million nucleotide bases, possibly containing many genomes. Our algorithm is based on a suffix array construction and a novel procedure to extract all perfect repeats in the entire input, that can be arbitrarily distant, and with no bound on the repeat length. We tested the software on the Homo sapiens DNA genome NCBI 36.49. We computed all perfect repeats of at least 40 bases occurring in any two chromosomes with exact matching. We found that each H. sapiens chromosome shares similar to 10% of its full sequence with every other human chromosome, distributed more or less evenly among the chromosome surfaces. We give statistics including a quanti. cation of repeats by diversity, length and number of occurrences. We compared the computed repeats against all biological repeats currently obtainable from Ensembl enlarged with the output of the dust program and all elements identified by TRF and RepeatMasker (ftp://ftp.ebi.ac.uk/pub/databases/ ensembl/jherrero/.repeats/ all_repeats.txt.bz2). We report novel repeats as well as new occurrences of repeats matching with known biological elements.
引用
收藏
页码:1746 / 1753
页数:8
相关论文
共 27 条
[1]  
Abajian C., 1994, Sputnik
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   Ultraconserved elements in the human genome [J].
Bejerano, G ;
Pheasant, M ;
Makunin, I ;
Stephen, S ;
Kent, WJ ;
Mattick, JS ;
Haussler, D .
SCIENCE, 2004, 304 (5675) :1321-1325
[4]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[5]   Discovering and detecting transposable elements in genome sequences [J].
Bergman, Casey M. ;
Quesneville, Hadi .
BRIEFINGS IN BIOINFORMATICS, 2007, 8 (06) :382-392
[6]   Identification of transposable elements using multiple alignments of related genomes [J].
Caspi, A ;
Pachter, L .
GENOME RESEARCH, 2006, 16 (02) :260-270
[7]   TROLL-Tandem Repeat Occurrence Locator [J].
Castelo, AT ;
Martins, W ;
Gao, GR .
BIOINFORMATICS, 2002, 18 (04) :634-636
[8]   DNA repeats in the human genome [J].
Catasti, P ;
Chen, X ;
Mariappan, SVS ;
Bradbury, EM ;
Gupta, G .
GENETICA, 1999, 106 (1-2) :15-36
[9]   Ultraconserved Elements: Analyses of Dosage Sensitivity, Motifs and Boundaries [J].
Chiang, Charleston W. K. ;
Derti, Adnan ;
Schwartz, Daniel ;
Chou, Michael F. ;
Hirschhorn, Joel N. ;
Wu, C. -ting .
GENETICS, 2008, 180 (04) :2277-2293
[10]   Retrotransposons revisited: The restraint and rehabilitation of parasites [J].
Goodier, John L. ;
Kazazian, Haig H., Jr. .
CELL, 2008, 135 (01) :23-35