Dfam: a database of repetitive DNA based on profile hidden Markov models

被引:220
作者
Wheeler, Travis J.
Clements, Jody
Eddy, Sean R.
Hubley, Robert [1 ]
Jones, Thomas A.
Jurka, Jerzy [2 ]
Smit, Arian F. A. [1 ]
Finn, Robert D.
机构
[1] Inst Syst Biol, Seattle, WA 98109 USA
[2] Genet Informat Res Inst, Mountain View, CA 94043 USA
基金
美国国家卫生研究院;
关键词
PROGRAM;
D O I
10.1093/nar/gks1265
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We present a database of repetitive DNA elements, called Dfam (http://dfam.janelia.org). Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements (TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that shape genomes. Identification and masking of TEs can also greatly simplify many downstream genome annotation and sequence analysis tasks. The commonly used TE annotation tools RepeatMasker and Censor depend on sequence homology search tools such as cross_match and BLAST variants, as well as Repbase, a collection of known TE families each represented by a single consensus sequence. Dfam contains entries corresponding to all Repbase TE entries for which instances have been found in the human genome. Each Dfam entry is represented by a profile hidden Markov model, built from alignments generated using RepeatMasker and Repbase. When used in conjunction with the hidden Markov model search tool nhmmer, Dfam produces a 2.9% increase in coverage over consensus sequence search methods on a large human benchmark, while maintaining low false discovery rates, and coverage of the full human genome is 54.5%. The website provides a collection of tools and data views to support improved TE curation and annotation efforts. Dfam is also available for download in flat file format or in the form of MySQL table dumps.
引用
收藏
页码:D70 / D82
页数:13
相关论文
共 13 条
[1]  
[Anonymous], THESIS UJ SO CALIFOR
[2]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[3]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[4]   Repetitive Elements May Comprise Over Two-Thirds of the Human Genome [J].
de Koning, A. P. Jason ;
Gu, Wanjun ;
Castoe, Todd A. ;
Batzer, Mark A. ;
Pollock, David D. .
PLOS GENETICS, 2011, 7 (12)
[5]  
Durbin R., 1998, Biological sequence analysis: probabilistic models of proteins and nucleic acids
[6]   Accelerated Profile HMM Searches [J].
Eddy, Sean R. .
PLOS COMPUTATIONAL BIOLOGY, 2011, 7 (10)
[7]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[8]   Rfam: Wikipedia, clans and the "decimal" release [J].
Gardner, Paul P. ;
Daub, Jennifer ;
Tate, John ;
Moore, Benjamin L. ;
Osuch, Isabelle H. ;
Griffiths-Jones, Sam ;
Finn, Robert D. ;
Nawrocki, Eric P. ;
Kolbe, Diana L. ;
Eddy, Sean R. ;
Bateman, Alex .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D141-D145
[9]   DIAGRAM, A METHOD FOR COMPARING SEQUENCES - ITS USE WITH AMINO ACID AND NUCLEOTIDE SEQUENCES [J].
GIBBS, AJ ;
MCINTYRE, GA .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 1970, 16 (01) :1-+
[10]   Repbase update, a database of eukaryotic repetitive elements [J].
Jurka, J ;
Kapitonov, VV ;
Pavlicek, A ;
Klonowski, P ;
Kohany, O ;
Walichiewicz, J .
CYTOGENETIC AND GENOME RESEARCH, 2005, 110 (1-4) :462-467