A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

被引：179

作者：

Kurtz, Stefan ^{[2
]}

Narechania, Apurva ^{[1
,3
]}

Stein, Joshua C. ^{[1
]}

Ware, Doreen ^{[1
]}

机构：

[1] Cold Spring Harbor Lab, Cold Spring Harbor, NY 11724 USA

[2] Univ Hamburg, Ctr Bioinformat, D-20146 Hamburg, Germany

[3] Amer Museum Nat Hist, Sackler Inst Comparat Genom, New York, NY 10024 USA

来源：

BMC GENOMICS | 2008年 / 9卷 / 1期

基金：

美国国家科学基金会;

关键词：

D O I：

10.1186/1471-2164-9-517

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

Background: The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks. Results: Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (approximate to 0.45 x) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C(0)t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity. Conclusion: The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer.

引用

页数：18

共 51 条

[1]

Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0

[2] Automated de novo identification of repeat sequence families in sequenced genomes [J].

Bao, ZR ;

Eddy, SR .

GENOME RESEARCH, 2002, 12 (08) :1269-1276

[3] Whole-genome re-sequencing [J].

Bentley, David R. .

CURRENT OPINION IN GENETICS & DEVELOPMENT, 2006, 16 (06) :545-552

[4] RAP:: a new computer program for de novo identification of repeated sequences in whole genomes [J].

Campagna, D ;

Romualdi, C ;

Vitulo, N ;

Del Favero, M ;

Lexa, M ;

Cannata, N ;

Valle, G .

BIOINFORMATICS, 2005, 21 (05) :582-588

[5] The TIGR Maize Database [J].

Chan, Agnes P. ;

Pertea, Geo ;

Cheung, Foo ;

Lee, Dan ;

Zheng, Li ;

Whitelaw, Cathy ;

Pontaroli, Ana C. ;

SanMiguel, Phillip ;

Yuan, Yinan ;

Bennetzen, Jeffrey ;

Barbazuk, William Brad ;

Quackenbush, John ;

Rabinowicz, Pablo D. .

NUCLEIC ACIDS RESEARCH, 2006, 34 :D771-D776

[6]

Coe E, 2002, PLANT PHYSIOL, V128, P9, DOI 10.1104/pp.010953

[7] PlantGDB, plant genome database and analysis tools [J].

Dong, QF ;

Schlueter, SD ;

Brendel, V .

NUCLEIC ACIDS RESEARCH, 2004, 32 :D354-D359

[8] SELFISH GENES, THE PHENOTYPE PARADIGM AND GENOME EVOLUTION [J].

DOOLITTLE, WF ;

SAPIENZA, C .

NATURE, 1980, 284 (5757) :601-603

[9] PILER: identification and classification of genomic repeats [J].

Edgar, RC ;

Myers, EW .

BIOINFORMATICS, 2005, 21 :I152-I158

[10]

Gao MQ, 2004, GENOME, V47, P666, DOI [10.1139/g04-021, 10.1139/G04-021]

← 1 2 3 4 5 6 →