A space-efficient construction of the Burrows-Wheeler transform for genomic data

被引：17

作者：

Lippert, RA

Mobarry, CM

Walenz, BP

机构：

[1] MIT, Dept Math, Cambridge, MA 02139 USA

[2] White Oak Technol, Silver Spring, MD 20910 USA

[3] J Craig Venter Inst, Rockville, MD 20850 USA

来源：

JOURNAL OF COMPUTATIONAL BIOLOGY | 2005年 / 12卷 / 07期

关键词：

Burrows-Wheeler; suffix array; string matching; comparative genomics;

D O I：

10.1089/cmb.2005.12.943

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows - Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used ( 1+ 2 log(2) |Sigma|)( 1+ epsilon) bits per character ( where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(|Sigma| n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 ( 1+|Sigma|)( 1+ epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.

引用

页码：943 / 951

页数：9

共 14 条

[1] Abouelhoda MI, 2002, LECT NOTES COMPUT SC, V2452, P449
[2] [Anonymous], 1990, MIT ELECT ENG COMPUT
[3] Burrows M, 1994, 124 DIG SRC
[4] Opportunistic data structures with applications
Ferragina, P
Manzini, G
[J]. 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, : 390 - 398
[5] Grossi R., 2000, Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing, P397, DOI 10.1145/335305.335351
[6] Annotating large genomes with exact word matches
Healy, J
Thomas, EE
Schwartz, JT
Wigler, M
[J]. GENOME RESEARCH, 2003, 13 (10) : 2306 - 2315
[7] Hon WK, 2003, LECT NOTES COMPUT SC, V2906, P240
[8] Kärkkäinen J, 2003, LECT NOTES COMPUT SC, V2719, P943
[9] Kurtz S, 1999, SOFTWARE PRACT EXPER, V29, P1149, DOI 10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO
[10] 2-O

← 1 2 →