Using quality scores and longer reads improves accuracy of Solexa read mapping

被引：184

作者：

Smith, Andrew D. ^{[1
]}

Xuan, Zhenyu ^{[1
]}

Zhang, Michael Q. ^{[1
]}

机构：

[1] Cold Spring Harbor Lab, Cold Spring Harbor, NY 11274 USA

来源：

BMC BIOINFORMATICS | 2008年 / 9卷 / 1期

关键词：

D O I：

10.1186/1471-2105-9-128

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample ( e. g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina IG sequencer can produce tens of millions of reads, ranging in length from similar to 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores. Results: To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/. Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.

引用

页数：8

共 15 条

[1] BASIC LOCAL ALIGNMENT SEARCH TOOL [J].

ALTSCHUL, SF ;

GISH, W ;

MILLER, W ;

MYERS, EW ;

LIPMAN, DJ .

JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410

[2]

[Anonymous], MAQ MAPPING ASSEMBLY

[3] Fast and practical approximate string matching [J].

BaezaYates, RA ;

Perleberg, CH .

INFORMATION PROCESSING LETTERS, 1996, 59 (01) :21-27

[4] High-resolution profiling of histone methylations in the human genome [J].

Barski, Artern ;

Cuddapah, Suresh ;

Cui, Kairong ;

Roh, Tae-Young ;

Schones, Dustin E. ;

Wang, Zhibin ;

Wei, Gang ;

Chepelev, Iouri ;

Zhao, Keji .

CELL, 2007, 129 (04) :823-837

[5] Whole-genome re-sequencing [J].

Bentley, David R. .

CURRENT OPINION IN GENETICS & DEVELOPMENT, 2006, 16 (06) :545-552

[6] Base-calling of automated sequencer traces using phred.: II.: Error probabilities [J].

Ewing, B ;

Green, P .

GENOME RESEARCH, 1998, 8 (03) :186-194

[7] Base-calling of automated sequencer traces using phred.: I.: Accuracy assessment [J].

Ewing, B ;

Hillier, L ;

Wendl, MC ;

Green, P .

GENOME RESEARCH, 1998, 8 (03) :175-185

[8]

Gusfield D., 1997, ALGORITHMS STRINGS T

[9] PatternHunter: faster and more sensitive homology search [J].

Ma, B ;

Tromp, J ;

Li, M .

BIOINFORMATICS, 2002, 18 (03) :440-445

[10]

MARGULIES M, 2005, NATURE, V376, P80

← 1 2 →