Using quality scores and longer reads improves accuracy of Solexa read mapping

被引:184
作者
Smith, Andrew D. [1 ]
Xuan, Zhenyu [1 ]
Zhang, Michael Q. [1 ]
机构
[1] Cold Spring Harbor Lab, Cold Spring Harbor, NY 11274 USA
关键词
D O I
10.1186/1471-2105-9-128
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample ( e. g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina IG sequencer can produce tens of millions of reads, ranging in length from similar to 25-50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores. Results: To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at http://rulai.cshl.edu/rmap/. Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.
引用
收藏
页数:8
相关论文
共 15 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
[Anonymous], MAQ MAPPING ASSEMBLY
[3]   Fast and practical approximate string matching [J].
BaezaYates, RA ;
Perleberg, CH .
INFORMATION PROCESSING LETTERS, 1996, 59 (01) :21-27
[4]   High-resolution profiling of histone methylations in the human genome [J].
Barski, Artern ;
Cuddapah, Suresh ;
Cui, Kairong ;
Roh, Tae-Young ;
Schones, Dustin E. ;
Wang, Zhibin ;
Wei, Gang ;
Chepelev, Iouri ;
Zhao, Keji .
CELL, 2007, 129 (04) :823-837
[5]   Whole-genome re-sequencing [J].
Bentley, David R. .
CURRENT OPINION IN GENETICS & DEVELOPMENT, 2006, 16 (06) :545-552
[6]   Base-calling of automated sequencer traces using phred.: II.: Error probabilities [J].
Ewing, B ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :186-194
[7]   Base-calling of automated sequencer traces using phred.: I.: Accuracy assessment [J].
Ewing, B ;
Hillier, L ;
Wendl, MC ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :175-185
[8]  
Gusfield D., 1997, ALGORITHMS STRINGS T
[9]   PatternHunter: faster and more sensitive homology search [J].
Ma, B ;
Tromp, J ;
Li, M .
BIOINFORMATICS, 2002, 18 (03) :440-445
[10]  
MARGULIES M, 2005, NATURE, V376, P80