Basecalling with LifeTrace

被引:15
作者
Walther, D [1 ]
Bartha, G [1 ]
Morris, M [1 ]
机构
[1] Incyte Genom Inc, Palo Alto, CA 94304 USA
关键词
D O I
10.1101/gr.177901
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A pivotal step in electrophoresis sequencing is the conversion of the raw, continuous chromatogram data into the actual sequence of discrete nucleotides, a process referred to as basecalling. We describe a novel algorithm for basecalling implemented in the program LifeTrace. Like Phred, currently the most widely used basecalling software program, LifeTrace takes processed trace data as input. It was designed to be tolerant to variable peak spacing by means of an improved peak-detection algorithm that emphasizes local chromatogram information over global properties. LifeTrace is shown to generate high-quality basecalls and reliable quality scores. It proved particularly effective when applied to MegaBACE capillary sequencing machines. In a benchmark test of 8372 dye-primer MegaBACE chromatograms, LifeTrace generated 17% fewer substitution errors, 16% fewer insertion/deletion errors, and 2.4% more aligned bases to the finished sequence than did Phred. For two sets totaling 6624 dye-terminator chromatograms, the performance improvement was 15% fewer substitution errors, 10% fewer insertion/deletion errors, and 2.1% more aligned bases. The processing time required by LifeTrace is comparable to that of Phred. The predicted quality scores were in line with observed quality scores, permitting direct use for quality clipping and in silico single nucleotide polymorphism (SNP) detection. Furthermore, we introduce a new type of quality score associated with every basecall: the gap-quality. It estimates the probability of a deletion error between the current and the following basecall. This additional quality score improves detection of single basepair deletions when used for locating potential basecalling errors during the alignment. We also describe a new protocol for benchmarking that we believe better discerns basecaller performance differences than methods previously published.
引用
收藏
页码:875 / 888
页数:14
相关论文
共 12 条
[1]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[2]   An SNP map of the human genome generated by reduced representation shotgun sequencing [J].
Altshuler, D ;
Pollara, VJ ;
Cowles, CR ;
Van Etten, WJ ;
Baldwin, J ;
Linton, L ;
Lander, ES .
NATURE, 2000, 407 (6803) :513-516
[3]  
BERNO AJ, 1996, GENOME RES, V6, P90
[4]   Reliable identification of large numbers of candidate SNPs from public EST data [J].
Buetow, KH ;
Edmonson, MN ;
Cassidy, AB .
NATURE GENETICS, 1999, 21 (03) :323-325
[5]   Base-calling of automated sequencer traces using phred.: II.: Error probabilities [J].
Ewing, B ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :186-194
[6]   Base-calling of automated sequencer traces using phred.: I.: Accuracy assessment [J].
Ewing, B ;
Hillier, L ;
Wendl, MC ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :175-185
[7]   AN ADAPTIVE, OBJECT-ORIENTED STRATEGY FOR BASE CALLING IN DNA-SEQUENCE ANALYSIS [J].
GIDDINGS, MC ;
BRUMLEY, RL ;
HAKER, M ;
SMITH, LM .
NUCLEIC ACIDS RESEARCH, 1993, 21 (19) :4530-4540
[8]   A software system for data analysis in automated DNA sequencing [J].
Giddings, MC ;
Severin, J ;
Westphall, M ;
Wu, JZ ;
Smith, LM .
GENOME RESEARCH, 1998, 8 (06) :644-665
[9]  
Golden J B 3rd, 1993, Proc Int Conf Intell Syst Mol Biol, V1, P136
[10]   ASSIGNMENT OF POSITION-SPECIFIC ERROR-PROBABILITY TO PRIMARY DNA-SEQUENCE DATA [J].
LAWRENCE, CB ;
SOLOVYEV, VV .
NUCLEIC ACIDS RESEARCH, 1994, 22 (07) :1272-1280