Correction of sequence-based artifacts in serial analysis of gene expression

被引:31
作者
Akmaev, VR [1 ]
Wang, CJ [1 ]
机构
[1] Genzyme Corp, Framingham, MA 01701 USA
关键词
D O I
10.1093/bioinformatics/bth077
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information on the size and shape of the sample transcriptome and can accelerate novel gene discovery. These latter SAGE applications are facilitated by the enhanced method of Long SAGE. A characteristic of sequencing-based methods, such as SAGE and Long SAGE is the unavoidable occurrence of artifact sequences resulting from sequencing errors. By virtue of their low-random incidence, such tag errors have minimal impact on differential expression analysis. However, to fully exploit the value of large SAGE tag datasets, it is desirable to account for and correct tag artifacts. Results: We present estimates for occurrences of tag errors, and an efficient error correction algorithm. Error rate estimates are based on a stochastic model that includes the Polymerase chain reaction and sequencing error contributions. The correction algorithm, SAGEScreen, is a multi-step procedure that addresses ditag processing, estimation of empirical error rates from highly abundant tags, grouping of similar-sequence tags and statistical testing of observed counts. We apply SAGEScreen to Long SAGE libraries and compare error rates for several processing scenarios. Results with simulated tag collections indicate that SAGEScreen corrects 78% of recoverable tag errors and reduces the occurrences of singleton tags.
引用
收藏
页码:1254 / 1263
页数:10
相关论文
共 18 条
[1]  
Athreya K.B., 1972, BRANCHING PROCESS
[2]   ESTIMATING THE NUMBER OF SPECIES - A REVIEW [J].
BUNGE, J ;
FITZPATRICK, M .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) :364-373
[3]   Detecting the impact of sequencing errors on SAGE data [J].
Colinge, J ;
Feger, G .
BIOINFORMATICS, 2001, 17 (09) :840-842
[4]   Base-calling of automated sequencer traces using phred.: II.: Error probabilities [J].
Ewing, B ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :186-194
[5]   Base-calling of automated sequencer traces using phred.: I.: Accuracy assessment [J].
Ewing, B ;
Hillier, L ;
Wendl, MC ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :175-185
[6]   AN ADAPTIVE, OBJECT-ORIENTED STRATEGY FOR BASE CALLING IN DNA-SEQUENCE ANALYSIS [J].
GIDDINGS, MC ;
BRUMLEY, RL ;
HAKER, M ;
SMITH, LM .
NUCLEIC ACIDS RESEARCH, 1993, 21 (19) :4530-4540
[7]  
HAYASHI K, 1990, Technique (Philadelphia), V2, P216
[8]   FIDELITY OF DNA-POLYMERASES IN DNA AMPLIFICATION [J].
KEOHAVONG, P ;
THILLY, WG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1989, 86 (23) :9253-9257
[9]   SAGEmap: A public gene expression resource [J].
Lash, AE ;
Tolstoshev, CM ;
Wagner, L ;
Schuler, GD ;
Strausberg, RL ;
Riggins, GJ ;
Altschul, SF .
GENOME RESEARCH, 2000, 10 (07) :1051-1060
[10]  
MADDEN SL, 2004, UNPUB AM J PATHOL