Statistical modeling of sequencing errors in SAGE libraries

被引:83
作者
Beissbarth, Tim [1 ]
Hyde, Lavinia [1 ]
Smyth, Gordon K. [1 ]
Job, Chris [2 ]
Boon, Wee-Ming [2 ]
Tan, Seong-Seng [2 ]
Scott, Hamish S. [1 ]
Speed, Terence P. [1 ]
机构
[1] Walter & Eliza Hall Inst Med Res Genet & Bioinfor, Parkville, Vic 3050, Australia
[2] Univ Melbourne, Brain Dev Lab, Howard Florey Inst, Parkville, Vic 3010, Australia
关键词
D O I
10.1093/bioinformatics/bth924
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Sequencing errors may bias the gene expression measurements made by Serial Analysis of Gene Expression (SAGE). They may introduce non-existent tags at low abundance and decrease the real abundance of other tags. These effects are increased in the longer tags generated in Long-SAGE libraries. Current sequencing technology generates quite accurate estimates of sequencing error rates. Here we make use of the sequence neighborhood of SAGE tags and error estimates from the base-calling software to correct for such errors. Results: We introduce a statistical model for the propagation of sequencing errors in SAGE and suggest an Expectation-Maximization (EM) algorithm to correct for them given observed sequences in a library and base-calling error estimates. We tested our method using simulated and experimental SAGE libraries. When comparing SAGE libraries, we found that sequencing errors can introduce considerable bias. High abundance tags may be falsely called as significantly differentially expressed, especially when comparing libraries with different levels of sequencing errors and/or of different size. Truly, differentially expressed tags have decreased significance as 'true'-tag counts are generally underestimated. This may alter if tags near the threshold of differential expression are called significant. Moreover, the number of different transcripts present in a library is overestimated as false tags are introduced at low abundance. Our correction method adjusts the tag counts to be closer to the true counts and is able to partly correct for biases introduced by sequencing errors.
引用
收藏
页码:31 / 39
页数:9
相关论文
共 17 条
[1]  
Akmaev V., 2004, BIOINFORMAT IN PRESS
[2]   Differential expression in SAGE: accounting for normal between-library variation [J].
Baggerly, KA ;
Deng, L ;
Morris, JS ;
Aldaz, CM .
BIOINFORMATICS, 2003, 19 (12) :1477-1483
[3]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[4]  
Blades N., 2004, GENOME BIOL IN PRESS
[5]  
Blades N., 2004, BIOINFORMAT IN PRESS
[6]   Detecting the impact of sequencing errors on SAGE data [J].
Colinge, J ;
Feger, G .
BIOINFORMATICS, 2001, 17 (09) :840-842
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]   Base-calling of automated sequencer traces using phred.: II.: Error probabilities [J].
Ewing, B ;
Green, P .
GENOME RESEARCH, 1998, 8 (03) :186-194
[9]  
GENTLEMAN R, 2002, R NEWS, V2, P11
[10]   SAGEmap: A public gene expression resource [J].
Lash, AE ;
Tolstoshev, CM ;
Wagner, L ;
Schuler, GD ;
Strausberg, RL ;
Riggins, GJ ;
Altschul, SF .
GENOME RESEARCH, 2000, 10 (07) :1051-1060