EST clustering error evaluation and correction

被引:57
作者
Wang, JPZ [1 ]
Lindsay, BG
Leebens-Mack, J
Cui, LY
Wall, K
Miller, WC
dePamphilis, CW
机构
[1] Northwestern Univ, Dept Stat, Evanston, IL 60208 USA
[2] Penn State Univ, Dept Stat, University Pk, PA 16802 USA
[3] Penn State Univ, Dept Biol, University Pk, PA 16802 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/bth342
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is similar to10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P greater than or equal to 95%, may even inflate the Type I error in both cases. We demonstrate that similar to80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
引用
收藏
页码:2973 / 2984
页数:12
相关论文
共 31 条
[1]   3,400 NEW EXPRESSED SEQUENCE TAGS IDENTIFY DIVERSITY OF TRANSCRIPTS IN HUMAN BRAIN [J].
ADAMS, MD ;
KERLAVAGE, AR ;
FIELDS, C ;
VENTER, JC .
NATURE GENETICS, 1993, 4 (03) :256-267
[2]   SEQUENCE IDENTIFICATION OF 2,375 HUMAN BRAIN GENES [J].
ADAMS, MD ;
DUBNICK, M ;
KERLAVAGE, AR ;
MORENO, R ;
KELLEY, JM ;
UTTERBACK, TR ;
NAGLE, JW ;
FIELDS, C ;
VENTER, JC .
NATURE, 1992, 355 (6361) :632-634
[3]  
[Anonymous], GENOME BIOL
[4]   A large scale analysis of cDNA in Arabidopsis thaliana:: Generation of 12,028 non-redundant expressed sequence tags from normalized and size-selected cDNA libraries [J].
Asamizu, E ;
Nakamura, Y ;
Sato, S ;
Tabata, S .
DNA RESEARCH, 2000, 7 (03) :175-180
[5]  
AUDIC S, 1997, HUM MOL GENET, V8, P1821
[6]   Common intervals and sorting by reversals: a marriage of necessity [J].
Bergeron, A ;
Heber, S ;
Stoye, J .
BIOINFORMATICS, 2002, 18 :S54-S63
[7]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[8]   Alternative gene form discovery and candidate gene selection from gene indexing projects [J].
Burke, J ;
Wang, H ;
Hide, W ;
Davison, DB .
GENOME RESEARCH, 1998, 8 (03) :276-290
[9]   d2_cluster: A validated method for clustering EST and full-length cDNA sequences [J].
Burke, J ;
Davison, D ;
Hide, W .
GENOME RESEARCH, 1999, 9 (11) :1135-1142
[10]   CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences [J].
Chou, A ;
Burke, J .
BIOINFORMATICS, 1999, 15 (05) :376-381