Cleaning the GenBank Arabidopsis thaliana data set

被引:42
作者
Korning, PG
Hebsgaard, SM
Rouze, P
Brunak, S
机构
[1] TECH UNIV DENMARK, CTR BIOL SEQUENCE ANAL, DK-2800 LYNGBY, DENMARK
[2] STATE UNIV GHENT VIB, LAB INRA, B-9000 GHENT, BELGIUM
关键词
D O I
10.1093/nar/24.2.316
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks, However, the possibilities are drastically impaired if the stored data is unreliable, During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank, A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate, More than 15% of the most important entries extracted did contain erroneous information, In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing, In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common, It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated-also at the submitter level-to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.
引用
收藏
页码:316 / 320
页数:5
相关论文
共 15 条
[1]  
BOHAMSMITH PC, 1994, PLANT PHYSIOL, V106, P401
[2]   PREDICTION OF HUMAN MESSENGER-RNA DONOR AND ACCEPTOR SITES FROM THE DNA-SEQUENCE [J].
BRUNAK, S ;
ENGELBRECHT, J ;
KNUDSEN, S .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 220 (01) :49-65
[3]   NEURAL NETWORK DETECTS ERRORS IN THE ASSIGNMENT OF MESSENGER-RNA SPLICE SITES [J].
BRUNAK, S ;
ENGELBRECHT, J ;
KNUDSEN, S .
NUCLEIC ACIDS RESEARCH, 1990, 18 (16) :4797-4801
[4]   CLEANING UP GENE DATABASES [J].
BRUNAK, S ;
ENGELBRECHT, J ;
KNUDSEN, S .
NATURE, 1990, 343 (6254) :123-123
[5]  
FILIPOWICZ W, 1995, PREMRNA PROCESSING, P65
[6]   THE MINIMUM FUNCTIONAL LENGTH OF PRE-MESSENGER-RNA INTRONS IN MONOCOTS AND DICOTS [J].
GOODALL, GJ ;
FILIPOWICZ, W .
PLANT MOLECULAR BIOLOGY, 1990, 14 (05) :727-733
[7]  
Hertz J., 1991, Introduction to the Theory of Neural Computation
[8]  
HOBOHM U, 1992, PROTEIN SCI, V1, P409
[9]   ARABIDOPSIS PHOSPHORIBOSYLANTHRANILATE ISOMERASE - MOLECULAR-GENETIC ANALYSIS OF TRIPLICATE TRYPTOPHAN PATHWAY GENES [J].
LI, JY ;
ZHAO, JM ;
ROSE, AB ;
SCHMIDT, R ;
LAST, RL .
PLANT CELL, 1995, 7 (04) :447-461
[10]  
Meyerowitz E.M., 1994, ARABIDOPSIS