A clean date set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures

被引:33
作者
Thanaraj, TA [1 ]
机构
[1] European Bioinformat Inst, Hinxton CB10 1SD, Cambs, England
关键词
D O I
10.1093/nar/27.13.2627
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A clean data set of verified splice sites from Homo sapiens are reported as well as the standards used for the clean-up procedure. The sites were validated by: (i) standard cleaning procedures such as requiring consistency in the annotation of the gene structural elements, completeness of the coding regions and elimination of redundant sequences; (ii) clustering by decision trees coupled with analysis of ClustalW alignments of the translated protein sequence with homologous proteins from SWISS-PROT; (iii) matching against human EST sequences. The sites are categorised as: (i) donor sites, a set of 619 EST-confirmed donor sites, for which 138 are either the sites or the regions around the sites involved in alternative splice events; (ii) acceptor sites, a set of 623 EST-confirmed acceptor sites, for which 144 are either the sites or the regions around the sites are involved in alternative splice events; (iii) genuine splice sites, a set of 392 splice sites wherein both the donor and acceptor sites had EST confirmation and were not involved in any alternative splicing; (iv) alternative splice sites, a set of 209 splice sites wherein both the donor and acceptor sites had EST confirmation and the sites or the regions around them were involved in alternative splicing, A set of nucleotide regions that can be used to generate a control set of false splice sites that have a high confidence of being nonfunctional are also reported.
引用
收藏
页码:2627 / 2637
页数:11
相关论文
共 9 条
[1]   The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 1998, 26 (01) :38-42
[2]   Finding the genes in genomic DNA [J].
Burge, CB ;
Karlin, S .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 1998, 8 (03) :346-354
[3]   Evaluation of gene structure prediction programs [J].
Burset, M ;
Guigo, R .
GENOMICS, 1996, 34 (03) :353-367
[4]   Cleaning the GenBank Arabidopsis thaliana data set [J].
Korning, PG ;
Hebsgaard, SM ;
Rouze, P ;
Brunak, S .
NUCLEIC ACIDS RESEARCH, 1996, 24 (02) :316-320
[5]  
MIRONOV AA, 1998, P 1 INT C BIOINF GEN, V2, P249
[6]   IMPROVED TOOLS FOR BIOLOGICAL SEQUENCE COMPARISON [J].
PEARSON, WR ;
LIPMAN, DJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1988, 85 (08) :2444-2448
[7]   The EMBL Nucleotide Sequence Database [J].
Stoesser, G ;
Tuli, MA ;
Lopez, R ;
Sterk, P .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :18-24
[8]   CLUSTAL-W - IMPROVING THE SENSITIVITY OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT THROUGH SEQUENCE WEIGHTING, POSITION-SPECIFIC GAP PENALTIES AND WEIGHT MATRIX CHOICE [J].
THOMPSON, JD ;
HIGGINS, DG ;
GIBSON, TJ .
NUCLEIC ACIDS RESEARCH, 1994, 22 (22) :4673-4680
[9]   A comparison of expressed sequence tags (ESTs) to human genomic sequences [J].
Wolfsberg, TG ;
Landsman, D .
NUCLEIC ACIDS RESEARCH, 1997, 25 (08) :1626-1632