Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery

被引:26
作者
Larsson, TP [1 ]
Murray, CG [1 ]
Hill, T [1 ]
Fredriksson, R [1 ]
Schiöth, HB [1 ]
机构
[1] Uppsala Univ, Dept Neurosci, S-75124 Uppsala, Sweden
关键词
eexpressed sequences tag; RefSeq; ensembl; databases; genscan;
D O I
10.1016/j.febslet.2004.12.046
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed sequences tags (ESTs) have recently been added to the NCB1 databases. We matched the transcript-sequences of RefSeq, EnsembI and dbEST in an attempt to provide an updated overview of how many unique human genes can be found. The results indicate that there are about 25000 unique genes in the union of RefSeq and Ensembl with 12-18% and 8-13% of the genes in each set unique to the other set, respectively. About 20% of all genes had splice variants. There are a considerable number of ESTs (2200000) that do not match the identified genes and we used an in-house pipeline to identify 22 novel genes from Gen-scan predictions that have considerable EST coverage. The study provides an insight into the current status of human gene catalogues and shows that considerable refinement of methods and datasets is needed to come to a conclusive gene count. (C) 2004 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved.
引用
收藏
页码:690 / 698
页数:9
相关论文
共 30 条
[21]   Initial sequencing and analysis of the human genome [J].
Lander, ES ;
Int Human Genome Sequencing Consortium ;
Linton, LM ;
Birren, B ;
Nusbaum, C ;
Zody, MC ;
Baldwin, J ;
Devon, K ;
Dewar, K ;
Doyle, M ;
FitzHugh, W ;
Funke, R ;
Gage, D ;
Harris, K ;
Heaford, A ;
Howland, J ;
Kann, L ;
Lehoczky, J ;
LeVine, R ;
McEwan, P ;
McKernan, K ;
Meldrim, J ;
Mesirov, JP ;
Miranda, C ;
Morris, W ;
Naylor, J ;
Raymond, C ;
Rosetti, M ;
Santos, R ;
Sheridan, A ;
Sougnez, C ;
Stange-Thomann, N ;
Stojanovic, N ;
Subramanian, A ;
Wyman, D ;
Rogers, J ;
Sulston, J ;
Ainscough, R ;
Beck, S ;
Bentley, D ;
Burton, J ;
Clee, C ;
Carter, N ;
Coulson, A ;
Deadman, R ;
Deloukas, P ;
Dunham, A ;
Dunham, I ;
Durbin, R ;
French, L .
NATURE, 2001, 409 (6822) :860-921
[22]   A comparative analysis of HGSC and Celera human genome assemblies and gene sets [J].
Li, SY ;
Cutler, G ;
Liu, JJJ ;
Hoey, T ;
Chen, LB ;
Schultz, PG ;
Liao, JY ;
Ling, XFB .
BIOINFORMATICS, 2003, 19 (13) :1597-1605
[23]   Gene Index analysis of the human genome estimates approximately 120,000 genes [J].
Liang, F ;
Holt, I ;
Pertea, G ;
Karamycheva, S ;
Salzberg, SL ;
Quackenbush, J .
NATURE GENETICS, 2000, 25 (02) :239-240
[24]   Frequent alternative splicing of human genes [J].
Mironov, AA ;
Fickett, JW ;
Gelfand, MS .
GENOME RESEARCH, 1999, 9 (12) :1288-1293
[25]   RefSeq and LocusLink: NCBI gene-centered resources [J].
Pruitt, KD ;
Maglott, DR .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :137-140
[26]   Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence [J].
Roest Crollius, H ;
Jaillon, O ;
Bernot, A ;
Dasilva, C ;
Bouneau, L ;
Fischer, C ;
Fizames, C ;
Wincker, P ;
Brottier, P ;
Quétier, F ;
Saurin, W ;
Weissenbach, J .
NATURE GENETICS, 2000, 25 (02) :235-238
[27]   Experimental annotation of the human genome using microarray technology [J].
Shoemaker, DD ;
Schadt, EE ;
Armour, CD ;
He, YD ;
Garrett-Engele, P ;
McDonagh, PD ;
Loerch, PM ;
Leonardson, A ;
Lum, PY ;
Cavet, G ;
Wu, LF ;
Altschuler, SJ ;
Edwards, S ;
King, J ;
Tsang, JS ;
Schimmack, G ;
Schelter, JM ;
Koch, J ;
Ziman, M ;
Marton, MJ ;
Li, B ;
Cundiff, P ;
Ward, T ;
Castle, J ;
Krolewski, M ;
Meyer, MR ;
Mao, M ;
Burchard, J ;
Kidd, MJ ;
Dai, H ;
Phillips, JW ;
Linsley, PS ;
Stoughton, R ;
Scherer, S ;
Boguski, MS .
NATURE, 2001, 409 (6822) :922-+
[28]   The sequence of the human genome [J].
Venter, JC ;
Adams, MD ;
Myers, EW ;
Li, PW ;
Mural, RJ ;
Sutton, GG ;
Smith, HO ;
Yandell, M ;
Evans, CA ;
Holt, RA ;
Gocayne, JD ;
Amanatides, P ;
Ballew, RM ;
Huson, DH ;
Wortman, JR ;
Zhang, Q ;
Kodira, CD ;
Zheng, XQH ;
Chen, L ;
Skupski, M ;
Subramanian, G ;
Thomas, PD ;
Zhang, JH ;
Miklos, GLG ;
Nelson, C ;
Broder, S ;
Clark, AG ;
Nadeau, C ;
McKusick, VA ;
Zinder, N ;
Levine, AJ ;
Roberts, RJ ;
Simon, M ;
Slayman, C ;
Hunkapiller, M ;
Bolanos, R ;
Delcher, A ;
Dew, I ;
Fasulo, D ;
Flanigan, M ;
Florea, L ;
Halpern, A ;
Hannenhalli, S ;
Kravitz, S ;
Levy, S ;
Mobarry, C ;
Reinert, K ;
Remington, K ;
Abu-Threideh, J ;
Beasley, E .
SCIENCE, 2001, 291 (5507) :1304-+
[29]   A comparison of expressed sequence tags (ESTs) to human genomic sequences [J].
Wolfsberg, TG ;
Landsman, D .
NUCLEIC ACIDS RESEARCH, 1997, 25 (08) :1626-1632
[30]   A greedy algorithm for aligning DNA sequences [J].
Zhang, Z ;
Schwartz, S ;
Wagner, L ;
Miller, W .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (1-2) :203-214