A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base

被引:143
作者
Miller, RT
Christoffels, AG
Gopalakrishnan, C
Burke, J
Ptitsyn, AA
Broveak, TR
Hide, WA
机构
[1] S African Natl Bioinformat Inst, ZA-7535 Bellville, South Africa
[2] Univ Western Cape, ZA-7535 Bellville, South Africa
[3] Elect Genet Observ, ZA-7925 Cape Town, South Africa
关键词
D O I
10.1101/gr.9.11.1143
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression Forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313,103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a Factor of 1.86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.
引用
收藏
页码:1143 / 1155
页数:13
相关论文
共 32 条
  • [1] Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data
    Aaronson, JS
    Eckman, B
    Blevins, RA
    Borkowski, JA
    Myerson, J
    Imran, S
    Elliston, KO
    [J]. GENOME RESEARCH, 1996, 6 (09): : 829 - 845
  • [2] SEQUENCE IDENTIFICATION OF 2,375 HUMAN BRAIN GENES
    ADAMS, MD
    DUBNICK, M
    KERLAVAGE, AR
    MORENO, R
    KELLEY, JM
    UTTERBACK, TR
    NAGLE, JW
    FIELDS, C
    VENTER, JC
    [J]. NATURE, 1992, 355 (6361) : 632 - 634
  • [3] COMPLEMENTARY-DNA SEQUENCING - EXPRESSED SEQUENCE TAGS AND HUMAN GENOME PROJECT
    ADAMS, MD
    KELLEY, JM
    GOCAYNE, JD
    DUBNICK, M
    POLYMEROPOULOS, MH
    XIAO, H
    MERRIL, CR
    WU, A
    OLDE, B
    MORENO, RF
    KERLAVAGE, AR
    MCCOMBIE, WR
    VENTER, JC
    [J]. SCIENCE, 1991, 252 (5013) : 1651 - 1656
  • [4] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [5] ESTABLISHING A HUMAN TRANSCRIPT MAP
    BOGUSKI, MS
    SCHULER, GD
    [J]. NATURE GENETICS, 1995, 10 (04) : 369 - 371
  • [6] Comparison of gene indexing databases
    Bouck, J
    Yu, W
    Gibbs, R
    Worley, K
    [J]. TRENDS IN GENETICS, 1999, 15 (04) : 159 - 162
  • [7] Alternative gene form discovery and candidate gene selection from gene indexing projects
    Burke, J
    Wang, H
    Hide, W
    Davison, DB
    [J]. GENOME RESEARCH, 1998, 8 (03): : 276 - 290
  • [8] BURKE J, 1999, GENOME RES
  • [9] CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences
    Chou, A
    Burke, J
    [J]. BIOINFORMATICS, 1999, 15 (05) : 376 - 381
  • [10] GREEN P, 1996, PHRAP