Extending traditional query-based integration approaches for functional characterization of post-genomic data

被引:25
作者
Eckman, BA [1 ]
Kosky, AS
Laroco, LA
机构
[1] GlaxoSmithKline, Dept Bioinformat, King Of Prussia, PA USA
[2] Gene Log Inc, Data Management Syst, Berkeley, CA USA
关键词
D O I
10.1093/bioinformatics/17.7.587
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: To identify and characterize regions of functional interest in genomic sequence requires full, flexible query access to an integrated, up-to-date view of all related information, irrespective of where it is stored (within an organization or across the Internet) and its format (traditional database, flat file, web site, results of runtime analysis). Wide-ranging multi-source queries often return unmanageably large result sets, requiring non-traditional approaches to exclude extraneous data. Results: Target Informatics Net (TINet) is a readily extensible data integration system developed at GlaxoSmithKline (GSK), based on the Object-Protocol Model (OPM) multidatabase middleware system of Gene Logic Inc. Data sources currently integrated include: the Mouse Genome Database (MGD) and Gene Expression Database (GXD), GenBank, SwissProt, PubMed, GeneCards, the results of runtime BLAST and PROSITE searches, and GSK proprietary relational databases. Special-purpose class methods used to filter and augment query results include regular expression pattern-matching over BLAST HSP alignments and retrieving partial sequences derived from primary structure annotations. All data sources and methods are accessible through an SQL-like query language or a GUI, so that when new investigations arise no additional programming beyond query specification is required. The power and flexibility of this approach are illustrated in such integrated queries as: (1) 'find homologs in genomic sequence to all novel genes cloned and reported in the scientific literature within the past three months that are linked to the MeSH term 'neoplasms''; (2) 'using a neuropeptide precursor query sequence, return only HSPs where the target genomic sequences conserve the G[KR][KR] motif at the appropriate points in the HSP alignment'; and (3) 'of the human genomic sequences annotated with exon boundaries in GenBank, return only those with valid putative donor/acceptor sites and start/stop codons'.
引用
收藏
页码:587 / 601
页数:15
相关论文
共 49 条
  • [1] ALONSO R, 1987, IEEE DATA ENG B, V10
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
  • [4] *AP SOFTW FDN, 1995, APACH HTTP SERV 1 3
  • [5] ATTWOOD TK, 1994, NUCLEIC ACIDS RES, V22, P3590
  • [6] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [7] Baker P G, 1998, Proc Int Conf Intell Syst Mol Biol, V6, P25
  • [8] An ontology for bioinformatics applications
    Baker, PG
    Goble, CA
    Bechhofer, S
    Paton, NW
    Stevens, R
    Brass, A
    [J]. BIOINFORMATICS, 1999, 15 (06) : 510 - 520
  • [9] GenBank
    Benson, DA
    Karsch-Mizrachi, I
    Lipman, DJ
    Ostell, J
    Rapp, BA
    Wheeler, DL
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 15 - 18
  • [10] Using GeneWise in the Drosophila annotation experiment
    Birney, E
    Durbin, R
    [J]. GENOME RESEARCH, 2000, 10 (04) : 547 - 548