A semiautomated approach to gene discovery through expressed sequence tag data mining: Discovery of new human transporter genes

被引:13
作者
Brown, S
Chang, JL
Sadee, W
Babbitt, PC
机构
[1] Univ Calif San Francisco, Sch Pharm, Dept Pharmaceut Chem, San Francisco, CA 94143 USA
[2] Univ Calif San Francisco, Sch Pharm, Dept Biopharmaceut Sci, San Francisco, CA 94143 USA
[3] MIT, Whitehead Inst, Ctr Genome Res, Cambridge, MA 02141 USA
[4] Ohio State Univ, Med Ctr, Columbus, OH 43210 USA
来源
AAPS PHARMSCI | 2003年 / 5卷 / 01期
关键词
major facilitator superfamily; transporters; superfamily analysis; expressed sequence tags; data mining;
D O I
10.1208/ps050101
中图分类号
R9 [药学];
学科分类号
1007 ;
摘要
Identification and functional characterization of the genes in the human genome remain a major challenge. A principal source of publicly available information used for this purpose is the National Center for Biotechnology Information database of expressed sequence tags (dbEST), which contains over 4 million human ESTs. To extract the information buried in this data more effectively, we have developed a semiautomated method to mine dbEST for uncharacterized human genes. Starting with a single protein input sequence, a family of related proteins from all species is compiled. This entire family is then used to mine the human EST database for new gene candidates. Evaluation of putative new gene candidates in the context of a family of characterized proteins provides a framework for inference of the structure and function of the new genes. When applied to a test data set of 28 families within the major facilitator superfamily (MFS) of membrane transporters, our protocol found 73 previously characterized human MFS genes and 43 new MFS gene candidates. Development of this approach provided insights into the problems and pitfalls of automated data mining using public databases.
引用
收藏
页数:18
相关论文
共 52 条
[1]   Should non-peer-reviewed raw DNA sequence data release be forced on the scientific community? [J].
Adams, MD ;
Venter, JC .
SCIENCE, 1996, 274 (5287) :534-536
[2]   Characterization of the human ABC superfamily: Isolation and mapping of 21 new genes using the expressed sequence tags database [J].
Allikmets, R ;
Gerrard, B ;
Hutchinson, A ;
Dean, M .
HUMAN MOLECULAR GENETICS, 1996, 5 (10) :1649-1655
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[5]  
ANDERLE P, 2003, IN PRESS PHAR RES
[6]  
Ashburner M, 2001, GENOME RES, V11, P1425
[7]   The enolase superfamily: A general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids [J].
Babbitt, PC ;
Hasson, MS ;
Wedekind, JE ;
Palmer, DRJ ;
Barrett, WC ;
Reed, GH ;
Rayment, I ;
Ringe, D ;
Kenyon, GL ;
Gerlt, JA .
BIOCHEMISTRY, 1996, 35 (51) :16489-16501
[8]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[9]   Protein diversity from alternative splicing: A challenge for bioinformatics and post-genome biology [J].
Black, DL .
CELL, 2000, 103 (03) :367-370
[10]   The complete genome sequence of Escherichia coli K-12 [J].
Blattner, FR ;
Plunkett, G ;
Bloch, CA ;
Perna, NT ;
Burland, V ;
Riley, M ;
ColladoVides, J ;
Glasner, JD ;
Rode, CK ;
Mayhew, GF ;
Gregor, J ;
Davis, NW ;
Kirkpatrick, HA ;
Goeden, MA ;
Rose, DJ ;
Mau, B ;
Shao, Y .
SCIENCE, 1997, 277 (5331) :1453-+