Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families

被引：126

作者：

Andrade, MA ^{[1
]}

Valencia, A ^{[1
]}

机构：

[1] CSIC, CNB, Prot Design Grp, E-28049 Madrid, Spain

来源：

BIOINFORMATICS | 1998年 / 14卷 / 07期

关键词：

D O I：

10.1093/bioinformatics/14.7.600

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Annotation of the biological function of different protein sequences is a time-consuming process currently performed by human experts. Genome analysis tools encounter great difficulty in performing this task. Database curators, developers of genome analysis tools and biologists in general could benefit from access to tools able to suggest functional annotations and facilitate access to functional information. Approach: We present here the first prototype of a system for the automatic annotation of protein function. The system is triggered by collections of abstracts related to a given protein, and it is able to extract biological information directly from scientific literature, i.e. MEDLINE abstracts. Relevant keywords are selected by their relative accumulation in comparison with a domain-specific background distribution. Simultaneously the most representative sentences and MEDLINE abstracts are selected and presented to the end-user: Evolutionary information is considered as a predominant characteristic in the domain of protein function. Our system consequently extracts domain-specific information from the analysis of a set of protein families. Results: The system has been tested with differ-ent protein families, of which three examples are discussed in detail here: 'ataxia-telangiectasia associated protein', 'ran GTPase' and 'carbonic anhydrase'. We found generally good correlation between the amount of information provided to the system and the quality of the annotations. Finally, the current limitations and future developments of the system al-e discussed. Availability: The current system can be considered as a prototype system. As such, it can be accessed as a server at http://columba.ebi.ac.uk:8765/andrade/abx. The system accepts test related to the protein or proteins to be evaluated (optimally, the result of a MEDLINE search by keyword) and the results are returned in the form of Web pages for keywords, sentences and abstracts. Supplementary information: Web pages containing foil information on the examples mentioned in the text are available at: http://www.cnb.uam.es/similar to cnbprot/keywords/ Contact: valencia@cnb.uam.es.

引用

页码：600 / 607

页数：8

共 25 条

[1] ALLEN J, 1994, NATURAL LANGUAGES UN
[2] BASIC LOCAL ALIGNMENT SEARCH TOOL
ALTSCHUL, SF
GISH, W
MILLER, W
MYERS, EW
LIPMAN, DJ
[J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
[3] [Anonymous], 1968, PSYCHOBIOLOGY LANGUA
[4] The SWISS-PROT protein sequence data bank and its supplement TrEMBL
Bairoch, A
Apweller, R
[J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (01) : 31 - 36
[5] The PROSITE database, its status in 1997
Bairoch, A
Bucher, P
Hofmann, K
[J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (01) : 217 - 221
[6] Using linear algebra for intelligent information retrieval
Berry, MW
Dumais, ST
OBrien, GW
[J]. SIAM REVIEW, 1995, 37 (04) : 573 - 595
[7] CHALLENGING TIMES FOR BIOINFORMATICS
CASARI, G
ANDRADE, MA
BORK, P
BOYLE, J
DARUVAR, A
OUZOUNIS, C
SCHNEIDER, R
TAMAMES, J
VALENCIA, A
SANDER, C
[J]. NATURE, 1995, 376 (6542) : 647 - 648
[8] Information extraction
Cowie, J
Lehnert, W
[J]. COMMUNICATIONS OF THE ACM, 1996, 39 (01) : 80 - 91
[9] Etzold T, 1996, METHOD ENZYMOL, V266, P114
[10] Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture
Gaasterland, T
Sensen, CW
[J]. BIOCHIMIE, 1996, 78 (05) : 302 - 310

← 1 2 3 →