Terminologies for text-mining;: an experiment in the lipoprotein metabolism domain

被引:13
作者
Alexopoulou, Dimitra [1 ]
Waechter, Thomas [1 ]
Pickersgill, Laura [2 ]
Eyre, Cecilia [3 ]
Schroeder, Michael [1 ]
机构
[1] Tech Univ Dresden, Ctr Biotechnol BIOTEC, D-01062 Dresden, Germany
[2] Unilever Corp Res, Colworth MK44 1LQ, England
[3] Unilever Safety & Environm Assurance Ctr, Colworth MK44 1LQ, England
关键词
D O I
10.1186/1471-2105-9-S4-S2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. Results: We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Conclusions: Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described. Availability: The TFIDF term recognition is available as Web Service, described at http://gopubmed4.biotec.tu-dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl.
引用
收藏
页数:12
相关论文
共 40 条
  • [1] Agirre E, 2006, TEXT SPEECH LANG TEC, V33, P217, DOI 10.1007/1-4020-4809-2_8
  • [2] [Anonymous], INT J DIGITAL LIB
  • [3] Understanding and using the meaning of statements in a bio-ontology: recasting the Gene Ontology in OWL
    Aranguren, Mikel Egana
    Bechhofer, Sean
    Lord, Phillip
    Sattler, Ulrike
    Stevens, Robert
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [4] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [5] An ontology for bioinformatics applications
    Baker, PG
    Goble, CA
    Bechhofer, S
    Paton, NW
    Stevens, R
    Brass, A
    [J]. BIOINFORMATICS, 1999, 15 (06) : 510 - 520
  • [6] An ontology for cell types
    Bard, J
    Rhee, SY
    Ashburner, M
    [J]. GENOME BIOLOGY, 2005, 6 (02)
  • [7] Berneis K, 2004, SWISS MED WKLY, V134, P720
  • [8] Beyond the data deluge: Data integration and bio-ontologies
    Blake, Judith A.
    Bult, Carol J.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2006, 39 (03) : 314 - 320
  • [9] The Unified Medical Language System (UMLS): integrating biomedical terminology
    Bodenreider, O
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D267 - D270
  • [10] Bio-ontologies: current trends and future directions
    Bodenreider, Olivier
    Stevens, Robert
    [J]. BRIEFINGS IN BIOINFORMATICS, 2006, 7 (03) : 256 - 274