Terminologies for text-mining;: an experiment in the lipoprotein metabolism domain

被引:13
作者
Alexopoulou, Dimitra [1 ]
Waechter, Thomas [1 ]
Pickersgill, Laura [2 ]
Eyre, Cecilia [3 ]
Schroeder, Michael [1 ]
机构
[1] Tech Univ Dresden, Ctr Biotechnol BIOTEC, D-01062 Dresden, Germany
[2] Unilever Corp Res, Colworth MK44 1LQ, England
[3] Unilever Safety & Environm Assurance Ctr, Colworth MK44 1LQ, England
关键词
D O I
10.1186/1471-2105-9-S4-S2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. Results: We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Conclusions: Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described. Availability: The TFIDF term recognition is available as Web Service, described at http://gopubmed4.biotec.tu-dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl.
引用
收藏
页数:12
相关论文
共 40 条
  • [21] Textpresso:: An ontology-based information retrieval and extraction system for biological literature
    Müller, HM
    Kenny, EE
    Sternberg, PW
    [J]. PLOS BIOLOGY, 2004, 2 (11): : 1984 - 1998
  • [22] NAVIGLI R, 2004, COMPUTATIONAL LINGUI, V30
  • [23] *NCI, NCI CANC NUTR ONT PR
  • [24] Nelson SJ, 2001, INFO SCI KNOW MANAGE, V2, P171
  • [25] Ogren PV, 2005, PACIFIC SYMPOSIUM ON BIOCOMPUTING 2005, P174
  • [26] Ogren PV, 2003, PACIFIC SYMPOSIUM ON BIOCOMPUTING 2004, P214
  • [27] Update on XplorMed:: a web server for exploring scientific literature
    Perez-Iratxeta, C
    Pérez, AJ
    Bork, P
    Andrade, MA
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (13) : 3866 - 3868
  • [28] *PROT, PROT OWL PLUG IN
  • [29] Rector A, 2003, PACIFIC SYMPOSIUM ON BIOCOMPUTING 2004, P226
  • [30] Rector AL, 1996, ST HEAL T, V34, P174