Disambiguating the species of biomedical named entities using natural language parsers

被引:34
作者
Wang, Xinglong [1 ,2 ]
Tsujii, Jun'ichi [1 ,2 ,3 ]
Ananiadou, Sophia [1 ,2 ]
机构
[1] Univ Manchester, Natl Ctr Text Min, Manchester, Lancs, England
[2] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
[3] Univ Tokyo, Dept Comp Sci, Tokyo, Japan
基金
英国生物技术与生命科学研究理事会;
关键词
SYSTEMS;
D O I
10.1093/bioinformatics/btq002
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification.
引用
收藏
页码:661 / 667
页数:7
相关论文
共 30 条
[1]
AIROLA A, 2008, P BIONLP COL OH
[2]
Alex Beatrice, 2008, Pac Symp Biocomput, P556
[3]
Text mining and its potential applications in systems biology [J].
Ananiadou, Sophia ;
Kell, Douglas B. ;
Tsujii, Jun-ichi .
TRENDS IN BIOTECHNOLOGY, 2006, 24 (12) :571-579
[4]
[Anonymous], P 3 INT S SEM MIN BI
[5]
[Anonymous], 1998, WORKSH EV PARS SYST
[6]
Briscoe E., 2006, Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney, Australia, P77
[7]
Gene name ambiguity of eukaryotic nomenclatures [J].
Chen, LF ;
Liu, HF ;
Friedman, C .
BIOINFORMATICS, 2005, 21 (02) :248-256
[8]
Wide-coverage efficient statistical parsing with CCG and log-linear models [J].
Clark, Stephen ;
Curran, James R. .
COMPUTATIONAL LINGUISTICS, 2007, 33 (04) :493-552
[9]
DEMARNEFFE MC, 2006, P 5 INT C LANG RES E
[10]
ERKAN G, 2007, P 2007 JOINT C EMP M, P228