Automatic document classification of biological literature

被引:31
作者
Chen, David
Muller, Hans-Michael [1 ]
Sternberg, Paul W.
机构
[1] CALTECH, Div Biol, Pasadena, CA 91125 USA
[2] CALTECH, Howard Hughes Med Inst, Pasadena, CA 91125 USA
关键词
D O I
10.1186/1471-2105-7-370
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.
引用
收藏
页数:11
相关论文
共 17 条
  • [1] Automated extraction of information in molecular biology
    Andrade, MA
    Bork, P
    [J]. FEBS LETTERS, 2000, 476 (1-2) : 12 - 17
  • [2] Beil Florian., 2002, KDD 02, P436, DOI DOI 10.1145/775047.775110
  • [3] Getting to the (c)ore of knowledge: mining biomedical literature
    de Bruijn, B
    Martin, J
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) : 7 - 18
  • [4] An analysis of the relative hardness of Reuters-21578 subsets
    Debole, F
    Sebastiani, F
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (06): : 584 - 596
  • [5] Tough mining
    Dickman, S
    [J]. PLOS BIOLOGY, 2003, 1 (02): : 144 - 147
  • [6] The anatomy of a hierarchical clustering engine for web-page, news and book snippets
    Ferragina, P
    Gullì, A
    [J]. FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 395 - 398
  • [7] Efficient phrase-based document indexing for web document clustering
    Hammouda, KM
    Kamel, MS
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (10) : 1279 - 1296
  • [8] Literature mining for the biologist: from information retrieval to biological discovery
    Jensen, LJ
    Saric, J
    Bork, P
    [J]. NATURE REVIEWS GENETICS, 2006, 7 (02) : 119 - 129
  • [9] Joachims T, 1999, MACHINE LEARNING, PROCEEDINGS, P200
  • [10] Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683