Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

被引:48
作者
Akhondi, Saber A. [1 ]
Klenner, Alexander G. [2 ]
Tyrchan, Christian [3 ]
Manchala, Anil K. [4 ]
Boppana, Kiran [4 ]
Lowe, Daniel [5 ]
Zimmermann, Marc [2 ]
Jagarlapudi, Sarma A. R. P. [4 ]
Sayle, Roger [5 ]
Kors, Jan A. [1 ]
Muresan, Sorel [6 ]
机构
[1] Erasmus Univ, Med Ctr, Dept Med Informat, Rotterdam, Netherlands
[2] Fraunhofer Gesell, Fraunhofer Inst Algorithms & Sci Comp SCAI, St Augustin, Germany
[3] AstraZeneca R&D, RIA Med Chem, Molndal, Sweden
[4] GVK Biosci Private Ltd, Hyderabad, Andhra Pradesh, India
[5] NextMove Software Ltd, Cambridge, England
[6] AstraZeneca R&D, Chem Innovat Ctr, Molndal, Sweden
关键词
INFORMATION; ENTITIES; IDENTIFICATION; CHEMISTRY;
D O I
10.1371/journal.pone.0107477
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
070301 [无机化学]; 070403 [天体物理学]; 070507 [自然资源与国土空间规划学]; 090105 [作物生产系统与生态工程];
摘要
Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.
引用
收藏
页数:8
相关论文
共 30 条
[1]
Consistency of systematic chemical identifiers within and between small-molecule databases [J].
Akhondi, Saber A. ;
Kors, Jan A. ;
Muresan, Sorel .
JOURNAL OF CHEMINFORMATICS, 2012, 4
[2]
[Anonymous], 1978, J Chem Inf Comput Sci, P58, DOI [10.1021/ci60013a609, DOI 10.1021/CI60013A609]
[3]
CAS REGISTRY, GOLD STAND CHEM SUBS
[4]
Corbett P., 2007, P WORKSHOP BIONLP 20, P57
[5]
Chemical Entities of Biological Interest: an update [J].
de Matos, Paula ;
Alcantara, Rafael ;
Dekker, Adriano ;
Ennis, Marcus ;
Hastings, Janna ;
Haug, Kenneth ;
Spiteri, Inmaculada ;
Turner, Steve ;
Steinbeck, Christoph .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D249-D254
[6]
ChEBI:: a database and ontology for chemical entities of biological interest [J].
Degtyarenko, Kirill ;
de Matos, Paula ;
Ennis, Marcus ;
Hastings, Janna ;
Zbinden, Martin ;
McNaught, Alan ;
Alcantara, Rafael ;
Darsow, Michael ;
Guedj, Mickael ;
Ashburner, Michael .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D344-D350
[7]
Grego T, 2009, LECT NOTES COMPUT SC, V5518, P942, DOI 10.1007/978-3-642-02481-8_144
[8]
InChI - the worldwide chemical structure identifier standard [J].
Heller, Stephen ;
McNaught, Alan ;
Stein, Stephen ;
Tchekhovskoi, Dmitrii ;
Pletnev, Igor .
JOURNAL OF CHEMINFORMATICS, 2013, 5
[9]
Mining chemical information from open patents [J].
Jessop, David M. ;
Adams, Sam E. ;
Murray-Rust, Peter .
JOURNAL OF CHEMINFORMATICS, 2011, 3
[10]
GENIA corpus-a semantically annotated corpus for bio-textmining [J].
Kim, J-D ;
Ohta, T. ;
Tateisi, Y. ;
Tsujii, J. .
BIOINFORMATICS, 2003, 19 :i180-i182