GeoCorpora: building a corpus to test and train microblog geoparsers

被引:63
作者
Wallgruen, Jan Oliver [1 ,2 ]
Karimzadeh, Morteza [3 ]
MacEachren, Alan M. [3 ]
Pezanowski, Scott [3 ]
机构
[1] Penn State Univ, GeoVISTA Ctr, University Pk, PA 16802 USA
[2] Penn State Univ, ChoroPhronesis, University Pk, PA 16802 USA
[3] Penn State Univ, Dept Geog, GeoVISTA Ctr, University Pk, PA 16802 USA
关键词
Geoparsing; corpus building; microblogs; Twitter; geo-annotation; named entity recognition;
D O I
10.1080/13658816.2017.1368523
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, we present the GeoCorpora corpus building framework and software tools as well as a geo-annotated Twitter corpus built with these tools to foster research and development in the areas of microblog/Twitter geoparsing and geographic information retrieval. The developed framework employs crowdsourcing and geovisual analytics to support the construction of large corpora of text in which the mentioned location entities are identified and geolocated to toponyms in existing geographical gazetteers. We describe how the approach has been applied to build a corpus of geo-annotated tweets that will be made freely available to the research community alongside this article to support the evaluation, comparison and training of geoparsers. Additionally, we report lessons learned related to corpus construction for geoparsing as well as insights about the notions of place and natural spatial language that we derive from application of the framework to building this corpus.
引用
收藏
页码:1 / 29
页数:29
相关论文
共 40 条
[21]  
Leidner J.L., 2007, THESIS
[22]   An evaluation dataset for the toponym resolution task [J].
Leidner, Jochen L. .
COMPUTERS ENVIRONMENT AND URBAN SYSTEMS, 2006, 30 (04) :400-417
[23]  
Lieberman MD, 2012, SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P731, DOI 10.1145/2348283.2348381
[24]  
MacEachren A., 2014, 2014 SPEC M SPAT SEA
[25]  
MacEachren A.M., 2013, TECHNICAL REPORT
[26]  
Mandl T, 2009, LECT NOTES COMPUT SC, V5706, P808, DOI 10.1007/978-3-642-04447-2_106
[27]  
Markert K., 2002, P 3 INT C LANG RES E, P1385
[28]  
Moncla L., 2014, P 22 ACM SIGSPATIAL, P183
[29]  
POTTHAST M, 2010, PROCEEDINGS OF THE 3, P789
[30]  
Sabou M., 2014, P 9 INT C LANG RES E, P859