A fast algorithm for bottom-up document layout analysis

被引:90
作者
Simon, A
Pret, JC
Johnson, AP
机构
[1] Institute for Computer Applications in Molecular Sciences, School of Chemistry, University of Leeds, Leeds
关键词
document analysis; physical page layout; bottom-up layout analysis; Kruskal's algorithm; spanning tree; chemical documents;
D O I
10.1109/34.584106
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes a new bottom-up method for document layout analysis. The algorithm was implemented in the GLIDE (Chemical Literature Data Extraction) system (http://chem.leeds.ac.uk/ICAMS/CLiDE.html) but the method described here is suitable for a broader range of documents. It is based on Kruskal's algorithm and uses a special distance-metric between the components to construct the physical page structure. The method has all the major advantages of bottom-up systems: independence from different text spacing and independence from different block alignments. The algorithms computational complexity is reduced to linear by using heuristics and path-compression.
引用
收藏
页码:273 / 277
页数:5
相关论文
共 14 条
[11]  
TANG YY, 1994, IEEE T KNOWL DATA EN, V6, P3, DOI 10.1109/69.273022
[12]   INTEGRATING NATURAL-LANGUAGE UNDERSTANDING WITH DOCUMENT STRUCTURE-ANALYSIS [J].
TAYLOR, SL ;
DAHL, DA ;
LIPSHUTZ, M ;
WEIR, C ;
NORTON, LM ;
NILSON, RW ;
LINEBARGER, MC .
ARTIFICIAL INTELLIGENCE REVIEW, 1994, 8 (2-3) :255-276
[13]   MAJOR COMPONENTS OF A COMPLETE TEXT READING SYSTEM [J].
TSUJIMOTO, S ;
ASADA, H .
PROCEEDINGS OF THE IEEE, 1992, 80 (07) :1133-1149
[14]   DOCUMENT ANALYSIS SYSTEM [J].
WONG, KY ;
CASEY, RG ;
WAHL, FM .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1982, 26 (06) :647-656