A fast algorithm for bottom-up document layout analysis

被引:90
作者
Simon, A
Pret, JC
Johnson, AP
机构
[1] Institute for Computer Applications in Molecular Sciences, School of Chemistry, University of Leeds, Leeds
关键词
document analysis; physical page layout; bottom-up layout analysis; Kruskal's algorithm; spanning tree; chemical documents;
D O I
10.1109/34.584106
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes a new bottom-up method for document layout analysis. The algorithm was implemented in the GLIDE (Chemical Literature Data Extraction) system (http://chem.leeds.ac.uk/ICAMS/CLiDE.html) but the method described here is suitable for a broader range of documents. It is based on Kruskal's algorithm and uses a special distance-metric between the components to construct the physical page structure. The method has all the major advantages of bottom-up systems: independence from different text spacing and independence from different block alignments. The algorithms computational complexity is reduced to linear by using heuristics and path-compression.
引用
收藏
页码:273 / 277
页数:5
相关论文
共 14 条
[1]  
Aho A. V., 1983, DATA STRUCTURES ALGO
[2]   SEGMENTATION AND CLASSIFICATION OF MIXED TEXT/GRAPHICS/IMAGE DOCUMENTS [J].
FAN, KC ;
LIU, CH ;
WANG, YK .
PATTERN RECOGNITION LETTERS, 1994, 15 (12) :1201-1209
[3]   CHEMICAL LITERATURE DATA EXTRACTION - THE CLIDE PROJECT [J].
IBISON, P ;
JACQUOT, M ;
KAM, F ;
NEVILLE, AG ;
SIMPSON, RW ;
TONNELIER, C ;
VENCZEL, T ;
JOHNSON, AP .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1993, 33 (03) :338-344
[4]  
IBISON P, 1992, P ONLINE INFORMATION
[5]  
Ittner D. J., 1993, Proceedings of the Second International Conference on Document Analysis and Recognition (Cat. No.93TH0578-5), P336, DOI 10.1109/ICDAR.1993.395720
[6]  
KAM F, 1992, P INT CHEM INF C ANN
[7]   SYNTACTIC SEGMENTATION AND LABELING OF DIGITIZED PAGES FROM TECHNICAL JOURNALS [J].
KRISHNAMOORTHY, M ;
NAGY, G ;
SETH, S ;
VISWANATHAN, M .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1993, 15 (07) :737-747
[8]   THE DOCUMENT SPECTRUM FOR PAGE LAYOUT ANALYSIS [J].
OGORMAN, L .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1993, 15 (11) :1162-1173
[9]   PAGE SEGMENTATION AND CLASSIFICATION [J].
PAVLIDIS, T ;
ZHOU, JY .
CVGIP-GRAPHICAL MODELS AND IMAGE PROCESSING, 1992, 54 (06) :484-496
[10]  
SAITOH T, 1994, IEICE T INF SYST, VE77D, P778