HYBRID CONTEXTUAL TEXT RECOGNITION WITH STRING-MATCHING

被引:26
作者
SINHA, RMK
PRASADA, B
HOULE, GF
SABOURIN, M
机构
[1] BELL NO RES LTD, MONTREAL, PQ, CANADA
[2] UNIV QUEBEC, INRS TELECOMMUNICAT, MONTREAL H3C 3P8, QUEBEC, CANADA
[3] ARTHUR D LITTLE INC, WASHINGTON, DC USA
基金
加拿大自然科学与工程研究理事会;
关键词
CONTEXT; MODIFIED VITERBI ALGORITHM; OPTICAL CHARACTER RECOGNITION; POST PROCESSING; STRING MATCHING; TEXT RECOGNITION;
D O I
10.1109/34.232077
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The hybrid contextual algorithm presented in this paper is designed to read real-life documents printed in varying fonts of any size. The text is typically composed of valid English words, proper nouns, acronyms, abbreviations, and numerals. In addition, word spelling may be corrupted by character misclassifications (due to fragmented, mutilated, or touching characters) or by errors in computing the word boundaries (due to word segmentation errors or word delimiter misclassifications). In this paper, text is recognized progressively in three passes. The first pass is used to generate character hypothesis, the second to generate word hypothesis, and the third to verify the word hypothesis. During the first pass, isolated characters are recognized using a dynamic contour warping classifier. Transient statistical information is collected to accelerate the recognition process and to verify hypotheses in later processing. A transient dictionary consisting of high confidence nondictionary words is constructed in this pass. During the second pass, word-level hypotheses are generated using hybrid contextual text processing. Nondictionary words are recognized using 1) a modified Viterbi algorithm (MVA), 2) a string matching algorithm (SMA) utilizing n grams, 3) special handlers for touching characters, and 4) pragmatic handlers for numerals, punctuation, hyphens, apostrophes, and a prefix/suffix handler. This processing usually generates several word hypothesis. During the third pass, word-level verification occurs. The word hypothesis generated during the second pass are verified using a cost criterion based on statistics (data driven or bottom up) and language heuristics (language driven or top down). The word with minimum cost is adopted. If no word hypothesis is generated in the second pass, the word is corrected using positional n-gram information. The hybrid contextual algorithm was tested on a set of 22 multifont documents of varying quality scanned at 200 dots/in using a facsimile scanner. A character recognition rate of 98% was observed.
引用
收藏
页码:915 / 925
页数:11
相关论文
共 22 条
[1]   ELASTIC MATCHING OF LINE DRAWINGS [J].
BURR, DJ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1981, 3 (06) :708-713
[2]  
DOSTER W, 1980, 5TH P INT C PATT REC, P853
[3]  
DUFFIE PK, 1985, THESIS MCGILL U
[4]  
DUFFIE PK, 1985, BNR TR850105
[5]   CONTEXTUAL WORD RECOGNITION USING PROBABILISTIC RELAXATION LABELING [J].
GOSHTASBY, A ;
EHRICH, RW .
PATTERN RECOGNITION, 1988, 21 (05) :455-462
[6]   APPROXIMATE STRING MATCHING [J].
HALL, PAV ;
DOWLING, GR .
COMPUTING SURVEYS, 1980, 12 (04) :381-402
[7]  
HARMALKAR S, 1990, 10TH P INT C PATT RE, P758
[8]   EXPERIMENTS IN TEXT RECOGNITION WITH BINARY N-GRAM AND VITERBI ALGORITHMS [J].
HULL, JJ ;
SRIHARI, SN .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1982, 4 (05) :520-530
[9]   AN INTEGRATED ALGORITHM FOR TEXT RECOGNITION - COMPARISON WITH A CASCADED ALGORITHM [J].
HULL, JJ ;
SRIHARI, SN ;
CHOUDHARI, R .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1983, 5 (04) :384-395
[10]   ON THE RECOGNITION OF PRINTED CHARACTERS OF ANY FONT AND SIZE [J].
KAHAN, S ;
PAVLIDIS, T ;
BAIRD, HS .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1987, 9 (02) :274-288