HYBRID CONTEXTUAL TEXT RECOGNITION WITH STRING-MATCHING

被引：26

作者：

SINHA, RMK

PRASADA, B

HOULE, GF

SABOURIN, M

机构：

[1] BELL NO RES LTD, MONTREAL, PQ, CANADA

[2] UNIV QUEBEC, INRS TELECOMMUNICAT, MONTREAL H3C 3P8, QUEBEC, CANADA

[3] ARTHUR D LITTLE INC, WASHINGTON, DC USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 1993年 / 15卷 / 09期

基金：

加拿大自然科学与工程研究理事会;

关键词：

CONTEXT; MODIFIED VITERBI ALGORITHM; OPTICAL CHARACTER RECOGNITION; POST PROCESSING; STRING MATCHING; TEXT RECOGNITION;

D O I：

10.1109/34.232077

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The hybrid contextual algorithm presented in this paper is designed to read real-life documents printed in varying fonts of any size. The text is typically composed of valid English words, proper nouns, acronyms, abbreviations, and numerals. In addition, word spelling may be corrupted by character misclassifications (due to fragmented, mutilated, or touching characters) or by errors in computing the word boundaries (due to word segmentation errors or word delimiter misclassifications). In this paper, text is recognized progressively in three passes. The first pass is used to generate character hypothesis, the second to generate word hypothesis, and the third to verify the word hypothesis. During the first pass, isolated characters are recognized using a dynamic contour warping classifier. Transient statistical information is collected to accelerate the recognition process and to verify hypotheses in later processing. A transient dictionary consisting of high confidence nondictionary words is constructed in this pass. During the second pass, word-level hypotheses are generated using hybrid contextual text processing. Nondictionary words are recognized using 1) a modified Viterbi algorithm (MVA), 2) a string matching algorithm (SMA) utilizing n grams, 3) special handlers for touching characters, and 4) pragmatic handlers for numerals, punctuation, hyphens, apostrophes, and a prefix/suffix handler. This processing usually generates several word hypothesis. During the third pass, word-level verification occurs. The word hypothesis generated during the second pass are verified using a cost criterion based on statistics (data driven or bottom up) and language heuristics (language driven or top down). The word with minimum cost is adopted. If no word hypothesis is generated in the second pass, the word is corrected using positional n-gram information. The hybrid contextual algorithm was tested on a set of 22 multifont documents of varying quality scanned at 200 dots/in using a facsimile scanner. A character recognition rate of 98% was observed.

引用

页码：915 / 925

页数：11

共 22 条

[1] ELASTIC MATCHING OF LINE DRAWINGS [J].