An omnifont open-vocabulary OCR system for English and Arabic

被引:144
作者
Bazzi, I [1 ]
Schwartz, R [1 ]
Makhoul, J [1 ]
机构
[1] GTE Internetworking, BBN Technol, Cambridge, MA 02138 USA
关键词
optical character recognition; speech recognition; hidden Markov Models; omnifont OCR; language modeling; Arabic OCR; segmentation-free recognition;
D O I
10.1109/34.771314
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on Hidden Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. In this paper we focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The mettled includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus.
引用
收藏
页码:495 / 504
页数:10
相关论文
共 23 条
[1]   Text page recognition using grey-level features and hidden markov models [J].
Aas, K ;
Eikvil, L .
PATTERN RECOGNITION, 1996, 29 (06) :977-985
[2]   SURVEY AND BIBLIOGRAPHY OF ARABIC OPTICAL TEXT RECOGNITION [J].
ALBADR, B ;
MAHMOUD, SA .
SIGNAL PROCESSING, 1995, 41 (01) :49-77
[3]  
ALLAM M, 1995, P SOC PHOTO-OPT INS, V2422, P228, DOI 10.1117/12.205825
[4]  
BAZZI I, 1997, P INT C DOC AN REC U, V2, P842
[5]  
BELLEGARDA J, 1989, IEEE INT C AC SPEECH, V1, P13
[6]  
BENAMARA N, 1996, 13 INT C PATT REC VI, V2, P220
[7]   CONNECTED AND DEGRADED TEXT RECOGNITION USING HIDDEN MARKOV MODEL [J].
BOSE, CB ;
KUO, SS .
PATTERN RECOGNITION, 1994, 27 (10) :1345-1363
[8]   Modeling and recognition of cursive words with hidden Markov models [J].
Cho, WY ;
Lee, SW ;
Kim, JH .
PATTERN RECOGNITION, 1995, 28 (12) :1941-1953
[9]  
Davidson R, 1997, P S DOC IM UND TECHN, P200
[10]  
Elms A. J., 1995, Proceedings of the Third International Conference on Document Analysis and Recognition, P504, DOI 10.1109/ICDAR.1995.599044