A probabilistic framework for segment-based speech recognition

被引:133
作者
Glass, JR [1 ]
机构
[1] MIT, Comp Sci Lab, Cambridge, MA 02139 USA
关键词
D O I
10.1016/S0885-2308(03)00006-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most current speech recognizers use an observation space based on a temporal sequence of measurements extracted from fixed-length "frames" (e.g., Mel-cepstra). Given a hypothetical word or sub-word sequence, the acoustic likelihood computation always involves all observation frames, though the mapping between individual frames and internal recognizer states will depend on the hypothesized segmentation. There is another type of recognizer whose observation space is better represented as a network, or graph, where each arc in the graph corresponds to a hypothesized variable-length segment that is represented by a fixed-dimensional "feature". In such feature-based recognizers, each hypothesized segmentation will correspond to a segment sequence, or path, through the overall segment-graph that is associated with a subset of all possible feature vectors in the total observation space. In this work we examine a maximum a posteriori decoding strategy for feature-based recognizers and develop a normalization criterion useful for a segment-based Viterbi or A* search. Experiments are reported for both phonetic, and word recognition tasks. (C) 2003 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:137 / 152
页数:16
相关论文
共 43 条
[1]  
[Anonymous], P DARPA SPEECH NAT L
[2]  
CHANG J, 1997, P EUR RHOD GREEC OCT, P1199
[3]  
CHANG J, 1998, THESIS EECS MIT
[4]   SEGMENTING SPEECH USING DYNAMIC-PROGRAMMING [J].
COHEN, JR .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1981, 69 (05) :1430-1438
[5]  
Cole R. A., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing, P731
[6]   ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition [J].
Digalakis, V. ;
Rohlicek, J. R. ;
Ostendorf, M. .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1993, 1 (04) :431-442
[7]  
DIGILAKIS V, 1992, THESIS BOSTON U
[8]  
Garofolo J., 1990, PB91505065 NTIS
[9]  
Glass J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2277, DOI 10.1109/ICSLP.1996.607261
[10]  
GLASS J, 1988, THESIS EECS MIT