Spontaneous speech recognition using a statistical coarticulatory model for the vocal-tract-resonance dynamics

被引：61

作者：

Deng, L ^{[1
]}

Ma, J ^{[1
]}

机构：

[1] Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada

来源：

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA | 2000年 / 108卷 / 06期

关键词：

D O I：

10.1121/1.1315288

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

A statistical coarticulatory model is presented for spontaneous speech recognition, where knowledge of the dynamic, target-directed behavior in the vocal tract resonance is incorporated into the model design, training, and in likelihood computation. The principal advantage of the new model over the conventional HMM is the use of a compact, internal structure that parsimoniously represents long-span context dependence in the observable domain of speech acoustics without using additional, context-dependent model parameters. The new model is formulated mathematically as a constrained, nonstationary, and nonlinear dynamic system, for which a version of the generalized EM algorithm is developed and implemented for automatically learning the compact set of model parameters. A series of experiments for speech recognition and model synthesis using spontaneous speech data from the Switchboard corpus are reported. The promise of the new model is demonstrated by showing its consistently superior performance over a state-of-the-art benchmark HMM system under controlled experimental conditions. Experiments on model synthesis and analysis shed insight into the mechanism underlying such superiority in terms of the target-directed behavior and of the long-span context-dependence property, both inherent in the designed structure of the new dynamic model of speech. (C) 2000 Acoustical Society of America. [S0001-4966(00)02911-8].

引用

页码：3036 / 3048

页数：13

共 26 条

[1]

[Anonymous], AUTOMATIC SPEECH SPE, DOI DOI 10.1007/978-1-4613-1367-0_1

[2]

BAKIS R, 1991, P IEEE WORKSH AUT SP, P20

[3]

BISHOP CM, 1995, NUERAL NETWORKS PATT

[4]

BLACKBURN C, 1995, P EUR, V2, P1623

[5]

BRIDLE J, 1998, FINAL REPORT 1998 WO, P1

[6] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].

DEMPSTER, AP ;

LAIRD, NM ;

RUBIN, DB .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38

[7] Maximum likelihood in statistical estimation of dynamic systems: Decomposition algorithm and simulation results [J].

Deng, L ;

Shen, XM .

SIGNAL PROCESSING, 1997, 57 (01) :65-79

[8] Production models as a structural basis for automatic speech recognition [J].

Deng, L ;

Ramsay, G ;

Sun, D .

SPEECH COMMUNICATION, 1997, 22 (2-3) :93-111

[9] A GENERALIZED HIDDEN MARKOV MODEL WITH STATE-CONDITIONED TREND FUNCTIONS OF TIME FOR THE SPEECH SIGNAL [J].

DENG, L .

SIGNAL PROCESSING, 1992, 27 (01) :65-78

[10] A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition [J].

Deng, L .

SPEECH COMMUNICATION, 1998, 24 (04) :299-323

← 1 2 3 →