Speaker adaptation with all-pass transforms

被引：11

作者：

McDonough, J ^{[1
]}

Schaaf, T ^{[1
]}

Waibel, A ^{[1
]}

机构：

[1] Univ Karlsruhe, Integrated Syst Lab, Inst Log Komplexitat & Dedukt Syst, D-76128 Karlsruhe, Germany

来源：

SPEECH COMMUNICATION | 2004年 / 42卷 / 01期

关键词：

speaker adaptation; speech recognition;

D O I：

10.1016/j.specom.2003.09.005

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Modern speech recognition systems are based on the hidden Markov model (HMM) and employ cepstral features to represent input speech. In speaker normalization, the cepstral features of speech from a given speaker are transformed to match the speaker independent HMM. In speaker adaptation, the means of the HMM are transformed to match the input speech. Vocal tract length normalization (VTLN) is a popular normalization scheme wherein the frequency axis of the short-time spectrum is rescaled prior to the extraction of cepstral features. In this work, we develop novel speaker adaptation schemes by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We describe two classes of such maps: rational all-pass transforms (RAPTs) which are well-known in the signal processing literature, and sine-log all-pass transforms (SLAPTs) which are novel in this work. For both classes of maps, we develop the relations necessary to perform maximum likelihood estimation of the relevant transform parameters using enrollment data. from a new speaker. We also propose the means by which an HMM may be trained specifically for use with this type of adaptation. Finally, in a set of recognition experiments conducted on conversational speech material from the Switchboard Corpus as well as the English Spontaneous Scheduling Task, we demonstrate. the capacity of APT-based speaker adaptation to achieve word error rate reductions superior to those obtained with other popular adaptation techniques, and moreover, reductions that are additive with those provided by VTLN. (C) 2003 Elsevier B.V. All rights reserved.

引用

页码：75 / 91

页数：17

共 41 条

[1]

ACERO A, 1990, THESIS CARNEGIE MELL

[2]

Anastasakos T., 1996, P ICSLP

[3]

Andreou A., 1994, P CAIP WORKSH FRONT

[4]

BOCCHIERI E, 1999, P ICASSP, V2, P773

[5]

CHURSCHILL RV, 1990, COMPLEX VARIABLES AP

[6] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].

DEMPSTER, AP ;

LAIRD, NM ;

RUBIN, DB .

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38

[7]

DIGALAKIS V, 1996, P ICASSP, V1, P339

[8] SPEAKER ADAPTATION USING CONSTRAINED ESTIMATION OF GAUSSIAN MIXTURES [J].

DIGALAKIS, VV ;

RTISCHEV, D ;

NEUMEYER, LG .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (05) :357-366

[9]

DING GH, 2002, ICSLP, P1389

[10]

Eide E, 1996, INT CONF ACOUST SPEE, P346, DOI 10.1109/ICASSP.1996.541103

← 1 2 3 4 5 →