Comparison of Speaker Adaptation Methods as Feature Extraction for SVM-Based Speaker Recognition

被引:28
作者
Ferras, Marc [1 ]
Leung, Cheung-Chi [2 ]
Barras, Claude [1 ]
Gauvain, Jean-Luc [1 ]
机构
[1] LIMSI CNRS, F-91403 Orsay, France
[2] Inst Infocomm Res I2R, Human Language Technol Dept, Singapore 138632, Singapore
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2010年 / 18卷 / 06期
关键词
Constrained MLLR (CMLLR); Gaussian supervectors; Gaussian mixture model (GMM); maximum-likelihood linear regression (MLLR); speaker recognition; support vector machine (SVM); VERIFICATION; KERNEL;
D O I
10.1109/TASL.2009.2034187
中图分类号
O42 [声学];
学科分类号
070206 [声学];
摘要
In the last years the speaker recognition field has made extensive use of speaker adaptation techniques. Adaptation allows speaker model parameters to be estimated using less speech data than needed for maximum-likelihood (ML) training. The maximum a posteriori (MAP) and maximum-likelihood linear regression (MLLR) techniques have typically been used for adaptation. Recently, MAP and MLLR adaptation have been incorporated in the feature extraction stage of support vector machine (SVM)-based speaker recognition systems. Two approaches to feature extraction use a SVM to classify either the MAP-adapted Gaussian mean vector parameters (GSV-SVM) or the MLLR transform coefficients (MLLR-SVM). In this paper, we provide an experimental analysis of the GSV-SVM and MLLR-SVM approaches. We largely focus on the latter by exploring constrained and unconstrained transforms and different choices of the acoustic model. A channel-compensated front-end is used to prevent the MLLR transforms to adapt to channel components in the speech data. Additional acoustic models were trained using speaker adaptive training (SAT) to better estimate the speaker MLLR transforms. We provide results on the NIST 2005 and 2006 Speaker Recognition Evaluation (SRE) data and fusion results on the SRE 2006 data. The results show that using the compensated front-end, SAT models and multiple regression classes bring major performance improvements.
引用
收藏
页码:1366 / 1378
页数:13
相关论文
共 33 条
[1]
Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807
[2]
[Anonymous], P INT C ANT BELG
[3]
[Anonymous], 2005, Proceedings of Interspeech
[4]
[Anonymous], 2004, ODYSSEY SPEAKER LANG
[5]
Score normalization for text-independent speaker verification systems [J].
Auckenthaler, R ;
Carey, M ;
Lloyd-Thomas, H .
DIGITAL SIGNAL PROCESSING, 2000, 10 (1-3) :42-54
[6]
Barras C, 2003, INT CONF ACOUST SPEE, P49
[7]
Bonastre JF, 2005, INT CONF ACOUST SPEE, P737
[8]
Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006 [J].
Bruemmer, Niko ;
Burget, Lukas ;
Cernocky, Jan 'Honza' ;
Glembek, Ondrej ;
Grezl, Frantisek ;
Karafiat, Martin ;
van Leeuwen, David A. ;
Matejka, Pavel ;
Schwarz, Petr ;
Strasheim, Albert .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2072-2084
[9]
BRUMMER N, 2005, NIST SPEAK REC WORKS, P2005
[10]
A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167