Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis

被引:23
作者
Yu, Kai [1 ]
Zen, Heiga [2 ]
Mairesse, Francois [1 ]
Young, Steve [1 ]
机构
[1] Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England
[2] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England
基金
英国工程与自然科学研究理事会;
关键词
HMM-based speech synthesis; Context adaptive training; Factorized decision tree; State clustering;
D O I
10.1016/j.specom.2011.03.003
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
To achieve natural high quality synthesized speech in HMM-based speech synthesis, the effective modelling of complex acoustic and linguistic contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based parameter clustering to model the full combinatorial of contexts. However, weak contexts, such as word-level emphasis in natural speech, are difficult to capture using this approach. Also, due to combinatorial explosion, incorporating new contexts within the traditional framework may easily lead to the problem of insufficient data coverage. To effectively model weak contexts and reduce the data sparsity problem, different types of contexts should be treated independently. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and transforms represent the additional effects of weak contexts. In contrast to speaker adaptive training in speech recognition, separate decision trees have to be built for different types of context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. However, the MLLR based system achieved the best performance. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:914 / 923
页数:10
相关论文
共 28 条
  • [1] Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807
  • [2] [Anonymous], 1999, P EUROSPEECH
  • [3] [Anonymous], 2009, HTK BOOK HTK VERSION
  • [4] [Anonymous], P INTERSPEECH
  • [5] [Anonymous], ARPA HLT WORKSH
  • [6] Chou W, 1999, INT CONF ACOUST SPEE, P345
  • [7] Fukada T., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P137, DOI 10.1109/ICASSP.1992.225953
  • [8] Gales M.J.F., 1996, CUEDFINFENGTR263
  • [9] Maximum likelihood linear transformations for HMM-based speech recognition
    Gales, MJF
    [J]. COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) : 75 - 98
  • [10] Cluster adaptive training of hidden Markov models
    Gales, MJF
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (04): : 417 - 428