Deep Learning for Acoustic Modeling in Parametric Speech Generation

被引:170
作者
Ling, Zhen-Hua [1 ,2 ,3 ,4 ]
Kang, Shi-Yin [5 ]
Zen, Heiga [6 ,7 ,8 ]
Senior, Andrew [6 ,7 ]
Schuster, Mike [9 ]
Qian, Xiao-Jun [5 ,10 ]
Meng, Helen [5 ,11 ]
Deng, Li [12 ,13 ,14 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9YL, Midlothian, Scotland
[2] Univ Sci & Technol China & iFLY TEK Co Ltd, Hefei, Peoples R China
[3] Univ Sci & Technol China, Hefei, Peoples R China
[4] Univ Washington, Seattle, WA 98195 USA
[5] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[6] Google, Mountain View, CA USA
[7] IBM TJ Watson Res Ctr, Yorktown Hts, NY USA
[8] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge, England
[9] Adv Telecommun Res Labs, Kyoto, Japan
[10] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China
[11] Chinese Univ Hong Kong, Fac Engn, Hong Kong, Hong Kong, Peoples R China
[12] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
[13] Microsoft Res, Deep Learning Technol Ctr, Redmond, WA USA
[14] Univ Washington, Seattle, WA 98195 USA
关键词
NEURAL-NETWORKS; CONVERSION; REPRESENTATIONS; ALGORITHM;
D O I
10.1109/MSP.2014.2359987
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. © 1991-2012 IEEE.
引用
收藏
页码:35 / 52
页数:18
相关论文
共 86 条
[1]  
Abe M., 1990, Journal of the Acoustical Society of Japan (E), V11, P71, DOI 10.1250/ast.11.71
[2]  
[Anonymous], 2001, A Field Guide to Dynamical Recurrent Networks
[3]  
[Anonymous], 2014, P ICASSP
[4]  
[Anonymous], MATH FDN SPEECH LANG
[5]  
[Anonymous], P NIPS 2011 WORKSH D
[6]  
[Anonymous], 2009, Learning deep generative models
[7]  
Bengio Yoshua, 2006, Advances in Neural Information Processing Systems 19, V19, P153
[8]  
Chen L.-H., 2013, Interspeech, P3052
[9]   Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training [J].
Chen, Ling-Hui ;
Ling, Zhen-Hua ;
Liu, Li-Juan ;
Dai, Li-Rong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1859-1872
[10]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42