语音识别中深度神经网络目标值优化

被引:4
作者
陈梦喆
张晴晴
潘接林
颜永红
机构
[1] 中国科学院语言声学与内容理解重点实验室
关键词
语音识别; 深度神经网络; 前后向算法; 目标值优化;
D O I
10.15961/j.jsuese.2016.01.025
中图分类号
TN912.34 [语音识别与设备];
学科分类号
0711 ;
摘要
训练深度神经网络声学模型时,所采用的强制对齐得到的目标值存在无法精准地表示出语音实际状况的问题。针对这一问题,提出一种利用前后向算法得到非0-1分布目标值的方法。由于用于强制对齐的模型可能与处理语句不完全匹配,以及发音连续性导致的过渡边界难以分离等问题,强制对齐得到的目标值存在不合理性。新的目标值可以表示某一帧以一定概率属于邻近各状态的分布情况,更详细地描述建模单元之间的过渡,进一步还原语音的原貌,提升模型的鲁棒性。同时,为寻求模型鲁棒性和建模单元区分度之间的平衡,对算法得到的目标值进行加窗处理。在中文客服问答领域进行实验,在小数据量上验证了目标值对于训练的较大影响,并且选取窗长宽度这一参数。最后将训练数据量提升至60、80以及100 h,结果显示,新的目标值优化方法训练得到的模型在识别性能上获得提升,相对字错误率下降为1.10%3.65%。多组实验验证新的目标值优化方法对模型训练有一定效果,在训练数据量上升的情况下依然具有有效性。
引用
收藏
页码:166 / 172
页数:7
相关论文
共 20 条
[1]  
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. George E. Dahl,Dong Yu,Li Deng. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING . 2012
[2]  
An unsupervised adaptation method for deep neural network-based large vocabulary continuous speech recognition. Xiao Yeming,Si Yujing,Xu Ji,et al. Journal of Information&Computational Science . 2014
[3]  
High performance telephone bandwidth speaker independent continuous digit recognition. Cosi P,Hosom J P,Valente A. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding 2001 (ASRU’’01) . 2001
[4]  
Ensemble of machine learning and acoustic segment model techniques for speech emotion and autism spectrum disorders recognition. Lee H,Hu T,Jing H,et al. Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) . 2013
[5]  
Synthesized stereo mapping via deep neural networks for noisy speech recognition. Du J,Dai L R,Huo Q. 2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . 2014
[6]  
Ensemble deep learning for speech recognition. Deng L,Platt J C. Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) . 2014
[7]  
Speech Recognition Using Neural Networks with Forward-backward Probability Generated Targets. YAN Y,FANTY M,COLE R. Proceedings of International Conference on Acoustics,Speech,and Signal Processing . 1997
[8]  
SRILM—The SRI Language Modeling Toolkit. SRI International. http://www.speech.sri.com/projects/srilm/ . 2015
[9]  
Improved modeling for F0 generation and V/U decision in HMM-based TTS. Zhang Q,Soong F,Qian Y,et al. Proceedings of 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) . 2010
[10]   基于加权有限状态机的动态匹配词图生成算法 [J].
郭宇弘 ;
黎塔 ;
肖业鸣 ;
潘接林 ;
颜永红 .
电子与信息学报, 2014, 36 (01) :140-146