Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

被引:169
作者
Sainath, Tara N. [1 ]
Weiss, Ron J. [1 ]
Wilson, Kevin W. [1 ]
Li, Bo [2 ]
Narayanan, Arun [2 ]
Variani, Ehsan [2 ]
Bacchiani, Michiel [1 ]
Shafran, Izhak [2 ]
Senior, Andrew [1 ]
Chin, Kean [2 ]
Misra, Ananya [2 ]
Kim, Chanwoo [2 ]
机构
[1] Google, New York, NY 10011 USA
[2] Google Inc, Mountain View, CA 94043 USA
关键词
Beamforming; deep learning; noise-robust speech recognition; ROBUST;
D O I
10.1109/TASLP.2017.2672401
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture, which performs multichannel filtering in the first layer of the network, and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filter bank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.
引用
收藏
页码:965 / 979
页数:15
相关论文
共 44 条
  • [41] Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling
    Variani, Ehsan
    Sainath, Tara N.
    Shafran, Izhak
    Bacchiani, Michiel
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 808 - 812
  • [42] Xiao X, 2015, INT CONF ACOUST SPEE, P2814, DOI 10.1109/ICASSP.2015.7178484
  • [43] Xiao X, 2016, INT CONF ACOUST SPEE, P5745, DOI 10.1109/ICASSP.2016.7472778
  • [44] Yu Zhang, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P185, DOI 10.1109/ICASSP.2014.6853583