Noise-Robust Voice Conversion Based on Sparse Spectral Mapping Using Non-negative Matrix Factorization

被引：15

作者：

Aihara, Ryo ^{[1
]}

Takashima, Ryoichi ^{[1
]}

Takiguchi, Tetsuya ^{[2
]}

Ariki, Yasuo ^{[2
]}

机构：

[1] Kobe Univ, Grad Sch Syst Informat, Kobe, Hyogo 6578501, Japan

[2] Kobe Univ, Org Adv Sci & Technol, Kobe, Hyogo 6578501, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2014年 / E97D卷 / 06期

关键词：

voice conversion; sparse representation; non-negative matrix factorization; noise robustness;

D O I：

10.1587/transinf.E97.D.1411

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

080201 [机械制造及其自动化];

摘要：

This paper presents a voice conversion (VC) technique for noisy environments based on a sparse representation of speech. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for noise-added spectral conversion between different speakers. In our previous exemplar-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all training exemplars (frames), and it requires high computation times to obtain the weights of the source exemplars. In this paper, we propose a framework to train the basis matrices of the source and target exemplars so that they have a common weight matrix. By using the basis matrices instead of the exemplars, the VC is performed with lower computation times than with the exemplar-based method. The effectiveness of this method was confirmed by comparing its effectiveness (in speaker conversion experiments using noise-added speech data) with that of an exemplar-based method and a conventional Gaussian mixture model (GMM)-based method.

引用

页码：1411 / 1418

页数：8

共 23 条

[1]

Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671

[2]

Aihara R., 2012, AM J SIGNAL PROCESSI, V2

[3]

Aihara R, 2013, INT CONF ACOUST SPEE, P8037, DOI 10.1109/ICASSP.2013.6639230

[4]

[Anonymous], 2011, INTERSPEECH

[5]

Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition [J].

Gemmeke, Jort F. ;

Virtanen, Tuomas ;

Hurmalainen, Antti .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07) :2067-2080

[6]

Voice Conversion Using Partial Least Squares Regression [J].

Helander, Elina ;

Virtanen, Tuomas ;

Nurminen, Jani ;

Gabbouj, Moncef .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :912-921

[7]

Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423

[8]

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].

Kawahara, H ;

Masuda-Katsuse, I ;

de Cheveigné, A .

SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207

[9]

CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments [J].

Kitaoka, Norihide ;

Yamada, Takeshi ;

Tsuge, Satoru ;

Miyajima, Chiyomi ;

Yamamoto, Kazumasa ;

Nishiura, Takanobu ;

Nakayama, Masato ;

Denda, Yuki ;

Fujimoto, Masakiyo ;

Takiguchi, Tetsuya ;

Tamura, Satoshi ;

Matsuda, Shigeki ;

Ogawa, Tetsuji ;

Kuroiwa, Shingo ;

Takeda, Kazuya ;

Nakamura, Satoshi .

ACOUSTICAL SCIENCE AND TECHNOLOGY, 2009, 30 (05) :363-371

[10]

ATR JAPANESE SPEECH DATABASE AS A TOOL OF SPEECH RECOGNITION AND SYNTHESIS [J].

KUREMATSU, A ;

TAKEDA, K ;

SAGISAKA, Y ;

KATAGIRI, S ;

KUWABARA, H ;

SHIKANO, K .

SPEECH COMMUNICATION, 1990, 9 (04) :357-363

← 1 2 3 →