An RNN-based prosodic information synthesizer for Mandarin text-to-speech

被引:98
作者
Chen, SH [1 ]
Hwang, SH [1 ]
Wang, YR [1 ]
机构
[1] Natl Chiao Tung Univ, Dept Engn, Hsinchu 300, Taiwan
来源
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 1998年 / 6卷 / 03期
关键词
Mandarin; pitch contour; prosodic information synthesizer; recurrent neural network; text-to-speech;
D O I
10.1109/89.668817
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A new RNN-based prosodic information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodic information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as inter-syllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent current-word phonologic states within the prosodic structure of text to be synthesized. The second hidden layer and output layer operate on a syllable-synchronized clock and use outputs from the preceding layers, along with additional syllable-level inputs fed directly to the second hidden layer, to generate desired prosodic parameters. The RNN was trained on a large set of actual utterances accompanied by associated texts, and can automatically learn many human-prosody phonologic rules, including the well-known Sandhi Tone 3 F0-change rule. Experimental results show that all synthesized prosodic parameter sequences matched quite well with their original counterparts, and a pitch-synchronous-overlap-add-based (PSOLA-based) Mandarin TTS system was also used for testing of our approach. While subjective tests are difficult to perform and remain to be done in the future, we have carried out informal listening tests by a significant number of native Chinese speakers and the results confirmed that all synthesized speech sounded quite natural.
引用
收藏
页码:226 / 239
页数:14
相关论文
共 66 条
[1]  
Bachenko J., 1990, Computational Linguistics, V16, P155
[2]   INTEGRATION OF RHYTHMIC AND SYNTACTIC CONSTRAINTS IN A MODEL OF GENERATION OF FRENCH PROSODY [J].
BAILLY, G .
SPEECH COMMUNICATION, 1989, 8 (02) :137-146
[3]   PROSODIC MODELING IN SWEDISH SPEECH SYNTHESIS [J].
BRUCE, G ;
GRANSTROM, B .
SPEECH COMMUNICATION, 1993, 13 (1-2) :63-73
[4]   AUTOMATIC DETECTION OF PROSODIC BOUNDARIES IN SPEECH [J].
CAMPBELL, N .
SPEECH COMMUNICATION, 1993, 13 (3-4) :343-354
[5]  
CHAN NC, 1992, J INF SCI ENG, V8, P261
[6]  
Chang L. L, 1989, PART SPEECH POS ANAL
[7]  
CHANG YC, 1991, P INT C COMP PROC CH, P210
[8]  
Chao Y.R., 1965, Grammar of spoken Chinese
[9]  
CHEN KJ, 1989, P ROCLING, V2, P121
[10]  
CHEN SH, 1990, IEEE T COMMUN, V38, P1317