Can we Generate Emotional Pronunciations for Expressive Speech Synthesis?,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Can we Generate Emotional Pronunciations for Expressive Speech Synthesis?
IEEE Transactions on Affective Computing ( IF 11.2 ) Pub Date : 2020-10-01 , DOI: 10.1109/taffc.2018.2828429
Marie Tahon , Gwenole Lecorve , Damien Lolive

In the field of expressive speech synthesis, a lot of work has been conducted on suprasegmental prosodic features while few has been done on pronunciation variants. However, prosody is highly related to the sequence of phonemes to be expressed. This article raises two issues in the generation of emotional pronunciations for TTS systems. The first issue consists in designing an automatic pronunciation generation method from text, while the second issue addresses the very existence of emotional pronunciations through experiments conducted on emotional speech. To do so, an innovative pronunciation adaptation method which automatically adapts canonical phonemes first to those labeled in the corpus used to create a synthetic voice, then to those labeled in an expressive corpus, is presented. This method consists in training conditional random fields pronunciation models with prosodic, linguistic, phonological and articulatory features. The analysis of emotional pronunciations reveals strong dependencies between prosody and phoneme assimilation or elisions. According to perceptual tests, the double adaptation allows to synthesize expressive speech samples of good quality, but emotion-specific pronunciations are too subtle to be perceived by testers.

中文翻译：

我们可以为表达性语音合成生成情感发音吗？

在表达性语音合成领域，对超音段韵律特征进行了大量工作，而对发音变体的研究很少。然而，韵律与要表达的音素序列高度相关。本文提出了 TTS 系统情感发音生成中的两个问题。第一个问题是设计一种从文本中自动生成发音的方法，而第二个问题通过对情感语音进行的实验来解决情感发音的存在。为此，提出了一种创新的发音适应方法，该方法首先将规范音素自动适应用于创建合成语音的语料库中标记的音素，然后适应表达语料库中标记的音素。该方法包括训练具有韵律、语言、语音和发音特征的条件随机场发音模型。对情感发音的分析揭示了韵律和音素同化或省略之间的强烈依赖性。根据感知测试，双重适应允许合成高质量的表达语音样本，但特定情绪的发音太微妙而无法被测试人员感知。

更新日期：2020-10-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>