当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features
Speech Communication ( IF 2.4 ) Pub Date : 2020-08-12 , DOI: 10.1016/j.specom.2020.08.003
Linda Gerlach , Kirsty McDougall , Finnian Kelly , Anil Alexander , Francis Nolan

The present study investigates relationships between voice similarity ratings made by human listeners and comparison scores produced by an automatic speaker recognition system that includes phonetic, perceptually-relevant features in its modelling. The study analyses human voice similarity ratings of pairs of speech samples from unrelated speakers from an accent-controlled database (DyViS, Standard Southern British English) and the comparison scores from an i-vector-based automatic speaker recognition system using ‘auto-phonetic’ (automatically extracted phonetic) features. The voice similarity ratings were obtained from 106 listeners who each rated the voice similarity of pairings of ten speakers on a Likert scale via an online test.

Correlation analysis and Multidimensional Scaling showed a positive relationship between listeners’ judgements and the automatic comparison scores. A separate analysis of the subsets of listener responses from English and German native speaker groups showed that a positive relationship was present for both groups, but that the correlation was higher for the English listener group.

This work has key implications for forensic phonetics through highlighting the potential to automate part of the process of selecting foil voices in voice parade construction for which the collection and processing of human judgements is currently needed. Further, establishing that it is possible to use automatic voice comparisons using phonetic features to select similar-sounding voices has important applications in ‘voice casting’ (finding voices that are similar to a given voice) and ‘voice banking’ (saving one's voice for future synthesis in case of an operation or degenerative disease).



中文翻译:

探索听众和包含语音功能的自动说话人识别系统在语音相似性估计之间的关系

本研究调查了人类听众的语音相似性等级与由自动语音识别系统产生的比较分数之间的关系,该系统在其建模中包括语音相关的功能。这项研究分析了来自重音控制数据库(DyViS,标准的南英式英语)中不相关说话者的语音样本对的人类语音相似性等级,以及使用“自动语音”的基于i向量的自动说话者识别系统的比较分数(自动提取的语音)功能。从106位听众中获得了语音相似性评分,他们分别通过在线测试在Likert量表上评估了十个说话人配对的语音相似性。

相关分析和多维标度显示听者的判断与自动比较分数之间呈正相关。对来自英语和德语为母语的人群的听众反应的子集进行的单独分析表明,两组的听众都有正相关,但英语听众的相关性更高。

这项工作突出了在语音阅兵结构中自动选择箔音的过程的一部分潜力(目前需要收集和处理人类判断),从而对法医语音具有关键意义。此外,确定可以使用具有语音功能的自动语音比较来选择发音相似的语音在“语音投射”(查找与给定语音相似的语音)和“语音银行”(将语音保存为手术或退行性疾病时的未来综合)。

更新日期:2020-10-02
down
wechat
bug