Hidden Bawls, Whispers, and Yelps: Can Text Convey the Sound of Speech, Beyond Words?,IEEE Transactions on Affective Computing

当前位置： X-MOL 学术 › IEEE Trans. Affect. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hidden Bawls, Whispers, and Yelps: Can Text Convey the Sound of Speech, Beyond Words?
IEEE Transactions on Affective Computing ( IF 9.6 ) Pub Date : 5-12-2022 , DOI: 10.1109/taffc.2022.3174721
Calua de Lacerda Pataca ₁ , Paula Dornhofer Paro Costa ₂

Affiliation

Whether a word was bawled, whispered, or yelped, captions will typically represent it in the same way. If they are your only way to access what is being said, subjective nuances expressed in the voice will be lost. Since so much of communication is carried by these nuances, we posit that if captions are to be used as an accurate representation of speech, embedding visual representations of paralinguistic qualities into captions could help readers use them to better understand speech beyond its mere textual content. This paper presents a model for processing vocal prosody (its loudness, pitch, and duration) and mapping it into visual dimensions of typography (respectively, font-weight, baseline shift, and letter-spacing), creating a visual representation of these lost vocal subtleties that can be embedded directly into the typographical form of text. An evaluation was carried out where participants were exposed to this speech-modulated typography and asked to match it to its originating audio, presented between similar alternatives. Participants (n=117) were able to correctly identify the original audios with an average accuracy of 65%, with no significant difference when showing them modulations as animated or static text. Additionally, participants’ comments showed their mental models of speech-modulated typography varied widely.

中文翻译：

隐藏的嚎叫、窃窃私语和尖叫声：文字能否传达超越言语的声音？

无论一个词是大喊大叫、低声耳语还是大喊大叫，字幕通常都会以相同的方式表示它。如果它们是您了解所说内容的唯一方式，那么声音中表达的主观细微差别就会丢失。由于如此多的交流是由这些细微差别进行的，我们假设，如果要使用字幕作为语音的准确表示，那么将副语言质量的视觉表示嵌入到字幕中可以帮助读者使用它们更好地理解语音，而不仅仅是其文本内容。本文提出了一个处理声乐韵律（响度、音调和持续时间）并将其映射到排版的视觉维度（分别是字体粗细、基线偏移和字母间距）的模型，从而创建这些丢失的声乐的视觉表示可以直接嵌入文本印刷形式的微妙之处。进行了一项评估，参与者接触了这种语音调制的字体，并要求将其与其原始音频相匹配，在类似的替代方案之间呈现。参与者 (n=117) 能够正确识别原始音频，平均准确率为 65%，以动画或静态文本形式显示调制时没有显着差异。此外，参与者的评论表明他们对语音调制字体的心理模型差异很大。

更新日期：2024-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11