当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Leveraging Linguistic Context in Dyadic Interactions to Improve Automatic Speech Recognition for Children.
Computer Speech & Language ( IF 3.1 ) Pub Date : 2020-04-16 , DOI: 10.1016/j.csl.2020.101101
Manoj Kumar 1 , So Hyun Kim 2 , Catherine Lord 3 , Thomas D Lyon 4 , Shrikanth Narayanan 1
Affiliation  

Automatic speech recognition for child speech has been long considered a more challenging problem than for adult speech. Various contributing factors have been identified such as larger acoustic speech variability including mispronunciations due to continuing biological changes in growth, developing vocabulary and linguistic skills, and scarcity of training corpora. A further challenge arises when dealing with spontaneous speech of children involved in a conversational interaction, and especially when the child may have limited or impaired communication ability. This includes health applications, one of the motivating domains of this paper, that involve goal-oriented dyadic interactions between a child and clinician/adult social partner as a part of behavioral assessment. In this work, we use linguistic context information from the interaction to adapt speech recognition models for children speech. Specifically, spoken language from the interacting adult speech provides the context for the child's speech. We propose two methods to exploit this context: lexical repetitions and semantic response generation. For the latter, we make use of sequence-to-sequence models that learn to predict the target child utterance given context adult utterances. Long-term context is incorporated in the model by propagating the cell-state across the duration of conversation. We use interpolation techniques to adapt language models at the utterance level, and analyze the effect of length and direction of context (forward and backward). Two different domains are used in our experiments to demonstrate the generalized nature of our methods - interactions between a child with ASD and an adult social partner in a play-based, naturalistic setting, and in forensic interviews between a child and a trained interviewer. In both cases, context-adapted models yield significant improvement (upto 10.71% in absolute word error rate) over the baseline and perform consistently across context windows and directions. Using statistical analysis, we investigate the effect of source-based (adult) and target-based (child) factors on adaptation methods. Our results demonstrate the applicability of our modeling approach in improving child speech recognition by employing information transfer from the adult interlocutor.

中文翻译:

在二元交互中利用语言上下文来改善儿童的自动语音识别。

长期以来,儿童语音的自动语音识别一直被认为是比成人语音更具挑战性的问题。已经确定了各种促成因素,例如较大的语音语音变异性,包括由于生长、词汇和语言技能的持续发展以及训练语料库的缺乏而导致的发音错误。在处理参与对话互动的儿童的自发讲话时,会出现进一步的挑战,尤其是当儿童的沟通能力有限或受损时。这包括健康应用,这是本文的激励领域之一,它涉及作为行为评估一部分的儿童与临床医生/成人社会伙伴之间以目标为导向的二元互动。在这项工作中,我们使用来自交互的语言上下文信息来调整儿童语音的语音识别模型。具体而言,来自互动的成人语音的口语为儿童的语音提供了上下文。我们提出了两种方法来利用这种上下文:词汇重复和语义响应生成。对于后者,我们使用序列到序列模型来学习在给定上下文成人话语的情况下预测目标儿童话语。通过在整个对话期间传播细胞状态,将长期上下文纳入模型中。我们使用插值技术在话语级别调整语言模型,并分析上下文的长度和方向(向前和向后)的影响。我们的实验中使用了两个不同的领域来证明我们方法的普遍性质——ASD 儿童与成人社会伙伴在基于游戏的自然环境中的互动,以及儿童与受过训练的采访者之间的法医访谈。在这两种情况下,上下文适应模型在基线上产生了显着的改进(绝对单词错误率高达 10.71%),并且在上下文窗口和方向上一致地执行。使用统计分析,我们调查了基于源(成人)和基于目标(儿童)的因素对适应方法的影响。我们的结果证明了我们的建模方法在通过使用来自成人对话者的信息传输来改善儿童语音识别方面的适用性。自然主义的环境,以及在儿童和训练有素的采访者之间的法医采访中。在这两种情况下,上下文适应模型在基线上产生了显着的改进(绝对单词错误率高达 10.71%),并且在上下文窗口和方向上一致地执行。使用统计分析,我们调查了基于源(成人)和基于目标(儿童)的因素对适应方法的影响。我们的结果证明了我们的建模方法在通过使用来自成人对话者的信息传输来改善儿童语音识别方面的适用性。自然主义的环境,以及儿童和训练有素的采访者之间的法医采访。在这两种情况下,上下文适应模型在基线上产生了显着的改进(绝对单词错误率高达 10.71%),并且在上下文窗口和方向上一致地执行。使用统计分析,我们调查了基于源(成人)和基于目标(儿童)的因素对适应方法的影响。我们的结果证明了我们的建模方法在通过使用来自成人对话者的信息传输来改善儿童语音识别方面的适用性。使用统计分析,我们调查了基于源(成人)和基于目标(儿童)的因素对适应方法的影响。我们的结果证明了我们的建模方法在通过使用来自成人对话者的信息传输来改善儿童语音识别方面的适用性。使用统计分析,我们调查了基于源(成人)和基于目标(儿童)的因素对适应方法的影响。我们的结果证明了我们的建模方法在通过使用来自成人对话者的信息传输来改善儿童语音识别方面的适用性。
更新日期:2020-04-16
down
wechat
bug