当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving user verification in human-robot interaction from audio or image inputs through sample quality assessment
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2021-07-03 , DOI: 10.1016/j.patrec.2021.06.014
David Freire-Obregón 1 , Kevin Rosales-Santana 1 , Pedro A. Marín-Reyes 1 , Adrian Penate-Sanchez 1 , Javier Lorenzo-Navarro 1 , Modesto Castrillón-Santana 1
Affiliation  

In this paper, we tackle the task of improving biometric verification in the context of Human-Robot Interaction (HRI). A robot that wants to identify a specific person to provide a service can do so by either image verification or, if light conditions are not favourable, through voice verification. In our approach, we will take advantage of the possibility a robot has of recovering further data until it is sure of the identity of the person. The key contribution is that we select from both image and audio signals the parts that are of higher confidence. For images we use a system that looks at the face of each person and selects frames in which the confidence is high while keeping those frames separate in time to avoid using very similar facial appearance. For audio our approach tries to find the parts of the signal that contain a person talking, avoiding those in which noise is present by segmenting the signal. Once the parts of interest are found, each input is described with an independent deep learning architecture that obtains a descriptor for each kind of input (face/voice). We also present in this paper fusion methods that improve performance by combining the features from both face and voice, results to validate this are shown for each independent input and for the fusion methods.



中文翻译:

通过样本质量评估改进从音频或图像输入的人机交互中的用户验证

在本文中,我们解决了在人机交互 (HRI) 背景下改进生物识别验证的任务。想要识别特定人员以提供服务的机器人可以通过图像验证或在光线条件不利时通过语音验证来实现。在我们的方法中,我们将利用机器人恢复更多数据的可能性,直到确定人的身份。关键的贡献是我们从图像和音频信号中选择具有更高置信度的部分。对于图像,我们使用一个系统来查看每个人的面部并选择置信度高的帧,同时将这些帧及时分开以避免使用非常相似的面部外观。对于音频,我们的方法试图找到包含人说话的信号部分,通过对信号进行分段来避免存在噪声的那些。一旦找到感兴趣的部分,每个输入都会用一个独立的深度学习架构来描述,该架构为每种输入(面部/语音)获取一个描述符。我们还在本文中介绍了通过结合面部和语音的特征来提高性能的融合方法,每个独立输入和融合方法的结果都显示了验证这一点。

更新日期:2021-07-13
down
wechat
bug