当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Investigating topics, audio representations and attention for multimodal scene-aware dialog
Computer Speech & Language ( IF 4.3 ) Pub Date : 2020-05-06 , DOI: 10.1016/j.csl.2020.101102
Shachi H. Kumar , Eda Okur , Saurav Sahay , Jonathan Huang , Lama Nachman

With the recent advancements in Artificial Intelligence(AI), Intelligent Virtual Assistants (IVA) such as Alexa and Google Home have become a ubiquitous part of every home. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will enable agents to have conversations with users about the objects, activities and events surrounding them. As part of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual Scene-Aware Dialog (AVSD) track, we explore three main techniques for multimodal dialog: 1) exploring ‘topics’ of the dialog as an important contextual feature for scene-aware conversations, 2) investigating several multimodal attention mechanisms during response generation and 3) incorporating an end-to-end audio classification sub network(AclNet) into our architecture. We present detailed analysis of our experiments and show that our model variations outperform the baseline system presented for this task.



中文翻译:

调查主题,音频表示形式和对多模式场景感知对话框的关注

随着人工智能(AI)的最新发展,诸如Alexa和Google Home之类的智能虚拟助手(IVA)已成为每个家庭无处不在的一部分。当前,此类IVA主要基于音频,但是展望未来,我们正在见证视觉,语音和对话系统技术的融合,这些技术使IVA能够学习语音的视听基础。这将使代理能够与用户就其周围的对象,活动和事件进行对话。作为第七届对话框系统技术挑战(DSTC7)的一部分,对于视听场景感知对话框(AVSD)轨道,我们探索了多模式对话框的三种主要技术:1)探索对话框的“主题”作为场景的重要上下文功能意识的对话 2)研究响应生成过程中的几种多模式注意机制,以及3)将端到端音频分类子网(AclNet)纳入我们的体系结构。我们提供了对实验的详细分析,并表明我们的模型变化优于为此任务提供的基准系统。

更新日期:2020-05-06
down
wechat
bug