Aspect-Aware Response Generation for Multimodal Dialogue System,ACM Transactions on Intelligent Systems and Technology

当前位置： X-MOL 学术 › ACM Trans. Intell. Syst. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Aspect-Aware Response Generation for Multimodal Dialogue System
ACM Transactions on Intelligent Systems and Technology ( IF 5 ) Pub Date : 2021-02-04 , DOI: 10.1145/3430752
Mauajama Firdaus ₁ , Nidhi Thakur ₁ , Asif Ekbal ₁

Affiliation

Multimodality in dialogue systems has opened up new frontiers for the creation of robust conversational agents. Any multimodal system aims at bridging the gap between language and vision by leveraging diverse and often complementary information from image, audio, and video, as well as text. For every task-oriented dialog system, different aspects of the product or service are crucial for satisfying the user’s demands. Based upon the aspect, the user decides upon selecting the product or service. The ability to generate responses with the specified aspects in a goal-oriented dialogue setup facilitates user satisfaction by fulfilling the user’s goals. Therefore, in our current work, we propose the task of aspect controlled response generation in a multimodal task-oriented dialog system. We employ a multimodal hierarchical memory network for generating responses that utilize information from both text and images. As there was no readily available data for building such multimodal systems, we create a Multi-Domain Multi-Modal Dialog (MDMMD++) dataset. The dataset comprises the conversations having both text and images belonging to the four different domains, such as hotels, restaurants, electronics, and furniture. Quantitative and qualitative analysis on the newly created MDMMD++ dataset shows that the proposed methodology outperforms the baseline models for the proposed task of aspect controlled response generation.

中文翻译：

多模式对话系统的方面感知响应生成

对话系统中的多模态为创建强大的对话代理开辟了新的领域。任何多模式系统旨在通过利用来自图像、音频和视频以及文本的多样化且通常互补的信息来弥合语言和视觉之间的差距。对于每个面向任务的对话系统，产品或服务的不同方面对于满足用户需求至关重要。基于该方面，用户决定选择产品或服务。在面向目标的对话设置中生成具有指定方面的响应的能力通过实现用户的目标来促进用户满意度。因此，在我们目前的工作中，我们提出了在多模式面向任务的对话系统中生成方面控制响应的任务。我们采用多模式分层记忆网络来生成利用来自文本和图像的信息的响应。由于没有现成的数据可用于构建此类多模式系统，我们创建了一个多域多模式对话 (MDMMD++) 数据集。该数据集包含具有属于四个不同领域（例如酒店、餐厅、电子产品和家具）的文本和图像的对话。对新创建的 MDMMD++ 数据集的定量和定性分析表明，所提出的方法优于所提出的方面控制响应生成任务的基线模型。该数据集包含具有属于四个不同领域（例如酒店、餐厅、电子产品和家具）的文本和图像的对话。对新创建的 MDMMD++ 数据集的定量和定性分析表明，所提出的方法优于所提出的方面控制响应生成任务的基线模型。该数据集包含具有属于四个不同领域（例如酒店、餐厅、电子产品和家具）的文本和图像的对话。对新创建的 MDMMD++ 数据集的定量和定性分析表明，所提出的方法优于所提出的方面控制响应生成任务的基线模型。

更新日期：2021-02-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>