Grounded Sequence to Sequence Transduction,IEEE Journal of Selected Topics in Signal Processing

当前位置： X-MOL 学术 › IEEE J. Sel. Top. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Grounded Sequence to Sequence Transduction
IEEE Journal of Selected Topics in Signal Processing ( IF 8.7 ) Pub Date : 2020-03-01 , DOI: 10.1109/jstsp.2020.2998415
Lucia Specia , Loic Barrault , Ozan Caglayan , Amanda Duarte , Desmond Elliott , Spandana Gella , Nils Holzenberger , Chiraag Lala , Sun Jae Lee , Jindrich Libovicky , Pranava Madhyastha , Florian Metze , Karl Mulligan , Alissa Ostapenko , Shruti Palaskar , Ramon Sanabria , Josiah Wang , Raman Arora

Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one language sequence to another. Although multiple modalities such as sound and video are becoming increasingly available, the state-of-the-art systems are inherently unimodal, in the sense that they take a single modality — either speech or text — as input. Evidence from human learning suggests that additional modalities can provide disambiguating signals crucial for many language tasks. In this article, we describe the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations. We then show how this single dataset can be used to develop systems for a variety of language tasks and present a number of models meant as starting points. Across tasks, we find that building multimodal architectures that perform better than their unimodal counterpart remains a challenge. This leaves plenty of room for the exploration of more advanced solutions that fully exploit the multimodal nature of the How2 dataset , and the general direction of multimodal learning with other datasets as well.

中文翻译：

接地序列到序列转导

语音识别和机器翻译在过去几十年中取得了重大进展，提供了将一种语言序列映射到另一种语言序列的实用系统。尽管声音和视频等多种模态变得越来越可用，但最先进的系统本质上是单模态的，因为它们采用单一模态（语音或文本）作为输入。来自人类学习的证据表明，额外的模态可以提供对许多语言任务至关重要的消除歧义的信号。在本文中，我们描述了 How2 数据集，这是一个大型开放域视频集合，带有转录及其翻译。然后，我们展示了如何使用这个单一数据集为各种语言任务开发系统，并展示许多作为起点的模型。跨任务，我们发现构建性能优于单模态架构的多模态架构仍然是一个挑战。这为探索更高级的解决方案留下了足够的空间，这些解决方案充分利用了 How2 数据集的多模态特性，以及多模态学习与其他数据集的总体方向。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11