Audiovisual speech recognition: A review and forecast,International Journal of Advanced Robotic Systems

当前位置： X-MOL 学术 › Int. J. Adv. Robot. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Audiovisual speech recognition: A review and forecast
International Journal of Advanced Robotic Systems ( IF 2.3 ) Pub Date : 2020-12-23 , DOI: 10.1177/1729881420976082
Linlin Xia ₁ , Gang Chen ₁ , Xun Xu ₂ , Jiashuo Cui ₁ , Yiping Gao ₁

Affiliation

Audiovisual speech recognition is a favorable solution to multimodality human–computer interaction. For a long time, it has been very difficult to develop machines capable of generating or understanding even fragments of natural languages; the fused sight, smelling, touching, and so on provide machines with possible mediums to perceive and understand. This article presents a detailed review of recent advances in audiovisual speech recognition area. After explicitly representing audiovisual speech recognition development phase divided by timeline, we focus on typical audiovisual speech database descriptions in terms of single view and multi-view, since the public databases for general purpose should be the first concern for audiovisual speech recognition tasks. For the following challenges that are inseparably related to the feature extraction and dynamic audiovisual fusion, the principal usefulness of deep learning-based tools, such as deep fully convolutional neural network, bidirectional long short-term memory network, 3D convolutional neural network, and so on, lies in the fact that they are relatively simple solutions of such problems. As the principle analyses and comparisons related to computational load, accuracy, and applicability of well-developed audiovisual speech recognition frameworks have been conducted, we further illuminate our insights into the future audiovisual speech recognition architecture design. We argue that end-to-end audiovisual speech recognition model and deep learning-based feature extractors will guide multimodality human–computer interaction directly to a solution.

中文翻译：

视听语音识别：回顾与展望

视听语音识别是多模式人机交互的有利解决方案。长期以来，开发能够生成或理解自然语言片段的机器非常困难。融合的视线，气味，触感等为机器提供了可能感知和理解的媒介。本文详细介绍了视听语音识别领域的最新进展。在明确表示视听语音识别发展阶段按时间线划分之后，我们将重点放在单视点和多视点方面的典型视听语音数据库描述上，因为通用的公共数据库应该是视听语音识别任务的首要关注对象。对于以下与特征提取和动态视听融合密不可分的挑战，基于深度学习的工具（如深度完全卷积神经网络，双向长期短期记忆网络，3D卷积神经网络等）的主要用途是原因在于，它们是此类问题的相对简单的解决方案。由于已经进行了与发达的视听语音识别框架的计算量，准确性和适用性相关的原理分析和比较，因此我们进一步阐明了对未来视听语音识别架构设计的见识。我们认为端到端视听语音识别模型和基于深度学习的特征提取器将引导多模式人机交互直接解决。

更新日期：2020-12-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>