Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2021-05-07 , DOI: 10.1007/s11263-021-01457-9
Ben Saunders , Necati Cihan Camgoz , Richard Bowden

Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate. Sign language production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a mixture density network (MDN) formulation to produce realistic and expressive sign pose sequences. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions.

中文翻译：

通过渐进式变压器和混合密度网络连续进行3D多通道手语制作

手语是多通道可视语言，其中的签名者使用连续的3D空间进行通信。从语言到手语的自动翻译，手语产生（SLP）必须体现手语的连续表达和完整形态，以使聋人社区真正理解。以前基于深度学习的SLP作品仅产生了一系列孤立的标志，主要侧重于手动功能，从而实现了机械化和非表达性的生产。在这项工作中，我们提出了一种新颖的渐进式变压器体系结构，这是第一个以端到端的方式将口语句子转换为连续的3D多通道符号姿势序列的SLP模型。我们的变压器网络架构引入了计数器解码功能，该功能可通过跟踪一段时间内的生产进度并预测序列结束来实现可变长度连续序列的生成。我们提出了广泛的数据增强技术，以减少预测漂移，以及对抗训练方案和混合密度网络（MDN）公式，以产生逼真的和富有表现力的手势姿势序列。我们提出了SLP的反向翻译评估机制，可在具有挑战性的PHOENIX14T数据集上提供基准量化结果，并为将来的研究设定基准。我们进一步为SLP模型提供用户评估，以了解聋人对我们的标志姿势产品的接受程度。我们提出了广泛的数据增强技术，以减少预测漂移，以及对抗训练方案和混合密度网络（MDN）公式，以产生逼真的和富有表现力的手势姿势序列。我们提出了SLP的反向翻译评估机制，可在具有挑战性的PHOENIX14T数据集上提供基准量化结果，并为将来的研究设定基准。我们进一步为SLP模型提供用户评估，以了解聋人对我们的标志姿势产品的接受程度。我们提出了广泛的数据增强技术，以减少预测漂移，以及对抗训练方案和混合密度网络（MDN）公式，以产生逼真的和富有表现力的手势姿势序列。我们提出了SLP的反向翻译评估机制，可在具有挑战性的PHOENIX14T数据集上提供基准量化结果，并为将来的研究设定基准。我们进一步为SLP模型提供用户评估，以了解聋人对我们的标志姿势产品的接受程度。在具有挑战性的PHOENIX14T数据集上提供基准定量结果，并为将来的研究设定基准。我们进一步为SLP模型提供用户评估，以了解聋人对我们的标志姿势产品的接受程度。在具有挑战性的PHOENIX14T数据集上提供基准定量结果，并为将来的研究设定基准。我们进一步为SLP模型提供用户评估，以了解聋人对我们的标志姿势产品的接受程度。

更新日期：2021-05-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11