Lipreading with DenseNet and resBi-LSTM,Signal, Image and Video Processing

当前位置： X-MOL 学术 › Signal Image Video Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Lipreading with DenseNet and resBi-LSTM
Signal, Image and Video Processing ( IF 2.0 ) Pub Date : 2020-01-24 , DOI: 10.1007/s11760-019-01630-1
Xuejuan Chen , Jixiang Du , Hongbo Zhang

Lipreading is to recognize what the speakers say by the movement of lip only. Most of the previous works are to solve the problem of lipreading in English. For Mandarin lipreading, there are a few researches due to the lack of datasets. For that reason, we introduce a simple method here to build a dataset for sentence-level Mandarin lipreading from programs like news, speech and talk show. We use Hanyu Pinyin (a phonemic transcription of Chinese) as label and totally have 349 classes, while the number of Chinese characters is 1705 in our dataset. Therefore, for lipreading, there are two steps. The first step is to obtain the Hanyu Pinyin sequence. We propose a model that is composed of a 3D convolutional layer with DenseNet and residual bidirectional long short-term memory. After this, in order to get the final Chinese characters results, a model with a stack of multi-head attention is applied to convert Hanyu Pinyin into Chinese characters.

中文翻译：

使用 DenseNet 和 resBi-LSTM 进行唇读

唇读是仅通过嘴唇的运动来识别说话者所说的内容。之前的大部分作品都是为了解决英文唇读的问题。对于普通话唇读，由于缺乏数据集，有一些研究。出于这个原因，我们在这里介绍了一种简单的方法来从新闻、演讲和脱口秀等节目中构建句子级普通话唇读数据集。我们使用汉语拼音（汉语的音标）作为标签，总共有 349 个类，而我们的数据集中汉字的数量为 1705 个。因此，对于唇读，有两个步骤。第一步是获取汉语拼音序列。我们提出了一个模型，该模型由具有 DenseNet 和残余双向长短期记忆的 3D 卷积层组成。在此之后，为了得到最终的汉字结果，

更新日期：2020-01-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文