当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency
arXiv - CS - Sound Pub Date : 2021-04-08 , DOI: arxiv-2105.00812
Jinchuan Tian, Rongzhi Gu, Helin Wang, Yuexian Zou

Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. However, both the training and inference process of these models may encounter prohibitively high computational cost and large parameter budget. Although Parameter Sharing Strategy (PSS) proposed in ALBERT paves the way for parameter reduction, the computation required remains the same. Interestingly, we found in experiments that distributions of feature embeddings from different Transformer layers are similar when PSS is integrated: a property termed as Layer Consistency (LC) in this paper. Given this similarity of feature distributions, we assume that feature embeddings from different layers would have similar representing power. In this work, Layer Consistency enables us to adopt Transformer-based models in a more efficient manner: the number of Conformer layers in each training iteration could be uniformly sampled and Shallow Layer Inference (SLI) could be applied to reduce the number of layers in inference stage. In experiments, our models are trained with LibriSpeech dataset and then evaluated on both phone classification and Speech Recognition tasks. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.

中文翻译:

减少层:通过层一致性加速基于Conformer的自我监督模型

基于变压器的自我监督模型被训练为特征提取器,并已授权许多下游语音任务来实现最新性能。但是,这些模型的训练和推理过程都可能会遇到过高的计算成本和庞大的参数预算。尽管ALBERT中提出的参数共享策略(PSS)为参数缩减铺平了道路,但所需的计算仍保持不变。有趣的是,我们在实验中发现,当集成PSS时,来自不同Transformer层的特征嵌入的分布是相似的:本文中称为“层一致性(LC)”的属性。鉴于特征分布的这种相似性,我们假设来自不同层的特征嵌入将具有相似的表示能力。在这项工作中,层一致性使我们能够以更有效的方式采用基于Transformer的模型:每次训练迭代中的Conformer层数可以统一采样,并且可以使用浅层推理(SLI)来减少推理阶段的层数。在实验中,我们的模型使用LibriSpeech数据集进行训练,然后在电话分类和语音识别任务上进行评估。我们通过实验实现了7.8倍的参数减少,41.9%的训练速度和37.7%的推理速度,同时保持了与传统的类似BERT的自我监督方法相当的性能。在实验中,我们的模型使用LibriSpeech数据集进行训练,然后在电话分类和语音识别任务上进行评估。我们通过实验实现了7.8倍的参数减少,41.9%的训练速度和37.7%的推理速度,同时保持了与传统的类似BERT的自我监督方法相当的性能。在实验中,我们的模型使用LibriSpeech数据集进行训练,然后在电话分类和语音识别任务上进行评估。我们通过实验实现了7.8倍的参数减少,41.9%的训练速度和37.7%的推理速度,同时保持了与传统的类似BERT的自我监督方法相当的性能。
更新日期:2021-05-04
down
wechat
bug