Optimizing Inference Performance of Transformers on CPUs,arXiv - CS - Mathematical Software

当前位置： X-MOL 学术 › arXiv.cs.MS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing Inference Performance of Transformers on CPUs
arXiv - CS - Mathematical Software Pub Date : 2021-02-12 , DOI: arxiv-2102.06621
Dave Dice, Alex Kogan

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose three optimizations to speed them up. The optimizations are evaluated using the inference benchmark from HuggingFace, and are shown to achieve the speedup of up to x2.36. The considered optimizations do not require any changes to the implementation of the models nor affect their accuracy.

中文翻译：

在CPU上优化变压器的推理性能

Transformer体系结构彻底改变了自然语言处理（NLP）领域。基于变压器的模型（例如BERT）为许多重要的Web服务提供了支持，例如搜索，翻译，问题解答等。尽管对这些模型的训练投入了大量的研究注意力，但为提高其推理性能所做的工作相对较少。本文通过对基于CPU的基于Transformer的模型进行推断的可伸缩性和性能的经验分析来解决这一差距。着眼于广受欢迎的BERT模型，我们确定发生大量计算的Transformer体系结构的关键组件，并提出三种优化措施以加快它们的速度。使用HuggingFace的推理基准对优化进行了评估，结果表明该优化可将速度提高到x2.36。

更新日期：2021-02-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文