当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Two-Pass Softmax Algorithm
arXiv - CS - Performance Pub Date : 2020-01-13 , DOI: arxiv-2001.04438
Marat Dukhan and Artsiom Ablavatski

The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth. We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the "mantissa" and another representing the "exponent". Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors. To foster reproducibility, we released an open-source implementation of the new Two-Pass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at GitHub.com/google/XNNPACK.

中文翻译:

两遍 Softmax 算法

softmax(也称为 softargmax)函数广泛用于机器学习模型中,以将实值分数归一化为概率分布。为了避免浮点溢出,softmax 函数通常分三遍实现:第一遍计算归一化常数,另外两遍计算归一化输入的输出。我们分析了三遍算法的两种变体,并证明在 HPC 级处理器上的优化实现中,所有三遍的性能都受到内存带宽的限制。然后,我们提出了一种只需两次传递即可进行 softmax 计算的新算法。提出的两遍算法通过对中间值采用奇异表示来避免数值溢出和额外的归一化传递,其中每个值都表示为一对浮点数:一个代表“尾数”,另一个代表“指数”。性能评估表明,在英特尔 Skylake-X 处理器上的缓存外输入上,新的两遍算法在 AVX512 实现中比传统三遍算法高出 28%,在 AVX2 实现中高出 18%。在 Intel Broadwell 和 AMD Zen 2 处理器上,提出的两遍算法也优于传统的三遍算法。为了提高可重复性,我们在 GitHub.com/google/XNNPACK 上发布了新的两遍 Softmax 算法和本文中其他实验的开源实现,作为 XNNPACK 库的一部分。性能评估表明,在英特尔 Skylake-X 处理器上的缓存外输入上,新的两遍算法在 AVX512 实现中比传统三遍算法高出 28%,在 AVX2 实现中高出 18%。在 Intel Broadwell 和 AMD Zen 2 处理器上,提出的两遍算法也优于传统的三遍算法。为了提高可重复性,我们在 GitHub.com/google/XNNPACK 上发布了新的两遍 Softmax 算法和本文中其他实验的开源实现,作为 XNNPACK 库的一部分。性能评估表明,在英特尔 Skylake-X 处理器上的缓存外输入上,新的两遍算法在 AVX512 实现中比传统三遍算法高出 28%,在 AVX2 实现中高出 18%。在 Intel Broadwell 和 AMD Zen 2 处理器上,提出的两遍算法也优于传统的三遍算法。为了提高可重复性,我们在 GitHub.com/google/XNNPACK 上发布了新的两遍 Softmax 算法和本文中其他实验的开源实现,作为 XNNPACK 库的一部分。在 Intel Broadwell 和 AMD Zen 2 处理器上,提出的两遍算法也优于传统的三遍算法。为了提高可重复性,我们在 GitHub.com/google/XNNPACK 上发布了新的两遍 Softmax 算法和本文中其他实验的开源实现,作为 XNNPACK 库的一部分。在 Intel Broadwell 和 AMD Zen 2 处理器上,提出的两遍算法也优于传统的三遍算法。为了提高可重复性,我们在 GitHub.com/google/XNNPACK 上发布了新的两遍 Softmax 算法和本文中其他实验的开源实现,作为 XNNPACK 库的一部分。
更新日期:2020-01-14
down
wechat
bug