当前位置: X-MOL 学术ACM Trans. Math. Softw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework
ACM Transactions on Mathematical Software ( IF 2.7 ) Pub Date : 2021-04-20 , DOI: 10.1145/3402225
Field G. Van Zee 1 , Devangi N. Parikh 1 , Robert A. Van De Geijn 1
Affiliation  

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

中文翻译:

在 BLIS 框架内支持混合域混合精度矩阵乘法

我们解决了在一般矩阵乘法中实现混合数据类型支持的问题(宝石) 类 BLAS 库实例化软件框架的操作,其中每个矩阵操作数一种,, 和C可以存储为单精度或双精度实数或复数。复杂性的另一个因素,即矩阵乘积和累加允许以不同于存储精度的精度进行一种要么, 也被讨论。我们首先将问题分解为正交维度,将域的混合与混合精度分开考虑。对存储在实数域或复数域中的矩阵操作数的所有组合的支持是通过枚举案例并描述每个案例的实现方法来绘制的。支持存储和计算精度的所有组合是通过在计算的关键阶段(根据需要在打包和/或累积期间)对矩阵进行类型转换来处理的。还记录了几个可选的优化。在 56 核 Marvell ThunderX2 和 52 核 Intel Xeon Platinum 上收集的性能结果表明,高性能大部分得以保留,但不可避免的类型转换指令会导致适度的减速。
更新日期:2021-04-20
down
wechat
bug