当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra
arXiv - CS - Hardware Architecture Pub Date : 2020-11-16 , DOI: arxiv-2011.08070
Paul Scheffler, Florian Zaruba, Fabian Schuiki, Torsten Hoefler, Luca Benini

Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multi-core cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and accelerator approaches, measuring a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication than a GTX 1080 Ti GPU running a cuSPARSE kernel.

中文翻译:

用于高效稀疏-密集线性代数的间接流语义寄存器架构

稀疏-密集线性代数在许多领域都至关重要,但在 CPU、GPU 和加速器等设备上高效处理具有挑战性;CSR 和 CSF 等稀疏格式的乘法需要间接内存查找。在这项工作中,我们增强了内存流 RISC-V ISA 扩展,以通过流间接加速稀疏密集产品。我们使用我们的硬件提供了高效的点、矩阵向量和矩阵矩阵产品内核,使单核 FPU 利用率高达 80%,在没有扩展的优化基线上实现高达 7.2 倍的加速。与优化的基线相比,我们的内核在多核集群上的矩阵向量实现速度提高了 5.8 倍,能源效率提高了 2.7 倍。我们建议进一步使用我们的间接硬件,例如分散-聚集操作和码本解码,
更新日期:2020-11-17
down
wechat
bug