SMASH: Sparse Matrix Atomic Scratchpad Hashing,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SMASH: Sparse Matrix Atomic Scratchpad Hashing
arXiv - CS - Hardware Architecture Pub Date : 2021-05-29 , DOI: arxiv-2105.14156
Kaustubh Shivdikar

Sparse matrices, more specifically SpGEMM kernels, are commonly found in a wide range of applications, spanning graph-based path-finding to machine learning algorithms (e.g., neural networks). A particular challenge in implementing SpGEMM kernels has been the pressure placed on DRAM memory. One approach to tackle this problem is to use an inner product method for the SpGEMM kernel implementation. While the inner product produces fewer intermediate results, it can end up saturating the memory bandwidth, given the high number of redundant fetches of the input matrix elements. Using an outer product-based SpGEMM kernel can reduce redundant fetches, but at the cost of increased overhead due to extra computation and memory accesses for producing/managing partial products. In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach. We leverage atomic instructions to merge intermediate partial products as they are generated. The use of atomic instructions eliminates the need to create partial product matrices. To evaluate our row-wise product approach, we map an optimized SpGEMM kernel to a custom accelerator designed to accelerate graph-based applications. The targeted accelerator is an experimental system named PIUMA, being developed by Intel. PIUMA provides several attractive features, including fast context switching, user-configurable caches, globally addressable memory, non-coherent caches, and asynchronous pipelines. We tailor our SpGEMM kernel to exploit many of the features of the PIUMA fabric. This thesis compares our SpGEMM implementation against prior solutions, all mapped to the PIUMA framework. We briefly describe some of the PIUMA architecture features and then delve into the details of our optimized SpGEMM kernel. Our SpGEMM kernel can achieve 9.4x speedup as compared to competing approaches.

中文翻译：

SMASH：稀疏矩阵原子 Scratchpad 哈希

稀疏矩阵，更具体地说是 SpGEMM 核，在广泛的应用中很常见，从基于图的寻路到机器学习算法（例如，神经网络）。实施 SpGEMM 内核的一个特殊挑战是 DRAM 内存的压力。解决此问题的一种方法是对 SpGEMM 内核实现使用内积方法。虽然内积产生较少的中间结果，但考虑到输入矩阵元素的大量冗余提取，它最终可能会使内存带宽饱和。使用基于外积的 SpGEMM 内核可以减少冗余提取，但代价是由于用于生产/管理部分积的额外计算和内存访问而导致开销增加。在这篇论文中，我们介绍了一种基于行乘积方法的新型 SpGEMM 内核实现。我们利用原子指令来合并生成的中间部分产品。原子指令的使用消除了创建部分乘积矩阵的需要。为了评估我们的行式乘积方法，我们将优化的 SpGEMM 内核映射到旨在加速基于图形的应用程序的自定义加速器。目标加速器是一个名为 PIUMA 的实验系统，由英特尔开发。PIUMA 提供了几个吸引人的特性，包括快速上下文切换、用户可配置缓存、全局可寻址内存、非一致性缓存和异步管道。我们定制 SpGEMM 内核以利用 PIUMA 结构的许多功能。本论文将我们的 SpGEMM 实现与先前的解决方案进行了比较，所有映射到PIUMA框架。我们简要描述了一些 PIUMA 架构特性，然后深入研究我们优化的 SpGEMM 内核的细节。与竞争方法相比，我们的 SpGEMM 内核可以实现 9.4 倍的加速。

更新日期：2021-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文