当前位置: X-MOL 学术IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Efficient Parallel Processor for Dense Tensor Computation
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2021-05-27 , DOI: 10.1109/tvlsi.2021.3080318
Wei-Pei Huang , Ray C. C. Cheung , Hong Yan

Nowadays, many data are multidimensional, which are called tensors. Tensor computations have been applied in different fields and various software libraries have been developed. However, not much attention has been received for developing a hardware architecture to accelerate the tensor computations. In this article, an efficient and unified processing element (PE) array for the 3-D tensor computation is demonstrated. Our PE array is optimized for thin and tall tensor–matrix multiplication and two types of tensor times matrices chain (TTMc) operations. Our design is evaluated in three study cases and compared with the state-of-the-art design. By using computation partition and rearrangement, data movement between the field-programmable gate array (FPGA) and off-chip DDR memory can be reduced by $O(I^{2})$ , where $I$ is the maximum range among all the dimensions of the data tensor. For TTMc implementation, clock frequency has been increased by 18% compared with the state-of-the-art implementation on the same FPGA chip. An experiment on 3-D volumetric data set rendering by tensor approximation method is conducted for demonstration. For the bricks reconstruction process, the runtime decreased by 50%, i.e., two times faster, on our FPGA implementation compared with that running on GPU. In CANDECOMP/PARAFAC decomposition, for one iteration, the runtime has been decreased by up to 93% compared with the programs implemented by Tensorly, which is a python library.

中文翻译:

用于密集张量计算的高效并行处理器

现在很多数据都是多维的,称为张量。张量计算已应用于不同领域,并开发了各种软件库。然而,开发硬件架构以加速张量计算并没有受到太多关注。在本文中,演示了用于 3-D 张量计算的高效且统一的处理元素 (PE) 阵列。我们的 PE 阵列针对薄和高张量矩阵乘法和两种类型的张量时间矩阵链 (TTMc) 操作进行了优化。我们的设计在三个研究案例中进行了评估,并与最先进的设计进行了比较。通过使用计算分区和重新排列,现场可编程门阵列 (FPGA) 和片外 DDR 存储器之间的数据移动可以减少 $O(I^{2})$ , 在哪里 $1$ 是数据张量所有维度中的最大范围。对于 TTMc 实现,与同一 FPGA 芯片上的最新实现相比,时钟频率提高了 18%。进行了张量近似法渲染 3-D 体数据集的实验以进行演示。对于砖块重建过程,与在 GPU 上运行的相比,我们的 FPGA 实现的运行时间减少了 50%,即快两倍。在 CANDECOMP/PARAFAC 分解中,与 Python 库 Tensorly 实现的程序相比,一次迭代的运行时间减少了高达 93%。
更新日期:2021-06-29
down
wechat
bug