Performance of low-rank approximations in tensor train format (TT-SVD) for large dense tensors,arXiv - CS - Mathematical Software

当前位置： X-MOL 学术 › arXiv.cs.MS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance of low-rank approximations in tensor train format (TT-SVD) for large dense tensors
arXiv - CS - Mathematical Software Pub Date : 2021-01-29 , DOI: arxiv-2102.00104
Melven Röhrig-Zöllner, Jonas Thies, Achim Basermann

There are several factorizations of multi-dimensional tensors into lower-dimensional components, known as `tensor networks'. We consider the popular `tensor-train' (TT) format and ask, how efficiently can we compute a low-rank approximation from a full tensor on current multi-core CPUs. Compared to sparse and dense linear algebra, there are much fewer and less extensive well-optimized kernel libraries for multi-linear algebra. Linear algebra libraries like BLAS and LAPACK may provide the required operations in principle, but often at the cost of additional data movements for rearranging memory layouts. Furthermore, these libraries are typically optimized for the compute-bound case (e.g.\ square matrix operations) whereas low-rank tensor decompositions lead to memory bandwidth limited operations. We propose a `tensor-train singular value decomposition' (TT-SVD) algorithm based on two building blocks: a `Q-less tall-skinny QR' factorization, and a fused tall-skinny matrix-matrix multiplication and reshape operation. We analyze the performance of the resulting TT-SVD algorithm using the Roofline performance model. In addition, we present performance results for different algorithmic variants for shared-memory as well as distributed-memory architectures. Our experiments show that commonly used TT-SVD implementations suffer severe performance penalties. We conclude that a dedicated library for tensor factorization kernels would benefit the community: Computing a low-rank approximation can be as cheap as reading the data twice from main memory. As a consequence, an implementation that achieves realistic performance will move the limit at which one has to resort to randomized methods that only process part of the data.

中文翻译：

张量列格式（TT-SVD）的低秩近似对于大型密集张量的性能

多维张量有多种分解为低维分量的过程，称为“张量网络”。我们考虑流行的“张量-火车”（TT）格式，并询问，在当前的多核CPU上，如何从一个完整的张量计算低秩逼近的效率如何。与稀疏和稠密的线性代数相比，用于多线性代数的，经过优化的内核库越来越少。线性代数库（例如BLAS和LAPACK）原则上可以提供所需的运算，但是通常以重新安排内存布局的额外数据移动为代价。此外，这些库通常针对计算受限的情况（例如，方矩阵运算）进行了优化，而低秩张量分解导致内存带宽受限的运算。我们基于两个构建块提出了“张量-训练奇异值分解”（TT-SVD）算法：“无Q的高瘦QR QR”分解和融合的高瘦矩阵矩阵乘法和整形操作。我们使用Roofline性能模型分析所得TT-SVD算法的性能。此外，我们为共享内存和分布式内存体系结构提供了不同算法变体的性能结果。我们的实验表明，常用的TT-SVD实现会遭受严重的性能损失。我们得出的结论是，用于张量分解内核的专用库将使社区受益：计算低秩逼近可能与从主内存中读取数据两次一样便宜。作为结果，

更新日期：2021-02-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文