• arXiv.cs.MS Pub Date : 2019-06-25
Wayne B. Mitchell; Robert Strzodka; Robert D. Falgout

Algebraic multigrid (AMG) is a widely used scalable solver and preconditioner for large-scale linear systems resulting from the discretization of a wide class of elliptic PDEs. While AMG has optimal computational complexity, the cost of communication has become a significant bottleneck that limits its scalability as processor counts continue to grow on modern machines. This paper examines the design, implementation, and parallel performance of a novel algorithm, Algebraic Multigrid Domain Decomposition (AMG-DD), designed specifically to limit communication. The goal of AMG-DD is to provide a low-communication alternative to standard AMG V-cycles by trading some additional computational overhead for a significant reduction in communication cost. Numerical results show that AMG-DD achieves superior accuracy per communication cost compared to AMG, and speedup over AMG is demonstrated on a large GPU cluster.

更新日期：2020-01-22
• arXiv.cs.MS Pub Date : 2019-04-23
Francesco Torsello

We present $\mathtt{bimEX}$, a Mathematica package for exact computations in 3$+$1 bimetric relativity. It is based on the $\mathtt{xAct}$ bundle, which can handle computations involving both abstract tensors and their components. In this communication, we refer to the latter case as concrete computations. The package consists of two main parts. The first part involves the abstract tensors, and focuses on how to deal with multiple metrics in $\mathtt{xAct}$. The second part takes an ansatz for the primary variables in a chart as the input, and returns the covariant BSSN bimetric equations in components in that chart. Several functions are implemented to make this process as fast and user-friendly as possible. The package has been used and tested extensively in spherical symmetry and was the workhorse in obtaining the bimetric covariant BSSN equations and reproducing the bimetric 3$+$1 equations in the spherical polar chart.

更新日期：2020-01-16
• arXiv.cs.MS Pub Date : 2020-01-08
Pascal Fua; Krzysztof Lis

Python currently is the dominant language in the field of Machine Learning but is often criticized for being slow to perform certain tasks. In this report, we use the well-known $N$-queens puzzle as a benchmark to show that once compiled using the Numba compiler it becomes competitive with C++ and Go in terms of execution speed while still allowing for very fast prototyping. This is true of both sequential and parallel programs. In most cases that arise in an academic environment, it therefore makes sense to develop in ordinary Python, identify computational bottlenecks, and use Numba to remove them.

更新日期：2020-01-09
• arXiv.cs.MS Pub Date : 2020-01-08
Stephen Chou; Fredrik Kjolstad; Saman Amarasinghe

This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) like CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping, analysis, and assembly. We then develop a language that precisely describes how different formats group together and order a tensor's nonzeros in memory. This enables a compiler to emit code that performs complex reorderings (remappings) of nonzeros when converting between formats. We additionally develop a query language that can extract complex statistics about sparse tensors, and we show how to emit efficient analysis code that computes such queries. Finally, we define an abstract interface that captures how data structures for storing a tensor can be efficiently assembled given specific statistics about the tensor. Disparate formats can implement this common interface, thus letting a compiler emit optimized sparse tensor conversion code for arbitrary combinations of a wide range of formats without hard-coding for any specific one. Our evaluation shows that our technique generates sparse tensor conversion routines with performance between 0.99 and 2.2$\times$ that of hand-optimized implementations in two widely used sparse linear algebra libraries, SPARSKIT and Intel MKL. By emitting code that avoids materializing temporaries, our technique also outperforms both libraries by between 1.4 and 3.4$\times$ for CSC/COO to DIA/ELL conversion.

更新日期：2020-01-09
• arXiv.cs.MS Pub Date : 2020-01-06
Mantas Mikaitis

We describe various issues caused by the lack of rounding in the gcc compiler implementation of the fixed-point arithmetic data types and operations. We demonstrate that there is no rounding in the conversion of constants, conversion from one numerical type to a less precise type and results of multiplications. Furthermore, we show that mixed-precision operations of fixed-point arithmetic lose precision on arguments, even before carrying out arithmetic operations. The ISO 18037:2008 standard was created to standardize C language extensions, including fixed-point arithmetic, for embedded systems. Embedded systems are usually based on ARM processors, of which approximately 100 billion were manufactured by now. Therefore, the observations about numerical issues that we show in this paper can be rather dangerous and are important to address, given a wide ranging types of applications that these embedded systems are running.

更新日期：2020-01-07
• arXiv.cs.MS Pub Date : 2020-01-01
Sheng-Chun Yang; Yong-Lei Wang

Nonequispaced discrete Fourier transformation (NDFT) is widely applied in all aspects of computational science and engineering. The computational efficiency and accuracy of NDFT has always been a critical issue in hindering its comprehensive applications both in intensive and in extensive aspects of scientific computing. In our previous work (2018, S.-C. Yang et al., Appl. Comput. Harmon. Anal. 44, 273), a CUNFFT method was proposed and it shown outstanding performance in handling NDFT at intermediate scale based on CUDA (Compute Unified Device Architecture) technology. In the current work, we further improved the computational efficiency of the CUNTTF method using an efficient MPI-CUDA hybrid parallelization (HP) scheme of NFFT to achieve a cutting-edge treatment of NDFT at super extended scale. Within this HP-NFFT method, the spatial domain of NDFT is decomposed into several parts according to the accumulative feature of NDFT and the detailed number of CPU and GPU nodes. These decomposed NDFT subcells are independently calculated on different CPU nodes using a MPI process-level parallelization mode, and on different GPU nodes using a CUDA threadlevel parallelization mode and CUNFFT algorithm. A massive benchmarking of the HP-NFFT method indicates that this method exhibit a dramatic improvement in computational efficiency for handling NDFT at super extended scale without loss of computational precision. Furthermore, the HP-NFFT method is validated via the calculation of Madelung constant of fluorite crystal structure, and thereafter verified that this method is robust for the calculation of electrostatic interactions between charged ions in molecular dynamics simulation systems.

更新日期：2020-01-07
• arXiv.cs.MS Pub Date : 2019-12-28
Ryan Senanayake; Fredrik Kjolstad; Changwan Hong; Shoaib Kamil; Saman Amarasinghe

We address the problem of optimizing mixed sparse and dense tensor algebra in a compiler. We show that standard loop transformations, such as strip-mining, tiling, collapsing, parallelization and vectorization, can be applied to irregular loops over sparse iteration spaces. We also show how these transformations can be applied to the contiguous value arrays of sparse tensor data structures, which we call their position space, to unlock load-balanced tiling and parallelism. We have prototyped these concepts in the open-source TACO system, where they are exposed as a scheduling API similar to the Halide domain-specific language for dense computations. Using this scheduling API, we show how to optimize mixed sparse/dense tensor algebra expressions, how to generate load-balanced code by scheduling sparse tensor algebra in position space, and how to generate sparse tensor algebra GPU code. Our evaluation shows that our transformations let us generate good code that is competitive with many hand-optimized implementations from the literature.

更新日期：2020-01-04
Contents have been reproduced by permission of the publishers.

down
wechat
bug