-
Analyzing the impact of CUDA versions on GPU applications Parallel Comput. (IF 1.4) Pub Date : 2024-02-29 Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda
CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate
-
Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture Parallel Comput. (IF 1.4) Pub Date : 2024-02-28 Jianjiang Li, Lin Li, Qingwei Wang, Wei Xue, Jiabi Liang, Jinliang Shi
Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse
-
-
Integrating FPGA-based hardware acceleration with relational databases Parallel Comput. (IF 1.4) Pub Date : 2024-02-06 Ke Liu, Haonan Tong, Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang
The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because
-
Fast data-dependence profiling through prior static analysis Parallel Comput. (IF 1.4) Pub Date : 2024-01-11 Mohammad Norouzi, Nicolas Morew, Qamar Ilias, Lukas Rothenberger, Ali Jannesari, Felix Wolf
Data-dependence profiling is a program-analysis technique for detecting parallelism opportunities in sequential programs. It captures data dependences that actually occur during program execution, filtering parallelism-preventing dependences that purely static methods assume only because they lack critical runtime information, such as the values of pointers and array indices. Profiling, however, suffers
-
A GPU-based hydrodynamic simulator with boid interactions Parallel Comput. (IF 1.4) Pub Date : 2023-12-21 Xi Liu, Gizem Kayar, Ken Perlin
We present a hydrodynamic simulation system using the GPU compute shaders of DirectX for simulating virtual agent behaviors and navigation inside a smoothed particle hydrodynamical (SPH) fluid environment with real-time water mesh surface reconstruction. The current SPH literature includes interactions between SPH and heterogeneous meshes but seldom involves interactions between SPH and virtual boid
-
Program partitioning and deadlock analysis for MPI based on logical clocks Parallel Comput. (IF 1.4) Pub Date : 2023-12-04 Shushan Li, Meng Wang, Hong Zhang, Yao Liu
The message passing interface (MPI) has become a standard for programming models in the field of high performance computing. It is of great importance to ensure the reliability of MPI programs by detecting whether there exist errors in them. However, as one of the most common errors in MPI programs, deadlock is difficult to detect due to the non-determinism and the asynchronous communication supported
-
-
Low consumption automatic discovery protocol for DDS-based large-scale distributed parallel computing Parallel Comput. (IF 1.4) Pub Date : 2023-11-09 Zhexu Liu, Shaofeng Liu, Zhiyong Fan, Zhen Zhao
DDS (Data Distribution Service) is an efficient communication specification for distributed parallel computing. However, as the scale of computation expands, high network load and memory consumption consistently limit its performance. This paper proposes a low consumption automatic discovery protocol to improve DDS in large-scale distributed parallel computing. Firstly, an improved Bloom Filter called
-
OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning Parallel Comput. (IF 1.4) Pub Date : 2023-11-09 Yunqi Gao, Zechao Zhang, Bing Hu, A-Long Jin, Chunming Wu
The communication bottleneck has severely restricted the scalability of distributed deep learning. Tensor fusion improves the scalability of data parallelism by overlapping computation and communication tasks. However, existing tensor fusion schemes only result in suboptimal training performance. In this paper, we propose an efficient communication mechanism (OF-WFBP) to find the optimal tensor fusion
-
Targeting performance and user-friendliness: GPU-accelerated finite element computation with automated code generation in FEniCS Parallel Comput. (IF 1.4) Pub Date : 2023-10-06 James D. Trotter, Johannes Langguth, Xing Cai
This paper studies the use of automated code generation to provide user-friendly GPU acceleration for solving partial differential equations (PDEs) with finite element methods. By extending the FEniCS framework and its automated compiler, we have achieved that a high-level description of finite element computations written in the Unified Form Language is auto-translated to parallelised CUDA C++ code
-
Task graph-based performance analysis of parallel-in-time methods Parallel Comput. (IF 1.4) Pub Date : 2023-09-14 Matthias Bolten, Stephanie Friedhoff, Jens Hahne
In this paper, we present a performance model based on task graphs for various iterative parallel-in-time (PinT) methods. PinT methods have been developed to speed up the simulation time of time-dependent problems using modern parallel supercomputers. The performance model is based on a data-driven notation of the methods, from which a task graph is generated. Based on this task graph and a distribution
-
-
ESA: An efficient sequence alignment algorithm for biological database search on Sunway TaihuLight Parallel Comput. (IF 1.4) Pub Date : 2023-08-22 Hao Zhang, Zhiyi Huang, Yawen Chen, Jianguo Liang, Xiran Gao
In computational biology, biological database search has been playing a very important role. Since the COVID-19 outbreak, it has provided significant help in identifying common characteristics of viruses and developing vaccines and drugs. Sequence alignment, a method finding similarity, homology and other information between gene/protein sequences, is the usual tool in the database search. With the
-
Finding inputs that trigger floating-point exceptions in heterogeneous computing via Bayesian optimization Parallel Comput. (IF 1.4) Pub Date : 2023-08-02 Ignacio Laguna, Anh Tran, Ganesh Gopalakrishnan
Testing code for floating-point exceptions is crucial as exceptions can quickly propagate and produce unreliable numerical answers. The state-of-the-art to test for floating-point exceptions in heterogeneous systems is quite limited and solutions require the application’s source code, which precludes their use in accelerated libraries where the source is not publicly available. We present an approach
-
Distributed software defined network-based fog to fog collaboration scheme Parallel Comput. (IF 1.4) Pub Date : 2023-07-29 Muhammad Kabeer, Ibrahim Yusuf, Nasir Ahmad Sufi
Fog computing was created to supplement the cloud in bridging the communication delay gap by deploying fog nodes nearer to Internet of Things (IoT) devices. Depending on the geographical location, computational resource and rate of IoT requests, fog nodes can be idle or saturated. The latter requires special mechanism to enable collaboration with other nodes through service offloading to improve resource
-
An optimal scheduling algorithm considering the transactions worst-case delay for multi-channel hyperledger fabric network Parallel Comput. (IF 1.4) Pub Date : 2023-07-27 Ou Wu, Shanshan Li, He Zhang, Liwen Liu, Haoming Li, Yanze Wang, Ziyi Zhang
As the most popular consortium blockchain platform, Hyperledger Fabric (Fabric for short) has released multiple versions that support different consensus protocols to address the risks faced in current and future network transactions. For example, Fabric v1.4 and v2.0 use Kafka and Raft mechanisms to complete consensus and ensure that the system can withstand failures such as crashes, network partitions
-
A flexible sparse matrix data format and parallel algorithms for the assembly of finite element matrices on shared memory systems Parallel Comput. (IF 1.4) Pub Date : 2023-07-22 Adam Sky, César Polindara, Ingo Muench, Carolin Birk
Finite element methods require the composition of the global stiffness matrix from local finite element contributions. The composition process combines the computation of element stiffness matrices and their assembly into the global stiffness matrix, which is commonly sparse. In this paper we focus on the assembly process of the global stiffness matrix and explore different algorithms and their efficiency
-
New YARN sharing GPU based on graphics memory granularity scheduling Parallel Comput. (IF 1.4) Pub Date : 2023-07-20 Jinliang Shi, Dewu Chen, Jiabi Liang, Lin Li, Yue Lin, Jianjiang Li
As one of the most widely used cluster scheduling frameworks, Hadoop YARN only supported CPU and memory scheduling in the past. Furthermore, due to the widespread use of AI, the demand for GPU is also increasing. So Hadoop YARN V3.0 adds GPU scheduling, but the granularity is on the whole card yet, rather than finer-grained graphics memory scheduling. However, during daily training, although the graphics
-
Editorial on Advances in High Performance Programming Parallel Comput. (IF 1.4) Pub Date : 2023-07-14
Abstract not available
-
Optimizing massively parallel sparse matrix computing on ARM many-core processor Parallel Comput. (IF 1.4) Pub Date : 2023-06-26 Jiang Zheng, Jiazhi Jiang, Jiangsu Du, Dan Huang, Yutong Lu
Sparse matrix multiplication is ubiquitous in many applications such as graph processing and numerical simulation. In recent years, numerous efficient sparse matrix multiplication algorithms and computational libraries have been proposed. However, most of them are oriented to x86 or GPU platforms, while the optimization on ARM many-core platforms has not been well investigated. Our experiments show
-
Parallelizable efficient large order multiple recursive generators Parallel Comput. (IF 1.4) Pub Date : 2023-06-26 Lih-Yuan Deng, Bryan R. Winter, Jyh-Jen Horng Shiau, Henry Horng-Shing Lu, Nirman Kumar, Ching-Chi Yang
The general multiple recursive generator (MRG) of maximum period has been thought of as an excellent source of pseudo random numbers. Based on a kth order linear recurrence modulo p, this generator produces the next pseudo random number based on a linear combination of the previous k numbers. General maximum period MRGs of order k have excellent empirical performance, and their strong mathematical
-
Using heterogeneous GPU nodes with a Cabana-based implementation of MPCD Parallel Comput. (IF 1.4) Pub Date : 2023-06-15 Rene Halver, Christoph Junghans, Godehard Sutmann
The Kokkos based library Cabana, which has been developed in the Co-design Center for Particle Applications (CoPA), is used for the implementation of Multi-Particle Collision Dynamics (MPCD), a particle-based description of hydrodynamic interactions. Cabana allows for a function portable implementation, which has been used to study the interplay between CPU and GPU usage on a multi-node system as well
-
Adaptively parallel runtime verification based on distributed network for temporal properties Parallel Comput. (IF 1.4) Pub Date : 2023-06-14 Bin Yu, Xu Lu, Cong Tian, Meng Wang, Chu Chen, Ming Lei, Zhenhua Duan
Runtime verification is a lightweight verification technique that verifies whether a monitored program execution satisfies a desired property. Online runtime verification faces challenges regarding efficiency and property expressiveness, which limit its widespread adoption. However, there is a lack of research that addresses both of these issues. With the basis of a distributed network, we propose
-
-
Big data BPMN workflow resource optimization in the cloud Parallel Comput. (IF 1.4) Pub Date : 2023-06-02 Srđan Daniel Simić, Nikola Tanković, Darko Etinger
Cloud computing is one of the critical technologies that meet the demand of various businesses for the high-capacity computational processing power needed to gain knowledge from their ever-growing business data. When utilizing cloud computing resources to deal with Big Data processing, companies face the challenge of determining the optimal use of resources within their business processes. The miscalculation
-
A lightweight semi-centralized strategy for the massive parallelization of branching algorithms Parallel Comput. (IF 1.4) Pub Date : 2023-04-29 Andres Pastrana-Cruz, Manuel Lafond
Several NP-hard problems are solved exactly using exponential-time branching strategies, whether it be branch-and-bound algorithms, or bounded search trees in fixed-parameter algorithms. The number of tractable instances that can be handled by sequential algorithms is usually small, whereas massive parallelization has been shown to significantly increase the space of instances that can be solved exactly
-
Segment based power-efficient scheduling for real-time DAG tasks on edge devices Parallel Comput. (IF 1.4) Pub Date : 2023-04-14 Lei Yu, Tianqi Zhong, Peng Bi, Lan Wang, Fei Teng
Smart Mobile Devices (SMDs) are crucial for the edge computing paradigm’s real-world sensing. Real-time applications, which are computationally intensive and periodic with strict time constraints, can typically be used to replicate real-world sensing. Such applications call for increased processing speed, memory capacity, and battery life on SMDs, which are typically resource-constrained due to physical
-
A survey of software techniques to emulate heterogeneous memory systems in high-performance computing Parallel Comput. (IF 1.4) Pub Date : 2023-04-18 Clément Foyer, Brice Goglin, Andrès Rubio Proaño
Heterogeneous memory will be involved in several upcoming platforms on the way to exascale. Combining technologies such as HBM, DRAM and/or NVDIMM allows to tackle the needs of different applications in terms of bandwidth, latency or capacity. And new memory interconnects such as CXL bring easy ways to attach these technologies to the processors. High-performance computing developers must prepare their
-
Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures Parallel Comput. (IF 1.4) Pub Date : 2023-04-14 Shelby Lockhart, Amanda Bienz, William D. Gropp, Luke N. Olson
Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on
-
Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clusters Parallel Comput. (IF 1.4) Pub Date : 2023-04-06 Lukas Reitz, Kai Hardenbicker, Tobias Werner, Claudia Fohry
A popular approach to program scalable irregular applications is Asynchronous Many-Task (AMT) Programming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-join (NFJ). We consider cluster AMTs, in which a runtime system maps the tasks to worker threads in multiple processes. Thereby, dynamic load balancing can be achieved via cooperative work
-
GPU acceleration of Levenshtein distance computation between long strings Parallel Comput. (IF 1.4) Pub Date : 2023-04-03 David Castells-Rufas
Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance
-
Efficient checkpoint/Restart of CUDA applications Parallel Comput. (IF 1.4) Pub Date : 2023-03-09 Akira Nukada, Taichiro Suzuki, Satoshi Matsuoka
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses
-
A heterogeneous processing-in-memory approach to accelerate quantum chemistry simulation Parallel Comput. (IF 1.4) Pub Date : 2023-03-01 Zeshi Liu, Zhen Xie, Wenqian Dong, Mengting Yuan, Haihang You, Dong Li
The “memory wall” is an architectural property introducing high memory access latency that can manifest application performance, and this wall becomes even taller in the context of big data. Although the use of GPU-based systems could achieve high performance, it is difficult to improve the utilization of GPU systems due to the “memory wall”. The intensive data exchange and computation remains a challenge
-
NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers Parallel Comput. (IF 1.4) Pub Date : 2023-02-23 Marek Palkowski, Wlodzimierz Bielecki
The paper presents a benchmark suite of ten non-serial polyadic dynamic programming (NPDP) kernels, which are designed to test the efficiency of tiled code generated by polyhedral optimization compilers. These kernels are mainly derived from bioinformatics algorithms, which pose a significant challenge for automatic loop nest tiling transformations. The paper describes algorithms implemented with examined
-
A parallel non-convex approximation framework for risk parity portfolio design Parallel Comput. (IF 1.4) Pub Date : 2023-02-01 Yidong Chen, Chen Li, Yonghong Hu, Zhonghua Lu
In this paper, we propose a parallel non-convex approximation framework (NCAQ) for optimization problems whose objective is to minimize a convex function plus the sum of non-convex functions. Based on the structure of the objective function, our framework transforms the non-convex constraints to the logarithmic barrier function and approximates the non-convex problem by a parallel quadratic approximation
-
Efficient parallel reduction of bandwidth for symmetric matrices Parallel Comput. (IF 1.4) Pub Date : 2023-01-21 Valeriy Manin, Bruno Lang
Bandwidth reduction can be a first step in the computation of eigenvalues and eigenvectors for a wide-banded complex Hermitian (or real symmetric) matrix. We present algorithms for this reduction and the corresponding back-transformation of the eigenvectors. These algorithms rely on blocked Householder transformations, thus enabling level 3 BLAS performance, and they feature two levels of parallelism
-
Heterogeneous sparse matrix–vector multiplication via compressed sparse row format Parallel Comput. (IF 1.4) Pub Date : 2023-01-20 Phillip Allen Lane, Joshua Dennis Booth
Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs
-
ParVoro++: A scalable parallel algorithm for constructing 3D Voronoi tessellations based on kd-tree decomposition Parallel Comput. (IF 1.4) Pub Date : 2023-01-18 Guoqing Wu, Hongyun Tian, Guo Lu, Wei Wang
The Voronoi tessellation is a fundamental geometric data structure which has numerous applications in various scientific and technological fields. For large particle datasets, computing Voronoi tessellations must be conducted in parallel on a distributed-memory supercomputer in order to satisfy time and memory-size constraints. However, due to load balance and communication, the parallelization of
-
Multi-level parallel multi-layer block reproducible summation algorithm Parallel Comput. (IF 1.4) Pub Date : 2023-01-18 Kuan Li, Kang He, Stef Graillat, Hao Jiang, Tongxiang Gu, Jie Liu
Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible
-
Uphill resampling for particle filter and its implementation on graphics processing unit Parallel Comput. (IF 1.4) Pub Date : 2023-01-06 Özcan Dülger, Halit Oğuztüzün, Mübeccel Demirekler
We introduce a new resampling method, named Uphill, that is free from numerical instability and suitable for parallel implementation on graphics processing unit (GPU). Common resampling algorithms such as Systematic suffer from numerical instability when single precision floating point numbers are used. This is due to cumulative summation over the weights of particles when the weights differ widely
-
Accelerating the scheduling of the network resources of the next-generation optical data centers Parallel Comput. (IF 1.4) Pub Date : 2022-12-07 G. Patronas, N. Vlassopoulos, Ph. Bellos, D. Reisis
Data centers (DCs) play a key role in the evolving IT applications and they rely heavily on the optical interconnects to improve their performance and scalability. Optically switched DCs most often exploit the slotted Time Division Multiplexing Access (TDMA) operation and the Wavelength Division Multiplexing (WDM) technology and rely on the effective scheduling of the TDMA frames to decide in real
-
Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval Parallel Comput. (IF 1.4) Pub Date : 2022-11-24 Guilherme Andrade, Renato Ferreira, George Teodoro
Content-based multimedia retrieval (CBMR) applications are becoming very popular in several online services which handles large volumes of data and are submitted to high query rates. While these applications may be complex, finding the nearest neighboring objects (multimedia descriptors) is typically their most time consuming operation. In order to address this problem, several recent works have proposed
-
-
Efficient parallel branch-and-bound approaches for exact graph edit distance problem Parallel Comput. (IF 1.4) Pub Date : 2022-11-03 Adel Dabah, Ibrahim Chegrane, Saïd Yahiaoui, Ahcene Bendjoudi, Nadia Nouali-Taboudjemat
Graph Edit Distance (GED) is a well-known measure used in the graph matching to measure the similarity/dissimilarity between two graphs by computing the minimum cost of edit operations needed to transform one graph into another. This process, Which appears to be simple, is known NP-hard and time consuming since the search space is increasing exponentially. One way to optimally solve this problem is
-
Graph optimization algorithm using symmetry and host bias for low-latency indirect network Parallel Comput. (IF 1.4) Pub Date : 2022-10-19 Masahiro Nakao, Masaki Tsukamoto, Yoshiko Hanada, Keiji Yamamoto
It is known that an indirect network with a small host-to-host Average Shortest Path Length (h-ASPL) improves overall system performance in a parallel computer system. As a means to discuss such indirect networks in graph theory, the Order/Radix Problem (ORP) has been proposed. ORP involves finding a graph with a minimum h-ASPL that satisfies a given number of hosts and radix. A graph in ORP represents
-
NekRS, a GPU-accelerated spectral element Navier–Stokes solver Parallel Comput. (IF 1.4) Pub Date : 2022-10-18 Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers, Tim Warburton
The development of NekRS, a GPU-oriented thermal-fluids simulation code based on the spectral element method (SEM) is described. For performance portability, the code is based on the open concurrent compute abstraction and leverages scalable developments in the SEM code Nek5000 and in libParanumal, which is a library of high-performance kernels for high-order discretizations and PDE-based miniapps
-
SGPM: A coroutine framework for transaction processing Parallel Comput. (IF 1.4) Pub Date : 2022-10-06 Xinyuan Wang, Hejiao Huang
Coroutine is able to increase program concurrency and processor core utilization. However, for adapting the coroutine-to-transaction model, the existing coroutine package has the following disadvantages: (1) Additional scheduler threads incur synchronization overhead when the load between scheduler threads and worker threads is unbalanced. (2) Coroutines are swapped out periodically to prevent deadlocks
-
-
Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA Parallel Comput. (IF 1.4) Pub Date : 2022-09-23 Lukas Spies, Amanda Bienz, David Moulton, Luke Olson, Andrew Reisner
Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more complex heterogeneous systems with both
-
A method for efficient radio astronomical data gridding on multi-core vector processor Parallel Comput. (IF 1.4) Pub Date : 2022-08-30 Hao Wang, Ce Yu, Jian Xiao, Shanjiang Tang, Yu Lu, Hao Fu, Bo Kang, Gang Zheng, Chenzhou Cui
Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art
-
Fast calculation of isostatic compensation correction using the GPU-parallel prism method Parallel Comput. (IF 1.4) Pub Date : 2022-08-11 Yan Huang, Qingbin Wang, Minghao Lv, Xingguang Song, Jinkai Feng, Xuli Tan, Ziyan Huang, Chuyuan Zhou
Isostatic compensation is a crucial component of crustal structure analysis and geoid calculations in cases of gravity reduction. However, large-scale and high-precision calculations are limited by the inefficiencies of the strict prism method and the low accuracy of the approximate calculation formula. In this study, we propose a new method of terrain grid re-encoding and an eight-component strict
-
Accelerating communication for parallel programming models on GPU systems Parallel Comput. (IF 1.4) Pub Date : 2022-08-04 Jaemin Choi, Zane Fink, Sam White, Nitin Bhat, David F. Richards, Laxmikant V. Kale
As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable
-
parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices Parallel Comput. (IF 1.4) Pub Date : 2022-07-25 Tianshi Xu, Vassilis Kalantzis, Ruipeng Li, Yuanzhe Xi, Geoffrey Dillon, Yousef Saad
This paper discusses parGeMSLR, a C++/MPI software library for the solution of sparse systems of linear algebraic equations via preconditioned Krylov subspace methods in distributed-memory computing environments. The preconditioner implemented in parGeMSLR is based on algebraic domain decomposition and partitions the symmetrized adjacency graph recursively into several non-overlapping partitions via
-
Optimizing small channel 3D convolution on GPU with tensor core Parallel Comput. (IF 1.4) Pub Date : 2022-07-22 Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, Xiangke Liao
In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D CNN, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device
-
SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent Parallel Comput. (IF 1.4) Pub Date : 2022-07-16 Gizen Mutlu, Çiğdem İnan Acı
The Support Vector Machine (SVM) method is one of the popular machine learning algorithms as it gives high accuracy. However, like most machine learning algorithms, the resource consumption of the SVM algorithm in terms of time and memory increases linearly as the dataset grows. In this study, a parallel-hybrid algorithm that combines SVM, Sequential Minimal Optimization (SMO) with Stochastic Gradient
-
QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU Parallel Comput. (IF 1.4) Pub Date : 2022-07-21 Qingxiao Sun, Liu Yi, Hailong Yang, Mingzhen Li, Zhongzhi Luan, Depei Qian
Although GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, existing works fail to exploit the computing characteristics of batch tasks, and thus waste the opportunities to reduce
-
Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers Parallel Comput. (IF 1.4) Pub Date : 2022-07-13 J. Pronold, J. Jordan, B.J.N. Wylie, I. Kitayama, M. Diesmann, S. Kunkel
Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed random graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively
-
Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments Parallel Comput. (IF 1.4) Pub Date : 2022-07-04 Alessio Netti, Michael Ott, Carla Guillen, Daniele Tafani, Martin Schulz
As HPC systems continue to grow in scale and complexity, efficient and manageable operation is increasingly critical. For this reason, many centers are starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from the massive amounts of data produced by monitoring systems and use it for enacting control over system knobs, or for aiding administrators through
-
Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores Parallel Comput. (IF 1.4) Pub Date : 2022-06-26 Lena Oden, Jörg Keller
We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs and on multicore CPUs with SIMD units. We demonstrate that for four different distributions, multiple problem sizes, and three platforms the best strategy varies. We support our analytic results by extensive