-
A new scalable distributed k-means algorithm based on Cloud micro-services for High-performance computing Parallel Comput. (IF 1.119) Pub Date : 2020-12-15 Fatéma Zahra Benchara; Mohamed Youssfi
The paper aims to propose a distributed clustering method for High performance computing (HPC) models and, its application for medical image processing. The communication cost is one of the great challenges, which minimizes the scalability of parallel and distributed computing models. Indeed, it reduces significantly the performance of HPC systems where these models are assigned to be implemented.
-
Multiscale modeling and cinematic visualization of photosynthetic energy conversion processes from electronic to cell scales Parallel Comput. (IF 1.119) Pub Date : 2020-12-15 Melih Sener; Stuart Levy; John E. Stone; AJ Christensen; Barry Isralewitz; Robert Patterson; Kalina Borkiewicz; Jeffrey Carpenter; C. Neil Hunter; Zaida Luthey-Schulten; Donna Cox
Conversion of sunlight into chemical energy, namely photosynthesis, is the primary energy source of life on Earth. A visualization depicting this process, based on multiscale computational models from electronic to cell scales, is presented in the form of an excerpt from the fulldome show Birth of Planet Earth. This accessible visual narrative shows a lay audience, including children, how the energy
-
Parallel branch and bound algorithm for solving integer linear programming models derived from behavioral synthesis Parallel Comput. (IF 1.119) Pub Date : 2020-11-13 Mohammad K Fallah; Mahmood Fazlali
Integer Linear Programming (ILP) formulation of behavioral synthesis allows hardware designers to implement efficient circuits considering resource and timing constraint. However, finding the optimal answer of ILP models is an NP-Hard problem and remains a computational challenge. In this paper, we address this challenge by developing two exact parallel branch and bound algorithms which are capable
-
HBPFP-DC: A parallel frequent itemset mining using Spark Parallel Comput. (IF 1.119) Pub Date : 2020-11-30 Yaling Xun; Jifu Zhang; Haifeng Yang; Xiao Qin
The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. Facing big data applications, parallel and distributed solutions are widely studied. However, the frequent itemset mining process is a continuous iteration process. As an in-memory parallel execution model in which all data will be loaded into memory, Spark is especially
-
Parallelization of network motif discovery using star contraction Parallel Comput. (IF 1.119) Pub Date : 2020-11-21 Esra Ruzgar Ateskan; Kayhan Erciyes; Mehmet Emin Dalkilic
Network motifs are widely used to uncover structural design principles of complex networks. Current sequential network motif discovery algorithms become inefficient as motif size grows, thus parallelization methods have been proposed in the literature. In this study, we use star contraction algorithm to partition complex networks efficiently for parallel discovery of network motifs. We propose two
-
A thread-adaptive sparse approximate inverse preconditioning algorithm on multi-GPUs Parallel Comput. (IF 1.119) Pub Date : 2020-11-19 Jiaquan Gao; Qi Chen; Guixia He
In this study, we present an efficient thread-adaptive sparse approximate inverse preconditioning algorithm on multiple GPUs, called GSPAI-Adaptive. For our proposed GSPAI-Adaptive, there are the following novelties: (1) a thread-adaptive allocation strategy is presented for each column of the preconditioner, and (2) a parallel framework of constructing the sparse approximate inverse preconditioner
-
Asynchronous parallel stochastic Quasi-Newton methods Parallel Comput. (IF 1.119) Pub Date : 2020-11-04 Qianqian Tong; Guannan Liang; Xingyu Cai; Chunjiang Zhu; Jinbo Bi
Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to their effectiveness in dealing with ill-conditioned optimization problems. The L-BFGS method is one of the most widely used quasi-Newton methods. We propose an asynchronous
-
Improved probabilistic I/O scheduling for limited-size Burst-Buffers deployed HPC Parallel Comput. (IF 1.119) Pub Date : 2020-10-25 Benbo Zha; Hong Shen
I/O bottleneck is a critical problem in current High Performance Computing (HPC) systems which hinges the performance scalability of a system. Some techniques, such as I/O scheduling and Burst-Buffering, had been proposed to accelerate data exchange between the compute and storage components on HPC platforms. Probabilistic I/O scheduling, a Markov-chain-based hybrid method combined the above-mentioned
-
CCF: An efficient SpMV storage format for AVX512 platforms Parallel Comput. (IF 1.119) Pub Date : 2020-10-21 Mohammad Almasri; Walid Abu-Sufah
We present a sparse matrix vector multiplication (SpMV) kernel that uses a novel sparse matrix storage format and delivers superior performance for unstructured matrices on Intel x86 processors. Our kernel exploits the properties of our storage format to enhance load balancing, SIMD efficiency, and data locality. We evaluate the performance of our kernel on a dual 24-core Skylake Xeon Platinum 8160
-
Scalable line and plane relaxation in a parallel structured multigrid solver Parallel Comput. (IF 1.119) Pub Date : 2020-10-20 Andrew Reisner; Markus Berndt; J. David Moulton; Luke N. Olson
The efficient solution of sparse, linear systems that arise through the discretization of partial differential equations remains a key challenge for a range of high performance scientific simulations. One approach for reducing data movement and improving performance is by exposing and exploiting structure in a problem through the use of robust structured multilevel solvers. By choosing coarsening that
-
Robust parallel eigenvector computation for the non-symmetric eigenvalue problem Parallel Comput. (IF 1.119) Pub Date : 2020-10-20 Angelika Schwarz; Carl Christian Kjelgaard Mikkelsen; Lars Karlsson
A standard approach for computing eigenvectors of a non-symmetric matrix reduced to real Schur form relies on a variant of backward substitution. Backward substitution is prone to overflow. To avoid overflow, the LAPACK eigenvector routine DTREVC3 associates every eigenvector with a scaling factor and dynamically rescales an entire eigenvector during the backward substitution such that overflow cannot
-
Exploring GPU acceleration of Deep Neural Networks using Block Circulant Matrices Parallel Comput. (IF 1.119) Pub Date : 2020-10-16 Shi Dong; Pu Zhao; Xue Lin; David Kaeli
Training a Deep Neural Network (DNN) is a significant computing task since it places high demands on computing resources and memory bandwidth. Many approaches have been proposed to compress the network, while maintaining high model accuracy, reducing the computational demands associated with large-scale DNN training. One attractive approach is to leverage Block Circulant Matrices (BCM), compressing
-
A parallel strategy for density functional theory computations on accelerated nodes Parallel Comput. (IF 1.119) Pub Date : 2020-10-15 Massimiliano Lupo Pasini; Bruno Turcksin; Wenjun Ge; Jean-Luc Fattebert
Using the Löwdin orthonormalization of tall-skinny matrices as a proxy-app for wavefunction-based Density Functional Theory solvers, we investigate a distributed memory parallel strategy focusing on Graphics Processing Unit (GPU)-accelerated nodes as available on some of the top ranked supercomputers at the present time. We present numerical results in the strong limit regime, as it is particularly
-
ImRP: A Predictive Partition Method for Data Skew Alleviation in Spark Streaming Environment Parallel Comput. (IF 1.119) Pub Date : 2020-10-02 Zhongming Fu; Zhuo Tang; Li Yang; Kenli Li; Keqin Li
Spark Streaming is an extension of the core Spark engine that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It treats stream as a series of deterministic batches and handles them as regular jobs. However, for a stream job responsible for a batch, data skew (i.e., the imbalance in the amount of data allocated to each reduce task), can degrade the job performance
-
ThermoBench: A thermal efficiency benchmark for clusters in data centers Parallel Comput. (IF 1.119) Pub Date : 2020-08-03 Yi Zhou; Yuanqi Chen; Shubbhi Taneja; Ajit Chavan; Xiao Qin; Jifu Zhang
The energy efficiency of a data center depends on the cooling cost of clusters in the data center. Enhancing thermal efficiency of clusters is a practical approach to reducing energy consumption cost, optimizing scalability, and improving reliability. In this paper, we propose ThermoBench to evaluate the thermal efficiency of computing and storage clusters deployed in data centers. We shed light on
-
Delaunay triangulation of large-scale datasets using two-level parallelism Parallel Comput. (IF 1.119) Pub Date : 2020-07-29 Cuong M. Nguyen; Philip J. Rhodes
Because of the importance of Delaunay Triangulation in science and engineering, researchers have devoted extensive attention to parallelizing this fundamental algorithm. However, generating unstructured meshes for extremely large point sets remains a barrier for scientists working with large scale or high resolution datasets. In our previous paper, we introduced a novel algorithm – Triangulation of
-
Performance optimization of non-equilibrium ionization simulations from MapReduce and GPU acceleration Parallel Comput. (IF 1.119) Pub Date : 2020-08-12 Jian Xiao; Min Long; Ce Yu; Xin Zhou; Li Ji
We propose a two-stage optimization strategy to accelerate non-equilibrium ionization (NEI) calculation that is crucial to various high energy astrophysical phenomena, by using methods of MapReduce modeling and GPU acceleration. First, we construct a parallel pipeline based on the MapReduce model that processes massive particles trajectories on a separate mesh decoupled from that has been used by other
-
Dynamic power management for value-oriented schedulers in power-constrained HPC system Parallel Comput. (IF 1.119) Pub Date : 2020-08-27 Nirmal Kumbhare; Ali Akoglu; Aniruddha Marathe; Salim Hariri; Ghaleb Abdulla
High performance computing (HPC) systems are confronting the challenge of improving their productivity under a system-wide power constraint in the exascale era. To measure the productivity of an HPC job, researchers have proposed to assign a monotonically decreasing time-dependent value function, called job-value, to that job. These job-value functions are used by the value-based scheduling algorithms
-
Collectives in hybrid MPI+MPI code: Design, practice and performance Parallel Comput. (IF 1.119) Pub Date : 2020-07-24 Huan Zhou; José Gracia; Naweiluo Zhou; Ralf Schneider
The use of hybrid scheme combining the message passing programming models for inter-node parallelism and the shared memory programming models for node-level parallelism is widely spread. Existing extensive practices on hybrid Message Passing Interface (MPI) plus Open Multi-Processing (OpenMP) programming account for its popularity. Nevertheless, strong programming efforts are required to gain performance
-
Accelerated molecular dynamics simulation of Silicon Crystals on TaihuLight using OpenACC Parallel Comput. (IF 1.119) Pub Date : 2020-07-11 Jianguo Liang; Rong Hua; Hao Zhang; Wenqiang Zhu; You Fu
The Sunway TaihuLight with the theoretical peak performance of 125PFlop/s is now ranked third in the TOP500 list. It provides a high-level programming model named OpenACC, which extends the OpenACC 2.0 standard with some customized extensions. We assess the performance of the extended programming model and the SW26010 heterogeneous many-core processor for running molecular dynamics (MD) simulation
-
GPU-accelerated Lagrangian heuristic for multidimensional assignment problems with decomposable costs Parallel Comput. (IF 1.119) Pub Date : 2020-06-15 Shardul Natu; Ketan Date; Rakesh Nagi
In this paper, we describe a GPU-accelerated parallel algorithm for the axial Multidimensional Assignment Problem with Decomposable Costs (MDADC), which is one of the most fundamental formulations for data association. MDADC is known to be NP-hard and is large-dimensioned in most realistic cases; hence, heuristic solutions with qualified optimality gaps is the best one can hope for, given the state-of-knowledge
-
Asynchronous runtime with distributed manager for task-based programming models Parallel Comput. (IF 1.119) Pub Date : 2020-06-07 Jaume Bosch; Carlos Álvarez; Daniel Jiménez-González; Xavier Martorell; Eduard Ayguadé
Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access using
-
A novel method of grouping target paths for parallel programs Parallel Comput. (IF 1.119) Pub Date : 2020-06-06 Dunwei Gong; Tian Tian; Jinxin Wang; Ying Du; Zheng Li
Genetic algorithms can be employed to automatically generate desired test data, with the advantage of freeing up manpower. For the path coverage criterion, the problem of test data generation needs to be transformed into an optimization problem before applying genetic algorithms. However, when the number of paths to be covered is large, the transformed optimization problem will be very complicated
-
A multi-improvement local search using dataflow and GPU to solve the minimum latency problem Parallel Comput. (IF 1.119) Pub Date : 2020-06-01 Rodolfo Pereira Araujo; Igor Machado Coelho; Leandro Augusto Justen Marzulo
Optimization problems have great importance in the industrial field, specially for supply chain management and transportation of goods. Many of these problems are classified as NP-Hard, thus there is no known algorithm to find their exact (global optimal) solutions in polynomial time. Therefore, fast heuristic strategies are generally employed, specially those with the ability to escape from poor quality
-
AIR: Iterative refinement acceleration using arbitrary dynamic precision Parallel Comput. (IF 1.119) Pub Date : 2020-06-01 JunKyu Lee; Gregory D. Peterson; Dimitrios S. Nikolopoulos; Hans Vandierendonck
The increased degree of concurrent operations by lower precision arithmetic enables high performance for iterative refinement. Most of related work present statically defined mixed precision arithmetic approaches, while adapting a level of arithmetic precision dynamically in a loop with one-bit granularity can further improve the performance. This paper presents Arbitrary Dynamic Precision Iterative
-
A domain partitioning method using a multi-phase-field model for block-based AMR applications Parallel Comput. (IF 1.119) Pub Date : 2020-05-17 Seiya Watanabe; Takayuki Aoki; Tomohiro Takaki
In distributed implementations of memory-bound stencil AMR applications, the inter-node communication time often represents a major performance bottleneck. Thus minimizing communication is an objective as important as maintaining a good load balance. We propose a new domain partitioning method for block-based AMR applications based on the multi-phase-field (MPF) model. The MPF model for polycrystalline
-
The allscale framework architecture Parallel Comput. (IF 1.119) Pub Date : 2020-05-13 Herbert Jordan; Philipp Gschwandtner; Peter Thoman; Peter Zangerl; Alexander Hirsch; Thomas Fahringer; Thomas Heller; Dietmar Fey
The tremendous challenge of developing applications efficiently utilizing the hardware provided by contemporary parallel systems of all scales is among the most limiting factors for the continuous growth of high performance computing. In this article, we present a novel architecture taking on this challenge by providing an infrastructure for the effective development of such applications. Our design
-
QMPI: A next generation MPI profiling interface for modern HPC platforms Parallel Comput. (IF 1.119) Pub Date : 2020-05-12 Bengisu Elis; Dai Yang; Olga Pearce; Kathryn Mohror; Martin Schulz
As modern HPC applications and systems advance to exascale, their complexity and the need for more efficient resource utilization increases. This fact demands more advanced monitoring, analysis and optimization approaches. Therefore, the Message Passing Interface (MPI), which is the most common parallel programming system for HPC applications, must enable these advanced approaches. Even if the existing
-
An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning Parallel Comput. (IF 1.119) Pub Date : 2020-05-12 Timon E. Knigge; Rob H. Bisseling
We investigate sparse matrix bipartitioning – a problem where we minimize the communication volume in parallel sparse matrix-vector multiplication. We prove, by reduction from graph bisection, that this problem is NP-complete in the case where each side of the bipartitioning must contain a linear fraction of the nonzeros. We present an improved exact branch-and-bound algorithm which finds the minimum
-
Minimizing the usage of hardware counters for collective communication using triggered operations Parallel Comput. (IF 1.119) Pub Date : 2020-05-05 Nusrat Sharmin Islam; Gengbin Zheng; Sayantan Sur; Akhil Langer; Maria Garzaran
Triggered operations and counting events or counters are building blocks used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation
-
High performance solution of skew-symmetric eigenvalue problems with applications in solving the Bethe-Salpeter eigenvalue problem Parallel Comput. (IF 1.119) Pub Date : 2020-05-01 Carolin Penke; Andreas Marek; Christian Vorwerk; Claudia Draxl; Peter Benner
We present a high-performance solver for dense skew-symmetric matrix eigenvalue problems. Our work is motivated by applications in computational quantum physics, where one solution approach to solve the Bethe-Salpeter equation involves the solution of a large, dense, skew-symmetric eigenvalue problem. The computed eigenpairs can be used to compute the optical absorption spectrum of molecules and crystalline
-
Design and evaluation of efficient global data movement in partitioned global address space Parallel Comput. (IF 1.119) Pub Date : 2020-04-30 Hitoshi Murai; Mitsuhisa Sato
Global data movement is the most general, and therefore important, function of inter-node communication in the partitioned global address space programming models, such as XcalableMP. Our implementation of it consists of compile-time and run-time optimization for specific cases and run-time processing based on the calculus of common-stride section descriptors for general cases, which allows efficient
-
QTMS: A quadratic time complexity topology-aware process mapping method for large-scale parallel applications on shared HPC system Parallel Comput. (IF 1.119) Pub Date : 2020-04-29 Baicheng Yan; Limin Xiao; Guangjun Qin; Zhang Yang; Bin Dong; Haonan Yu; Hongyu Wu
Communication exacerbates the performance for parallel applications with thousands of CPU cores and quantities of data to exchange. The high communication cost is usually attributed to the mismatch between the communication patterns of parallel applications and the physical topology graphs of the computing resources (or the underlying network topologies). The topology-aware process mapping method can
-
On the scalability of CFD tool for supersonic jet flow configurations Parallel Comput. (IF 1.119) Pub Date : 2020-03-09 Carlos Junqueira-Junior; João Luiz F. Azevedo; Jairo Panetta; William R. Wolf; Sami Yamouni
New regulations are imposing noise emissions limitations for the aviation industry which are pushing researchers and engineers to invest efforts in studying the aeroacoustics phenomena. Following this trend, an in-house computational fluid dynamics tool is build to reproduce high fidelity results of supersonic jet flows for aeroacoustic analogy applications. The solver is written using the large eddy
-
Comparison of selected FETI coarse space projector implementation strategies Parallel Comput. (IF 1.119) Pub Date : 2020-01-30 Jakub Kruzik; David Horak; Vaclav Hapla; Martin Cermak
This paper deals with scalability improvements of the FETI (Finite Element Tearing and Interconnecting) domain decomposition method solving elliptic PDEs. The main bottleneck of FETI is the solution of a coarse problem that is part of the projector onto the natural coarse space. This paper introduces and compares two strategies for the FETI coarse problem solution. The first one is a classical solution
-
An on-node scalable sparse incomplete LU factorization for a many-core iterative solver with Javelin Parallel Comput. (IF 1.119) Pub Date : 2020-03-23 Joshua Dennis Booth; Gregory Bolet
We present a scalable incomplete LU factorization to be used as a preconditioner for solving sparse linear systems with iterative methods in the package called Javelin. Javelin allows for improved parallel factorization on shared-memory many-core systems by packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves
-
Analysis of energy efficiency of a parallel AES algorithm for CPU-GPU heterogeneous platforms Parallel Comput. (IF 1.119) Pub Date : 2020-03-20 Xiongwei Fei; Kenli Li; Wangdong Yang; Keqin Li
Encryption plays an important role in protecting data, especially data transferred on the Internet. However, encryption is computationally expensive and this leads to high energy costs. Parallel encryption solutions using more CPU/GPU cores can achieve high performance. If we consider energy efficiency to be cost effective using parallel encryption solutions at the same time, this problem can be alleviated
-
Visualizing multiphysics, fluid-structure interaction phenomena in intracranial aneurysms. Parallel Comput. (IF 1.119) Pub Date : 2016-07-01 Paris Perdikaris,Joseph A Insley,Leopold Grinberg,Yue Yu,Michael E Papka,George Em Karniadakis
This work presents recent advances in visualizing multi-physics, fluid-structure interaction (FSI) phenomena in cerebral aneurysms. Realistic FSI simulations produce very large and complex data sets, yielding the need for parallel data processing and visualization. Here we present our efforts to develop an interactive visualization tool which enables the visualization of such FSI simulation data. Specifically
-
Atomic Detail Visualization of Photosynthetic Membranes with GPU-Accelerated Ray Tracing. Parallel Comput. (IF 1.119) Pub Date : 2016-06-09 John E Stone,Melih Sener,Kirby L Vandivort,Angela Barragan,Abhishek Singharoy,Ivan Teo,João V Ribeiro,Barry Isralewitz,Bo Liu,Boon Chong Goh,James C Phillips,Craig MacGregor-Chatwin,Matthew P Johnson,Lena F Kourkoutis,C Neil Hunter,Klaus Schulten
The cellular process responsible for providing energy for most life on Earth, namely photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological
-
Parallel Simulated Annealing Using an Adaptive Resampling Interval. Parallel Comput. (IF 1.119) Pub Date : 2016-03-05 Zhihao Lou,John Reinitz
This paper presents a parallel simulated annealing algorithm that is able to achieve 90% parallel efficiency in iteration on up to 192 processors and up to 40% parallel efficiency in time when applied to a 5000-dimension Rastrigin function. Our algorithm breaks scalability barriers in the method of Chu et al. (1999) by abandoning adaptive cooling based on variance. The resulting gains in parallel efficiency
Contents have been reproduced by permission of the publishers.