当前期刊: Journal of Parallel and Distributed Computing Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • On the performance difference between theory and practice for parallel algorithms
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-15
    Ami Marowka

    The performance of parallel algorithms is often inconsistent with their preliminary theoretical analyses. Indeed, the difference is increasing between the ability to theoretically predict the performance of a parallel algorithm and the results measured in practice. This is mainly due to the accelerated development of advanced parallel architectures, whereas there is still no agreed model for parallel computation, which has implications for the design of parallel algorithms and for the manner in which parallel programming should be taught. In this study, we examined the practical performance of Cormen’s Quicksort parallel algorithm. We determined the performance of the algorithm with different parallel programming approaches and examine the capacity of theoretical performance analyses of the algorithm for predicting the actual performance. This algorithm is used for teaching theoretical and practical aspects of parallel programming to undergraduate students. We considered the pedagogic implications that may arise when the algorithm is used as a learning resource for teaching parallel programming.

    更新日期:2020-01-15
  • Subgraph fault tolerance of distance optimally edge connected hypercubes and folded hypercubes
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-08
    Litao Guo; Chengfu Qin; Liqiong Xu

    Hypercube and folded hypercube are the most fundamental interconnection networks for the attractive topological properties. We assume for any distinct vertices u,v∈V,κ(u,v) defined as local connectivity of u and v, is the maximum number of independent (u,v)-paths in G. Similarly, λ(u,v) is local edge connectivity of u,v. For some t∈[1,D(G)],∀u,v∈V,u≠v, and d(u,v)=t, if κ(u,v)(orλ(u,v))=min{d(u),d(v)}, then G is t-distance optimally (edge) connected, where D(G) is the diameter of G and d(u) is the degree of u. For all integers 0

    更新日期:2020-01-08
  • A semantic-based methodology for digital forensics analysis
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-08
    Flora Amato; Aniello Castiglione; Giovanni Cozzolino; Fabio Narducci

    Nowadays, more than ever, digital forensics activities are involved in any criminal, civil or military investigation and represent a fundamental tool to support cyber-security. Investigators use a variety of techniques and proprietary software forensics applications to examine the copy of digital devices, searching hidden, deleted, encrypted, or damaged files or folders. Any evidence found is carefully analysed and documented in a “finding report” in preparation for legal proceedings that involve discovery, depositions, or actual litigation. The aim is to discover and analyse patterns of fraudulent activities. In this work, a new methodology is proposed to support investigators during the analysis process, correlating evidence found through different forensics tools. The methodology was implemented through a system able to add semantic assertion to data generated by forensics tools during extraction processes. These assertions enable more effective access to relevant information and enhanced retrieval and reasoning capabilities.

    更新日期:2020-01-08
  • Efficient convolution pooling on the GPU
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-07
    Shunsuke Suita; Takahiro Nishimura; Hiroki Tokura; Koji Nakano; Yasuaki Ito; Akihiko Kasagi; Tsuguchika Tabaru

    The main contribution of this paper is to show efficient implementations of the convolution-pooling in the GPU, in which the pooling follows the multiple convolution. Since the multiple convolution and the pooling operations are performed alternately in earlier stages of many Convolutional Neural Networks (CNNs), it is very important to accelerate the convolution-pooling. Our new GPU implementation uses two techniques, (1) convolution interchange with direct sum, and (2) conversion to matrix multiplication. By these techniques, the computational and memory access cost are reduced. Further the convolution interchange is converted to matrix multiplication, which can be computed by cuBLAS very efficiently. Experimental results using Tesla V100 GPU show that our new GPU implementation compatible with cuDNN for the convolution-pooling is expected 2.90 times and 1.43 times faster for fp32 and fp16 than the multiple convolution and then the pooling by cuDNN, respectively. the most popular library of primitives to implement the CNNs in the GPU.

    更新日期:2020-01-07
  • Scheduling directed acyclic graphs with optimal duplication strategy on homogeneous multiprocessor systems
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-07
    Qi Tang; Li-Hua Zhu; Li Zhou; Jun Xiong; Ji-Bo Wei

    Modern applications generally need a large volume of computation and communication to fulfill the goal. These applications are often implemented on multiprocessor systems to meet the requirements in computing capacity and communication bandwidth, whereas, how to obtain a good or even the optimal performance on such systems remains a challenge. When tasks of the application are mapped onto different processors for execution, inter-processor communications become inevitable, which delays some tasks’ execution and deteriorates the schedule performance. To mitigate the overhead incurred by inter-processor communications and improve the schedule performance, task duplication strategy has been employed in the schedule. Most available techniques for the duplication-based scheduling problem utilize heuristic strategies to produce sub-optimal solutions, however, how to find the optimal duplication-based solution with the minimal schedule makespan remains an unsolved issue. To fill in this gap, this paper proposes a novel Mixed Integer Linear Programming (MILP) formulation for this problem, together with a set of key theorems which enable and simplify the MILP formulation. The proposed MILP formulation can optimize the duplication strategy, serialize the execution of task instances on each processor and determine data precedences among different task instances, thus producing the optimal solution. The proposed method is tested on a set of synthesized applications and platforms and compared with the well-known algorithm. The experimental results demonstrate the effectiveness of the proposed method.

    更新日期:2020-01-07
  • Scalable energy-efficient parallel sorting on a fine-grained many-core processor array
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-26
    Aaron Stillmaker; Brent Bohnenstiehl; Lucas Stillmaker; Bevan Baas

    Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co-processor with a variety of array sizes. Results are generated using a cycle-accurate model based on measured data from a fabricated many-core chip, and simulated for different processor array sizes. The data shows most energy efficient first-phase many-core sort requires over 65× lower energy than GNU C++ standard library sort performed on an Intel laptop-class processor and over 105× lower energy than a radix sort running on an Nvidia GPU. In addition, the highest first-phase throughput many-core sort is over 9.8× faster than the std::sort and over 14× faster than the radix sort. Both phases of a 10 GB external sort require 6.2× lower energy×time energy delay product than the std::sort and over 13× lower energy×time than the radix sort.

    更新日期:2020-01-04
  • DQPFS: Distributed quadratic programming based feature selection for big data
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-20
    Majid Soheili; Amir Masoud Eftekhari-Moghadam

    With the advent of the Big data, the scalability of the machine learning algorithms has become more crucial than ever before. Furthermore, Feature selection as an essential preprocessing technique can improve the performance of the learning algorithms in confront with large-scale dataset by removing the irrelevant and redundant features. Owing to the lack of scalability, most of the classical feature selection algorithms are not so proper to deal with the voluminous data in the Big Data era. QPFS is a traditional feature weighting algorithm that has been used in lots of feature selection applications. By inspiring the classical QPFS, in this paper, a scalable algorithm called DQPFS is proposed based on the novel Apache Spark cluster computing model. The experimental study is performed on three big datasets that have a large number of instances and features at the same time. Then some assessment criteria such as accuracy, execution time, speed-up and scale-out are figured. Moreover, to study more deeply, the results of the proposed algorithm are compared with the classical version QPFS and the DiRelief, a distributed feature selection algorithm proposed recently. The empirical results illustrate that proposed method has (a) better scale-out than DiRelief, (b) significantly lower execution time than DiRelief, (c) lower execution time than QPFS, (d) better accuracy of the Naïve Bayes classifier in two of three datasets than DiRelief.

    更新日期:2020-01-04
  • On demand clock synchronization for live VM migration in distributed cloud data centers
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-19
    Yashwant Singh Patel; Aditi Page; Manvi Nagdev; Anurag Choubey; Rajiv Misra; Sajal K. Das

    Live migration of virtual machines (VMs) has become an extremely powerful tool for cloud data center management and provides significant benefits of seamless VM mobility among physical hosts within a data center or across multiple data centers without interrupting the running service. However, with all the enhanced techniques that ensure a smooth and flexible migration, the down-time of any VM during a live migration could still be in a range of few milliseconds to seconds. But many time-sensitive applications and services cannot afford this extended down-time, and their clocks must be perfectly synchronized to ensure no loss of events or information. In such a virtualized environment, clock synchronization with minute precision and error boundedness are one of the most complex and tedious tasks for system performance. In this paper, we propose enhanced DTP and wireless PTP based clock synchronization algorithms to achieve high precision at intra and inter-cloud data center networks. We thoroughly analyze the performance of the proposed algorithms using different clock measurements. Through simulation and real-time experiments, we also show the effect of various performance parameters on the data center networking architectures.

    更新日期:2020-01-04
  • Extending the limits for big data RSA cracking: Towards cache-oblivious TU decomposition
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-03
    Fatima K. Abu Salem; Mira Al Arab; Laurence T. Yang

    Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The RSA algorithm has become the de facto standard for encryption, especially for data sent over the internet. RSA takes its security from the hardness of the Integer Factorisation Problem. As the size of the modulus of an RSA key grows with the number of bytes to be encrypted, the corresponding linear system to be solved in the adversary integer factorisation algorithm also grows. In the age of big data this makes it compelling to redesign linear solvers over finite fields so that they exploit the memory hierarchy. To this end, we examine several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields. The TU algorithm of Dumas and Roche (2002) requires index conversion routines for which the cost to encode and decode the chosen curve is significant. Using a detailed analysis of the number of bit operations required for the encoding and decoding procedures, and filtering the cost of lookup tables that represent the recursive decomposition of the Hilbert curve, we show that the Morton-hybrid order incurs the least cost for index conversion routines that are required throughout the matrix decomposition as compared to the Hilbert, Peano, or Morton orders. The motivation lies in that cache efficient parallel adaptations for which the natural sequential evaluation order demonstrates lower cache miss rate result in overall faster performance on parallel machines with private or shared caches and on GPU’s.

    更新日期:2020-01-04
  • sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-03
    Pedro Valero-Lara; Sandra Catalán; Xavier Martorell; Tetsuzo Usui; Jesús Labarta

    In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and the LAPACK routines npgetrf and npgesv. All these implementations are part of the first prototype of sLASs library, a novel library for auto-tunable codes for linear algebra operations based on LASs library. In all these cases, the use of the OmpSs-2 features presents an improvement in terms of execution time against other reference libraries such as, the original LASs library, PLASMA, ATLAS and Intel MKL. These codes are able to reduce the execution time in about 18% on big matrices, by increasing the IPC on gemm and reducing the time of task instantiation. For a few medium matrices, benefits are also seen. For small matrices and a subset of medium matrices, specific optimizations that allow to increase the degree of parallelism in both, gemm and trsm tasks, are applied. This strategy achieves an increment in performance of up to 40%.

    更新日期:2020-01-04
  • Blockchain 3.0 applications survey
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-03
    Damiano Di Francesco Maesa; Paolo Mori

    In this paper we survey a number of interesting applications of blockchain technology not related to cryptocurrencies. As a matter of fact, after an initial period of application to cryptocurrencies and to the financial world, blockchain technology has been successfully exploited in many other different scenarios, where its unique features allowed the definition of innovative and sometimes disruptive solutions. In particular, this paper takes into account the following application scenarios: end-to-end verifiable electronic voting, healthcare records management, identity management systems, access control systems, decentralized notary (with a focus on intellectual property protection) and supply chain management. For each of these, we firstly analyse the problem, the related requirements and the advantages the adoption of blockchain technology might bring. Then, we present a number of relevant solutions proposed in the literature both by academia and companies.

    更新日期:2020-01-04
  • Efficient AES implementation on Sunway TaihuLight supercomputer: A systematic approach
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2020-01-02
    Liandeng Li; Jiarui Fang; Jinlei Jiang; Lin Gan; Weijie Zheng; Haohuan Fu; Guangwen Yang

    Encryption is an important technique to improve information security for many real-world applications. The Advanced Encryption Standard (AES) is a widely-used efficient cryptographic algorithm. Although AES is fast both in software and hardware, it is time-consuming to do data encryption especially for large amount of data. Therefore, it is a lasting effort to accelerate AES operations. This paper presents SW-AES, a parallel AES implementation on the Sunway TaihuLight, one of the fastest supercomputers in the world that takes the SW26010 processor as the basic building block. According to the architectural features of SW26010, SW-AES exploits parallelism from different levels, including (1) inter-CPE (Computing Processing Element) data parallelism that distributes tasks among the 256 on-chip CPEs, (2) intra-CPE data parallelism enabled by the Single-Instruction Multiple-Data (SIMD) instructions inside each CPE, and (3) instruction-level parallelism that pipelines memory access and the computation. In addition, corresponding to the two application scenarios, SW-AES presents scalable ways to efficiently run AES on many nodes. As a result, SW-AES can gain a maximum throughput of 13.50 GB/s on a single SW26010 node, which is 216.23× higher than the latest parallel AES implementation on the Sunway TaihuLight, and about 37.3% higher than the latest AES implementation on the GTX 480 GPU. When running on 1024 computing nodes with each one processing 1 GB data, SW-AES can achieve a throughput of 13819.25 GB/s. On the contrast, only a throughput of 63.91 GB/s can be achieved by the latest related work on the Sunway TaihuLight.

    更新日期:2020-01-04
  • A Parallel Multilevel Feature Selection algorithm for improved cancer classification
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-28
    Lokeswari Venkataramana; Shomona Gracia Jacob; Rajavel Ramadoss

    Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel algorithms reduces execution time, but with the penalty of reduced prediction accuracy. This research paper targets these two issues and proposes the following approaches. (1) Vertical partitioning of data along feature space and horizontal partitioning along samples in order to ease the task of data parallelism. (2) Parallel Multilevel Feature Selection (M-FS) algorithm to select optimal and important features for improved classification of cancer sub-types. The selected features are evaluated using parallel Random Forest on Spark, compared with previously reported results and also with the results of sequential execution of same algorithms. The proposed parallel M-FS algorithm was compared with existing parallel feature selection algorithms in terms of accuracy and execution time. The results reveal that parallel multilevel feature selection algorithm improved cancer classification resulting into prediction accuracy ranging from ∼85% to ∼99% with very high speed up in terms of seconds. On the other hand, existing sequential algorithms yielded prediction accuracy of ∼65% to ∼99% with execution time of more than 24 h.

    更新日期:2020-01-04
  • Structured multi-block grid partitioning using balanced cut trees
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-27
    Georg Geiser; Wolfgang Schröder

    An algorithm to partition structured multi-block hexahedral grids for a load balanced assignment of the partitions to a given number of bins is presented. It uses a balanced hierarchical cut tree data structure to partition the structured blocks into structured partitions. The refinement of the cut tree attempts to generate equally shaped partitions with a low amount of additional surface. A multi-block load balancing approach is presented that guarantees to satisfy an upper bound of load imbalance. The partition quality of the algorithm is compared to established recursive edge bisection approaches and an unstructured partitioning using METIS. Two generic and two turbomachinery test cases demonstrate the superior quality and fast runtime of the present algorithm at generating load balanced structured partitions.

    更新日期:2020-01-04
  • Kokkos implementation of an Ewald Coulomb solver and analysis of performance portability
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-17
    Rene Halver; Jan H. Meinke; Godehard Sutmann

    We have implemented the computation of Coulomb interactions in particle systems using the performance portable C++ framework Kokkos. For the computation of the electrostatic interactions in particle systems we used an Ewald summation. This implementation we consider as a basis for a performance portability study. As target architectures we used Intel CPUs, including Intel Xeon Phi, as well as Nvidia GPUs. To provide a measure for performance portability we compute the number of needed operations and required cycles, i.e. runtime, and compare these with the measured runtime. Results indicate a similar quality of performance portability on all investigated architectures.

    更新日期:2020-01-04
  • CHAMELEON: Reactive load balancing for hybrid MPI+openMP task-parallel applications
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-16
    Jannis Klinkenberg; Philipp Samfass; Michael Bader; Christian Terboven; Matthias S. Müller

    Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load within and between compute nodes, modern software and hardware increasingly make performance predictability difficult if not impossible. Consequently, balancing computational load becomes much more difficult. Aiming to tackle these challenges in search for a general solution, we present a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP. With our approach, individual migratable tasks can be executed on any MPI rank. The actual executing rank is determined at run time based on online performance data. We evaluate our approach under an enforced power cap and under enforced clock frequency changes for a synthetic benchmark and show its robustness for work-induced imbalances for a realistic application. Our experiments demonstrate speedups of up to 1.31X.

    更新日期:2020-01-04
  • High level programming abstractions for leveraging hierarchical memories with micro-core architectures
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-12-11
    Maurice Jamieson; Nick Brown

    Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hierarchy, the way in which programmers offload kernels needs to be carefully considered. In this paper we use Python as a vehicle for exploring the semantics and abstractions of higher level programming languages to support the offloading of computational kernels to these devices. By moving to a pass by reference model, along with leveraging memory kinds, we demonstrate the ability to easily and efficiently take advantage of multiple levels in the memory hierarchy, even ones that are not directly accessible to the micro-cores. Using a machine learning benchmark, we perform experiments on both Epiphany-III and MicroBlaze based micro-cores, demonstrating the ability to compute with data sets of arbitrarily large size. To provide context of our results, we explore the performance and power efficiency of these technologies, demonstrating that whilst these two micro-core technologies are competitive within their own embedded class of hardware, there is still a way to go to reach HPC class GPUs.

    更新日期:2020-01-04
  • Practical concurrent unrolled linked lists using lazy synchronization
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-11-13
    Kenneth Platz; Neeraj Mittal; S. Venkatesan

    Linked lists and other list-based sets are some of the most ubiquitous data structures in computer science. They are useful in their own right and are frequently used as building blocks in other data structures. A linked list can be “unrolled” to combine multiple keys in each node; this improves storage density and overall performance. This organization also allows an operation to skip over nodes which cannot contain a key of interest. This work introduces a new high-performance concurrent unrolled linked list with a lazy synchronization strategy. Most write operations under this strategy can complete by locking a single node. Experiments show up to 300% improvement over other concurrent list-based sets.

    更新日期:2020-01-04
  • Towards High Performance Data Analytic on Heterogeneous Many-core Systems: A Study on Bayesian Sequential Partitioning.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2019-03-16
    Bo-Cheng Lai,Tung-Yu Wu,Tsou-Han Chiu,Kun-Chun Li,Chia-Ying Lee,Wei-Chen Chien,Wing Hung Wong

    Bayesian Sequential Partitioning (BSP) is a statistically effective density estimation method to comprehend the characteristics of a high dimensional data space. The intensive computation of the statistical model and the counting of enormous data have caused serious design challenges for BSP to handle the growing volume of the data. This paper proposes a high performance design of BSP by leveraging a heterogeneous CPU/GPGPU system that consists of a host CPU and a K80 GPGPU. A series of techniques, on both data structures and execution management policies, is implemented to extensively exploit the computation capability of the heterogeneous many-core system and alleviate system bottlenecks. When compared with a parallel design on a high-end CPU, the proposed techniques achieve 48x average runtime enhancement while the maximum speedup can reach 78.76x.

    更新日期:2019-11-01
  • High Performance Multiple Sequence Alignment System for Pyrosequencing Reads from Multiple Reference Genomes.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2012-11-06
    Fahad Saeed,Alan Perez-Rathke,Jaroslaw Gwarnicki,Tanya Berger-Wolf,Ashfaq Khokhar

    Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.

    更新日期:2019-11-01
  • Parallel Algorithms for Switching Edges in Heterogeneous Graphs.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2017-08-02
    Hasanuzzaman Bhuiyan,Maleq Khan,Jiangzhuo Chen,Madhav Marathe

    An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.

    更新日期:2019-11-01
  • Accelerating Advanced MRI Reconstructions on GPUs.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2008-10-01
    S S Stone,J P Haldar,S C Tsao,W-M W Hwu,B P Sutton,Z-P Liang

    Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%.

    更新日期:2019-11-01
  • Multi-heuristic dynamic task allocation using genetic algorithms in a heterogeneous distributed system.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2010-09-24
    Andrew J Page,Thomas M Keane,Thomas J Naughton

    We present a multi-heuristic evolutionary task allocation algorithm to dynamically map tasks to processors in a heterogeneous distributed system. It utilizes a genetic algorithm, combined with eight common heuristics, in an effort to minimize the total execution time. It operates on batches of unmapped tasks and can preemptively remap tasks to processors. The algorithm has been implemented on a Java distributed system and evaluated with a set of six problems from the areas of bioinformatics, biomedical engineering, computer science and cryptography. Experiments using up to 150 heterogeneous processors show that the algorithm achieves better efficiency than other state-of-the-art heuristic algorithms.

    更新日期:2019-11-01
  • Distributed Computation of the knn Graph for Large High-Dimensional Point Sets.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2007-03-01
    Erion Plaku,Lydia E Kavraki

    High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors.

    更新日期:2019-11-01
  • Scalable isosurface visualization of massive datasets on commodity off-the-shelf clusters.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2009-09-17
    Xiaoyu Zhang,Chandrajit Bajaj

    Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces.

    更新日期:2019-11-01
  • More IMPATIENT: A Gridding-Accelerated Toeplitz-based Strategy for Non-Cartesian High-Resolution 3D MRI on GPUs.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2013-05-18
    Jiading Gai,Nady Obeid,Joseph L Holtrop,Xiao-Long Wu,Fan Lam,Maojing Fu,Justin P Haldar,Wen-Mei W Hwu,Zhi-Pei Liang,Bradley P Sutton

    Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code.

    更新日期:2019-11-01
  • Efficient Out of Core Sorting Algorithms for the Parallel Disks Model.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2011-10-29
    Vamsi Kundeti,Sanguthevar Rajasekaran

    In this paper we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O's required to sort the input close to the lower bound of [Formula: see text]. We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.

    更新日期:2019-11-01
  • A New Augmentation Based Algorithm for Extracting Maximal Chordal Subgraphs.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2015-03-15
    Sanjukta Bhowmick,Tzu-Yi Chen,Mahantesh Halappanavar

    A graph is chordal if every cycle of length greater than three contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms' parallelizability. In this paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. We experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.

    更新日期:2019-11-01
  • A uniform approach for programming distributed heterogeneous computing systems.
    J. Parallel Distrib. Comput. (IF 1.819) Pub Date : 2015-04-07
    Ivan Grasso,Simone Pellegrini,Biagio Cosenza,Thomas Fahringer

    Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging. In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal. We assess libWater's performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations.

    更新日期:2019-11-01
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
中国科学院大学楚甲祥
上海纽约大学William Glover
中国科学院化学研究所
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug