• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-08-30
Xiaohu Wu; Patrick Loiseau; Esa Hyytiä

Many businesses possess a small infrastructure that they can use for their computing tasks, but also often buy extra computing resources from clouds. Cloud vendors such as Amazon EC2 offer two types of purchase options: on-demand and spot instances. As tenants have limited budgets to satisfy their computing needs, it is crucial for them to determine how to purchase different options and utilize them (in addition to possible self-owned instances) in a cost-effective manner while respecting their response-time targets. In this paper, we propose a framework to design policies to allocate self-owned, on-demand and spot instances to arriving jobs. In particular, we propose a near-optimal policy to determine the number of self-owned instances and an optimal policy to determine the number of on-demand instances to buy and the number of spot instances to bid for at each time unit. Our policies rely on a small number of parameters and we use an online learning technique to infer their optimal values. Through numerical simulations, we show the effectiveness of our proposed policies, in particular that they achieve a cost reduction of up to 64.51 percent when spot and on-demand instances are considered and of up to 43.74 percent when self-owned instances are considered, compared to previously proposed or intuitive policies.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-03
Qiang He; Guangming Cui; Xuyun Zhang; Feifei Chen; Shuiguang Deng; Hai Jin; Yanhui Li; Yun Yang

Edge Computing provides mobile and Internet-of-Things (IoT) app vendors with a new distributed computing paradigm which allows an app vendor to deploy its app at hired edge servers distributed near app users at the edge of the cloud. This way, app users can be allocated to hired edge servers nearby to minimize network latency and energy consumption. A cost-effective edge user allocation (EUA) requires maximum app users to be served with minimum overall system cost. Finding a centralized optimal solution to this EUA problem is NP-hard. Thus, we propose EUAGame, a game-theoretic approach that formulates the EUA problem as a potential game. We analyze the game and show that it admits a Nash equilibrium. Then, we design a novel decentralized algorithm for finding a Nash equilibrium in the game as a solution to the EUA problem. The performance of this algorithm is theoretically analyzed and experimentally evaluated. The results show that the EUA problem can be solved effectively and efficiently.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-03
Behrooz Zarebavani; Foad Jafarinejad; Matin Hashemi; Saber Salehkaleybar

The main goal in many fields in the empirical sciences is to discover causal relationships among a set of variables from observational data. PC algorithm is one of the promising solutions to learn underlying causal structure by performing a number of conditional independence tests. In this paper, we propose a novel GPU-based parallel algorithm, called cuPC, to execute an order-independent version of PC. The proposed solution has two variants, cuPC-E and cuPC-S, which parallelize PC in two different ways for multivariate normal distribution. Experimental results show the scalability of the proposed algorithms with respect to the number of variables, the number of samples, and different graph densities. For instance, in one of the most challenging datasets, the runtime is reduced from more than 11 hours to about 4 seconds. On average, cuPC-E and cuPC-S achieve 500X and 1300X speedup, respectively, compared to serial implementation on CPU.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-03
Thomas Schneider; Amos Treiber

Privacy-preserving scalar product (PPSP) protocols are an important building block for secure computation tasks in various applications. Lu et al. (TPDS'13) introduced a PPSP protocol that does not rely on cryptographic assumptions and that is used in a wide range of publications to date. In this comment paper, we show that Lu et al.'s protocol is insecure and should not be used. We describe specific attacks against it and, using impossibility results of Impagliazzo and Rudich (STOC'89), show that it is inherently insecure and cannot be fixed without relying on at least some cryptographic assumptions.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-03
Pengxing Guo; Weigang Hou; Lei Guo; Wei Sun; Chuang Liu; Hainan Bao; Luan H. K. Duong; Weichen Liu

The three-dimensional Network-on-Chips (3D NoCs) has become a mature multi-core interconnection architecture in recent years. However, the traditional electrical lines have very limited bandwidth and high energy consumption, making the photonic interconnection promising for future 3D Optical NoCs (ONoCs). Since existing solutions cannot well guarantee the fault-tolerant ability of 3D ONoCs, in this paper, we propose a reliable optical router (OR) structure which sacrifices less redundancy to obtain more restore paths. Moreover, by using our fault-tolerant routing algorithm, the restore path can be found inside the disabled OR under the deadlock-free condition, i.e., fault-node reuse. Experimental results show that the proposed approach outperforms the previous related works by maximum 81.1 percent and 33.0 percent on average for throughput performance under different synthetic and real traffic patterns. It can improve the system average optical signal to noise ratio (OSNR) performance by maximum 26.92 percent and 12.57 percent on average, and it can improve the average energy consumption performance by 0.3 percent to 15.2 percent under different topology types/sizes, failure rates, OR structures, and payload packet sizes.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-05
Shungeng Zhang; Qingyang Wang; Yasuhiko Kanemasa; Huasong Shan; Liting Hu

Asynchronous event-driven server architecture has been considered as a superior alternative to the thread-based counterpart due to reduced multithreading overhead. In this paper, we conduct empirical research on the efficiency of asynchronous Internet servers, showing that an asynchronous server may perform significantly worse than a thread-based one due to two design deficiencies. The first one is the widely adopted one-event-one-handler event processing model in current asynchronous Internet servers, which could generate frequent unnecessary context switches between event handlers, leading to significant CPU overhead of the server. The second one is a write-spin problem (i.e., repeatedly making unnecessary I/O system calls) in asynchronous servers due to some specific runtime workload and network conditions (e.g., large response size and non-trivial network latency). To address these two design deficiencies, we present a hybrid solution by exploiting the merits of different asynchronous architectures so that the server is able to adapt to dynamic runtime workload and network conditions in the cloud. Concretely, our hybrid solution applies a lightweight runtime request checking and seeks for the most efficient path to process each request from clients. Our results show that the hybrid solution can achieve from 10 to 90 percent higher throughput than all the other types of servers under the various realistic workload and network conditions in the cloud.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-06
Haidong Lan; Jintao Meng; Christian Hundt; Bertil Schmidt; Minwen Deng; Xiaoning Wang; Weiguo Liu; Yu Qiao; Shengzhong Feng

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apple's Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn .

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-10
Tao Zhang; Xiao-Yang Liu; Xiaodong Wang; Anwar Walid

Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$16.91×,27.03×,38.97×,22.36×,15.43× speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 \times$9.80× and $269.26 \times$269.26× speedups over multi-core CPU implementations.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-10
Liang Wang; Leibo Liu; Jie Han; Xiaohang Wang; Shouyi Yin; Shaojun Wei

The communication behaviors in NoCs of chip-multiprocessors exhibit great spatial and temporal variations, which introduce significant challenges for the reconfiguration in NoCs. Existing reconfigurable NoCs are still far from ideal reconfiguration scenarios, in which globally reconfigurable interconnects can be immediately reconfigured to provide bandwidths on demand for varying traffic flows. In this paper, we propose a hybrid NoC architecture that globally reconfigures the ring-based interconnect to adapt to the varying traffic flows with a high flexibility. The ring-based interconnect has the following advantages. First, it includes horizontal rings and vertical rings, which can be dynamically combined or split to provide low-latency channels for heavy traffic flows. Second, each combined ring connects a number of nodes, thereby improving both the utilization of each ring and the probability to reuse previous reconfigurable interconnects. Finally, the reconfiguration algorithm has a linear-time complexity and can be implemented using a low-overhead hardware design, making it possible to achieve a fast reconfiguration in NoCs. The experimental results show that compared to recent reconfigurable NoCs, the proposed NoC architecture can greatly improve the saturation throughput for synthetic traffic patterns, and reduce the packet latency over 40 percent for realistic benchmarks without incurring significant area and power overhead.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-10
Mohsen Ansari; Javad Saber-Latibari; Mostafa Pasandideh; Alireza Ejlali

Analysis of reliability, power, and performance at hardware and software levels due to heterogeneity is a crucial requirement for heterogeneous multicore embedded systems. Escalating power densities have led to thermal issues for heterogeneous multicore embedded systems. This paper proposes a peak-power-aware reliability management scheme to meet power constraints through distributing power density on the whole chip such that reliability targets are satisfied. In this paper, we consider peak power consumption as a system-level power constraint to prevent system failure. To balance the power consumption, we also employ a Dynamic Frequency Scaling (DFS) method to further reduce peak power consumption and satisfy thermal constraints on the chip. We illustrate the benefits of our scheme by comparing it with state-of-the-art schemes, resulting in average in 26.5 percent less peak power consumption (up to 54.3 percent).

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-12
Xu Yuan; Xingliang Yuan; Yihe Zhang; Baochun Li; Cong Wang

The persistent growth of big data applications has being raising new challenges in managing large volumes of datasets with high scalability, confidentiality protection, and flexible types of search queries. In this paper, we propose a secure design to disassemble the private dataset with the aim to store them across geographically distributed servers while supporting secure multi-client Boolean queries. In this design, the data owner encrypts the private database with the searchable index attributes. The encrypted dataset will be disassembled and distributed evenly across multiple servers by leveraging the property of a distributed index framework. By constructing an encryption structure, generating search tokens, and enabling parallel query, we show how the proposed design performs the secure while efficient Boolean search. These queries are not only limited to those initiated by the data owner but also can be extended to support multiple authorized clients, where each client is allowed to access a necessary part of the private database. In this stage, we advocate a non-interactive authorization scheme where data owner is not required to stay online to process the query request. Moreover, the query operation can be executed in parallel, which significantly improves the search efficiency. We formally characterize the leakage profile, which allow us to follow the existing security analysis method to demonstrate that our system can guarantee data confidentiality and query privacy. To validate our protocol, we implement a system prototype and evaluate the efficiency of our construction. Through experimental results, we demonstrate the effectiveness of our protocol in terms of data outsourcing time and Boolean query time.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-13
Guijuan Wang; Cheng-Kuan Lin; Jianxi Fan; Baolei Cheng; Xiaohua Jia

The generalized hypercube (GH) is one key interconnection network with excellent topological properties. It contains many other interconnection topologies, such as the hypercube network, the complete graph, the mesh network, and the $k$k -ary $n$n -cube network. It can also be used to construct some data center networks, such as HyperX, BCube, FBFLY, and SWCube. However, the construction cost of GH is high since it contains too many links. In this paper, we propose a novel low cost interconnection architecture called the exchanged generalized hypercube (EGH). We study the properties of EGH, such as the number of edges, the degree of vertices, connectivity, diameter, and diagnosability. Then, we give a routing algorithm to find the shortest path between any two distinct vertices of EGH. Furthermore, we design an algorithm to give disjoint paths between any two distinct vertices of EGH. In addition, we propose two local diagnosis algorithms: LDT $_{EGH}$EGH and LDWB $_{EGH}$EGH in EGH under PMC model and MM model, respectively. Simulation results demonstrate that even if the proportion of faulty vertices in EGH is up to 25 percent, the probability that these two diagnosis algorithms can successfully determine the status of vertices is more than 90 percent. As far as the number of edges is concerned, the analysis shows that the construction cost of EGH is much less than that of GH. We could regard this work as the basis for proposing future new high performance topologies.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-19
Yi-Wei Ci; Michael R. Lyu; Zhan Zhang; De-Cheng Zuo; Xiao-Zong Yang

Shared memory is widely used for inter-process communication. The shared memory abstraction allows computation to be decoupled from communication, which offers benefits, including portability and ease of programming. To enable shared memory access by processes that are on different machines, distributed shared memory (DSM) can be employed. However, DSM systems can suffer from thrashing: while different processes update certain hot data items, the largest amount of effort is spent on data synchronization, and little progress is made by each process. To avoid interference between processes during data updating while providing shared memory at page granularity, more time is reserved for a writer to hold a page in a traditional manner. In this paper, we report on complex thrashing, which can explain why extending the time of holding a page might not be sufficient to control thrashing. To increase the throughput, we propose a thrashing control mechanism that allows each process to update a set of pages during a period of time, where the pages compose a logical area. Because of the isolation of areas, updates on different areas can be performed concurrently. To allow the areas to be fairly well used, each process is assigned with a random priority for thrashing control. The thrashing control mechanism is implemented on a Linux-based DSM system. Performance results show that the execution time of the applications that are apt to cause system thrashing can be significantly reduced by our approach.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-20
Salma Hesham; Diana Goehringer; Mohamed A. Abd El Ghany

Networks-on-chip (NoCs) acquired substantial advancements as the typical solution for a modular, flexible and high performance communication infrastructure coping with the scalable Multi-/Manycores technology. However, the increasing chip complexity heading towards thousand cores, together with the approaching dark-silicon era, puts energy efficiency as an integral design key for future NoC-based multicores, where NoCs are significantly contributing to the total chip power. In this paper, we propose HPPT-NoC, a dark-silicon inspired energy-efficient hierarchical TDM NoC with online distributed setup-scheme. The proposed network makes use of the dim silicon parts of the chip to hierarchically connect quad-routers units. Normal routers operate at full-chip-frequency at high supply level, and hierarchical routers operate at half-chip-frequency and lower supply voltage with adequate synchronization. Routers follow a proposed TDM architecture that separates the datapath from the control-setup planes. This allows separate clocking and operating supplies between data and control and to keep the control-setup as a single-slot-cycle design independent of the datapath slot size. The proposed NoC architecture is evaluated versus a base NoC from the state-of-the-art in terms of performance and hardware results using Synopsys VCS and Synopsys Design Compiler for SAED90nm and SAED32nm technologies. The obtained results highlight the power-frequency-trading feature supported by the proposed hierarchical NoC through the configurable data-control clock relation and maintained over the different technology nodes. With the same power budget of the base NoC, the proposed architecture provides up to 74% setup latency enhancement, 32% increased NoC saturation load, and 21% higher success rates, offering up to 78% improved power delay product. On the other hand, with 38% power savings, the proposed NoC provides up to 37% enhanced latency and 15% higher success rates, with 72% enhanced power delay product. The proposed design consumes almost double the area of the base NoC, however with an average of 56% under-clocked (dim) silicon area operating at half to quarter the maximum chip frequency. This results in reduced power density as a main concern in the dark-silicon era down to 24% of the base NoC.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-20
Zhi Li; Hai Jin; Deqing Zou; Bin Yuan

DDoS attacks are rampant in cloud environments and continually evolve into more sophisticated and intelligent modalities, such as low-rate DDoS attacks. But meanwhile, the cloud environment is also developing in constant. Now container technology and microservice architecture are widely applied in cloud environment and compose container-based cloud environment. Comparing with traditional cloud environments, the container-based cloud environment is more lightweight in virtualization and more flexible in scaling service. Naturally, a question that arises is whether these new features of container-based cloud environment will bring new possibilities to defeat DDoS attacks. In this paper, we establish a mathematical model based on queueing theory to analyze the strengths and weaknesses of the container-based cloud environment in defeating low-rate DDoS attack. Based on this, we propose a dynamic DDoS mitigation strategy, which can dynamically regulate the number of container instances serving for different users and coordinate the resource allocation for these instances to maximize the quality of service. And extensive simulations and testbed-based experiments demonstrate our strategy can make the limited system resources be utilized sufficiently to maintain the quality of service acceptable and defeat DDoS attack effectively in the container-based cloud environment.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-23
Weihua Zhang; Zhaofeng Yan; Yuzhe Lin; Chuanlei Zhao; Lu Peng

B+tree is one of the most important data structures and has been widely used in different fields. With the increase of concurrent queries and data-scale in storage, designing an efficient B+tree structure has become critical. Due to abundant computation resources, SIMD architectures provide potential opportunities to achieve high query throughput for B+tree. However, prior methods cannot achieve satisfactory performance results due to low resource utilization and poor memory performance. In this paper, we first identify the gaps between B+tree and SIMD architectures. Concurrent B+tree queries involve many global memory accesses and different divergences, which mismatch with SIMD architecture features. Based on this observation, we propose Harmonia, a novel B+tree structure to bridge the gaps. In Harmonia, a B+tree structure is divided into a key region and a child region. The key region stores the nodes with its keys in a breadth-first order. The child region is organized as a prefix-sum array, which only stores each node's first child index in the key region. Since the prefix-sum child region is small and the children's index can be retrieved through index computations, most of it can be stored in on-chip caches, which can achieve good cache locality. To make it more efficient, Harmonia also includes two optimizations: partially-sorted aggregation and narrowed thread-group traversal, which can mitigate memory and execution divergence and improve resource utilization. Evaluations on a 28-core INTEL CPU show that Harmonia can achieve up to 207 million queries per second, which is about 1.7X faster than that of CPU-based HB+Tree [1] , a recent state-of-the-art solution. And on a Volta TITAN V GPU, it can achieve up to 3.6 billion queries per second, which is about 3.4X faster than that of GPU-based HB+Tree.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-23
Louis-Claude Canon; Loris Marchal; Bertrand Simon; Frédéric Vivien

Modern computing platforms commonly include accelerators. We target the problem of scheduling applications modeled as task graphs on hybrid platforms made of two types of resources, such as CPUs and GPUs. We consider that task graphs are uncovered dynamically, and that the scheduler has information only on the available tasks, i.e., tasks whose predecessors have all been completed. Each task can be processed by either a CPU or a GPU, and the corresponding processing times are known. Our study extends a previous $4\sqrt{m/k}$4m/k -competitive online algorithm by Amaris et al. [1] , where $m$m is the number of CPUs and $k$k the number of GPUs ( $m\geq k$m≥k ). We prove that no online algorithm can have a competitive ratio smaller than $\sqrt{m/k}$m/k . We also study how adding flexibility on task processing, such as task migration or spoliation, or increasing the knowledge of the scheduler by providing it with information on the task graph, influences the lower bound. We provide a $(2\sqrt{m/k}+1)$(2m/k+1) -competitive algorithm as well as a tunable combination of a system-oriented heuristic and a competitive algorithm; this combination performs well in practice and has a competitive ratio in $\Theta (\sqrt{m/k})$Θ(m/k) . We also adapt all our results to the case of multiple types of processors. Finally, simulations on different sets of task graphs illustrate how the instance properties impact the performance of the studied algorithms and show that our proposed tunable algorithm performs the best among the online algorithms in almost all cases and has even performance close to an offline algorithm.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-09-24
Hongwu Lv; Jane Hillston; Paul Piho; Huiqiang Wang

High availability is one of the core properties of Infrastructure as a Service (IaaS) and ensures that users have anytime access to on-demand cloud services. However, significant variations of workflow and the presence of super-tasks, mean that heterogeneous workload can severely impact the availability of IaaS clouds. Although previous work has investigated global queues, VM deployment, and failure of PMs, two aspects are yet to be fully explored: one is the impact of task size and the other is the differing features across PMs such as the variable execution rate and capacity. To address these challenges we propose an attribute-based availability model of large scale IaaS developed in the formal modeling language CARMA. The size of tasks in our model can be a fixed integer value or follow the normal, uniform or log-normal distribution. Additionally, our model also provides an easy approach to investigating how to arrange the slack and normal resources in order to achieve availability levels. The two goals of our work are providing an analysis of the availability of IaaS and showing that the use of CARMA allows us to easily model complex phenomena that were not readily captured by other existing approaches.

更新日期：2020-01-17
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-11-26
Xiaotong Li; Ruiting Zhou; Lei Jiao; Chuan Wu; Yuhang Deng; Zongpeng Li

Geo-distributed machine learning (ML) often uses large geo-dispersed data collections produced over time to train global models, without consolidating the data to a central site. In the parameter server architecture, “workers” and “parameter servers” for a geo-distributed ML job should be strategically deployed and adjusted on the fly, to allow easy access to the datasets and fast exchange of the model parameters at any time. Despite many cloud platforms now provide volume discounts to encourage the usage of their ML resources, different geo-distributed ML jobs that run in the clouds often rent cloud resources separately and respectively, thus rarely enjoying the benefit of discounts. We study an ML broker service that aggregates geo-distributed ML jobs into cloud data centers for volume discounts via dynamic online placement and scaling of workers and parameter servers in individual jobs for long-term cost minimization. To decide the number and the placement of workers and parameter servers, we propose an efficient online algorithm which first decomposes the online problem into a series of one-shot optimization problems solvable at each individual time slot by the technique of regularization, and afterwards round the fractional decisions to the integer ones via a carefully-designed dependent rounding method. We prove a parameterized-constant competitive ratio for our online algorithm as the theoretical performance analysis, and also conduct extensive simulation studies to exhibit its close-to-offline-optimum practical performance in realistic settings.

更新日期：2020-01-14
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-12-05
Xuesong Li; Wenxue Cheng; Tong Zhang; Fengyuan Ren; Bailong Yang

Recently, high performance packet I/O frameworks continue to flourish for their ability to process packets from high-speed links. To achieve high throughput and low latency, high performance packet I/O frameworks usually employ busy polling. As busy polling will burn all CPU cycles even if there's no packet to process, these frameworks are quite power inefficient. However, exploiting power management techniques such as DVFS and LPI in the frameworks is challenging, because neither the OS nor the frameworks can provide information (e.g., actual CPU utilization, available idle period, or the target frequency) required by these techniques. In this article, we establish a model that can formulate the packet processing flow of high performance packet I/O to help and address the above challenges. From the model, we can deduce the information needed for power management techniques, and gain the insights to balance the power and latency. After suggesting to use pause instruction to reduce CPU power within short idle period, we propose two approaches to conduct power conservation for high performance packet I/O: one with the aid of traffic information and the other without. Experiments with Intel DPDK show that both approaches can achieve significant power reduction with little latency increase.

更新日期：2020-01-14
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2019-12-20

This index covers all technical items - papers, correspondence, reviews, etc. - that appeared in this periodical during the year, and items from previous years that were commented upon or corrected in this year. Departments and other items may also be covered if they have been judged to have archival value. The Author Index contains the primary entry for each item, listed under the first author's name. The primary entry includes the co-authors' names, the title of the paper or other item, and its location, specified by the publication abbreviation, year, month, and inclusive pagination. The Subject Index contains entries describing the item under all appropriate subject headings, plus the first author's name, the publication abbreviation, month, and year, and inclusive pages. Note that the item title is found only under the primary entry in the Author Index.

更新日期：2020-01-04
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2015-03-06
Zhaoyang Zhang,Honggang Wang,Chonggang Wang,Hua Fang

Increasing population density, closer social contact and interactions make epidemic control difficult. Traditional offline epidemic control methods (e.g., using medical survey or medical records) or model-based approach are not effective due to its inability to gather health data and social contact information simultaneously or impractical statistical assumption about the dynamics of social contact networks, respectively. In addition, it is challenging to find optimal sets of people to be quarantined to contain the spread of epidemics for large populations due to high computational complexity. Unlike these approaches, in this paper, a novel cluster-based epidemic control scheme is proposed based on Smartphone-based body area networks. The proposed scheme divides the populations into multiple clusters based on their physical location and social contact information. The proposed control schemes are applied within the cluster or between clusters. Further, we develop a computational efficient approach called UGP to enable an effective cluster-based quarantine strategy using graph theory for large scale networks (i.e., populations). The effectiveness of the proposed methods is demonstrated through both simulations and experiments on real social contact networks.

更新日期：2019-11-01
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2014-11-25
George Teodoro,Tahsin Kurc,Jun Kong,Lee Cooper,Joel Saltz

We study and characterize the performance of operations in an important class of applications on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy scanners. Common operations in these applications involve the detection and extraction of objects (object segmentation), the computation of features of each extracted object (feature computation), and characterization of objects based on these features (object classification). In this work, we have identify the data access and computation patterns of operations in the object segmentation and feature computation categories. We systematically implement and evaluate the performance of these operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU. On the other hand, GPUs are significantly more efficient than MICs for operations that access data irregularly. This is a result of the low performance of MICs when it comes to random data access. We also have examined the coordinated use of MICs and CPUs. Our experiments show that using a performance aware task strategy for scheduling application operations improves performance about 1.29× over a first-come-first-served strategy. This allows applications to obtain high performance efficiency on CPU-MIC systems - the example application attained an efficiency of 84% on 192 nodes (3072 CPU cores and 192 MICs).

更新日期：2019-11-01
• IEEE Trans. Parallel Distrib. Syst. (IF 3.402) Pub Date : 2015-07-17
Javier Cabezas,Isaac Gelado,John E Stone,Nacho Navarro,David B Kirk,Wen-Mei Hwu

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

更新日期：2019-11-01
Contents have been reproduced by permission of the publishers.

down
wechat
bug