显示样式： 排序： 导出

GPU Tensor Cores for fast Arithmetic Reductions arXiv.cs.DC Pub Date : 20200115
Cristóbal A. Navarro; Roberto Carrasco; Ricardo J. Barrientos; Javier A. Riquelme; Raimundo VegaThis work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the subresults of each chain of $R$ MMAs is kept as a 32bit floating point value, before being all reduced into as a final 32bit result. The chained MMA design allows a flexible configuration of threadblocks; small threadblocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large threadblocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to nonMachine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

OneBit OvertheAir Aggregation for CommunicationEfficient Federated Edge Learning: Design and Convergence Analysis arXiv.cs.DC Pub Date : 20200116
Guangxu Zhu; Yuqing Du; Deniz Gunduz; Kaibin HuangFederated edge learning (FEEL) is a popular framework for model training at an edge server using data distributed at edge devices (e.g., smartphones and sensors) without compromising their privacy. In the FEEL framework, edge devices periodically transmit highdimensional stochastic gradients to the edge server, where these gradients are aggregated and used to update a global model. When the edge devices share the same communication medium, the multiple access channel from the devices to the edge server induces a communication bottleneck. To overcome this bottleneck, an efficient broadband analog transmission scheme has been recently proposed, featuring the aggregation of analog modulated gradients (or local models) via the waveformsuperposition property of the wireless medium. However, the assumed linear analog modulation makes it difficult to deploy this technique in modern wireless systems that exclusively use digital modulation. To address this issue, we propose in this work a novel digital version of broadband overtheair aggregation, called onebit broadband digital aggregation (OBDA). The new scheme features onebit gradient quantization followed by digital modulation at the edge devices and a majorityvoting based decoding at the edge server. We develop a comprehensive analysis framework for quantifying the effects of wireless channel hostilities (channel noise, fading, and channel estimation errors) on the convergence rate. The analysis shows that the hostilities slow down the convergence of the learning process by introducing a scaling factor and a bias term into the gradient norm. However, we show that all the negative effects vanish as the number of participating devices grows, but at a different rate for each type of channel hostility.

Smart Data based Ensemble for Imbalanced Big Data Classification arXiv.cs.DC Pub Date : 20200116
Diego GarcíaGil; Johan Holmberg; Salvador García; Ning Xiong; Francisco HerreraBig Data scenarios pose a new challenge to traditional data mining algorithms, since they are not prepared to work with such amount of data. Smart Data refers to data of enough quality to improve the outcome from a data mining algorithm. Existing data mining algorithms unability to handle Big Datasets prevents the transition from Big to Smart Data. Automation in data acquisition that characterizes Big Data also brings some problems, such as differences in data size per class. This will lead classifiers to lean towards the most represented classes. This problem is known as imbalanced data distribution, where one class is underrepresented in the dataset. Ensembles of classifiers are machine learning methods that improve the performance of a single base classifier by the combination of several of them. Ensembles are not exempt from the imbalanced classification problem. To deal with this issue, the ensemble method have to be designed specifically. In this paper, a data preprocessing ensemble for imbalanced Big Data classification is presented, with focus on twoclass problems. Experiments carried out in 21 Big Datasets have proved that our ensemble classifier outperforms classic machine learning models with an added data balancing method, such as Random Forests.

Duet Benchmarking: Improving Measurement Accuracy in the Cloud arXiv.cs.DC Pub Date : 20200116
Lubomír Bulej; Vojtěch Horký; Petr Tůma; François Farquet; Aleksandar ProkopecWe investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3% to 12.5% (5.03% on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8% to 82.4% (37.4% on average) for the SPEC CPU 2017 workloads.

Runtime Deep Model Multiplexing arXiv.cs.DC Pub Date : 20200114
Amir Erfan Eshratifar; Massoud PedramWe propose a framework to design a lightweight neural multiplexer that given input and resource budgets, decides upon the appropriate model to be called for the inference. Mobile devices can use this framework to offload the hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloudbased intelligent applications, instead of replicating the mostaccurate model, a range of small and large models can be multiplexed from depending on the input's complexity and resource budgets. Our experimental results demonstrate the effectiveness of our framework benefiting both mobile users and cloud providers.

Robust Massively Parallel Sorting arXiv.cs.DC Pub Date : 20160628
Michael Axtmann; Peter SandersWe investigate distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements. The main outcome is that four sorting algorithms cover the entire range of possible input sizes. For three algorithms we devise new low overhead mechanisms to make them robust with respect to duplicate keys and skewed input distributions. One of these, designed for medium sized inputs, is a new variant of quicksort with fast highquality pivot selection. At the same time asymptotic analysis provides performance guarantees and guides the selection and configuration of the algorithms. We validate these hypotheses using extensive experiments on 7 algorithms, 10 input distributions, up to 262144 cores, and varying input sizes over 9 orders of magnitude. For difficult input distributions, our algorithms are the only ones that work at all. For all but the largest input sizes, we are the first to perform experiments on such large machines at all and our algorithms significantly outperform the ones one would conventionally have considered.

Improved Parallel Construction of Wavelet Trees and Rank/Select Structures arXiv.cs.DC Pub Date : 20161011
Julian ShunExisting parallel algorithms for wavelet tree construction have a work complexity of $O(n\log\sigma)$. This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is based on parallel integer sorting and has either $O(n\log\log n\lceil\log\sigma/\sqrt{\log n\log\log n}\rceil)$ work and polylogarithmic depth, or $O(n\lceil\log\sigma/\sqrt{\log n}\rceil)$ work and sublinear depth. We also describe another algorithm that has $O(n\lceil\log\sigma/\sqrt{\log n} \rceil)$ work and $O(\sigma+\log n)$ depth. We then show how to use similar ideas to construct variants of wavelet trees (arbitraryshaped binary trees and multiary trees) as well as wavelet matrices in parallel with lower work complexity than prior algorithms. Finally, we show that the rank and select structures on binary sequences and multiary sequences, which are stored on wavelet tree nodes, can be constructed in parallel with improved work bounds, matching those of the best existing sequential algorithms for constructing rank and select structures.

Decrypting Distributed Ledger Design  Taxonomy, Classification and Blockchain Community Evaluation arXiv.cs.DC Pub Date : 20181030
Mark C. Ballandies; Marcus M. Dapp; Evangelos PournarasMore than 1000 distributed ledger technology (DLT) systems raising $600 billion in investment in 2016 feature the unprecedented and disruptive potential of blockchain technology. A systematic and datadriven analysis, comparison and rigorous evaluation of the different design choices of distributed ledgers and their implications is a challenge. The rapidly evolving nature of the blockchain landscape hinders reaching a common understanding of the technosocioeconomic design space of distributed ledgers and the cryptoeconomies they support. To fill this gap, this paper makes the following contributions: (i) A conceptual architecture of DLT systems with which (ii) a taxonomy is designed and (iii) a rigorous classification of DLT systems is made using realworld data and wisdom of the crowd. (iv) A DLT design guideline is the end result of applying machine learning methodologies on the classification data. Compared to related work and as defined in earlier taxonomy theory, the proposed taxonomy is highly comprehensive, robust, explanatory and extensible. The findings of this paper can provide new insights and better understanding of the key design choices evolving the modeling complexity of DLT systems, while identifying opportunities for new research contributions and business innovation.

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array arXiv.cs.DC Pub Date : 20190107
Linghao Song; Jiachen Mao; Youwei Zhuo; Xuehai Qian; Hai Li; Yiran ChenWith the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent offchip memory accesses; and the training of DNNs, which is not wellexplored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarsegrain parallelism, compared to the finegrain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layerwise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layerwise dynamic programming method to search for the partition for each layer.

Unsupervised Segmentation Algorithms' Implementation in ITK for Tissue Classification via Human Head MRI Scans arXiv.cs.DC Pub Date : 20190226
Shadman Sakib; Md. Abu Bakr SiddiqueTissue classification is one of the significant tasks in the field of biomedical image analysis. Magnetic Resonance Imaging (MRI) is of great importance in tissue classification especially in the areas of brain tissue classification which is able to recognize anatomical areas of interest such as surgical planning, monitoring therapy, clinical drug trials, image registration, stereotactic neurosurgery, radiotherapy etc. The task of this paper is to implement different unsupervised classification algorithms in ITK and perform tissue classification (white matter, gray matter, cerebrospinal fluid (CSF) and background of the human brain). For this purpose, 5 grayscale head MRI scans are provided. In order of classifying brain tissues, three algorithms are used. These are: Otsu thresholding, Bayesian classification and Bayesian classification with Gaussian smoothing. The obtained classification results are analyzed in the results and discussion section.

Incentive Analysis of BitcoinNG, Revisited arXiv.cs.DC Pub Date : 20200114
Jianyu Niu; Ziyu Wang; Fangyu Gai; Chen FengBitcoinNG is among the first blockchain protocols to approach the \emph{nearoptimal} throughput by decoupling blockchain operation into two planes: leader election and transaction serialization. Its decoupling idea has inspired a new generation of highperformance blockchain protocols. However, the existing incentive analysis of BitcoinNG has several limitations. First, the impact of network capacity is ignored. Second, an integrated incentive analysis that jointly considers both key blocks and microblocks is still missing. In this paper, we aim to address the two limitations. First, we propose a new incentive analysis that takes the network capacity into account, showing that BitcoinNG can achieve better incentive compatibility against the microblock mining attack under limited network capacity. Second, we leverage a Markov decision process (MDP) to jointly analyze the incentive of both key blocks and microblocks, showing that BitcoinNG is as secure as Bitcoin when the adversary controls less than 35% of the computation power. We hope that our indepth incentive analysis for BitcoinNG can shed some light on the mechanism design and incentive analysis of nextgeneration blockchain protocols.

Entangled Polynomial Codes for Secure, Private, and Batch Distributed Matrix Multiplication: Breaking the ''Cubic'' Barrier arXiv.cs.DC Pub Date : 20200115
Qian Yu; A. Salman AvestimehrIn distributed matrix multiplication, a common scenario is to assign each worker a fraction of the multiplication task, by partition the input matrices into smaller submatrices. In particular, by dividing two input matrices into $m$by$p$ and $p$by$n$ subblocks, a single multiplication task can be viewed as computing linear combinations of $pmn$ submatrix products, which can be assigned to $pmn$ workers. Such blockpartitioning based designs have been widely studied under the topics of secure, private, and batch computation, where the state of the arts all require computing at least ``cubic'' ($pmn$) number of submatrix multiplications. Entangled polynomial codes, first presented for straggler mitigation, provides a powerful method for breaking the cubic barrier. It achieves a subcubic recovery threshold, meaning that the final product can be recovered from \emph{any} subset of multiplication results with a size orderwise smaller than $pmn$. In this work, we show that entangled polynomial codes can be further extended to also include these three important settings, and provide a unified framework that orderwise reduces the total computational costs upon the state of the arts by achieving subcubic recovery thresholds.

Optimized implementation of the conjugate gradient algorithm for FPGAbased platforms using the DiracWilson operator as an example arXiv.cs.DC Pub Date : 20200115
G. Korcyl; P. KorcylIt is now a noticeable trend in High Performance Computing that the systems are becoming more and more heterogeneous. Compute nodes with a host CPU are being equipped with accelerators, the latter being a GPU or FPGA cards or both. In many cases at the heart of scientific applications running on such systems are iterative linear solvers. In this work we present a software package which includes an FPGA implementation of the Conjugate Gradient algorithm using a particular problem of the DiracWilson operator as encountered in numerical simulations of Quantum Chromodynamics. The software is written in OpenCL and C++ and is optimized for maximal performance. Our framework allows for a simple implementation of other linear operators, while keeping the data transport mechanisms unaltered. Hence, our software can serve as a backbone for many applications which are expected to gain a significant boost factor on FPGA accelerators. As such systems are expected to become more and more widespread, the need for highly performant FPGA implementations of the Conjugate Gradient algorithm and its variants will certainly increase and the porting investment can be greatly facilitated by the attached code.

An n/2 Byzantine node tolerate Blockchain Sharding approach arXiv.cs.DC Pub Date : 20200115
Yibin Xu; Yangyu HuangTraditional Blockchain Sharding approaches can only tolerate up to n/3 of nodes being adversary because they rely on the hypergeometric distribution to make a failure (an adversary does not have n/3 of nodes globally but can manipulate the consensus of a Shard) hard to happen. The system must maintain a large Shard size (the number of nodes inside a Shard) to sustain the low failure probability so that only a small number of Shards may exist. In this paper, we present a new approach of Blockchain Sharding that can withstand up to n/2 of nodes being bad. We categorise the nodes into different classes, and every Shard has a fixed number of nodes from different classes. We prove that this design is much more secure than the traditional models (only have one class) and the Shard size can be reduced significantly. In this way, many more Shards can exist, and the transaction throughput can be largely increased. The improved Blockchain Sharding approach is promising to serve as the foundation for decentralised autonomous organisations and decentralised database.

Lazy object copy as a platform for populationbased probabilistic programming arXiv.cs.DC Pub Date : 20200109
Lawrence M. MurrayThis work considers dynamic memory management for populationbased probabilistic programs, such as those using particle methods for inference. Such programs exhibit a pattern of allocating, copying, potentially mutating, and deallocating collections of similar objects through successive generations. These objects may assemble data structures such as stacks, queues, lists, ragged arrays, and trees, which may be of random, and possibly unbounded, size. For the simple case of $N$ particles, $T$ generations, $D$ objects, and resampling at each generation, dense representation requires $O(DNT)$ memory, while sparse representation requires only $O(DT+DN\log DN)$ memory, based on existing theoretical results. This work describes an object copyonwrite platform to automate this saving for the programmer. The core idea is formalized using labeled directed multigraphs, where vertices represent objects, edges the pointers between them, and labels the necessary bookkeeping. A specific labeling scheme is proposed for high performance under the motivating pattern. The platform is implemented for the Birch probabilistic programming language, using smart pointers, hash tables, and referencecounting garbage collection. It is tested empirically on a number of realistic probabilistic programs, and shown to significantly reduce memory use and execution time in a manner consistent with theoretical expectations. This enables copyonwrite for the imperative programmer, lazy deep copies for the objectoriented programmer, and inplace write optimizations for the functional programmer.

Throughput Optimal Routing in Blockchain Based Payment Systems arXiv.cs.DC Pub Date : 20191212
Sushil Mahavir Varma; Siva Theja MaguluriCryptocurrency networks such as Bitcoin have emerged as a distributed alternative to traditional centralized financial transaction networks. However, there are major challenges in scaling up the throughput of such networks. Lightning network and Spider network are alternates that build bidirectional payment channels on top of cryptocurrency networks using smart contracts, to enable fast transactions that bypass the Blockchain. In this paper, we study the problem of routing transactions in such a payment processing network. We first propose a Stochastic model to study such a system, as opposed to a fluid model that is studied in the literature. Each link in such a model is a twosided queue, and unlike classical queues, such queues are not stable unless there is an external control. We propose a notion of stability for the payment processing network consisting of such twosided queues using the notion of onchain rebalancing. We then characterize the capacity region and propose a throughput optimal algorithm that stabilizes the system under any load within the capacity region. The stochastic model enables us to study closed loop policies, which typically have better queuing/delay performance than the open loop policies (or static split rules) studied in the literature. We investigate this through simulations.

A novel countermeasure technique to protect WSN against denialofsleep attacks using firefly and Hopfield neural network (HNN) algorithms arXiv.cs.DC Pub Date : 20200115
Reza Fotohi; Somayyeh Firoozi BariWireless sensor networks (WSNs) contain numerous nodes that their main goals are to monitor and control environments. Also, sensor nodes distribute based on network usage. One of the most significant issues in this type of network is the energy consumption of sensor nodes. In fixedsink networks, nodes which are near the sink act as an interface to transfer data of other nodes to sink. This causes the energy consumption of sensors reduces rapidly. Therefore, the lifetime of the network declines. Sensor nodes owing to their weaknesses are susceptible to several threats, one of which is denialofsleep attack (DoSA) threatening WSN. Hence, the DoSA refers to the energy loss in these nodes by maintaining the nodes from entering energysaving and sleep mode. In this paper, a hybrid approach is proposed based on mobile sink, firefly algorithm based on leach, and Hopfield neural network (WSNFAHN). Thus, mobile sink is applied to both improve energy consumption and increase network lifetime. Firefly algorithm is proposed to cluster nodes and authenticate in two levels to prevent from DoSA. In addition, Hopfield neural network detects the direction route of the sink movement to send data of CH. Furthermore, here WSNFAHN technique is assessed through wide simulations performed in the NS2 environment. The WSNFAHN procedure superiority is demonstrated by simulation outcomes in comparison with contemporary schemes based on performance metrics like packet delivery ratio (PDR), average throughput, detection ratio, and network lifetime while decreasing the average residual energy.

The Gossiping InsertEliminate Algorithm for MultiAgent Bandits arXiv.cs.DC Pub Date : 20200115
Ronshee Chawla; Abishek Sankararaman; Ayalvadi Ganesh; Sanjay ShakkottaiWe consider a decentralized multiagent Multi Armed Bandit (MAB) setup consisting of $N$ agents, solving the same MAB instance to minimize individual cumulative regret. In our model, agents collaborate by exchanging messages through pairwise gossip style communications. We develop two novel algorithms, where each agent only plays from a subset of all the arms. Agents use the communication medium to recommend only armIDs (not samples), and thus update the set of arms from which they play. We establish that, if agents communicate $\Omega(\log(T))$ times through any connected pairwise gossip mechanism, then every agent's regret is a factor of order $N$ smaller compared to the case of no collaborations. Furthermore, we show that the communication constraints only have a second order effect on the regret of our algorithm. We then analyze this second order term of the regret to derive bounds on the regretcommunication tradeoffs. Finally, we empirically evaluate our algorithm and conclude that the insights are fundamental and not artifacts of our bounds. We also show a lower bound which gives that the regret scaling obtained by our algorithm cannot be improved even in the absence of any communication constraints. Our results demonstrate that even a minimal level of collaboration among agents greatly reduces regret for all agents.

Model Pruning Enables Efficient Federated Learning on Edge Devices arXiv.cs.DC Pub Date : 20190926
Yuang Jiang; Shiqiang Wang; Bong Jun Ko; WeiHan Lee; Leandros TassiulasFederated learning is a recent approach for distributed model training without sharing the raw data of clients. It allows model training using the large amount of user data collected by edge and mobile devices, while preserving data privacy. A challenge in federated learning is that the devices usually have much lower computational power and communication bandwidth than machines in data centers. Training largesized deep neural networks in such a federated setting can consume a large amount of time and resources. To overcome this challenge, we propose a method that integrates model pruning with federated learning in this paper, which includes initial model pruning at the server, further model pruning as part of the federated learning process, followed by the regular federated learning procedure. Our proposed approach can save the computation, communication, and storage costs compared to standard federated learning approaches. Extensive experiments on real edge devices validate the benefit of our proposed method.

Decomposing Collectives for Exploiting Multilane Communication arXiv.cs.DC Pub Date : 20191029
Jesper Larsson TräffMany modern, highperformance systems increase the cumulated nodebandwidth by offering more than a single communication network and/or by having multiple connections to the network. Efficient algorithms and implementations for collective operations as found in, e.g., MPI must be explicitly designed for such multilane capabilities. We discuss a model for the design of multilane algorithms, and in particular give a recipe for converting any standard, oneported, (pipelined) communication tree algorithm into a multilane algorithm that can effectively use $k$ lanes simultaneously. We first examine the problem from the perspective of \emph{selfconsistent performance guidelines}, and give simple, \emph{fulllane, mockup implementations} of the MPI broadcast, reduction, scan, gather, scatter, allgather, and alltoall operations using only similar operations of the given MPI library itself in such a way that multilane capabilities can be exploited. These implementations which rely on a decomposition of the communication domain into communicators for nodes and lanes are fullfledged and readily usable implementations of the MPI collectives. The mockup implementations, contrary to expectation, in many cases show surprising performance improvements with different MPI libraries on a small 36node dualsocket, duallane Intel OmniPath cluster, indicating severe problems with the native MPI library implementations. Our fulllane implementations are in many cases considerably more than a factor of two faster than the corresponding MPI collectives. We see similar results on the larger Vienna Scientific Cluster, VSC3. These experiments indicate considerable room for improvement of the MPI collectives in current libraries including more efficient use of multilane communication.

Live Exploration with Mobile Robots in a Dynamic Ring, Revisited arXiv.cs.DC Pub Date : 20200113
Subhrangsu Mandal; Anisur Rahaman Molla; William K. Moses JrThe graph exploration problem requires a group of mobile robots, initially placed arbitrarily on the nodes of a graph, to work collaboratively to explore the graph such that each node is eventually visited by at least one robot. One important requirement of exploration is the {\em termination} condition, i.e., the robots must know that exploration is completed. The problem of live exploration of a dynamic ring using mobile robots was recently introduced in [Di Luna et al., ICDCS 2016]. In it, they proposed multiple algorithms to solve exploration in fully synchronous and semisynchronous settings with various guarantees when $2$ robots were involved. They also provided guarantees that with certain assumptions, exploration of the ring using two robots was impossible. An important question left open was how the presence of $3$ robots would affect the results. In this paper, we try to settle this question in a fully synchronous setting and also show how to extend our results to a semisynchronous setting. In particular, we present algorithms for exploration with explicit termination using $3$ robots in conjunction with either (i) unique IDs of the robots and edge crossing detection capability (i.e., two robots moving in opposite directions through an edge in the same round can detect each other), or (ii) access to randomness. The time complexity of our deterministic algorithm is asymptotically optimal. We also provide complementary impossibility results showing that there does not exist any explicit termination algorithm for $2$ robots. The theoretical analysis and comprehensive simulations of our algorithm show the effectiveness and efficiency of the algorithm in dynamic rings. We also present an algorithm to achieve exploration with partial termination using $3$ robots in the semisynchronous setting.

Cloudburst: Stateful FunctionsasaService arXiv.cs.DC Pub Date : 20200114
Vikram Sreekanti; Chenggang Wu Xiayue Charles Lin; Jose M. Faleiro; Joseph E. Gonzalez; Joseph M. Hellerstein; Alexey TumanovFunctionasaService (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with lowlatency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling keyvalue store, for state sharing and overlay routing combined with mutable caches colocated with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of latticeencapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the statemanagement overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.

Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach arXiv.cs.DC Pub Date : 20200114
Pengchao Han; Shiqiang Wang; Kin K. LeungFederated learning (FL) is an emerging technique for training machine learning models using geographically dispersed data collected by local entities. It includes local computation and synchronization steps. To reduce the communication overhead and improve the overall efficiency of FL, gradient sparsification (GS) can be applied, where instead of the full gradient, only a small subset of important elements of the gradient is communicated. Existing work on GS uses a fixed degree of gradient sparsity for i.i.d.distributed data within a datacenter. In this paper, we consider adaptive degree of sparsity and noni.i.d. local datasets. We first present a fairnessaware GS method which ensures that different clients provide a similar amount of updates. Then, with the goal of minimizing the overall training time, we propose a novel online learning formulation and algorithm for automatically determining the nearoptimal communication and computation tradeoff that is controlled by the degree of gradient sparsity. The online learning algorithm uses an estimated sign of the derivative of the objective function, which gives a regret bound that is asymptotically equal to the case where exact derivative is available. Experiments with real datasets confirm the benefits of our proposed approaches, showing up to $40\%$ improvement in model accuracy for a finite training time.

What's Live? Understanding Distributed Consensus arXiv.cs.DC Pub Date : 20200114
Saksham Chand; Yanhong A LiuDistributed consensus algorithms such as Paxos have been studied extensively. They all use a same definition of safety. Liveness is especially important in practice despite wellknown theoretical impossibility results. However, many different liveness properties and assumptions have been stated, and there are no systematic comparisons for better understanding of these properties. This paper studies and compares different liveness properties stated for over 30 wellknown consensus algorithms and variants. We build a lattice of liveness properties combining a lattice of the assumptions used and a lattice of the assertions made, and we compare the strengths and weaknesses of algorithms that ensure these properties. Our precise specifications and systematic comparisons led to the discovery of a range of problems in various stated liveness properties, from lacking assumptions or too weak assumptions for which no liveness assertions can hold, to too strong assumptions making it trivial or uninteresting to achieve the assertions. We also developed TLA+ specifications of these liveness properties. We show that model checking execution steps using TLC can illustrate liveness patterns for singlevalued Paxos on up to 4 proposers and 4 acceptors in a few hours, but becomes too expensive for multivalued Paxos or more processes.

sStep Orthomin and GMRES implemented on parallel computers arXiv.cs.DC Pub Date : 20200114
A. T. Chronopoulos; S. K. KimThe Orthomin ( Omin ) and the Generalized Minimal Residual method ( GMRES ) are commonly used iterative methods for approximating the solution of nonsymmetric linear systems. The sstep generalizations of these methods enhance their data locality parallel and properties by forming s simultaneous search direction vectors. Good data locality is the key in achieving near peak rates on memory hierarchical supercomputers. The theoretical derivation of the sstep Arnoldi and Omin has been published in the past. Here we derive the sstep GMRES method. We then implement sstep Omin and GMRES on a Cray2 hierarchical memory supercomputer.

Processing Distribution and Architecture Tradeoff for Large Intelligent Surface Implementation arXiv.cs.DC Pub Date : 20200114
Jesus Rodriguez Sanchez; Ove Edfors; Fredrik Rusek; Liang LiuThe Large Intelligent Surface (LIS) concept has emerged recently as a new paradigm for wireless communication, remote sensing and positioning. Despite of its potential, there are a lot of challenges from an implementation point of view, with the interconnection datarate and computational complexity being the most relevant. Distributed processing techniques and hierarchical architectures are expected to play a vital role addressing this. In this paper we perform algorithmarchitecture codesign and analyze the hardware requirements and architecture tradeoffs for a discrete LIS to perform uplink detection. By doing this, we expect to give concrete case studies and guidelines for efficient implementation of LIS systems.

An Informal Method arXiv.cs.DC Pub Date : 20160804
Victor YodaikenA method for specifying the behavior and architecture of discrete state systems such as digital electronic devices and software. The method draws on state machine theory, automata products, and recursive functions and is ordinary working mathematics, not involving formal methods or any foundational or metamathematical techniques. Systems in which there are levels of components that may operate in parallel or concurrently are specified in terms of function composition. Illustrative examples include realtime systems, distributed consensus, a Java producer/consumer solution, and digital circuits.

Who started this rumor? Quantifying the natural differential privacy guarantees of gossip protocols arXiv.cs.DC Pub Date : 20190219
Aurélien Bellet; Rachid Guerraoui; Hadrien HendrikxGossip protocols (also called rumor spreading or epidemic protocols) are widely used to disseminate information in massive peertopeer networks. These protocols are often claimed to guarantee privacy because of the uncertainty they introduce on the node that started the dissemination. But is that claim really true? Can one indeed start a gossip and safely hide in the crowd? This paper studies, for the first time, gossip protocols using a rigorous mathematical framework based on differential privacy to determine the extent to which the source of a gossip can be traceable. Considering the case of a complete graph in which a subset of the nodes are curious, we derive matching lower and upper bounds on differential privacy, showing that some gossip protocols achieve strong privacy guarantees. Our results reveal an interesting tension between privacy and dissemination speed: the standard "push" gossip protocol has very weak privacy guarantees, while the optimal guarantees are attained at the cost of a drastic increase in the spreading time. Yet, we show that it is possible to leverage the inherent randomness and partial observability of gossip protocols to achieve both fast dissemination speed and nearoptimal privacy. These theoretical results are supported by numerical experiments.

Latency, Capacity, and Distributed MST arXiv.cs.DC Pub Date : 20190224
John Augustine; Seth Gilbert; Fabian Kuhn; Peter Robinson; Suman SouravWe study the cost of distributed MST construction in the setting where each edge has a latency and a capacity, along with the weight. Edge latencies capture the delay on the links of the communication network, while capacity captures their throughput (in this case, the rate at which messages can be sent). Depending on how the edge latencies relate to the edge weights, we provide several tight bounds on the time and messages required to construct an MST. When edge weights exactly correspond with the latencies, we show that, perhaps interestingly, the bottleneck parameter in determining the running time of an algorithm is the total weight $W$ of the MST (rather than the total number of nodes $n$, as in the standard CONGEST model). That is, we show a tight bound of $\tilde{\Theta}(D + \sqrt{W/c})$ rounds, where $D$ refers to the latency diameter of the graph, $W$ refers to the total weight of the constructed MST and edges have capacity $c$. The proposed algorithm sends $\tilde{O}(m+W)$ messages, where $m$, the total number of edges in the network graph under consideration, is a known lower bound on message complexity for MST construction. We also show that $\Omega(W)$ is a lower bound for fast MST constructions. When the edge latencies and the corresponding edge weights are unrelated, and either can take arbitrary values, we show that (unlike the sublinear time algorithms in the standard CONGEST model, on small diameter graphs), the best time complexity that can be achieved is $\tilde{\Theta}(D+n/c)$. However, if we restrict all edges to have equal latency $\ell$ and capacity $c$ while having possibly different weights (weights could deviate arbitrarily from $\ell$), we give an algorithm that constructs an MST in $\tilde{O}(D + \sqrt{n\ell/c})$ time. In each case, we provide nearly matching upper and lower bounds.

The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters arXiv.cs.DC Pub Date : 20191025
Daning Cheng; Hanping Zhang; Fen Xia; Shigang Li; Yunquan ZhangTo gain a better performance, many researchers put more computing resource into an application. However, in the AI area, there is still a lack of a successful largescale machine learning training application: The scalability and performance reproducibility of parallel machine learning training algorithm are limited and there are a few pieces of research focusing on why these indexes are limited but there are very few research efforts explaining the reasons in essence. In this paper, we propose that the sample difference in dataset plays a more prominent role in parallel machine learning algorithm scalability. Dataset characters can measure sample difference. These characters include the variance of the sample in a dataset, sparsity, sample diversity and similarity in sampling sequence. To match our proposal, we choose four kinds of parallel machine learning training algorithms as our research objects: (1) Asynchronous parallel SGD algorithm (Hogwild! algorithm) (2) Parallel model average SGD algorithm (Minibatch SGD algorithm) (3) Decenterilization optimization algorithm, (4) Dual Coordinate Optimization (DADM algorithm). These algorithms cover different types of machine learning optimization algorithms. We present the analysis of their convergence proof and design experiments. Our results show that the characters datasets decide the scalability of the machine learning algorithm. What is more, there is an upper bound of parallel scalability for machine learning algorithms.

Verifiable and Auditable Digital Interchange Framework arXiv.cs.DC Pub Date : 20200111
Prabal Banarjee; Dushyant Behl; Palanivel Kodeswaran; Chaitanya Kumar; Sushmita Ruj; Sayandeep SenWe address the problem of fairness and transparency in online marketplaces selling digital content, where all parties are not actively participating in the trade. We present the design, implementation and evaluation of VADER, a highly scalable solution for multiparty fair digital exchange that combines the trusted execution of blockchains with intelligent protocol design and incentivization schemes. We prototype VADER on Hyperledger Fabric and extensively evaluate our system on a realistic testbed spanning five public cloud datacenters, spread across four continents. Our results demonstrate that VADER adds only minimal overhead of 16% in median case compared to a baseline solution, while significantly outperforming a naive blockchain based solution that adds an overhead of 764%.

Permissioned Blockchain Revisited: A Byzantine GameTheoretical Perspective arXiv.cs.DC Pub Date : 20200112
Dongfang ZhaoDespite the popularity and practical applicability of blockchains, there is very limited work on the theoretical foundation of blockchains: The lack of rigorous theory and analysis behind the curtain of blockchains has severely staggered its broader applications. This paper attempts to lay out a theoretical foundation for a specific type of blockchainsthe ones requiring basic authenticity from the participants, also called \textit{permissioned blockchain}. We formulate permissioned blockchain systems and operations into a gametheoretical problem by incorporating constraints implied by the wisdom from distributed computing and Byzantine systems. We show that in a noncooperative blockchain game (NBG), a Nash equilibrium can be efficiently found in a closedform even though the game involves more than two players. Somewhat surprisingly, the simulation results of the Nash equilibrium implies that the game can reach a stable status regardless of the number of Byzantine nodes and trustworthy players. We then study a harder problem where players are allowed to form coalitions: the coalitional blockchain game (CBG). We show that although the Shapley value for a CBG can be expressed in a more succinct form, its core is empty.

Private and CommunicationEfficient Edge Learning: A Sparse Differential GaussianMasking Distributed SGD Approach arXiv.cs.DC Pub Date : 20200112
Xin Zhang; Minghong Fang; Jia Liu; Zhengyuan ZhuWith rise of machine learning (ML) and the proliferation of smart mobile devices, recent years have witnessed a surge of interest in performing ML in wireless edge networks. In this paper, we consider the problem of jointly improving data privacy and communication efficiency of distributed edge learning, both of which are critical performance metrics in wireless edge network computing. Toward this end, we propose a new decentralized stochastic gradient method with sparse differential Gaussianmasked stochastic gradients (SDMDSGD) for nonconvex distributed edge learning. Our main contributions are threefold: i) We theoretically establish the privacy and communication efficiency performance guarantee of our SDMDSGD method, which outperforms all existing works; ii) We show that SDMDSGD improves the fundamental trainingprivacy tradeoff by {\em two orders of magnitude} compared with the stateoftheart. iii) We reveal theoretical insights and offer practical design guidelines for the interactions between privacy preservation and communication efficiency, two conflicting performance goals. We conduct extensive experiments with a variety of learning models on MNIST and CIFAR10 datasets to verify our theoretical findings. Collectively, our results contribute to the theory and algorithm design for distributed edge learning.

Hierarchical MultiAgent Optimization for Resource Allocation in Cloud Computing arXiv.cs.DC Pub Date : 20200112
Xiangqiang GaoSenior Member, IEEE; Rongke LiuSenior Member, IEEE; Aryan KaushikIn cloud computing, an important concern is to allocate the available resources of service nodes to the requested tasks on demand and to make the objective function optimum, i.e., maximizing resource utilization, payoffs and available bandwidth. This paper proposes a hierarchical multiagent optimization (HMAO) algorithm in order to maximize the resource utilization and make the bandwidth cost minimum for cloud computing. The proposed HMAO algorithm is a combination of the genetic algorithm (GA) and the multiagent optimization (MAO) algorithm. With maximizing the resource utilization, an improved GA is implemented to find a set of service nodes that are used to deploy the requested tasks. A decentralizedbased MAO algorithm is presented to minimize the bandwidth cost. We study the effect of key parameters of the HMAO algorithm by the Taguchi method and evaluate the performance results. When compared with genetic algorithm (GA) and fast elitist nondominated sorting genetic (NSGAII) algorithm, the simulation results demonstrate that the HMAO algorithm is more effective than the existing solutions to solve the problem of resource allocation with a large number of the requested tasks. Furthermore, we provide the performance comparison of the HMAO algorithm with the firstfit greedy approach in online resource allocation.

Competitive Broadcast against Adaptive Adversary in Multichannel Radio Networks arXiv.cs.DC Pub Date : 20200112
Haimin Chen; Chaodong ZhengWireless networks are vulnerable to adversarial jamming due to the open nature of the communication medium. To thwart such malicious behavior, researchers have proposed resource competitive analysis. In this framework, sending, listening, or jamming on one channel for one time slot costs one unit of energy. The adversary can employ arbitrary jamming strategy to disrupt communication, but has a limited energy budget $T$. The honest nodes, on the other hand, aim to accomplish the distributed computing task in concern with a spending of $o(T)$. In this paper, we focus on solving the broadcast problem, in which a single source node wants to disseminate a message to all other $n1$ nodes. Previous work have shown, in singlehop singlechannel scenario, each node can receive the message in $\tilde{O}(T+n)$ time, while spending only $\tilde{O}(\sqrt{T/n}+1)$ energy. If $C$ channels are available, then the time complexity can be reduced by a factor of $C$, without increasing nodes' cost. However, these multichannel algorithms only work for certain values of $n$ and $C$, and can only tolerate an oblivious adversary. We develop two new resource competitive algorithms for the broadcast problem. They work for arbitrary $n,C$ values, require minimal prior knowledge, and can tolerate a powerful adaptive adversary. In both algorithms, each node's runtime is dominated by the term $O(T/C)$, and each node's energy cost is dominated by the term $\tilde{O}(\sqrt{T/n})$. The time complexity is asymptotically optimal, while the energy complexity is near optimal in some cases. We use "epidemic broadcast" to achieve time efficiency and resource competitiveness, and employ the coupling technique in the analysis to handle the adaptivity of the adversary. These tools might be of independent interest, and can potentially be applied in the design and analysis of other resource competitive algorithms.

Heterogeneous Computation Assignments in Coded Elastic Computing arXiv.cs.DC Pub Date : 20200112
Nicholas Woolsey; RongRong Chen; Mingyue JiWe study the optimal design of a heterogeneous coded elastic computing (CEC) network where machines have varying relative computation speeds. CEC introduced by Yang {\it et al.} is a framework which mitigates the impact of elastic events, where machines join and leave the network. A set of data is distributed among storage constrained machines using a Maximum Distance Separable (MDS) code such that any subset of machines of a specific size can perform the desired computations. This design eliminates the need to redistribute the data after each elastic event. In this work, we develop a process for an arbitrary heterogeneous computing network to minimize the overall computation time by defining an optimal computation load, or number of computations assigned to each machine. We then present an algorithm to define a specific computation assignment among the machines that makes use of the MDS code and meets the optimal computation load.

Reliable and interoperable computational molecular engineering: 2. Semantic interoperability based on the European Materials and Modelling Ontology arXiv.cs.DC Pub Date : 20200113
Martin Thomas Horsch; Silvia Chiacchiera; Youness Bami; Georg J. Schmitz; Gabriele Mogni; Gerhard Goldbeck; Emanuele GhediniThe European Materials and Modelling Ontology (EMMO) is a toplevel ontology designed by the European Materials Modelling Council to facilitate semantic interoperability between platforms, models, and tools in computational molecular engineering, integrated computational materials engineering, and related applications of materials modelling and characterization. Additionally, domain ontologies exist based on data technology developments from specific platforms. The present work discusses the ongoing work on establishing a European Virtual Marketplace Framework, into which diverse platforms can be integrated. It addresses common challenges that arise when marketplacelevel domain ontologies are combined with a toplevel ontology like the EMMO by ontology alignment.

Towards High Performance Javabased Deep Learning Frameworks arXiv.cs.DC Pub Date : 20200113
Athanasios Stratikopoulos; Juan Fumero; Zoran Sevarac; Christos KotselidisThe advent of modern cloud services along with the huge volume of data produced on a daily basis, have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. Prior research has focused on employing hardware accelerators as a means to overcome this inefficiency. This trend has driven software development to target heterogeneous execution, and several modern computing systems have incorporated a mixture of diverse computing components, including GPUs and FPGAs. However, the specialization of the applications' code for heterogeneous execution is not a trivial task, as it requires developers to have hardware expertise in order to obtain high performance. The vast majority of the existing deep learning frameworks that support heterogeneous acceleration, rely on the implementation of wrapper calls from a highlevel programming language to a lowlevel accelerator backend, such as OpenCL, CUDA or HLS. In this paper we have employed TornadoVM, a stateoftheart heterogeneous programming framework to transparently accelerate Deep Netts; a Javabased deep learning framework. Our initial results demonstrate up to 8x performance speedup when executing the back propagation process of the network's training on AMD GPUs against the sequential execution of the original Deep Netts framework.

Resource Sharing in the Edge: A Distributed BargainingTheoretic Approach arXiv.cs.DC Pub Date : 20200113
Faheem Zafari; Prithwish Basu; Kin K. Leung; Jian Li; Ananthram Swami; Don TowsleyThe growing demand for edge computing resources, particularly due to increasing popularity of Internet of Things (IoT), and distributed machine/deep learning applications poses a significant challenge. On the one hand, certain edge service providers (ESPs) may not have sufficient resources to satisfy their applications according to the associated servicelevel agreements. On the other hand, some ESPs may have additional unused resources. In this paper, we propose a resourcesharing framework that allows different ESPs to optimally utilize their resources and improve the satisfaction level of applications subject to constraints such as communication cost for sharing resources across ESPs. Our framework considers that different ESPs have their own objectives for utilizing their resources, thus resulting in a multiobjective optimization problem. We present an $N$person \emph{Nash Bargaining Solution} (NBS) for resource allocation and sharing among ESPs with \emph{Pareto} optimality guarantee. Furthermore, we propose a \emph{distributed}, primaldual algorithm to obtain the NBS by proving that the strongduality property holds for the resultant resource sharing optimization problem. Using synthetic and realworld data traces, we show numerically that the proposed NBS based framework not only enhances the ability to satisfy applications' resource demands, but also improves utilities of different ESPs.

Notes on Theory of Distributed Systems arXiv.cs.DC Pub Date : 20200110
James AspnesNotes for the Yale course CPSC 465/565 Theory of Distributed Systems.

FastFourierForecasting Resource Utilisation in Distributed Systems arXiv.cs.DC Pub Date : 20200113
Paul j. Pritz; Daniel Perez; Kin K. LeungDistributed computing systems often consist of hundreds of nodes, executing tasks with different resource requirements. Efficient resource provisioning and task scheduling in such systems are nontrivial and require close monitoring and accurate forecasting of the state of the system, specifically resource utilisation at its constituent machines. Two challenges present themselves towards these objectives. First, collecting monitoring data entails substantial communication overhead. This overhead can be prohibitively high, especially in networks where bandwidth is limited. Second, forecasting models to predict resource utilisation should be accurate and need to exhibit high inference speed. Mission critical scheduling and resource allocation algorithms use these predictions and rely on their immediate availability. To address the first challenge, we present a communicationefficient data collection mechanism. Resource utilisation data is collected at the individual machines in the system and transmitted to a central controller in batches. Each batch is processed by an adaptive datareduction algorithm based on Fourier transforms and truncation in the frequency domain. We show that the proposed mechanism leads to a significant reduction in communication overhead while incurring only minimal error and adhering to accuracy guarantees. To address the second challenge, we propose a deep learning architecture using complex Gated Recurrent Units to forecast resource utilisation. This architecture is directly integrated with the above data collection mechanism to improve inference speed of our forecasting model. Using two realworld datasets, we demonstrate the effectiveness of our approach, both in terms of forecasting accuracy and inference speed. Our approach resolves challenges encountered in resource provisioning frameworks and can be applied to other forecasting problems.

Domination in Signed Petri Net arXiv.cs.DC Pub Date : 20200113
Payal; Sangita KansalIn this paper, domination in Signed Petri net(SPN) has been introduced.We identify some of the Petri net structures where a dominating set can exist.Applications of producer consumer problem, searching of food by bees and finding similarity in research papers are given to understand the areas where the proposed theory can be used.

Learningbased Dynamic Pinning of Parallelized Applications in ManyCore Systems arXiv.cs.DC Pub Date : 20180301
Georgios C. Chasparis; Vladimir Janjic; Michael RossboryMotivated by the need for adaptive, secure and responsive scheduling in a great range of computing applications, including humancentered and timecritical applications, this paper proposes a scheduling framework that seamlessly adds resourceawareness to any parallel application. In particular, we introduce a learningbased framework for dynamic placement of parallel threads to NonUniform Memory Access (NUMA) architectures. Decisions are taken independently by each thread in a decentralized fashion that significantly reduces computational complexity. The advantage of the proposed learning scheme is the ability to easily incorporate any multiobjective criterion and easily adapt to performance variations during runtime. Under the multiobjective criterion of maximizing total completed instructions per second (i.e., both computational and memoryaccess instructions), we provide analytical guarantees with respect to the expected performance of the parallel application. We also compare the performance of the proposed scheme with the Linux operating system scheduler in an extensive set of applications, including both computationally and memory intensive ones. We have observed that performance improvement could be significant especially under limited availability of resources and under irregular memoryaccess patterns.

The Fog Development Kit: A Development Platform for SDNbased EdgeFog Systems arXiv.cs.DC Pub Date : 20190706
Colton Powell; Christopher Desiniotis; Behnam DezfouliWith the rise of the Internet of Things (IoT), fog computing has emerged to help traditional cloud computing in meeting scalability demands. Fog computing makes it possible to fulfill realtime requirements of applications by bringing more processing, storage, and control power geographically closer to enddevices. However, since fog computing is a relatively new field, there is no standard platform for research and development in a realistic environment, and this dramatically inhibits innovation and development of fogbased applications. In response to these challenges, we propose the Fog Development Kit (FDK). By providing highlevel interfaces for allocating computing and networking resources, the FDK abstracts the complexities of fog computing from developers and enables the rapid development of fog systems. In addition to supporting application development on a physical deployment, the FDK supports the use of emulation tools (e.g., GNS3 and Mininet) to create realistic environments, allowing fog application prototypes to be built with zero additional costs and enabling seamless portability to a physical infrastructure. Using a physical testbed and various kinds of applications running on it, we verify the operation and study the performance of the FDK. Specifically, we demonstrate that resource allocations are appropriately enforced and guaranteed, even amidst extreme network congestion. We also present simulationbased scalability analysis of the FDK versus the number of switches, the number of enddevices, and the number of fogdevices.

Succinct Population Protocols for Presburger Arithmetic arXiv.cs.DC Pub Date : 20191010
Michael Blondin; Javier Esparza; Blaise Genest; Martin Helfrich; Stefan JaaxAngluin et al. proved that population protocols compute exactly the predicates definable in Presburger arithmetic (PA), the firstorder theory of addition. As part of this result, they presented a procedure that translates any formula $\varphi$ of quantifierfree PA with remainder predicates (which has the same expressive power as full PA) into a population protocol with $2^{O(\text{poly}(\varphi))}$ states that computes $\varphi$. More precisely, the number of states of the protocol is exponential in both the bit length of the largest coefficient in the formula, and the number of nodes of its syntax tree. In this paper, we prove that every formula $\varphi$ of quantifierfree PA with remainder predicates is computable by a leaderless population protocol with $O(\text{poly}(\varphi))$ states. Our proof is based on several new constructions, which may be of independent interest. Given a formula $\varphi$ of quantifierfree PA with remainder predicates, a first construction produces a succinct protocol (with $O(\varphi^3)$ leaders) that computes $\varphi$; this completes the work initiated in [STACS'18], where we constructed such protocols for a fragment of PA. For large enough inputs, we can get rid of these leaders. If the input is not large enough, then it is small, and we design another construction producing a succinct protocol with one leader that computes $\varphi$. Our last construction gets rid of this leader for small inputs.

Similarity Driven Approximation for Text Analytics arXiv.cs.DC Pub Date : 20191016
Guangyan Hu; Yongfeng Zhang; Sandro Rigo; Thu D. NguyenText analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. Processing large text data sets can be computationally expensive, however, especially if it involves sophisticated algorithms. This challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speed up query processing. In this paper, we propose and evaluate a framework called EmApprox that uses approximation to speed up the processing of a wide range of queries over large text data sets. The key insight is that different types of queries can be approximated by processing subsets of data that are most similar to the queries. EmApprox builds a general index for a data set by learning a natural language processing model, producing a set of highly compressed vectors representing words and subcollections of documents. Then, at query processing time, EmApprox uses the index to guide sampling of the data set, with the probability of selecting each subcollection of documents being proportional to its {\em similarity} to the query as computed using the vector representations. We have implemented a prototype of EmApprox as an extension of the Apache Spark system, and used it to approximate three types of queries: aggregation, information retrieval, and recommendation. Experimental results show that EmApprox's similarityguided sampling achieves much better accuracy than random sampling. Further, EmApprox can achieve significant speedups if users can tolerate small amounts of inaccuracies. For example, when sampling at 10\%, EmApprox speeds up a set of queries counting phrase occurrences by almost 10x while achieving estimated relative errors of less than 22\% for 90\% of the queries.

Selfstabilizing Uniform Reliable Broadcast arXiv.cs.DC Pub Date : 20200109
Oskar Lundström; Michel Raynal; Elad M. SchillerWe study a wellknown communication abstraction called Uniform Reliable Broadcast (URB). URB is central in the design and implementation of faulttolerant distributed systems, as many nontrivial faulttolerant distributed applications require communication with provable guarantees on message deliveries. Our study focuses on faulttolerant implementations for timefree messagepassing systems that are prone to nodefailures. Moreover, we aim at the design of an even more robust communication abstraction. We do so through the lenses of selfstabilizationa very strong notion of faulttolerance. In addition to node and communication failures, selfstabilizing algorithms can recover after the occurrence of arbitrary transient faults; these faults represent any violation of the assumptions according to which the system was designed to operate (as long as the algorithm code stays intact). This work proposes the first selfstabilizing URB solution for timefree messagepassing systems that are prone to nodefailures. The proposed algorithm has an O(bufferUnitSize) stabilization time (in terms of asynchronous cycles) from arbitrary transient faults, where bufferUnitSize is a predefined constant that can be set according to the available memory. Moreover, the communication costs of our algorithm are similar to the ones of the nonselfstabilizing stateoftheart. The main differences are that our proposal considers repeated gossiping of O(1) bits messages and deals with bounded space (which is a prerequisite for selfstabilization). Specifically, each node needs to store up to bufferUnitSize n records and each record is of size O(v + n log n) bits, where n is the number of nodes in the system and v is the number of bits needed to encode a single URB instance.

RMWPaxos: FaultTolerant InPlace Consensus Sequences arXiv.cs.DC Pub Date : 20200110
Jan Skrzypczak; Florian Schintke; Thorsten SchüttBuilding consensus sequences based on distributed, faulttolerant consensus, as used for replicated state machines, typically requires a separate distributed state for every new consensus instance. Allocating and maintaining this state causes significant overhead. In particular, freeing the distributed, outdated states in a faulttolerant way is not trivial and adds further complexity and overhead to the system. In this paper, we propose an extension to the singledecree Paxos protocol that can learn a sequence of consensus decisions 'inplace', i.e., with a single set of distributed states. Our protocol does not require dynamic log structures and hence has no need for distributed log pruning, snapshotting, compaction, or dynamic resource allocation. The protocol builds a faulttolerant atomic register that supports arbitrary readmodifywrite operations. We use the concept of consistent quorums to detect whether the previous consensus still needs to be consolidated or is already finished so that the next consensus value can be safely proposed. Reading a consolidated consensus is done without state modification and is thereby free of concurrency control and demand for serialisation. A proposer that is not interrupted reaches agreement on consecutive consensuses within a single message roundtrip per consensus decision by preparing the acceptors eagerly with the preceding request.

Demo: LightWeight Programming Language for Blockchain arXiv.cs.DC Pub Date : 20200110
Junhui Kim; Joongheon KimThis demo abstract introduces a new lightweight programming language koa which is suitable for blockchain system design and implementation. In this abstract, the basic features of koa are introduced including working system (with playground), architecture, and virtual machine operations. Rumtime execution of software implemented by koa will be presented during the session.

Decentralized Optimization of Vehicle Route Planning  A CrossCity Comparative Study arXiv.cs.DC Pub Date : 20200110
Brionna Davis; Grace Jennings; Taylor Pothast; Ilias Gerostathopoulos; Evangelos Pournaras; Raphael E. SternNew mobility concepts are at the forefront of research and innovation in smart cities. The introduction of connected and autonomous vehicles enables new possibilities in vehicle routing. Specifically, knowing the origin and destination of each agent in the network can allow for realtime routing of the vehicles to optimize network performance. However, this relies on individual vehicles being "altruistic" i.e., being willing to accept an alternative nonpreferred route in order to achieve a networklevel performance goal. In this work, we conduct a study to compare different levels of agent altruism and the resulting effect on the networklevel traffic performance. Specifically, this study compares the effects of different underlying urban structures on the overall network performance, and investigates which characteristics of the network make it possible to realize routing improvements using a decentralized optimization router. The main finding is that, with increased vehicle altruism, it is possible to balance traffic flow among the links of the network. We show evidence that the decentralized optimization router is more effective with networks of high load while we study the influence of cities characteristics, in particular: networks with a higher number of nodes (intersections) or edges (roads) per unit area allow for more possible alternate routes, and thus higher potential to improve network performance.

RealTime RFI Mitigation for the Apertif Radio Transient System arXiv.cs.DC Pub Date : 20200110
Alessio Sclocco; Dany Vohl; Rob V. van NieuwpoortCurrent and upcoming radio telescopes are being designed with increasing sensitivity to detect new and mysterious radio sources of astrophysical origin. While this increased sensitivity improves the likelihood of discoveries, it also makes these instruments more susceptible to the deleterious effects of Radio Frequency Interference (RFI). The challenge posed by RFI is exacerbated by the high datarates achieved by modern radio telescopes, which require realtime processing to keep up with the data. Furthermore, the high datarates do not allow for permanent storage of observations at high resolution. Offline RFI mitigation is therefore not possible anymore. The realtime requirement makes RFI mitigation even more challenging because, on one side, the techniques used for mitigation need to be fast and simple, and on the other side they also need to be robust enough to cope with just a partial view of the data. The Apertif Radio Transient System (ARTS) is the realtime, timedomain, transient detection instrument of the Westerbork Synthesis Radio Telescope (WSRT), processing 73 Gb of data per second. Even with a deep learning classifier, the ARTS pipeline requires stateoftheart realtime RFI mitigation to reduce the number of falsepositive detections. Our solution to this challenge is RFIm, a highperformance, opensource, tuned, and extensible RFI mitigation library. The goal of this library is to provide users with RFI mitigation routines that are designed to run in realtime on manycore accelerators, such as Graphics Processing Units, and that can be highlytuned to achieve code and performance portability to different hardware platforms and scientific usecases. Results on the ARTS show that we can achieve realtime RFI mitigation, with a minimal impact on the total execution time of the search pipeline, and considerably reduce the number of falsepositives.

An Efficient Universal Construction for Large Objects arXiv.cs.DC Pub Date : 20200110
Panagiota Fatourou; Nikolaos D. Kallimanis; Eleni KanellouThis paper presents LUC, a universal construction that efficiently implements dynamic objects of large state in a waitfree manner. The step complexity of LUC is O(n+kw), where n is the number of processes, k is the interval contention (i.e., the maximum number of active processes during the execution interval of an operation), and w is the worstcase time complexity to perform an operation on the sequential implementation of the simulated object. LUC efficiently implements objects whose size can change dynamically. It improves upon previous universal constructions either by efficiently handling objects whose state is large and can change dynamically, or by achieving better step complexity.

AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs arXiv.cs.DC Pub Date : 20200106
Pengfei Xu; Xiaofan Zhang; Cong Hao; Yang Zhao; Yongan Zhang; Yue Wang; Chaojian Li; Zetong Guan; Deming Chen; Yingyan LinRecent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for DNN chips. However, designing DNN chips is nontrivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the large design space due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware codesign is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs to meet the application specifications. Therefore, DNN chips take a long time to design and require crossdisciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip  a DNN chip generator that can automatically generate both FPGA and ASICbased DNN chip implementation given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graphbased accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, and area based on the DNN model parameters, hardware configuration, technologybased IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balancing, etc.), optimize chip design via the Chip Predictor, and then generate optimized synthesizable RTL to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from realmeasured ones by < 10% when validated using 15 DNN models and 4 platforms (edgeFPGA/TPU/GPU and ASIC). Furthermore, accelerators generated by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that of expertcrafted stateoftheart accelerators.

OOVR: NUMA Friendly ObjectOriented VR Rendering Framework For Future NUMABased MultiGPU Systems arXiv.cs.DC Pub Date : 20200108
Chenhao Xie; Xin Fu; Mingsong Chen; Shuaiwen Leon SongWith the strong computation capability, NUMAbased multiGPU system is a promising candidate to provide sustainable and scalable performance for Virtual Reality. However, the entire multiGPU system is viewed as a single GPU which ignores the data locality in VR rendering during the workload distribution, leading to tremendous remote memory accesses among GPU models. By conducting comprehensive characterizations on different kinds of parallel rendering frameworks, we observe that distributing the rendering object along with its required data per GPM can reduce the interGPM memory accesses. However, this objectlevel rendering still faces two major challenges in NUMAbased multiGPU system: (1) the large data locality between the left and right views of the same object and the data sharing among different objects and (2) the unbalanced workloads induced by the softwarelevel distribution and composition mechanisms. To tackle these challenges, we propose objectoriented VR rendering framework (OOVR) that conducts the software and hardware cooptimization to provide a NUMA friendly solution for VR multiview rendering in NUMAbased multiGPU systems. We first propose an objectoriented VR programming model to exploit the data sharing between two views of the same object and group objects into batches based on their texture sharing levels. Then, we design an object aware runtime batch distribution engine and distributed hardware composition unit to achieve the balanced workloads among GPMs. Finally, evaluations on our VR featured simulator show that OOVR provides 1.58x overall performance improvement and 76% interGPM memory traffic reduction over the stateoftheart multiGPU systems. In addition, OOVR provides NUMA friendly performance scalability for the future larger multiGPU scenarios with ever increasing asymmetric bandwidth between local and remote memory.

FineGrained Complexity of Safety Verification arXiv.cs.DC Pub Date : 20180215
Peter Chini; Roland Meyer; Prakash SaivasanWe study the finegrained complexity of Leader Contributor Reachability (LCR) and BoundedStage Reachability (BSR), two variants of the safety verification problem for shared memory concurrent programs. For both problems, the memory is a single variable over a finite data domain. Our contributions are new verification algorithms and lower bounds. The latter are based on the Exponential Time Hypothesis (ETH), the problem Set Cover, and crosscompositions. LCR is the question whether a designated leader thread can reach an unsafe state when interacting with a certain number of equal contributor threads. We suggest two parameterizations: (1) By the size of the data domain D and the size of the leader L, and (2) by the size of the contributors C. We present algorithms for both cases. The key techniques are compact witnesses and dynamic programming. The algorithms run in O*((L(D+1))^(LD) * D^D) and O*(2^C) time, showing that both parameterizations are fixedparameter tractable. We complement the upper bounds by (matching) lower bounds based on ETH and Set Cover. Moreover, we prove the absence of polynomial kernels. For BSR, we consider programs involving t different threads. We restrict the analysis to computations where the write permission changes s times between the threads. BSR asks whether a given configuration is reachable via such an sstage computation. When parameterized by P, the maximum size of a thread, and t, the interesting observation is that the problem has a large number of difficult instances. Formally, we show that there is no polynomial kernel, no compression algorithm that reduces the size of the data domain D or the number of stages s to a polynomial dependence on P and t. This indicates that symbolic methods may be harder to find for this problem.

Nakamoto Consensus with Verifiable Delay Puzzle arXiv.cs.DC Pub Date : 20190818
Jieyi Long; Ribao WeiThis paper summarizes our workinprogress on a new consensus protocol based on verifiable delay function. First, we introduce the concept of verifiable delay puzzle (VDP), which resembles the hashing puzzle used in the PoW mechanism but can only be solved sequentially. We then present a VDP implementation based on Pietrzak's verifiable delay function. Further, we show that VDP can be combined with the Nakamoto consensus in a proofofstake/proofofdelay hybrid protocol. We analyze the persistence and liveness of the protocol, and show that compared to PoW, our proposal consumes much less energy; compared to BFT based consensus algorithms which usually place an upper limit on the number of consensus nodes, our proposal is much more scalable and can thus achieve a higher level of decentralization.

H2OCloud: A Resource and Quality of ServiceAware Task Scheduling Framework for WarehouseScale Data Centers  A Hierarchical Hybrid DRL (Deep Reinforcement Learning) based Approach arXiv.cs.DC Pub Date : 20191220
Mingxi Cheng; Ji Li; Paul Bogdan; Shahin NazarianCloud computing has attracted both endusers and Cloud Service Providers (CSPs) in recent years. Improving resource utilization rate (RUtR), such as CPU and memory usages on servers, while maintaining QualityofService (QoS) is one key challenge faced by CSPs with warehousescale data centers. Prior works proposed various algorithms to reduce energy cost or to improve RUtR, which either lack the finegrained task scheduling capabilities, or fail to take a comprehensive system model into consideration. This article presents H2OCloud, a Hierarchical and Hybrid Online task scheduling framework for warehousescale CSPs, to improve resource usage effectiveness while maintaining QoS. H2OCloud is highly scalable and considers comprehensive information such as various workload scenarios, cloud platform configurations, user request information and dynamic pricing model. The hierarchy and hybridity of the framework, combined with its deep reinforcement learning (DRL) engines, enable H2OCloud to efficiently start onthego scheduling and learning in an unpredictable environment without pretraining. Our experiments confirm the high efficiency of the proposed H2OCloud when compared to baseline approaches, in terms of energy and cost while maintaining QoS. Compared with a stateoftheart DRLbased algorithm, H2OCloud achieves up to 201.17% energy cost efficiency improvement, 47.88% energy efficiency improvement and 551.76% reward rate improvement.

DeepRecSys: A System for Optimizing EndToEnd Atscale Neural Recommendation Inference arXiv.cs.DC Pub Date : 20200108
Udit Gupta; Samuel Hsia; Vikram Saraph; Xiaodong Wang; Brandon Reagen; GuYeon Wei; HsienHsin S. Lee; David Brooks; CaroleJean WuNeural personalized recommendation is the cornerstone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel endtoend modeling infrastructure, DeepRecInfra, that adopts an algorithm and system codesign methodology to customdesign systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latencybounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industryrepresentative recommendation models. Finally, design, deployment, and evaluation in atscale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

Architecture and Security of SCADA Systems: A Review arXiv.cs.DC Pub Date : 20200109
Geeta Yadav; Kolin PaulPipeline bursting, production lines shut down, frenzy traffic, trains confrontation, nuclear reactor shut down, disrupted electric supply, interrupted oxygen supply in ICU  these catastrophic events could result because of an erroneous SCADA system/ Industrial Control System(ICS). SCADA systems have become an essential part of automated control and monitoring of many of the Critical Infrastructures (CI). Modern SCADA systems have evolved from standalone systems into sophisticated complex, open systems, connected to the Internet. This geographically distributed modern SCADA system is vulnerable to threats and cyber attacks. In this paper, we first review the SCADA system architectures that have been proposed/implemented followed by attacks on such systems to understand and highlight the evolving security needs for SCADA systems. A short investigation of the current state of intrusion detection techniques in SCADA systems is done , followed by a brief study of testbeds for SCADA systems. The cloud and Internet of things (IoT) based SCADA systems are studied by analysing the architecture of modern SCADA systems. This review paper ends by highlighting the critical research problems that need to be resolved to close the gaps in the security of SCADA systems.

LibreSocial: A PeertoPeer Framework for Online Social Networks arXiv.cs.DC Pub Date : 20200109
Kalman Graffi; Newton MasindeDistributed online social networks (DOSNs) were first proposed to solve the problem of privacy, security and scalability. A significant amount of research was undertaken to offer viable DOSN solutions that were capable of competing with the existing centralized OSN applications such as Facebook, LinkedIn and Instagram. This research led to the emergence of the use of peertopeer (P2P) networks as a possible solution, upon which several OSNs such as LifeSocial.KOM, Safebook, PeerSoN among others were based. In this paper, we define the basic requirements for a P2P OSN. We then revisit one of the first P2Pbased OSNs, LifeSocial.KOM, that is now called LibreSocial, which evolved in the past years to address the challenges of running a completely decentralized social network. Over the course of time, several essential new technologies have been incorporated within LibreSocial for better functionalities. We describe the architecture and each individual component of LibreSocial and point out how LibreSocial meets the basic requirements for a fully functional distributed OSN.