当前期刊: arXiv - CS - Distributed, Parallel, and Cluster Computing Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • GPU Tensor Cores for fast Arithmetic Reductions
    arXiv.cs.DC Pub Date : 2020-01-15
    Cristóbal A. Navarro; Roberto Carrasco; Ricardo J. Barrientos; Javier A. Riquelme; Raimundo Vega

    This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

    更新日期:2020-01-17
  • One-Bit Over-the-Air Aggregation for Communication-Efficient Federated Edge Learning: Design and Convergence Analysis
    arXiv.cs.DC Pub Date : 2020-01-16
    Guangxu Zhu; Yuqing Du; Deniz Gunduz; Kaibin Huang

    Federated edge learning (FEEL) is a popular framework for model training at an edge server using data distributed at edge devices (e.g., smart-phones and sensors) without compromising their privacy. In the FEEL framework, edge devices periodically transmit high-dimensional stochastic gradients to the edge server, where these gradients are aggregated and used to update a global model. When the edge devices share the same communication medium, the multiple access channel from the devices to the edge server induces a communication bottleneck. To overcome this bottleneck, an efficient broadband analog transmission scheme has been recently proposed, featuring the aggregation of analog modulated gradients (or local models) via the waveform-superposition property of the wireless medium. However, the assumed linear analog modulation makes it difficult to deploy this technique in modern wireless systems that exclusively use digital modulation. To address this issue, we propose in this work a novel digital version of broadband over-the-air aggregation, called one-bit broadband digital aggregation (OBDA). The new scheme features one-bit gradient quantization followed by digital modulation at the edge devices and a majority-voting based decoding at the edge server. We develop a comprehensive analysis framework for quantifying the effects of wireless channel hostilities (channel noise, fading, and channel estimation errors) on the convergence rate. The analysis shows that the hostilities slow down the convergence of the learning process by introducing a scaling factor and a bias term into the gradient norm. However, we show that all the negative effects vanish as the number of participating devices grows, but at a different rate for each type of channel hostility.

    更新日期:2020-01-17
  • Smart Data based Ensemble for Imbalanced Big Data Classification
    arXiv.cs.DC Pub Date : 2020-01-16
    Diego García-Gil; Johan Holmberg; Salvador García; Ning Xiong; Francisco Herrera

    Big Data scenarios pose a new challenge to traditional data mining algorithms, since they are not prepared to work with such amount of data. Smart Data refers to data of enough quality to improve the outcome from a data mining algorithm. Existing data mining algorithms unability to handle Big Datasets prevents the transition from Big to Smart Data. Automation in data acquisition that characterizes Big Data also brings some problems, such as differences in data size per class. This will lead classifiers to lean towards the most represented classes. This problem is known as imbalanced data distribution, where one class is underrepresented in the dataset. Ensembles of classifiers are machine learning methods that improve the performance of a single base classifier by the combination of several of them. Ensembles are not exempt from the imbalanced classification problem. To deal with this issue, the ensemble method have to be designed specifically. In this paper, a data preprocessing ensemble for imbalanced Big Data classification is presented, with focus on two-class problems. Experiments carried out in 21 Big Datasets have proved that our ensemble classifier outperforms classic machine learning models with an added data balancing method, such as Random Forests.

    更新日期:2020-01-17
  • Duet Benchmarking: Improving Measurement Accuracy in the Cloud
    arXiv.cs.DC Pub Date : 2020-01-16
    Lubomír Bulej; Vojtěch Horký; Petr Tůma; François Farquet; Aleksandar Prokopec

    We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3% to 12.5% (5.03% on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8% to 82.4% (37.4% on average) for the SPEC CPU 2017 workloads.

    更新日期:2020-01-17
  • Run-time Deep Model Multiplexing
    arXiv.cs.DC Pub Date : 2020-01-14
    Amir Erfan Eshratifar; Massoud Pedram

    We propose a framework to design a light-weight neural multiplexer that given input and resource budgets, decides upon the appropriate model to be called for the inference. Mobile devices can use this framework to offload the hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloud-based intelligent applications, instead of replicating the most-accurate model, a range of small and large models can be multiplexed from depending on the input's complexity and resource budgets. Our experimental results demonstrate the effectiveness of our framework benefiting both mobile users and cloud providers.

    更新日期:2020-01-17
  • Robust Massively Parallel Sorting
    arXiv.cs.DC Pub Date : 2016-06-28
    Michael Axtmann; Peter Sanders

    We investigate distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements. The main outcome is that four sorting algorithms cover the entire range of possible input sizes. For three algorithms we devise new low overhead mechanisms to make them robust with respect to duplicate keys and skewed input distributions. One of these, designed for medium sized inputs, is a new variant of quicksort with fast high-quality pivot selection. At the same time asymptotic analysis provides performance guarantees and guides the selection and configuration of the algorithms. We validate these hypotheses using extensive experiments on 7 algorithms, 10 input distributions, up to 262144 cores, and varying input sizes over 9 orders of magnitude. For difficult input distributions, our algorithms are the only ones that work at all. For all but the largest input sizes, we are the first to perform experiments on such large machines at all and our algorithms significantly outperform the ones one would conventionally have considered.

    更新日期:2020-01-17
  • Improved Parallel Construction of Wavelet Trees and Rank/Select Structures
    arXiv.cs.DC Pub Date : 2016-10-11
    Julian Shun

    Existing parallel algorithms for wavelet tree construction have a work complexity of $O(n\log\sigma)$. This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is based on parallel integer sorting and has either $O(n\log\log n\lceil\log\sigma/\sqrt{\log n\log\log n}\rceil)$ work and polylogarithmic depth, or $O(n\lceil\log\sigma/\sqrt{\log n}\rceil)$ work and sub-linear depth. We also describe another algorithm that has $O(n\lceil\log\sigma/\sqrt{\log n} \rceil)$ work and $O(\sigma+\log n)$ depth. We then show how to use similar ideas to construct variants of wavelet trees (arbitrary-shaped binary trees and multiary trees) as well as wavelet matrices in parallel with lower work complexity than prior algorithms. Finally, we show that the rank and select structures on binary sequences and multiary sequences, which are stored on wavelet tree nodes, can be constructed in parallel with improved work bounds, matching those of the best existing sequential algorithms for constructing rank and select structures.

    更新日期:2020-01-17
  • Decrypting Distributed Ledger Design -- Taxonomy, Classification and Blockchain Community Evaluation
    arXiv.cs.DC Pub Date : 2018-10-30
    Mark C. Ballandies; Marcus M. Dapp; Evangelos Pournaras

    More than 1000 distributed ledger technology (DLT) systems raising $600 billion in investment in 2016 feature the unprecedented and disruptive potential of blockchain technology. A systematic and data-driven analysis, comparison and rigorous evaluation of the different design choices of distributed ledgers and their implications is a challenge. The rapidly evolving nature of the blockchain landscape hinders reaching a common understanding of the techno-socio-economic design space of distributed ledgers and the cryptoeconomies they support. To fill this gap, this paper makes the following contributions: (i) A conceptual architecture of DLT systems with which (ii) a taxonomy is designed and (iii) a rigorous classification of DLT systems is made using real-world data and wisdom of the crowd. (iv) A DLT design guideline is the end result of applying machine learning methodologies on the classification data. Compared to related work and as defined in earlier taxonomy theory, the proposed taxonomy is highly comprehensive, robust, explanatory and extensible. The findings of this paper can provide new insights and better understanding of the key design choices evolving the modeling complexity of DLT systems, while identifying opportunities for new research contributions and business innovation.

    更新日期:2020-01-17
  • HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
    arXiv.cs.DC Pub Date : 2019-01-07
    Linghao Song; Jiachen Mao; Youwei Zhuo; Xuehai Qian; Hai Li; Yiran Chen

    With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.

    更新日期:2020-01-17
  • Unsupervised Segmentation Algorithms' Implementation in ITK for Tissue Classification via Human Head MRI Scans
    arXiv.cs.DC Pub Date : 2019-02-26
    Shadman Sakib; Md. Abu Bakr Siddique

    Tissue classification is one of the significant tasks in the field of biomedical image analysis. Magnetic Resonance Imaging (MRI) is of great importance in tissue classification especially in the areas of brain tissue classification which is able to recognize anatomical areas of interest such as surgical planning, monitoring therapy, clinical drug trials, image registration, stereotactic neurosurgery, radiotherapy etc. The task of this paper is to implement different unsupervised classification algorithms in ITK and perform tissue classification (white matter, gray matter, cerebrospinal fluid (CSF) and background of the human brain). For this purpose, 5 grayscale head MRI scans are provided. In order of classifying brain tissues, three algorithms are used. These are: Otsu thresholding, Bayesian classification and Bayesian classification with Gaussian smoothing. The obtained classification results are analyzed in the results and discussion section.

    更新日期:2020-01-17
  • Incentive Analysis of Bitcoin-NG, Revisited
    arXiv.cs.DC Pub Date : 2020-01-14
    Jianyu Niu; Ziyu Wang; Fangyu Gai; Chen Feng

    Bitcoin-NG is among the first blockchain protocols to approach the \emph{near-optimal} throughput by decoupling blockchain operation into two planes: leader election and transaction serialization. Its decoupling idea has inspired a new generation of high-performance blockchain protocols. However, the existing incentive analysis of Bitcoin-NG has several limitations. First, the impact of network capacity is ignored. Second, an integrated incentive analysis that jointly considers both key blocks and microblocks is still missing. In this paper, we aim to address the two limitations. First, we propose a new incentive analysis that takes the network capacity into account, showing that Bitcoin-NG can achieve better incentive compatibility against the microblock mining attack under limited network capacity. Second, we leverage a Markov decision process (MDP) to jointly analyze the incentive of both key blocks and microblocks, showing that Bitcoin-NG is as secure as Bitcoin when the adversary controls less than 35% of the computation power. We hope that our in-depth incentive analysis for Bitcoin-NG can shed some light on the mechanism design and incentive analysis of next-generation blockchain protocols.

    更新日期:2020-01-16
  • Entangled Polynomial Codes for Secure, Private, and Batch Distributed Matrix Multiplication: Breaking the ''Cubic'' Barrier
    arXiv.cs.DC Pub Date : 2020-01-15
    Qian Yu; A. Salman Avestimehr

    In distributed matrix multiplication, a common scenario is to assign each worker a fraction of the multiplication task, by partition the input matrices into smaller submatrices. In particular, by dividing two input matrices into $m$-by-$p$ and $p$-by-$n$ subblocks, a single multiplication task can be viewed as computing linear combinations of $pmn$ submatrix products, which can be assigned to $pmn$ workers. Such block-partitioning based designs have been widely studied under the topics of secure, private, and batch computation, where the state of the arts all require computing at least ``cubic'' ($pmn$) number of submatrix multiplications. Entangled polynomial codes, first presented for straggler mitigation, provides a powerful method for breaking the cubic barrier. It achieves a subcubic recovery threshold, meaning that the final product can be recovered from \emph{any} subset of multiplication results with a size order-wise smaller than $pmn$. In this work, we show that entangled polynomial codes can be further extended to also include these three important settings, and provide a unified framework that order-wise reduces the total computational costs upon the state of the arts by achieving subcubic recovery thresholds.

    更新日期:2020-01-16
  • Optimized implementation of the conjugate gradient algorithm for FPGA-based platforms using the Dirac-Wilson operator as an example
    arXiv.cs.DC Pub Date : 2020-01-15
    G. Korcyl; P. Korcyl

    It is now a noticeable trend in High Performance Computing that the systems are becoming more and more heterogeneous. Compute nodes with a host CPU are being equipped with accelerators, the latter being a GPU or FPGA cards or both. In many cases at the heart of scientific applications running on such systems are iterative linear solvers. In this work we present a software package which includes an FPGA implementation of the Conjugate Gradient algorithm using a particular problem of the Dirac-Wilson operator as encountered in numerical simulations of Quantum Chromodynamics. The software is written in OpenCL and C++ and is optimized for maximal performance. Our framework allows for a simple implementation of other linear operators, while keeping the data transport mechanisms unaltered. Hence, our software can serve as a backbone for many applications which are expected to gain a significant boost factor on FPGA accelerators. As such systems are expected to become more and more widespread, the need for highly performant FPGA implementations of the Conjugate Gradient algorithm and its variants will certainly increase and the porting investment can be greatly facilitated by the attached code.

    更新日期:2020-01-16
  • An n/2 Byzantine node tolerate Blockchain Sharding approach
    arXiv.cs.DC Pub Date : 2020-01-15
    Yibin Xu; Yangyu Huang

    Traditional Blockchain Sharding approaches can only tolerate up to n/3 of nodes being adversary because they rely on the hypergeometric distribution to make a failure (an adversary does not have n/3 of nodes globally but can manipulate the consensus of a Shard) hard to happen. The system must maintain a large Shard size (the number of nodes inside a Shard) to sustain the low failure probability so that only a small number of Shards may exist. In this paper, we present a new approach of Blockchain Sharding that can withstand up to n/2 of nodes being bad. We categorise the nodes into different classes, and every Shard has a fixed number of nodes from different classes. We prove that this design is much more secure than the traditional models (only have one class) and the Shard size can be reduced significantly. In this way, many more Shards can exist, and the transaction throughput can be largely increased. The improved Blockchain Sharding approach is promising to serve as the foundation for decentralised autonomous organisations and decentralised database.

    更新日期:2020-01-16
  • Lazy object copy as a platform for population-based probabilistic programming
    arXiv.cs.DC Pub Date : 2020-01-09
    Lawrence M. Murray

    This work considers dynamic memory management for population-based probabilistic programs, such as those using particle methods for inference. Such programs exhibit a pattern of allocating, copying, potentially mutating, and deallocating collections of similar objects through successive generations. These objects may assemble data structures such as stacks, queues, lists, ragged arrays, and trees, which may be of random, and possibly unbounded, size. For the simple case of $N$ particles, $T$ generations, $D$ objects, and resampling at each generation, dense representation requires $O(DNT)$ memory, while sparse representation requires only $O(DT+DN\log DN)$ memory, based on existing theoretical results. This work describes an object copy-on-write platform to automate this saving for the programmer. The core idea is formalized using labeled directed multigraphs, where vertices represent objects, edges the pointers between them, and labels the necessary bookkeeping. A specific labeling scheme is proposed for high performance under the motivating pattern. The platform is implemented for the Birch probabilistic programming language, using smart pointers, hash tables, and reference-counting garbage collection. It is tested empirically on a number of realistic probabilistic programs, and shown to significantly reduce memory use and execution time in a manner consistent with theoretical expectations. This enables copy-on-write for the imperative programmer, lazy deep copies for the object-oriented programmer, and in-place write optimizations for the functional programmer.

    更新日期:2020-01-16
  • Throughput Optimal Routing in Blockchain Based Payment Systems
    arXiv.cs.DC Pub Date : 2019-12-12
    Sushil Mahavir Varma; Siva Theja Maguluri

    Cryptocurrency networks such as Bitcoin have emerged as a distributed alternative to traditional centralized financial transaction networks. However, there are major challenges in scaling up the throughput of such networks. Lightning network and Spider network are alternates that build bidirectional payment channels on top of cryptocurrency networks using smart contracts, to enable fast transactions that bypass the Blockchain. In this paper, we study the problem of routing transactions in such a payment processing network. We first propose a Stochastic model to study such a system, as opposed to a fluid model that is studied in the literature. Each link in such a model is a two-sided queue, and unlike classical queues, such queues are not stable unless there is an external control. We propose a notion of stability for the payment processing network consisting of such two-sided queues using the notion of on-chain rebalancing. We then characterize the capacity region and propose a throughput optimal algorithm that stabilizes the system under any load within the capacity region. The stochastic model enables us to study closed loop policies, which typically have better queuing/delay performance than the open loop policies (or static split rules) studied in the literature. We investigate this through simulations.

    更新日期:2020-01-16
  • A novel countermeasure technique to protect WSN against denial-of-sleep attacks using firefly and Hopfield neural network (HNN) algorithms
    arXiv.cs.DC Pub Date : 2020-01-15
    Reza Fotohi; Somayyeh Firoozi Bari

    Wireless sensor networks (WSNs) contain numerous nodes that their main goals are to monitor and control environments. Also, sensor nodes distribute based on network usage. One of the most significant issues in this type of network is the energy consumption of sensor nodes. In fixed-sink networks, nodes which are near the sink act as an interface to transfer data of other nodes to sink. This causes the energy consumption of sensors reduces rapidly. Therefore, the lifetime of the network declines. Sensor nodes owing to their weaknesses are susceptible to several threats, one of which is denial-of-sleep attack (DoSA) threatening WSN. Hence, the DoSA refers to the energy loss in these nodes by maintaining the nodes from entering energy-saving and sleep mode. In this paper, a hybrid approach is proposed based on mobile sink, firefly algorithm based on leach, and Hopfield neural network (WSN-FAHN). Thus, mobile sink is applied to both improve energy consumption and increase network lifetime. Firefly algorithm is proposed to cluster nodes and authenticate in two levels to prevent from DoSA. In addition, Hopfield neural network detects the direction route of the sink movement to send data of CH. Furthermore, here WSN-FAHN technique is assessed through wide simulations performed in the NS-2 environment. The WSN-FAHN procedure superiority is demonstrated by simulation outcomes in comparison with contemporary schemes based on performance metrics like packet delivery ratio (PDR), average throughput, detection ratio, and network lifetime while decreasing the average residual energy.

    更新日期:2020-01-16
  • The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits
    arXiv.cs.DC Pub Date : 2020-01-15
    Ronshee Chawla; Abishek Sankararaman; Ayalvadi Ganesh; Sanjay Shakkottai

    We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of $N$ agents, solving the same MAB instance to minimize individual cumulative regret. In our model, agents collaborate by exchanging messages through pairwise gossip style communications. We develop two novel algorithms, where each agent only plays from a subset of all the arms. Agents use the communication medium to recommend only arm-IDs (not samples), and thus update the set of arms from which they play. We establish that, if agents communicate $\Omega(\log(T))$ times through any connected pairwise gossip mechanism, then every agent's regret is a factor of order $N$ smaller compared to the case of no collaborations. Furthermore, we show that the communication constraints only have a second order effect on the regret of our algorithm. We then analyze this second order term of the regret to derive bounds on the regret-communication tradeoffs. Finally, we empirically evaluate our algorithm and conclude that the insights are fundamental and not artifacts of our bounds. We also show a lower bound which gives that the regret scaling obtained by our algorithm cannot be improved even in the absence of any communication constraints. Our results demonstrate that even a minimal level of collaboration among agents greatly reduces regret for all agents.

    更新日期:2020-01-16
  • Model Pruning Enables Efficient Federated Learning on Edge Devices
    arXiv.cs.DC Pub Date : 2019-09-26
    Yuang Jiang; Shiqiang Wang; Bong Jun Ko; Wei-Han Lee; Leandros Tassiulas

    Federated learning is a recent approach for distributed model training without sharing the raw data of clients. It allows model training using the large amount of user data collected by edge and mobile devices, while preserving data privacy. A challenge in federated learning is that the devices usually have much lower computational power and communication bandwidth than machines in data centers. Training large-sized deep neural networks in such a federated setting can consume a large amount of time and resources. To overcome this challenge, we propose a method that integrates model pruning with federated learning in this paper, which includes initial model pruning at the server, further model pruning as part of the federated learning process, followed by the regular federated learning procedure. Our proposed approach can save the computation, communication, and storage costs compared to standard federated learning approaches. Extensive experiments on real edge devices validate the benefit of our proposed method.

    更新日期:2020-01-16
  • Decomposing Collectives for Exploiting Multi-lane Communication
    arXiv.cs.DC Pub Date : 2019-10-29
    Jesper Larsson Träff

    Many modern, high-performance systems increase the cumulated node-bandwidth by offering more than a single communication network and/or by having multiple connections to the network. Efficient algorithms and implementations for collective operations as found in, e.g., MPI must be explicitly designed for such multi-lane capabilities. We discuss a model for the design of multi-lane algorithms, and in particular give a recipe for converting any standard, one-ported, (pipelined) communication tree algorithm into a multi-lane algorithm that can effectively use $k$ lanes simultaneously. We first examine the problem from the perspective of \emph{self-consistent performance guidelines}, and give simple, \emph{full-lane, mock-up implementations} of the MPI broadcast, reduction, scan, gather, scatter, allgather, and alltoall operations using only similar operations of the given MPI library itself in such a way that multi-lane capabilities can be exploited. These implementations which rely on a decomposition of the communication domain into communicators for nodes and lanes are full-fledged and readily usable implementations of the MPI collectives. The mock-up implementations, contrary to expectation, in many cases show surprising performance improvements with different MPI libraries on a small 36-node dual-socket, dual-lane Intel OmniPath cluster, indicating severe problems with the native MPI library implementations. Our full-lane implementations are in many cases considerably more than a factor of two faster than the corresponding MPI collectives. We see similar results on the larger Vienna Scientific Cluster, VSC-3. These experiments indicate considerable room for improvement of the MPI collectives in current libraries including more efficient use of multi-lane communication.

    更新日期:2020-01-16
  • Live Exploration with Mobile Robots in a Dynamic Ring, Revisited
    arXiv.cs.DC Pub Date : 2020-01-13
    Subhrangsu Mandal; Anisur Rahaman Molla; William K. Moses Jr

    The graph exploration problem requires a group of mobile robots, initially placed arbitrarily on the nodes of a graph, to work collaboratively to explore the graph such that each node is eventually visited by at least one robot. One important requirement of exploration is the {\em termination} condition, i.e., the robots must know that exploration is completed. The problem of live exploration of a dynamic ring using mobile robots was recently introduced in [Di Luna et al., ICDCS 2016]. In it, they proposed multiple algorithms to solve exploration in fully synchronous and semi-synchronous settings with various guarantees when $2$ robots were involved. They also provided guarantees that with certain assumptions, exploration of the ring using two robots was impossible. An important question left open was how the presence of $3$ robots would affect the results. In this paper, we try to settle this question in a fully synchronous setting and also show how to extend our results to a semi-synchronous setting. In particular, we present algorithms for exploration with explicit termination using $3$ robots in conjunction with either (i) unique IDs of the robots and edge crossing detection capability (i.e., two robots moving in opposite directions through an edge in the same round can detect each other), or (ii) access to randomness. The time complexity of our deterministic algorithm is asymptotically optimal. We also provide complementary impossibility results showing that there does not exist any explicit termination algorithm for $2$ robots. The theoretical analysis and comprehensive simulations of our algorithm show the effectiveness and efficiency of the algorithm in dynamic rings. We also present an algorithm to achieve exploration with partial termination using $3$ robots in the semi-synchronous setting.

    更新日期:2020-01-15
  • Cloudburst: Stateful Functions-as-a-Service
    arXiv.cs.DC Pub Date : 2020-01-14
    Vikram Sreekanti; Chenggang Wu Xiayue Charles Lin; Jose M. Faleiro; Joseph E. Gonzalez; Joseph M. Hellerstein; Alexey Tumanov

    Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.

    更新日期:2020-01-15
  • Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach
    arXiv.cs.DC Pub Date : 2020-01-14
    Pengchao Han; Shiqiang Wang; Kin K. Leung

    Federated learning (FL) is an emerging technique for training machine learning models using geographically dispersed data collected by local entities. It includes local computation and synchronization steps. To reduce the communication overhead and improve the overall efficiency of FL, gradient sparsification (GS) can be applied, where instead of the full gradient, only a small subset of important elements of the gradient is communicated. Existing work on GS uses a fixed degree of gradient sparsity for i.i.d.-distributed data within a datacenter. In this paper, we consider adaptive degree of sparsity and non-i.i.d. local datasets. We first present a fairness-aware GS method which ensures that different clients provide a similar amount of updates. Then, with the goal of minimizing the overall training time, we propose a novel online learning formulation and algorithm for automatically determining the near-optimal communication and computation trade-off that is controlled by the degree of gradient sparsity. The online learning algorithm uses an estimated sign of the derivative of the objective function, which gives a regret bound that is asymptotically equal to the case where exact derivative is available. Experiments with real datasets confirm the benefits of our proposed approaches, showing up to $40\%$ improvement in model accuracy for a finite training time.

    更新日期:2020-01-15
  • What's Live? Understanding Distributed Consensus
    arXiv.cs.DC Pub Date : 2020-01-14
    Saksham Chand; Yanhong A Liu

    Distributed consensus algorithms such as Paxos have been studied extensively. They all use a same definition of safety. Liveness is especially important in practice despite well-known theoretical impossibility results. However, many different liveness properties and assumptions have been stated, and there are no systematic comparisons for better understanding of these properties. This paper studies and compares different liveness properties stated for over 30 well-known consensus algorithms and variants. We build a lattice of liveness properties combining a lattice of the assumptions used and a lattice of the assertions made, and we compare the strengths and weaknesses of algorithms that ensure these properties. Our precise specifications and systematic comparisons led to the discovery of a range of problems in various stated liveness properties, from lacking assumptions or too weak assumptions for which no liveness assertions can hold, to too strong assumptions making it trivial or uninteresting to achieve the assertions. We also developed TLA+ specifications of these liveness properties. We show that model checking execution steps using TLC can illustrate liveness patterns for single-valued Paxos on up to 4 proposers and 4 acceptors in a few hours, but becomes too expensive for multi-valued Paxos or more processes.

    更新日期:2020-01-15
  • s-Step Orthomin and GMRES implemented on parallel computers
    arXiv.cs.DC Pub Date : 2020-01-14
    A. T. Chronopoulos; S. K. Kim

    The Orthomin ( Omin ) and the Generalized Minimal Residual method ( GMRES ) are commonly used iterative methods for approximating the solution of non-symmetric linear systems. The s-step generalizations of these methods enhance their data locality parallel and properties by forming s simultaneous search direction vectors. Good data locality is the key in achieving near peak rates on memory hierarchical supercomputers. The theoretical derivation of the s-step Arnoldi and Omin has been published in the past. Here we derive the s-step GMRES method. We then implement s-step Omin and GMRES on a Cray-2 hierarchical memory supercomputer.

    更新日期:2020-01-15
  • Processing Distribution and Architecture Tradeoff for Large Intelligent Surface Implementation
    arXiv.cs.DC Pub Date : 2020-01-14
    Jesus Rodriguez Sanchez; Ove Edfors; Fredrik Rusek; Liang Liu

    The Large Intelligent Surface (LIS) concept has emerged recently as a new paradigm for wireless communication, remote sensing and positioning. Despite of its potential, there are a lot of challenges from an implementation point of view, with the interconnection data-rate and computational complexity being the most relevant. Distributed processing techniques and hierarchical architectures are expected to play a vital role addressing this. In this paper we perform algorithm-architecture codesign and analyze the hardware requirements and architecture trade-offs for a discrete LIS to perform uplink detection. By doing this, we expect to give concrete case studies and guidelines for efficient implementation of LIS systems.

    更新日期:2020-01-15
  • An Informal Method
    arXiv.cs.DC Pub Date : 2016-08-04
    Victor Yodaiken

    A method for specifying the behavior and architecture of discrete state systems such as digital electronic devices and software. The method draws on state machine theory, automata products, and recursive functions and is ordinary working mathematics, not involving formal methods or any foundational or meta-mathematical techniques. Systems in which there are levels of components that may operate in parallel or concurrently are specified in terms of function composition. Illustrative examples include real-time systems, distributed consensus, a Java producer/consumer solution, and digital circuits.

    更新日期:2020-01-15
  • Who started this rumor? Quantifying the natural differential privacy guarantees of gossip protocols
    arXiv.cs.DC Pub Date : 2019-02-19
    Aurélien Bellet; Rachid Guerraoui; Hadrien Hendrikx

    Gossip protocols (also called rumor spreading or epidemic protocols) are widely used to disseminate information in massive peer-to-peer networks. These protocols are often claimed to guarantee privacy because of the uncertainty they introduce on the node that started the dissemination. But is that claim really true? Can one indeed start a gossip and safely hide in the crowd? This paper studies, for the first time, gossip protocols using a rigorous mathematical framework based on differential privacy to determine the extent to which the source of a gossip can be traceable. Considering the case of a complete graph in which a subset of the nodes are curious, we derive matching lower and upper bounds on differential privacy, showing that some gossip protocols achieve strong privacy guarantees. Our results reveal an interesting tension between privacy and dissemination speed: the standard "push" gossip protocol has very weak privacy guarantees, while the optimal guarantees are attained at the cost of a drastic increase in the spreading time. Yet, we show that it is possible to leverage the inherent randomness and partial observability of gossip protocols to achieve both fast dissemination speed and near-optimal privacy. These theoretical results are supported by numerical experiments.

    更新日期:2020-01-15
  • Latency, Capacity, and Distributed MST
    arXiv.cs.DC Pub Date : 2019-02-24
    John Augustine; Seth Gilbert; Fabian Kuhn; Peter Robinson; Suman Sourav

    We study the cost of distributed MST construction in the setting where each edge has a latency and a capacity, along with the weight. Edge latencies capture the delay on the links of the communication network, while capacity captures their throughput (in this case, the rate at which messages can be sent). Depending on how the edge latencies relate to the edge weights, we provide several tight bounds on the time and messages required to construct an MST. When edge weights exactly correspond with the latencies, we show that, perhaps interestingly, the bottleneck parameter in determining the running time of an algorithm is the total weight $W$ of the MST (rather than the total number of nodes $n$, as in the standard CONGEST model). That is, we show a tight bound of $\tilde{\Theta}(D + \sqrt{W/c})$ rounds, where $D$ refers to the latency diameter of the graph, $W$ refers to the total weight of the constructed MST and edges have capacity $c$. The proposed algorithm sends $\tilde{O}(m+W)$ messages, where $m$, the total number of edges in the network graph under consideration, is a known lower bound on message complexity for MST construction. We also show that $\Omega(W)$ is a lower bound for fast MST constructions. When the edge latencies and the corresponding edge weights are unrelated, and either can take arbitrary values, we show that (unlike the sub-linear time algorithms in the standard CONGEST model, on small diameter graphs), the best time complexity that can be achieved is $\tilde{\Theta}(D+n/c)$. However, if we restrict all edges to have equal latency $\ell$ and capacity $c$ while having possibly different weights (weights could deviate arbitrarily from $\ell$), we give an algorithm that constructs an MST in $\tilde{O}(D + \sqrt{n\ell/c})$ time. In each case, we provide nearly matching upper and lower bounds.

    更新日期:2020-01-15
  • The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters
    arXiv.cs.DC Pub Date : 2019-10-25
    Daning Cheng; Hanping Zhang; Fen Xia; Shigang Li; Yunquan Zhang

    To gain a better performance, many researchers put more computing resource into an application. However, in the AI area, there is still a lack of a successful large-scale machine learning training application: The scalability and performance reproducibility of parallel machine learning training algorithm are limited and there are a few pieces of research focusing on why these indexes are limited but there are very few research efforts explaining the reasons in essence. In this paper, we propose that the sample difference in dataset plays a more prominent role in parallel machine learning algorithm scalability. Dataset characters can measure sample difference. These characters include the variance of the sample in a dataset, sparsity, sample diversity and similarity in sampling sequence. To match our proposal, we choose four kinds of parallel machine learning training algorithms as our research objects: (1) Asynchronous parallel SGD algorithm (Hogwild! algorithm) (2) Parallel model average SGD algorithm (Mini-batch SGD algorithm) (3) Decenterilization optimization algorithm, (4) Dual Coordinate Optimization (DADM algorithm). These algorithms cover different types of machine learning optimization algorithms. We present the analysis of their convergence proof and design experiments. Our results show that the characters datasets decide the scalability of the machine learning algorithm. What is more, there is an upper bound of parallel scalability for machine learning algorithms.

    更新日期:2020-01-15
  • Verifiable and Auditable Digital Interchange Framework
    arXiv.cs.DC Pub Date : 2020-01-11
    Prabal Banarjee; Dushyant Behl; Palanivel Kodeswaran; Chaitanya Kumar; Sushmita Ruj; Sayandeep Sen

    We address the problem of fairness and transparency in online marketplaces selling digital content, where all parties are not actively participating in the trade. We present the design, implementation and evaluation of VADER, a highly scalable solution for multi-party fair digital exchange that combines the trusted execution of blockchains with intelligent protocol design and incentivization schemes. We prototype VADER on Hyperledger Fabric and extensively evaluate our system on a realistic testbed spanning five public cloud datacenters, spread across four continents. Our results demonstrate that VADER adds only minimal overhead of 16% in median case compared to a baseline solution, while significantly outperforming a naive blockchain based solution that adds an overhead of 764%.

    更新日期:2020-01-14
  • Permissioned Blockchain Revisited: A Byzantine Game-Theoretical Perspective
    arXiv.cs.DC Pub Date : 2020-01-12
    Dongfang Zhao

    Despite the popularity and practical applicability of blockchains, there is very limited work on the theoretical foundation of blockchains: The lack of rigorous theory and analysis behind the curtain of blockchains has severely staggered its broader applications. This paper attempts to lay out a theoretical foundation for a specific type of blockchains---the ones requiring basic authenticity from the participants, also called \textit{permissioned blockchain}. We formulate permissioned blockchain systems and operations into a game-theoretical problem by incorporating constraints implied by the wisdom from distributed computing and Byzantine systems. We show that in a noncooperative blockchain game (NBG), a Nash equilibrium can be efficiently found in a closed-form even though the game involves more than two players. Somewhat surprisingly, the simulation results of the Nash equilibrium implies that the game can reach a stable status regardless of the number of Byzantine nodes and trustworthy players. We then study a harder problem where players are allowed to form coalitions: the coalitional blockchain game (CBG). We show that although the Shapley value for a CBG can be expressed in a more succinct form, its core is empty.

    更新日期:2020-01-14
  • Private and Communication-Efficient Edge Learning: A Sparse Differential Gaussian-Masking Distributed SGD Approach
    arXiv.cs.DC Pub Date : 2020-01-12
    Xin Zhang; Minghong Fang; Jia Liu; Zhengyuan Zhu

    With rise of machine learning (ML) and the proliferation of smart mobile devices, recent years have witnessed a surge of interest in performing ML in wireless edge networks. In this paper, we consider the problem of jointly improving data privacy and communication efficiency of distributed edge learning, both of which are critical performance metrics in wireless edge network computing. Toward this end, we propose a new decentralized stochastic gradient method with sparse differential Gaussian-masked stochastic gradients (SDM-DSGD) for non-convex distributed edge learning. Our main contributions are three-fold: i) We theoretically establish the privacy and communication efficiency performance guarantee of our SDM-DSGD method, which outperforms all existing works; ii) We show that SDM-DSGD improves the fundamental training-privacy trade-off by {\em two orders of magnitude} compared with the state-of-the-art. iii) We reveal theoretical insights and offer practical design guidelines for the interactions between privacy preservation and communication efficiency, two conflicting performance goals. We conduct extensive experiments with a variety of learning models on MNIST and CIFAR-10 datasets to verify our theoretical findings. Collectively, our results contribute to the theory and algorithm design for distributed edge learning.

    更新日期:2020-01-14
  • Hierarchical Multi-Agent Optimization for Resource Allocation in Cloud Computing
    arXiv.cs.DC Pub Date : 2020-01-12
    Xiangqiang GaoSenior Member, IEEE; Rongke LiuSenior Member, IEEE; Aryan Kaushik

    In cloud computing, an important concern is to allocate the available resources of service nodes to the requested tasks on demand and to make the objective function optimum, i.e., maximizing resource utilization, payoffs and available bandwidth. This paper proposes a hierarchical multi-agent optimization (HMAO) algorithm in order to maximize the resource utilization and make the bandwidth cost minimum for cloud computing. The proposed HMAO algorithm is a combination of the genetic algorithm (GA) and the multi-agent optimization (MAO) algorithm. With maximizing the resource utilization, an improved GA is implemented to find a set of service nodes that are used to deploy the requested tasks. A decentralized-based MAO algorithm is presented to minimize the bandwidth cost. We study the effect of key parameters of the HMAO algorithm by the Taguchi method and evaluate the performance results. When compared with genetic algorithm (GA) and fast elitist non-dominated sorting genetic (NSGA-II) algorithm, the simulation results demonstrate that the HMAO algorithm is more effective than the existing solutions to solve the problem of resource allocation with a large number of the requested tasks. Furthermore, we provide the performance comparison of the HMAO algorithm with the first-fit greedy approach in on-line resource allocation.

    更新日期:2020-01-14
  • Competitive Broadcast against Adaptive Adversary in Multi-channel Radio Networks
    arXiv.cs.DC Pub Date : 2020-01-12
    Haimin Chen; Chaodong Zheng

    Wireless networks are vulnerable to adversarial jamming due to the open nature of the communication medium. To thwart such malicious behavior, researchers have proposed resource competitive analysis. In this framework, sending, listening, or jamming on one channel for one time slot costs one unit of energy. The adversary can employ arbitrary jamming strategy to disrupt communication, but has a limited energy budget $T$. The honest nodes, on the other hand, aim to accomplish the distributed computing task in concern with a spending of $o(T)$. In this paper, we focus on solving the broadcast problem, in which a single source node wants to disseminate a message to all other $n-1$ nodes. Previous work have shown, in single-hop single-channel scenario, each node can receive the message in $\tilde{O}(T+n)$ time, while spending only $\tilde{O}(\sqrt{T/n}+1)$ energy. If $C$ channels are available, then the time complexity can be reduced by a factor of $C$, without increasing nodes' cost. However, these multi-channel algorithms only work for certain values of $n$ and $C$, and can only tolerate an oblivious adversary. We develop two new resource competitive algorithms for the broadcast problem. They work for arbitrary $n,C$ values, require minimal prior knowledge, and can tolerate a powerful adaptive adversary. In both algorithms, each node's runtime is dominated by the term $O(T/C)$, and each node's energy cost is dominated by the term $\tilde{O}(\sqrt{T/n})$. The time complexity is asymptotically optimal, while the energy complexity is near optimal in some cases. We use "epidemic broadcast" to achieve time efficiency and resource competitiveness, and employ the coupling technique in the analysis to handle the adaptivity of the adversary. These tools might be of independent interest, and can potentially be applied in the design and analysis of other resource competitive algorithms.

    更新日期:2020-01-14
  • Heterogeneous Computation Assignments in Coded Elastic Computing
    arXiv.cs.DC Pub Date : 2020-01-12
    Nicholas Woolsey; Rong-Rong Chen; Mingyue Ji

    We study the optimal design of a heterogeneous coded elastic computing (CEC) network where machines have varying relative computation speeds. CEC introduced by Yang {\it et al.} is a framework which mitigates the impact of elastic events, where machines join and leave the network. A set of data is distributed among storage constrained machines using a Maximum Distance Separable (MDS) code such that any subset of machines of a specific size can perform the desired computations. This design eliminates the need to re-distribute the data after each elastic event. In this work, we develop a process for an arbitrary heterogeneous computing network to minimize the overall computation time by defining an optimal computation load, or number of computations assigned to each machine. We then present an algorithm to define a specific computation assignment among the machines that makes use of the MDS code and meets the optimal computation load.

    更新日期:2020-01-14
  • Reliable and interoperable computational molecular engineering: 2. Semantic interoperability based on the European Materials and Modelling Ontology
    arXiv.cs.DC Pub Date : 2020-01-13
    Martin Thomas Horsch; Silvia Chiacchiera; Youness Bami; Georg J. Schmitz; Gabriele Mogni; Gerhard Goldbeck; Emanuele Ghedini

    The European Materials and Modelling Ontology (EMMO) is a top-level ontology designed by the European Materials Modelling Council to facilitate semantic interoperability between platforms, models, and tools in computational molecular engineering, integrated computational materials engineering, and related applications of materials modelling and characterization. Additionally, domain ontologies exist based on data technology developments from specific platforms. The present work discusses the ongoing work on establishing a European Virtual Marketplace Framework, into which diverse platforms can be integrated. It addresses common challenges that arise when marketplace-level domain ontologies are combined with a top-level ontology like the EMMO by ontology alignment.

    更新日期:2020-01-14
  • Towards High Performance Java-based Deep Learning Frameworks
    arXiv.cs.DC Pub Date : 2020-01-13
    Athanasios Stratikopoulos; Juan Fumero; Zoran Sevarac; Christos Kotselidis

    The advent of modern cloud services along with the huge volume of data produced on a daily basis, have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. Prior research has focused on employing hardware accelerators as a means to overcome this inefficiency. This trend has driven software development to target heterogeneous execution, and several modern computing systems have incorporated a mixture of diverse computing components, including GPUs and FPGAs. However, the specialization of the applications' code for heterogeneous execution is not a trivial task, as it requires developers to have hardware expertise in order to obtain high performance. The vast majority of the existing deep learning frameworks that support heterogeneous acceleration, rely on the implementation of wrapper calls from a high-level programming language to a low-level accelerator backend, such as OpenCL, CUDA or HLS. In this paper we have employed TornadoVM, a state-of-the-art heterogeneous programming framework to transparently accelerate Deep Netts; a Java-based deep learning framework. Our initial results demonstrate up to 8x performance speedup when executing the back propagation process of the network's training on AMD GPUs against the sequential execution of the original Deep Netts framework.

    更新日期:2020-01-14
  • Resource Sharing in the Edge: A Distributed Bargaining-Theoretic Approach
    arXiv.cs.DC Pub Date : 2020-01-13
    Faheem Zafari; Prithwish Basu; Kin K. Leung; Jian Li; Ananthram Swami; Don Towsley

    The growing demand for edge computing resources, particularly due to increasing popularity of Internet of Things (IoT), and distributed machine/deep learning applications poses a significant challenge. On the one hand, certain edge service providers (ESPs) may not have sufficient resources to satisfy their applications according to the associated service-level agreements. On the other hand, some ESPs may have additional unused resources. In this paper, we propose a resource-sharing framework that allows different ESPs to optimally utilize their resources and improve the satisfaction level of applications subject to constraints such as communication cost for sharing resources across ESPs. Our framework considers that different ESPs have their own objectives for utilizing their resources, thus resulting in a multi-objective optimization problem. We present an $N$-person \emph{Nash Bargaining Solution} (NBS) for resource allocation and sharing among ESPs with \emph{Pareto} optimality guarantee. Furthermore, we propose a \emph{distributed}, primal-dual algorithm to obtain the NBS by proving that the strong-duality property holds for the resultant resource sharing optimization problem. Using synthetic and real-world data traces, we show numerically that the proposed NBS based framework not only enhances the ability to satisfy applications' resource demands, but also improves utilities of different ESPs.

    更新日期:2020-01-14
  • Notes on Theory of Distributed Systems
    arXiv.cs.DC Pub Date : 2020-01-10
    James Aspnes

    Notes for the Yale course CPSC 465/565 Theory of Distributed Systems.

    更新日期:2020-01-14
  • Fast-Fourier-Forecasting Resource Utilisation in Distributed Systems
    arXiv.cs.DC Pub Date : 2020-01-13
    Paul j. Pritz; Daniel Perez; Kin K. Leung

    Distributed computing systems often consist of hundreds of nodes, executing tasks with different resource requirements. Efficient resource provisioning and task scheduling in such systems are non-trivial and require close monitoring and accurate forecasting of the state of the system, specifically resource utilisation at its constituent machines. Two challenges present themselves towards these objectives. First, collecting monitoring data entails substantial communication overhead. This overhead can be prohibitively high, especially in networks where bandwidth is limited. Second, forecasting models to predict resource utilisation should be accurate and need to exhibit high inference speed. Mission critical scheduling and resource allocation algorithms use these predictions and rely on their immediate availability. To address the first challenge, we present a communication-efficient data collection mechanism. Resource utilisation data is collected at the individual machines in the system and transmitted to a central controller in batches. Each batch is processed by an adaptive data-reduction algorithm based on Fourier transforms and truncation in the frequency domain. We show that the proposed mechanism leads to a significant reduction in communication overhead while incurring only minimal error and adhering to accuracy guarantees. To address the second challenge, we propose a deep learning architecture using complex Gated Recurrent Units to forecast resource utilisation. This architecture is directly integrated with the above data collection mechanism to improve inference speed of our forecasting model. Using two real-world datasets, we demonstrate the effectiveness of our approach, both in terms of forecasting accuracy and inference speed. Our approach resolves challenges encountered in resource provisioning frameworks and can be applied to other forecasting problems.

    更新日期:2020-01-14
  • Domination in Signed Petri Net
    arXiv.cs.DC Pub Date : 2020-01-13
    Payal; Sangita Kansal

    In this paper, domination in Signed Petri net(SPN) has been introduced.We identify some of the Petri net structures where a dominating set can exist.Applications of producer consumer problem, searching of food by bees and finding similarity in research papers are given to understand the areas where the proposed theory can be used.

    更新日期:2020-01-14
  • Learning-based Dynamic Pinning of Parallelized Applications in Many-Core Systems
    arXiv.cs.DC Pub Date : 2018-03-01
    Georgios C. Chasparis; Vladimir Janjic; Michael Rossbory

    Motivated by the need for adaptive, secure and responsive scheduling in a great range of computing applications, including human-centered and time-critical applications, this paper proposes a scheduling framework that seamlessly adds resource-awareness to any parallel application. In particular, we introduce a learning-based framework for dynamic placement of parallel threads to Non-Uniform Memory Access (NUMA) architectures. Decisions are taken independently by each thread in a decentralized fashion that significantly reduces computational complexity. The advantage of the proposed learning scheme is the ability to easily incorporate any multi-objective criterion and easily adapt to performance variations during runtime. Under the multi-objective criterion of maximizing total completed instructions per second (i.e., both computational and memory-access instructions), we provide analytical guarantees with respect to the expected performance of the parallel application. We also compare the performance of the proposed scheme with the Linux operating system scheduler in an extensive set of applications, including both computationally and memory intensive ones. We have observed that performance improvement could be significant especially under limited availability of resources and under irregular memory-access patterns.

    更新日期:2020-01-14
  • The Fog Development Kit: A Development Platform for SDN-based Edge-Fog Systems
    arXiv.cs.DC Pub Date : 2019-07-06
    Colton Powell; Christopher Desiniotis; Behnam Dezfouli

    With the rise of the Internet of Things (IoT), fog computing has emerged to help traditional cloud computing in meeting scalability demands. Fog computing makes it possible to fulfill real-time requirements of applications by bringing more processing, storage, and control power geographically closer to end-devices. However, since fog computing is a relatively new field, there is no standard platform for research and development in a realistic environment, and this dramatically inhibits innovation and development of fog-based applications. In response to these challenges, we propose the Fog Development Kit (FDK). By providing high-level interfaces for allocating computing and networking resources, the FDK abstracts the complexities of fog computing from developers and enables the rapid development of fog systems. In addition to supporting application development on a physical deployment, the FDK supports the use of emulation tools (e.g., GNS3 and Mininet) to create realistic environments, allowing fog application prototypes to be built with zero additional costs and enabling seamless portability to a physical infrastructure. Using a physical testbed and various kinds of applications running on it, we verify the operation and study the performance of the FDK. Specifically, we demonstrate that resource allocations are appropriately enforced and guaranteed, even amidst extreme network congestion. We also present simulation-based scalability analysis of the FDK versus the number of switches, the number of end-devices, and the number of fog-devices.

    更新日期:2020-01-14
  • Succinct Population Protocols for Presburger Arithmetic
    arXiv.cs.DC Pub Date : 2019-10-10
    Michael Blondin; Javier Esparza; Blaise Genest; Martin Helfrich; Stefan Jaax

    Angluin et al. proved that population protocols compute exactly the predicates definable in Presburger arithmetic (PA), the first-order theory of addition. As part of this result, they presented a procedure that translates any formula $\varphi$ of quantifier-free PA with remainder predicates (which has the same expressive power as full PA) into a population protocol with $2^{O(\text{poly}(|\varphi|))}$ states that computes $\varphi$. More precisely, the number of states of the protocol is exponential in both the bit length of the largest coefficient in the formula, and the number of nodes of its syntax tree. In this paper, we prove that every formula $\varphi$ of quantifier-free PA with remainder predicates is computable by a leaderless population protocol with $O(\text{poly}(|\varphi|))$ states. Our proof is based on several new constructions, which may be of independent interest. Given a formula $\varphi$ of quantifier-free PA with remainder predicates, a first construction produces a succinct protocol (with $O(|\varphi|^3)$ leaders) that computes $\varphi$; this completes the work initiated in [STACS'18], where we constructed such protocols for a fragment of PA. For large enough inputs, we can get rid of these leaders. If the input is not large enough, then it is small, and we design another construction producing a succinct protocol with one leader that computes $\varphi$. Our last construction gets rid of this leader for small inputs.

    更新日期:2020-01-14
  • Similarity Driven Approximation for Text Analytics
    arXiv.cs.DC Pub Date : 2019-10-16
    Guangyan Hu; Yongfeng Zhang; Sandro Rigo; Thu D. Nguyen

    Text analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. Processing large text data sets can be computationally expensive, however, especially if it involves sophisticated algorithms. This challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speed up query processing. In this paper, we propose and evaluate a framework called EmApprox that uses approximation to speed up the processing of a wide range of queries over large text data sets. The key insight is that different types of queries can be approximated by processing subsets of data that are most similar to the queries. EmApprox builds a general index for a data set by learning a natural language processing model, producing a set of highly compressed vectors representing words and subcollections of documents. Then, at query processing time, EmApprox uses the index to guide sampling of the data set, with the probability of selecting each subcollection of documents being proportional to its {\em similarity} to the query as computed using the vector representations. We have implemented a prototype of EmApprox as an extension of the Apache Spark system, and used it to approximate three types of queries: aggregation, information retrieval, and recommendation. Experimental results show that EmApprox's similarity-guided sampling achieves much better accuracy than random sampling. Further, EmApprox can achieve significant speedups if users can tolerate small amounts of inaccuracies. For example, when sampling at 10\%, EmApprox speeds up a set of queries counting phrase occurrences by almost 10x while achieving estimated relative errors of less than 22\% for 90\% of the queries.

    更新日期:2020-01-14
  • Self-stabilizing Uniform Reliable Broadcast
    arXiv.cs.DC Pub Date : 2020-01-09
    Oskar Lundström; Michel Raynal; Elad M. Schiller

    We study a well-known communication abstraction called Uniform Reliable Broadcast (URB). URB is central in the design and implementation of fault-tolerant distributed systems, as many non-trivial fault-tolerant distributed applications require communication with provable guarantees on message deliveries. Our study focuses on fault-tolerant implementations for time-free message-passing systems that are prone to node-failures. Moreover, we aim at the design of an even more robust communication abstraction. We do so through the lenses of self-stabilization---a very strong notion of fault-tolerance. In addition to node and communication failures, self-stabilizing algorithms can recover after the occurrence of arbitrary transient faults; these faults represent any violation of the assumptions according to which the system was designed to operate (as long as the algorithm code stays intact). This work proposes the first self-stabilizing URB solution for time-free message-passing systems that are prone to node-failures. The proposed algorithm has an O(bufferUnitSize) stabilization time (in terms of asynchronous cycles) from arbitrary transient faults, where bufferUnitSize is a predefined constant that can be set according to the available memory. Moreover, the communication costs of our algorithm are similar to the ones of the non-self-stabilizing state-of-the-art. The main differences are that our proposal considers repeated gossiping of O(1) bits messages and deals with bounded space (which is a prerequisite for self-stabilization). Specifically, each node needs to store up to bufferUnitSize n records and each record is of size O(v + n log n) bits, where n is the number of nodes in the system and v is the number of bits needed to encode a single URB instance.

    更新日期:2020-01-13
  • RMWPaxos: Fault-Tolerant In-Place Consensus Sequences
    arXiv.cs.DC Pub Date : 2020-01-10
    Jan Skrzypczak; Florian Schintke; Thorsten Schütt

    Building consensus sequences based on distributed, fault-tolerant consensus, as used for replicated state machines, typically requires a separate distributed state for every new consensus instance. Allocating and maintaining this state causes significant overhead. In particular, freeing the distributed, outdated states in a fault-tolerant way is not trivial and adds further complexity and overhead to the system. In this paper, we propose an extension to the single-decree Paxos protocol that can learn a sequence of consensus decisions 'in-place', i.e., with a single set of distributed states. Our protocol does not require dynamic log structures and hence has no need for distributed log pruning, snapshotting, compaction, or dynamic resource allocation. The protocol builds a fault-tolerant atomic register that supports arbitrary read-modify-write operations. We use the concept of consistent quorums to detect whether the previous consensus still needs to be consolidated or is already finished so that the next consensus value can be safely proposed. Reading a consolidated consensus is done without state modification and is thereby free of concurrency control and demand for serialisation. A proposer that is not interrupted reaches agreement on consecutive consensuses within a single message round-trip per consensus decision by preparing the acceptors eagerly with the preceding request.

    更新日期:2020-01-13
  • Demo: Light-Weight Programming Language for Blockchain
    arXiv.cs.DC Pub Date : 2020-01-10
    Junhui Kim; Joongheon Kim

    This demo abstract introduces a new light-weight programming language koa which is suitable for blockchain system design and implementation. In this abstract, the basic features of koa are introduced including working system (with playground), architecture, and virtual machine operations. Rum-time execution of software implemented by koa will be presented during the session.

    更新日期:2020-01-13
  • Decentralized Optimization of Vehicle Route Planning -- A Cross-City Comparative Study
    arXiv.cs.DC Pub Date : 2020-01-10
    Brionna Davis; Grace Jennings; Taylor Pothast; Ilias Gerostathopoulos; Evangelos Pournaras; Raphael E. Stern

    New mobility concepts are at the forefront of research and innovation in smart cities. The introduction of connected and autonomous vehicles enables new possibilities in vehicle routing. Specifically, knowing the origin and destination of each agent in the network can allow for real-time routing of the vehicles to optimize network performance. However, this relies on individual vehicles being "altruistic" i.e., being willing to accept an alternative non-preferred route in order to achieve a network-level performance goal. In this work, we conduct a study to compare different levels of agent altruism and the resulting effect on the network-level traffic performance. Specifically, this study compares the effects of different underlying urban structures on the overall network performance, and investigates which characteristics of the network make it possible to realize routing improvements using a decentralized optimization router. The main finding is that, with increased vehicle altruism, it is possible to balance traffic flow among the links of the network. We show evidence that the decentralized optimization router is more effective with networks of high load while we study the influence of cities characteristics, in particular: networks with a higher number of nodes (intersections) or edges (roads) per unit area allow for more possible alternate routes, and thus higher potential to improve network performance.

    更新日期:2020-01-13
  • Real-Time RFI Mitigation for the Apertif Radio Transient System
    arXiv.cs.DC Pub Date : 2020-01-10
    Alessio Sclocco; Dany Vohl; Rob V. van Nieuwpoort

    Current and upcoming radio telescopes are being designed with increasing sensitivity to detect new and mysterious radio sources of astrophysical origin. While this increased sensitivity improves the likelihood of discoveries, it also makes these instruments more susceptible to the deleterious effects of Radio Frequency Interference (RFI). The challenge posed by RFI is exacerbated by the high data-rates achieved by modern radio telescopes, which require real-time processing to keep up with the data. Furthermore, the high data-rates do not allow for permanent storage of observations at high resolution. Offline RFI mitigation is therefore not possible anymore. The real-time requirement makes RFI mitigation even more challenging because, on one side, the techniques used for mitigation need to be fast and simple, and on the other side they also need to be robust enough to cope with just a partial view of the data. The Apertif Radio Transient System (ARTS) is the real-time, time-domain, transient detection instrument of the Westerbork Synthesis Radio Telescope (WSRT), processing 73 Gb of data per second. Even with a deep learning classifier, the ARTS pipeline requires state-of-the-art real-time RFI mitigation to reduce the number of false-positive detections. Our solution to this challenge is RFIm, a high-performance, open-source, tuned, and extensible RFI mitigation library. The goal of this library is to provide users with RFI mitigation routines that are designed to run in real-time on many-core accelerators, such as Graphics Processing Units, and that can be highly-tuned to achieve code and performance portability to different hardware platforms and scientific use-cases. Results on the ARTS show that we can achieve real-time RFI mitigation, with a minimal impact on the total execution time of the search pipeline, and considerably reduce the number of false-positives.

    更新日期:2020-01-13
  • An Efficient Universal Construction for Large Objects
    arXiv.cs.DC Pub Date : 2020-01-10
    Panagiota Fatourou; Nikolaos D. Kallimanis; Eleni Kanellou

    This paper presents L-UC, a universal construction that efficiently implements dynamic objects of large state in a wait-free manner. The step complexity of L-UC is O(n+kw), where n is the number of processes, k is the interval contention (i.e., the maximum number of active processes during the execution interval of an operation), and w is the worst-case time complexity to perform an operation on the sequential implementation of the simulated object. L-UC efficiently implements objects whose size can change dynamically. It improves upon previous universal constructions either by efficiently handling objects whose state is large and can change dynamically, or by achieving better step complexity.

    更新日期:2020-01-13
  • AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs
    arXiv.cs.DC Pub Date : 2020-01-06
    Pengfei Xu; Xiaofan Zhang; Cong Hao; Yang Zhao; Yongan Zhang; Yue Wang; Chaojian Li; Zetong Guan; Deming Chen; Yingyan Lin

    Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for DNN chips. However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the large design space due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs to meet the application specifications. Therefore, DNN chips take a long time to design and require cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip - a DNN chip generator that can automatically generate both FPGA- and ASIC-based DNN chip implementation given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balancing, etc.), optimize chip design via the Chip Predictor, and then generate optimized synthesizable RTL to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by < 10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, accelerators generated by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that of expert-crafted state-of-the-art accelerators.

    更新日期:2020-01-13
  • OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems
    arXiv.cs.DC Pub Date : 2020-01-08
    Chenhao Xie; Xin Fu; Mingsong Chen; Shuaiwen Leon Song

    With the strong computation capability, NUMA-based multi-GPU system is a promising candidate to provide sustainable and scalable performance for Virtual Reality. However, the entire multi-GPU system is viewed as a single GPU which ignores the data locality in VR rendering during the workload distribution, leading to tremendous remote memory accesses among GPU models. By conducting comprehensive characterizations on different kinds of parallel rendering frameworks, we observe that distributing the rendering object along with its required data per GPM can reduce the inter-GPM memory accesses. However, this object-level rendering still faces two major challenges in NUMA-based multi-GPU system: (1) the large data locality between the left and right views of the same object and the data sharing among different objects and (2) the unbalanced workloads induced by the software-level distribution and composition mechanisms. To tackle these challenges, we propose object-oriented VR rendering framework (OO-VR) that conducts the software and hardware co-optimization to provide a NUMA friendly solution for VR multi-view rendering in NUMA-based multi-GPU systems. We first propose an object-oriented VR programming model to exploit the data sharing between two views of the same object and group objects into batches based on their texture sharing levels. Then, we design an object aware runtime batch distribution engine and distributed hardware composition unit to achieve the balanced workloads among GPMs. Finally, evaluations on our VR featured simulator show that OO-VR provides 1.58x overall performance improvement and 76% inter-GPM memory traffic reduction over the state-of-the-art multi-GPU systems. In addition, OO-VR provides NUMA friendly performance scalability for the future larger multi-GPU scenarios with ever increasing asymmetric bandwidth between local and remote memory.

    更新日期:2020-01-13
  • Fine-Grained Complexity of Safety Verification
    arXiv.cs.DC Pub Date : 2018-02-15
    Peter Chini; Roland Meyer; Prakash Saivasan

    We study the fine-grained complexity of Leader Contributor Reachability (LCR) and Bounded-Stage Reachability (BSR), two variants of the safety verification problem for shared memory concurrent programs. For both problems, the memory is a single variable over a finite data domain. Our contributions are new verification algorithms and lower bounds. The latter are based on the Exponential Time Hypothesis (ETH), the problem Set Cover, and cross-compositions. LCR is the question whether a designated leader thread can reach an unsafe state when interacting with a certain number of equal contributor threads. We suggest two parameterizations: (1) By the size of the data domain D and the size of the leader L, and (2) by the size of the contributors C. We present algorithms for both cases. The key techniques are compact witnesses and dynamic programming. The algorithms run in O*((L(D+1))^(LD) * D^D) and O*(2^C) time, showing that both parameterizations are fixed-parameter tractable. We complement the upper bounds by (matching) lower bounds based on ETH and Set Cover. Moreover, we prove the absence of polynomial kernels. For BSR, we consider programs involving t different threads. We restrict the analysis to computations where the write permission changes s times between the threads. BSR asks whether a given configuration is reachable via such an s-stage computation. When parameterized by P, the maximum size of a thread, and t, the interesting observation is that the problem has a large number of difficult instances. Formally, we show that there is no polynomial kernel, no compression algorithm that reduces the size of the data domain D or the number of stages s to a polynomial dependence on P and t. This indicates that symbolic methods may be harder to find for this problem.

    更新日期:2020-01-13
  • Nakamoto Consensus with Verifiable Delay Puzzle
    arXiv.cs.DC Pub Date : 2019-08-18
    Jieyi Long; Ribao Wei

    This paper summarizes our work-in-progress on a new consensus protocol based on verifiable delay function. First, we introduce the concept of verifiable delay puzzle (VDP), which resembles the hashing puzzle used in the PoW mechanism but can only be solved sequentially. We then present a VDP implementation based on Pietrzak's verifiable delay function. Further, we show that VDP can be combined with the Nakamoto consensus in a proof-of-stake/proof-of-delay hybrid protocol. We analyze the persistence and liveness of the protocol, and show that compared to PoW, our proposal consumes much less energy; compared to BFT based consensus algorithms which usually place an upper limit on the number of consensus nodes, our proposal is much more scalable and can thus achieve a higher level of decentralization.

    更新日期:2020-01-13
  • H2O-Cloud: A Resource and Quality of Service-Aware Task Scheduling Framework for Warehouse-Scale Data Centers -- A Hierarchical Hybrid DRL (Deep Reinforcement Learning) based Approach
    arXiv.cs.DC Pub Date : 2019-12-20
    Mingxi Cheng; Ji Li; Paul Bogdan; Shahin Nazarian

    Cloud computing has attracted both end-users and Cloud Service Providers (CSPs) in recent years. Improving resource utilization rate (RUtR), such as CPU and memory usages on servers, while maintaining Quality-of-Service (QoS) is one key challenge faced by CSPs with warehouse-scale data centers. Prior works proposed various algorithms to reduce energy cost or to improve RUtR, which either lack the fine-grained task scheduling capabilities, or fail to take a comprehensive system model into consideration. This article presents H2O-Cloud, a Hierarchical and Hybrid Online task scheduling framework for warehouse-scale CSPs, to improve resource usage effectiveness while maintaining QoS. H2O-Cloud is highly scalable and considers comprehensive information such as various workload scenarios, cloud platform configurations, user request information and dynamic pricing model. The hierarchy and hybridity of the framework, combined with its deep reinforcement learning (DRL) engines, enable H2O-Cloud to efficiently start on-the-go scheduling and learning in an unpredictable environment without pre-training. Our experiments confirm the high efficiency of the proposed H2O-Cloud when compared to baseline approaches, in terms of energy and cost while maintaining QoS. Compared with a state-of-the-art DRL-based algorithm, H2O-Cloud achieves up to 201.17% energy cost efficiency improvement, 47.88% energy efficiency improvement and 551.76% reward rate improvement.

    更新日期:2020-01-13
  • DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
    arXiv.cs.DC Pub Date : 2020-01-08
    Udit Gupta; Samuel Hsia; Vikram Saraph; Xiaodong Wang; Brandon Reagen; Gu-Yeon Wei; Hsien-Hsin S. Lee; David Brooks; Carole-Jean Wu

    Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

    更新日期:2020-01-10
  • Architecture and Security of SCADA Systems: A Review
    arXiv.cs.DC Pub Date : 2020-01-09
    Geeta Yadav; Kolin Paul

    Pipeline bursting, production lines shut down, frenzy traffic, trains confrontation, nuclear reactor shut down, disrupted electric supply, interrupted oxygen supply in ICU - these catastrophic events could result because of an erroneous SCADA system/ Industrial Control System(ICS). SCADA systems have become an essential part of automated control and monitoring of many of the Critical Infrastructures (CI). Modern SCADA systems have evolved from standalone systems into sophisticated complex, open systems, connected to the Internet. This geographically distributed modern SCADA system is vulnerable to threats and cyber attacks. In this paper, we first review the SCADA system architectures that have been proposed/implemented followed by attacks on such systems to understand and highlight the evolving security needs for SCADA systems. A short investigation of the current state of intrusion detection techniques in SCADA systems is done , followed by a brief study of testbeds for SCADA systems. The cloud and Internet of things (IoT) based SCADA systems are studied by analysing the architecture of modern SCADA systems. This review paper ends by highlighting the critical research problems that need to be resolved to close the gaps in the security of SCADA systems.

    更新日期:2020-01-10
  • LibreSocial: A Peer-to-Peer Framework for Online Social Networks
    arXiv.cs.DC Pub Date : 2020-01-09
    Kalman Graffi; Newton Masinde

    Distributed online social networks (DOSNs) were first proposed to solve the problem of privacy, security and scalability. A significant amount of research was undertaken to offer viable DOSN solutions that were capable of competing with the existing centralized OSN applications such as Facebook, LinkedIn and Instagram. This research led to the emergence of the use of peer-to-peer (P2P) networks as a possible solution, upon which several OSNs such as LifeSocial.KOM, Safebook, PeerSoN among others were based. In this paper, we define the basic requirements for a P2P OSN. We then revisit one of the first P2P-based OSNs, LifeSocial.KOM, that is now called LibreSocial, which evolved in the past years to address the challenges of running a completely decentralized social network. Over the course of time, several essential new technologies have been incorporated within LibreSocial for better functionalities. We describe the architecture and each individual component of LibreSocial and point out how LibreSocial meets the basic requirements for a fully functional distributed OSN.

    更新日期:2020-01-10
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
中国科学院大学楚甲祥
上海纽约大学William Glover
中国科学院化学研究所
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug