当前期刊: arXiv - CS - Performance Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers
    arXiv.cs.PF Pub Date : 2020-04-07
    Shijian Li; Robert J. Walls; Tian Guo

    Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the

  • The State of Research on Function-as-a-Service Performance Evaluation: A Multivocal Literature Review
    arXiv.cs.PF Pub Date : 2020-04-07
    Joel Scheuner; Philipp Leitner

    Function-as-a-Service (FaaS) is one form of the serverless cloud computing paradigm and is defined through FaaS platforms (e.g., AWS Lambda) executing event-triggered code snippets (i.e., functions). Many studies that empirically evaluate the performance of such FaaS platforms have started to appear but we are currently lacking a comprehensive understanding of the overall domain. In our work, we survey

  • Using HEP experiment workflows for the benchmarking and accounting of WLCG computing resources
    arXiv.cs.PF Pub Date : 2020-04-03
    Andrea Valassi; Manfred Alef; Jean-Michel Barbet; Olga Datskova; Riccardo De Maria; Miguel Fontes Medeiros; Domenico Giordano; Costin Grigoras; Christopher Hollowell; Martina Javurkova; Viktor Khristenko; David Lange; Michele Michelotto; Lorenzo Rinaldi; Andrea Sciabà; Cas Van Der Laan

    Benchmarking of CPU resources in WLCG has been based on the HEP-SPEC06 (HS06) suite for over a decade. It has recently become clear that HS06, which is based on real applications from non-HEP domains, no longer describes typical HEP workloads. The aim of the HEP-Benchmarks project is to develop a new benchmark suite for WLCG compute resources, based on real applications from the LHC experiments. By

  • Heavy Traffic Analysis of the Mean Response Time for Load Balancing Policies in the Mean Field Regime
    arXiv.cs.PF Pub Date : 2020-04-02
    Tim Hellemans; Benny Van Houdt

    Mean field models are a popular tool used to analyse load balancing policies. In some exceptional cases the response time distribution of the mean field limit has an explicit form. In most cases it can be computed using either a recursion or a differential equation (for exponential job sizes with mean one). In this paper we study the value of the mean response time $E[R_\lambda]$ as the arrival rate

  • Computational Performance of a Germline Variant Calling Pipeline for Next Generation Sequencing
    arXiv.cs.PF Pub Date : 2020-04-01
    Jie Liu; Xiaotian Wu; Kai Zhang; Bing Liu; Renyi Bao; Xiao Chen; Yiran Cai; Yiming Shen; Xinjun He; Jun Yan; Weixing Ji

    With the booming of next generation sequencing technology and its implementation in clinical practice and life science research, the need for faster and more efficient data analysis methods becomes pressing in the field of sequencing. Here we report on the evaluation of an optimized germline mutation calling pipeline, HummingBird, by assessing its performance against the widely accepted BWA-GATK pipeline

  • Scheduling Parallel-Task Jobs Subject to Packing and Placement Constraints
    arXiv.cs.PF Pub Date : 2020-04-01
    Mehrnoosh Shafiee; Javad Ghaderi

    Motivated by modern parallel computing applications, we consider the problem of scheduling parallel-task jobs with heterogeneous resource requirements in a cluster of machines. Each job consists of a set of tasks that can be processed in parallel, however, the job is considered completed only when all its tasks finish their processing, which we refer to as "synchronization" constraint. Further, assignment

  • Fundamental Limits of Online Network-Caching
    arXiv.cs.PF Pub Date : 2020-03-31
    Rajarshi Bhattacharjee; Subhankar Banerjee; Abhishek Sinha

    Optimal caching of files in a content distribution network (CDN) is a problem of fundamental and growing commercial interest. Although many different caching algorithms are in use today, the fundamental performance limits of network caching algorithms from an online learning point-of-view remain poorly understood to date. In this paper, we resolve this question in the following two settings: (1) a

  • Static vs accumulating priorities in healthcare queues under heavy loads
    arXiv.cs.PF Pub Date : 2020-03-31
    Binyamin Oz; Seva Shneer; Ilze Ziedins

    Amid unprecedented times caused by COVID-19, healthcare systems all over the world are strained to the limits of, or even beyond, capacity. A similar event is experienced by some healthcare systems regularly, due to for instance seasonal spikes in the number of patients. We model this as a queueing system in heavy traffic (where the arrival rate is approaching the service rate from below) or in overload

  • Is Your Load Generator Launching Web Requests in Bunches?
    arXiv.cs.PF Pub Date : 2018-09-27
    James F Brady

    One problem with load test quality, almost always overlooked, is the potential for the load generator's user thread pool to sync up and dispatch queries in bunches rather than independently from each other like real users initiate their requests. A spiky launch pattern misrepresents workload flow as well as yields erroneous application response time statistics. This paper describes what a real user

  • Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models
    arXiv.cs.PF Pub Date : 2019-12-05
    Matthew LeMay; Shijian Li; Tian Guo

    Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However

  • Optimal Multiserver Scheduling with Unknown Job Sizes in Heavy Traffic
    arXiv.cs.PF Pub Date : 2020-03-30
    Ziv Scully; Isaac Grosof; Mor Harchol-Balter

    We consider scheduling to minimize mean response time of the M/G/k queue with unknown job sizes. In the single-server case, the optimal policy is the Gittins policy, but it is not known whether Gittins or any other policy is optimal in the multiserver case. Exactly analyzing the M/G/k under any scheduling policy is intractable, and Gittins is a particularly complicated policy that is hard to analyze

  • FFT, FMM, and Multigrid on the Road to Exascale: performance challenges and opportunities
    arXiv.cs.PF Pub Date : 2018-10-28
    Huda Ibeid; Luke Olson; William Gropp

    FFT, FMM, and multigrid methods are widely used fast and highly scalable solvers for elliptic PDEs. However, emerging large-scale computing systems are introducing challenges in comparison to current petascale computers. Recent efforts (Dongarra et al. 2011) have identified several constraints in the design of exascale software that includes massive concurrency, resilience management, exploiting the

  • Next-Generation Information Technology Systems for Fast Detectors in Electron Microscop
    arXiv.cs.PF Pub Date : 2020-03-25
    Dieter Weber; Alexander Clausen; Rafal E. Dunin-Borkowski

    The Gatan K2 IS direct electron detector (Gatan Inc., 2018), which was introduced in 2014, marked a watershed moment in the development of cameras for transmission electron microscopy (TEM) (Pan & Czarnik, 2016). Its pixel frequency, i.e. the number of data points (pixels) recorded per second, was two orders of magnitude higher than the fastest cameras available only five years before. Starting from

  • CoCoPIE: Making Mobile AI Sweet As PIE --Compression-Compilation Co-Design Goes a Long Way
    arXiv.cs.PF Pub Date : 2020-03-14
    Shaoshan Liu; Bin Ren; Xipeng Shen; Yanzhi Wang

    Assuming hardware is the major constraint for enabling real-time mobile intelligence, the industry has mainly dedicated their efforts to developing specialized hardware accelerators for machine learning and inference. This article challenges the assumption. By drawing on a recent real-time AI optimization framework CoCoPIE, it maintains that with effective compression-compiler co-design, it is possible

  • A Transactional Perspective on Execute-order-validate Blockchains
    arXiv.cs.PF Pub Date : 2020-03-23
    Pingcheng Ruan; Dumitrel Loghin; Quang-Trung Ta; Meihui Zhang; Gang Chen; Beng Chin Ooi

    Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the blockchain's throughput. However, this new architecture

  • Integrating State of the Art Compute, Communication, and Autotuning Strategies to Multiply the Performance of the Application Programm CPMD for Ab Initio Molecular Dynamics Simulations
    arXiv.cs.PF Pub Date : 2020-03-18
    Tobias Klöffel; Gerald Mathias; Bernd Meyer

    We present our recent code modernizations of the of the ab initio molecular dynamics program CPMD (www.cpmd.org) with a special focus on the ultra-soft pseudopotential (USPP) code path. Following the internal instrumentation of CPMD, all time critical routines have been revised to maximize the computational throughput and to minimize the communication overhead for optimal performance. Throughout the

  • ContainerStress: Autonomous Cloud-Node Scoping Framework for Big-Data ML Use Cases
    arXiv.cs.PF Pub Date : 2020-03-18
    Guang Chao Wang; Kenny Gross; Akshay Subramaniam

    Deploying big-data Machine Learning (ML) services in a cloud environment presents a challenge to the cloud vendor with respect to the cloud container configuration sizing for any given customer use case. OracleLabs has developed an automated framework that uses nested-loop Monte Carlo simulation to autonomously scale any size customer ML use cases across the range of cloud CPU-GPU "Shapes" (configurations

  • Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction
    arXiv.cs.PF Pub Date : 2020-03-17
    Ajitesh SrivastavaUniversity of Southern California; Naifeng ZhangUniversity of Southern California; Rajgopal KannanUS Army Research Lab-West; Viktor K. PrasannaUniversity of Southern California

    Writing high-performance code requires significant expertise of the programming language, compiler optimizations, and hardware knowledge. This often leads to poor productivity and portability and is inconvenient for a non-programmer domain-specialist such as a Physicist. More desirable is a high-level language where the domain-specialist simply specifies the workload in terms of high-level operations

  • A TTL-based Approach for Content Placement in Edge Networks
    arXiv.cs.PF Pub Date : 2017-11-10
    Nitish K. Panigrahy; Jian Li; Faheem Zafari; Don Towsley; Paul Yu

    Edge networks are promising to provide better services to users by provisioning computing and storage resources at the edge of networks. However, due to the uncertainty and diversity of user interests, content popularity, distributed network structure, cache sizes, it is challenging to decide where to place the content, and how long it should be cached. In this paper, we study the utility optimization

  • Securing of Unmanned Aerial Systems (UAS) against security threats using human immune system
    arXiv.cs.PF Pub Date : 2020-03-01
    Reza Fotohi

    UASs form a large part of the fighting ability of the advanced military forces. In particular, these systems that carry confidential information are subject to security attacks. Accordingly, an Intrusion Detection System (IDS) has been proposed in the proposed design to protect against the security problems using the human immune system (HIS). The IDSs are used to detect and respond to attempts to

  • Scaling Hyperledger Fabric Using Pipelined Execution and Sparse Peers
    arXiv.cs.PF Pub Date : 2020-03-11
    Parth Thakkar; Senthil Nathan

    Many proofs of concept blockchain applications built using Hyperledger Fabric, a permissioned blockchain platform, have recently been transformed into production. However, the performance provided by Hyperledger Fabric is of significant concern for enterprises due to steady growth in network usage. Hence, in this paper, we study the performance achieved in a Fabric network using vertical scaling (i

  • Covert Cycle Stealing in a Single FIFO Server
    arXiv.cs.PF Pub Date : 2020-03-11
    Bo Jiang; Philippe Nain; Don Towsley

    Consider a setting where Willie generates a Poisson stream of jobs and routes them to a single server that follows the first-in first-out discipline. Suppose there is an adversary Alice, who desires to receive service without being detected. We ask the question: what is the amount of service that she can receive covertly, i.e. without being detected by Willie? In the case where both Willie and Alice

  • In Situ Network and Application Performance Measurement on Android Devices and the Imperfections
    arXiv.cs.PF Pub Date : 2020-03-11
    Mohammad A. Hoque; Ashwin Rao; Sasu Tarkoma

    Understanding network and application performance are essential for debugging, improving user experience, and performance comparison. Meanwhile, modern mobile systems are optimized for energy-efficient computation and communications that may limit the performance of network and applications. In recent years, several tools have emerged that analyze network performance of mobile applications in~situ

  • DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs (Extended)
    arXiv.cs.PF Pub Date : 2019-11-18
    Cheng Li; Abdul Dakkak; Jinjun Xiong; Wen-mei Hwu

    The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, object detection, machine translation, etc. While DL models provide an opportunity to solve otherwise intractable tasks, their adoption relies on them being optimized to meet latency and resource requirements. Benchmarking is a key step in this process but has been hampered

  • The Locus Algorithm IV: Performance metrics of a grid computing system used to create catalogues of optimised pointings
    arXiv.cs.PF Pub Date : 2020-03-10
    Oisín Creaner; John Walsh; Kevin Nolan; Eugene Hickey

    This paper discusses the requirements for and performance metrics of the the Grid Computing system used to implement the Locus Algorithm to identify optimum pointings for differential photometry of 61,662,376 stars and 23,779 quasars. Initial operational tests indicated a need for a software system to analyse the data and a High Performance Computing system to run that software in a scalable manner

  • Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL
    arXiv.cs.PF Pub Date : 2020-03-08
    Philip Heinisch; Katharina Ostaszewski; Hendrik Ranocha

    When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance parallel cluster systems due to financial and practical limits on power consumption and cooling. Recent developments in hard- and software have given programmers the

  • Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach
    arXiv.cs.PF Pub Date : 2020-03-05
    Peng Zhang; Jianbin Fang; Canqun Yang; Chun Huang; Tao Tang; Zheng Wang

    This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used

  • Modeling the Invariance of Virtual Pointers in LLVM
    arXiv.cs.PF Pub Date : 2020-02-22
    Piotr Padlewski; Krzysztof Pszeniczny; Richard Smith

    Devirtualization is a compiler optimization that replaces indirect (virtual) function calls with direct calls. It is particularly effective in object-oriented languages, such as Java or C++, in which virtual methods are typically abundant. We present a novel abstract model to express the lifetimes of C++ dynamic objects and invariance of virtual table pointers in the LLVM intermediate representation

  • The distribution of age-of-information performance measures for message processing systems
    arXiv.cs.PF Pub Date : 2019-04-11
    George Kesidis; Takis Konstantopoulos; Michael Zazanis

    The idea behind the recently introduced "age of information" performance measure of a networked message processing system is that it indicates our knowledge regarding the "freshness" of the most recent piece of information that can be used as a criterion for real-time control. In this foundational paper, we examine two such measures, one that has been extensively studied in the recent literature and

  • nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems
    arXiv.cs.PF Pub Date : 2019-11-08
    Andreas Abel; Jan Reineke

    We present nanoBench, a tool for evaluating small microbenchmarks using hardware performance counters on Intel and AMD x86 systems. Most existing tools and libraries are intended to either benchmark entire programs, or program segments in the context of their execution within a larger program. In contrast, nanoBench is specifically designed to evaluate small, isolated pieces of code. Such code is common

  • Optimizing JPEG Quantization for Classification Networks
    arXiv.cs.PF Pub Date : 2020-03-05
    Zhijing Li; Christopher De Sa; Adrian Sampson

    Deep learning for computer vision depends on lossy image compression: it reduces the storage required for training and test data and lowers transfer costs in deployment. Mainstream datasets and imaging pipelines all rely on standard JPEG compression. In JPEG, the degree of quantization of frequency coefficients controls the lossiness: an 8 by 8 quantization table (Q-table) decides both the quality

  • Me Love (SYN-)Cookies: SYN Flood Mitigation in Programmable Data Planes
    arXiv.cs.PF Pub Date : 2020-03-06
    Dominik Scholz; Sebastian Gallenmüller; Henning Stubbe; Bassam Jaber; Minoo Rouhi; Georg Carle

    The SYN flood attack is a common attack strategy on the Internet, which tries to overload services with requests leading to a Denial-of-Service (DoS). Highly asymmetric costs for connection setup - putting the main burden on the attackee - make SYN flooding an efficient and popular DoS attack strategy. Abusing the widely used TCP as an attack vector complicates the detection of malicious traffic and

  • Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application
    arXiv.cs.PF Pub Date : 2020-03-06
    David Goz; Georgios Ieronymakis; Vassilis Papaefstathiou; Nikolaos Dimou; Sara Bertocco; Giuliano Taffoni; Francesco Simula; Antonio Ragagnin; Luca Tornatore; Igor Coretti

    New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a profound

  • Efficient statistical validation with edge cases to evaluate Highly Automated Vehicles
    arXiv.cs.PF Pub Date : 2020-03-04
    Dhanoop Karunakaran; Stewart Worrall; Eduardo Nebot

    The widescale deployment of Autonomous Vehicles (AV) seems to be imminent despite many safety challenges that are yet to be resolved. It is well known that there are no universally agreed Verification and Validation (VV) methodologies to guarantee absolute safety, which is crucial for the acceptance of this technology. Existing standards focus on deterministic processes where the validation requires

  • Enabling URLLC for Low-Cost IoT Devices via Diversity Combining Schemes
    arXiv.cs.PF Pub Date : 2020-03-04
    Onel L. Alcaraz López; Nurul Huda Mahmood; Hirley Alves

    Supporting Ultra-Reliable Low-Latency Communication (URLLC) in the Internet of Things (IoT) era is challenging due to stringent constraints on latency and reliability combined with the simple circuitry of IoT nodes. Diversity is usually required for sustaining the reliability levels of URLLC, but there is an additional delay associated to auxiliary procedures to be considered, specially when communication

  • Blind GB-PANDAS: A Blind Throughput-Optimal Load Balancing Algorithm for Affinity Scheduling
    arXiv.cs.PF Pub Date : 2019-01-13
    Ali Yekkehkhany; Rakesh Nagi

    Dynamic affinity load balancing of multi-type tasks on multi-skilled servers, when the service rate of each task type on each of the servers is known and can possibly be different from each other, is an open problem for over three decades. The goal is to do task assignment on servers in a real time manner so that the system becomes stable, which means that the queue lengths do not diverge to infinity

  • VulnDS: Top-k Vulnerable SME Detection System in Networked-Loans
    arXiv.cs.PF Pub Date : 2019-12-28
    Dawei Cheng; Xiaoyang Wang; Ying Zhang; Shunzhang Wang

    Groups of small and medium enterprises (SMEs) can back each other to obtain loans from banks and thus form guarantee networks. If the loan repayment of a small company in the network defaults, its backers are required to repay the loan. Therefore, risk over networked enterprises may cause significant contagious damage. In real-world applications, it is critical to detect top vulnerable nodes in such

  • High Performance Code Generation in MLIR: An Early Case Study with GEMM
    arXiv.cs.PF Pub Date : 2020-03-01
    Uday Bondhugula

    This article is primarily meant to present an early case study on using MLIR, a new compiler intermediate representation infrastructure, for high-performance code generation. Aspects of MLIR covered in particular include memrefs, the affine dialect, and polyhedral utilities and pass infrastructure surrounding those. This article is also aimed at showing the role compiler infrastructure could play in

  • Change Point Detection in Software Performance Testing
    arXiv.cs.PF Pub Date : 2020-03-01
    David Daly; William Brown; Henrik Ingo; Jim O'Leary; David Bradford

    We describe our process for automatic detection of performance changes for a software product in the presence of noise. A large collection of tests run periodically as changes to our software product are committed to our source repository, and we would like to identify the commits responsible for performance regressions. Previously, we relied on manual inspection of time series graphs to identify significant

  • GPU-Accelerated Mobile Multi-view Style Transfer
    arXiv.cs.PF Pub Date : 2020-03-02
    Puneet Kohli; Saravana Gunaseelan; Jason Orozco; Yiwen Hua; Edward Li; Nicolas Dahlquist

    An estimated 60% of smartphones sold in 2018 were equipped with multiple rear cameras, enabling a wide variety of 3D-enabled applications such as 3D Photos. The success of 3D Photo platforms (Facebook 3D Photo, Holopix, etc) depend on a steady influx of user generated content. These platforms must provide simple image manipulation tools to facilitate content creation, akin to traditional photo platforms

  • Quantized Neural Network Inference with Precision Batching
    arXiv.cs.PF Pub Date : 2020-02-26
    Maximilian Lam; Zachary Yedidia; Colby Banbury; Vijay Janapa Reddi

    We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only

  • ICE: An Interactive Configuration Explorer for High Dimensional Categorical Parameter Spaces
    arXiv.cs.PF Pub Date : 2019-07-29
    Anjul Tyagi; Zhen Cao; Tyler Estro; Erez Zadok; Klaus Mueller

    There are many applications where users seek to explore the impact of the settings of several categorical variables with respect to one dependent numerical variable. For example, a computer systems analyst might want to study how the type of file system or storage device affects system performance. A usual choice is the method of Parallel Sets designed to visualize multivariate categorical variables

  • MLPerf Training Benchmark
    arXiv.cs.PF Pub Date : 2019-10-02
    Peter Mattson; Christine Cheng; Cody Coleman; Greg Diamos; Paulius Micikevicius; David Patterson; Hanlin Tang; Gu-Yeon Wei; Peter Bailis; Victor Bittorf; David Brooks; Dehao Chen; Debojyoti Dutta; Udit Gupta; Kim Hazelwood; Andrew Hock; Xinyuan Huang; Atsushi Ike; Bill Jia; Daniel Kang; David Kanter; Naveen Kumar; Jeffery Liao; Guokai Ma; Deepak Narayanan; Tayo Oguntebi; Gennady Pekhimenko; Lillian

    Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high

  • Towards a Geometry Automated Provers Competition
    arXiv.cs.PF Pub Date : 2020-02-28
    Nuno BaetaUniversity of Coimbra; Pedro QuaresmaUniversity of Coimbra; Zoltán KovácsThe Private University College of Education of the Diocese of Linz

    The geometry automated theorem proving area distinguishes itself by a large number of specific methods and implementations, different approaches (synthetic, algebraic, semi-synthetic) and different goals and applications (from research in the area of artificial intelligence to applications in education). Apart from the usual measures of efficiency (e.g. CPU time), the possibility of visual and/or readable

  • Optimizing Memory-Access Patterns for Deep Learning Accelerators
    arXiv.cs.PF Pub Date : 2020-02-27
    Hongbin Zheng; Sejong Oh; Huiqing Wang; Preston Briggs; Jiading Gai; Animesh Jain; Yizhi Liu; Rich Heaton; Randy Huang; Yida Wang

    Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads; however, it is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can

  • Throughput Prediction of Asynchronous SGD in TensorFlow
    arXiv.cs.PF Pub Date : 2019-11-12
    Zhuojin Li; Wumo Yan; Marco Paolieri; Leana Golubchik

    Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due to communication overhead and bottlenecks, the total throughput of SGD updates in a cluster scales sublinearly, saturating as the number of nodes increases. In this

  • Parallel Data Distribution Management on Shared-Memory Multiprocessors
    arXiv.cs.PF Pub Date : 2019-11-07
    Moreno Marzolla; Gabriele D'Angelo

    The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HLA) specification -- a standard framework for interoperability among simulators -- includes a Data Distribution Management (DDM) service whose responsibility is to report all intersections

  • The Deep Learning Compiler: A Comprehensive Survey
    arXiv.cs.PF Pub Date : 2020-02-06
    Mingzhen Li; Yi Liu; Xiaoyan Liu; Qingxiao Sun; Xin You; Hailong Yang; Zhongzhi Luan; Depei Qian

    The difficulty of deploying various deep learning (DL) models on diverse DL hardware has boosted the research and development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks as input, and then generate optimized codes for diverse

  • Reflection Resource Management for Intelligent Reflecting Surface Aided Wireless Networks
    arXiv.cs.PF Pub Date : 2020-02-02
    Yulan Gao; Chao Yong; Zehui Xiong; Jun Zhao; Yue Xiao; Dusit Niyato

    In this paper, the adoption of an intelligent reflecting surface (IRS) for multiple single-antenna source terminal (ST)-DT pairs in two-hop networks is investigated. Different from the previous studies on IRS that merely focused on tuning the reflection coefficient of all the reflection elements at IRS, in this paper, we consider the true reflection resource management. Specifically, the true reflection

  • A High-Throughput Solver for Marginalized Graph Kernels on GPU
    arXiv.cs.PF Pub Date : 2019-10-14
    Yu-Hang Tang; Oguz Selvitopi; Doru Popovici; Aydın Buluç

    We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between

  • Learning Queuing Networks by Recurrent Neural Networks
    arXiv.cs.PF Pub Date : 2020-02-25
    Giulio Garbi; Emilio Incerto; Mirco Tribastone

    It is well known that building analytical performance models in practice is difficult because it requires a considerable degree of proficiency in the underlying mathematics. In this paper, we propose a machine-learning approach to derive performance models from data. We focus on queuing networks, and crucially exploit a deterministic approximation of their average dynamics in terms of a compact system

  • Graph Computing based Distributed State Estimation with PMUs
    arXiv.cs.PF Pub Date : 2020-02-20
    Yi Lu; Chen Yuan; Xiang Zhang; Hua Huang; Guangyi Liu; Renchang Dai; Zhiwei Wang

    Power system state estimation plays a fundamental and critical role in the energy management system (EMS). To achieve a high performance and accurate system states estimation, a graph computing based distributed state estimation approach is proposed in this paper. Firstly, a power system network is divided into multiple areas. Reference buses are selected with PMUs being installed at these buses for

  • Deadline-aware Scheduling for Maximizing Information Freshness in Industrial Cyber-Physical System
    arXiv.cs.PF Pub Date : 2019-12-24
    Devarpita Sinha; Rajarshi Roy

    Age of Information is an interesting metric that captures the freshness of information in the underlying applications. It is a combination of both packets inter-arrival time and packet transmission delay. In recent times, advanced real-time systems rely on this metric for delivering status updates as timely as possible. This paper aims to accomplish optimal transmission scheduling policy to maintain

  • Simplified Ray Tracing for the Millimeter Wave Channel: A Performance Evaluation
    arXiv.cs.PF Pub Date : 2020-02-21
    Mattia Lecci; Paolo Testolina; Marco Giordani; Michele Polese; Tanguy Ropitault; Camillo Gentile; Neeraj Varshney; Anuraag Bodi; Michele Zorzi

    Millimeter-wave (mmWave) communication is one of the cornerstone innovations of fifth-generation (5G) wireless networks, thanks to the massive bandwidth available in these frequency bands. To correctly assess the performance of such systems, however, it is essential to have reliable channel models, based on a deep understanding of the propagation characteristics of the mmWave signal. In this respect

  • Taurus: An Intelligent Data Plane
    arXiv.cs.PF Pub Date : 2020-02-12
    Tushar Swamy; Alexander Rucker; Muhammad Shahbaz; Kunle Olukotun

    Emerging applications -- cloud computing, the internet of things, and augmented/virtual reality -- need responsive, available, secure, ubiquitous, and scalable datacenter networks. Network management currently uses simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under an intelligent, millisecond-latency control plane that runs data-driven performance and security policies. However

  • Asymptotically Optimal Load Balancing in Large-scale Heterogeneous Systems with Multiple Dispatchers
    arXiv.cs.PF Pub Date : 2020-02-20
    Xingyu Zhou; Ness Shroff; Adam Wierman

    We consider the load balancing problem in large-scale heterogeneous systems with multiple dispatchers. We introduce a general framework called Local-Estimation-Driven (LED). Under this framework, each dispatcher keeps local (possibly outdated) estimates of queue lengths for all the servers, and the dispatching decision is made purely based on these local estimates. The local estimates are updated via

  • Performance Analysis of Single-Cell Adaptive Data Rate-Enabled LoRaWAN
    arXiv.cs.PF Pub Date : 2020-02-19
    Arliones Hoeller; Richard Demo Souza; Samuel Montejo-Sánchez; Hirley Alves

    LoRaWAN enables massive connectivity for Internet-of-Things applications. Many published works employ stochastic geometry to derive outage models of LoRaWAN over fading channels assuming fixed transmit power and distance-based spreading factor (SF) allocation. However, in practice, LoRaWAN employs the Adaptive Data Rate (ADR) mechanism, which dynamically adjusts SF and transmit power of nodes based

  • Throughput Optimal Decentralized Scheduling with Single-bit State Feedback for a Class of Queueing Systems
    arXiv.cs.PF Pub Date : 2020-02-19
    Avinash Mohan; Aditya Gopalan; Anurag Kumar

    Motivated by medium access control for resource-challenged wireless Internet of Things (IoT), we consider the problem of queue scheduling with reduced queue state information. In particular, we consider a time-slotted scheduling model with $N$ sensor nodes, with pair-wise dependence, such that Nodes $i$ and $i + 1,~0 < i < N$ cannot transmit together. We develop new throughput-optimal scheduling policies

  • Honing and proofing Astrophysical codes on the road to Exascale. Experiences from code modernization on many-core systems
    arXiv.cs.PF Pub Date : 2020-02-19
    Salvatore Cielo; Luigi Iapichino; Fabio Baruffa; Matteo Bugli; Christoph Federrath

    The complexity of modern and upcoming computing architectures poses severe challenges for code developers and application specialists, and forces them to expose the highest possible degree of parallelism, in order to make the best use of the available hardware. The Intel$^{(R)}$ Xeon Phi$^{(TM)}$ of second generation (code-named Knights Landing, henceforth KNL) is the latest many-core system, which

  • Interface Modeling for Quality and Resource Management
    arXiv.cs.PF Pub Date : 2020-02-19
    Martijn Hendriks; Marc Geilen; Kees Goossens; Rob de Jong; Twan Basten

    We develop an interface-modeling framework for quality and resource management that captures configurable working points of hardware and software components in terms of functionality, resource usage and provision, and quality indicators such as performance and energy consumption. We base these aspects on partially-ordered sets to capture quality levels, budget sizes, and functional compatibility. This

Contents have been reproduced by permission of the publishers.
全球疫情及响应:BMC Medicine专题征稿