当前期刊: arXiv - CS - Performance Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • tinyMD: A Portable and Scalable Implementation for Pairwise Interactions Simulations
    arXiv.cs.PF Pub Date : 2020-09-16
    Rafael Ravedutti L. MachadoChair for System Simulation at University of Erlangen-Nürnberg; Jonas SchmittChair for System Simulation at University of Erlangen-Nürnberg; Sebastian EiblChair for System Simulation at University of Erlangen-Nürnberg; Jan EitzingerRegional Computer Center Erlangen at University of Erlangen-Nürnberg; Roland LeißaSaarland Informatics Campus at Saarland University; Sebastian

    This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementations and contrast miniMD's performance for single-node

  • Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-time Edge Computing
    arXiv.cs.PF Pub Date : 2020-09-13
    Karel Adámek; Jan Novotný; Jeyarajan Thiyagalingam; Wesley Armour

    The Square Kilometre Array (SKA) is an international initiative for developing the world's largest radio telescope with a total collecting area of over a million square meters. The scale of the operation, combined with the remote location of the telescope, requires the use of energy-efficient computational algorithms. This, along with the extreme data rates that will be produced by the SKA and the

  • Artery-C -- An OMNeT++ Based Discrete Event Simulation Framework for Cellular V2X
    arXiv.cs.PF Pub Date : 2020-09-12
    Anupama Hegde; Andreas Festag

    Cellular Vehicle-to-X (Cellular V2X) is a communication technology that aims to facilitate the communication among vehicles and with the roadside infrastructure. Introduced with LTE Release 14, Cellular V2X enables device-to-device communication to support road safety and traffic efficiency applications. We present Artery-C, a simulation framework for the performance evaluation of Cellular V2X protocols

  • Repeated Recursion Unfolding for Super-Linear Speedup within Bounds
    arXiv.cs.PF Pub Date : 2020-09-11
    Thom Fruehwirth

    Repeated recursion unfolding is a new approach that repeatedly unfolds a recursion with itself and simplifies it while keeping all unfolded rules. Each unfolding doubles the number of recursive steps covered. This reduces the number of recursive rule applications to its logarithm at the expense of introducing a logarithmic number of unfolded rules to the program. Efficiency crucially depends on the

  • Hierarchical Roofline Performance Analysis for Deep Learning Applications
    arXiv.cs.PF Pub Date : 2020-09-11
    Yunsong Wang; Charlene Yang; Steven Farrell; Thorsten Kurth; Samuel Williams

    This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for more data precision support and Tensor Core support and introduces an Nsight Compute based method to accurately collect application performance information. This methodology allows for automated

  • Time-Based Roofline for Deep Learning Performance Analysis
    arXiv.cs.PF Pub Date : 2020-09-09
    Yunsong Wang; Charlene Yang; Steven Farrell; Yan Zhang; Thorsten Kurth; Samuel Williams

    Deep learning applications are usually very compute-intensive and require a long runtime for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based approach to performance analysis to facilitate the optimization of these applications. This approach is an extension of the Roofline model widely used in traditional

  • GPA: A GPU Performance Advisor Based on Instruction Sampling
    arXiv.cs.PF Pub Date : 2020-09-09
    Keren Zhou; Xiaozhu Meng; Ryuichi Sai; John Mellor-Crummey

    Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual lines, loops

  • Latency and Throughput Optimization in Modern Networks: A Comprehensive Survey
    arXiv.cs.PF Pub Date : 2020-09-01
    Amir Mirzaeinnia; Mehdi Mirzaeinia; Abdelmounaam Rezgui

    Modern applications are highly sensitive to communication delays and throughput. This paper surveys major attempts on reducing latency and increasing the throughput. These methods are surveyed on different networks and surroundings such as wired networks, wireless networks, application layer transport control, Remote Direct Memory Access, and machine learning based transport control.

  • Collaborative Management of Benchmark Instances and their Attributes
    arXiv.cs.PF Pub Date : 2020-09-07
    Markus Iser; Luca Springer; Carsten Sinz

    Experimental evaluation is an integral part in the design process of algorithms. Publicly available benchmark instances are widely used to evaluate methods in SAT solving. For the interpretation of results and the design of algorithm portfolios their attributes are crucial. Capturing the interrelation of benchmark instances and their attributes is considerably simplified through our specification of

  • Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs
    arXiv.cs.PF Pub Date : 2020-09-05
    Charlene Yang

    This paper surveys a range of methods to collect necessary performance data on Intel CPUs and NVIDIA GPUs for hierarchical Roofline analysis. As of mid-2020, two vendor performance tools, Intel Advisor and NVIDIA Nsight Compute, have integrated Roofline analysis into their supported feature set. This paper fills the gap for when these tools are not available, or when users would like a more customized

  • ScalAna: Automating Scaling Loss Detection with Graph Analysis
    arXiv.cs.PF Pub Date : 2020-09-03
    Yuyang Jin; Haojie Wang; Teng Yu; Xiongchao Tang; Torsten Hoefler; Xu Liu; Jidong Zhai

    Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads

  • Service Rate Region: A New Aspect of Coded Distributed System Design
    arXiv.cs.PF Pub Date : 2020-09-03
    Mehmet Aktas; Gauri Joshi; Swanand Kadhe; Fatemeh Kazemi; Emina Soljanin

    Erasure coding has been recently employed as a powerful method to mitigate delays due to slow or straggling nodes in distributed systems. In this work, we show that erasure coding of data objects can flexibly handle skews in the request rates. Coding can help boost the service rate region, that is, increase the overall volume of data access requests that can be handled by the system. The goal of this

  • Analysis of an M/G/1 system for the optimization of the RTG performances in the delivery of containers in Abidjan Terminal
    arXiv.cs.PF Pub Date : 2020-08-27
    Bakary Kone; Salimata Gueye Diagne; Dethie Dione; Coumba Diallo

    In front of the major challenges to increase its productivity while satisfying its customer, it is today important to establish in advance the operational performances of the RTG Abidjan Terminal. In this article, by using an M/G/1 retrial queue system, we obtained the average number of parked delivery trucks and as well as their waiting time. Finally, we used Matlab to represent them graphically then

  • Architectural Implications of Graph Neural Networks
    arXiv.cs.PF Pub Date : 2020-09-02
    Zhihui Zhang; Jingwen Leng; Lingxiao Ma; Youshan Miao; Chao Li; Minyi Guo

    Graph neural networks (GNN) represent an emerging line of deep learning models that operate on graph structures. It is becoming more and more popular due to its high accuracy achieved in many graph-related tasks. However, GNN is not as well understood in the system and architecture community as its counterparts such as multi-layer perceptrons and convolutional neural networks. This work tries to introduce

  • Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines
    arXiv.cs.PF Pub Date : 2020-09-01
    Sören Henning; Wilhelm Hasselbring

    Distributed stream processing engines are designed with a focus on scalability to process big data volumes in a continuous manner. We present the Theodolite method for benchmarking the scalability of distributed stream processing engines. Core of this method is the definition of use cases that microservices implementing stream processing have to fulfill. For each use case, our method identifies relevant

  • Performance portability through machine learning guided kernel selection in SYCL libraries
    arXiv.cs.PF Pub Date : 2020-08-30
    John Lawson

    Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so these techniques are of limited

  • Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
    arXiv.cs.PF Pub Date : 2020-08-31
    Sungsoo Ha; Wonyong Jeong; Gyorgy Matyasfalvi; Cong Xie; Kevin Huck; Jong Youl Choi; Abid Malik; Li Tang; Hubertus Van Dam; Line Pouchard; Wei Xu; Shinjae Yoo; Nicholas D'Imperio; Kerstin Kleese Van Dam

    Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace

  • CoShare: An Efficient Approach for Redundancy Allocation in NFV
    arXiv.cs.PF Pub Date : 2020-08-31
    Yordanos Tibebu Woldeyohannes; Besmir Tola; Yuming Jiang; K. K. Ramakrishnan

    An appealing feature of Network Function Virtualization (NFV) is that in an NFV-based network, a network function (NF) instance may be placed at any node. This, on the one hand, offers great flexibility in redundancy allocation to meet the availability requirements of flows; on the other hand, it makes the challenge unique and difficult. One particular highlight is that there is inherent correlation

  • Power and Performance Analysis of Persistent Key-Value Stores
    arXiv.cs.PF Pub Date : 2020-08-31
    Stella Mikrou; Anastasios Papagiannis; Giorgos Saloustros; Manolis Marazakis; Angelos Bilas

    With the current rate of data growth, processing needs are becoming difficult to fulfill due to CPU power and energy limitations. Data serving systems and especially persistent key-value stores have become a substantial part of data processing stacks in the data center, providing access to massive amounts of data for applications and services. Key-value stores exhibit high CPU and I/O overheads because

  • Analysis of Interference between RDMA and Local Access on Hybrid Memory System
    arXiv.cs.PF Pub Date : 2020-08-28
    Kazuichi Oe

    We can use a hybrid memory system consisting of DRAM and Intel Optane DC Persistent Memory (We call it DCPM in this paper) as DCPM is now commercially available since April 2019. Even if the latency for DCPM is several times higher than that for DRAM, the capacity for DCPM is several times higher than that for DRAM and the cost of DCPM is also several times lower than that for DRAM. In addition, DCPM

  • Quality of Service (QoS): Measurements of Video Streaming
    arXiv.cs.PF Pub Date : 2020-08-27
    Sajida Karim; Hui He; Asif Ali Laghari; Hina Madiha

    Nowadays video streaming is growing over the social clouds, where end-users always want to share High Definition (HD) videos among friends. Mostly videos were recorded via smartphones and other HD devices and short time videos have a big file size. The big file size of videos required high bandwidth to upload and download on the Internet and also required more time to load in a web page for play. So

  • Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation
    arXiv.cs.PF Pub Date : 2020-08-26
    Wenqian Dong; Jie Liu; Zhen Xie; Dong Li

    The Eulerian fluid simulation is an important HPC application. The neural network has been applied to accelerate it. The current methods that accelerate the fluid simulation with neural networks lack flexibility and generalization. In this paper, we tackle the above limitation and aim to enhance the applicability of neural networks in the Eulerian fluid simulation. We introduce Smartfluidnet, a framework

  • Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation
    arXiv.cs.PF Pub Date : 2020-08-26
    Wenqian Dong; Zhen Xie; Gokcen Kestor; Dong Li

    The optimal power flow (OPF) problem is one of the most important optimization problems for the operation of the power grid. It calculates the optimum scheduling of the committed generation units. In this paper, we develop a neural network approach to the problem of accelerating the current optimal power flow (AC-OPF) by generating an intelligent initial solution. The high quality of the initial solution

  • Optimising AI Training Deployments using Graph Compilers and Containers
    arXiv.cs.PF Pub Date : 2020-08-26
    Nina Mujkanovic; Karthee Sivalingam; Alfio Lazzaro

    Artificial Intelligence (AI) applications based on Deep Neural Networks (DNN) or Deep Learning (DL) have become popular due to their success in solving problems likeimage analysis and speech recognition. Training a DNN is computationally intensive and High Performance Computing(HPC) has been a key driver in AI growth. Virtualisation and container technology have led to the convergence of cloud and

  • 8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks
    arXiv.cs.PF Pub Date : 2020-08-26
    Charlene Yang

    Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and optimization techniques. Despite challenges such as high register usage, low occupancy, complex data access patterns, and the existence of several long-latency instructions

  • High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality
    arXiv.cs.PF Pub Date : 2020-08-26
    Maciej Besta; Armon Carigiet; Zur Vonarburg-Shmaria; Kacper Janda; Lukas Gianinazzi; Torsten Hoefler

    We develop the first parallel graph coloring heuristics with strong theoretical guarantees on work and depth and coloring quality. The key idea is to design a relaxation of the vertex degeneracy order, a well-known graph theory concept, and to color vertices in the order dictated by this relaxation. This introduces a tunable amount of parallelism into the degeneracy ordering that is otherwise hard

  • Optimized routines for event generators in QED-PIC codes
    arXiv.cs.PF Pub Date : 2020-08-24
    V. Volokitin; S. Bastrakov; A. Bashinov; E. Efimenko; A. Muraviev; A. Gonoskov; I. Meyerov

    In recent years, the prospects of performing fundamental and applied studies at the next-generation high-intensity laser facilities have greatly stimulated the interest in performing large-scale simulations of laser interaction with matter with the account for quantum electrodynamics (QED) processes such as emission of high energy photons and decay of such photons into electron-positron pairs. These

  • Tearing Down the Memory Wall
    arXiv.cs.PF Pub Date : 2020-08-24
    Zaid Qureshi; Vikram Sharma Mailthody; Seung Won Min; I-Hsin Chung; Jinjun Xiong; Wen-mei Hwu

    We present a vision for the Erudite architecture that redefines the compute and memory abstractions such that memory bandwidth and capacity become first-class citizens along with compute throughput. In this architecture, we envision coupling a high-density, massively parallel memory technology like Flash with programmable near-data accelerators, like the streaming multiprocessors in modern GPUs. Each

  • A Principled Approach to Design Using High Fidelity Fluid-Structure Interaction Simulations
    arXiv.cs.PF Pub Date : 2020-08-21
    Wensi Wu; Christophe Bonneville; Christopher J. Earls

    A high fidelity fluid-structure interaction simulation may require many days to run, on hundreds of cores. This poses a serious burden, both in terms of time and economic considerations, when repetitions of such simulations may be required (e.g. for the purpose of design optimization). In this paper we present strategies based on (constrained) Bayesian optimization (BO) to alleviate this burden. BO

  • Reinforcement Learning-based Admission Control in Delay-sensitive Service Systems
    arXiv.cs.PF Pub Date : 2020-08-21
    Majid Raeis; Ali Tizghadam; Alberto Leon-Garcia

    Ensuring quality of service (QoS) guarantees in service systems is a challenging task, particularly when the system is composed of more fine-grained services, such as service function chains. An important QoS metric in service systems is the end-to-end delay, which becomes even more important in delay-sensitive applications, where the jobs must be completed within a time deadline. Admission control

  • Optimal Load Balancing in Bipartite Graphs
    arXiv.cs.PF Pub Date : 2020-08-20
    Wentao Weng; Xingyu Zhou; R. Srikant

    Applications in cloud platforms motivate the study of efficient load balancing under job-server constraints and server heterogeneity. In this paper, we study load balancing on a bipartite graph where left nodes correspond to job types and right nodes correspond to servers, with each edge indicating that a job type can be served by a server. Thus edges represent locality constraints, i.e., each job

  • Accuracy and Performance Comparison of Video Action Recognition Approaches
    arXiv.cs.PF Pub Date : 2020-08-20
    Matthew Hutchinson; Siddharth Samsi; William Arcand; David Bestor; Bill Bergeron; Chansup Byun; Micheal Houle; Matthew Hubbell; Micheal Jones; Jeremy Kepner; Andrew Kirby; Peter Michaleas; Lauren Milechin; Julie Mullen; Andrew Prout; Antonio Rosa; Albert Reuther; Charles Yee; Vijay Gadepally

    Over the past few years, there has been significant interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen off-the-shelf and state-of-the-art

  • An In-Depth Analysis of the Slingshot Interconnect
    arXiv.cs.PF Pub Date : 2020-08-20
    Daniele De Sensi; Salvatore Di Girolamo; Kim H. McMahon; Duncan Roweth; Torsten Hoefler

    The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with

  • High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip
    arXiv.cs.PF Pub Date : 2020-08-20
    Kris NikovUniversity of Bristol, UK; Mohammad HosseinabadyUniversity of Bristol, UK; Rafael AsenjoUniversidad de Málaga, Spain; Andrés RodríguezzUniversidad de Málaga, Spain; Angeles NavarroUniversidad de Málaga, Spain; Jose Nunez-YanezUniversity of Bristol, UK

    This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17\%

  • FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices
    arXiv.cs.PF Pub Date : 2020-08-19
    Haoran Qiu; Subho S. Banerjee; Saurabh Jha; Zbigniew T. Kalbarczyk; Ravishankar K. Iyer

    Modern user-facing, latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing compute-resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests

  • Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations
    arXiv.cs.PF Pub Date : 2020-08-19
    Yuhsiang Mike Tsai; Terry Cojean; Hartwig Anzt

    GPU accelerators have become an important backbone for scientific high performance computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA's newest server line GPU, the A100 architecture part of the Ampere generation. Specifically, we assess its performance for sparse linear algebra operations that form the backbone

  • Benchmarking network fabrics for data distributed training of deep neural networks
    arXiv.cs.PF Pub Date : 2020-08-18
    Siddharth Samsi; Andrew Prout; Michael Jones; Andrew Kirby; Bill Arcand; Bill Bergeron; David Bestor; Chansup Byun; Vijay Gadepally; Michael Houle; Matthew Hubbell; Anna Klein; Peter Michaleas; Lauren Milechin; Julie Mullen; Antonio Rosa; Charles Yee; Albert Reuther; Jeremy Kepner

    Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple

  • A Microservices Architecture for Distributed Complex Event Processing in Smart Cities
    arXiv.cs.PF Pub Date : 2020-08-17
    Fernando Freire Scattone; Kelly Rosa Braghetto

    A considerable volume of data is collected from sensors today and needs to be processed in real time. Complex Event Processing (CEP) is one of the most important techniques developed for this purpose. In CEP, each new sensor measurement is considered an event and new event types can be defined based on other events occurrence. There exists several open-source CEP implementations currently available

  • Load Balancing Under Strict Compatibility Constraints
    arXiv.cs.PF Pub Date : 2020-08-17
    Daan Rutten; Debankur Mukherjee

    We study large-scale systems operating under the JSQ$(d)$ policy in the presence of stringent task-server compatibility constraints. Consider a system with $N$ identical single-server queues and $M(N)$ task types, where each server is able to process only a small subset of possible task types. Each arriving task selects $d\geq 2$ random servers compatible to its type, and joins the shortest queue among

  • Erlang Redux: An Ansatz Method for Solving the M/M/m Queue
    arXiv.cs.PF Pub Date : 2020-08-16
    Neil J. Gunther

    This exposition presents a novel approach to solving an M/M/m queue for the waiting time and the residence time. The motivation comes from an algebraic solution for the residence time of the M/M/1 queue. The key idea is the introduction of an ansatz transformation, defined in terms of the Erlang B function, that avoids the more opaque derivation based on applied probability theory. The only prerequisite

  • In-situ Workflow Auto-tuning via Combining Performance Models of Component Applications
    arXiv.cs.PF Pub Date : 2020-08-16
    Tong Shu; Yanfei Guo; Justin Wozniak; Xiaoning Ding; Ian Foster; Tahsin Kurc

    In-situ parallel workflows couple multiple component applications, such as simulation and analysis, via streaming data transfer. in order to avoid data exchange via shared file systems. Such workflows are challenging to configure for optimal performance due to the large space of possible configurations. Expert experience is rarely sufficient to identify optimal configurations, and existing empirical

  • Toward an End-to-End Auto-tuning Framework in HPC PowerStack
    arXiv.cs.PF Pub Date : 2020-08-14
    Xingfu Wu; Aniruddha Marathe; Siddhartha Jana; Ondrej Vysocky; Jophin John; Andrea Bartolini; Lubomir Riha; Michael Gerndt; Valerie Taylor; Sridutt Bhalachandra

    Efficiently utilizing procured power and optimizing performance of scientific applications under power and energy constraints are challenging. The HPC PowerStack defines a software stack to manage power and energy of high-performance computing systems and standardizes the interfaces between different components of the stack. This survey paper presents the findings of a working group focused on the

  • Consideration for effectively handling parallel workloads on public cloud system
    arXiv.cs.PF Pub Date : 2020-08-14
    Kazuichi Oe

    We retrieved and analyzed parallel storage workloads of the FUJITSU K5 cloud service to clarify how to build cost-effective hybrid storage systems. A hybrid storage system consists of fast but low-capacity tier (first tier) and slow but high-capacity tier (second tier). And, it typically consists of either SSDs and HDDs or NVMs and SSDs. As a result, we found that 1) regions for first tier should be

  • FLCD: A Flexible Low Complexity Design of Coded Distributed Computing
    arXiv.cs.PF Pub Date : 2020-08-13
    Nicholas Woolsey; Xingyue Wang; Rong-Rong Chen; Mingyue Ji

    We propose a flexible low complexity design (FLCD) of coded distributed computing (CDC) with empirical evaluation on Amazon Elastic Compute Cloud (Amazon EC2). CDC can expedite MapReduce like computation by trading increased map computations to reduce communication load and shuffle time. A main novelty of FLCD is to utilize the design freedom in defining map and reduce functions to develop asymptotic

  • Study on State-of-the-art Cloud Services Integration Capabilities with Autonomous Ground Vehicles
    arXiv.cs.PF Pub Date : 2020-08-11
    Praveen Damacharla; Dhwani Mehta; Ahmad Y Javaid; Vijay K. Devabhaktuni

    Computing and intelligence are substantial requirements for the accurate performance of autonomous ground vehicles (AGVs). In this context, the use of cloud services in addition to onboard computers enhances computing and intelligence capabilities of AGVs. In addition, the vast amount of data processed in a cloud system contributes to overall performance and capabilities of the onboard system. This

  • Reinforced Wasserstein Training for Severity-Aware Semantic Segmentation in Autonomous Driving
    arXiv.cs.PF Pub Date : 2020-08-11
    Xiaofeng Liu; Yimeng Zhang; Xiongchang Liu; Song Bai; Site Li; Jane You

    Semantic segmentation is important for many real-world systems, e.g., autonomous vehicles, which predict the class of each pixel. Recently, deep networks achieved significant progress w.r.t. the mean Intersection-over Union (mIoU) with the cross-entropy loss. However, the cross-entropy loss can essentially ignore the difference of severity for an autonomous car with different wrong prediction mistakes

  • PROFIT: A Novel Training Method for sub-4-bit MobileNet Models
    arXiv.cs.PF Pub Date : 2020-08-11
    Eunhyeok Park; Sungjoo Yoo

    4-bit and lower precision mobile models are required due to the ever-increasing demand for better energy efficiency in mobile devices. In this work, we report that the activation instability induced by weight quantization (AIWQ) is the key obstacle to sub-4-bit quantization of mobile networks. To alleviate the AIWQ problem, we propose a novel training method called PROgressive-Freezing Iterative Training

  • Performance Analysis of Priority-Aware NoCs with Deflection Routing under Traffic Congestion
    arXiv.cs.PF Pub Date : 2020-08-10
    Sumit K. Mandal; Anish Krishnakumar; Raid Ayoub; Michael Kishinevsky; Umit Y. Ogras

    Priority-aware networks-on-chip (NoCs) are used in industry to achieve predictable latency under different workload conditions. These NoCs incorporate deflection routing to minimize queuing resources within routers and achieve low latency during low traffic load. However, deflected packets can exacerbate congestion during high traffic load since they consume the NoC bandwidth. State-of-the-art analytical

  • BSF: a parallel computation model for scalability estimation of iterative numerical algorithms on cluster computing systems
    arXiv.cs.PF Pub Date : 2020-08-08
    Leonid B. Sokolinsky

    This paper examines a new parallel computation model called bulk synchronous farm (BSF) that focuses on estimating the scalability of compute-intensive iterative algorithms aimed at cluster computing systems. In the BSF model, a computer is a set of processor nodes connected by a network and organized according to the mas-ter/slave paradigm. A cost metric of the BSF model is presented. This cost metric

  • Achievable Stability in Redundancy Systems
    arXiv.cs.PF Pub Date : 2020-08-08
    Youri Raaijmakers; Sem Borst

    We consider a system with $N$ parallel servers where incoming jobs are immediately replicated to, say, $d$ servers. Each of the $N$ servers has its own queue and follows a FCFS discipline. As soon as the first job replica is completed, the remaining replicas are abandoned. We investigate the achievable stability region for a quite general workload model with different job types and heterogeneous servers

  • High performance on-demand de-identification of a petabyte-scale medical imaging data lake
    arXiv.cs.PF Pub Date : 2020-08-04
    Joseph Mesterhazy; Garrick Olson; Somalee Datta

    With the increase in Artificial Intelligence driven approaches, researchers are requesting unprecedented volumes of medical imaging data which far exceed the capacity of traditional on-premise client-server approaches for making the data research analysis-ready. We are making available a flexible solution for on-demand de-identification that combines the use of mature software technologies with modern

  • A Learned Performance Model for the Tensor Processing Unit
    arXiv.cs.PF Pub Date : 2020-08-03
    Samuel J. Kaufman; Phitchaya Mangpo Phothilimthana; Yanqi Zhou; Mike Burrows

    Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as an minimization objective, or by autotuners to find an optimal configuration of a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators

  • A Survey on the Evolution of Stream Processing Systems
    arXiv.cs.PF Pub Date : 2020-08-03
    Marios Fragkoulis; Paris Carbone; Vasiliki Kalavri; Asterios Katsifodimos

    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state

  • Custom Tailored Suite of Random Forests for Prefetcher Adaptation
    arXiv.cs.PF Pub Date : 2020-08-01
    Furkan Eris; Sadullah Canakci; Cansu Demirkiran; Ajay Joshi

    To close the gap between memory and processors, and in turn improve performance, there has been an abundance of work in the area of data/instruction prefetcher designs. Prefetchers are deployed in each level of the memory hierarchy, but typically, each prefetcher gets designed without comprehensively accounting for other prefetchers in the system. As a result, these individual prefetcher designs do

  • The Effect of TCP Variants on the Coexistence of MMORPG and Best-Effort Traffic
    arXiv.cs.PF Pub Date : 2020-07-30
    Jose Saldana; Mirko Suznjevic; Luis Sequeira; Julian Fernandez-Navajas; Maja Matijasevic; Jose Ruiz-Mas

    We study TCP flows coexistence between Massive Multiplayer Online Role Playing Games (MMORPGs) and other TCP applications, by taking World of Warcraft (WoW) and a file transfer application based on File Transfer Protocol (FTP) as an example. Our focus is on the effects of the sender buffer size and FTP cross-traffic on the queuing delay experienced by the (MMORPG) game traffic. A network scenario corresponding

  • Delay and Price Differentiation in Cloud Computing: A Service Model, Supporting Architectures, and Performance
    arXiv.cs.PF Pub Date : 2020-07-30
    Xiaohu Wu; Francesco De Pellegrini; Giuliano Casale

    Many cloud service providers (CSPs) provide on-demand service at a price with a small delay. We propose a QoS-differentiated model where multiple SLAs deliver both on-demand service for latency-critical users and delayed services for delay-tolerant users at lower prices. Two architectures are considered to fulfill SLAs. The first is based on priority queues. The second simply separates servers into

  • Traffic Optimization for TCP-based Massive Multiplayer Online Games
    arXiv.cs.PF Pub Date : 2020-07-30
    Jose Saldana; Luis Sequeira; Julian Fernandez-Navajas; Jose Ruiz-Mas

    This paper studies the use of a traffic optimization technique named TCM (Tunneling, Compressing and Multiplexing) to reduce the bandwidth of MMORPGs (Massively Multiplayer Online Role-Playing Games), which employ TCP to provide a soft real-time service. In order to optimize the traffic and to improve bandwidth efficiency, TCM can be applied when the packets of a number of players share the same link

  • Implications of Dissemination Strategies on the Security of Distributed Ledgers
    arXiv.cs.PF Pub Date : 2020-07-30
    Luca Serena; Gabriele D'Angelo; Stefano Ferretti

    This paper describes a simulation study on security attacks over Distributed Ledger Technologies (DLTs). We specifically focus on attacks at the underlying peer-to-peer layer of these systems, that is in charge of disseminating messages containing data and transaction to be spread among all participants. In particular, we consider the Sybil attack, according to which a malicious node creates many Sybils

  • Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic
    arXiv.cs.PF Pub Date : 2020-07-28
    Sumit K. Mandal; Raid Ayoub; Michael Kishinevsky; Mohammad M. Islam; Umit Y. Ogras

    Networks-on-Chip (NoCs) used in commercial many-core processors typically incorporate priority arbitration. Moreover, they experience bursty traffic due to application workloads. However, most state-of-the-art NoC analytical performance analysis techniques assume fair arbitration and simple traffic models. To address these limitations, we propose an analytical modeling technique for priority-aware

  • Monocular Real-Time Volumetric Performance Capture
    arXiv.cs.PF Pub Date : 2020-07-28
    Ruilong Li; Yuliang Xiu; Shunsuke Saito; Zeng Huang; Kyle Olszewski; Hao Li

    We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video, eliminating the need for expensive multi-view systems or cumbersome pre-acquisition of a personalized template model. Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu). While PIFu achieves high-resolution

Contents have been reproduced by permission of the publishers.
ACS ES&T Engineering
ACS ES&T Water