-
HPC AI500: Representative, Repeatable and Simple HPC AI Benchmarking arXiv.cs.PF Pub Date : 2021-02-25 Zihan Jiang; Wanling Gao; Fei Tang; Xingwang Xiong; Lei Wang; Chuanxin Lan; Chunjie Luo; Hongxiao Li; Jianfeng Zhan
Recent years witness a trend of applying large-scale distributed deep learning algorithms (HPC AI) in both business and scientific computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC AI benchmarks accelerate the process. Unfortunately, benchmarking HPC AI systems at scale raises serious challenges. This paper presents a representative, repeatable
-
Performance Comparison for Scientific Computations on the Edge via Relative Performance arXiv.cs.PF Pub Date : 2021-02-25 Aravind Sankaran; Paolo Bientinesi
In a typical Internet-of-Things setting that involves scientific applications, a target computation can be evaluated in many different ways depending on the split of computations among various devices. On the one hand, different implementations (or algorithms)--equivalent from a mathematical perspective--might exhibit significant difference in terms of performance. On the other hand, some of the implementations
-
Reading from External Memory arXiv.cs.PF Pub Date : 2021-02-22 Ruslan Savchenko
Modern external memory is represented by several device classes. At present, HDD, SATA SSD and NVMe SSD are widely used. Recently ultra-low latency SSD such as Intel Optane became available on the market. Each of these types exhibits it's own pattern for throughput, latency and parallelism. To achieve the highest performance one has to pick an appropriate I/O interface provided by the operating system
-
Trust Computational Heuristic for Social Internet of Things: A Machine Learning-based Approach arXiv.cs.PF Pub Date : 2021-02-03 Subhash Sagar; Adnan Mahmood; Quan Z. Sheng; Wei Emma Zhang
The Internet of Things (IoT) is an evolving network of billions of interconnected physical objects, such as numerous sensors, smartphones, wearables, and embedded devices. These physical objects, generally referred to as the smart objects, when deployed in the real-world aggregates useful information from their surrounding environment. As-of-late, this notion of IoT has been extended to incorporate
-
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics arXiv.cs.PF Pub Date : 2021-02-22 Subho S. Banerjee; Saurabh Jha; Zbigniew T. Kalbarczyk; Ravishankar K. Iyer
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in
-
Residual-Aided End-to-End Learning of Communication System without Known Channel arXiv.cs.PF Pub Date : 2021-02-22 Hao Jiang; Shuangkaisheng Bi; Linglong Dai
Leveraging powerful deep learning techniques, the end-to-end (E2E) learning of communication system is able to outperform the classical communication system. Unfortunately, this communication system cannot be trained by deep learning without known channel. To deal with this problem, a generative adversarial network (GAN) based training scheme has been recently proposed to imitate the real channel.
-
FlexClock: Generic Clock Reconfiguration for Low-end IoT Devices arXiv.cs.PF Pub Date : 2021-02-20 Michel Rottleuthner; Thomas C. Schmidt; Matthias Wählisch
Clock configuration within constrained general-purpose microcontrollers takes a key role in tuning performance, power consumption, and timing accuracy of applications in the Internet of Things (IoT). Subsystems governing the underlying clock tree must nonetheless cope with a huge parameter space, complex dependencies, and dynamic constraints. Manufacturers expose the underlying functions in very diverse
-
ALTO: Adaptive Linearized Storage of Sparse Tensors arXiv.cs.PF Pub Date : 2021-02-20 Ahmed E. Helal; Jan Laukemann; Fabio Checconi; Jesmin Jahan Tithi; Teresa Ranadive; Fabrizio Petrini; Jeewhan Choi
The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space
-
DeepScaleTool : A Tool for the Accurate Estimation of Technology Scaling in the Deep-Submicron Era arXiv.cs.PF Pub Date : 2021-02-19 Satyabrata Sarangi; Bevan Baas
The estimation of classical CMOS "constant-field" or "Dennard" scaling methods that define scaling factors for various dimensional and electrical parameters have become less accurate in the deep-submicron regime, which drives the need for better estimation approaches especially in the educational and research domains. We present DeepScaleTool, a tool for the accurate estimation of deep-submicron technology
-
Anytime Diagnosis for Reconfiguration arXiv.cs.PF Pub Date : 2021-02-19 Alexander Felfernig; Rouven Walter; Jose A. Galindo; David Benavides; Seda Polat-Erdeniz; Muesluem Atas; Stefan Reiterer
Many domains require scalable algorithms that help to determine diagnoses efficiently and often within predefined time limits. Anytime diagnosis is able to determine solutions in such a way and thus is especially useful in real-time scenarios such as production scheduling, robot control, and communication networks management where diagnosis and corresponding reconfiguration capabilities play a major
-
Latency Modeling of Hyperledger Fabric for Blockchain-based IoT (BC-IoT) Networks arXiv.cs.PF Pub Date : 2021-02-18 Sungho Lee; Minsu Kim; Jemin Lee; Ruei-Hau Hsu; Tony Q. S. Quek
With the worldwide growth of IoT industry, the need for a strong security level for IoT networks has also increased, leading to blockchain-based IoT (BC-IoT) networks. While blockchain technology is leveraged to ensure data integrity in a distributed manner, Hyperledger Fabric (HLF) attracts attention with its distinctive strong point without requiring the power-consuming consensus protocol, that is
-
SonicChain: A Wait-free, Pseudo-Static Approach Toward Concurrency in Blockchains arXiv.cs.PF Pub Date : 2021-02-06 Kian Paimani
Blockchains have a two-sided reputation: they are praised for disrupting some of our institutions through innovative technology for good, yet notorious for being slow and expensive to use. In this work, we tackle this issue with concurrency, yet we aim to take a radically different approach by valuing simplicity. We embrace the simplicity through two steps: first, we formulate a simple runtime mechanism
-
Performance Optimizations of Recursive Electronic Structure Solvers targeting Multi-Core Architectures (LA-UR-20-26665) arXiv.cs.PF Pub Date : 2021-02-17 Adetokunbo A. Adedoyin; Christian F. A. Negre; Jamaludin Mohd-Yusof; Nicolas Bock; Daniel Osei-Kuffuor; Jean-Luc Fattebert; Michael E. Wall; Anders M. N. Niklasson; Susan M. Mniszewski
As we rapidly approach the frontiers of ultra large computing resources, software optimization is becoming of paramount interest to scientific application developers interested in efficiently leveraging all available on-Node computing capabilities and thereby improving a requisite science per watt metric. The scientific application of interest here is the Basic Math Library (BML) that provides a singular
-
Distributed Fair Scheduling for Information Exchange in Multi-Agent Systems arXiv.cs.PF Pub Date : 2021-02-17 Majid Raeis; S. Jamaloddin Golestani
Information exchange is a crucial component of many real-world multi-agent systems. However, the communication between the agents involves two major challenges: the limited bandwidth, and the shared communication medium between the agents, which restricts the number of agents that can simultaneously exchange information. While both of these issues need to be addressed in practice, the impact of the
-
Large-Scale Benchmarks for the Job Shop Scheduling Problem arXiv.cs.PF Pub Date : 2021-01-25 Giacomo Da Col; Erich Teppan
This report contains the description of two novel job shop scheduling benchmarks that resemble instances of real scheduling problem as they appear in industry. In particular, the aim was to provide large-scale benchmarks (up to 1 million operations) to test the state-of-the-art scheduling solutions on problems that are closer to what occurs in a real industrial context. The first benchmark is an extension
-
AdEle: An Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Connected 3D NoCs arXiv.cs.PF Pub Date : 2021-02-16 Ebadollah Taheri; Ryan G. Kim; Mahdi Nikdast
By lowering the number of vertical connections in fully connected 3D networks-on-chip (NoCs), partially connected 3D NoCs (PC-3DNoCs) help alleviate reliability and fabrication issues. This paper proposes a novel, adaptive congestion- and energy-aware elevator-selection scheme called AdEle to improve the traffic distribution in PC-3DNoCs. AdEle employs an offline multi-objective simulated-annealing-based
-
An In-Depth Investigation of Performance Characteristics of Hyperledger Fabric arXiv.cs.PF Pub Date : 2021-02-15 Tobias Guggenberger; Johannes Sedlmeir; Gilbert Fridgen; André Luckow
Private permissioned blockchains, such as Hyperledger Fabric, are widely deployed across the industry to facilitate cross-organizational processes and promise improved performance compared to their public counterparts. However, the lack of empirical and theoretical results prevent precise prediction of the real-world performance. We address this gap by conducting an in-depth performance analysis of
-
On the Impact of Device and Behavioral Heterogeneity in Federated Learning arXiv.cs.PF Pub Date : 2021-02-15 Ahmed M. Abdelmoniem; Chen-Yu Ho; Pantelis Papageorgiou; Muhammad Bilal; Marco Canini
Federated learning (FL) is becoming a popular paradigm for collaborative learning over distributed, private datasets owned by non-trusting entities. FL has seen successful deployment in production environments, and it has been adopted in services such as virtual keyboards, auto-completion, item recommendation, and several IoT applications. However, FL comes with the challenge of performing training
-
Comparative Code Structure Analysis using Deep Learning for Performance Prediction arXiv.cs.PF Pub Date : 2021-02-12 Nathan Pinnow; Tarek Ramadan; Tanzima Z. Islam; Chase Phelps; Jayaraman J. Thiagarajan
Performance analysis has always been an afterthought during the application development process, focusing on application correctness first. The learning curve of the existing static and dynamic analysis tools are steep, which requires understanding low-level details to interpret the findings for actionable optimizations. Additionally, application performance is a function of an infinite number of unknowns
-
T-RACKs: A Faster Recovery Mechanism for TCP in Data Center Networks arXiv.cs.PF Pub Date : 2021-02-15 Ahmed M. Abdelmoniem; Brahim Bensaou
Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small buffer space in data-center switches. Such applications require a short flow completion time (FCT) to perform their jobs effectively. However, TCP is oblivious to the composite nature of application data and artificially inflates the FCT of such flows by several orders of magnitude. This is due
-
NumaPerf: Predictive and Full NUMA Profiling arXiv.cs.PF Pub Date : 2021-02-10 Xin ZhaoUniversity of Massachusetts Amherst; Jin ZhouUniversity of Massachusetts Amherst; Hui GuanUniversity of Massachusetts Amherst; Wei WangUniversity of Texas at San Antonio; Xu LiuNorth Carolina State University; Tongping LiuUniversity of Massachusetts Amherst
Parallel applications are extremely challenging to achieve the optimal performance on the NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool - NumaPerf - that overcomes these issues. NumaPerf aims to identify
-
Automated and Distributed Statistical Analysis of Economic Agent-Based Models arXiv.cs.PF Pub Date : 2021-02-10 Andrea Vandin; Daniele Giachini; Francesco Lamperti; Francesca Chiaromonte
We propose a novel approach to the statistical analysis of simulation models and, especially, agent-based models (ABMs). Our main goal is to provide a fully automated and model-independent tool-kit to inspect simulations and perform counterfactual analysis. Our approach: (i) is easy-to-use by the modeller, (ii) improves reproducibility of results, (iii) optimizes running time given the modeller's machine
-
Using hardware performance counters to speed up autotuning convergence on GPUs arXiv.cs.PF Pub Date : 2021-02-10 Jiří Filipovič; Jana Hozzová; Amin Nezarat; Jaroslav Oľha; Filip Petrovič
Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization
-
Multi-GPU SNN Simulation with Perfect Static Load Balancing arXiv.cs.PF Pub Date : 2021-02-09 Dennis Bautembach; Iason Oikonomidis; Antonis Argyros
We present a SNN simulator which scales to millions of neurons, billions of synapses, and 8 GPUs. This is made possible by 1) a novel, cache-aware spike transmission algorithm 2) a model parallel multi-GPU distribution scheme and 3) a static, yet very effective load balancing strategy. The simulator further features an easy to use API and the ability to create custom models. We compare the proposed
-
DV-DVFS: Merging Data Variety and DVFS Technique to Manage the Energy Consumption of Big Data Processing arXiv.cs.PF Pub Date : 2021-02-07 Hossein Ahmadvand; Fouzhan Foroutan; Mahmood Fathy
Data variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic
-
Revocation Statuses on the Internet arXiv.cs.PF Pub Date : 2021-02-08 Nikita Korzhitskii; Niklas Carlsson
The modern Internet is highly dependent on the trust communicated via X.509 certificates. However, in some cases certificates become untrusted and it is necessary to revoke them. In practice, the problem of secure certificate revocation has not yet been solved, and today no revocation procedure (similar to Certificate Transparency w.r.t. certificate issuance) has been adopted to provide transparent
-
Dynamic Performance Management: An Approach for Managing the Common Goods arXiv.cs.PF Pub Date : 2021-02-08 A. Sardi; E. Sorano
Public organizations need innovative approaches for managing common goods and to explain the dynamics linking the (re)generation of common goods and organizational performance. Although system dynamics is recognised as a useful approach for managing common goods, public organizations rarely adopt the system dynamics for this goal. The paper aims to review the literature on the system dynamics and its
-
Estimate The Efficiency Of Multiprocessor's Cash Memory Work Algorithms arXiv.cs.PF Pub Date : 2021-02-07 Mohamed A. Hamada; Abdelrahman Abdallah
Many computer systems for calculating the proper organization of memory are among the most critical issues. Using a tier cache memory (along with branching prediction) is an effective means of increasing modern multi-core processors' performance. Designing high-performance processors is a complex task and requires preliminary verification and analysis of the model level, usually used in analytical
-
A Newcomer In The PGAS World -- UPC++ vs UPC: A Comparative Study arXiv.cs.PF Pub Date : 2021-02-06 Jérémie Lagravière; Johannes Langguth; Martina Prugger; Phuong H. Ha; Xing Cai
A newcomer in the Partitioned Global Address Space (PGAS) 'world' has arrived in its version 1.0: Unified Parallel C++ (UPC++). UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories with different optimal
-
Matching Impatient and Heterogeneous Demand and Supply arXiv.cs.PF Pub Date : 2021-02-04 Angelos Aveklouris; Levi DeValve; Amy R. Ward; Xiaofan Wu
Service platforms must determine rules for matching heterogeneous demand (customers) and supply (workers) that arrive randomly over time and may be lost if forced to wait too long for a match. We show how to balance the trade-off between making a less good match quickly and waiting for a better match, at the risk of losing impatient customers and/or workers. When the objective is to maximize the cumulative
-
Function Delivery Network: Extending Serverless Computing for Heterogeneous Platforms arXiv.cs.PF Pub Date : 2021-02-03 Anshul Jindal; Michael Gerndt; Mohak Chadha; Vladimir Podolskiy; Pengfei Chen
Serverless computing has rapidly grown following the launch of Amazon's Lambda platform. Function-as-a-Service (FaaS) a key enabler of serverless computing allows an application to be decomposed into simple, standalone functions that are executed on a FaaS platform. The FaaS platform is responsible for deploying and facilitating resources to the functions. Several of today's cloud applications spread
-
Pick the Right Edge Device: Towards Power and Performance Estimation of CUDA-based CNNs on GPGPUs arXiv.cs.PF Pub Date : 2021-02-02 Christopher A. Metz; Mehran Goli; Rolf Drechsler
The emergence of Machine Learning (ML) as a powerful technique has been helping nearly all fields of business to increase operational efficiency or to develop new value propositions. Besides the challenges of deploying and maintaining ML models, picking the right edge device (e.g., GPGPUs) to run these models (e.g., CNN with the massive computational process) is one of the most pressing challenges
-
Performance Measurements within Asynchronous Task-based Runtime Systems: A Double White Dwarf Merger as an Application arXiv.cs.PF Pub Date : 2021-01-30 Patrick Diehl; Dominic Marcello; Parsa Armini; Hartmut Kaiser; Sagiv Shiber; Geoffrey C. Clayton; Juhan Frank; Gregor Daiß; Dirk Pflüger; David Eder; Alice Koniges; Kevin Huck
Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Octo-Tiger
-
Understanding Cache Boundness of ML Operators on ARM Processors arXiv.cs.PF Pub Date : 2021-02-01 Bernhard Klein; Christoph Gratl; Manfred Mücke; Holger Fröning
Machine Learning compilers like TVM allow a fast and flexible deployment on embedded CPUs. This enables the use of non-standard operators, which are common in ML compression techniques. However, it is necessary to understand the limitations of typical compute-intense operators in ML workloads to design a proper solution. This is the first in-detail analysis of dense and convolution operators, generated
-
Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach arXiv.cs.PF Pub Date : 2021-01-31 Geoffrey X. Yu; Yubo Gao; Pavel Golikov; Gennady Pekhimenko
Deep learning researchers and practitioners usually leverage GPUs to help train their deep neural networks (DNNs) faster. However, choosing which GPU to use is challenging both because (i) there are many options, and (ii) users grapple with competing concerns: maximizing compute performance while minimizing costs. In this work, we present a new practical technique to help users make informed and cost-efficient
-
BAMSim Simulator arXiv.cs.PF Pub Date : 2021-01-30 Rafael F. Reale; Walter P. neto; Joberto S. B. Martins
Resource allocation is an essential design aspect for current systems and bandwidth allocation is an essential design aspect in multi-protocol label switched and OpenFlow/SDN network infrastructures. The bandwidth allocation models (BAMs) are an alternative to allocate and share bandwidth among network users. BAMs have an extensive number of parameters that need to be defined and tuned to achieve an
-
No-Regret Caching via Online Mirror Descent arXiv.cs.PF Pub Date : 2021-01-29 Tareq Si Salem; Giovanni Neglia; Stratis Ioannidis
We study an online caching problem in which requests can be served by a local cache to avoid retrieval costs from a remote server. The cache can update its state after a batch of requests and store an arbitrarily small fraction of each content. We study no-regret algorithms based on Online Mirror Descent (OMD) strategies. We show that the optimal OMD strategy depends on the request diversity present
-
A Model of WiFi Performance With Bounded Latency arXiv.cs.PF Pub Date : 2021-01-29 Bjørn Ivar Teigen; Neil Davies; Kai Olav Ellefsen; Tor Skeie; Jim Torresen
In September 2020, the Broadband Forum published a new industry standard for measuring network quality. The standard centers on the notion of quality attenuation. Quality attenuation is a measure of the distribution of latency and packet loss between two points connected by a network path. A vital feature of the quality attenuation idea is that we can express detailed application requirements and network
-
A New Approach to Capacity Scaling Augmented With Unreliable Machine Learning Predictions arXiv.cs.PF Pub Date : 2021-01-28 Daan Rutten; Debankur Mukherjee
Modern data centers suffer from immense power consumption. The erratic behavior of internet traffic forces data centers to maintain excess capacity in the form of idle servers in case the workload suddenly increases. As an idle server still consumes a significant fraction of the peak energy, data center operators have heavily invested in capacity scaling solutions. In simple terms, these aim to deactivate
-
Random-Mode Frank-Wolfe Algorithm for Tensor Completion in Wireless Edge Caching arXiv.cs.PF Pub Date : 2021-01-28 Navneet Garg; Tharmalingam Ratnarajah
Wireless edge caching is a popular strategy to avoid backhaul congestion in the next generation networks, where the content is cached in advance at the base stations to fulfil the redundant requests during peak periods. In the edge caching data, the missing observations are inevitable due to dynamic selective popularity. Among the completion methods, the tensor-based models have been shown to be the
-
Enhancing Application Performance by Memory Partitioning in Android Platforms arXiv.cs.PF Pub Date : 2021-01-26 Geunsik Lim; Changwoo Min; Young Ik Eom
This paper suggests a new memory partitioning scheme that can enhance process lifecycle, while avoiding Low Memory Killer and Out-of-Memory Killer operations on mobile devices. Our proposed scheme offers the complete concept of virtual memory nodes in operating systems of Android devices.
-
Personal Data Access Control Through Distributed Authorization arXiv.cs.PF Pub Date : 2021-01-25 Mirko Zichichi; Stefano Ferretti; Gabriele D'Angelo; Víctor Rodríguez-Doncel
This paper presents an architecture of a Personal Information Management System, in which individuals can define the access to their personal data by means of smart contracts. These smart contracts, running on the Ethereum blockchain, implement access control lists and grant immutability, traceability and verifiability of the references to personal data, which is stored itself in a (possibly distributed)
-
Comparing Broadband ISP Performance using Big Data from M-Lab arXiv.cs.PF Pub Date : 2021-01-24 Xiaohong Deng; Yun Feng; Thanchanok Sutjarittham; Hassan Habibi Gharakheili; Blanca Gallego; Vijay Sivaraman
Comparing ISPs on broadband speed is challenging, since measurements can vary due to subscriber attributes such as operation system and test conditions such as access capacity, server distance, TCP window size, time-of-day, and network segment size. In this paper, we draw inspiration from observational studies in medicine, which face a similar challenge in comparing the effect of treatments on patients
-
Creating a Virtuous Cycle in Performance Testing at MongoDB arXiv.cs.PF Pub Date : 2021-01-25 David Daly
It is important to detect changes in software performance during development in order to avoid performance decreasing release to release or dealing with costly delays at release time. Performance testing is part of the development process at MongoDB, and integrated into our continuous integration system. We describe a set of changes to that performance testing environment designed to improve testing
-
Experiences & Challenges with Server-Side WiFi Indoor Localization UsingExisting Infrastructure arXiv.cs.PF Pub Date : 2021-01-23 Dheryta Jaisinghani; Vinayak Naik; Rajesh Balan; Archan Misra; Youngki Lee
Real-world deployments of WiFi-based indoor localization in large public venues are few and far between as most state-of-the-art solutions require either client or infrastructure-side changes. Hence, even though high location accuracy is possible with these solutions, they are not practical due to cost and/or client adoption reasons. Majority of the public venues use commercial controller-managed WLAN
-
F3ORNITS: A Flexible Variable Step Size Non-Iterative Co-simulation Method handling Subsystems with Hybrid Advanced Capabilities arXiv.cs.PF Pub Date : 2021-01-22 Yohan Eguillon; Bruno Lacabanne; Damien Tromeur-Dervout
This paper introduces the F3ORNITS non-iterative co-simulation algorithm in which F3 stands for the 3 flexible aspects of the method: flexible polynomial order representation of coupling variables, flexible time-stepper applying variable co-simulation step size rules on subsystems allowing it and flexible scheduler orchestrating the meeting times among the subsystems and capable of asynchronousness
-
TAOS-CI: Lightweight & Modular Continuous Integration System for Edge Computing arXiv.cs.PF Pub Date : 2021-01-21 Geunsik Lim; MyungJoo Ham; Jijoong Moon; Wook Song; Sangjung Woo; Sewon Oh
With the proliferation of IoT and edge devices, we are observing a lot of consumer electronics becoming yet another IoT and edge devices. Unlike traditional smart devices, such as smart phones, consumer electronics, in general, have significant diversities with fewer number of devices per product model. With such high diversities, the proliferation of edge devices requires frequent and seamless updates
-
Efficient MPI-based Communication for GPU-Accelerated Dask Applications arXiv.cs.PF Pub Date : 2021-01-21 Aamir Shafi; Jahanzeb Maqbool Hashmi; Hari Subramoni; Dhabaleswar K. Panda
Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for adding new communication devices. It currently has two communication devices: one for TCP and the other for high-speed networks using UCX-Py -- a Cython wrapper to
-
Virtual Memory Partitioning for Enhancing Application Performance in Mobile Platforms arXiv.cs.PF Pub Date : 2021-01-21 Geunsik Lim; Changwoo Min; Young Ik Eom
Recently, the amount of running software on smart mobile devices is gradually increasing due to the introduction of application stores. The application store is a type of digital distribution platform for application software, which is provided as a component of an operating system on a smartphone or tablet. Mobile devices have limited memory capacity and, unlike server and desktop systems, due to
-
Comparison and Improvement for Delay Analysis Approaches: Theoretical Models and Experimental Tests arXiv.cs.PF Pub Date : 2021-01-21 Yue Hong Gao; Xiao Hong; Hao Tian Yang; Lu Chen; Xiao Nan Zhang
Computer network tends to be subjected to the proliferation of mobile demands and increasingly multifarious, therefore it poses a great challenge to guarantee the quality of network service. By designing the model according to different requirements, we may get some related indicators such as delay and packet loss rate in order to evaluate the quality of network service and verify the user data surface
-
UNIT: Unifying Tensorized Instruction Compilation arXiv.cs.PF Pub Date : 2021-01-21 Jian Weng; Animesh Jain; Jie Wang; Leyuan Wang; Yida Wang; Tony Nowatzki
Because of the increasing demand for computation in DNN, researchers develope both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to leverage mixed precision without hardware support because of the overhead of data casting. Hardware vendors offer tensorized instructions for mixed-precision
-
Thread Evolution Kit for Optimizing Thread Operations on CE/IoT Devices arXiv.cs.PF Pub Date : 2021-01-20 Geunsik Lim; Donghyun Kang; Young Ik Eom
Most modern operating systems have adopted the one-to-one thread model to support fast execution of threads in both multi-core and single-core systems. This thread model, which maps the kernel-space and user-space threads in a one-to-one manner, supports quick thread creation and termination in high-performance server environments. However, the performance of time-critical threads is degraded when
-
PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses arXiv.cs.PF Pub Date : 2021-01-20 Seung Won Min; Kun Wu; Sitao Huang; Mert Hidayetoğlu; Jinjun Xiong; Eiman Ebrahimi; Deming Chen; Wen-mei Hwu
With the increasing adoption of graph neural networks (GNNs) in the machine learning community, GPUs have become an essential tool to accelerate GNN training. However, training GNNs on very large graphs that do not fit in GPU memory is still a challenging task. Unlike conventional neural networks, mini-batching input samples in GNNs requires complicated tasks such as traversing neighboring nodes and
-
An Efficient Graph Mining System for Large Patterns arXiv.cs.PF Pub Date : 2021-01-19 Peng Jiang; Rujia Wang; Bo Wu
There is a growing interest in designing systems for graph pattern mining in recent years. The existing systems mostly focus on small patterns and have difficulty in mining larger patterns. In this work, we propose Angelica, a single-machine graph pattern mining system aiming at supporting large patterns. We first propose a new computation model called multi-vertex exploration. The model allows us
-
Accelerating Deep Learning Inference via Learned Caches arXiv.cs.PF Pub Date : 2021-01-18 Arjun Balasubramanian; Adarsh Kumar; Yuhan Liu; Han Cao; Shivaram Venkataraman; Aditya Akella
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge to the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal
-
Verifiable Failure Localization in Smart Grid under Cyber-Physical Attacks arXiv.cs.PF Pub Date : 2021-01-18 Yudi Huang; Ting He; Nilanjan Ray Chaudhuri; Thomas La Porta
Cyber-physical attacks impose a significant threat to the smart grid, as the cyber attack makes it difficult to identify the actual damage caused by the physical attack. To defend against such attacks, various inference-based solutions have been proposed to estimate the states of grid elements (e.g., transmission lines) from measurements outside the attacked area, out of which a few have provided theoretical
-
Online Caching with Optimal Switching Regret arXiv.cs.PF Pub Date : 2021-01-18 Samrat Mukhopadhyay; Abhishek Sinha
We consider the classical uncoded caching problem from an online learning point-of-view. A cache of limited storage capacity can hold $C$ files at a time from a large catalog. A user requests an arbitrary file from the catalog at each time slot. Before the file request from the user arrives, a caching policy populates the cache with any $C$ files of its choice. In the case of a cache-hit, the policy
-
Acceleration of multiple precision matrix multiplication based on multi-component floating-point arithmetic using AVX2 arXiv.cs.PF Pub Date : 2021-01-17 Tomonori Kouya
In this paper, we report the results obtained from the acceleration of multi-binary64-type multiple precision matrix multiplication with AVX2. We target double-double (DD), triple-double (TD), and quad-double (QD) precision arithmetic designed by certain types of error-free transformation (EFT) arithmetic. Furthermore, we implement SIMDized EFT functions, which simultaneously compute with four binary64
-
Sensitivity of Mean-Field Fluctuations in Erlang loss models with randomized routing arXiv.cs.PF Pub Date : 2021-01-16 Thirupathaiah Vasantam; Ravi R. Mazumdar
In this paper, we study a large system of $N$ servers each with capacity to process at most $C$ simultaneous jobs and an incoming job is routed to a server if it has the lowest occupancy amongst $d$ (out of N) randomly selected servers. A job that is routed to a server with no vacancy is assumed to be blocked and lost. Such randomized policies are referred to JSQ(d) (Join the Shortest Queue out of
-
FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference arXiv.cs.PF Pub Date : 2021-01-13 Daya Khudia; Jianyu Huang; Protonu Basu; Summer Deng; Haixin Liu; Jongsoo Park; Mikhail Smelyanskiy
Deep learning models typically use single-precision (FP32) floating point data types for representing activations and weights, but a slew of recent research work has shown that computations with reduced-precision data types (FP16, 16-bit integers, 8-bit integers or even 4- or 2-bit integers) are enough to achieve same accuracy as FP32 and are much more efficient. Therefore, we designed fbgemm, a high-performance
Contents have been reproduced by permission of the publishers.