• arXiv.cs.PF Pub Date : 2020-01-19
Sungjin Im; Benjamin Moseley; Kamesh Munagala; Kirk Pruhs

In this paper, we consider the following dynamic fair allocation problem: Given a sequence of job arrivals and departures, the goal is to maintain an approximately fair allocation of the resource against a target fair allocation policy, while minimizing the total number of disruptions, which is the number of times the allocation of any job is changed. We consider a rich class of fair allocation policies that significantly generalize those considered in previous work. We first consider the models where jobs only arrive, or jobs only depart. We present tight upper and lower bounds for the number of disruptions required to maintain a constant approximate fair allocation every time step. In particular, for the canonical case where jobs have weights and the resource allocation is proportional to the job's weight, we show that maintaining a constant approximate fair allocation requires $\Theta(\log^* n)$ disruptions per job, almost matching the bounds in prior work for the unit weight case. For the more general setting where the allocation policy only decreases the allocation to a job when new jobs arrive, we show that maintaining a constant approximate fair allocation requires $\Theta(\log n)$ disruptions per job. We then consider the model where jobs can both arrive and depart. We first show strong lower bounds on the number of disruptions required to maintain constant approximate fairness for arbitrary instances. In contrast we then show that there there is an algorithm that can maintain constant approximate fairness with $O(1)$ expected disruptions per job if the weights of the jobs are independent of the jobs arrival and departure order. We finally show how our results can be extended to the setting with multiple resources.

更新日期：2020-01-22
• arXiv.cs.PF Pub Date : 2020-01-20
Jeremy Kepner; Tim Davis; Chansup Byun; William Arcand; David Bestor; William Bergeron; Vijay Gadepally; Matthew Hubbell; Michael Houle; Michael Jones; Anna Klein; Peter Michaleas; Lauren Milechin; Julie Mullen; Andrew Prout; Antonio Rosa; Siddharth Samsi; Charles Yee; Albert Reuther

The SuiteSparse GraphBLAS C-library implements high performance hypersparse matrices with bindings to a variety of languages (Python, Julia, and Matlab/Octave). GraphBLAS provides a lightweight in-memory database implementation of hypersparse matrices that are ideal for analyzing many types of network data, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of hypersparse matrices put enormous pressure on the memory hierarchy. This work benchmarks an implementation of hierarchical hypersparse matrices that reduces memory pressure and dramatically increases the update rate into a hypersparse matrices. The parameters of hierarchical hypersparse matrices rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical hypersparse matrices achieve over 1,000,000 updates per second in a single instance. Scaling to 31,000 instances of hierarchical hypersparse matrices arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 75,000,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.

更新日期：2020-01-22
• arXiv.cs.PF Pub Date : 2020-01-20
Lorenz Braun; Sotirios Nikas; Chen Song; Vincent Heuveline; Holger Fröning

Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features extracted. This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU and SHOC. Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of [13.45%, 44.56%] and [1.81%, 2.91%], for time respectively power prediction on five different GPUs, while latency for a single prediction varies between 0.1 and 0.2 seconds.

更新日期：2020-01-22
• arXiv.cs.PF Pub Date : 2019-07-30
Andrew Daw; Robert C. Hampshire; Jamol Pender

Driverless vehicles promise a host of societal benefits including dramatically improved safety, increased accessibility, greater productivity, and higher quality of life. As this new technology approaches widespread deployment, both industry and government are making provisions for teleoperations systems, in which remote human agents provide assistance to driverless vehicles. This assistance can involve real-time remote operation and even ahead-of-time input via human-in-the-loop artificial intelligence systems. In this paper, we address the problem of staffing such a remote support center. Our analysis focuses on the tradeoffs between the total number of remote agents, the reliability of the remote support system, and the resulting safety of the driverless vehicles. By establishing a novel connection between queues with large batch arrivals and storage processes, we determine the probability of the system exceeding its service capacity. This connection drives our staffing methodology. We also develop a numerical method to compute the exact staffing level needed to achieve various performance measures. This moment generating function based technique may be of independent interest, and our overall staffing analysis may be of use in other applications that combine human expertise and automated systems.

更新日期：2020-01-22
• arXiv.cs.PF Pub Date : 2019-11-11
Maciej Besta; Raghavendra Kanakagiri; Harun Mustafa; Mikhail Karasikov; Gunnar Rätsch; Torsten Hoefler; Edgar Solomonik

The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.

更新日期：2020-01-22
• arXiv.cs.PF Pub Date : 2019-12-31
Jiri Borovec

This report presents a generic image registration benchmark with automatic evaluation using landmark annotations. The key features of the BIRL framework are: easily extendable, performance evaluation, parallel experimentation, simple visualisations, experiment's time-out limit, resuming unfinished experiments. From the research practice, we identified and focused on these two main use-cases: (a) comparison of user's (newly developed) method with some State-of-the-Art (SOTA) methods on a common dataset and (b) experimenting SOTA methods on user's custom dataset (which should contain landmark annotation). Moreover, we present an integration of several standard image registration methods aiming at biomedical imaging into the BIRL framework. This report also contains experimental results of these SOTA methods on the CIMA dataset, which is a dataset of Whole Slice Imaging (WSI) from histology/pathology containing several multi-stain tissue samples from three tissue kinds. Source and results: https://borda.github.io/BIRL

更新日期：2020-01-22
• arXiv.cs.PF Pub Date : 2020-01-14
Oliver Urbann; Simon Camphausen; Arne Moos; Ingmar Schwarz; Sören Kerner; Maximilian Otten

Inference of Convolutional Neural Networks in time critical applications usually requires a GPU. In robotics or embedded devices these are often not available due to energy, space and cost constraints. Furthermore, installation of a deep learning framework or even a native compiler on the target platform is not possible. This paper presents a neural network code generator (NNCG) that generates from a trained CNN a plain ANSI C code file that encapsulates the inference in single a function. It can easily be included in existing projects and due to lack of dependencies, cross compilation is usually possible. Additionally, the code generation is optimized based on the known trained CNN and target platform following four design principles. The system is evaluated utilizing small CNN designed for this application. Compared to TensorFlow XLA and Glow speed-ups of up to 11.81 can be shown and even GPUs are outperformed regarding latency.

更新日期：2020-01-17
• arXiv.cs.PF Pub Date : 2020-01-16
Lubomír Bulej; Vojtěch Horký; Petr Tůma; François Farquet; Aleksandar Prokopec

We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3% to 12.5% (5.03% on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8% to 82.4% (37.4% on average) for the SPEC CPU 2017 workloads.

更新日期：2020-01-17
• arXiv.cs.PF Pub Date : 2020-01-16
Patrick Rodler

We challenge existing query-based ontology fault localization methods wrt. assumptions they make, criteria they optimize, and interaction means they use. We find that their efficiency depends largely on the behavior of the interacting expert, that performed calculations can be inefficient or imprecise, and that used optimization criteria are often not fully realistic. As a remedy, we suggest a novel (and simpler) interaction approach which overcomes all identified problems and, in comprehensive experiments on faulty real-world ontologies, enables a successful fault localization while requiring fewer expert interactions in 66 % of the cases, and always at least 80 % less expert waiting time, compared to existing methods.

更新日期：2020-01-17
• arXiv.cs.PF Pub Date : 2018-02-06
Giovanni Luca Torrisi; Michele Garetto; Emilio Leonardi

We consider the Erd\"{o}s--R\'{e}nyi random graph $G_{n,p}$ and we analyze the simple irreversible epidemic process on the graph, known in the literature as bootstrap percolation. We give a quantitative version of some results by Janson et al. (2012), providing a fine asymptotic analysis of the final size $A_n^*$ of active nodes, under a suitable super-critical regime. More specifically, we establish large deviation principles for the sequence of random variables $\{\frac{n- A_n^*}{f(n)}\}_{n\geq 1}$ with explicit rate functions and allowing the scaling function $f$ to vary in the widest possible range.

更新日期：2020-01-17
• arXiv.cs.PF Pub Date : 2020-01-15
Subhankar Banerjee; Rajarshi Bhattacharjee; Abhishek Sinha

We study the multi-user scheduling problem for minimizing the Age of Information (AoI) in cellular wireless networks under stationary and non-stationary regimes. We derive fundamental lower bounds for the scheduling problem and design efficient online policies with provable performance guarantees. In the stationary setting, we consider the AoI optimization problem for a set of mobile users travelling around multiple cells. In this setting, we propose a scheduling policy and show that it is $2$-optimal. Next, we propose a new adversarial channel model for studying the scheduling problem in non-stationary environments. For $N$ users, we show that the competitive ratio of any online scheduling policy in this setting is at least $\Omega(N)$. We then propose an online policy and show that it achieves a competitive ratio of $O(N^2)$. Finally, we introduce a relaxed adversarial model with channel state estimations for the immediate future. We propose a heuristic model predictive control policy that exploits this feature and compare its performance through numerical simulations.

更新日期：2020-01-16
• arXiv.cs.PF Pub Date : 2020-01-13
Devarpita Sinha; Rajarshi Roy

Age of Information is a newly introduced metric, getting vivid attention for measuring the freshness of information in real-time networks. This parameter has evolved to guarantee the reception of timely information from the latest status update, received by a user from any real-time application. In this paper, we study a centralized, closed-loop, networked controlled industrial wireless sensor-actuator network for cyber-physical production systems. Here, we jointly address the problem of transmission scheduling of sensor updates and the restoration of an information flow-line after any real-time update having hard-deadline drops from it, resulting a break in the loop. Unlike existing real-time scheduling policies that only ensure timely updates, this work aims to accomplish both the time-sensitivity and data freshness in new and regenerative real-time updates in terms of the age of information. Here, the coexistence of both cyber and physical units and their individual requirements for providing the quality of service to the system, as a whole, seems to be one of the major challenges to handle. In this work, minimization of staleness of the time-critical updates to extract maximum utilization out of its information content and its effects on other network performances are thoroughly investigated. A greedy scheduling policy called Deadline-aware highest latency first has been used to solve this problem; its performance optimality is proved analytically. Finally, our claim is validated by comparing the results obtained by our algorithm with those of other popular scheduling policies through extensive simulations.

更新日期：2020-01-14
• arXiv.cs.PF Pub Date : 2020-01-13
Marat Dukhan; Artsiom Ablavatski

The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth. We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the "mantissa" and another representing the "exponent". Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors. To foster reproducibility, we released an open-source implementation of the new Two-Pass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at GitHub.com/google/XNNPACK.

更新日期：2020-01-14
• arXiv.cs.PF Pub Date : 2020-01-07
Renzo Angles; János Benjamin Antal; Alex Averbuch; Peter Boncz; Orri Erling; Andrey Gubichev; Vlad Haprian; Moritz Kaufmann; Josep Lluís Larriba Pey; Norbert Martínez; József Marton; Marcus Paradies; Minh-Duc Pham; Arnau Prat-Pérez; Mirko Spasić; Benjamin A. Steer; Gábor Szárnyas; Jack Waudby

The Linked Data Benchmark Council's Social Network Benchmark (LDBC SNB) is an effort intended to test various functionalities of systems used for graph-like data management. For this, LDBC SNB uses the recognizable scenario of operating a social network, characterized by its graph-shaped data. LDBC SNB consists of two workloads that focus on different functionalities: the Interactive workload (interactive transactional queries) and the Business Intelligence workload (analytical queries). This document contains the definition of the Interactive Workload and the first draft of the Business Intelligence Workload. This includes a detailed explanation of the data used in the LDBC SNB benchmark, a detailed description for all queries, and instructions on how to generate the data and run the benchmark with the provided software.

更新日期：2020-01-09
• arXiv.cs.PF Pub Date : 2020-01-08
Andrew Mackey; Petros Spachos; Liang Song; Konstantinos Plataniotis

The interconnectedness of all things is continuously expanding which has allowed every individual to increase their level of interaction with their surroundings. Internet of Things (IoT) devices are used in a plethora of context-aware application such as Proximity-Based Services (PBS), and Location-Based Services (LBS). For these systems to perform, it is essential to have reliable hardware and predict a user's position in the area with high accuracy in order to differentiate between individuals in a small area. A variety of wireless solutions that utilize Received Signal Strength Indicators (RSSI) have been proposed to provide PBS and LBS for indoor environments, though each solution presents its own drawbacks. In this work, Bluetooth Low Energy (BLE) beacons are examined in terms of their accuracy in proximity estimation. Specifically, a mobile application is developed along with three Bayesian filtering techniques to improve the BLE beacon proximity estimation accuracy. This includes a Kalman filter, a particle filter, and a Non-parametric Information (NI) filter. Since the RSSI is heavily influenced by the environment, experiments were conducted to examine the performance of beacons from three popular vendors in two different environments. The error is compared in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). According to the experimental results, Bayesian filters can improve proximity estimation accuracy up to 30 % in comparison with traditional filtering, when the beacon and the receiver are within 3 m.

更新日期：2020-01-09
• arXiv.cs.PF Pub Date : 2020-01-03
Pengfei Zhang; Eric Lo; Baotong Lu

Lightweight convolutional neural networks (e.g., MobileNets) are specifically designed to carry out inference directly on mobile devices. Among the various lightweight models, depthwise convolution (DWConv) and pointwise convolution (PWConv) are their key operations. In this paper, we observe that the existing implementations of DWConv and PWConv are not well utilizing the ARM processors in the mobile devices, and exhibit lots of cache misses under multi-core and poor data reuse at register level. We propose techniques to re-optimize the implementations of DWConv and PWConv based on ARM architecture. Experimental results show that our implementation can respectively achieve a speedup of up to 5.5x and 2.1x against TVM (Chen et al. 2018) on DWConv and PWConv.

更新日期：2020-01-09
• arXiv.cs.PF Pub Date : 2020-01-02
Guillaume Fieni; Romain Rouvoy; Lionel Seinturier

Fine-grained power monitoring of software activities becomes unavoidable to maximize the power usage efficiency of data centers. In particular, achieving an optimal scheduling of containers requires the deployment of software-defined power~meters to go beyond the granularity of hardware power monitoring sensors, such as Power Distribution Units (PDU) or Intel's Running Average Power Limit (RAPL), to deliver power estimations of activities at the granularity of software~containers. However, the definition of the underlying power models that estimate the power consumption remains a long and fragile process that is tightly coupled to the host machine. To overcome these limitations, this paper introduces SmartWatts: a lightweight power monitoring system that adopts online calibration to automatically adjust the CPU and DRAM power models in order to maximize the accuracy of runtime power estimations of containers. Unlike state-of-the-art techniques, SmartWatts does not require any a priori training phase or hardware equipment to configure the power models and can therefore be deployed on a wide range of machines including the latest power optimizations, at no cost.

更新日期：2020-01-09
• arXiv.cs.PF Pub Date : 2020-01-07
Jin Xu; Natarajan Gautam

Polling systems have been widely studied, however most of these studies focus on polling systems with renewal processes for arrivals and random variables for service times. There is a need driven by practical applications to study polling systems with arbitrary arrivals (not restricted to time-varying or in batches) and revealed service time upon a job's arrival. To address that need, our work considers a polling system with generic setting and for the first time provides the worst case analysis for online scheduling policies in this system. We provide conditions for the existence of constant competitive ratio for this system, and also the competitive ratios for several well-studied policies such as cyclic exhaustive, gated and \emph{l}-limited policies, Stochastic Largest Queue policy, One Machine policy and Gittins Index policy for polling systems. We show that any policy with (1) purely static, (2) queue-length based or (3) job-processing-time based routing discipline does not have a competitive ratio smaller than $k$, where $k$ is the number of queues. Finally, a mixed strategy is provided for a practical scenario where setup time is large but bounded.

更新日期：2020-01-09
• arXiv.cs.PF Pub Date : 2019-12-02
Markku-Juhani O. Saarinen

Standardization of Post-Quantum Cryptography (PQC) was started by NIST in 2016 and has proceeded to its second elimination round. The upcoming standards are intended to replace (or supplement) current RSA and Elliptic Curve Cryptography (ECC) on all targets, including lightweight, embedded, and mobile systems. We present an energy requirement analysis based on extensive measurements of PQC candidate algorithms on a Cortex M4 - based reference platform. We relate computational (energy) costs of PQC algorithms to their data transmission costs which are expected to increase with new types of public keys and ciphertext messages. The energy, bandwidth, and latency needs of PQC algorithms span several orders of magnitude, which is substantial enough to impact battery life, user experience, and application protocol design. We propose metrics and guidelines for PQC algorithm usage in IoT and mobile systems based on our findings. Our evidence supports the view that fast structured-lattice PQC schemes are the preferred choice for cloud-connected mobile devices in most use cases, even when per-bit data transmission energy cost is relatively high.

更新日期：2020-01-08
• arXiv.cs.PF Pub Date : 2020-01-04
Heng-Li Liu; Quan-Lin Li; Yan-Xia Chang; Chi Zhang

This paper studies a block-structured double-ended queue, whose block structure comes from two independent Markovian arrival processes (MAPs), and its stability is guaranteed by customers' impatient behaviors. We show that such a queue can be expressed as a new bilateral quasi birth-and-death (QBD) process. For this purpose, we provide a detailed analysis for the bilateral QBD process, including the system stability, the stationary probability vector, the sojourn time, and so forth. Furthermore, we develop three effective algorithms for computing the performance measures (i.e., the probabilities of stationary queue lengths, the average stationary queue lengths, and the average sojourn times) of the block-structured double-ended queue. Finally, numerical examples are employed to verify the correctness of our theoretical results, and illustrate how the performance measures of this queue are influenced by key system parameters. We believe that the methodology and results described in this paper can be applied to deal with general matching queues (e.g., bilateral Markov processes of GI/M/1 type and those of M/G/1 type) via developing their corresponding bilateral block-structured Markov processes, which are very useful in analyzing many practical issues, such as those encountered in sharing economy, organ transplantation, intelligent manufacturing, intelligent transportation, and so on.

更新日期：2020-01-07
• arXiv.cs.PF Pub Date : 2020-01-05
János Végh

In the "gold rush" for higher performance numbers, a lot of confusion was introduced in supercomputing. The present paper attempts to clear up the terms through scrutinizing the basic terms, contributions, measurement methods. It is shown that using extremely large number of processing elements in computing systems leads to unexpected phenomena, that cannot be explained in the frame of the classical computing paradigm. The phenomena show interesting parallels with the phenomena experienced in science more than a century ago and through their studying a modern science was introduced. The introduced simple non-technical model enables to set up a frame and formalism enabling to explain the unexplained experiences around supercomputing. The model also enables to derive predictions of supercomputer performance for the near future as well as provides hints for enhancing supercomputer components.

更新日期：2020-01-07
• arXiv.cs.PF Pub Date : 2020-01-06
Tobias Gysi; Tobias Grosser; Laurin Brandner; Torsten Hoefler

While the cost of computation is an easy to understand local property, the cost of data movement on cached architectures depends on global state, does not compose, and is hard to predict. As a result, programmers often fail to consider the cost of data movement. Existing cache models and simulators provide the missing information but are computationally expensive. We present a lightweight cache model for fully associative caches with least recently used (LRU) replacement policy that gives fast and accurate results. We count the cache misses without explicit enumeration of all memory accesses by using symbolic counting techniques twice: 1) to derive the stack distance for each memory access and 2) to count the memory accesses with stack distance larger than the cache size. While this technique seems infeasible in theory, due to non-linearities after the first round of counting, we show that the counting problems are sufficiently linear in practice. Our cache model often computes the results within seconds and contrary to simulation the execution time is mostly problem size independent. Our evaluation measures modeling errors below 0.6% on real hardware. By providing accurate data placement information we enable memory hierarchy aware software development.

更新日期：2020-01-07
• arXiv.cs.PF Pub Date : 2019-08-07
Sumit K. Mandal; Raid Ayoub; Michael Kishinevsky; Umit Y. Ogras

Networks-on-chip (NoCs) have become the standard for interconnect solutions in industrial designs ranging from client CPUs to many-core chip-multiprocessors. Since NoCs play a vital role in system performance and power consumption, pre-silicon evaluation environments include cycle-accurate NoC simulators. Long simulations increase the execution time of evaluation frameworks, which are already notoriously slow, and prohibit design-space exploration. Existing analytical NoC models, which assume fair arbitration, cannot replace these simulations since industrial NoCs typically employ priority schedulers and multiple priority classes. To address this limitation, we propose a systematic approach to construct priority-aware analytical performance models using micro-architecture specifications and input traffic. Our approach consists of developing two novel transformations of queuing system and designing an algorithm which iteratively uses these two transformations to estimate end-to-end latency. Our approach decomposes the given NoC into individual queues with modified service time to enable accurate and scalable latency computations. Specifically, we introduce novel transformations along with an algorithm that iteratively applies these transformations to decompose the queuing system. Experimental evaluations using real architectures and applications show high accuracy of 97% and up to 2.5x speedup in full-system simulation.

更新日期：2020-01-07
• arXiv.cs.PF Pub Date : 2019-09-20
Ameer Haj-Ali; Nesreen K. Ahmed; Ted Willke; Sophia Shao; Krste Asanovic; Ion Stoica

One of the key challenges arising when compilers vectorize loops for today's SIMD-compatible architectures is to decide if vectorization or interleaving is beneficial. Then, the compiler has to determine how many instructions to pack together and how many loop iterations to interleave. Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions. Alternatively, software engineers often hand-write the vectorization factors of every loop. This, however, places a huge burden on them, since it requires prior experience and significantly increases the development time. In this work, we explore a novel approach for handling loop vectorization and propose an end-to-end solution using deep reinforcement learning (RL). We conjecture that deep RL can capture different instructions, dependencies, and data structures to enable learning a sophisticated model that can better predict the actual performance cost and determine the optimal vectorization factors. We develop an end-to-end framework, from code to vectorization, that integrates deep RL in the LLVM compiler. Our proposed framework takes benchmark codes as input and extracts the loop codes. These loop codes are then fed to a loop embedding generator that learns an embedding for these loops. Finally, the learned embeddings are used as input to a Deep RL agent, which determines the vectorization factors for all the loops. We further extend our framework to support multiple supervised learning methods. We evaluate our approaches against the currently used LLVM vectorizer and loop polyhedral optimization techniques. Our experiments show 1.29X-4.73X performance speedup compared to baseline and only 3% worse than the brute-force search on a wide range of benchmarks.

更新日期：2020-01-07
• arXiv.cs.PF Pub Date : 2020-01-02
Jiajia Li; Mahesh Lakshminarasimhan; Xiaolong Wu; Ang Li; Catherine Olschanowsky; Kevin Barker

Tensor computations present significant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. Efforts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of reference tensor kernel implementations that are compatible with real-world tensors and power law tensors extended from synthetic graph generation techniques. We also propose Roofline performance models for these kernels to provide insights of computer platforms from sparse tensor view.

更新日期：2020-01-06
• arXiv.cs.PF Pub Date : 2019-02-27
Tal Ben-Nun; Johannes de Fine Licht; Alexandros Nikolaos Ziogas; Timo Schneider; Torsten Hoefler

The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs --- from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.

更新日期：2020-01-06
• arXiv.cs.PF Pub Date : 2019-12-26
Issam Hammad; Kamal El-Sankary; Jason Gu

This paper presents by simulation how approximate multipliers can be utilized to enhance the training performance of convolutional neural networks (CNNs). Approximate multipliers have significantly better performance in terms of speed, power, and area compared to exact multipliers. However, approximate multipliers have an inaccuracy which is defined in terms of the Mean Relative Error (MRE). To assess the applicability of approximate multipliers in enhancing CNN training performance, a simulation for the impact of approximate multipliers error on CNN training is presented. The paper demonstrates that using approximate multipliers for CNN training can significantly enhance the performance in terms of speed, power, and area at the cost of a small negative impact on the achieved accuracy. Additionally, the paper proposes a hybrid training method which mitigates this negative impact on the accuracy. Using the proposed hybrid method, the training can start using approximate multipliers then switches to exact multipliers for the last few epochs. Using this method, the performance benefits of approximate multipliers in terms of speed, power, and area can be attained for a large portion of the training stage. On the other hand, the negative impact on the accuracy is diminished by using the exact multipliers for the last epochs of training.

更新日期：2020-01-04
• arXiv.cs.PF Pub Date : 2017-09-21
Mejbah Alam; Justin Gottschlich; Nesime Tatbul; Javier Turek; Timothy Mattson; Abdullah Muzahid

The field of machine programming (MP), the automation of the development of software, is making notable research advances. This is, in part, due to the emergence of a wide range of novel techniques in machine learning. In this paper, we apply MP to the automation of software performance regression testing. A performance regression is a software performance degradation caused by a code change. We present AutoPerf - a novel approach to automate regression testing that utilizes three core techniques: (i) zero-positive learning, (ii) autoencoders, and (iii) hardware telemetry. We demonstrate AutoPerf's generality and efficacy against 3 types of performance regressions across 10 real performance bugs in 7 benchmark and open-source programs. On average, AutoPerf exhibits 4% profiling overhead and accurately diagnoses more performance bugs than prior state-of-the-art approaches. Thus far, AutoPerf has produced no false negatives.

更新日期：2020-01-04
• arXiv.cs.PF Pub Date : 2019-02-22
Geoff Langdale; Daniel Lemire

JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

更新日期：2020-01-04
• arXiv.cs.PF Pub Date : 2019-10-21
Yajun Zhao; Juan Liu; Saijin Xie

In NR-based Access to Unlicensed Spectrum (NR-U) of 5G system, to satisfy the rules of Occupied Channel Bandwidth (OCB) of unlicensed spectrum, the channels of PRACH and PUCCH have to use some sequence repetition mechanisms in frequency domain. These repetition mechanisms will cause serious cubic metric(CM) problems for these channels, although these two types of channels are composed of Constant Amplitude Zero Auto-correlation(CAZAC) sequences.. Based on the characteristics of CAZAC sequences, which are used for PRACH and PUCCH (refer to PUCCH format 0 and format 1) in 5G NR, in this paper, we propose some new mechanisms of CM reduction for these two types of channels considering the design principles to ensure the sequence performance of the auto-correlation and cross-correlation. Then the proposed CM schemes are evaluated and the optimized parameters are further provided considering CM performance and the complexity.

更新日期：2020-01-04
Contents have been reproduced by permission of the publishers.

down
wechat
bug