-
Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation arXiv.cs.PF Pub Date : 2024-03-11 Jan Laukemann, Ahmed E. Helal, S. Isaac Geronimo Anderson, Fabio Checconi, Yongseok Soh, Jesmin Jahan Tithi, Teresa Ranadive, Brian J Gravelle, Fabrizio Petrini, Jee Choi
High-dimensional sparse data emerge in many critical application domains such as cybersecurity, healthcare, anomaly detection, and trend analysis. To quickly extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes, data
-
A Performance Analysis of Basin Hopping Compared to Established Metaheuristics for Global Optimization arXiv.cs.PF Pub Date : 2024-03-09 Marco Baioletti, Valentino Santucci, Marco Tomassini
During the last decades many metaheuristics for global numerical optimization have been proposed. Among them, Basin Hopping is very simple and straightforward to implement, although rarely used outside its original Physical Chemistry community. In this work, our aim is to compare Basin Hopping, and two population variants of it, with readily available implementations of the well known metaheuristics
-
Literature Review of Current Sustainability Assessment Frameworks and Approaches for Organizations arXiv.cs.PF Pub Date : 2024-03-07 Sarah Farahdel, Chun Wang, Anjali Awasthi
This systematic literature review explores sustainability assessment frameworks (SAFs) across diverse industries. The review focuses on SAF design approaches including the methods used for Sustainability Indicator (SI) selection, relative importance assessment, and interdependency analysis. Various methods, including literature reviews, stakeholder interviews, questionnaires, Pareto analysis, SMART
-
DGAP: Efficient Dynamic Graph Analysis on Persistent Memory arXiv.cs.PF Pub Date : 2024-03-05 Abdullah Al Raqibul Islam, Dong Dai
Dynamic graphs, featuring continuously updated vertices and edges, have grown in importance for numerous real-world applications. To accommodate this, graph frameworks, particularly their internal data structures, must support both persistent graph updates and rapid graph analysis simultaneously, leading to complex designs to orchestrate `fast but volatile' and `persistent but slow' storage devices
-
On Latency Predictors for Neural Architecture Search arXiv.cs.PF Pub Date : 2024-03-04 Yash Akhauri, Mohamed S. Abdelfattah
Efficient deployment of neural networks (NN) requires the co-optimization of accuracy and latency. For example, hardware-aware neural architecture search has been used to automatically find NN architectures that satisfy a latency constraint on a specific hardware device. Central to these search algorithms is a prediction model that is designed to provide a hardware latency estimate for a candidate
-
A Continuous Benchmarking Infrastructure for High-Performance Computing Applications arXiv.cs.PF Pub Date : 2024-03-03 Christoph Alt, Martin Lanser, Jonas Plewinski, Atin Janki, Axel Klawonn, Harald Köstler, Michael Selzer, Ulrich Rüde
For scientific software, especially those used for large-scale simulations, achieving good performance and efficiently using the available hardware resources is essential. It is important to regularly perform benchmarks to ensure the efficient use of hardware and software when systems are changing and the software evolves. However, this can become quickly very tedious when many options for parameters
-
HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices arXiv.cs.PF Pub Date : 2024-03-02 Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You
In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel
-
Kubernetes in Action: Exploring the Performance of Kubernetes Distributions in the Cloud arXiv.cs.PF Pub Date : 2024-03-03 Hossein Aqasizade, Ehsan Ataie, Mostafa Bastam
Kubernetes has emerged as a leading open-source platform for container orchestration, allowing organizations to efficiently manage and deploy containerized applications at scale. This paper investigates the performance of four Kubernetes distributions, namely Kubeadm, K3s, MicroK8s, and K0s when running OpenFaaS as a containerized service on a cluster of computing nodes on CloudLab. For this purpose
-
GraphMini: Accelerating Graph Pattern Matching Using Auxiliary Graphs arXiv.cs.PF Pub Date : 2024-03-02 Juelin Liu, Sandeep Polisetty, Hui Guan, Marco Serafini
Graph pattern matching is a fundamental problem encountered by many common graph mining tasks and the basic building block of several graph mining systems. This paper explores for the first time how to proactively prune graphs to speed up graph pattern matching by leveraging the structure of the query pattern and the input graph. We propose building auxiliary graphs, which are different pruned versions
-
An Experimental Study of Low-Latency Video Streaming over 5G arXiv.cs.PF Pub Date : 2024-03-01 Imran Khan, Tuyen X. Tran, Matti Hiltunen, Theodore Karagioules, Dimitrios Koutsonikolas
Low-latency video streaming over 5G has become rapidly popular over the last few years due to its increased usage in hosting virtual events, online education, webinars, and all-hands meetings. Our work aims to address the absence of studies that reveal the real-world behavior of low-latency video streaming. To that end, we provide an experimental methodology and measurements, collected in a US metropolitan
-
Performance bottlenecks detection through microarchitectural sensitivity arXiv.cs.PF Pub Date : 2024-02-24 Hugo Pompougnac, Alban Dutilleul, Christophe Guillon, Nicolas Derumigny, Fabrice Rastello
Modern Out-of-Order (OoO) CPUs are complex systems with many components interleaved in non-trivial ways. Pinpointing performance bottlenecks and understanding the underlying causes of program performance issues are critical tasks to make the most of hardware resources. We provide an in-depth overview of performance bottlenecks in recent OoO microarchitectures and describe the difficulties of detecting
-
Accelerating Graph Neural Networks on Real Processing-In-Memory Systems arXiv.cs.PF Pub Date : 2024-02-26 Christina Giannoula, Peiming Yang, Ivan Fernandez Vega, Jiacheng Yang, Yu Xin Li, Juan Gomez Luna, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko
Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors
-
CesASMe and Staticdeps: static detection of memory-carried dependencies for code analyzers arXiv.cs.PF Pub Date : 2024-02-22 Théophile Bastian, Hugo Pompougnac, Alban Dutilleul, Fabrice Rastello
A variety of code analyzers, such as IACA, uiCA, llvm-mca or Ithemal, strive to statically predict the throughput of a computation kernel. Each analyzer is based on its own simplified CPU model reasoning at the scale of a basic block. Facing this diversity, evaluating their strengths and weaknesses is important to guide both their usage and their enhancement. We present CesASMe, a fully-tooled solution
-
Toward Scalable Docker-Based Emulations of Blockchain Networks for Research and Development arXiv.cs.PF Pub Date : 2024-02-22 Diego Pennino, Maurizio Pizzonia
Blockchain, like any other complex technology, needs a strong testing methodology to support its evolution in both research and development contexts. Setting up meaningful tests for permissionless blockchain technology is a notoriously complex task for several reasons: software is complex, large number of nodes are involved, network is non ideal, etc. Developers usually adopt small virtual laboratories
-
A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader arXiv.cs.PF Pub Date : 2024-02-23 Qingcai Jiang, Shaojie Tan, Junshi Chen, Hong An
The performance gap between memory and processor has grown rapidly. Consequently, the energy and wall-clock time costs associated with moving data between the CPU and main memory predominate the overall computational cost. The Processing-in-Memory (PIM) paradigm emerges as a promising architecture that mitigates the need for extensive data movements by strategically positioning computing units proximate
-
Learning to Defer in Content Moderation: The Human-AI Interplay arXiv.cs.PF Pub Date : 2024-02-19 Thodoris Lykouris, Wentao Weng
Successful content moderation in online platforms relies on a human-AI collaboration approach. A typical heuristic estimates the expected harmfulness of a post and uses fixed thresholds to decide whether to remove it and whether to send it for human review. This disregards the prediction uncertainty, the time-varying element of human review capacity and post arrivals, and the selective sampling in
-
5G Cellular -- An Energy Efficiency Perspective arXiv.cs.PF Pub Date : 2024-02-18 Deven Panchal
While the 5G technology of cellular communications promises great capacity and coverage to access information anywhere and anytime, it is feared to have huge power consumption. Significant research been has been directed towards solving this problem which exists both on the subscribers side as well as the operators side. There have been efforts like predicting traffic, modifying the physical layer
-
Assessing the Performance of OpenTitan as Cryptographic Accelerator in Secure Open-Hardware System-on-Chips arXiv.cs.PF Pub Date : 2024-02-16 Emanuele Parisi, Alberto Musa, Maicol Ciani, Francesco Barchi, Davide Rossi, Andrea Bartolini, Andrea Acquaviva
RISC-V open-source systems are emerging in deployment scenarios where safety and security are critical. OpenTitan is an open-source silicon root-of-trust designed to be deployed in a wide range of systems, from high-end to deeply embedded secure environments. Despite the availability of various cryptographic hardware accelerators that make OpenTitan suitable for offloading cryptographic workloads from
-
Model Compression and Efficient Inference for Large Language Models: A Survey arXiv.cs.PF Pub Date : 2024-02-15 Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He
Transformer based large language models have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained devices. In this paper, we investigate compression and efficient inference methods for large language models from an algorithmic perspective. Regarding taxonomy, similar
-
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules arXiv.cs.PF Pub Date : 2024-02-15 Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, Weihua Zhang
System-level emulators have been used extensively for system design, debugging and evaluation. They work by providing a system-level virtual machine to support a guest operating system (OS) running on a platform with the same or different native OS that uses the same or different instruction-set architecture. For such system-level emulation, dynamic binary translation (DBT) is one of the core technologies
-
Integrating ytopt and libEnsemble to Autotune OpenMC arXiv.cs.PF Pub Date : 2024-02-14 Xingfu Wu, John R. Tramm, Jeffrey Larson, John-Luke Navarro, Prasanna Balaprakash, Brice Videau, Michael Kruse, Paul Hovland, Valerie Taylor, Mary Hall
ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-clock
-
An In-Depth Investigation of LEO Satellite Topology Design Parameters arXiv.cs.PF Pub Date : 2024-02-14 Wenyi Zhang, Zihan Xu, Sangeetha Abdu Jyothi
Low Earth Orbit (LEO) satellite networks are rapidly gaining traction today. Although several real-world deployments exist, our preliminary analysis of LEO topology performance with the soon-to-be operational Inter-Satellite Links (ISLs) reveals several interesting characteristics that are difficult to explain based on our current understanding of topologies. For example, a real-world satellite shell
-
An Evaluative Comparison of Performance Portability across GPU Programming Models arXiv.cs.PF Pub Date : 2024-02-14 Joshua H. Davis, Pranav Sivaraman, Isaac Minn, Konstantinos Parasyris, Harshitha Menon, Giorgis Georgakoudis, Abhinav Bhatele
Ensuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. While prior work has investigated the performance portability of a few selected proxy applications or programming models, this paper provides a comprehensive study of a range of proxy applications implemented
-
Exploring the Impact of In-Browser Deep Learning Inference on Quality of User Experience and Performance arXiv.cs.PF Pub Date : 2024-02-08 Qipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Ying Zhang, Yun Ma, Ting Cao, Xuanzhe Liu
Deep Learning (DL) is increasingly being integrated into Web applications through a method known as "in-browser inference", where the DL processes occur directly within Web browsers. However, the actual performance of this method and its effect on user experience quality (QoE) is not well-understood. This gap in knowledge necessitates new forms of QoE measurement, going beyond traditional metrics such
-
Reconsidering the performance of DEVS modeling and simulation environments using the DEVStone benchmark arXiv.cs.PF Pub Date : 2024-02-08 José L. Risco-Martín, Saurabh Mittal, Juan Carlos Fabero, Marina Zapater, Román Hermida
The Discrete Event System Specification formalism (DEVS), which supports hierarchical and modular model composition, has been widely used to understand, analyze and develop a variety of systems. DEVS has been implemented in various languages and platforms over the years. The DEVStone benchmark was conceived to generate a set of models with varied structure and behavior, and to automate the evaluation
-
Acceleration and energy consumption optimization in cascading classifiers for face detection on low-cost ARM big.LITTLE asymmetric architectures arXiv.cs.PF Pub Date : 2024-02-06 Alberto Corpas, Luis Costero, Guillermo Botella, Francisco D. Igual, Carlos García, Manuel Rodríguez
This paper proposes a mechanism to accelerate and optimize the energy consumption of a face detection software based on Haar-like cascading classifiers, taking advantage of the features of low-cost Asymmetric Multicore Processors (AMPs) with limited power budget. A modelling and task scheduling/allocation is proposed in order to efficiently make use of the existing features on big.LITTLE ARM processors
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache arXiv.cs.PF Pub Date : 2024-02-05 Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
Efficiently serving large language models (LLMs) requires batching many requests together to reduce the cost per request. Yet, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. This memory demand increases with larger batch sizes and longer context lengths. Additionally
-
Simple Symmetric Sustainable Sorting -- the greeNsort article arXiv.cs.PF Pub Date : 2024-02-02 Jens Oehlschlägel
We explored an uncharted part of the solution space for sorting algorithms: the role of symmetry in divide&conquer algorithms. We found/designed novel simple binary Quicksort and Mergesort algorithms operating in contiguous space which achieve improved trade-offs between worst-case CPU-efficiency, best-case adaptivity and RAM-requirements. The 'greeNsort' algorithms need less hardware (RAM) and/or
-
Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems arXiv.cs.PF Pub Date : 2024-02-02 Neharika Jali, Guannan Qu, Weina Wang, Gauri Joshi
We consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-trivial
-
Makinote: An FPGA-Based HW/SW Platform for Pre-Silicon Emulation of RISC-V Designs arXiv.cs.PF Pub Date : 2024-01-31 Elias Perdomo, Alexander Kropotov, Francelly Cano, Syed Zafar, Teresa Cervero, Xavier Martorell, Behzad Salami
Emulating chip functionality before silicon production is crucial, especially with the increasing prevalence of RISC-V-based designs. FPGAs are promising candidates for such purposes due to their high-speed and reconfigurable architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is composed of a large number
-
Towards Efficient and Reliable LLM Serving: A Real-World Workload Study arXiv.cs.PF Pub Date : 2024-01-31 Yuxin Wang, Yuhan Chen, Zeyu Li, Zhenheng Tang, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu
Large language models (LLMs), especially Generative Pretrained Transformer (GPT) models, have significantly advanced in the industry in recent years. However, these models' broader development faces considerable challenges due to high operational and deployment costs. This has led to active research in improving the hardware efficiency of LLMs. Yet, the characteristics of real-world LLM workloads are
-
Dawn of the Dead(line Misses): Impact of Job Dismiss on the Deadline Miss Rate arXiv.cs.PF Pub Date : 2024-01-27 Jian-Jia Chen, Mario Günzel, Peter Bella, Georg von der Brüggen, Kuan-Hsun Chen
Occasional deadline misses are acceptable for soft real-time systems. Quantifying probabilistic and deterministic characteristics of deadline misses is therefore essential to ensure that deadline misses indeed happen only occasionally. This is supported by recent research activities on probabilistic worst-case execution time, worst-case deadline failure probability, the maximum number of deadline misses
-
Dissecting the software-based measurement of CPU energy consumption: a comparative analysis arXiv.cs.PF Pub Date : 2024-01-29 Guillaume RaffinDATAMOVE, UGA, Denis TrystramUGA
Every day, we experience the effects of the global warming: extreme weather events, major forest fires, storms, global warming, etc. The scientific community acknowledges that this crisis is a consequence of human activities where Information and Communications Technologies (ICT) are an increasingly important contributor. Computer scientists need tools for measuring the footprint of the code they produce
-
DF* PageRank: Improved Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs arXiv.cs.PF Pub Date : 2024-01-29 Subhajit Sahu
PageRank is a widely used centrality measure that assesses the significance of vertices in a graph by considering their connections and the importance of those connections. Efficiently updating PageRank on dynamic graphs is essential for various applications due to the increasing scale of datasets. This technical report introduces our improved Dynamic Frontier (DF) and Dynamic Frontier with Pruning
-
Accelerating Scientific Application through Transparent I/O Interposition arXiv.cs.PF Pub Date : 2024-01-26 Steven W. D. Chien, Kento Sato, Artur Podobas, Niclas Jansson, Stefano Markidis, Michio Honda
The ability to handle a large volume of data generated by scientific applications is crucial. We have seen an increase in the heterogeneity of storage technologies available to scientific applications, such as burst buffers, local temporary block storage, managed cloud parallel file systems (PFS), and non-POSIX object stores. However, scientific applications designed for traditional HPC systems can
-
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving arXiv.cs.PF Pub Date : 2024-01-25 Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina
This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. MoE-Infinity features sequence-level expert activation tracing, a new approach adept at identifying sparse activations and capturing the temporal locality of MoE inference. By analyzing these traces, MoE-Infinity performs novel activation-aware expert prefetching
-
Automated Programmatic Performance Analysis of Parallel Programs arXiv.cs.PF Pub Date : 2024-01-23 Onur Cankur, Aditya Tomar, Daniel Nichols, Connor Scully-Allison, Katherine E. Isaacs, Abhinav Bhatele
Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API
-
Systematic Performance Evaluation Framework for LEO Mega-Constellation Satellite Networks arXiv.cs.PF Pub Date : 2024-01-22 Yu Wang, Chuili Kong, Xian Meng, Hejia Luo, Ke-Xin Li, Jun Wang
Low Earth orbit (LEO) mega-constellation satellite networks have shown great potential to extend the coverage capability of conventional terrestrial networks. How to systematically define, quantify, and assess the technical performance of LEO mega-constellation satellite networks remains an open issue. In this paper, we propose a comprehensive key performance indicator (KPI) framework for mega-constellation
-
AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference arXiv.cs.PF Pub Date : 2024-01-19 Xuanlei Zhao, Shenggan Cheng, Guangyang Lu, Jiarui Fang, Haotian Zhou, Bin Jia, Ziming Liu, Yang You
Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequences
-
A2Q+: Improving Accumulator-Aware Weight Quantization arXiv.cs.PF Pub Date : 2024-01-19 Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu
Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy
-
Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUs arXiv.cs.PF Pub Date : 2024-01-22 Herbert Owen, Dominik Ernst, Thomas Gruber, Oriol Lemkuhl, Guillaume Houzeaux, Lucas Gasparino, Gerhard Wellein
This paper addresses the challenge of providing portable and highly efficient code structures for CPU and GPU architectures. We choose the assembly of the right-hand term in the incompressible flow module of the High-Performance Computational Mechanics code Alya, which is one of the two CFD codes in the Unified European Benchmark Suite. Starting from an efficient CPU-code and a related OpenACC-port
-
Accurate and Scalable Many-Node Simulation arXiv.cs.PF Pub Date : 2024-01-18 Stijn Eyerman, Wim Heirman, Kristof Du Bois, Ibrahim Hur
Accurate performance estimation of future many-node machines is challenging because it requires detailed simulation models of both node and network. However, simulating the full system in detail is unfeasible in terms of compute and memory resources. State-of-the-art techniques use a two-phase approach that combines detailed simulation of a single node with network-only simulation of the full system
-
Cost-effective and performant virtual WANs with CORNIFER arXiv.cs.PF Pub Date : 2024-01-17 Anjali, Rachee Singh, Michael M. Swift
Virtual wide-area networks (WANs) are WAN-as-a-service cloud offerings that aim to bring the performance benefits of dedicated wide-area interconnects to enterprise customers. In this work, we show that the topology of a virtual WAN can render it both performance and cost inefficient. We develop Cornifer, a tool that designs virtual WAN topologies by deciding the number of virtual WAN nodes and their
-
Hierarchical Analyses Applied to Computer System Performance: Review and Call for Further Studies arXiv.cs.PF Pub Date : 2024-01-17 Alexander Thomasian
We review studies based on analytic and simulation methods for hierarchical performance analysis of Queueing Network - QN models, which result in an order of magnitude reduction in performance evaluation cost with respect to simulation. The computational cost at the lower level is reduced when the computer system is modeled as a product-form QN. A Continuous Time Markov Chain - CTMC or discrete-event
-
cedar: Composable and Optimized Machine Learning Input Data Pipelines arXiv.cs.PF Pub Date : 2024-01-17 Mark Zhao, Emanuel Adamiak, Christos Kozyrakis
The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex of transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and
-
Approximations to Study the Impact of the Service Discipline in Systems with Redundancy arXiv.cs.PF Pub Date : 2024-01-15 Nicolas GastPOLARIS, UGA, Benny van HoudtUA
As job redundancy has been recognized as an effective means to improve performance of large-scale computer systems, queueing systems with redundancy have been studied by various authors. Existing results include methods to compute the queue length distribution and response time but only when the service discipline is First-Come-First-Served (FCFS). For other service disciplines, such as Processor Sharing
-
Experimental Assessment of Containers Running on Top of Virtual Machines arXiv.cs.PF Pub Date : 2024-01-15 Hossein Aqasizade, Ehsan Ataie, Mostafa Bastam
Over the past two decades, the cloud computing paradigm has gradually attracted more popularity due to its efficient resource usage and simple service access model. Virtualization technology is the fundamental element of cloud computing that brings several benefits to cloud users and providers, such as workload isolation, energy efficiency, server consolidation, and cost reduction. This paper examines
-
PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler arXiv.cs.PF Pub Date : 2024-01-12 Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, Cedric Bastoul
Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific
-
Network Anatomy and Real-Time Measurement of Nvidia GeForce NOW Cloud Gaming arXiv.cs.PF Pub Date : 2024-01-12 Minzhao Lyu, Sharat Chandra Madanapalli, Arun Vishwanath, Vijay Sivaraman
Cloud gaming, wherein game graphics is rendered in the cloud and streamed back to the user as real-time video, expands the gaming market to billions of users who do not have gaming consoles or high-power graphics PCs. Companies like Nvidia, Amazon, Sony and Microsoft are investing in building cloud gaming platforms to tap this large unserved market. However, cloud gaming requires the user to have high
-
Joint Network Slicing, Routing, and In-Network Computing for Energy-Efficient 6G arXiv.cs.PF Pub Date : 2024-01-12 Zeinab Sasan, Masoud Shokrnezhad, Siavash Khorsandi, Tarik Taleb
To address the evolving landscape of next-generation mobile networks, characterized by an increasing number of connected users, surging traffic demands, and the continuous emergence of new services, a novel communication paradigm is essential. One promising candidate is the integration of network slicing and in-network computing, offering resource isolation, deterministic networking, enhanced resource
-
Approximation Algorithms for Minimizing Congestion in Demand-Aware Networks arXiv.cs.PF Pub Date : 2024-01-09 Wenkai Dai, Michael Dinitz, Klaus-Tycho Foerster, Long Luo, Stefan Schmid
Emerging reconfigurable optical communication technologies allow to enhance datacenter topologies with demand-aware links optimized towards traffic patterns. This paper studies the algorithmic problem of jointly optimizing topology and routing in such demand-aware networks to minimize congestion, along two dimensions: (1) splittable or unsplittable flows, and (2) whether routing is segregated, i.e
-
Online Allocation with Replenishable Budgets: Worst Case and Beyond arXiv.cs.PF Pub Date : 2024-01-09 Jianyi Yang, Pengfei Li, Mohammad Jaminur Islam, Shaolei Ren
This paper studies online resource allocation with replenishable budgets, where budgets can be replenished on top of the initial budget and an agent sequentially chooses online allocation decisions without violating the available budget constraint at each round. We propose a novel online algorithm, called OACP (Opportunistic Allocation with Conservative Pricing), that conservatively adjusts dual variables
-
Characterizing Physical Memory Fragmentation arXiv.cs.PF Pub Date : 2024-01-07 Mark Mansi, Michael M. Swift
External fragmentation of physical memory occurs when adjacent differently sized regions of allocated physical memory are freed at different times, causing free memory to be physically discontiguous. It can significantly degrade system performance and efficiency, such as reducing the ability to use huge pages, a critical optimization on modern large-memory system. For decades system developers have
-
An Incrementally Expanding Approach for Updating PageRank on Dynamic Graphs arXiv.cs.PF Pub Date : 2024-01-06 Subhajit Sahu
PageRank is a popular centrality metric that assigns importance to the vertices of a graph based on its neighbors and their score. Efficient parallel algorithms for updating PageRank on dynamic graphs is crucial for various applications, especially as dataset sizes have reached substantial scales. This technical report presents our Dynamic Frontier approach. Given a batch update of edge deletion and
-
RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision) arXiv.cs.PF Pub Date : 2024-01-06 Alexander Thomasian
RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial
-
Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC arXiv.cs.PF Pub Date : 2024-01-05 Wei-Chen Lin, Simon McIntosh-Smith, Tom Deakin
Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work highlights how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we acknowledged AMD's past effort such as HCC, which unfortunately is deprecated and does not support newer hardware platforms. Recent developments by AMD, Codeplay, and AdaptiveCpp
-
Kairos: Efficient Temporal Graph Analytics on a Single Machine arXiv.cs.PF Pub Date : 2024-01-04 Joana M. F. da Trindade, Julian Shun, Samuel Madden, Nesime Tatbul
Many important societal problems are naturally modeled as algorithms over temporal graphs. To date, however, most graph processing systems remain inefficient as they rely on distributed processing even for graphs that fit well within a commodity server's available storage. In this paper, we introduce Kairos, a temporal graph analytics system that provides application developers a framework for efficiently
-
Data-Driven Power Modeling and Monitoring via Hardware Performance Counters Tracking arXiv.cs.PF Pub Date : 2024-01-03 Sergio Mazzola, Gabriele Ara, Thomas Benz, Björn Forsberg, Tommaso Cucinotta, Luca Benini
In the current high-performance and embedded computing era, full-stack energy-centric design is paramount. Use cases require increasingly high performance at an affordable power budget, often under real-time constraints. Extreme heterogeneity and parallelism address these issues but greatly complicate online power consumption assessment, which is essential for dynamic hardware and software stack adaptations
-
The Power of Training: How Different Neural Network Setups Influence the Energy Demand arXiv.cs.PF Pub Date : 2024-01-03 Daniel Geißler, Bo Zhou, Mengxi Liu, Sungho Suh, Paul Lukowicz
This work examines the effects of variations in machine learning training regimes and learning paradigms on the corresponding energy consumption. While increasing data availability and innovation in high-performance hardware fuels the training of sophisticated models, it also supports the fading perception of energy consumption and carbon emission. Therefore, the goal of this work is to create awareness
-
Fairness in Serving Large Language Models arXiv.cs.PF Pub Date : 2023-12-31 Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization