PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-06-22 , DOI: arxiv-2006.12587
Thomas Rausch and Waldemar Hummer and Vinod Muthusamy

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.

中文翻译：

PipeSim：跟踪驱动的大规模 AI 操作平台模拟

实施人工智能已成为研究和工业领域的一项重大努力。管理 AI 应用程序生命周期的自动化、可操作的管道将构成未来基础设施工作负载的重要组成部分。为了优化生产级 AI 工作流平台的操作，我们可以利用现有的调度方法，但在满足机器学习 (ML) 模型的特定领域特征的同时，微调操作策略以实现特定于应用程序的成本效益权衡是一项挑战，例如准确性、稳健性或公平性。我们提出了一个跟踪驱动的基于模拟的实验和分析环境，允许研究人员和工程师为大规模 AI 工作流系统设计和评估此类操作策略。来自 IBM 开发的生产级 AI 平台的分析数据用于构建综合仿真模型。我们的模拟模型描述了管道和系统基础设施之间的交互，以及管道任务如何影响不同的 ML 模型指标。我们在独立的、随机的、离散的事件模拟器中实现模型，并提供用于运行实验的工具包。合成跟踪可用于临时探索以及实验的统计分析，以测试和检查管道调度、集群资源分配和类似的操作机制。随机离散事件模拟器，并提供用于运行实验的工具包。合成跟踪可用于临时探索以及实验的统计分析，以测试和检查管道调度、集群资源分配和类似的操作机制。随机离散事件模拟器，并提供用于运行实验的工具包。合成跟踪可用于临时探索以及实验的统计分析，以测试和检查管道调度、集群资源分配和类似的操作机制。

更新日期：2020-06-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文