当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-06-22 , DOI: arxiv-2006.12587
Thomas Rausch and Waldemar Hummer and Vinod Muthusamy

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.

中文翻译:

PipeSim:跟踪驱动的大规模 AI 操作平台模拟

实施人工智能已成为研究和工业领域的一项重大努力。管理 AI 应用程序生命周期的自动化、可操作的管道将构成未来基础设施工作负载的重要组成部分。为了优化生产级 AI 工作流平台的操作,我们可以利用现有的调度方法,但在满足机器学习 (ML) 模型的特定领域特征的同时,微调操作策略以实现特定于应用程序的成本效益权衡是一项挑战,例如准确性、稳健性或公平性。我们提出了一个跟踪驱动的基于模拟的实验和分析环境,允许研究人员和工程师为大规模 AI 工作流系统设计和评估此类操作策略。来自 IBM 开发的生产级 AI 平台的分析数据用于构建综合仿真模型。我们的模拟模型描述了管道和系统基础设施之间的交互,以及管道任务如何影响不同的 ML 模型指标。我们在独立的、随机的、离散的事件模拟器中实现模型,并提供用于运行实验的工具包。合成跟踪可用于临时探索以及实验的统计分析,以测试和检查管道调度、集群资源分配和类似的操作机制。随机离散事件模拟器,并提供用于运行实验的工具包。合成跟踪可用于临时探索以及实验的统计分析,以测试和检查管道调度、集群资源分配和类似的操作机制。随机离散事件模拟器,并提供用于运行实验的工具包。合成跟踪可用于临时探索以及实验的统计分析,以测试和检查管道调度、集群资源分配和类似的操作机制。
更新日期:2020-06-24
down
wechat
bug