当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development
arXiv - CS - Software Engineering Pub Date : 2019-05-22 , DOI: arxiv-1905.08942
Micah J. Smith, Carles Sala, James Max Kanter, Kalyan Veeramachaneni

As machine learning is applied more widely, data scientists often struggle to find or create end-to-end machine learning systems for specific tasks. The proliferation of libraries and frameworks and the complexity of the tasks have led to the emergence of "pipeline jungles" - brittle, ad hoc ML systems. To address these problems, we introduce the Machine Learning Bazaar, a new framework for developing machine learning and automated machine learning software systems. First, we introduce ML primitives, a unified API and specification for data processing and ML components from different software libraries. Next, we compose primitives into usable ML pipelines, abstracting away glue code, data flow, and data storage. We further pair these pipelines with a hierarchy of AutoML strategies - Bayesian optimization and bandit learning. We use these components to create a general-purpose, multi-task, end-to-end AutoML system that provides solutions to a variety of data modalities (image, text, graph, tabular, relational, etc.) and problem types (classification, regression, anomaly detection, graph matching, etc.). We demonstrate 5 real-world use cases and 2 case studies of our approach. Finally, we present an evaluation suite of 456 real-world ML tasks and describe the characteristics of 2.5 million pipelines searched over this task suite.

中文翻译:

机器学习集市:利用 ML 生态系统进行有效的系统开发

随着机器学习的应用越来越广泛,数据科学家通常难以为特定任务寻找或创建端到端的机器学习系统。库和框架的激增以及任务的复杂性导致了“管道丛林”的出现——脆弱的、临时的机器学习系统。为了解决这些问题,我们引入了机器学习集市,这是一个用于开发机器学习和自动化机器学习软件系统的新框架。首先,我们引入了 ML 原语、统一的 API 和规范,用于来自不同软件库的数据处理和 ML 组件。接下来,我们将原语组合成可用的 ML 管道,抽象掉胶水代码、数据流和数据存储。我们进一步将这些管道与 AutoML 策略的层次结构配对——贝叶斯优化和老虎机学习。我们使用这些组件来创建一个通用、多任务、端到端的 AutoML 系统,该系统为各种数据模式(图像、文本、图形、表格、关系等)和问题类型(分类)提供解决方案、回归、异常检测、图匹配等)。我们展示了我们方法的 5 个实际用例和 2 个案例研究。最后,我们展示了一个包含 456 个真实世界 ML 任务的评估套件,并描述了在该任务套件上搜索的 250 万条流水线的特征。
更新日期:2020-04-08
down
wechat
bug