当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-layer Optimizations for End-to-End Data Analytics
arXiv - CS - Databases Pub Date : 2020-01-10 , DOI: arxiv-2001.03541
Amir Shaikhha, Maximilian Schleich, Alexandru Ghita, Dan Olteanu

We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.

中文翻译:

端到端数据分析的多层优化

我们考虑在多关系数据上训练机器学习模型的问题。主流方法是首先使用对输入数据库的特征提取查询来构建训练数据集,然后使用选择的统计软件包来训练模型。在本文中,我们介绍了迭代功能聚合查询 (IFAQ),这是一个实现替代方法的框架。IFAQ 将特征提取查询和学习任务视为 IFAQ 的特定领域语言中给出的一个程序,该程序捕获了 Jupyter 笔记本中常用的 Python 子集,用于机器学习应用程序的快速原型设计。该程序受制于几层 IFAQ 优化,例如代数变换、循环变换、模式专业化、数据布局优化、最后编译成高效的低级 C++ 代码,专门用于给定的工作负载和数据。我们表明,对于多个关系数据集上的线性回归和回归树模型,IFAQ 的 Scala 实现可以在几个数量级上优于 mlpack、Scikit 和 TensorFlow。
更新日期:2020-01-13
down
wechat
bug