ARDA: Automatic Relational Data Augmentation for Machine Learning,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ARDA: Automatic Relational Data Augmentation for Machine Learning
arXiv - CS - Databases Pub Date : 2020-03-21 , DOI: arxiv-2003.09758
Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

中文翻译：

ARDA：机器学习的自动关系数据增强

自动机器学习 (\AML) 是一系列技术，用于自动化训练预测模型的过程，旨在提高性能并使机器学习更易于访问。虽然最近的许多工作都集中在机器学习管道的各个方面，如模型选择、超参数调整和特征选择，但相对较少的工作集中在自动数据增强上。自动数据增强涉及寻找与用户的预测任务相关的新特征，而“人工参与”最少。我们提出了 \system，这是一个端到端系统，它将数据集和数据存储库作为输入，并输出增强数据集，以便在这个增强数据集上训练预测模型可以提高性能。我们的系统有两个不同的组成部分：(1) 基于输入的各种属性搜索数据并将其与输入数据连接的框架，以及 (2) 一种有效的特征选择算法，可从结果连接中剔除嘈杂或不相关的特征。我们对不同的系统组件进行了广泛的经验评估，并在现实世界的数据集上对我们的特征选择算法进行了基准测试。

更新日期：2020-03-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文