Snorkel: rapid training data creation with weak supervision,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Snorkel: rapid training data creation with weak supervision
The VLDB Journal ( IF 4.2 ) Pub Date : 2019-07-15 , DOI: 10.1007/s00778-019-00552-1
Alexander Ratner ₁ , Stephen H Bach _{1,

2} , Henry Ehrenberg ₁ , Jason Fries ₁ , Sen Wu ₁ , Christopher Ré ₁

Affiliation

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models \(2.8\times \) faster and increase predictive performance an average \(45.5\%\) versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to \(1.8\times \) speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides \(132\%\) average improvements to predictive performance over prior heuristic approaches and comes within an average \(3.60\%\) of the predictive performance of large hand-curated training sets.

中文翻译：

Snorkel：在弱监督的情况下快速创建训练数据

标记训练数据日益成为部署机器学习系统的最大瓶颈。我们推出了 Snorkel，这是一款首创的系统，使用户能够训练最先进的模型，而无需手动标记任何训练数据。相反，用户编写表达任意启发式的标记函数，这些启发式可能具有未知的准确性和相关性。Snorkel 通过结合我们最近提出的机器学习范式（数据编程）的第一个端到端实现，在无需获取真实数据的情况下对输出进行降噪。根据我们过去一年与公司、机构和研究实验室合作的经验，我们提出了一个灵活的界面层，用于编写标签函数。在一项用户研究中，主题专家构建模型的速度比 7 小时的手工标记速度快了2.8 倍，预测性能平均提高了45.5% 。我们研究了这种新设置中的建模权衡，并提出了一种用于自动权衡决策的优化器，该优化器可以使每个管道执行速度提高1.8 倍。在与美国退伍军人事务部和美国食品和药物管理局的两次合作中，以及在代表其他部署的四个开源文本和图像数据集上，Snorkel 的预测性能平均提高了\(132\%\)先前的启发式方法，在大型手工训练集的预测性能的平均\(3.60\%\)范围内。

更新日期：2019-07-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>