当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation
arXiv - CS - Machine Learning Pub Date : 2021-03-03 , DOI: arxiv-2103.02761
Mayee F. Chen, Benjamin Cohen-Wang, Stephen Mussmann, Frederic Sala, Christopher Ré

Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios -- well-specified, misspecified, and corrected models -- to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.

中文翻译:

在矩量法潜在变量估计中比较标记和未标记数据的值

为现代机器学习标记数据既昂贵又费时。潜在变量模型可用于从对未标记数据进行操作的较弱,更易于获取的来源推断标签。还可以使用标记的数据来训练此类模型,从而提出一个关键问题:用户应该投资很少的标记点还是很多未标记的点?我们通过一个以矩量法潜在变量估计中的模型错误指定为中心的框架来回答这个问题。我们的核心结果是泛化误差的偏差-方差分解,这表明仅未标记的方法在错误指定的情况下会产生额外的偏差。然后,我们引入一种校正,在某些情况下可证明消除了这种偏差。我们将分解框架应用于三种情况-规格明确,规格错误,和修正的模型-1)在标记和未标记的数据之间进行选择,以及2)从它们的组合中学习。我们从理论上和通过综合实验观察到,对于特定模型,标记点比未标记点的价值要高一个常数。但是,由于规格不正确,由于存在额外的偏差,它们的相对值会更高,但可以通过校正降低它们的相对值。我们还将应用我们的方法来研究现实世界中用于数据集构建的弱监督技术。
更新日期:2021-03-05
down
wechat
bug