当前位置: X-MOL 学术Management Science › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Predicting with Proxies: Transfer Learning in High Dimension
Management Science ( IF 5.4 ) Pub Date : 2020-10-02 , DOI: 10.1287/mnsc.2020.3729
Hamsa Bastani 1
Affiliation  

Predictive analytics is increasingly used to guide decision-making in many applications. However, in practice, we often have limited data on the true predictive task of interest, and must instead rely on more abundant data on a closely-related proxy predictive task. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. Yet, not accounting for the bias in the proxy can lead to sub-optimal decisions. Using real datasets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features). Our proof relies on a new LASSO tail inequality for approximately sparse vectors. Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare datasets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.

中文翻译:

用代理预测:高维迁移学习

预测分析越来越多地用于指导许多应用程序中的决策。但是,在实践中,我们经常缺少真正感兴趣的预测任务的数据,而必须依赖于与紧密相关的代理预测任务更丰富的数据。例如,电子商务平台使用大量的客户点击数据(代理)来提出产品推荐,而不是使用相对稀疏的客户购买数据(真正的利益结果);或者,医院通常依靠在不同患者群体(代理人)上而不是在其自己的患者群体(真正感兴趣的队列)上训练的医学风险评分来分配干预措施。但是,不考虑代理中的偏差会导致次优决策。使用真实的数据集,我们发现该偏差通常可以通过特征的稀疏函数来捕获。从而,我们提出了一种新颖的两步估算器,该估算器使用了来自高维统计的技术来有效地组合大量代理数据和少量真实数据。我们证明了我们提出的估计量的误差的上限,而证明了数据科学家使用的几种启发式方法的界限;特别是,我们提出的估算器可以用更少的真实数据(在特征数量上)实现相同的精度。我们的证明依赖于大约稀疏向量的新LASSO尾部不等式。最后,我们证明了我们的方法在电子商务和医疗保健数据集上的有效性;在这两种情况下,我们都能获得更好的预测准确性,并获得有关代理数据偏差性质的管理见解。
更新日期:2020-10-02
down
wechat
bug