当前位置: X-MOL 学术J. Am. Stat. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Regularized Optimal Transport of Covariates and Outcomes in Data Recoding
Journal of the American Statistical Association ( IF 3.7 ) Pub Date : 2020-07-20 , DOI: 10.1080/01621459.2020.1775615
Valérie Garès 1 , Jérémy Omer 1
Affiliation  

Abstract

When databases are constructed from heterogeneous sources, it is not unusual that different encodings are used for the same outcome. In such case, it is necessary to recode the outcome variable before merging two databases. The method proposed for the recoding is an application of optimal transportation where we search for a bijective mapping between the distributions of such variable in two databases. In this article, we build upon the work by Garés et al., where they transport the distributions of categorical outcomes assuming that they are distributed equally in the two databases. Here, we extend the scope of the model to treat all the situations where the covariates explain the outcomes similarly in the two databases. In particular, we do not require that the outcomes be distributed equally. For this, we propose a model where joint distributions of outcomes and covariates are transported. We also propose to enrich the model by relaxing the constraints on marginal distributions and adding an L1 regularization term. The performances of the models are evaluated in a simulation study, and they are applied to a real dataset. The code used in the computational assessment and in the simulation of test cases is publicly available on Github repository: https://github.com/otrecoding/OTRecod.jl.



中文翻译:

数据重新编码中协变量和结果的正则化最优传输

摘要

当数据库由异构源构建时,不同的编码用于相同的结果并不罕见。在这种情况下,有必要在合并两个数据库之前重新编码结果变量。为重新编码提出的方法是最佳运输的应用,我们在两个数据库中搜索此类变量的分布之间的双射映射。在本文中,我们以 Garés 等人的工作为基础,假设它们在两个数据库中平均分布,他们传输分类结果的分布。在这里,我们扩展了模型的范围,以处理协变量在两个数据库中类似地解释结果的所有情况。特别是,我们不要求结果平均分配。为了这,我们提出了一个模型,其中传输结果和协变量的联合分布。我们还建议通过放松对边际分布的约束并添加一个L 1正则化项。在模拟研究中评估模型的性能,并将它们应用于真实数据集。用于计算评估和模拟测试用例的代码可在 Github 存储库上公开获得:https://github.com/otrecoding/OTRecod.jl。

更新日期:2020-07-20
down
wechat
bug