Fair Data Integration,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fair Data Integration
arXiv - CS - Databases Pub Date : 2020-06-10 , DOI: arxiv-2006.06053
Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri and Kush R. Varshney

The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high quality training data, most of the fairness literature ignores this stage. In this work, we consider fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. We work under the causal interventional fairness paradigm. Without requiring the underlying structural causal model a priori, we propose an approach to identify a sub-collection of features that ensure the fairness of the dataset by performing conditional independence tests between different subsets of features. We use group testing to improve the complexity of the approach. We theoretically prove the correctness of the proposed algorithm to identify features that ensure interventional fairness and show that sub-linear conditional independence tests are sufficient to identify these variables. A detailed empirical evaluation is performed on real-world datasets to demonstrate the efficacy and efficiency of our technique.

中文翻译：

公平的数据整合

在高风险的社会决策中使用机器学习 (ML) 鼓励在整个 ML 生命周期中考虑公平性。尽管数据集成是生成高质量训练数据的主要步骤之一，但大多数公平性文献都忽略了这一阶段。在这项工作中，我们考虑了数据管理集成组件中的公平性，旨在识别改进预测的特征，而不会给数据集增加任何偏见。我们在因果干预公平范式下工作。在不需要先验的基础结构因果模型的情况下，我们提出了一种方法来识别特征子集，通过在不同特征子集之间执行条件独立性测试来确保数据集的公平性。我们使用组测试来提高方法的复杂性。我们从理论上证明了所提出算法的正确性，以识别确保干预公平性的特征，并表明亚线性条件独立性测试足以识别这些变量。对真实世界的数据集进行了详细的实证评估，以证明我们技术的功效和效率。

更新日期：2020-06-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文