Representation Bias in Data: A Survey on Identification and Resolution Techniques,ACM Computing Surveys

当前位置： X-MOL 学术 › ACM Comput. Surv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Representation Bias in Data: A Survey on Identification and Resolution Techniques
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2023-03-17 , DOI: 10.1145/3588433
Nima Shahbazi ₁ , Yin Lin ₂ , Abolfazl Asudeh ₁ , H. V. Jagadish ₂

Affiliation

Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that “bias in, bias out”, one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties.

There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.

中文翻译：

数据中的表示偏差：识别和解决技术的调查

数据驱动算法的好坏取决于它们所处理的数据，而数据集，尤其是社交数据，往往无法充分代表少数群体。数据中的代表性偏差可能由于各种原因而发生，从历史歧视到数据采集和准备方法中的选择和抽样偏差。鉴于“偏入，偏出”，不能指望基于人工智能的解决方案在不解决代表性偏差等问题的情况下为社会应用带来公平的结果。虽然对机器学习模型的公平性进行了广泛的研究，包括几篇评论论文，但对数据偏差的研究较少。本文回顾了有关识别和解决表示偏差作为数据集特征的文献，与以后的消费方式无关。本次调查的范围仅限于结构化（表格）和非结构化（例如图像、文本、图表）数据。它提出了分类法，以根据多个设计维度对所研究的技术进行分类，并提供了它们属性的并排比较。

要完全解决数据中的代表性偏差问题还有很长的路要走。作者希望这项调查能激励研究人员在未来通过观察各自领域内的现有工作来应对这些挑战。

更新日期：2023-03-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11