当前位置: X-MOL 学术arXiv.cs.CY › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large image datasets: A pyrrhic win for computer vision?
arXiv - CS - Computers and Society Pub Date : 2020-06-24 , DOI: arxiv-2006.16923
Vinay Uday Prabhu, Abeba Birhane

In this paper we investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class-wise accuracy, human-cardinality-analysis, and the semanticity of the image class information in order to statistically investigate the extent and subtleties of ethical transgressions. We then use the census to help hand-curate a look-up-table of images in the ImageNet-ILSVRC-2012 dataset that fall into the categories of verifiably pornographic: shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed private parts. We survey the landscape of harm and threats both society broadly and individuals face due to uncritical and ill-considered dataset curation practices. We then propose possible courses of correction and critique the pros and cons of these. We have duly open-sourced all of the code and the census meta-datasets generated in this endeavor for the computer vision community to build on. By unveiling the severity of the threats, our hope is to motivate the constitution of mandatory Institutional Review Boards (IRB) for large scale dataset curation processes.

中文翻译:

大型图像数据集:计算机视觉的巨大胜利?

在本文中,我们调查了大规模视觉数据集的问题实践和后果。我们研究了广泛的问题,例如同意和正义的问题以及具体的问题,例如在数据集中包含可验证的色情图像。以 ImageNet-ILSVRC-2012 数据集为例,我们执行了一个基于横截面模型的定量普查,涵盖了年龄、性别、NSFW 内容评分、类别准确性、人类基数分析和语义的语义等因素。图像类信息,以便统计调查道德违规的程度和微妙之处。然后,我们使用人口普查帮助手工策划 ImageNet-ILSVRC-2012 数据集中的图像查找表,这些图像属于可验证色情类别:在未经同意的环境中拍摄(上裙)、海滩窥淫癖, 和暴露的私处。由于不加批判和考虑不周的数据集管理实践,我们调查了广泛的社会和个人所面临的伤害和威胁的格局。然后,我们提出可能的纠正方法并批判这些方法的利弊。我们已经正式开源了在这项工作中生成的所有代码和人口普查元数据集,供计算机视觉社区建立。通过揭示威胁的严重性,我们希望能够为大规模数据集管理流程建立强制性机构审查委员会 (IRB)。我们已经正式开源了在这项工作中生成的所有代码和人口普查元数据集,供计算机视觉社区建立。通过揭示威胁的严重性,我们希望能够为大规模数据集管理流程建立强制性机构审查委员会 (IRB)。我们已经正式开源了在这项工作中生成的所有代码和人口普查元数据集,供计算机视觉社区建立。通过揭示威胁的严重性,我们希望能够为大规模数据集管理流程建立强制性机构审查委员会 (IRB)。
更新日期:2020-07-27
down
wechat
bug