Big Data, Small Personas: How Algorithms Shape the Demographic Representation of Data-Driven User Segments,Big Data

当前位置： X-MOL 学术 › Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Big Data, Small Personas: How Algorithms Shape the Demographic Representation of Data-Driven User Segments
Big Data ( IF 4.6 ) Pub Date : 2022-08-12 , DOI: 10.1089/big.2021.0177
Joni Salminen ₁ , Kamal Chhirang ₂ , Soon-Gyo Jung ₁ , Saravanan Thirumuruganathan ₁ , Kathleen W Guan ₃ , Bernard J Jansen ₁

Affiliation

Derived from the notion of algorithmic bias, it is possible that creating user segments such as personas from data results in over- or under-representing certain segments (FAIRNESS), does not properly represent the diversity of the user populations (DIVERSITY), or produces inconsistent results when hyperparameters are changed (CONSISTENCY). Collecting user data on 363M video views from a global news and media organization, we compare personas created from this data using different algorithms. Results indicate that the algorithms fall into two groups: those that generate personas with low diversity–high fairness and those that generate personas with high diversity–low fairness. The algorithms that rank high on diversity tend to rank low on fairness (Spearman's correlation: −0.83). The algorithm that best balances diversity, fairness, and consistency is Spectral Embedding. The results imply that the choice of algorithm is a crucial step in data-driven user segmentation, because the algorithm fundamentally impacts the demographic attributes of the generated personas and thus influences how decision makers view the user population. The results have implications for algorithmic bias in user segmentation and creating user segments that not only consider commercial segmentation criteria but also consider criteria derived from ethical discussions in the computing community.

中文翻译：

大数据，小人物：算法如何塑造数据驱动用户群的人口特征

源自算法偏差的概念，根据数据创建用户细分（例如人物角色）可能会导致某些细分的代表性过高或过低（公平），不能正确表示用户群体的多样性（多样性），或产生更改超参数时的结果不一致 (CONSISTENCY)。我们从一家全球新闻和媒体组织收集有关 3.63 亿视频观看次数的用户数据，比较使用不同算法从这些数据创建的角色。结果表明，算法分为两组：生成低多样性人物角色的算法-高公平性和生成高多样性-低公平性人物角色的算法. 在多样性上排名高的算法往往在公平性上排名低（Spearman 相关性：-0.83）。最能平衡多样性、公平性和一致性的算法是 Spectral Embedding。结果表明，算法的选择是数据驱动的用户细分的关键步骤，因为该算法从根本上影响了生成的人物角色的人口统计属性，从而影响了决策者如何看待用户群体。结果对用户细分中的算法偏差和创建用户细分有影响，这些用户细分不仅考虑商业细分标准，还考虑从计算社区的伦理讨论中得出的标准。

更新日期：2022-08-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>