当前位置: X-MOL 学术Inf. Retrieval J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Demographic differences in search engine use with implications for cohort selection
Information Retrieval Journal ( IF 2.5 ) Pub Date : 2019-01-01 , DOI: 10.1007/s10791-018-09349-2
Elad Yom-Tov

The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of human health, where queries are used to identify patient populations. Using three datasets comprising of queries from multiple general-purpose Internet search engines we investigate the correlation between demography (age, gender, and income) and the text of queries submitted to search engines. Our results show that females and younger people use longer queries. This difference is such that females make approximately 25% more queries with 10 or more words. In the case of queries which identify users as having specific medical conditions we find that females make 53% more queries than expected, and that this results in patient cohorts which are highly skewed in gender and age, compared to known patient populations. We show that methods for cohort selection which use additional information beyond queries where users indicate their condition are less skewed. Finally, we show that biased training cohorts can lead to differential performance of models designed to detect disease from search engine queries. Our results indicate that studies where demographic representation is important, such as in the study of health aspect of users or when search engines are evaluated for fairness, care should be taken in the selection of search engine data so as to create a representative dataset.

中文翻译:

搜索引擎使用中的人口统计学差异对同类群组选择有影响

用户的人口统计学与他们编写的文本之间的相关性已通过文学文本和最近的社交媒体进行了调查。但是,尚未完全分析与搜索引擎中语言使用有关的差异,尤其是年龄和性别差异。由于在人类健康研究中越来越多地使用搜索引擎数据,其中使用查询来识别患者人群,因此这种差异非常重要。使用由来自多个通用Internet搜索引擎的查询组成的三个数据集,我们调查了人口统计学(年龄,性别和收入)与提交给搜索引擎的查询文字之间的相关性。我们的结果表明,女性和年轻人使用的查询时间更长。这种差异使得女性用10个或更多的单词进行的查询大约多出25%。在将用户标识为具有特定医疗状况的查询的情况下,我们发现女性查询的人数比预期多53%,与已知的患者群体相比,这导致患者队列在性别和年龄方面存在很大差异。我们展示了用于队列选择的方法,该方法使用了查询之外的其他信息,在这些查询中,用户表明其状况不太偏斜。最后,我们表明,有偏见的训练队列可能导致旨在从搜索引擎查询中检测疾病的模型的性能差异。我们的结果表明,在人口统计学表征很重要的研究中,例如在用户健康方面的研究中,或者在对搜索引擎进行公平性评估时,应谨慎选择搜索引擎数据,以创建具有代表性的数据集。
更新日期:2019-01-01
down
wechat
bug