当前位置: X-MOL 学术International Migration › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Can big data deliver its promises in migration research?
International Migration ( IF 2.022 ) Pub Date : 2022-03-19 , DOI: 10.1111/imig.12984
Albert Ali Salah 1, 2
Affiliation  

As a computer scientist working on algorithms and applications for human behaviour analysis since the 90s, I was able to witness the rise of "big data," which was jokingly defined by one of my university professors as "data that would not fit on your computer". Indeed, such is the volume of data constantly generated by humans moving about with mobile phones in their pockets, using devices and services that create, store and share information about such usage, contributing and propagating content on social media, that it becomes difficult to organise, classify and interpret these digital traces with traditional computational approaches. Add to this the devices and systems that observe humans, such as surveillance cameras, smart infrastructure elements that log users, remote sensing and satellite systems, and the picture becomes even more complex. Yet, with ample opportunities for monetisation, approaches and systems were quickly developed to process big data. In particular, companies quickly recognised that big data collected from customers could provide valuable insights for marketing and optimisation tasks, for content customisation, user adaptation and modelling. But such data were also valuable to governments for decision-making and policy, and for many researchers interested in human behaviour, including those in the field of migration research, looking for reliable and granular indicators of human movements over the globe.

The main premise of computational social science (or social computing) is that large-scale and complex human behavioural data, typically stored by companies that benefit from it for the mentioned purposes, can be analysed and re-purposed to address research questions from the social sciences (Lazer et al., 2020). In this new paradigm, carefully constructed samples and in-depth questions were replaced by numbers of subjects greater by orders of magnitude, and simple indicators that were sampled across much greater temporal and spatial resolutions than the traditional "snapshots". In addition, these were analysed in an aggregated fashion to provide proxy variables. For example, in a seminal study related to migration research, Blumenstock et al. (2015) illustrated that a person's mobile phone usage history could be used to create a wealth indicator. They used mobile phone call detail records (CDR) collected for accounting purposes by the phone company and containing the base tower locations, times, and durations of phone calls for each customer. Aggregating anonymised data from 1.5 million customers, Blumenstock and co-authors created a detailed map that predicted the wealth of people living in Rwanda, which showed a high correlation with the last two Demographic and Health Surveys conducted with seven and thirteen thousand people, respectively, at a fraction of the cost, and in a very timely fashion. Does this mean that similar big data initiatives can be easily created all over the world to enable improved analyses of factors known to relate to migration (Sîrbu et al., 2021), or to provide timely insights to policy makers? Hardly so; there are technical, ethical, legal, and as Scheel and Ustek-Spilda (2018) highlighted, political hurdles that hinder adopting big data solutions in migration statistics.

Before continuing further, I would like to point out that the big data usage examples I advocate work with anonymised and aggregated data, and do not—in principle—involve data, which can be used to identify specific individuals. In the last few years, several instances of technology usage in areas related to migration management came into question (Molnar, 2021). Particularly, high-tech solutions for large-scale biometric identification of refugees, drone and robotic surveillance in the borders, automatic decision-making based on artificial intelligence solutions in sensitive areas where the cost of an error cannot (and should not) be quantified, including lie detection technologies at European borders have been discussed in terms of their ethical and human rights ramifications. Molnar (2021, p.134) also pointed out that Covid-19 increased the investment in technological solutions and states that "as governments move towards biosurveillance to contain the spread of the pandemic, there has been an increase in the use of tracking, automatic drones and other types of technologies that purport to help manage migration, exacerbating potential human rights concerns (Cliffe, 2020; Lewis & Mok, 2020; Molnar & Naranjo, 2020)." These are all valid concerns, but it is possible to process big data with proper checks and balances, controlling for both individual and group privacy (Salah, Canca, et al., 2022), and we should weigh their usefulness in the context of specific risks. After all, flexible and ubiquitous technologies such as mobile phones do have many uses - such as in helping to contain the spread of a pandemic (Oliver et al., 2020), and the existence of technology—and associated data—brings an obligation with it, in that if it is possible to responsibly harness it for improving the lives of people, it should be considered seriously by governmental and non-governmental actors (Letouzé & Oliver, 2019).

The primary potential of big data for migration research seems to be in addressing data gaps (Bircan et al., 2020; Bosco et al., 2022), which are created by, for instance, inconsistencies in the definitions and data collection methodology, lack of adequate statistics and plainly lacking data on irregular migration. Different data types may provide information about such gaps (Salah, Bircan, et al., 2022). Mobile phone call detail records contain very detailed mobility information, but some exceptions aside, are processed without linking it to demographic information, to protect the privacy of the data subjects. Nonetheless, it can be used to produce wealth indicators, show the concentration and mobilities of groups, indicate infrastructure usage, and even provide indicators of social integration by modelling intergroup encounters and communication (Bakker et al., 2019). Remote sensing provides population estimates that are considered valuable information in contexts of natural and man-made disaster areas and climate-related mobility (Bircan, 2022). Social media data can be a particularly rich source offering different insights depending on the platform. For example, Twitter can provide information about sentiments of and about migrants and refugees while LinkedIn can offer indicators about skilled migration, and Facebook on demographics (Bosco et al., 2022; Coimbra et al., 2022; Kim et al., 2022). Combining multiple data sources can potentially be even more powerful; for example, refugee mobility and settlement can be observed via mobile phone data, and real-estate price averages can be linked to home locations, so examining these two sets of data together provides an opportunity to estimate refugee wealth in greater detail (Bertoli et al., 2019).

What are the premises under which data gaps are filled? First of all, data access is an important issue. Some of these data sources are publicly available for research (e.g. remote sensing data with limited resolution), but others are buried in the servers of private companies and very difficult to access (e.g. mobile CDR). Second, processing such high volumes of data requires mastery of data scientific tools, possibly including database management systems, scripting tools and analysis tools that range from natural language processing to image processing, depending on the modality. Ideally, this calls for interdisciplinary collaborations between computer scientists and migration scholars. The third premise is the translation of migration scholars’ inquiries (or the data gap) into the disciplinary language of computer scientists while simultaneously avoiding reductionisms typical for this field. For instance, if social integration is to be assessed via mobile phone data, quantitative indicators should be found that are of relevance to social integration, and they need to be validated with some approach. Blumenstock et al. (2015) used previous census data to validate their method, but if there is a data gap to be addressed, there may not be such an obvious choice. The assumptions under which the approximations and projections would work should also be determined in collaboration to account for not only domain expertise, but also biases that may come from the algorithmic models employed. Finally, applied ethics expertise is required to properly assess the potential risks, and this topic is mostly lacking from most technical computer science curricula. The ethical review is necessary not only for the study itself but also for the communication of results, given the sensitivity that surrounds how the findings are communicated.

As a way of addressing the first difficulty, Verhulst and Young (2019) proposed data collaboratives, which are public-private partnerships based on data owned by private parties. Different models have been proposed for such collaborations (Letouzé & Oliver, 2019), but the main idea is that a private dataset is re-purposed for public good. Data challenges are a form of data collaboratives, where the private data are processed to remove personal information and opened to a larger research community, allowing multiple groups to analyse it from different aspects (Salah, 2021). An example of such challenges was the Data for Refugees (D4R) project supported by TUBITAK, UNHCR, UNICEF and IOM, where Türk Telekom opened a mobile CDR dataset collected in Turkey from 1 million users over a year to provide insights into Syrian refugee mobility in Turkey with the aim of improving their living conditions (Salah et al., 2019). This challenge, with 60+ participating research teams (and an ethics committee examining both the projects and resulting publications) was useful for capacity building, initialising inter-disciplinary and international collaborations, and creating some policy recommendations. However, a challenge is (typically) a one-off event and sustainable data processing requires longer relationships, where the infrastructure must be created to share computed indicators instead of the data itself. Furthermore, to be truly useful, data-driven recommendations must be taken up by researchers with deep domain expertise and by policy makers, further investigated, combined and triangulated with existing information sources and evaluated by taking into account other qualitative factors influencing research and policy considerations. Bridging the data gaps requires actions from all involved parties, including private data holders, policy makers, and most importantly, migration scholars and data scientists bridging disciplinary language gaps.



中文翻译:

大数据能否在移民研究中兑现承诺?

作为一名自 90 年代以来致力于人类行为分析算法和应用的计算机科学家,我见证了“大数据”的兴起,我的一位大学教授开玩笑地将其定义为“计算机无法容纳的数据” ”。事实上,人类随身携带手机,使用创建、存储和共享有关此类使用信息的设备和服务,在社交媒体上贡献和传播内容,不断产生大量数据,使得组织变得困难,用传统的计算方法对这些数字轨迹进行分类和解释。除此之外,还有观察人类的设备和系统,例如监控摄像头、记录用户的智能基础设施元素、遥感和卫星系统,画面变得更加复杂。然而,随着货币化的大量机会,迅速开发了处理大数据的方法和系统。特别是,公司很快意识到从客户那里收集的大数据可以为营销和优化任务、内容定制、用户适应和建模提供有价值的见解。但这些数据对于政府的决策和政策以及许多对人类行为感兴趣的研究人员(包括移民研究领域的研究人员)来说也很有价值,他们正在寻找可靠和细致的全球人类活动指标。公司很快意识到,从客户那里收集的大数据可以为营销和优化任务、内容定制、用户适应和建模提供有价值的见解。但这些数据对于政府的决策和政策以及许多对人类行为感兴趣的研究人员(包括移民研究领域的研究人员)来说也很有价值,他们正在寻找可靠和细致的全球人类活动指标。公司很快意识到,从客户那里收集的大数据可以为营销和优化任务、内容定制、用户适应和建模提供有价值的见解。但这些数据对于政府的决策和政策以及许多对人类行为感兴趣的研究人员(包括移民研究领域的研究人员)来说也很有价值,他们正在寻找可靠和细致的全球人类活动指标。

计算社会科学(或社会计算)的主要前提是大规模和复杂的人类行为数据,通常由受益于上述目的的公司存储,可以被分析和重新用于解决来自社会的研究问题。科学(Lazer 等人,2020 年)。在这个新的范式中,精心构建的样本和深入的问题被数量级以上的主题数量以及在比传统“快照”更大的时间和空间分辨率上采样的简单指标所取代。此外,这些以聚合方式进行分析以提供代理变量。例如,在一项与迁移研究有关的开创性研究中,Blumenstock 等人。(2015) 说明一个人的手机使用历史可以用来创建财富指标。他们使用电话公司为记账目的收集的手机通话详细记录 (CDR),其中包含每个客户的基站位置、时间和通话时长。Blumenstock 和合著者汇总了来自 150 万客户的匿名数据,创建了一张详细的地图,预测了卢旺达居民的财富,这与最近两次分别针对 7 万人和 1.3 万人进行的人口和健康调查显示出高度相关性,以极少的成本,并且非常及时。这是否意味着可以在世界各地轻松创建类似的大数据计划,以改进对已知与迁移相关的因素的分析(Sîrbu 等人,2021 年),还是为政策制定者提供及时的见解?几乎没有;存在技术、道德、法律方面的问题,正如 Scheel 和 Ustek-Spilda(2018 年)所强调的那样,阻碍在移民统计中采用大数据解决方案的政治障碍。

在继续之前,我想指出,我提倡的大数据使用示例使用匿名和聚合数据,原则上不涉及可用于识别特定个人的数据。在过去的几年中,与迁移管理相关的领域中的几个技术使用实例受到质疑(Molnar,2021 年)。特别是用于难民大规模生物识别的高科技解决方案、边境的无人机和机器人监控、基于人工智能解决方案的自动决策在无法(也不应该)量化错误成本的敏感区域,包括在欧洲边境的测谎技术已就其伦理和人权影响进行了讨论。莫尔纳 ( 2021, p.134) 还指出,Covid-19 增加了对技术解决方案的投资,并指出“随着政府转向生物监测以遏制大流行的蔓延,跟踪、自动无人机和其他旨在帮助管理移民、加剧潜在人权问题的技术类型(Cliffe,2020;Lewis & Mok,2020;Molnar & Naranjo,2020)。这些都是有道理的担忧,但可以通过适当的制衡来处理大数据,同时控制个人和团体的隐私(Salah,Canca 等,2022),我们应该在特定风险的背景下权衡它们的有用性。毕竟,移动电话等灵活且无处不在的技术确实有很多用途——比如帮助遏制大流行病的传播(Oliver 等人,2020 年),而技术和相关数据的存在带来了义务它,因为如果可以负责任地利用它来改善人们的生活,政府和非政府行为者应该认真考虑它(Letouzé 和 Oliver,2019 年)。

大数据在移民研究中的主要潜力似乎在于解决数据差距(Bircan 等人,2020 年;Bosco 等人,2022 年),这些差距是由例如定义和数据收集方法的不一致、缺乏缺乏足够的统计数据,而且明显缺乏关于非正常移徙的数据。不同的数据类型可能会提供有关此类差距的信息(Salah, Bircan, et al., 2022)。手机通话详细记录包含非常详细的移动信息,但除一些例外情况外,在处理时未将其与人口统计信息相关联,以保护数据主体的隐私。尽管如此,它仍可用于生成财富指标、显示群体的集中度和流动性、指示基础设施的使用情况,甚至通过对群体间的相遇和交流进行建模来提供社会融合的指标(Bakker 等人,2019 年)。遥感提供的人口估计值在自然和人为灾区以及与气候相关的流动性背景下被认为是有价值的信息(Bircan,2022)。社交媒体数据可能是一个特别丰富的来源,可根据平台提供不同的见解。例如,Twitter 可以提供有关移民和难民的情绪的信息,LinkedIn 可以提供有关技术移民的指标,而 Facebook 可以提供人口统计数据(Bosco 等人,2022 年;Coimbra 等人,2022 年;Kim 等人,2022 年) . 结合多个数据源可能会更强大;例如,可以通过手机数据观察难民的流动性和定居点,并且可以将房地产价格平均值与家庭位置联系起来,因此将这两组数据一起检查提供了更详细地估计难民财富的机会(Bertoli et al ., 2019 年)。

填补数据空白的前提是什么?首先,数据访问是一个重要问题。其中一些数据源可公开用于研究(例如分辨率有限的遥感数据),但其他数据源隐藏在私营公司的服务器中并且非常难以访问(例如移动 CDR)。其次,处理如此大量的数据需要掌握数据科学工具,可能包括从自然语言处理到图像处理的数据库管理系统、脚本工具和分析工具,具体取决于模式。理想情况下,这需要计算机科学家和移民学者之间的跨学科合作。第三个前提是将移民学者的调查(或数据差距)翻译成计算机科学家的学科语言,同时避免该领域典型的简化论。例如,如果要通过手机数据评估社会融合,则应找到与社会融合相关的定量指标,并需要通过某种方法对其进行验证。布鲁门斯托克等人。(2015 ) 使用以前的人口普查数据来验证他们的方法,但如果存在需要解决的数据差距,则可能没有这么明显的选择。近似和预测的工作假设也应该通过协作来确定,不仅要考虑到领域专业知识,还要考虑到可能来自所采用的算法模型的偏差。最后,需要应用伦理学专业知识来正确评估潜在风险,而大多数技术计算机科学课程大多缺乏这个主题。伦理审查不仅对于研究本身而且对于结果的交流都是必要的,因为围绕着研究结果的交流方式的敏感性。

作为解决第一个困难的一种方式,Verhulst 和 Young ( 2019 ) 提出了数据协作,这是基于私人方拥有的数据的公私合作伙伴关系。已经为此类合作提出了不同的模型(Letouzé 和 Oliver,2019 年),但主要思想是私有数据集被重新用于公共利益。数据挑战是数据协作的一种形式,其中处理私人数据以删除个人信息并向更大的研究社区开放,允许多个小组从不同方面对其进行分析(Salah,2021)。此类挑战的一个例子是由 TUBITAK、UNHCR、UNICEF 和 IOM 支持的难民数据 (D4R) 项目,土耳其电信在该项目中开放了一个移动 CDR 数据集,该数据集在土耳其一年内从 100 万用户那里收集,以提供对叙利亚难民流动性的见解。土耳其旨在改善他们的生活条件(Salah 等人,2019)。这一挑战有 60 多个参与的研究团队(以及一个审查项目和由此产生的出版物的伦理委员会)对于能力建设、启动跨学科和国际合作以及制定一些政策建议很有用。然而,挑战(通常)是一次性事件,可持续的数据处理需要更长的关系,其中必须创建基础设施来共享计算指标而不是数据本身。此外,要真正有用,数据驱动的建议必须由具有深厚领域专业知识的研究人员和政策制定者采用,与现有信息源进一步调查、结合和三角化,并通过考虑影响研究和政策考虑的其他定性因素进行评估.

更新日期:2022-03-19
down
wechat
bug