当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Variation across Scales: Measurement Fidelity under Twitter Data Sampling
arXiv - CS - Information Retrieval Pub Date : 2020-03-21 , DOI: arxiv-2003.09557
Siqi Wu, Marian-Andrei Rizoiu, Lexing Xie

A comprehensive understanding of data quality is the cornerstone of measurement studies in social media research. This paper presents in-depth measurements on the effects of Twitter data sampling across different timescales and different subjects (entities, networks, and cascades). By constructing complete tweet streams, we show that Twitter rate limit message is an accurate indicator for the volume of missing tweets. Sampling also differs significantly across timescales. While the hourly sampling rate is influenced by the diurnal rhythm in different time zones, the millisecond level sampling is heavily affected by the implementation choices. For Twitter entities such as users, we find the Bernoulli process with a uniform rate approximates the empirical distributions well. It also allows us to estimate the true ranking with the observed sample data. For networks on Twitter, their structures are altered significantly and some components are more likely to be preserved. For retweet cascades, we observe changes in distributions of tweet inter-arrival time and user influence, which will affect models that rely on these features. This work calls attention to noises and potential biases in social data, and provides a few tools to measure Twitter sampling effects.

中文翻译:

跨尺度变化:推特数据采样下的测量保真度

对数据质量的全面理解是社交媒体研究中衡量研究的基石。本文对 Twitter 数据采样在不同时间尺度和不同主题(实体、网络和级联)上的影响进行了深入测量。通过构建完整的推文流,我们表明 Twitter 速率限制消息是丢失推文数量的准确指标。采样在不同时间尺度上也有显着差异。虽然每小时采样率受不同时区的昼夜节律影响,但毫秒级采样受实现选择的影响很大。对于用户等 Twitter 实体,我们发现具有统一速率的伯努利过程很好地近似于经验分布。它还允许我们使用观察到的样本数据估计真实排名。对于 Twitter 上的网络,它们的结构发生了显着变化,并且更可能保留一些组件。对于转发级联,我们观察到推文到达时间和用户影响分布的变化,这将影响依赖这些特征的模型。这项工作引起了人们对社交数据中的噪音和潜在偏见的关注,并提供了一些衡量 Twitter 抽样效果的工具。
更新日期:2020-04-07
down
wechat
bug