当前位置: X-MOL 学术ACM Trans. Web › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Early Detection of Social Media Hoaxes at Scale
ACM Transactions on the Web ( IF 3.5 ) Pub Date : 2020-08-18 , DOI: 10.1145/3407194
Arkaitz Zubiaga 1 , Aiqi Jiang 1
Affiliation  

The unmoderated nature of social media enables the diffusion of hoaxes, which in turn jeopardises the credibility of information gathered from social media platforms. Existing research on automated detection of hoaxes has the limitation of using relatively small datasets, owing to the difficulty of getting labelled data. This, in turn, has limited research exploring early detection of hoaxes as well as exploring other factors such as the effect of the size of the training data or the use of sliding windows. To mitigate this problem, we introduce a semi-automated method that leverages the Wikidata knowledge base to build large-scale datasets for veracity classification, focusing on celebrity death reports. This enables us to create a dataset with 4,007 reports including over 13M tweets, 15% of which are fake. Experiments using class-specific representations of word embeddings show that we can achieve F1 scores nearing 72% within 10 minutes of the first tweet being posted when we expand the size of the training data following our semi-automated means. Our dataset represents a realistic scenario with a real distribution of true, commemorative, and false stories, which we release for further use as a benchmark in future research.

中文翻译:

早期发现大规模社交媒体恶作剧

社交媒体的无节制性质使恶作剧得以传播,进而危及从社交媒体平台收集的信息的可信度。由于难以获得标记数据,现有的恶作剧自动检测研究存在使用相对较小数据集的局限性。这反过来又限制了探索早期发现恶作剧以及探索其他因素的研究,例如训练数据大小的影响或滑动窗口的使用。为了缓解这个问题,我们引入了一种半自动化方法,该方法利用 Wikidata 知识库来构建用于真实性分类的大规模数据集,重点关注名人死亡报告。这使我们能够创建一个包含 4,007 份报告的数据集,其中包括超过 1300 万条推文,其中 15% 是假的。使用词嵌入的特定类别表示的实验表明,当我们按照半自动方式扩展训练数据的大小时,我们可以在发布第一条推文后的 10 分钟内获得接近 72% 的 F1 分数。我们的数据集代表了真实、纪念和虚假故事的真实分布的现实场景,我们将其发布以进一步用作未来研究的基准。
更新日期:2020-08-18
down
wechat
bug