当前位置: X-MOL 学术arXiv.cs.SI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board
arXiv - CS - Social and Information Networks Pub Date : 2020-01-21 , DOI: arxiv-2001.07487
Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn

This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.

中文翻译:

失落的奇克攻略:来自政治不正确委员会的 3.5 年增强型 4chan 帖子

本文提供了一个数据集,其中包含来自图像板论坛 4chan 的政治不正确板 (/pol/) 的超过 330 万个主题和 1.345 亿个帖子,发布时间为近 3.5 年(2016 年 6 月至 2019 年 11 月)。据我们所知,这是最大的公开可用的 4chan 数据集,为社区提供了已从 4chan 中永久删除且无法访问的帖子存档。我们使用一组附加标签来扩充数据,包括毒性评分和每篇文章中提到的命名实体。我们还对数据集进行了统计分析,概述了有兴趣使用它的研究人员可以期待什么,以及简单的内容分析,阐明最突出的讨论主题、提到的最流行的实体和毒性水平每个职位。总的来说,我们相信我们的工作将激励和帮助研究人员研究和理解 4chan,以及它在更大的网络上的作用。例如,我们希望该数据集可用于社交媒体的跨平台研究,以及对其他类型的研究(如自然语言处理)有用。最后,我们的数据集可以帮助定性工作,重点是特定叙事、事件或社会理论的深入案例研究。
更新日期:2020-04-02
down
wechat
bug