BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on Twitter,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on Twitter
arXiv - CS - Information Retrieval Pub Date : 2021-05-04 , DOI: arxiv-2105.01331
Hasan Kemik, Nusret Özateş, Meysam Asgari-Chenaghlu, Erik Cambria

Protection of human rights is one of the most important problems of our world. In this paper, our aim is to provide a dataset which covers one of the most significant human rights contradiction in recent months affected the whole world, George Floyd incident. We propose a labeled dataset for topic detection that contains 17 million tweets. These Tweets are collected from 25 May 2020 to 21 August 2020 that covers 89 days from start of this incident. We labeled the dataset by monitoring most trending news topics from global and local newspapers. Apart from that, we present two baselines, TF-IDF and LDA. We evaluated the results of these two methods with three different k values for metrics of precision, recall and f1-score. The collected dataset is available at https://github.com/MeysamAsgariC/BLMT.

中文翻译：

BLM-17m：用于Twitter上黑生活问题主题的大规模数据集

保护人权是我们世界上最重要的问题之一。在本文中，我们的目的是提供一个数据集，该数据集涵盖最近几个月影响全世界的最重要的人权矛盾之一乔治·弗洛伊德事件。我们提出了一个用于主题检测的标记数据集，其中包含1,700万条推文。这些推文的收集时间为2020年5月25日至2020年8月21日，涵盖了该事件开始以来的89天。我们通过监视全球和本地报纸上最热门的新闻主题来标记数据集。除此之外，我们提出了两个基准，即TF-IDF和LDA。我们使用三种不同的k值对这两种方法的结果进行了评估，以得出精度，召回率和f1-分数。收集的数据集可在https://github.com/MeysamAsgariC/BLMT上找到。

更新日期：2021-05-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文