The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity
arXiv - CS - Digital Libraries Pub Date : 2020-05-27 , DOI: arxiv-2005.14024
Lukas Gebhard and Felix Hamborg

News articles covering policy issues are an essential source of information in the social sciences and are also frequently used for other use cases, e.g., to train NLP language models. To derive meaningful insights from the analysis of news, large datasets are required that represent real-world distributions, e.g., with respect to the contained outlets' popularity, topically, or across time. Information on the political leanings of media publishers is often needed, e.g., to study differences in news reporting across the political spectrum, which is one of the prime use cases in the social sciences when studying media bias and related societal issues. Concerning these requirements, existing datasets have major flaws, resulting in redundant and cumbersome effort in the research community for dataset creation. To fill this gap, we present POLUSA, a dataset that represents the online media landscape as perceived by an average US news consumer. The dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum. Each outlet is labeled by its political leaning, which we derive using a systematic aggregation of eight data sources. The news dataset is balanced with respect to publication date and outlet popularity. POLUSA enables studying a variety of subjects, e.g., media effects and political partisanship. Due to its size, the dataset allows to utilize data-intense deep learning methods.

中文翻译：

POLUSA 数据集：090 万篇政治新闻文章按时间和网点流行度平衡

涵盖政策问题的新闻文章是社会科学中重要的信息来源，也经常用于其他用例，例如训练 NLP 语言模型。为了从新闻分析中获得有意义的见解，需要代表现实世界分布的大型数据集，例如，关于所包含的媒体的流行度、主题或跨时间。通常需要有关媒体出版商政治倾向的信息，例如，研究不同政治领域新闻报道的差异，这是研究媒体偏见和相关社会问题时社会科学中的主要用例之一。关于这些要求，现有数据集存在重大缺陷，导致研究界在创建数据集方面进行了冗余和繁琐的工作。为了填补这一空白，我们推出了 POLUSA，一个数据集，代表美国普通新闻消费者所感知的在线媒体格局。该数据集包含 090 万篇文章，涵盖了 2017 年 1 月至 2019 年 8 月期间由 18 家代表政治领域的新闻媒体发表的政策主题。每个出口都标有其政治倾向，这是我们使用八个数据源的系统聚合得出的。新闻数据集在发布日期和出口流行度方面是平衡的。POLUSA 能够研究各种主题，例如媒体效果和政治党派关系。由于其大小，数据集允许使用数据密集型深度学习方法。2019 年由代表政治光谱的 18 家新闻媒体报道。每个出口都标有其政治倾向，这是我们使用八个数据源的系统聚合得出的。新闻数据集在发布日期和出口流行度方面是平衡的。POLUSA 能够研究各种主题，例如媒体效果和政治党派关系。由于其大小，数据集允许使用数据密集型深度学习方法。2019 年由代表政治光谱的 18 家新闻媒体报道。每个出口都标有其政治倾向，这是我们使用八个数据源的系统聚合得出的。新闻数据集在发布日期和出口流行度方面是平衡的。POLUSA 能够研究各种主题，例如媒体效果和政治党派关系。由于其大小，数据集允许使用数据密集型深度学习方法。

更新日期：2020-05-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文