当前位置: X-MOL 学术arXiv.cs.DL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
\textit{NewsEdits}: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)
arXiv - CS - Digital Libraries Pub Date : 2021-04-19 , DOI: arxiv-2104.09647
Alexander Spangher, Jonathan May

News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or \textit{NewsEdits}. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. \textit{NewsEdits} is, to our knowledge, the largest corpus of revision histories of any domain.

中文翻译:

\ textit {NewsEdits}:新闻文章修订历史数据集(技术报告:数据处理)

新闻文章的修订历史有可能为我们提供跨语言学和社会科学各个领域的新颖见解。在我们的工作中,据我们所知,我们提供了新闻报道修订历史的第一个公开可用的数据集,即\ textit {NewsEdits}。我们的数据集是多语言的;它包含来自三个国家的22种以上英语和法语报纸来源的1,278,804篇文章和4,609,430种版本。在所有版本对中,我们增加了1,090万个句子。更改句子890万,删除句子680万。在更改的句子中,我们获得了7200万个原子编辑。据我们所知,\ textit {NewsEdits}是任何领域中最大的修订历史资料集。
更新日期:2021-04-21
down
wechat
bug