Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia
arXiv - CS - Digital Libraries Pub Date : 2020-07-14 , DOI: arxiv-2007.07022
Harshdeep Singh, Robert West, Giovanni Colavizza

Wikipedia's contents are based on reliable and published sources. To this date, little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further labeled an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI. Scientific articles cited from Wikipedia correspond to 3.5% of all articles with a DOI currently indexed in the Web of Science. We release all our code to allow the community to extend upon our work and update the dataset in the future.

中文翻译：

维基百科引文：一个综合的引文数据集，带有从英文维基百科中提取的标识符

维基百科的内容基于可靠且已发布的来源。迄今为止，人们对维基百科所依赖的来源知之甚少，部分原因是提取引文和识别引用来源具有挑战性。为了弥补这一差距，我们发布了 Wikipedia Citations，这是一个从 Wikipedia 中提取的综合引文数据集。截至 2020 年 5 月，总共从 610 万篇英文维基百科文章中提取了 2930 万次引用，并将其归类为书籍、期刊文章或 Web 内容。因此，我们能够从具有已知标识符的学术出版物中提取 400 万次引文——包括 DOI、PMC、PMID 和 ISBN——并进一步用来自 Crossref 的 DOI 标记了额外的 261K 引文。因此，我们发现 6.7% 的维基百科文章至少引用了一篇具有相关 DOI 的期刊文章。从 Wikipedia 引用的科学文章占 Web of Science 当前索引的所有带有 DOI 的文章的 3.5%。我们发布所有代码以允许社区扩展我们的工作并在未来更新数据集。

更新日期：2020-07-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>