当前位置: X-MOL 学术J. Assoc. Inf. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Softcite dataset: A dataset of software mentions in biomedical and economic research publications
Journal of the Association for Information Science and Technology ( IF 2.8 ) Pub Date : 2021-02-02 , DOI: 10.1002/asi.24454
Caifan Du 1 , Johanna Cohoon 1 , Patrice Lopez 2 , James Howison 1
Affiliation  

Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.

中文翻译:

Softcite 数据集:生物医学和经济研究出版物中提及的软件数据集

软件对学术研究的贡献相对不明显,尤其是对于基于文献计量学的正式学术声誉系统。在本文中,我们介绍了一个黄金标准的软件提及数据集,该数据集来自生物医学和经济学领域 4,971 份学术 PDF 的手动注释。该数据集旨在通过大规模监督学习从 PDF 格式的研究出版物中自动提取软件提及。我们提供了数据集的描述及其创建过程的扩展讨论,包括改进的学术 PDF 文本转换。最后,我们反思我们在数据集创建过程中遇到的挑战和经验教训,希望鼓励更多关于创建用于机器学习的数据集的讨论。
更新日期:2021-02-02
down
wechat
bug