当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History
arXiv - CS - Software Engineering Pub Date : 2020-11-16 , DOI: arxiv-2011.07824
Antoine Pietri (DGD-I), Diomidis Spinellis (AUEB), Stefano Zacchiroli (UP, DGD-I)

Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on ''most starred'' repositories as it often happens.

中文翻译:

软件遗产图数据集:公共软件开发历史的大规模分析

Software Heritage 是最大的现有软件源代码和相关开发历史公共档案。它跨越超过 50 亿个独特的源代码文件和 10 亿个独特的提交,来自超过 8000 万个软件项目。这些软件工件是从主要的协作开发平台(例如 GitHub、GitLab)和包存储库(例如 PyPI、Debian、NPM)中检索的,并存储在一个统一的表示中,将源代码文件、目录、提交和完整快照链接在一起Software Heritage 在定期抓取期间观察到的版本控制系统 (VCS) 存储库。这个数据集在可访问性和规模方面是独一无二的,它允许探索关于公共软件开发长尾的许多研究问题,而不是仅仅关注“”
更新日期:2020-11-17
down
wechat
bug