当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Google Dataset Search by the Numbers
arXiv - CS - Databases Pub Date : 2020-06-12 , DOI: arxiv-2006.06894
Omar Benjelloun and Shiyu Chen and Natasha Noy

Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search extracts dataset metadata -- expressed using schema.org and similar vocabularies -- from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from about 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.


Google 数据集按数字搜索

科学家、政府和公司越来越多地在 Web 上发布数据集。Google 的数据集搜索从网页中提取数据集元数据——使用 schema.org 和类似词汇表表示——以使数据集可被发现。自从我们在 2016 年开始数据集搜索工作以来,schema.org 中描述的数据集数量已经从大约 500K 增长到近 30M。因此,该语料库已成为 Web 数据的宝贵快照。据我们所知,这个语料库是同类中最大、最多样化的。我们分析该语料库并讨论数据集的来源、涵盖的主题、采用的形式以及搜索数据集的人感兴趣的内容。 基于此分析,我们确定差距和可能的未来工作,以帮助使数据更易于发现.