当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search -- Extended Technical Report
arXiv - CS - Databases Pub Date : 2020-06-12 , DOI: arxiv-2006.07064
Till Blume and Ansgar Scherp

Indexing the Web of Data offers many opportunities, in particular, to find and explore data sources. One major design decision when indexing the Web of Data is to find a suitable index model, i.e., how to index and summarize data. Various efforts have been conducted to develop specific index models for a given task. With each index model designed, implemented, and evaluated independently, it remains difficult to judge whether an approach generalizes well to another task, set of queries, or dataset. In this work, we empirically evaluate six representative index models with unique feature combinations. Among them is a new index model incorporating inferencing over RDFS and owl:sameAs. We implement all index models for the first time into a single, stream-based framework. We evaluate variations of the index models considering sub-graphs of size 0, 1, and 2 hops on two large, real-world datasets. We evaluate the quality of the indices regarding the compression ratio, summarization ratio, and F1-score denoting the approximation quality of the stream-based index computation. The experiments reveal huge variations in compression ratio, summarization ratio, and approximation quality for different index models, queries, and datasets. However, we observe meaningful correlations in the results that help to determine the right index model for a given task, type of query, and dataset.

中文翻译:

在 Web 上索引数据:用于数据搜索的模式级索引的比较——扩展技术报告

索引数据网络提供了许多机会,特别是查找和探索数据源。索引 Web of Data 时的一项主要设计决策是找到合适的索引模型,即如何索引和汇总数据。已经进行了各种努力来为给定任务开发特定的索引模型。由于每个索引模型都是独立设计、实现和评估的,因此很难判断一种方法是否能很好地推广到另一个任务、一组查询或数据集。在这项工作中,我们凭经验评估了六个具有独特特征组合的代表性指标模型。其中包括一个新的索引模型,它结合了对 RDFS 和 owl:sameAs 的推理。我们首次将所有索引模型实现到一个单一的、基于流的框架中。我们考虑大小为 0、1、在两个大型真实世界数据集上进行 2 次跳跃。我们评估关于压缩率、汇总率和 F1 分数的索引质量,F1 分数表示基于流的索引计算的近似质量。实验揭示了不同索引模型、查询和数据集的压缩率、汇总率和近似质量的巨大差异。但是,我们在结果中观察到有意义的相关性,这些相关性有助于为给定的任务、查询类型和数据集确定正确的索引模型。不同索引模型、查询和数据集的近似质量。但是,我们在结果中观察到有意义的相关性,这些相关性有助于为给定的任务、查询类型和数据集确定正确的索引模型。不同索引模型、查询和数据集的近似质量。但是,我们在结果中观察到有意义的相关性,这些相关性有助于为给定的任务、查询类型和数据集确定正确的索引模型。
更新日期:2020-06-15
down
wechat
bug