Graph-based keyword search in heterogeneous data sources,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Graph-based keyword search in heterogeneous data sources
arXiv - CS - Databases Pub Date : 2020-09-09 , DOI: arxiv-2009.04283
Mhd Yamen Haddad (CEDAR), Angelos Anadiotis (CEDAR), Yamen Mhd, Ioana Manolescu (CEDAR)

Data journalism is the field of investigative journalism which focuses on digital data by treating them as first-class citizens. Following the trends in human activity, which leaves strong digital traces, data journalism becomes increasingly important. However, as the number and the diversity of data sources increase, heterogeneous data models with different structure, or even no structure at all, need to be considered in query answering. Inspired by our collaboration with Le Monde, a leading French newspaper, we designed a novel query algorithm for exploiting such heterogeneous corpora through keyword search. We model our underlying data as graphs and, given a set of search terms, our algorithm nds links between them within and across the heterogeneous datasets included in the graph. We draw inspiration from prior work on keyword search in structured and unstructured data, which we extend with the data heterogeneity dimension, which makes the keyword search problem computationally harder. We implement our algorithm and we evaluate its performance using synthetic and real-world datasets.

中文翻译：

异构数据源中基于图的关键字搜索

数据新闻是调查新闻的领域，它通过将数字数据视为一等公民来关注数字数据。随着人类活动的趋势留下强烈的数字痕迹，数据新闻变得越来越重要。然而，随着数据源的数量和多样性的增加，在查询回答中需要考虑具有不同结构甚至根本没有结构的异构数据模型。受到与法国领先报纸 Le Monde 合作的启发，我们设计了一种新颖的查询算法，通过关键字搜索来利用这种异构语料库。我们将我们的基础数据建模为图形，并且在给定一组搜索词的情况下，我们的算法会在图形中包含的异构数据集内部和之间找到它们之间的链接。我们从先前在结构化和非结构化数据中进行关键字搜索的工作中汲取灵感，我们扩展了数据异质性维度，这使得关键字搜索问题在计算上变得更加困难。我们实现了我们的算法，并使用合成和真实世界的数据集来评估其性能。

更新日期：2020-09-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文