当前位置: X-MOL 学术Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Graph integration of structured, semistructured and unstructured data for data journalism
Information Systems ( IF 3.7 ) Pub Date : 2021-07-06 , DOI: 10.1016/j.is.2021.101846
Angelos Christos Anadiotis 1, 2 , Oana Balalau 3 , Catarina Conceição 4 , Helena Galhardas 4 , Mhd Yamen Haddad 3 , Ioana Manolescu 3 , Tayeb Merabti 3 , Jingmao You 3
Affiliation  

Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources.

We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.



中文翻译:

用于数据新闻的结构化、半结构化和非结构化数据的图形集成

数字数据是现代新闻业的金矿。然而,记者感兴趣的数据集是极其异构的,包括高度结构化(关系数据库)、半结构化(JSON、XML、HTML)、图形(例如 RDF)和文本。记者(以及其他缺乏高级 IT 专业知识的用户类别,例如大多数非政府组织或小型公共管理机构)需要能够理解这种异构语料库,即使他们缺乏定义和部署自定义提取的能力-transform-load 工作流,特别是对于动态变化的数据源集。

我们描述了一种集成动态异构数据集集的完整方法,包括上述内容:使此类图变得有用所面临的挑战,允许它们的集成扩展,以及我们为这些问题提出的解决方案。我们的方法是在 ConnectionLens 系统中实现的;我们通过一组实验对其进行了验证。

更新日期:2021-07-06
down
wechat
bug