Graph integration of structured, semistructured and unstructured data for data journalism,Information Systems

当前位置： X-MOL 学术 › Inform. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Graph integration of structured, semistructured and unstructured data for data journalism
Information Systems ( IF 3.7 ) Pub Date : 2021-07-06 , DOI: 10.1016/j.is.2021.101846
Angelos Christos Anadiotis _{1,

2} , Oana Balalau ₃ , Catarina Conceição ₄ , Helena Galhardas ₄ , Mhd Yamen Haddad ₃ , Ioana Manolescu ₃ , Tayeb Merabti ₃ , Jingmao You ₃

Affiliation

Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources.

We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.

中文翻译：

用于数据新闻的结构化、半结构化和非结构化数据的图形集成

数字数据是现代新闻业的金矿。然而，记者感兴趣的数据集是极其异构的，包括高度结构化（关系数据库）、半结构化（JSON、XML、HTML）、图形（例如 RDF）和文本。记者（以及其他缺乏高级 IT 专业知识的用户类别，例如大多数非政府组织或小型公共管理机构）需要能够理解这种异构语料库，即使他们缺乏定义和部署自定义提取的能力-transform-load 工作流，特别是对于动态变化的数据源集。

我们描述了一种集成动态异构数据集集的完整方法，包括上述内容：使此类图变得有用所面临的挑战，允许它们的集成扩展，以及我们为这些问题提出的解决方案。我们的方法是在 ConnectionLens 系统中实现的；我们通过一组实验对其进行了验证。

更新日期：2021-07-06

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南