当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detecting Opportunities for Differential Maintenance of Extracted Views
arXiv - CS - Databases Pub Date : 2020-07-04 , DOI: arxiv-2007.01973
Besat Kassaie and Frank Wm. Tompa

Semi-structured and unstructured data management is challenging, but many of the problems encountered are analogous to problems already addressed in the relational context. In the area of information extraction, for example, the shift from engineering ad hoc, application-specific extraction rules towards using expressive languages such as CPSL and AQL creates opportunities to propose solutions that can be applied to a wide range of extraction programs. In this work, we focus on extracted view maintenance, a problem that is well-motivated and thoroughly addressed in the relational setting. In particular, we formalize and address the problem of keeping extracted relations consistent with source documents that can be arbitrarily updated. We formally characterize three classes of document updates, namely those that are irrelevant, autonomously computable, and pseudo-irrelevant with respect to a given extractor. Finally, we propose algorithms to detect pseudo-irrelevant document updates with respect to extractors that are expressed as document spanners, a model of information extraction inspired by SystemT.

中文翻译:

检测提取视图差异维护的机会

半结构化和非结构化数据管理具有挑战性,但遇到的许多问题类似于在关系上下文中已经解决的问题。例如,在信息提取领域,从工程临时的、特定于应用程序的提取规则向使用表达性语言(如 CPSL 和 AQL)的转变创造了提出可应用于各种提取程序的解决方案的机会。在这项工作中,我们专注于提取视图维护,这是一个在关系设置中动机良好并得到彻底解决的问题。特别是,我们形式化并解决了使提取的关系与可以任意更新的源文档保持一致的问题。我们正式描述了三类文档更新,即那些不相关的、可自主计算,并且与给定的提取器伪无关。最后,我们提出算法来检测与表示为文档生成器的提取器相关的伪无关文档更新,这是一种受 SystemT 启发的信息提取模型。
更新日期:2020-07-07
down
wechat
bug