当前位置: X-MOL 学术ACM Trans. Database Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Enumeration Algorithms for Regular Document Spanners
ACM Transactions on Database Systems ( IF 1.8 ) Pub Date : 2020-02-08 , DOI: 10.1145/3351451
Fernando Florenzano 1 , Cristian Riveros 1 , Martín Ugarte 2 , Stijn Vansummeren 3 , Domagoj Vrgoč 1
Affiliation  

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners , use regular languages to locate the data that a user wants to extract from a text document and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have efficient evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Toward this goal, we present a practical evaluation algorithm that allows output-linear delay enumeration of a spanner’s result after a precomputation phase that is linear in the document. Although the algorithm assumes that the spanner is specified in a syntactic variant of variable-set automata, we also study how it can be applied when the spanner is specified by general variable-set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner and provide a fine-grained analysis of the classes of document spanners that support efficient enumeration of their results.

中文翻译:

常规文档 Spanner 的高效枚举算法

具有捕获变量的正则表达式和自动机模型是基于规则的信息提取的核心工具。这些形式,也称为常规文件扳手,使用常规语言来定位用户想要从文本文档中提取的数据,然后将这些数据存储到变量中。由于文档生成器可以轻松生成大量输出,因此拥有可以快速连续生成提取数据且预计算时间相对较短的高效评估算法非常重要。为了实现这一目标,我们提出了一种实用的评估算法,该算法允许在文档中线性的预计算阶段之后对扳手结果进行输出线性延迟枚举。尽管该算法假定扳手是在变量集自动机的句法变体中指定的,但我们还研究了当扳手由通用变量集自动机、正则表达式或扳手代数指定时如何应用它。最后,
更新日期:2020-02-08
down
wechat
bug