research-article

Constant-Delay Enumeration for Nondeterministic Document Spanners

Authors:
Antoine Amarilli

LTCI, Télécom Paris, Institut Polytechnique de Paris

LTCI, Télécom Paris, Institut Polytechnique de Paris
View Profile

,
Pierre Bourhis

CNRS, CRIStAL, UMR 9189 & Inria Lille

CNRS, CRIStAL, UMR 9189 & Inria Lille
View Profile

,
Stefan Mengel

CRIL, CNRS & Univ Artois

CRIL, CNRS & Univ Artois
View Profile

,
Matthias Niewerth

University of Bayreuth

University of Bayreuth
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 49 Issue 1March 2020pp 25–32https://doi.org/10.1145/3422648.3422655

Published:04 September 2020Publication History

ACM SIGMOD Record

Abstract

One of the classical tasks in information extraction is to extract subparts of texts through regular expressions. In the database theory literature, this approach has been generalized and formalized as document spanners. In this model, extraction is performed by evaluating a particular kind of automata, called a sequential variable-set automaton (VA). The efficiency of this task is then measured in the context of enumeration algorithms: we first run a preprocessing phase computing a compact representation of the answers, and second we produce the results one after the other with a short time between consecutive answers, called the delay of the enumeration. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., a constant delay that does not depend on the document. We present such an algorithm for a variant of VAs called extended sequential VAs and give an experimental evaluation of this algorithm.

References

A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974. Google ScholarDigital Library
A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, 2017.Google Scholar
A. Amarilli, P. Bourhis, and S. Mengel. Enumeration on trees under relabelings. In ICDT, 2018.Google Scholar
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, 2019.Google Scholar
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, 2019. Google ScholarDigital Library
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners, 2020. https://arxiv.org/abs/2003.02576.Google Scholar
G. Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, 2006. Google ScholarDigital Library
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. Google ScholarDigital Library
F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. Constant delay algorithms for regular document spanners. In PODS, 2018. Google ScholarDigital Library
D. D. Freydenberger. A logic for document spanners. In ICDT, 2017.Google Scholar
D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4), 2018. Google ScholarDigital Library
D. D. Freydenberger, B. Kimelfeld, and L. Peterfreund. Joining extractions of regular expressions. In PODS, 2018. Google ScholarDigital Library
F. L. Gall. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In SODA, 2012.Google ScholarCross Ref
F. L. Gall. Powers of tensors and fast matrix multiplication. In ISSAC, 2014.Google ScholarDigital Library
E. Grandjean. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence, 16(1), 1996.Google Scholar
IBM Research. SystemT, 2018. https://researcher.watson.ibm.com/ researcher/view_group.php?id=1264.Google Scholar
W. Kazana and L. Segoufin. Enumeration of monadic second-order queries on trees. TOCL, 14(4), 2013. Google ScholarDigital Library
K. Losemann and W. Martens. MSO queries on trees: Enumerating answers under updates. In CSL-LICS, 2014. Google ScholarDigital Library
F. Maturana, C. Riveros, and D. Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, 2018. Google ScholarDigital Library
M. Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, 2018. Google ScholarDigital Library
M. Niewerth and L. Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS, 2018. Google ScholarDigital Library
L. Peterfreund. The Complexity of Relational Queries over Extractions from Text. PhD thesis, Technion, 2019. http: //www.cs.technion.ac.il/users/wwwb/cgi-bin/ tr-get.cgi/2019/PHD/PHD-2019--10.pdf.Google Scholar
L. Segoufin. A glimpse on constant delay enumeration (Invited talk). In STACS, 2014.Google Scholar
S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa. A new algorithm for generating all the maximal independent sets. SIAM J. Comput., 6, 09 1977.Google ScholarCross Ref
L. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2), 1979.Google Scholar
K. Wasa. Enumeration of enumeration algorithms. CoRR, 2016.Google Scholar

Recommendations

Document Spanners: A Formal Approach to Information Extraction

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short)...
Read More
Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set ...
Read More
Constant Delay Algorithms for Regular Document Spanners
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 49, Issue 1
March 2020
72 pages
ISSN:0163-5808
DOI:10.1145/3422648
Editors:
Rada Chirkova
North Carolina State University
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Wim Martens
University of Bayreuth
,
Divesh Srivastava
ATT research
,
Pinar Tözü
IBM Almaden Research Center
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
,
Azza Abouzied
NYU
,
Lyublena Antova
Datometry
,
Aaron J. Elmore
University of Chicago
,
Kyriakos Mouratidis
Singapore Management University
,
Dan Olteanu
University of Oxford
,
Immanuel Trummer
Cornell University
,
Yannis Velegrakis
Utrecht University
Issue’s Table of Contents
Copyright © 2020 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 September 2020
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 60
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Constant-Delay Enumeration for Nondeterministic Document Spanners

ACM SIGMOD Record

Abstract

References

Cited By

Recommendations

Document Spanners: A Formal Approach to Information Extraction

Constant-Delay Enumeration for Nondeterministic Document Spanners

Constant Delay Algorithms for Regular Document Spanners