Abstract
One of the classical tasks in information extraction is to extract subparts of texts through regular expressions. In the database theory literature, this approach has been generalized and formalized as document spanners. In this model, extraction is performed by evaluating a particular kind of automata, called a sequential variable-set automaton (VA). The efficiency of this task is then measured in the context of enumeration algorithms: we first run a preprocessing phase computing a compact representation of the answers, and second we produce the results one after the other with a short time between consecutive answers, called the delay of the enumeration. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., a constant delay that does not depend on the document. We present such an algorithm for a variant of VAs called extended sequential VAs and give an experimental evaluation of this algorithm.
- A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974. Google ScholarDigital Library
- A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, 2017.Google Scholar
- A. Amarilli, P. Bourhis, and S. Mengel. Enumeration on trees under relabelings. In ICDT, 2018.Google Scholar
- A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, 2019.Google Scholar
- A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, 2019. Google ScholarDigital Library
- A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners, 2020. https://arxiv.org/abs/2003.02576.Google Scholar
- G. Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, 2006. Google ScholarDigital Library
- R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. Google ScholarDigital Library
- F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. Constant delay algorithms for regular document spanners. In PODS, 2018. Google ScholarDigital Library
- D. D. Freydenberger. A logic for document spanners. In ICDT, 2017.Google Scholar
- D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4), 2018. Google ScholarDigital Library
- D. D. Freydenberger, B. Kimelfeld, and L. Peterfreund. Joining extractions of regular expressions. In PODS, 2018. Google ScholarDigital Library
- F. L. Gall. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In SODA, 2012.Google ScholarCross Ref
- F. L. Gall. Powers of tensors and fast matrix multiplication. In ISSAC, 2014.Google ScholarDigital Library
- E. Grandjean. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence, 16(1), 1996.Google Scholar
- IBM Research. SystemT, 2018. https://researcher.watson.ibm.com/ researcher/view_group.php?id=1264.Google Scholar
- W. Kazana and L. Segoufin. Enumeration of monadic second-order queries on trees. TOCL, 14(4), 2013. Google ScholarDigital Library
- K. Losemann and W. Martens. MSO queries on trees: Enumerating answers under updates. In CSL-LICS, 2014. Google ScholarDigital Library
- F. Maturana, C. Riveros, and D. Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, 2018. Google ScholarDigital Library
- M. Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, 2018. Google ScholarDigital Library
- M. Niewerth and L. Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS, 2018. Google ScholarDigital Library
- L. Peterfreund. The Complexity of Relational Queries over Extractions from Text. PhD thesis, Technion, 2019. http: //www.cs.technion.ac.il/users/wwwb/cgi-bin/ tr-get.cgi/2019/PHD/PHD-2019--10.pdf.Google Scholar
- L. Segoufin. A glimpse on constant delay enumeration (Invited talk). In STACS, 2014.Google Scholar
- S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa. A new algorithm for generating all the maximal independent sets. SIAM J. Comput., 6, 09 1977.Google ScholarCross Ref
- L. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2), 1979.Google Scholar
- K. Wasa. Enumeration of enumeration algorithms. CoRR, 2016.Google Scholar
Recommendations
Document Spanners: A Formal Approach to Information Extraction
An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short)...
Constant-Delay Enumeration for Nondeterministic Document Spanners
We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set ...
Constant Delay Algorithms for Regular Document Spanners
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsRegular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract ...
Comments