skip to main content
research-article

Constant-Delay Enumeration for Nondeterministic Document Spanners

Published:04 September 2020Publication History
Skip Abstract Section

Abstract

One of the classical tasks in information extraction is to extract subparts of texts through regular expressions. In the database theory literature, this approach has been generalized and formalized as document spanners. In this model, extraction is performed by evaluating a particular kind of automata, called a sequential variable-set automaton (VA). The efficiency of this task is then measured in the context of enumeration algorithms: we first run a preprocessing phase computing a compact representation of the answers, and second we produce the results one after the other with a short time between consecutive answers, called the delay of the enumeration. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., a constant delay that does not depend on the document. We present such an algorithm for a variant of VAs called extended sequential VAs and give an experimental evaluation of this algorithm.

References

  1. A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, 2017.Google ScholarGoogle Scholar
  3. A. Amarilli, P. Bourhis, and S. Mengel. Enumeration on trees under relabelings. In ICDT, 2018.Google ScholarGoogle Scholar
  4. A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, 2019.Google ScholarGoogle Scholar
  5. A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, 2019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners, 2020. https://arxiv.org/abs/2003.02576.Google ScholarGoogle Scholar
  7. G. Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. Constant delay algorithms for regular document spanners. In PODS, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. D. Freydenberger. A logic for document spanners. In ICDT, 2017.Google ScholarGoogle Scholar
  11. D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4), 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. D. Freydenberger, B. Kimelfeld, and L. Peterfreund. Joining extractions of regular expressions. In PODS, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. L. Gall. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In SODA, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. F. L. Gall. Powers of tensors and fast matrix multiplication. In ISSAC, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Grandjean. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence, 16(1), 1996.Google ScholarGoogle Scholar
  16. IBM Research. SystemT, 2018. https://researcher.watson.ibm.com/ researcher/view_group.php?id=1264.Google ScholarGoogle Scholar
  17. W. Kazana and L. Segoufin. Enumeration of monadic second-order queries on trees. TOCL, 14(4), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Losemann and W. Martens. MSO queries on trees: Enumerating answers under updates. In CSL-LICS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Maturana, C. Riveros, and D. Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Niewerth and L. Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Peterfreund. The Complexity of Relational Queries over Extractions from Text. PhD thesis, Technion, 2019. http: //www.cs.technion.ac.il/users/wwwb/cgi-bin/ tr-get.cgi/2019/PHD/PHD-2019--10.pdf.Google ScholarGoogle Scholar
  23. L. Segoufin. A glimpse on constant delay enumeration (Invited talk). In STACS, 2014.Google ScholarGoogle Scholar
  24. S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa. A new algorithm for generating all the maximal independent sets. SIAM J. Comput., 6, 09 1977.Google ScholarGoogle ScholarCross RefCross Ref
  25. L. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2), 1979.Google ScholarGoogle Scholar
  26. K. Wasa. Enumeration of enumeration algorithms. CoRR, 2016.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 49, Issue 1
    March 2020
    72 pages
    ISSN:0163-5808
    DOI:10.1145/3422648
    Issue’s Table of Contents

    Copyright © 2020 Copyright is held by the owner/author(s)

    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 September 2020

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader