Abstract
Regular Expressions (REs) are ubiquitous in database and programming languages. While many applications make use of REs extended with interleaving (shuffle) and unordered concatenation operators, this extension badly affects the complexity of basic operations, and, especially, makes membership NP-hard, which is unacceptable in most practical scenarios.
In this article, we study the problem of membership checking for a restricted class of these extended REs, called conflict-free REs, which are expressive enough to cover the vast majority of real-world applications. We present several polynomial algorithms for membership checking over conflict-free REs. The algorithms are all polynomial and differ in terms of adopted optimization techniques and in the kind of supported operators. As a particular application, we generalize the approach to check membership of Extensible Markup Language trees into a class of EDTDs (Extended Document Type Definitions) that models the crucial aspects of DTDs (Document Type Definitions) and XSD (XML Schema Definitions) schemas.
Results about an extensive experimental analysis validate the efficiency of the presented membership checking techniques.
- Carlos Buil Aranda, Marcelo Arenas, Óscar Corcho, and Axel Polleres. 2013. Federating queries in SPARQL 1.1: Syntax, semantics, and evaluation. J. Web Sem. 18, 1 (2013), 1--17.Google ScholarDigital Library
- Andrey Balmin, Yannis Papakonstantinou, and Victor Vianu. 2004. Incremental validation of XML documents. ACM Trans. Database Syst. 29, 4 (2004), 710--751. Google ScholarDigital Library
- Denilson Barbosa, Gregory Leighton, and Andrew Smith. 2006. Efficient incremental validation of XML documents after composite updates. In Proceedings of XML Database Symposium XSym (Lecture Notes in Computer Science), Vol. 4156. Springer, 107--121. Google ScholarDigital Library
- Denilson Barbosa, Alberto O. Mendelzon, Leonid Libkin, Laurent Mignet, and Marcelo Arenas. 2004. Efficient incremental validation of XML documents. In Proceedings of the 20th International Conference on Data Engineering (ICDE’04). IEEE Computer Society, 671--682. Google ScholarCross Ref
- Geert Jan Bex, Frank Neven, and Jan Van den Bussche. 2004. DTDs versus XML schema: A practical study. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB’04), Colocated with ACM SIGMOD/PODS 2004. 79--84. Google ScholarDigital Library
- Geert Jan Bex, Frank Neven, Thomas Schwentick, and Stijn Vansummeren. 2010. Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35, 2 (2010), 11:1--11:47.Google ScholarDigital Library
- Geert Jan Bex, Frank Neven, and Stijn Vansummeren. 2007. Inferring XML schema definitions from XML data. In Proceedings of the Conference on Very Large Data Bases (VLDB’07). 998--1009.Google Scholar
- Henrik Björklund, Wim Martens, and Thomas Timm. 2015. Efficient incremental evaluation of succinct regular expressions. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). ACM, 1541--1550.Google ScholarDigital Library
- Iovka Boneva, Radu Ciucanu, and Slawek Staworko. 2013. Simple schemas for unordered XML. In Proceedings of the 16th International Workshop on the Web and Databases 2013 (WebDB’13). 13--18.Google Scholar
- Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cowan. 2006. Extensible Markup Language (XML) 1.1 (2nd ed.). Technical Report. World Wide Web Consortium. W3C Recommendation.Google Scholar
- Anne Brüggemann-Klein. 1993. Unambiguity of extended regular expressions in SGML document grammars. In Proceedings of the 1st Annual European Symposium on Algorithms (ESA’93), Bad Honnef, Germany, September 30--October 2, 1993 (Lecture Notes in Computer Science), Vol. 726. Springer, 73--84.Google Scholar
- Anne Brüggemann-Klein and Derick Wood. 1992. Deterministic regular languages. In Proceedings of the 9th Annual Symposium on Theoretical Aspects of Computer Science (STACS’92). 173--184.Google ScholarCross Ref
- Anne Brüggemann-Klein and Derick Wood. 1998. One-unambiguous regular languages. Inf. Comput. 142, 2 (1998), 182--206.Google ScholarDigital Library
- Janusz A. Brzozowski. 1964. Derivatives of regular expressions. J. ACM 11, 4 (1964), 481--494. Google ScholarDigital Library
- Byron Choi. 2002. What are real DTDs like? In Proceedings of the International Workshop on Web and Databases (WebDB’02). 43--48.Google Scholar
- Dario Colazzo, Giorgio Ghelli, Luca Pardini, and Carlo Sartiani. 2009. Linear inclusion for XML regular expression types. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09), Hong Kong, China, November 2--6, 2009. ACM, 137--146. Google ScholarDigital Library
- Dario Colazzo, Giorgio Ghelli, Luca Pardini, and Carlo Sartiani. 2013. Almost-linear inclusion for XML regular expression types. ACM Trans. Database Syst. 38, 3 (2013), 15. Google ScholarDigital Library
- Dario Colazzo, Giorgio Ghelli, Luca Pardini, and Carlo Sartiani. 2013. Efficient asymmetric inclusion of regular expressions with interleaving and counting for XML type-checking. Theor. Comput. Sci. 492 (2013), 88--116. Google ScholarCross Ref
- Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2009. Efficient asymmetric inclusion between regular expression types. In Proceedings of the ACM International Conference Proceeding Series (ICDT’09), Ronald Fagin (Ed.), Vol. 361. ACM, 174--182. Google ScholarDigital Library
- Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2009. Efficient inclusion for a class of XML types with interleaving and counting. Inf. Syst. 34, 7 (2009), 643--656. Google ScholarDigital Library
- Silvano Dal-Zilio and Denis Lugiez. 2003. XML schema, tree logic and sheaves automata. In , Proceedings of the 14th International Conference on Rewriting Techniques and Applications (RTA’03). Springer, 246--263. Google ScholarCross Ref
- David C. Fallside and Priscilla Walmsley. 2004. XML Schema Part 0: Primer, 2nd ed. (Oct. 2004). W3C Recommendation.Google Scholar
- Wouter Gelade, Marc Gyssens, and Wim Martens. 2012. Regular expressions with counting: Weak versus strong determinism. SIAM J. Comput. 41, 1 (2012), 160--190. Google ScholarDigital Library
- Wouter Gelade, Wim Martens, and Frank Neven. 2009. Optimizing schema languages for XML: Numerical constraints and interleaving. SIAM J. Comput. 38, 5 (2009), 2021--2043. Google ScholarDigital Library
- Giorgio Ghelli, Dario Colazzo, and Carlo Sartiani. 2007. Efficient inclusion for a class of XML types with interleaving and counting. In Proceedings of the 11th International Symposium on Database Programming Languages (DBPL’07), Vienna, Austria, September 23--24, 2007, Revised Selected Papers (Lecture Notes in Computer Science), Marcelo Arenas and Michael I. Schwartzbach (Eds.), Vol. 4797. Springer, 231--245. Google ScholarCross Ref
- Giorgio Ghelli, Dario Colazzo, and Carlo Sartiani. 2008. Linear time membership in a class of regular expressions with interleaving and counting. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, 389--398. Google ScholarDigital Library
- V. M. Glushkov. 1961. The abstract theory of automata. Russian Math. Surveys 16, 5 (1961), 1.Google ScholarCross Ref
- Charles F. Goldfarb. 1990. SGML Handbook. Clarendon Press.Google Scholar
- Steve Harris and Andy Seaborne. 2013. SPARQL 1.1 Query Language. Technical Report. World Wide Web Consortium. W3C Recommendation.Google Scholar
- Dag Hovland. 2012. The membership problem for regular expressions with unordered concatenation and numerical constraints. In Proceeedings of the 6th International Conference on Language and Automata Theory and Applications (LATA’12), A Coruña, Spain, March 5--9, 2012 (Lecture Notes in Computer Science), Adrian Horia Dediu and Carlos Martín-Vide (Eds.), Vol. 7183. Springer, 313--324. Google ScholarDigital Library
- J. E. Hopcroft and J. D. Ullman. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley.Google Scholar
- Pekka Kilpeläinen and Rauno Tuhkanen. 2003. Regular expressions with numerical occurrence indicators - Preliminary results. In Proceedings of the 8th Symposium on Programming Languages and Software Tools (SPLST’03), Pekka Kilpeläinen and Niina Päivinen (Eds.). University of Kuopio, Department of Computer Science, 163--173.Google Scholar
- Pekka Kilpeläinen and Rauno Tuhkanen. 2004. Towards efficient implementation of XML schema content models. In Proceedings of the 2004 ACM Symposium on Document Engineering, Milwaukee, WI, October 28--30, 2004. ACM, 239--241.Google ScholarDigital Library
- Leonid Libkin, Wim Martens, and Domagoj Vrgoc. 2016. Querying graphs with data. J. ACM 63, 2 (2016), 14.Google ScholarDigital Library
- Anthony Mansfield. 1983. On the computational complexity of a merge recognition problem. Discrete Appl. Math. 5, 1 (1983), 119--122. Google ScholarCross Ref
- Alain J. Mayer and Larry J. Stockmeyer. 1994. Word problems—This time with interleaving. Inf. Comput. 115, 2 (1994), 293--311. Google ScholarDigital Library
- Anders Møller. 2010. dk.brics.automaton—Finite-State Automata and Regular Expressions for Java. Retrieved from http://www.brics.dk/automaton/.Google Scholar
- Manizheh Montazerian, Peter T. Wood, and Seyed R. Mousavi. 2007. XPath query satisfiability is in PTIME for real-world DTDs. In Proceedings of the XML Database Symposium (XSym’07) (Lecture Notes in Computer Science), Vol. 4704. Springer, 17--30. Google ScholarCross Ref
- Sushant Patnaik and Neil Immerman. 1997. Dyn-FO: A parallel, dynamic complexity class. J. Comput. Syst. Sci. 55, 2 (1997), 199--209. Google ScholarDigital Library
- Randy Smith, Cristian Estan, Somesh Jha, and Shijin Kong. 2008. Deflating the big bang: Fast and scalable deep packet inspection with extended finite automata. In Proceedings of the ACM SIGCOMM 2008 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, 207--218.Google ScholarDigital Library
- C. M. Sperberg-McQueen. 2004. Notes on finite state automata with counters. Technical Report. Retrieved from http://www.w3.org/XML/2004/05/msm-cfa.html.Google Scholar
- C. M. Sperberg-McQueen. 2005. Applications of Brzozowski derivatives to XML schema processing. In Proceedings of the Extreme Markup Languages 2005 Conference.Google Scholar
- Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. 2004. XML Schema Part 1: Structures, 2nd ed. Technical Report. World Wide Web Consortium. W3C Recommendation.Google Scholar
- Manfred K. Warmuth and David Haussler. 1984. On the complexity of iterated shuffle. J. Comput. Syst. Sci. 28, 3 (1984), 345--358. Google ScholarCross Ref
- Peter T. Wood. 2003. Containment for XPath fragments under DTD constraints. In Proceedings of the 9th International Conference on Database Theory (ICDT’03). Springer, 297--311.Google ScholarCross Ref
Index Terms
- Linear Time Membership in a Class of Regular Expressions with Counting, Interleaving, and Unordered Concatenation
Recommendations
Inference of concise regular expressions and DTDs
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise ...
Linear time membership in a class of regular expressions with interleaving and counting
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementThe extension of Regular Expressions (REs) with an interleaving (shuffle) operator has been proposed in many occasions, since it would be crucial to deal with unordered data. However, interleaving badly affects the complexity of basic operations, and, ...
The membership problem for regular expressions with unordered concatenation and numerical constraints
LATA'12: Proceedings of the 6th international conference on Language and Automata Theory and ApplicationsWe study the membership problem for regular expressions extended with operators for unordered concatenation and numerical constraints. The unordered concatenation of a set of regular expressions denotes all sequences consisting of exactly one word ...
Comments