Skip to main content
Log in

A deterministic parsing algorithm for ambiguous regular expressions

  • Original Article
  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract

We introduce a new parser generator, called Berry–Sethi Parser (BSP), for ambiguous regular expressions (RE). The generator constructs a deterministic finite-state transducer that recognizes an input string, as the classical Berry–Sethi algorithm does, and additionally outputs a linear representation of all the syntax trees of the string; for infinitely ambiguous strings, a policy for selecting representative sets of trees is chosen. To construct the transducer, the RE symbols, including letters, parentheses and other metasymbols, are distinctly numbered, so that the corresponding language becomes locally testable. In this way a deterministic position automaton can be constructed, which recognizes and translates the input into a compact DAG representation of the syntax trees. The correctness of the construction is proved. The transducer operates in a linear time on the input. Its descriptive complexity is analyzed as a function of established RE parameters: the alphabetic width, the number of null string symbols and the height of the RE tree. A condition for checking RE ambiguity on the transducer graph is stated. Experimental results of running the parser generator and the parser on a large RE collection are presented. The POSIX RE disambiguation criterion has also been applied to the parser.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. The code is available at https://github.com/FLC-project/BSP together with the input data used for the experiments.

  2. The benchmark and generator codes are available at https://github.com/FLC-project/BSP.

  3. On a computer AMD Athlon 64 X2 4200+ with clock 2.2 GHz and operating system Windows 10.

  4. Since RE2 outputs one tree and is coded in C\(++\), to offset the difference due to the programming language we implemented a version of BSP that uses POSIX disambiguation for selecting one tree and is coded in C\(++\) as well; some experimental results are available at https://github.com/FLC-project/BSP. A systematic experimental comparison between existing RE parsing algorithms would be interesting, but it requires more research and presents practical difficulties. Only a few published algorithms come with well-engineered and available programs, and such programs may be coded in different languages. Moreover, the parsing process may return incomparable information on the syntax trees. Lastly, such a research has to face the problem of choosing an unbiased collection of REs as a benchmark.

References

  1. Aaraj, N., Raghunathan, A., Jha, N.K.: Dynamic binary instrumentation-based framework for malware defense. In: Zamboni, D. (ed.) DIMVA, LNCS, vol. 5137, pp. 64–87. Springer (2008)

  2. Allauzen, C., Mohri, M.: A unified construction of the Glushkov, Follow, and Antimirov automata. In: Kralovic, R., Urzyczyn, P. (eds.) MFCS, LNCS, vol. 4162, pp. 110–121. Springer (2006)

  3. Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theor. Comput. Sci. 48(1), 117–126 (1986)

    Article  MathSciNet  Google Scholar 

  4. Berstel, J., Pin, J.E.: Local languages and the Berry–Sethi algorithm. Theor. Comput. Sci. 155(2), 439–446 (1996)

    Article  MathSciNet  Google Scholar 

  5. Bille, P., Gørtz, I.L.: From regular expression matching to parsing. In: Rossmanith, P., Heggernes, P., Katoen, J. (eds.) MFCS, LIPIcs, vol. 138, pp. 71:1–71:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

  6. Book, R., Even, S., Greibach, S., Ott, G.: Ambiguity in graphs and expressions. IEEE Trans. Comput. C–20(2), 149–153 (1971)

    Article  MathSciNet  Google Scholar 

  7. Borsotti, A., Breveglieri, L., Crespi Reghizzi, S., Morzenti, A.: From ambiguous regular expressions to deterministic parsing automata. In: Drewes, F. (ed.) CIAA, LNCS, vol. 9223, pp. 35–48. Springer (2015)

  8. Borsotti, A., Breveglieri, L., Crespi Reghizzi, S., Morzenti, A.: A benchmark production tool for regular expressions. In: Hospodár, M., Jirásková, G. (eds.) CIAA, LNCS, vol. 11601, pp. 95–107. Springer (2019)

  9. Crespi Reghizzi, S., Breveglieri, L., Morzenti, A.: Formal Languages and Compilation. Texts in Computer Science, 3rd edn. Springer, Berlin (2019)

    Book  Google Scholar 

  10. Dubè, D., Feeley, M.: Efficiently building a parse tree from a regular expression. Acta Inf. 37(2), 121–144 (2000)

    Article  MathSciNet  Google Scholar 

  11. Frisch, A., Cardelli, L.: Greedy regular expression matching. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP, LNCS, vol. 3142, pp. 618–629. Springer (2004)

  12. Grathwohl, N., Henglein, F., Nielsen, L., Rasmussen, U.: Two-pass greedy regular expression parsing. In: Konstantinidis, S. (ed.) CIAA, LNCS, vol. 7982, pp. 60–71. Springer (2013)

  13. Gruber, H., Holzer, M.: From finite automata to regular expressions and back—a summary on descriptional complexity. Int. J. Found. Comput. Sci. 26(8), 1009–1040 (2015)

    Article  MathSciNet  Google Scholar 

  14. Haber, S., Horne, W., Manadhata, P., Mowbray, M., Rao, P.: Efficient submatch extraction for practical regular expressions. In: Dediu, A.H., Vide, C.M., Truthe, B. (eds.) LATA, LNCS, vol. 7810, pp. 323–334. Springer (2013)

  15. IEEE: std. 1003.2, POSIX, regular expression notation, section 2.8 (1992)

  16. Kearns, S.: Extending regular expressions with context operators and parse extraction. Softw. Pract. Exp. 21(8), 787–804 (1991)

    Article  Google Scholar 

  17. Laurikari, V.: NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In: de la Fuente, P. (ed.) SPIRE, pp. 181–187. IEEE Computer Society (2000)

  18. McNaughton, R., Papert, S.: Counter-Free Automata. MIT Press, Cambridge (1971)

    MATH  Google Scholar 

  19. Nielsen, L., Henglein, F.: Bit-coded regular expression parsing. In: Dediu, A.H., Inenaga, S., C.M. (eds.) LATA, LNCS, vol. 6638, pp. 402–413. Springer (2011)

  20. Okui, S., Suzuki, T.: Disambiguation in regular expression matching via position automata with augmented transitions. In: Domaratzki, M., Salomaa, K. (eds.) CIAA, LNCS, vol. 6482, pp. 231–240. Springer (2010)

  21. Schwarz, N., Karper, A., Nierstrasz, O.: Efficiently extracting full parse trees using regular expressions with capture groups. PeerJ PrePrints 3, e1248 (2015)

    Google Scholar 

  22. Sulzmann, M., Lu, K.Z.M.: POSIX regular expression parsing with derivatives. In: Codish, M., Sumii, E. (eds.) FLOPS, LNCS, vol. 8475, pp. 203–220. Springer (2014)

  23. Sulzmann, M., Lu, K.Z.M.: Derivative-based diagnosis of regular expression ambiguity. Int. J. Found. Comput. Sci. 28(5), 543–562 (2017)

    Article  MathSciNet  Google Scholar 

  24. Watson, B.: A taxonomy of finite automata construction algorithms. Technical Report, Computing Science Notes, Technische Univ. Eindhoven (1993)

Download references

Acknowledgements

To the anonymous reviewers for their valuable suggestions and references.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Breveglieri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of the first part of this work is in [7].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Borsotti, A., Breveglieri, L., Crespi Reghizzi, S. et al. A deterministic parsing algorithm for ambiguous regular expressions. Acta Informatica 58, 195–229 (2021). https://doi.org/10.1007/s00236-020-00366-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00236-020-00366-7

Navigation