Efficient POSIX submatch extraction on nondeterministic finite automata,Software: Practice and Experience

当前位置： X-MOL 学术 › Softw. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient POSIX submatch extraction on nondeterministic finite automata
Software: Practice and Experience ( IF 2.6 ) Pub Date : 2020-10-18 , DOI: 10.1002/spe.2881
Angelo Borsotti ₁ , Ulya Trofimovich ₂

Affiliation

In this paper we study the performance of POSIX submatch extraction algorithms based on nondeterministic finite automata (NFA). We propose an algorithm that combines Laurikari tagged NFA and extended Okui‐Suzuki disambiguation. The algorithm works in worst‐case O(n m2 t) time and O(m2) space (including preprocessing), where n is the length of input, m is the size of the regular expression with bounded repetition expanded and t is the number of capturing groups and subexpressions that contain them. On real‐world benchmarks our algorithm performs close to the O(n m t) complexity of leftmost‐greedy matching, although on artificial benchmarks it can be significantly slower. We propose a lazy version of the algorithm that runs much faster, but requires O(n m2) space. We show that the Kuklewicz algorithm is slower in practice, and the backward matching algorithm proposed by Cox is incorrect.

中文翻译：

非确定性有限自动机上的高效 POSIX 子匹配提取

在本文中，我们研究了基于非确定性有限自动机 (NFA) 的 POSIX 子匹配提取算法的性能。我们提出了一种将 Laurikari 标记的 NFA 和扩展的 Okui-Suzuki 消歧相结合的算法。该算法在 O(n m2 t) 时间和 O(m2) 空间（包括预处理）中工作，其中 n 是输入的长度，m 是有界重复扩展的正则表达式的大小，t 是数字捕获组和包含它们的子表达式。在现实世界的基准测试中，我们的算法执行接近于最左贪婪匹配的 O(n·m t) 复杂度，尽管在人工基准测试中它可能要慢得多。我们提出了一种运行速度更快但需要 O(n·m2) 空间的算法的惰性版本。我们表明 Kuklewicz 算法在实践中更慢，

更新日期：2020-10-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文