当前位置: X-MOL 学术Int. J. Softw. Eng. Knowl. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Regular Expression Learning from Positive Examples Based on Integer Programming
International Journal of Software Engineering and Knowledge Engineering ( IF 0.6 ) Pub Date : 2020-11-09 , DOI: 10.1142/s0218194020400203
Juntao Gao 1 , Yingqian Zhang 2
Affiliation  

This paper presents a novel method to infer regular expressions from positive examples. The method consists of a candidate’s construction phase and an optimization phase. We first propose multiscaling sample augmentation to capture the cycle patterns from single examples during the candidate’s construction phase. We then use common substrings to build regular expressions that capture patterns across multiple examples, and we show this algorithm is more general than those based on common prefixes or suffixes. Furthermore, we propose a pruning mechanism to improve the efficiency of useful common substring mining, which is an important part of common substring-based expression building algorithm. Finally, in the optimization phase, we model the problem of choosing a set of regular expressions with the lowest cost as an integer linear program, which can be solved to obtain the optimal solution. The experimental results on synthetic and real-life samples demonstrate the effectiveness of our approach in inferring concise and semantically meaningful regular expressions for string datasets.

中文翻译:

基于整数规划的正则表达式学习

本文提出了一种从正例推断正则表达式的新方法。该方法由候选者的构建阶段和优化阶段组成。我们首先提出多尺度样本增强,以在候选构建阶段从单个示例中捕获循环模式。然后,我们使用公共子字符串构建正则表达式,以捕获多个示例中的模式,并且我们表明该算法比基于公共前缀或后缀的算法更通用。此外,我们提出了一种剪枝机制来提高有用公共子串挖掘的效率,这是基于公共子串的表达式构建算法的重要组成部分。最后,在优化阶段,我们将选择一组成本最低的正则表达式的问题建模为整数线性规划,可以求解得到最优解。合成样本和真实样本的实验结果证明了我们的方法在为字符串数据集推断简洁且具有语义意义的正则表达式方面的有效性。
更新日期:2020-11-09
down
wechat
bug