Combinatorial Algorithms for String Sanitization,ACM Transactions on Knowledge Discovery from Data

当前位置： X-MOL 学术 › ACM Trans. Knowl. Discov. Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Combinatorial Algorithms for String Sanitization
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2020-12-07 , DOI: 10.1145/3418683
Giulia Bernardini ₁ , Huiping Chen ₂ , Alessio Conte ₃ , Roberto Grossi ₄ , Grigorios Loukides ₂ , Nadia Pisanti ₄ , Solon P. Pissis ₅ , Giovanna Rosone ₃ , Michelle Sweering ₆

Affiliation

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this article, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO, and then MCSR-ALGO, and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.

中文翻译：

字符串清理的组合算法

通常传播字符串数据以支持基于位置的服务提供或 DNA 序列分析等应用。然而，这种传播可能会暴露对机密知识建模的敏感模式（例如，从代表用户位置历史的字符串中前往心理健康诊所）。在本文中，我们考虑了在与许多常见字符串处理任务相关的两种设置中，通过隐藏敏感模式的出现来净化字符串的问题，同时保持数据实用性。在第一个设置中，我们的目标是生成最小长度的字符串，该字符串保留所有非敏感模式的出现顺序和频率。这样的字符串允许基于字符串的顺序性质和模式频率准确地执行任务。要构造这样的字符串，我们提出了一种时间最优算法，TFS-ALGO。我们还提出了另一种时间最优算法 PFS-ALGO，它保留了非敏感模式出现的部分顺序，但产生了更短的字符串，可以更有效地进行分析。这些算法产生的字符串是通过连接输入字符串的非敏感部分来构造的。但是，可以通过“反转”连接操作来检测敏感模式。作为回应，我们提出了一种启发式的 MCSR-ALGO，它将算法输出的字符串中的字母替换为精心挑选的字母，从而不会恢复敏感模式，不会引入不合理的模式，并防止出现虚假模式。在第二种设定中，除了保留所有非敏感模式的出现顺序和频率之外，我们的目标是生成与原始字符串的编辑距离最小的字符串。为了构造这样一个字符串，我们提出了一种算法 ETFS-ALGO，它基于解决近似正则表达式匹配的特定实例。我们实施了应用 TFS-ALGO、PFS-ALGO 和 MCSR-ALGO 的清理方法，并通过实验证明它是有效的。我们还表明，TFS-ALGO 在最小化编辑距离方面几乎与 ETFS-ALGO 一样有效，同时比 ETFS-ALGO 效率要高得多。基于解决近似正则表达式匹配的具体实例。我们实施了应用 TFS-ALGO、PFS-ALGO 和 MCSR-ALGO 的清理方法，并通过实验证明它是有效的。我们还表明，TFS-ALGO 在最小化编辑距离方面几乎与 ETFS-ALGO 一样有效，同时比 ETFS-ALGO 效率要高得多。基于解决近似正则表达式匹配的具体实例。我们实施了应用 TFS-ALGO、PFS-ALGO 和 MCSR-ALGO 的清理方法，并通过实验证明它是有效的。我们还表明，TFS-ALGO 在最小化编辑距离方面几乎与 ETFS-ALGO 一样有效，同时比 ETFS-ALGO 效率要高得多。

更新日期：2020-12-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>