research-article

Combinatorial Algorithms for String Sanitization

Authors:
Giulia Bernardini

University of Milano - Bicocca and CWI, Milano, Italy

University of Milano - Bicocca and CWI, Milano, Italy
View Profile

,
Huiping Chen

King’s College London, London, UK

King’s College London, London, UK
View Profile

,
Alessio Conte

University of Pisa, Largo Pontecorvo, Pisa, Italy

University of Pisa, Largo Pontecorvo, Pisa, Italy
View Profile

,
Roberto Grossi

University of Pisa and ERABLE Team, Pisa, Italy

University of Pisa and ERABLE Team, Pisa, Italy
View Profile

,
Grigorios Loukides

King’s College London, London, UK

King’s College London, London, UK
View Profile

,
Nadia Pisanti

University of Pisa and ERABLE Team, Pisa, Italy

University of Pisa and ERABLE Team, Pisa, Italy
View Profile

,
Solon P. Pissis

CWI, Vrije Universiteit Amsterdam, and ERABLE Team, Amsterdam, NETHERLANDS

CWI, Vrije Universiteit Amsterdam, and ERABLE Team, Amsterdam, NETHERLANDS
View Profile

,
Giovanna Rosone

University of Pisa, Largo Pontecorvo, Pisa, Italy

University of Pisa, Largo Pontecorvo, Pisa, Italy
View Profile

,
Michelle Sweering

CWI, Amsterdam, NETHERLANDS

CWI, Amsterdam, NETHERLANDS
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 15 Issue 1Article No.: 8pp 1–34https://doi.org/10.1145/3418683

Published:07 December 2020Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this article, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks.

In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.

We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO, and then MCSR-ALGO, and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.

References

O. Abul, F. Bonchi, and F. Giannotti. 2010. Hiding sequential and spatiotemporal patterns. IEEE Transactions on Knowledge and Data Engineering 22, 12 (2010), 1709--1723.Google ScholarDigital Library
Osman Abul. 2010. Knowledge hiding in emerging application domains. In Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press.Google Scholar
C. C. Aggarwal and P. S. Yu. 2007. On anonymization of string data. In Proceedings of the 2007 SIAM International Conference on Data Mining. 419--424.Google Scholar
C. C. Aggarwal and P. S. Yu. 2008. A framework for condensation-based anonymization of string data. Data Mining and Knowledge Discovery 16, 3 (2008), 251--275.Google ScholarDigital Library
C. C. Aggarwal and P. S. Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining: Models and Algorithms. Springer.Google Scholar
C. C. Aggarwal and P. S. Yu. 2008. Privacy-Preserving Data Mining: Models and Algorithms. Springer.Google Scholar
Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. 2017. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology 12, 5 (2017).Google Scholar
A. Backurs and P. Indyk. 2015. Edit distance cannot be computed in strongly subquadratic time (Unless SETH is false). In Proceedings of the 47th Annual ACM Symposium on Theory of Computing. 51--58.Google Scholar
G. Bernardini, H. Chen, G. Loukides, N. Pisanti, S. P. Pissis, L. Stougie, and M. Sweering. 2020. String sanitization under edit distance. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching. 7:1--7:14.Google Scholar
Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. 2019. String sanitization: A combinatorial approach. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 627--644.Google Scholar
Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. 2020. Reverse-safe data structures for text indexing. In Proceedings of the Symposium on Algorithm Engineering and Experiments. SIAM, 199--213.Google Scholar
F. Bonchi and E. Ferrari. 2010. Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press.Google Scholar
L. Bonomi, L. Fan, and H. Jin. 2016. An information-theoretic approach to individual sequential data sanitization. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 337--346.Google Scholar
L. Bonomi and L. Xiong. 2013. A two-phase algorithm for mining sequential patterns with differential privacy. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 269--278.Google Scholar
Volker Brendel, Jacques S. Beckmann, and Edward N. Trifonov. 1986. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4, 1 (1986), 11--21.Google ScholarCross Ref
Bastien Cazaux, Thierry Lecroq, and Eric Rivals. 2019. Linking indexing data structures to de Bruijn graphs: Construction and update. Journal of Computer and System Sciences 104, 1 (2019) 165--183.Google ScholarCross Ref
R. Chen, G. Acs, and C. Castelluccia. 2012. Differentially private sequential data publication via variable-length N-grams. In Proceedings of the 2012 ACM Conference on Computer and Communications Security. 638--649.Google Scholar
G. Cormode, F. Korn, and S. Tirthapura. 2008. Exponentially decayed aggregates on data streams. In Proceedings of the IEEE 24th International Conference on Data Engineering. 1379--1381.Google Scholar
M. Crochemore, C. Hancart, and T. Lecroq. 2007. Algorithms on Strings. Cambridge University Press.Google Scholar
J. Droppo and A. Acero. 2010. Context dependent phonetic string edit distance for automatic speech recognition. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 4358--4361.Google Scholar
C. Dwork, F. McSherry, K. Nissim, and A. Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. 265--284.Google Scholar
Sara Foresti. 2011. Microdata protection. In Encyclopedia of Cryptography and Security, 2nd Ed, Henk C.A. van Tilborg, and Sushil Jajodia (Eds.). Springer, 781--783.Google Scholar
Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4, (June 2010), 53.Google Scholar
J. Gallant, D. Maier, and J. A. Storer. 1980. On finding minimal length superstrings. Journal of Computer and System Sciences 20, 1 (1980), 50--58.Google ScholarCross Ref
A. Gkoulalas-Divanis and G. Loukides. 2011. Revisiting sequential pattern hiding to enhance utility. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1316--1324.Google Scholar
Roberto Grossi, Costas S. Iliopoulos, Robert Mercas, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, and Fatima Vayani. 2016. Circular sequence comparison: Algorithms and applications. Algorithms for Molecular Biology 11, 12 (2016).Google Scholar
R. Gwadera, A. Gkoulalas-Divanis, and G. Loukides. 2013. Permutation-based sequential pattern hiding. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. 241--250.Google Scholar
L. Jin, C. Li, and R. Vernica. 2008. SEPIA: Estimating selectivities of approximate string predicates in large Databases. The VLDB Journal 17, 5 (Aug 2008), 1213--1229.Google ScholarDigital Library
Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. The Multiple-Choice Knapsack Problem. Springer, Berlin, 317--347.Google Scholar
A. Liu, K. Zhengy, L. Liz, G. Liu, L. Zhao, and X. Zhou. 2015. Efficient secure similarity computation on encrypted trajectory data. In Proceedings of the IEEE International Conference on Data Engineering. 66--77.Google Scholar
G. Loukides and R. Gwadera. 2015. Optimal event sequence sanitization. In Proceedings of the 2015 SIAM International Conference on Data Mining. 775--783.Google Scholar
Grigorios Loukides, Aris Gkoulalas-Divanis, and Bradley Malin. 2010. Anonymization of electronic medical records for validating genome-wide association studies. Proceedings of the National Academy of Sciences 107, 17 (2010), 7898--7903.Google ScholarCross Ref
W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. 2014. Efficiently supporting edit distance based string similarity search using B⁺-trees. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2983--2996.Google ScholarCross Ref
B. Malin and L. Sweeney. 2000. Determining the identifiability of DNA database entries. In AMIA. 537--541.Google Scholar
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.Google ScholarDigital Library
A. Monreale, D. Pedreschi, R. G. Pensa, and F. Pinelli. 2014. Anonymity preserving sequential pattern mining. Artificial Intelligence and Law 22, 2 (2014), 141--173.Google ScholarDigital Library
Eugene W. Myers and Webb Miller. 1989. Approximate matching of regular expressions. Bulletin of Mathematical Biology 51, 1 (1989), 5--37.Google ScholarCross Ref
A. Narayanan and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy. 111--125.Google Scholar
J. Natwichai, X. Li, and M. Orlowska. 2005. Hiding classification rules for data sharing with privacy preservation. In Data Warehousing and Knowledge Discovery. Springer, Berlin, 468--477.Google Scholar
D. Pissinger. 1995. A minimal algorithm for the multiple-choice knapsack problem. European Journal of Operational Research 83, 2 (1995), 394--410.Google ScholarCross Ref
Solon P. Pissis. 2014. MoTeX-II: Structured MoTif eXtraction from large-scale datasets. BMC Bioinformatics 15 (2014), 235.Google ScholarCross Ref
Mireille Régnier and Mathias Vandenbogaert. 2006. Comparison of statistical significance criteria. Journal of Bioinformatics and Computational Biology 4, 2 (2006), 537--552.Google ScholarCross Ref
P. Samarati and L. Sweeney. 1998. Generalizing data to provide anonymity when disclosing information (abstract). In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 188.Google Scholar
J. Shang, J. Peng, and J. Han. [2016]. MACFP: Maximal approximate consecutive frequent pattern mining under edit distance. In Proceedings of the 2016 SIAM International Conference on Data Mining. 558--566.Google Scholar
P. Sinha and A. A. Zoltners. 1979. The multiple-choice knapsack problem.Operations Research 27, 3 (1979), 431--627.Google Scholar
X. Sun and P.S. Yu. 2005. A border-based approach for hiding sensitive frequent itemsets. In Proceedings of the 5th IEEE International Conference on Data Mining. 426--433.Google Scholar
M. Terrovitis, G. Poulis, N. Mamoulis, and S. Skiadopoulos. 2017. Local suppression and splitting techniques for privacy preserving publication of trajectories. IEEE Transactions on Knowledge and Data Engineering 29, 7 (2017), 1466--1479.Google ScholarCross Ref
George Theodorakopoulos, Reza Shokri, Carmela Troncoso, Jean-Pierre Hubaux, and Jean-Yves Le Boudec. 2014. Prolonging the hide-and-seek game: Optimal trajectory privacy for location-based services. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. 73--82.Google ScholarDigital Library
V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni. 2004. Association rule hiding. IEEE Transactions on Knowledge and Data Engineering 16, 4 (2004), 434--447.Google ScholarDigital Library
D. Wang, Y. He, E. Rundensteiner, and J. F. Naughton. 2013. Utility-maximizing event stream suppression. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 589--600.Google Scholar
Z. Wen, D. Deng, R. Zhang, and R. Kotagiri. 2019. 2ED: An efficient entity extraction algorithm using two-level edit-distance. In Proceedings of the IEEE International Conference on Data Engineering. 998--1009.Google Scholar
Y. Xu, K. Wang, A. W. Fu, and P. S. Yu. 2008. Anonymizing transaction databases for publication. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767--775.Google Scholar

Index Terms

Combinatorial Algorithms for String Sanitization
1. Mathematics of computing
  1. Discrete mathematics
    1. Combinatorics
      1. Combinatorics on words
2. Security and privacy
  1. Database and storage security
    1. Data anonymization and sanitization

Recommendations

An Information-Theoretic Approach to Individual Sequential Data Sanitization
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

Fine-grained, personal data has been largely, continuously generated nowadays, such as location check-ins, web histories, physical activities, etc. Those data sequences are typically shared with untrusted parties for data analysis and promotional ...
Read More
String Sanitization: A Combinatorial Approach
Machine Learning and Knowledge Discovery in Databases
Abstract
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental ...
Read More
Prefix Reversals on Binary and Ternary Strings

Given a permutation $\pi$, the application of prefix reversal $f^{(i)}$ to $\pi$ reverses the order of the first $i$ elements of $\pi$. The problem of sorting by prefix reversals (also known as pancake flipping), made famous by Gates and Papadimitriou (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 15, Issue 1
February 2021
361 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3441647
Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 December 2020
- Accepted: 1 August 2020
- Revised: 1 June 2020
- Received: 1 December 2019
Published in tkdd Volume 15, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data privacy
data sanitization
knowledge hiding
sensitive knowledge
sequences
strings
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 284
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Combinatorial Algorithms for String Sanitization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

An Information-Theoretic Approach to Individual Sequential Data Sanitization

String Sanitization: A Combinatorial Approach

Prefix Reversals on Binary and Ternary Strings