skip to main content
research-article

Combinatorial Algorithms for String Sanitization

Published:07 December 2020Publication History
Skip Abstract Section

Abstract

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this article, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks.

In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.

We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO, and then MCSR-ALGO, and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.

References

  1. O. Abul, F. Bonchi, and F. Giannotti. 2010. Hiding sequential and spatiotemporal patterns. IEEE Transactions on Knowledge and Data Engineering 22, 12 (2010), 1709--1723.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Osman Abul. 2010. Knowledge hiding in emerging application domains. In Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press.Google ScholarGoogle Scholar
  3. C. C. Aggarwal and P. S. Yu. 2007. On anonymization of string data. In Proceedings of the 2007 SIAM International Conference on Data Mining. 419--424.Google ScholarGoogle Scholar
  4. C. C. Aggarwal and P. S. Yu. 2008. A framework for condensation-based anonymization of string data. Data Mining and Knowledge Discovery 16, 3 (2008), 251--275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. C. Aggarwal and P. S. Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining: Models and Algorithms. Springer.Google ScholarGoogle Scholar
  6. C. C. Aggarwal and P. S. Yu. 2008. Privacy-Preserving Data Mining: Models and Algorithms. Springer.Google ScholarGoogle Scholar
  7. Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. 2017. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology 12, 5 (2017).Google ScholarGoogle Scholar
  8. A. Backurs and P. Indyk. 2015. Edit distance cannot be computed in strongly subquadratic time (Unless SETH is false). In Proceedings of the 47th Annual ACM Symposium on Theory of Computing. 51--58.Google ScholarGoogle Scholar
  9. G. Bernardini, H. Chen, G. Loukides, N. Pisanti, S. P. Pissis, L. Stougie, and M. Sweering. 2020. String sanitization under edit distance. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching. 7:1--7:14.Google ScholarGoogle Scholar
  10. Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. 2019. String sanitization: A combinatorial approach. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 627--644.Google ScholarGoogle Scholar
  11. Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. 2020. Reverse-safe data structures for text indexing. In Proceedings of the Symposium on Algorithm Engineering and Experiments. SIAM, 199--213.Google ScholarGoogle Scholar
  12. F. Bonchi and E. Ferrari. 2010. Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press.Google ScholarGoogle Scholar
  13. L. Bonomi, L. Fan, and H. Jin. 2016. An information-theoretic approach to individual sequential data sanitization. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 337--346.Google ScholarGoogle Scholar
  14. L. Bonomi and L. Xiong. 2013. A two-phase algorithm for mining sequential patterns with differential privacy. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 269--278.Google ScholarGoogle Scholar
  15. Volker Brendel, Jacques S. Beckmann, and Edward N. Trifonov. 1986. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4, 1 (1986), 11--21.Google ScholarGoogle ScholarCross RefCross Ref
  16. Bastien Cazaux, Thierry Lecroq, and Eric Rivals. 2019. Linking indexing data structures to de Bruijn graphs: Construction and update. Journal of Computer and System Sciences 104, 1 (2019) 165--183.Google ScholarGoogle ScholarCross RefCross Ref
  17. R. Chen, G. Acs, and C. Castelluccia. 2012. Differentially private sequential data publication via variable-length N-grams. In Proceedings of the 2012 ACM Conference on Computer and Communications Security. 638--649.Google ScholarGoogle Scholar
  18. G. Cormode, F. Korn, and S. Tirthapura. 2008. Exponentially decayed aggregates on data streams. In Proceedings of the IEEE 24th International Conference on Data Engineering. 1379--1381.Google ScholarGoogle Scholar
  19. M. Crochemore, C. Hancart, and T. Lecroq. 2007. Algorithms on Strings. Cambridge University Press.Google ScholarGoogle Scholar
  20. J. Droppo and A. Acero. 2010. Context dependent phonetic string edit distance for automatic speech recognition. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 4358--4361.Google ScholarGoogle Scholar
  21. C. Dwork, F. McSherry, K. Nissim, and A. Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. 265--284.Google ScholarGoogle Scholar
  22. Sara Foresti. 2011. Microdata protection. In Encyclopedia of Cryptography and Security, 2nd Ed, Henk C.A. van Tilborg, and Sushil Jajodia (Eds.). Springer, 781--783.Google ScholarGoogle Scholar
  23. Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4, (June 2010), 53.Google ScholarGoogle Scholar
  24. J. Gallant, D. Maier, and J. A. Storer. 1980. On finding minimal length superstrings. Journal of Computer and System Sciences 20, 1 (1980), 50--58.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Gkoulalas-Divanis and G. Loukides. 2011. Revisiting sequential pattern hiding to enhance utility. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1316--1324.Google ScholarGoogle Scholar
  26. Roberto Grossi, Costas S. Iliopoulos, Robert Mercas, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, and Fatima Vayani. 2016. Circular sequence comparison: Algorithms and applications. Algorithms for Molecular Biology 11, 12 (2016).Google ScholarGoogle Scholar
  27. R. Gwadera, A. Gkoulalas-Divanis, and G. Loukides. 2013. Permutation-based sequential pattern hiding. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. 241--250.Google ScholarGoogle Scholar
  28. L. Jin, C. Li, and R. Vernica. 2008. SEPIA: Estimating selectivities of approximate string predicates in large Databases. The VLDB Journal 17, 5 (Aug 2008), 1213--1229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. The Multiple-Choice Knapsack Problem. Springer, Berlin, 317--347.Google ScholarGoogle Scholar
  30. A. Liu, K. Zhengy, L. Liz, G. Liu, L. Zhao, and X. Zhou. 2015. Efficient secure similarity computation on encrypted trajectory data. In Proceedings of the IEEE International Conference on Data Engineering. 66--77.Google ScholarGoogle Scholar
  31. G. Loukides and R. Gwadera. 2015. Optimal event sequence sanitization. In Proceedings of the 2015 SIAM International Conference on Data Mining. 775--783.Google ScholarGoogle Scholar
  32. Grigorios Loukides, Aris Gkoulalas-Divanis, and Bradley Malin. 2010. Anonymization of electronic medical records for validating genome-wide association studies. Proceedings of the National Academy of Sciences 107, 17 (2010), 7898--7903.Google ScholarGoogle ScholarCross RefCross Ref
  33. W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. 2014. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2983--2996.Google ScholarGoogle ScholarCross RefCross Ref
  34. B. Malin and L. Sweeney. 2000. Determining the identifiability of DNA database entries. In AMIA. 537--541.Google ScholarGoogle Scholar
  35. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Monreale, D. Pedreschi, R. G. Pensa, and F. Pinelli. 2014. Anonymity preserving sequential pattern mining. Artificial Intelligence and Law 22, 2 (2014), 141--173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Eugene W. Myers and Webb Miller. 1989. Approximate matching of regular expressions. Bulletin of Mathematical Biology 51, 1 (1989), 5--37.Google ScholarGoogle ScholarCross RefCross Ref
  38. A. Narayanan and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy. 111--125.Google ScholarGoogle Scholar
  39. J. Natwichai, X. Li, and M. Orlowska. 2005. Hiding classification rules for data sharing with privacy preservation. In Data Warehousing and Knowledge Discovery. Springer, Berlin, 468--477.Google ScholarGoogle Scholar
  40. D. Pissinger. 1995. A minimal algorithm for the multiple-choice knapsack problem. European Journal of Operational Research 83, 2 (1995), 394--410.Google ScholarGoogle ScholarCross RefCross Ref
  41. Solon P. Pissis. 2014. MoTeX-II: Structured MoTif eXtraction from large-scale datasets. BMC Bioinformatics 15 (2014), 235.Google ScholarGoogle ScholarCross RefCross Ref
  42. Mireille Régnier and Mathias Vandenbogaert. 2006. Comparison of statistical significance criteria. Journal of Bioinformatics and Computational Biology 4, 2 (2006), 537--552.Google ScholarGoogle ScholarCross RefCross Ref
  43. P. Samarati and L. Sweeney. 1998. Generalizing data to provide anonymity when disclosing information (abstract). In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 188.Google ScholarGoogle Scholar
  44. J. Shang, J. Peng, and J. Han. [2016]. MACFP: Maximal approximate consecutive frequent pattern mining under edit distance. In Proceedings of the 2016 SIAM International Conference on Data Mining. 558--566.Google ScholarGoogle Scholar
  45. P. Sinha and A. A. Zoltners. 1979. The multiple-choice knapsack problem.Operations Research 27, 3 (1979), 431--627.Google ScholarGoogle Scholar
  46. X. Sun and P.S. Yu. 2005. A border-based approach for hiding sensitive frequent itemsets. In Proceedings of the 5th IEEE International Conference on Data Mining. 426--433.Google ScholarGoogle Scholar
  47. M. Terrovitis, G. Poulis, N. Mamoulis, and S. Skiadopoulos. 2017. Local suppression and splitting techniques for privacy preserving publication of trajectories. IEEE Transactions on Knowledge and Data Engineering 29, 7 (2017), 1466--1479.Google ScholarGoogle ScholarCross RefCross Ref
  48. George Theodorakopoulos, Reza Shokri, Carmela Troncoso, Jean-Pierre Hubaux, and Jean-Yves Le Boudec. 2014. Prolonging the hide-and-seek game: Optimal trajectory privacy for location-based services. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. 73--82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni. 2004. Association rule hiding. IEEE Transactions on Knowledge and Data Engineering 16, 4 (2004), 434--447.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. D. Wang, Y. He, E. Rundensteiner, and J. F. Naughton. 2013. Utility-maximizing event stream suppression. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 589--600.Google ScholarGoogle Scholar
  51. Z. Wen, D. Deng, R. Zhang, and R. Kotagiri. 2019. 2ED: An efficient entity extraction algorithm using two-level edit-distance. In Proceedings of the IEEE International Conference on Data Engineering. 998--1009.Google ScholarGoogle Scholar
  52. Y. Xu, K. Wang, A. W. Fu, and P. S. Yu. 2008. Anonymizing transaction databases for publication. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767--775.Google ScholarGoogle Scholar

Index Terms

  1. Combinatorial Algorithms for String Sanitization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 1
        February 2021
        361 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/3441647
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 December 2020
        • Accepted: 1 August 2020
        • Revised: 1 June 2020
        • Received: 1 December 2019
        Published in tkdd Volume 15, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format