Abstract
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this article, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks.
In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.
We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO, and then MCSR-ALGO, and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.
- O. Abul, F. Bonchi, and F. Giannotti. 2010. Hiding sequential and spatiotemporal patterns. IEEE Transactions on Knowledge and Data Engineering 22, 12 (2010), 1709--1723.Google ScholarDigital Library
- Osman Abul. 2010. Knowledge hiding in emerging application domains. In Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press.Google Scholar
- C. C. Aggarwal and P. S. Yu. 2007. On anonymization of string data. In Proceedings of the 2007 SIAM International Conference on Data Mining. 419--424.Google Scholar
- C. C. Aggarwal and P. S. Yu. 2008. A framework for condensation-based anonymization of string data. Data Mining and Knowledge Discovery 16, 3 (2008), 251--275.Google ScholarDigital Library
- C. C. Aggarwal and P. S. Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining: Models and Algorithms. Springer.Google Scholar
- C. C. Aggarwal and P. S. Yu. 2008. Privacy-Preserving Data Mining: Models and Algorithms. Springer.Google Scholar
- Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. 2017. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology 12, 5 (2017).Google Scholar
- A. Backurs and P. Indyk. 2015. Edit distance cannot be computed in strongly subquadratic time (Unless SETH is false). In Proceedings of the 47th Annual ACM Symposium on Theory of Computing. 51--58.Google Scholar
- G. Bernardini, H. Chen, G. Loukides, N. Pisanti, S. P. Pissis, L. Stougie, and M. Sweering. 2020. String sanitization under edit distance. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching. 7:1--7:14.Google Scholar
- Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. 2019. String sanitization: A combinatorial approach. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 627--644.Google Scholar
- Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. 2020. Reverse-safe data structures for text indexing. In Proceedings of the Symposium on Algorithm Engineering and Experiments. SIAM, 199--213.Google Scholar
- F. Bonchi and E. Ferrari. 2010. Privacy-Aware Knowledge Discovery: Novel Applications and New Techniques. CRC Press.Google Scholar
- L. Bonomi, L. Fan, and H. Jin. 2016. An information-theoretic approach to individual sequential data sanitization. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 337--346.Google Scholar
- L. Bonomi and L. Xiong. 2013. A two-phase algorithm for mining sequential patterns with differential privacy. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 269--278.Google Scholar
- Volker Brendel, Jacques S. Beckmann, and Edward N. Trifonov. 1986. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4, 1 (1986), 11--21.Google ScholarCross Ref
- Bastien Cazaux, Thierry Lecroq, and Eric Rivals. 2019. Linking indexing data structures to de Bruijn graphs: Construction and update. Journal of Computer and System Sciences 104, 1 (2019) 165--183.Google ScholarCross Ref
- R. Chen, G. Acs, and C. Castelluccia. 2012. Differentially private sequential data publication via variable-length N-grams. In Proceedings of the 2012 ACM Conference on Computer and Communications Security. 638--649.Google Scholar
- G. Cormode, F. Korn, and S. Tirthapura. 2008. Exponentially decayed aggregates on data streams. In Proceedings of the IEEE 24th International Conference on Data Engineering. 1379--1381.Google Scholar
- M. Crochemore, C. Hancart, and T. Lecroq. 2007. Algorithms on Strings. Cambridge University Press.Google Scholar
- J. Droppo and A. Acero. 2010. Context dependent phonetic string edit distance for automatic speech recognition. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 4358--4361.Google Scholar
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. 265--284.Google Scholar
- Sara Foresti. 2011. Microdata protection. In Encyclopedia of Cryptography and Security, 2nd Ed, Henk C.A. van Tilborg, and Sushil Jajodia (Eds.). Springer, 781--783.Google Scholar
- Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4, (June 2010), 53.Google Scholar
- J. Gallant, D. Maier, and J. A. Storer. 1980. On finding minimal length superstrings. Journal of Computer and System Sciences 20, 1 (1980), 50--58.Google ScholarCross Ref
- A. Gkoulalas-Divanis and G. Loukides. 2011. Revisiting sequential pattern hiding to enhance utility. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1316--1324.Google Scholar
- Roberto Grossi, Costas S. Iliopoulos, Robert Mercas, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, and Fatima Vayani. 2016. Circular sequence comparison: Algorithms and applications. Algorithms for Molecular Biology 11, 12 (2016).Google Scholar
- R. Gwadera, A. Gkoulalas-Divanis, and G. Loukides. 2013. Permutation-based sequential pattern hiding. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining. 241--250.Google Scholar
- L. Jin, C. Li, and R. Vernica. 2008. SEPIA: Estimating selectivities of approximate string predicates in large Databases. The VLDB Journal 17, 5 (Aug 2008), 1213--1229.Google ScholarDigital Library
- Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. The Multiple-Choice Knapsack Problem. Springer, Berlin, 317--347.Google Scholar
- A. Liu, K. Zhengy, L. Liz, G. Liu, L. Zhao, and X. Zhou. 2015. Efficient secure similarity computation on encrypted trajectory data. In Proceedings of the IEEE International Conference on Data Engineering. 66--77.Google Scholar
- G. Loukides and R. Gwadera. 2015. Optimal event sequence sanitization. In Proceedings of the 2015 SIAM International Conference on Data Mining. 775--783.Google Scholar
- Grigorios Loukides, Aris Gkoulalas-Divanis, and Bradley Malin. 2010. Anonymization of electronic medical records for validating genome-wide association studies. Proceedings of the National Academy of Sciences 107, 17 (2010), 7898--7903.Google ScholarCross Ref
- W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. 2014. Efficiently supporting edit distance based string similarity search using B+-trees. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2983--2996.Google ScholarCross Ref
- B. Malin and L. Sweeney. 2000. Determining the identifiability of DNA database entries. In AMIA. 537--541.Google Scholar
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.Google ScholarDigital Library
- A. Monreale, D. Pedreschi, R. G. Pensa, and F. Pinelli. 2014. Anonymity preserving sequential pattern mining. Artificial Intelligence and Law 22, 2 (2014), 141--173.Google ScholarDigital Library
- Eugene W. Myers and Webb Miller. 1989. Approximate matching of regular expressions. Bulletin of Mathematical Biology 51, 1 (1989), 5--37.Google ScholarCross Ref
- A. Narayanan and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy. 111--125.Google Scholar
- J. Natwichai, X. Li, and M. Orlowska. 2005. Hiding classification rules for data sharing with privacy preservation. In Data Warehousing and Knowledge Discovery. Springer, Berlin, 468--477.Google Scholar
- D. Pissinger. 1995. A minimal algorithm for the multiple-choice knapsack problem. European Journal of Operational Research 83, 2 (1995), 394--410.Google ScholarCross Ref
- Solon P. Pissis. 2014. MoTeX-II: Structured MoTif eXtraction from large-scale datasets. BMC Bioinformatics 15 (2014), 235.Google ScholarCross Ref
- Mireille Régnier and Mathias Vandenbogaert. 2006. Comparison of statistical significance criteria. Journal of Bioinformatics and Computational Biology 4, 2 (2006), 537--552.Google ScholarCross Ref
- P. Samarati and L. Sweeney. 1998. Generalizing data to provide anonymity when disclosing information (abstract). In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 188.Google Scholar
- J. Shang, J. Peng, and J. Han. [2016]. MACFP: Maximal approximate consecutive frequent pattern mining under edit distance. In Proceedings of the 2016 SIAM International Conference on Data Mining. 558--566.Google Scholar
- P. Sinha and A. A. Zoltners. 1979. The multiple-choice knapsack problem.Operations Research 27, 3 (1979), 431--627.Google Scholar
- X. Sun and P.S. Yu. 2005. A border-based approach for hiding sensitive frequent itemsets. In Proceedings of the 5th IEEE International Conference on Data Mining. 426--433.Google Scholar
- M. Terrovitis, G. Poulis, N. Mamoulis, and S. Skiadopoulos. 2017. Local suppression and splitting techniques for privacy preserving publication of trajectories. IEEE Transactions on Knowledge and Data Engineering 29, 7 (2017), 1466--1479.Google ScholarCross Ref
- George Theodorakopoulos, Reza Shokri, Carmela Troncoso, Jean-Pierre Hubaux, and Jean-Yves Le Boudec. 2014. Prolonging the hide-and-seek game: Optimal trajectory privacy for location-based services. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. 73--82.Google ScholarDigital Library
- V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni. 2004. Association rule hiding. IEEE Transactions on Knowledge and Data Engineering 16, 4 (2004), 434--447.Google ScholarDigital Library
- D. Wang, Y. He, E. Rundensteiner, and J. F. Naughton. 2013. Utility-maximizing event stream suppression. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 589--600.Google Scholar
- Z. Wen, D. Deng, R. Zhang, and R. Kotagiri. 2019. 2ED: An efficient entity extraction algorithm using two-level edit-distance. In Proceedings of the IEEE International Conference on Data Engineering. 998--1009.Google Scholar
- Y. Xu, K. Wang, A. W. Fu, and P. S. Yu. 2008. Anonymizing transaction databases for publication. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767--775.Google Scholar
Index Terms
- Combinatorial Algorithms for String Sanitization
Recommendations
An Information-Theoretic Approach to Individual Sequential Data Sanitization
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data MiningFine-grained, personal data has been largely, continuously generated nowadays, such as location check-ins, web histories, physical activities, etc. Those data sequences are typically shared with untrusted parties for data analysis and promotional ...
String Sanitization: A Combinatorial Approach
Machine Learning and Knowledge Discovery in DatabasesAbstractString data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental ...
Prefix Reversals on Binary and Ternary Strings
Given a permutation $\pi$, the application of prefix reversal $f^{(i)}$ to $\pi$ reverses the order of the first $i$ elements of $\pi$. The problem of sorting by prefix reversals (also known as pancake flipping), made famous by Gates and Papadimitriou (...
Comments