Skip to main content
Log in

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

One of the main steps in the data lifecycle is to publish it for data analysts to discover hidden patterns. But, data publishing may lead to unwanted disclosure of personal information and cause privacy problems. Data anonymization techniques preserve privacy models to prevent the disclosure of individuals’ private information in published data. In this paper, a distributed in-memory method is proposed on the Apache Spark framework to preserve the ℓ-diversity privacy model. This method anonymizes large-scale data in a three-phase process, which includes, seed selection, data clustering for \(\ell\)-diversity, and finalizing phase. In this method, a hierarchical kmeans-based data clustering algorithm has been designed for data anonymization. One of the major challenges of anonymization methods is to establish a better trade-off between data utility and privacy. Therefore, for calculating the distance between records and forming more cohesive ℓdiverse-clusters, the authors have designed two Manhattan-based and Euclidean-based distance functions to satisfy the requirements of the ℓ-diversity model. Given the 100-fold speed of the Spark compared to MapReduce, the proposed method is presented using in-memory RDD programming in Apache Spark, to address the runtime, scalability, and performance in large-scale data anonymization as it exists in the previous MapReduce-based algorithms. Our method provides general knowledge to use parallel in-memory computation of Spark in big data anonymization. In experiments, this method has obtained lower information loss and loses about 1% to 2% accuracy and FMeasure criteria; therefore, it establishes a better trade-off than the state-of-the-art MapReduce-based Mondrian methods

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Singh AP, Parihar MD (2013) A review of privacy preserving data publishing technique. Int J Emerg Res Manag Technol 2(6):32–38

    Google Scholar 

  2. Sweeney L (2000) Simple demographics often identify people uniquely, Carnegie Mellon Univ. Data Priv. Work. Pap. 3. Pittsburgh 671: 1–34

  3. Zigomitros A, Casino F, Solanas A, Patsakis C (2020) A Survey on privacy properties for data publishing of relational data. IEEE Access 8:51071–51099

    Article  Google Scholar 

  4. de Montjoye Y-A, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the Crowd: the privacy bounds of human mobility. Sci Rep 3(1):1376

    Article  Google Scholar 

  5. Jain P, Gyanchandani M, Khare N (2016) Big data privacy: a technological perspective and review. J Big Data 3(1):25

    Article  Google Scholar 

  6. Yu S (2016) Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access 4:2751–2763

    Article  Google Scholar 

  7. Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S (2016) Protection of big data privacy. IEEE Access 4:1821–1834

    Article  Google Scholar 

  8. Clifton C, Tassa T (2013) On syntactic anonymity and differential privacy. Trans Data Priv 6(2):161–183

    MathSciNet  Google Scholar 

  9. Xu L, Jiang C, Wang J, Yuan J, Ren Y (2014) Information security in big data: privacy and data mining. IEEE Access 2:1151–1178

    Google Scholar 

  10. Sweeney L (2002) k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(05):557–570

    Article  MathSciNet  MATH  Google Scholar 

  11. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov from Data 1(1):3-es

    Article  Google Scholar 

  12. Ninghui L, Tiancheng L, Venkatasubramanian S (2007) t-Closeness: privacy beyond k-anonymity and ℓ-diversity. In: Proceedings - International Conference on Data Engineering: pp 106–115

  13. Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets.”In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 689–700

  14. Nergiz ME, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 665–676

  15. Abdelhameed SA, Moussa SM, Khalifa ME (2018) Privacy-preserving tabular data publishing: a comprehensive evaluation from web to cloud. Comput Secur 72:74–95

    Article  Google Scholar 

  16. Victor N, Lopez D, Abawajy JH (2016) Privacy models for big data: a survey. Int J Big Data Intell 3(1):61–75

    Article  Google Scholar 

  17. Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv 42(4):14

    Article  Google Scholar 

  18. Ali M, Khan SU, Vasilakos AV (2015) Security in cloud computing: opportunities and challenges. Inf Sci (Ny) 305:357–383

    Article  MathSciNet  Google Scholar 

  19. Meier A, Kaufmann M (2019) Nosql databases. In: Meier A, Kaufmann M (eds) SQL & NoSQL Databases. Springer, Berlin, pp 201–218

    Chapter  Google Scholar 

  20. Apache software foundation, Apache Spark home page. https://spark.apache.org/

  21. Zaharia M et al (2016) Apache spark. Commun ACM 59(11):56–65

    Article  Google Scholar 

  22. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on Apache Spark. Int J Data Sci Anal 1(3):145–164

    Article  Google Scholar 

  23. Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark. O’Reilly Media

    Google Scholar 

  24. Guller M (2015) Big data analytics with spark. Apress, Berkeley

    Book  Google Scholar 

  25. Canbay Y, Saǧiroǧlu S (2017) Big data anonymization with spark. In 2nd International Conference on Computer Science and Engineering, UBMK 2017, pp 833–838

  26. Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: 2010 Third International Symposium on intelligent information technology and security informatics, pp 63–67

  27. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  28. Rashidi R, Khamforoosh K, Sheikhahmadi A (2020) An analytic approach to separate users by introducing new combinations of initial centers of clustering. Phys A Stat Mech Appl 1(551):124185

    Article  Google Scholar 

  29. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf theory 28(2):129–137

    Article  MathSciNet  MATH  Google Scholar 

  30. LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional K-anonymity. Proc Int Conf Data Eng 2006:25

    Google Scholar 

  31. Xu J, Wang W, Pei J, Wang X, Shi B, Fu AW-C (2006) Utility-based anonymization for privacy preservation with less information loss. Acm Sigkdd Explor Newsl 8(2):21–30

    Article  Google Scholar 

  32. Li J, Wong RC-W, Fu AW-C, Pei J (2008) Anonymization by local recoding in data with attribute hierarchical taxonomies. IEEE Trans Knowl Data Eng 20(9):1181–1194

    Article  Google Scholar 

  33. Aggarwal G et al (2010) Achieving anonymity via clustering. ACM Trans Algorithms 6(3):1–19

    Article  MathSciNet  MATH  Google Scholar 

  34. Zheng W, Ma Y, Wang Z, Jia C, Li P (2019) Effective L-diversity anonymization algorithm based on improved clustering. In: International Symposium on Cyberspace Safety and Security, pp 318–329

  35. LeFevre K, DJDJ DeWitt, R Ramakrishnan, (2005) Incognito: efficient full-domain K-anonymity SIGMOD ’05 Proc. 2005 ACM SIGMOD Int Conf Manag Data, pp 49–60

  36. Yaseen S et al (2018) Improved generalization for secure data publishing. IEEE Access 6:27156–27165

    Article  Google Scholar 

  37. Temuujin O, Ahn J, Im D (2019) Efficient L-diversity algorithm for preserving privacy of dynamically published datasets. IEEE Access 7:122878–122888

    Article  Google Scholar 

  38. Li T, Li N, Zhang J, Molloy I (2012) Slicing: a new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng 24(3):561–574

    Article  Google Scholar 

  39. Jin X, Wah BW, Cheng X, Wang Y (2015) Significance and challenges of big data research. Big Data Res 2(2):59–64

    Article  Google Scholar 

  40. Zhang X, Leckie C, Dou W, Chen J, Kotagiri R, Salcic Z (2016) Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM ’16, pp 1793–1802

  41. Zhang X, Yang LT, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans Parallel Distrib Syst 25(2):363–373

    Article  Google Scholar 

  42. Zhang X et al (2015) Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Trans Comput 64(8):2293–2307

    Article  MathSciNet  MATH  Google Scholar 

  43. Zhang X, Liu C, Nepal S, Yang C, Dou W, Chen J (2014) A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud. J Comput Syst Sci 80(5):1008–1020

    Article  MathSciNet  MATH  Google Scholar 

  44. Zakerzadeh H, Aggarwal CC, Barker K (2015) Privacy-preserving big data publishing, Proc. 27th Int Conf Sci Stat Database Manag. - SSDBM ’15, pp 1–11

  45. Ashkouti F, Sheikhahmadi A (2021) DI-Mondrian: distributed improved mondrian for satisfaction of the L-diversity privacy model using apache spark. Inf Sci (Ny) 546:1–24

    Article  Google Scholar 

  46. Al-Zobbi M, Shahrestani S, Ruan C (2017) Improving MapReduce privacy by implementing multi-dimensional sensitivity-based anonymization. J Big Data 4(1):45

    Article  Google Scholar 

  47. Jain P, Gyanchandani M, Khare N (2019) Enhanced secured Map Reduce layer for big data privacy and security. J Big Data 6(1):30

    Article  Google Scholar 

  48. Nayahi JJV, Kavitha V (2017) Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Futur Gener Comput Syst 74:393–408

    Article  Google Scholar 

  49. Bazai SU, Jang-Jaccard J, Alavizadeh H (2021) Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10(5):589

    Article  Google Scholar 

  50. IPUMS USA, University of Minnesota. https://usa.ipums.org/usa/

  51. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  52. Sinwar D, Kaushik R (2014) Study of Euclidean and Manhattan distance metrics using simple k-means clustering. Int J Res Appl Sci Eng Technol 2(5):270–274

    Google Scholar 

  53. Liberti L, Lavor C, Maculan N, Mucherino A (2014) Euclidean distance geometry and applications. SIAM Rev 56(1):3–69

    Article  MathSciNet  MATH  Google Scholar 

  54. University of california at Irvine, UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.php

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keyhan Khamforoosh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ashkouti, F., Khamforoosh, K., Sheikhahmadi, A. et al. DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark. J Supercomput 78, 2616–2650 (2022). https://doi.org/10.1007/s11227-021-03958-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03958-3

Keywords

Navigation