Latent Dirichlet Allocation Model Training With Differential Privacy,IEEE Transactions on Information Forensics and Security

当前位置： X-MOL 学术 › IEEE Trans. Inform. Forensics Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Latent Dirichlet Allocation Model Training With Differential Privacy
IEEE Transactions on Information Forensics and Security ( IF 6.3 ) Pub Date : 10-21-2020 , DOI: 10.1109/tifs.2020.3032021
Fangyuan Zhao , Xuebin Ren , Shusen Yang , Qing Han , Peng Zhao , Xinyu Yang

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for hidden semantic discovery of text data and serves as a fundamental tool for text analysis in various applications. However, the LDA model as well as the training process of LDA may expose the text information in the training data, thus bringing significant privacy concerns. To address the privacy issue in LDA, we systematically investigate the privacy protection of the main-stream LDA training algorithm based on Collapsed Gibbs Sampling (CGS) and propose several differentially private LDA algorithms for typical training scenarios. In particular, we present the first theoretical analysis on the inherent differential privacy guarantee of CGS based LDA training and further propose a centralized privacy-preserving algorithm (HDP-LDA) that can prevent data inference from the intermediate statistics in the CGS training. Also, we propose a locally private LDA training algorithm (LP-LDA) on crowdsourced data to provide local differential privacy for individual data contributors. Furthermore, we extend LP-LDA to an online version as OLP-LDA to achieve LDA training on locally private mini-batches in a streaming setting. Extensive analysis and experiment results validate both the effectiveness and efficiency of our proposed privacy-preserving LDA training algorithms.

中文翻译：

具有差分隐私的潜在狄利克雷分配模型训练

潜在狄利克雷分配（LDA）是一种流行的主题建模技术，用于文本数据的隐藏语义发现，并作为各种应用中文本分析的基本工具。然而，LDA模型以及LDA的训练过程可能会暴露训练数据中的文本信息，从而带来重大的隐私问题。为了解决LDA中的隐私问题，我们系统地研究了基于折叠吉布斯采样（CGS）的主流LDA训练算法的隐私保护，并针对典型训练场景提出了几种差分隐私LDA算法。特别是，我们首次对基于 CGS 的 LDA 训练的固有差异隐私保证进行了理论分析，并进一步提出了一种集中式隐私保护算法（HDP-LDA），该算法可以防止从 CGS 训练中的中间统计数据推断数据。此外，我们还提出了一种针对众包数据的本地私有 LDA 训练算法（LP-LDA），为个人数据贡献者提供本地差异隐私。此外，我们将 LP-LDA 扩展为在线版本 OLP-LDA，以在流设置中实现本地私有小批量的 LDA 训练。广泛的分析和实验结果验证了我们提出的隐私保护 LDA 训练算法的有效性和效率。

更新日期：2024-08-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11