Scalable Iterative Classification for Sanitizing Large-Scale Datasets,IEEE Transactions on Knowledge and Data Engineering

当前位置： X-MOL 学术 › IEEE Trans. Knowl. Data. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scalable Iterative Classification for Sanitizing Large-Scale Datasets
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2017-03-01 , DOI: 10.1109/tkde.2016.2628180
Bo Li ₁ , Yevgeniy Vorobeychik ₁ , Muqun Li ₁ , Bradley Malin ₁

Affiliation

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93 percent of the original data, and completes after at most five iterations.

中文翻译：

用于净化大规模数据集的可扩展迭代分类

廉价的普适计算可以在各种领域收集大量个人数据。许多组织旨在共享此类数据，同时隐藏可能泄露个人身份信息的功能。这些数据中的大部分表现出弱结构（例如，文本），因此已经开发了机器学习方法来检测和从中删除标识符。虽然学习从来都不是完美的，依靠这种方法来清理数据可能会泄露敏感信息，但小风险通常是可以接受的。我们的目标是平衡已发布数据的价值和攻击者发现泄露标识符的风险。我们将数据清理建模为 1) 选择一组分类器应用于数据并仅发布预测为非敏感实例的发布者和 2) 结合机器学习和手动检查以发现泄露的识别信息的攻击者之间的游戏。我们为发布者引入了一种快速迭代贪婪算法，以确保资源有限的对手的低效用。此外，我们使用五个文本数据集说明我们的算法几乎没有为最先进的学习算法留下可自动识别的敏感实例，同时共享超过 93% 的原始数据，并在最多五次迭代后完成。我们为发布者引入了一种快速迭代贪婪算法，以确保资源有限的对手的低效用。此外，我们使用五个文本数据集说明我们的算法几乎没有为最先进的学习算法留下可自动识别的敏感实例，同时共享超过 93% 的原始数据，并在最多五次迭代后完成。我们为发布者引入了一种快速迭代贪婪算法，以确保资源有限的对手的低效用。此外，我们使用五个文本数据集说明我们的算法几乎没有为最先进的学习算法留下可自动识别的敏感实例，同时共享超过 93% 的原始数据，并在最多五次迭代后完成。

更新日期：2017-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>