TC 11 Briefing PapersExsense: Extract sensitive information from unstructured data
Introduction
With the rapid development of the Internet, a large amount of sensitive information is stored and transmitted on the Internet. Large-scale sensitive information leakage incidents are frequently reported in recent years. In March 2018, the New York Times reported that Facebook’s 50 million user information was leaked by a company named Cambridge Analytica Cadwalladr and Graham-Harrison (2018). According to IBM’s 2019 Cost of a Data Breach Report IBM Security (2019), the average cost of a data breach in 2019 is $3.92 million, a 12% increase from 2014, and the average size of a data breach is 25,575 records. Also, once sensitive information is leaked it can lead to significant contractual or legal liabilities; serious damage to personal image and reputation; or legal, financial, or business losses Ohm (2014). In this context, the issue of sensitive information leakage has received considerable critical attention.
Sensitive information leakage can be caused by both internal and external factors. The 2019 Data Breach Investigations Report Verizon (2019) released by Verizon shows that data breaches caused by external factors include hacking and social attacks. Also, internal information breaches cannot be ignored, such as unintentional leakage caused by weak security awareness, internal data misuse by authorized users, and corporate espionage. Many security measures including firewalls, access control, IPS/IDS have been considered for data leakage caused by external factors. However, there is no effective way to deal with data leakage caused by internal factors. This is because sensitive information often resides in common unstructured data (e.g., email messages, blog posts, news, configuration files, etc), making it difficult for people to realize the occurrence of data leakage.
Currently, more than 80% of the data on the Internet is unstructured data Allahyari et al. (2017). Unstructured data usually refers to information that does not reside in a relational database. In other words, the data structure of unstructured data is irregular or incomplete and there is no predefined data model. In particular, it should be noted that although some documents like CSV, JSON, XML have some organizational properties, they usually do not have a clear predefined data model. Compared to structured data, these data still difficult to retrieve, analyze and store. Unstructured data is easily processed by humans but is very hard for machines to understand Gupta and Gupta (2019). It is thus beneficial to devise a means to process unstructured data, which helps us automatically detect sensitive information from it and prevent data leakage.
In order to protect sensitive information, many researchers focus on data leakage prevention (DLP) Hart et al. (2011); Meli et al. (2019); Shapira et al. (2013); Shu et al. (2015b). The existing methods Lin et al. (2020); Noor et al. (2019); Shvartzshnaider et al. (2019); Trabelsi (2019) can be classified into two categories: content-based analysis and context-based analysis. Content-based methods inspect data content based on features of sensitive information itself, such as regular expressions and data fingerprints. Content-based methods have high detection accuracy for sensitive information with predictable patterns (e.g., IP, email, API KEY). Context-based methods detect sensitive information based on contextual features around the monitored data. For sensitive information without predictable patterns (e.g., Login Password Combo), the context-based approach is more effective. Therefore, in order to extract sensitive information more comprehensively and accurately, appropriate methods should be adopted for different sensitive information.
Deep learning methods have achieved tremendous success in the field of computer vision and pattern recognition. Neural networks based on dense vector representation have achieved great results in many NLP tasks. Compared with traditional machine learning, deep learning makes multi-level automatic feature representation learning possible. With this in mind, deep learning methods used in the field of data leakage protection has become the trend.
In this work, we propose a method to extract sensitive information from unstructured data, utilizing the content-based and context-based extract mechanism. On the one hand, the method uses regular matching to extract sensitive information with predictable patterns. On the other hand, we build a model named BERT-BiLSTM-Attention to label sensitive information entities in the text based on contextual features. To ensure its accuracy, we compare it to several popular baseline methods and achieve significant improvements. Experimental results on real-world datasets show that the hybridization of content analysis and context analysis tends to produce better performance when compared to the individual methods. In general, the specific contributions of our research are the following:
- •
The paper proposes a framework named ExSense to extract sensitive information. ExSense contains two main modules: regular expressions are used to extract content-based sensitive information with predictable patterns; automatically machine learning extractor is used to extract context-based sensitive information with natural language processing technologies.
- •
In the context-based analysis, the paper presents the sensitive information extraction problem as a sequence labeling problem and builds a BERT-BiLSTM-Attention model. This model utilizes the BiLSTM neural network and the attention mechanism. Experimental results show that the performance of this model is better than other baselines, with an F1 score of 99.15%.
- •
The paper analyzes about 1 million texts on Pastebin and displays different types of sensitive information samples by using ExSense, which proves that the effectually and accurately of our framework.
The rest of the paper is organized as follows: Section 2 presents related work. Section 3 presents a detailed description of the sensitive information extraction method in this paper. Section 4 presents the experiments and analysis related to this work. Section 5 summarizes conclusion and proposes future works.
Section snippets
Related work
A prerequisite for sensitive information protection is extracting sensitive information, so in this section, we first review the existing sensitive information detection methods. Secondly, we introduce technologies related to the method proposed in this paper, which includes the sequence labeling model and attention mechanism. At the end of this section, we summarize the existing methods including contentbased analysis and contextbased analysis.
Methodology
For a given unstructured data including various types of unstructured text, our goal is to extract sensitive information from it. First of all, we need to figure out what sensitive information is and the types of sensitive information. Ohm summarized the definition of sensitive information by surveying dozens of different laws and regulations. Sensitive information describes the information that can be used to enable privacy or security harm when placed in the wrong hands Ohm (2014). In this
Datasets
The datasets used in the paper are collected from Pastebin (https://pastebin.com). This website is a text sharing platform where users can store any text. Pastebin contains voluminous sensitive information. For example, some users accidentally uploaded personal information, password credentials, and financial information. Developers and engineers leaked internal configurations and API keys. In addition, several hackers uploaded illegally obtained sensitive information to Pastebin.
We collected
Conclusion and future work
In this work, we present ExSense, a sensitive information extraction method from unstructured data. ExSense utilizes a hybrid approach that combines content and context analysis. Appropriate methods (i.e., content analysis and context analysis) are used for different sensitive information. In content analysis, regular expressions are used to extract sensitive information with predictable patterns. In context analysis, we build a sequence labeling model named BERT-BiLSTM-Attention, which
Ethics Statement
In this section, we discuss issues related to the ethical conduct of this research.
The Institutional Review Board informed us that the data we collected was outside the scope of the review because we only collected public documents in Pastebin, which are publicly available on the Internet. In addition, we did not obtain research data through any illegal means.
The sensitive information involved in this study had already been leaked on the Internet before our collection. Nevertheless, we have
CRediT authorship contribution statement
Yongyan Guo: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Jiayong Liu: Conceptualization, Methodology, Investigation. Wenwu Tang: Investigation, Software, Data curation. Cheng Huang: Conceptualization, Methodology, Validation, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This document is the results of the research project funded by the National Natural Science Foundation of China (No. 61902265), and Sichuan Science and Technology Program (No. 2020YFG0047, No. 2020YFG0076).
Yongyan Guo is currently pursuing his masters degree in the College of Cybersecurity, Sichuan University, China. His current research interests include data breach protection, attack detection, and artificial intelligence.
References (38)
- et al.
Coban: a context based model for data leakage prevention
Inf. Sci. (Ny)
(2014) - et al.
Terminologies augmented recurrent neural network model for clinical named entity recognition
J. Biomed. Inform.
(2020) - et al.
A machine learning framework for investigating data breaches based on semantic analysis of adversarys attack patterns in threat intelligence repositories
Future Generation Computer Systems
(2019) - et al.
A brief survey of text mining: classification, clustering and extraction techniques
arXiv preprint arXiv:1707.02919
(2017) - et al.
Detecting data semantic: a data leakage prevention approach
2015 IEEE Trustcom/BigDataSE/ISPA
(2015) - et al.
Neural machine translation by jointly learning to align and translate
arXiv preprint arXiv:1409.0473
(2014) - et al.
Revealed: 50 million facebook profiles harvested for cambridge analytica in major data breach
The guardian
(2018) - et al.
Empirical evaluation of gated recurrent neural networks on sequence modeling
arXiv preprint arXiv:1412.3555
(2014) - et al.
Bert: pre-training of deep bidirectional transformers for language understanding
arXiv preprint arXiv:1810.04805
(2018) - et al.
Data leak prevention through named entity recognition
2010 IEEE Second International Conference on Social Computing
(2010)
Natural language processing in mining unstructured data from software repositories: a review
Sādhanā
Text classification for data loss prevention
International Symposium on Privacy Enhancing Technologies Symposium
Bidirectional lstm-crf models for sequence tagging
arXiv preprint arXiv:1508.01991
Detecting security breaches in personal data protection with machine learning
2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)
Finding function in form: compositional character models for open vocabulary word representation
arXiv preprint arXiv:1508.02096
A data-centric approach to insider attack detection in database systems
International Workshop on Recent Advances in Intrusion Detection
How bad can it git? characterizing secret leakage in public github repositories.
NDSS
Efficient estimation of word representations in vector space
arXiv preprint arXiv:1301.3781
Cited by (25)
A service-oriented framework for large-scale documents processing and application via 3D models and feature extraction
2024, Simulation Modelling Practice and TheoryArtificial intelligence for cybersecurity: Literature review and future research directions
2023, Information FusionAbstract Algebraic Approach to the Formation of Computational Environments for Solving Problems in Object Formulations
2024, Lecture Notes in Networks and SystemsA novel framework for Chinese personal sensitive information detection
2024, Connection ScienceResearch on Intelligent Perception Algorithm for Sensitive Information
2023, Applied Sciences (Switzerland)
Yongyan Guo is currently pursuing his masters degree in the College of Cybersecurity, Sichuan University, China. His current research interests include data breach protection, attack detection, and artificial intelligence.
Jiayong Liu received his B.Eng. degree in 1982, M. Eng. degree in 1989, and Ph.D. degree in 2008 from Sichuan University, China. He is currently a professor in School of Cybersecurity, Sichuan University, China. His research interests include network information processing and information security, communications and network information system.
Wenwu Tang received the masters degree from SichuanUniversity, Chengdu, China, in 2010. and received the certificate of Network Security Engineer in 2014. At present, his main research directions include web security, big data application, computer forensics, attack detection and other fields.
Cheng Huang received the Ph.D degree from SichuanUniversity, Chengdu, China, in 2017. From 2014 to 2015, he was a visiting student at the School of Computer Science, University of California, CA, USA. He is currently an Assistant Research Professor at the college of Cybersecurity, Sichuan University, Chengdu, China. His current research interests include Web security, attack detection, artificial intelligence.