Elsevier

Computers & Security

Volume 102, March 2021, 102156
Computers & Security

TC 11 Briefing Papers
Exsense: Extract sensitive information from unstructured data

https://doi.org/10.1016/j.cose.2020.102156Get rights and content

Abstract

Large-scale sensitive information leakage incidents are frequently reported in recent years. Once sensitive information is leaked, it may lead to serious effects. In this context, sensitive information leakage has long been a question of great interest in the field of cybersecurity. However, most sensitive information resides in unstructured data. Therefore, how to extract sensitive information from voluminous unstructured data has become one of the greatest challenges. To address the above challenges, we propose a method named ExSense for extracting sensitive information from unstructured data, which utilizes the content-based and context-based extract mechanism. On the one hand, the method uses regular matching to extract sensitive information with predictable patterns. On the other hand, we build a model named BERT-BiLSTM-Attention for extracting sensitive information with natural language processing. This model uses the latest BERT algorithm to accomplish word embedding and extracts sensitive information by using BiLSTM and attention mechanism, with an F1 score of 99.15%. Experimental results on real-world datasets show that ExSense has a higher detection rate than using individual methods (i.e., content analysis and context analysis). In addition, we analyze about a million texts on Pastebin, and the results prove that ExSense can extract sensitive information from unstructured data effectively.

Introduction

With the rapid development of the Internet, a large amount of sensitive information is stored and transmitted on the Internet. Large-scale sensitive information leakage incidents are frequently reported in recent years. In March 2018, the New York Times reported that Facebook’s 50 million user information was leaked by a company named Cambridge Analytica Cadwalladr and Graham-Harrison (2018). According to IBM’s 2019 Cost of a Data Breach Report IBM Security (2019), the average cost of a data breach in 2019 is $3.92 million, a 12% increase from 2014, and the average size of a data breach is 25,575 records. Also, once sensitive information is leaked it can lead to significant contractual or legal liabilities; serious damage to personal image and reputation; or legal, financial, or business losses Ohm (2014). In this context, the issue of sensitive information leakage has received considerable critical attention.

Sensitive information leakage can be caused by both internal and external factors. The 2019 Data Breach Investigations Report Verizon (2019) released by Verizon shows that data breaches caused by external factors include hacking and social attacks. Also, internal information breaches cannot be ignored, such as unintentional leakage caused by weak security awareness, internal data misuse by authorized users, and corporate espionage. Many security measures including firewalls, access control, IPS/IDS have been considered for data leakage caused by external factors. However, there is no effective way to deal with data leakage caused by internal factors. This is because sensitive information often resides in common unstructured data (e.g., email messages, blog posts, news, configuration files, etc), making it difficult for people to realize the occurrence of data leakage.

Currently, more than 80% of the data on the Internet is unstructured data Allahyari et al. (2017). Unstructured data usually refers to information that does not reside in a relational database. In other words, the data structure of unstructured data is irregular or incomplete and there is no predefined data model. In particular, it should be noted that although some documents like CSV, JSON, XML have some organizational properties, they usually do not have a clear predefined data model. Compared to structured data, these data still difficult to retrieve, analyze and store. Unstructured data is easily processed by humans but is very hard for machines to understand Gupta and Gupta (2019). It is thus beneficial to devise a means to process unstructured data, which helps us automatically detect sensitive information from it and prevent data leakage.

In order to protect sensitive information, many researchers focus on data leakage prevention (DLP) Hart et al. (2011); Meli et al. (2019); Shapira et al. (2013); Shu et al. (2015b). The existing methods Lin et al. (2020); Noor et al. (2019); Shvartzshnaider et al. (2019); Trabelsi (2019) can be classified into two categories: content-based analysis and context-based analysis. Content-based methods inspect data content based on features of sensitive information itself, such as regular expressions and data fingerprints. Content-based methods have high detection accuracy for sensitive information with predictable patterns (e.g., IP, email, API KEY). Context-based methods detect sensitive information based on contextual features around the monitored data. For sensitive information without predictable patterns (e.g., Login Password Combo), the context-based approach is more effective. Therefore, in order to extract sensitive information more comprehensively and accurately, appropriate methods should be adopted for different sensitive information.

Deep learning methods have achieved tremendous success in the field of computer vision and pattern recognition. Neural networks based on dense vector representation have achieved great results in many NLP tasks. Compared with traditional machine learning, deep learning makes multi-level automatic feature representation learning possible. With this in mind, deep learning methods used in the field of data leakage protection has become the trend.

In this work, we propose a method to extract sensitive information from unstructured data, utilizing the content-based and context-based extract mechanism. On the one hand, the method uses regular matching to extract sensitive information with predictable patterns. On the other hand, we build a model named BERT-BiLSTM-Attention to label sensitive information entities in the text based on contextual features. To ensure its accuracy, we compare it to several popular baseline methods and achieve significant improvements. Experimental results on real-world datasets show that the hybridization of content analysis and context analysis tends to produce better performance when compared to the individual methods. In general, the specific contributions of our research are the following:

  • The paper proposes a framework named ExSense to extract sensitive information. ExSense contains two main modules: regular expressions are used to extract content-based sensitive information with predictable patterns; automatically machine learning extractor is used to extract context-based sensitive information with natural language processing technologies.

  • In the context-based analysis, the paper presents the sensitive information extraction problem as a sequence labeling problem and builds a BERT-BiLSTM-Attention model. This model utilizes the BiLSTM neural network and the attention mechanism. Experimental results show that the performance of this model is better than other baselines, with an F1 score of 99.15%.

  • The paper analyzes about 1 million texts on Pastebin and displays different types of sensitive information samples by using ExSense, which proves that the effectually and accurately of our framework.

The rest of the paper is organized as follows: Section 2 presents related work. Section 3 presents a detailed description of the sensitive information extraction method in this paper. Section 4 presents the experiments and analysis related to this work. Section 5 summarizes conclusion and proposes future works.

Section snippets

Related work

A prerequisite for sensitive information protection is extracting sensitive information, so in this section, we first review the existing sensitive information detection methods. Secondly, we introduce technologies related to the method proposed in this paper, which includes the sequence labeling model and attention mechanism. At the end of this section, we summarize the existing methods including contentbased analysis and contextbased analysis.

Methodology

For a given unstructured data including various types of unstructured text, our goal is to extract sensitive information from it. First of all, we need to figure out what sensitive information is and the types of sensitive information. Ohm summarized the definition of sensitive information by surveying dozens of different laws and regulations. Sensitive information describes the information that can be used to enable privacy or security harm when placed in the wrong hands Ohm (2014). In this

Datasets

The datasets used in the paper are collected from Pastebin (https://pastebin.com). This website is a text sharing platform where users can store any text. Pastebin contains voluminous sensitive information. For example, some users accidentally uploaded personal information, password credentials, and financial information. Developers and engineers leaked internal configurations and API keys. In addition, several hackers uploaded illegally obtained sensitive information to Pastebin.

We collected

Conclusion and future work

In this work, we present ExSense, a sensitive information extraction method from unstructured data. ExSense utilizes a hybrid approach that combines content and context analysis. Appropriate methods (i.e., content analysis and context analysis) are used for different sensitive information. In content analysis, regular expressions are used to extract sensitive information with predictable patterns. In context analysis, we build a sequence labeling model named BERT-BiLSTM-Attention, which

Ethics Statement

In this section, we discuss issues related to the ethical conduct of this research.

The Institutional Review Board informed us that the data we collected was outside the scope of the review because we only collected public documents in Pastebin, which are publicly available on the Internet. In addition, we did not obtain research data through any illegal means.

The sensitive information involved in this study had already been leaked on the Internet before our collection. Nevertheless, we have

CRediT authorship contribution statement

Yongyan Guo: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Jiayong Liu: Conceptualization, Methodology, Investigation. Wenwu Tang: Investigation, Software, Data curation. Cheng Huang: Conceptualization, Methodology, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This document is the results of the research project funded by the National Natural Science Foundation of China (No. 61902265), and Sichuan Science and Technology Program (No. 2020YFG0047, No. 2020YFG0076).

Yongyan Guo is currently pursuing his masters degree in the College of Cybersecurity, Sichuan University, China. His current research interests include data breach protection, attack detection, and artificial intelligence.

References (38)

  • S. Gupta et al.

    Natural language processing in mining unstructured data from software repositories: a review

    Sādhanā

    (2019)
  • M. Hart et al.

    Text classification for data loss prevention

    International Symposium on Privacy Enhancing Technologies Symposium

    (2011)
  • Z. Huang et al.

    Bidirectional lstm-crf models for sequence tagging

    arXiv preprint arXiv:1508.01991

    (2015)
  • IBM Security, P.I., 2019. Cost of a data breach report....
  • C.-H. Lin et al.

    Detecting security breaches in personal data protection with machine learning

    2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)

    (2020)
  • W. Ling et al.

    Finding function in form: compositional character models for open vocabulary word representation

    arXiv preprint arXiv:1508.02096

    (2015)
  • S. Mathew et al.

    A data-centric approach to insider attack detection in database systems

    International Workshop on Recent Advances in Intrusion Detection

    (2010)
  • M. Meli et al.

    How bad can it git? characterizing secret leakage in public github repositories.

    NDSS

    (2019)
  • T. Mikolov et al.

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781

    (2013)
  • Cited by (25)

    View all citing articles on Scopus

    Yongyan Guo is currently pursuing his masters degree in the College of Cybersecurity, Sichuan University, China. His current research interests include data breach protection, attack detection, and artificial intelligence.

    Jiayong Liu received his B.Eng. degree in 1982, M. Eng. degree in 1989, and Ph.D. degree in 2008 from Sichuan University, China. He is currently a professor in School of Cybersecurity, Sichuan University, China. His research interests include network information processing and information security, communications and network information system.

    Wenwu Tang received the masters degree from SichuanUniversity, Chengdu, China, in 2010. and received the certificate of Network Security Engineer in 2014. At present, his main research directions include web security, big data application, computer forensics, attack detection and other fields.

    Cheng Huang received the Ph.D degree from SichuanUniversity, Chengdu, China, in 2017. From 2014 to 2015, he was a visiting student at the School of Computer Science, University of California, CA, USA. He is currently an Assistant Research Professor at the college of Cybersecurity, Sichuan University, Chengdu, China. His current research interests include Web security, attack detection, artificial intelligence.

    View full text