RCBAC: A risk-aware content-based access control model for large-scale text data

https://doi.org/10.1016/j.jnca.2020.102733Get rights and content

Abstract

Unstructured data (mostly text data) have become a vital part in the era of big data. Hence, it has become increasingly difficult to identify the internal relations among data and describing the access control object during the design of access control (especially fine-grained access control) policies. Furthermore, in recent years, security incidents have frequently occurred due to the leakage of secrets by insiders, in both enterprises and government agencies around the world. Due to dynamic user behavior, it is difficult to determine ā€œcurious accessesā€ and grant authority based on traditional static access control models. Therefore, we need a dynamic access control model that is content-driven and can be used to find curious users in daily practice. This paper proposes a risk-aware content-based access control model (RCBAC) which can be used to solve over-authorization problems and can grant file-level authority to users. Based on the relevance of the data content and the duties of each user, RCBAC can quantify the risk of both the access behavior and the access history; accordingly, each user's access ability can be adjusted dynamically. The experimental results show that the RCBAC model can separate curious users from normal users and limit the access ability of curious users.

Introduction

In the era of big data, data have become an asset with economic importance. To use data more efficiently, it is important to share data safely. As a key technology for ensuring safe data sharing, access control plays an important role. The traditional access control models include discretionary access control (DAC), mandatory access control (MAC), and role-based access control (RBAC). However, in addition to the convenience big data has brought us, it also generates new challenges, and the traditional access control techniques cannot meet some of the demands of big data. On the one hand, unstructured data (most are text data) have become a vital part of big data, and it is difficult to identify the internal relations among the data, which increases the difficulty of describing the access control object when designing access control (especially fine-grained access control) policies. On the other hand, traditional access control policies cannot be effectively applied to big data because of their large scale and high incremental speed. Policies that are too lax can cause valuable or sensitive private data to be leaked; policies that are too strict can cause access to critical resources to be denied (Bauer and Kerschbaum, 2014). Access control models that can grant appropriate authorities to users are desired.

In recent years, security incidents have frequently occurred due to the leaking of secrets by insiders, both in enterprises and government agencies around the world. For example, a Google engineer spied on the data of four underage teens for months in 20101; accounts held by over 100,000 clients and 20,000 offshore companies with HSBC in Geneva were leaked in 20152; and six employees of Huawei leaked company secrets in 2017.3 These leakages demonstrated the need for solving over-authorization problems for users: it is not enough for access control models to only grant corresponding authority to users; however, they also need to evaluate each user's access behavior and detect potential risks.

Furthermore, faced with large-scale unstructured text data, it is much more difficult to enforce access control policies with traditional access control models. We can use RBAC to assign unstructured data to different roles so that we can grant users corresponding authority based on these relations. However, in big data applications, it is difficult for administrators to assign unstructured data, and this work requires too much labor. Moreover, in access control systems, ā€œwho can access whatā€ is typically defined by administrators and this kind of authorization is very labor intensive (Zeng et al., 2014). When facing massive data, it is more difficult for administrators to perform this labor-intensive authorization; as a result, the users might be granted too much or even full authority to ensure the systems can be used well. However, this kind of over-authorization could cause high risk. To further motivate our research, we consider the following example:

Example 1

In a law enforcement agency with highly sensitive case files, a supervisor assigns a case to agent Alice in Department A for investigation. Naturally, Alice also need to access related or similar cases (the concept of related or similar cases is determined according to the semantic content of the files). As an agent in Department A, Alice has the authority to access related files in the database of Department A. However, she cannot access related cases that belong to other departments. When Alice needs to access the data that belong to Department B, it is possible that: (i) the data are unsupervised and Alice can obtain the data directly or (ii) via complex procedures, such as request and approval, Alice obtains the access authorization.

Each of these situations has drawbacks. In Situation (i), there is no security guarantee, and if Alice is a curious user, she can access unrelated data without any restriction, which could cause infinite risk. In Situation (ii), because of the complex procedures, Alice may not be able to obtain related files in time, which would affect Alice's normal work. Furthermore, the manual cost will be high due to the increase in the number of access requests. Without first obtaining related approval, not even the administrators can examine and approve requests to grant users authority, which could result in the same risk as that in Situation (i). We hope that Alice can only access data that are related to her duties and her work, which satisfies the need-to-know principle. Based on this requirement, we need a fine-grained, self-adaptive access control policy that considers the data content and user duties.

In the above example, the problem can be partially solved with ABAC. With ABAC, the character attributes can be compared to grant authority for Alice. For example, Alice can obtain access authority for the messages of suspects who are F feet tall and appeared in position P at time T. However, in ABAC, it is almost impossible to extract attributes from unstructured text data, and it is difficult to identify the internal links between different text files. Furthermore, the traditional access control models often have executions that are too strict, which could lead to insufficient authorization (e.g., Alice can only access the data that belong to Department A), and typical users could obtain low authority, which would reduce its usability.

Beginning with the content-based authorization model for digital libraries that was proposed in 2002 (Adam et al., 2002), researchers have proposed various access control policies that are based on various types of content, including text content in databases (Zeng et al., 2014), social network data (Jadliwala et al., 2014; Paradesi et al., 2013), electronic data (Wu et al., 2010; Wu and Zhuo, 2014), context (Rubart, 2005), and content-centric networks (Mannes et al., 2015; Nagai et al., 2015). However, in a complex big data system, the access authority could be changed frequently, and static access control policies could not satisfy the requirements in dynamic scenarios. If access control is strictly enforced with predefined policies, it might result in insufficient or over authorization.

Break The Glass (BTG)4 is a fast technique for which a person who does not have access privileges to certain information can gain access when necessary. Integrated within the NIST/ANSI role-based access control model, Ferreira et al. uses the main strategy of BTG to propose a BTG-RBAC model (Ferreira et al., 2009) that can be used in health care institutions and violate restrictions in a controlled and justifiable manner to grant necessary authorities. In the field of medical privacy, Moura et al. uses the concept of BTG to propose a socio-technical risk-adaptable access control model SoTRAACE, in which if a patient in a hospital is incapable of communicating if he/she is allergic to a specific medication and the nurse treating him/her can have the justified possibility of overriding pre-defined policies and accessing patient's allergy data, before administering it (Moura et al., 2017).

Risk-based access control is another approach to overcoming the problem of insufficient authorization. Zhang et al. propose an access control model that balances the benefits and risks by combining them with each Read/Write operation (Zhang et al., 2006). When unforeseen access behavior occurs, if the risk is acceptable and smaller than the benefit, the access will be permitted. The model increases the usability in the situation of insufficient authorization. The JASON Program Office defines the concept of risk quota, and uses it to compute and manage access requests. In their research on addressing over-authorization in medical big data, Wang and Jin (2011) and Hui et al. (2015) propose risk-based access control models that distinguish between ā€œnormal doctorsā€ and ā€œcurious doctorsā€ based on their access behaviors so that they can adjust doctors' access abilities adaptively, such that ā€œnormal doctorsā€ can access data in the typical manner while the access authority of ā€œcurious doctorsā€ is limited.

Machine learning methods have also been applied in access control models to improve authorizations. Reference Molloy et al. (2012) trains on existing access control decisions to optimize its classifier and uses the classifier to determine whether an uncertain new access request should be permitted. Reference Pervaiz et al. (2015) satisfies the privacy requirement of l-diversity generalization of stream data, and uses the sliding window technique to propose a precision-bounded access control approach.

To solve the problem of over-authorization for large-scale text data, this paper proposes a risk-aware content-based access control (RCBAC) model, which uses data content and risk management technology to grant users access authority. In RCBAC, the risk is evaluated according to both access behavior and access history. (i) On the aspect of access behavior, the user duties are compared with the duty attribute of files, and the text content is used to compute the concrete access behavior risk. (ii) On the aspect of access history, sliding windows are defined for each user to store their historical access requests. Then, the information entropy is used to compute their confusion in various access periods and evaluate users' history risk. (iii) RCBAC uses risk quotas to manage users' accesses. Users can consume their risk quota to obtain access authority.

Our contributions in this paper are as follows:

  • (1)

    A content-driven risk-aware access control model for text data: RCBAC uses both the user duty and the semantic content of data to compute the risk that is caused by users. A user will cause much lower risk if he tries to access files that are related to his current work (related duty or related content) compared to curious users. If one user has sufficiently high behavior risk and history risk, RCBAC will identify him and reduce his access ability.

  • (2)

    RCBAC enforcements on different data size: We develop two RCBAC enforcements on a small data set (Google snippets) and a big data set (Amazon product reviews). In both cases, the experimental results show that RCBAC can distinguish curious users from normal users easily.

  • (3)

    A feasible solution for helping to solve over-authorization problems: RCBAC computes behavior risks for each user's access requests and sliding windows are defined for users to compute their history risk in RCBAC. If a user is over-authorized and often tries to access data unrelated to him, he will be tagged ā€œcuriousā€ and descriptions of his suspicious behaviors will be sent to administrators. The user's risk quota will be reduced and he could only access a small amount of data at last. Moreover, administrators also can focus on tracking the accesses of these curious users to identify suspicious users and reject their accesses, which can solve over-authorization problems to a certain extent.

The remainder of this paper is organized as follows. Section 2 describes the faced problems. Section 3 presents the proposed model of RCBAC. Section 4 uses two data sets to develop RCBAC enforcement. Related works are presented in Section 5 and the paper is concluded in Section 6.

Section snippets

Problem description

Insufficient authorization and over-authorization are two main problems to be solved in access control systems. If the authorization is too strict, users might not be able to access data that should be accessible, while over-authorization could result in large risk, which is also not acceptable. Therefore, to grant suitable authority to users, we need a new authorization method that grants access between these two authorization levels.

In daily work, each employee has his own duties, and his

RCBAC model

In this section, we will first introduce the basic framework of RCBAC. Then, we will elaborate on the details of RCBAC from three aspects: access behavior risk computing, access history risk computing, and Risk Quota & Management Module. Moreover, we will show the computational complexity and space complexity of RCBAC in Section 3.5. At last, we will discuss the security and usability of RCBAC.

RCBAC enforcement

In RCBAC enforcement, the Google Snippets data set5 and the Amazon Product Review data set6 are used to implement file-level access control. The main commonalities between the two data sets are as follows: (i) they are rich in semantic information; (ii) all data are divided into categories and we can use these categories as duty attributes.

Related works

Access controls with semantics and contents. Comprehensive data protection requires related mechanisms to enforce access control policies based on data contents. In recent years, many access control policies that are related to ā€œsemanticsā€ have been proposed. Based on XACML with the application of semantic inter-operation, Reference Zhao et al. (2010) realizes the semantic inter-operation between the attributes of the service requester and the service provider, which increases the security of

Conclusion and future work

In big data applications, when facing large-scale data that is rapidly increasing in size, it is difficult for administrators to carry out accurate authorizations for users. To ensure that users can use applications smoothly, administrators might grant them too many or even complete privileges, which could cause high risk. Furthermore, as a vital type of unstructured data, text data plays an important role in the era of big data, which requires new demands on fine-grained access controls.

To

CRediT authorship contribution statement

Ke Ma: Conceptualization, Methodology, Software, Writing - original draft, Investigation. Geng Yang: Supervision, Project administration, Writing - review & editing. Yang Xiang: Writing - review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant 61972209, Grant 61572263, Grant 61872197, Grant 61502243, and Grant 61602263, in part by the Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant KYCX18_0891, in part by China Postdoctoral Science Foundation under Grant 2016M601859 and Grant 2015M581794.

Ke Ma received the Master degree of Science from Nanjing University of Posts and Telecommunications, China in 2016. He is now currently working towards the Ph.D. degree at School of Computer Science, Nanjing University of Posts and Telecommunications. His research interests include information security, parallel and distributed computing, and data mining.

References (49)

  • A. Chen et al.

    A dynamic risk-based access control model for cloud computing

  • X. Chen et al.

    Semantic-aware access control for grid application

  • P.-C. Cheng et al.

    Fuzzy multi-level security: an experiment on quantified risk-adaptive access control

  • G. Dartmann et al.

    Big Data Analytics for Cyber-Physical Systems: Machine Learning for the Internet of Things

    (2019)
  • D.R. Dos Santos et al.

    A dynamic risk-based access control architecture for cloud computing

  • L. El Haourani et al.

    Knowledge based access control a model for security and privacy in the big data

  • N. Elahi et al.

    Semantic access control in web based communities

  • S. Etigowni et al.

    Cpac: securing critical infrastructure with cyber-physical access control

  • L.B. Fazlic et al.

    A novel nlp-fuzzy system prototype for information extraction from medical guidelines

  • A. Ferreira et al.

    How to securely break into rbac: the btg-rbac model

  • Z. Hui et al.

    Risk-adaptive access control model for big data in healthcare (in Chinese)

    J. Commun.

    (2015)
  • M. Jadliwala et al.

    Social puzzles: context-based access control in online social networks

  • M. Liu et al.

    Semantic access control for web services

  • Z. Lu et al.

    Risk assessment based access control with text and behavior analysis for document management

  • Cited by (7)

    • DNA computing and table based data accessing in the cloud environment

      2020, Journal of Network and Computer Applications
      Citation Excerpt :

      High system overhead is another issue in the existing scheme (Gao et al., 2012) as the DO must be always online during the entire data accessing or communication process. Many schemes are already developed to solve these problems (Namasudra, 2017, 2020; Sarkar et al., 2015; Namasudra et al., 2017a, 2017b, 2018a, 2018b, 2018c, 2018d, 2020a, 2020b, 2020c, 2020d; Assis and Bittencourt, 2020; Ma et al., 2020; Fu et al., 2018; Sajid et al., 2016; Alguliyev et al., 2020; Namasudra and Roy, 2016, 2017b, 2018; Thaseen et al., 2020; Wang et al., 2019a; Kumar et al., 2020; Namasudra and Deka, 2018; Hossain and Muhammad, 2016; Lojka et al., 2016; Tripura and Roy, 2017; Tripura et al., 2018, 2020; Zhao et al., 2019; Devi et al., 2020). Role Based Access Control (RBAC) model has been proposed by Ferraiolo and Kuhn (1992).

    • Access Control and Encryption Techniques during Big Data Accessing Stage

      2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023
    • Big Data Security Risk Control Model Based on Federated Learning Algorithm

      2023, Proceedings of SPIE - The International Society for Optical Engineering
    • Common Attacks on Near Field Communication Technology

      2022, Proceedings of 2022 2nd International Conference on Computing and Information Technology, ICCIT 2022
    View all citing articles on Scopus

    Ke Ma received the Master degree of Science from Nanjing University of Posts and Telecommunications, China in 2016. He is now currently working towards the Ph.D. degree at School of Computer Science, Nanjing University of Posts and Telecommunications. His research interests include information security, parallel and distributed computing, and data mining.

    Geng Yang born in 1961. Professor and PhD supervisor with the School of Computer Science, Nanjing University of Posts and Telecommunications. His current research interests include computer communication and networks, parallel and distributed computing, cloud computing security, and information security.

    Yang Xiang received the Ph.D. degree in computer science from Deakin University, Australia. He is currently with Digital Research & Innovation Capability Platform, Swinburne University of Technology. His research interests include network and system security, distributed systems, and networking. In particular, he is currently leading in a research group developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC).

    View full text