Analyzing privacy policies through syntax-driven semantic analysis of information types

https://doi.org/10.1016/j.infsof.2021.106608Get rights and content

Highlights

  • Abstract terms in privacy policies reduce shared understanding among stakeholders.

  • Ontology is a knowledge graph containing information types and semantic relations.

  • Ontology guides requirements analysts in the selection of the most appropriate terms.

  • Ontology help reduce the abstraction and ambiguity problems in privacy policies.

Abstract

Context:

Several government laws and app markets, such as Google Play, require the disclosure of app data practices to users. These data practices constitute critical privacy requirements statements, since they underpin the app’s functionality while describing how various personal information types are collected, used, and with whom they are shared.

Objective:

Abstract and ambiguous terminology in requirements statements concerning information types (e.g., “we collect your device information”), can reduce shared understanding among app developers, policy writers, and users.

Method:

To address this challenge, we propose a syntax-driven method that first parses a given information type phrase (e.g. mobile device identifier) into its constituents using a context-free grammar and second infers semantic relationships between constituents using semantic rules. The inferred semantic relationships between a given phrase and its constituents generate a hierarchy that models the generality and ambiguity of phrases. Through this method, we infer relations from a lexicon consisting of a set of information type phrases to populate a partial ontology. The resulting ontology is a knowledge graph that can be used to guide requirements authors in the selection of the most appropriate information type terms.

Results:

We evaluate the method’s performance using two criteria: (1) expert assessment of relations between information types; and (2) non-expert preferences for relations between information types. The results suggest performance improvement when compared to a previously proposed method. We also evaluate the reliability of the method considering the information types extracted from different data practices (e.g., collection, usage, sharing, etc.) in privacy policies for mobile or web-based apps in various app domains.

Contributions:

The method achieves average of 89% precision and 87% recall considering information types from various app domains and data practices. Due to these results, we conclude that the method can be generalized reliably in inferring relations and reducing the ambiguity and abstraction in privacy policies.

Introduction

Government regulations increasingly require mobile and web-based application (app) companies to standardize their data practices concerning the collection, use, and sharing of various types of information. A summary of these practices are communicated to users through online privacy policies [1], [2], which have become a well-established source of requirements for requirements engineers [3], [4], because they need to be consistent with software behaviors.

The challenge of acquiring requirements from data practice descriptions, however, is that privacy policies often contain ambiguities [5], which admit more than one interpretation [6]. Furthermore, policies are intended to generalize across a wide range of data practices, and are not limited to describe a single software system, in which case they also exhibit vagueness and generality [7]. Berry and Kamsties distinguish four broad categories of linguistic ambiguity, including lexical, syntactic, semantic, and pragmatic ambiguity [8]. They further separate vagueness and generality from ambiguity. Vagueness occurs when a phrase admits borderline cases, e.g., the word “tall” is vague when considering a subject who is neither tall nor not tall [8]. In generality, a superordinate term refers to two or more subordinate terms. In linguistics, generality is encoded by the relationship between a hypernym, or the general term, and more specific terms, called hyponyms.

In privacy policies, information types can be expressed using both vague and general terms. Many policies describe the vague phrase “personal information”, which can include both a person’s “age” and their “health conditions”, which users may consider more or less private, leading to boundary cases. In addition, they contain general terms, such as “address”, which are intended to refer to more specific meanings, such as “postal address”, “e-mail address”, or “network address”, in which case the reader must choose an interpretation to fit the given context. Finally, these policies also contain another kind of semantic indeterminacy that has not historically been included in ambiguity, generality or vagueness, which concerns holonyms, or wholes, and meronyms, or the parts of wholes. For example, when a policy refers to “postal address”, it also refers to “city”, “country”, and “postal code”, which are distinct parts of the postal address.

Ambiguity, generality, and vagueness have been extensively studied in requirements engineering research, particularly in regulatory and policy documents. This includes techniques to identify, classify, and model ambiguity in regulations, such as HIPAA [5], [9], and techniques to identify generality [3], [10], [11] and vagueness [12] in privacy policies. Recently, two studies employed hand-crafted regular expressions over nominals, and constituency parse trees derived from individual policy statements to extract generalities, specifically hypernyms [13], [14]. This prior work demonstrates the difficulty of scaling manual methods to construct ontologies from policies, and thus motivates the need for automated ontology discovery techniques.

In this paper, we focus on the role of hypernyms, meronyms and synonyms and their formal relationships among terminology in privacy policies (see Section 2 for examples). We propose a novel, automated syntax-driven semantic analysis method for constructing partial ontologies to formalize these relationships. Formal ontologies can be used to automate requirements analysis, specifically where the “informal meets the formal”, as in where mathematical models are extracted from natural language text [15]. Recently, such ontologies have enabled precise, reusable and semi-automated analysis to trace requirements from policies to code execution [11], [16] and in checking formal specifications for conflicting interpretations [3], [17].

Our proposed method is based on the principle of compositionality, which states the meaning of a given phrase can be derived from the meaning of its constituents [18], [19]. Using this principle and grounded analysis of 356 unique information type phrases (e.g., mobile device identifier), we developed a context-free grammar (CFG) to decompose a given information type phrase into its constituents. The production rules in the CFG are augmented with semantic rules, which we call semantic attachments [20], that are used to infer semantic relationships, including hypernymy, meronymy, and synonymy between the given information type constituents. This method is evaluated on two sets of 491 and 1853 information type phrases extracted from 60 privacy policies. Applying our method on these information types yields 5044 and 21,745 semantic relations, respectively.

This work extends a previous conference paper [21] with new evaluation of the syntax-driven method using two sources of ground-truth: (a) relations identified by experts with experience in privacy and data practices; and (b) the preferences expressed by a population of web and mobile-app users (i.e., non-experts) toward relationships between information types. This novel contribution adds a new evaluation method that reaches beyond expert opinion, which is the historical benchmark for constructing corpora and performing natural language evaluation, to include popular opinion, which is better suited to measure how potential users interpret data practice descriptions in privacy policies. The overall contributions of the current paper are as follows: (1) a syntax-driven method to infer semantic relations from a given information type using principle of compositionality; (2) an empirical evaluation of the method using expert-inferred relations; (3) an empirical evaluation of the method using population preferences; (4) an empirical evaluation of the method using statements about mobile and web-based apps across multiple domains.

This paper is organized as follows: in Section 2, we discuss the problem and motivation for ontology construction with an example ontology illustration; in Section 3, we discuss important terminology and lexicons; in Section 5, we introduce the syntax-driven method. In Section 6, we present the evaluation using four experiments and results, followed by limitations in Section 7, threats to validity in Section 8, and related work in Section 4. Finally, we discuss the application and implication of ontologies in practice, and future work in Sections 9 Discussion, 10 Conclusion.

Section snippets

Problem and motivation

Ambiguous terminology in legal requirements and privacy policies can lead to multiple, unwanted interpretations [9] and further reduce the shared understanding among requirements engineers, policy authors, and regulators [22]. A lack of shared understanding has consequences, such as the recent $5 billion settlement of Federal Trade Commission with Facebook [23]. This penalty arose from poor data practices resulting in the leaking of 87 million users’ personal information to third parties. The

Background

In this section, we introduce the terminology and three different lexicons used throughout this paper.

Related work

In this section, we review related work concerning the definition of ambiguity, tools and techniques to detect and resolve ambiguity, ontology in requirements modeling, and ontology in requirement analysis. Lastly, we summarize our observation over the related work and make our contribution explicit to the body of knowledge in requirements engineering domain.

Syntax-driven ontology construction method

Our method for constructing ontology fragments is based on grounded theory [57], which is a qualitative inquiry approach that involves applying codes to data through coding cycles to develop a theory grounded in the data [58]. Based on this theory, we describe three applications in this paper: (1) codes applied to information types in Lexicon L1 to construct a context-free grammar (see Section 5.3); (2) memo-writing to capture results from applying the grammar and its semantic attachments to

Evaluation and results

In this paper, we evaluate the syntax-driven method using two approaches:

  • 1.

    Expert Evaluation: relations identified by experts with experience in privacy and data practices.

  • 2.

    Non-Expert Evaluation: the preferences expressed by a population of web and mobile-app users (i.e., non-experts) toward the relationships between information types.

Fig. 5 depicts the overview of the method evaluation. Through this evaluation, we compare our method with expert and non-expert viewpoints. This comparison requires

Limitations

In this section we discuss limitations of the work by analyzing false negatives. While the syntax-driven method affords high precision, the low recall presents a limitation of the approach. To understand this limitation, we open code the false negatives (FNs) for all four experiments (i.e., E1.1, E1.2, E2.1, and E2.2) and identify four explanations for this limitation. The four explanations are as follows.

(1) Tacit Knowledge: The method solely relies on syntax and infers semantic relations

Threats to validity

In this section we discuss threats to validity of the work.

Internal Validity- Internal validity concerns whether the inferences drawn from the experiments are valid. The relations inferred by the syntax-driven method depend on reliable labeling of information types by analysts (see Section 5.2). Changes in tags affect the performance of the method when compared to a ground truth.

To understand the effect of labels on the inferred relations by the syntax-driven method, we analyze the FNs for

Discussion

Privacy policies contain ambiguous, general, and vague terminology, making requirements elicitation from data practices a challenging task. Our methodology provides requirements analysts with reusable ontologies that formalize semantic relations between ambiguous, general, and vague information types in privacy policies. However, such research poses fundamental challenges due to the problem’s nature that requires systematic validation. In the following subsections, we provide brief responses to

Conclusion

Privacy policies are legal documents containing application data practices. These documents are well-established sources of requirements in software engineering. However, privacy policies are written in natural language, thus subject to ambiguity and abstraction. Eliciting requirements from privacy policies is a challenging task as these ambiguities can result in more than one interpretation of a given term. In this paper, we focus on the role of hypernyms, meronyms, and synonyms and their

CRediT authorship contribution statement

Mitra Bokaei Hosseini: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Writing - original draft, Visualization. Travis D. Breaux: Conceptualization, Software, Validation, Formal analysis, Investigation, Writing - review & editing, Supervision, Funding acquisition. Rocky Slavin: Writing - review & editing. Jianwei Niu: Methodology, Formal analysis, Resources, Funding acquisition, Supervision. Xiaoyin Wang: Methodology, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank Xue Qin for her participation as an expert in experiment E1.2. This research was supported by NSF, United State awards #1736209, #1748109, #2007718, #1453139, and #1948244.

References (64)

  • JanssenT.M. et al.

    Compositionality

  • BreauxT.D. et al.

    Legally “reasonable” security requirements: A 10-year FTC retrospective

    Comput. Secur.

    (2011)
  • BreauxT.D. et al.

    A distributed requirements management framework for legal compliance and accountability

    Comput. Secur.

    (2009)
  • HarrisK.

    Privacy on the Go: Recommendations for the Mobile Ecosystem

    (2013)
  • AntonA.I. et al.

    A requirements taxonomy for reducing web site privacy vulnerabilities

    Requir. Eng.

    (2004)
  • BreauxT.D. et al.

    Eddy, a formal language for specifying and analyzing data flow specifications for conflicting privacy requirements

    Requir. Eng.

    (2014)
  • MasseyA.K. et al.

    Automated text mining for requirements analysis of policy documents

  • MasseyA.K. et al.

    Modeling regulatory ambiguities for requirements analysis

  • E. Kamsties, B. Peach, Taming ambiguity in natural language requirements, in: Proceedings of the Thirteenth...
  • ReidenbergJ.R. et al.

    Ambiguity in privacy policies and the impact of regulation

    J. Legal Stud.

    (2016)
  • BerryD.M. et al.

    Ambiguity in requirements specification

  • MasseyA.K. et al.

    Identifying and classifying ambiguity for regulatory requirements

  • M.B. Hosseini, S. Wadkar, T.D. Breaux, J. Niu, Lexical similarity of information type hypernyms, meronyms and synonyms...
  • R. Slavin, X. Wang, M.B. Hosseini, J. Hester, R. Krishnan, J. Bhatia, T.D. Breaux, J. Niu, Toward a framework for...
  • BhatiaJ. et al.

    A theory of vagueness and privacy risk perception

  • HosseiniM.B. et al.

    Inferring ontology fragments from semantic role typing of lexical variants

  • EvansM.C. et al.

    An evaluation of constituency-based hyponymy extraction from privacy policies

  • JacksonM.

    Problems and requirements

  • X. Wang, X. Qin, M.B. Hosseini, R. Slavin, T.D. Breaux, J. Niu, Guileak: Tracing privacy policy claims on user input...
  • BreitmanK.K. et al.

    Ontology as a requirements engineering product

  • FregeG.

    Über begriff und gegenstand

    (1892)
  • BachE.

    An extension of classical transformational grammar

    (1976)
  • HosseiniM.B. et al.

    Disambiguating requirements through syntax-driven semantic analysis of information types

  • FTCT.D.

    FTC’s $5 billion Facebook settlement: Record-breaking and history-making

    (2019)
  • G. Petronella, Analyzing privacy of Android applications, Italy,...
  • S. Zimmeck, Z. Wang, L. Zou, R. Iyengar, B. Liu, F. Schaub, S. Wilson, N. Sadeh, S.M. Bellovin, J. Reidenberg,...
  • BhatiaJ. et al.

    Towards an information type lexicon for privacy policies

  • SlavinR. et al.

    PVDetector: a detector of privacy-policy violations for Android apps

  • SlovicP.

    The construction of preference

    Amer. Psychol.

    (1995)
  • JurafskyD. et al.

    Speech and Language Processing, Vol. 3

    (2014)
  • BaaderF. et al.

    The Description Logic Handbook: Theory, Implementation and Applications

    (2003)
  • De SaussureF. et al.

    Course in General Linguistics. (Open Court Classics)

    (1998)
  • Cited by (0)

    View full text