Analyzing privacy policies through syntax-driven semantic analysis of information types
Introduction
Government regulations increasingly require mobile and web-based application (app) companies to standardize their data practices concerning the collection, use, and sharing of various types of information. A summary of these practices are communicated to users through online privacy policies [1], [2], which have become a well-established source of requirements for requirements engineers [3], [4], because they need to be consistent with software behaviors.
The challenge of acquiring requirements from data practice descriptions, however, is that privacy policies often contain ambiguities [5], which admit more than one interpretation [6]. Furthermore, policies are intended to generalize across a wide range of data practices, and are not limited to describe a single software system, in which case they also exhibit vagueness and generality [7]. Berry and Kamsties distinguish four broad categories of linguistic ambiguity, including lexical, syntactic, semantic, and pragmatic ambiguity [8]. They further separate vagueness and generality from ambiguity. Vagueness occurs when a phrase admits borderline cases, e.g., the word “tall” is vague when considering a subject who is neither tall nor not tall [8]. In generality, a superordinate term refers to two or more subordinate terms. In linguistics, generality is encoded by the relationship between a hypernym, or the general term, and more specific terms, called hyponyms.
In privacy policies, information types can be expressed using both vague and general terms. Many policies describe the vague phrase “personal information”, which can include both a person’s “age” and their “health conditions”, which users may consider more or less private, leading to boundary cases. In addition, they contain general terms, such as “address”, which are intended to refer to more specific meanings, such as “postal address”, “e-mail address”, or “network address”, in which case the reader must choose an interpretation to fit the given context. Finally, these policies also contain another kind of semantic indeterminacy that has not historically been included in ambiguity, generality or vagueness, which concerns holonyms, or wholes, and meronyms, or the parts of wholes. For example, when a policy refers to “postal address”, it also refers to “city”, “country”, and “postal code”, which are distinct parts of the postal address.
Ambiguity, generality, and vagueness have been extensively studied in requirements engineering research, particularly in regulatory and policy documents. This includes techniques to identify, classify, and model ambiguity in regulations, such as HIPAA [5], [9], and techniques to identify generality [3], [10], [11] and vagueness [12] in privacy policies. Recently, two studies employed hand-crafted regular expressions over nominals, and constituency parse trees derived from individual policy statements to extract generalities, specifically hypernyms [13], [14]. This prior work demonstrates the difficulty of scaling manual methods to construct ontologies from policies, and thus motivates the need for automated ontology discovery techniques.
In this paper, we focus on the role of hypernyms, meronyms and synonyms and their formal relationships among terminology in privacy policies (see Section 2 for examples). We propose a novel, automated syntax-driven semantic analysis method for constructing partial ontologies to formalize these relationships. Formal ontologies can be used to automate requirements analysis, specifically where the “informal meets the formal”, as in where mathematical models are extracted from natural language text [15]. Recently, such ontologies have enabled precise, reusable and semi-automated analysis to trace requirements from policies to code execution [11], [16] and in checking formal specifications for conflicting interpretations [3], [17].
Our proposed method is based on the principle of compositionality, which states the meaning of a given phrase can be derived from the meaning of its constituents [18], [19]. Using this principle and grounded analysis of 356 unique information type phrases (e.g., mobile device identifier), we developed a context-free grammar (CFG) to decompose a given information type phrase into its constituents. The production rules in the CFG are augmented with semantic rules, which we call semantic attachments [20], that are used to infer semantic relationships, including hypernymy, meronymy, and synonymy between the given information type constituents. This method is evaluated on two sets of 491 and 1853 information type phrases extracted from 60 privacy policies. Applying our method on these information types yields 5044 and 21,745 semantic relations, respectively.
This work extends a previous conference paper [21] with new evaluation of the syntax-driven method using two sources of ground-truth: (a) relations identified by experts with experience in privacy and data practices; and (b) the preferences expressed by a population of web and mobile-app users (i.e., non-experts) toward relationships between information types. This novel contribution adds a new evaluation method that reaches beyond expert opinion, which is the historical benchmark for constructing corpora and performing natural language evaluation, to include popular opinion, which is better suited to measure how potential users interpret data practice descriptions in privacy policies. The overall contributions of the current paper are as follows: (1) a syntax-driven method to infer semantic relations from a given information type using principle of compositionality; (2) an empirical evaluation of the method using expert-inferred relations; (3) an empirical evaluation of the method using population preferences; (4) an empirical evaluation of the method using statements about mobile and web-based apps across multiple domains.
This paper is organized as follows: in Section 2, we discuss the problem and motivation for ontology construction with an example ontology illustration; in Section 3, we discuss important terminology and lexicons; in Section 5, we introduce the syntax-driven method. In Section 6, we present the evaluation using four experiments and results, followed by limitations in Section 7, threats to validity in Section 8, and related work in Section 4. Finally, we discuss the application and implication of ontologies in practice, and future work in Sections 9 Discussion, 10 Conclusion.
Section snippets
Problem and motivation
Ambiguous terminology in legal requirements and privacy policies can lead to multiple, unwanted interpretations [9] and further reduce the shared understanding among requirements engineers, policy authors, and regulators [22]. A lack of shared understanding has consequences, such as the recent $5 billion settlement of Federal Trade Commission with Facebook [23]. This penalty arose from poor data practices resulting in the leaking of 87 million users’ personal information to third parties. The
Background
In this section, we introduce the terminology and three different lexicons used throughout this paper.
Related work
In this section, we review related work concerning the definition of ambiguity, tools and techniques to detect and resolve ambiguity, ontology in requirements modeling, and ontology in requirement analysis. Lastly, we summarize our observation over the related work and make our contribution explicit to the body of knowledge in requirements engineering domain.
Syntax-driven ontology construction method
Our method for constructing ontology fragments is based on grounded theory [57], which is a qualitative inquiry approach that involves applying codes to data through coding cycles to develop a theory grounded in the data [58]. Based on this theory, we describe three applications in this paper: (1) codes applied to information types in Lexicon to construct a context-free grammar (see Section 5.3); (2) memo-writing to capture results from applying the grammar and its semantic attachments to
Evaluation and results
In this paper, we evaluate the syntax-driven method using two approaches:
- 1.
Expert Evaluation: relations identified by experts with experience in privacy and data practices.
- 2.
Non-Expert Evaluation: the preferences expressed by a population of web and mobile-app users (i.e., non-experts) toward the relationships between information types.
Fig. 5 depicts the overview of the method evaluation. Through this evaluation, we compare our method with expert and non-expert viewpoints. This comparison requires
Limitations
In this section we discuss limitations of the work by analyzing false negatives. While the syntax-driven method affords high precision, the low recall presents a limitation of the approach. To understand this limitation, we open code the false negatives (FNs) for all four experiments (i.e., , , , and ) and identify four explanations for this limitation. The four explanations are as follows.
(1) Tacit Knowledge: The method solely relies on syntax and infers semantic relations
Threats to validity
In this section we discuss threats to validity of the work.
Internal Validity- Internal validity concerns whether the inferences drawn from the experiments are valid. The relations inferred by the syntax-driven method depend on reliable labeling of information types by analysts (see Section 5.2). Changes in tags affect the performance of the method when compared to a ground truth.
To understand the effect of labels on the inferred relations by the syntax-driven method, we analyze the FNs for
Discussion
Privacy policies contain ambiguous, general, and vague terminology, making requirements elicitation from data practices a challenging task. Our methodology provides requirements analysts with reusable ontologies that formalize semantic relations between ambiguous, general, and vague information types in privacy policies. However, such research poses fundamental challenges due to the problem’s nature that requires systematic validation. In the following subsections, we provide brief responses to
Conclusion
Privacy policies are legal documents containing application data practices. These documents are well-established sources of requirements in software engineering. However, privacy policies are written in natural language, thus subject to ambiguity and abstraction. Eliciting requirements from privacy policies is a challenging task as these ambiguities can result in more than one interpretation of a given term. In this paper, we focus on the role of hypernyms, meronyms, and synonyms and their
CRediT authorship contribution statement
Mitra Bokaei Hosseini: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Writing - original draft, Visualization. Travis D. Breaux: Conceptualization, Software, Validation, Formal analysis, Investigation, Writing - review & editing, Supervision, Funding acquisition. Rocky Slavin: Writing - review & editing. Jianwei Niu: Methodology, Formal analysis, Resources, Funding acquisition, Supervision. Xiaoyin Wang: Methodology, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We thank Xue Qin for her participation as an expert in experiment . This research was supported by NSF, United State awards #1736209, #1748109, #2007718, #1453139, and #1948244.
References (64)
- et al.
Compositionality
- et al.
Legally “reasonable” security requirements: A 10-year FTC retrospective
Comput. Secur.
(2011) - et al.
A distributed requirements management framework for legal compliance and accountability
Comput. Secur.
(2009) Privacy on the Go: Recommendations for the Mobile Ecosystem
(2013)- et al.
A requirements taxonomy for reducing web site privacy vulnerabilities
Requir. Eng.
(2004) - et al.
Eddy, a formal language for specifying and analyzing data flow specifications for conflicting privacy requirements
Requir. Eng.
(2014) - et al.
Automated text mining for requirements analysis of policy documents
- et al.
Modeling regulatory ambiguities for requirements analysis
- E. Kamsties, B. Peach, Taming ambiguity in natural language requirements, in: Proceedings of the Thirteenth...
- et al.
Ambiguity in privacy policies and the impact of regulation
J. Legal Stud.
(2016)