Abstract

With the advancement in ICT, web search engines have become a preferred source to find health-related information published over the Internet. Google alone receives more than one billion health-related queries on a daily basis. However, in order to provide the results most relevant to the user, WSEs maintain the users’ profiles. These profiles may contain private and sensitive information such as the user’s health condition, disease status, and others. Health-related queries contain privacy-sensitive information that may infringe user’s privacy, as the identity of a user is exposed and may be misused by the WSE and third parties. This raises serious concerns since the identity of a user is exposed and may be misused by third parties. One well-known solution to preserve privacy involves issuing the queries via peer-to-peer private information retrieval protocol, such as useless user profile (UUP), thereby hiding the user’s identity from the WSE. This paper investigates the level of protection offered by UUP. For this purpose, we present QuPiD (query profile distance) attack: a machine learning-based attack that evaluates the effectiveness of UUP in privacy protection. QuPiD attack determines the distance between the user’s profile (web search history) and upcoming query using our proposed novel feature vector. The experiments were conducted using ten classification algorithms belonging to the tree-based, rule-based, lazy learner, metaheuristic, and Bayesian families for the sake of comparison. Furthermore, two subsets of an America Online dataset (noisy and clean datasets) were used for experimentation. The results show that the proposed QuPiD attack associates more than 70% queries to the correct user with a precision of over 72% for the clean dataset, while for the noisy dataset, the proposed QuPiD attack associates more than 40% queries to the correct user with 70% precision.

1. Introduction

Currently, web search engines (WSEs) have become the preferred way to find health care-related content on the World Wide Web. A recent survey reports that more than 80% of patients use WSE to seek health-related information before consulting the physician [1], while according to the report published by Pew Research Center, 35% of American adults consulted WSE to diagnose medical conditions [2]. However, while using the web search services, the user usually posts their physical condition and health information as a query [3]. Web search engines claim that they collect and maintain user queries as user profile for various activities such as result ranking [4], market research [3], personalization [5], targeted advertisements [6], and others. On the brighter side, maintaining users profile can actually improve the quality of results and user experience, while on the darker side, this indiscriminate collection of users’ queries may cause critical privacy breaches as users’ queries may contain sensitive and personal information [7]. This issue of users’ privacy breach received significant attention in 2005 when the US Department of Justice compelled Google to submit records of users’ queries [8]. Later, America Online (AOL) released (pseudonymized) 20 million queries of more than 650,000 users submitted in three months of time [9], from which the identities of some users had been inferred through personal information enclosed in their queries [10].

Patient’s health information is considered to be a sensitive issue since ancient times, and it is also reflected in the Hippocratic Oath [11] that physician will keep the patient’s information secrete [12]. However, in online and public health facility services, user privacy is just becoming behavior tracking [12]. Consider a scenario when a user posts a series of private queries related to his/her health condition such as “HIV” or “diabetes.” WSE may sell this information to the advertisement agencies or other companies for business purposes, which ultimately breaches the user’s privacy [3]. Such kind of privacy disclosure happened in 2006 when the New York Times managed to deduce and infer personal information from the search history from the pseudonymized log published by AOL. One of them was a 62-year-old widow who conducted hundreds of searches related to her health condition such as “hand tremors,” dry mouth,” and “nicotine effects on the body” which were linked back to her [13].

To address this issue of privacy infringement, several methods have been proposed. These methods include user profile obfuscation [14], query scrambling [15], anonymizing networks [16], and private information retrieval (PIR) protocols [1720]. In a user profile obfuscation, a user profile is contaminated with fake queries to mislead the WSE. In the query scrambling technique, the user query is replaced by a set of blurred and benign synonyms and later posted to WSE. Techniques based on anonymizing network forward the user query through a series of routers to make it difficult for WSE to trace the origin of the query. These methods hide the IP address while the user is still traceable through cookies and device fingerprints [21]. In PIR protocols, a group of users submits queries on behalf of each other to hide their identity.

Despite the fact that the aforementioned methods improve the user privacy, yet some previous studies [2225] using a machine learning algorithm and user profile (i.e., user history or logged user queries) show that an adversary is able to break profile obfuscation and anonymizing network methods. However, it is not clear if an adversary is able to break PIR protocols using machine learning techniques. Therefore, in this research, we propose a machine learning-based attack in order to evaluate the effectiveness of popular PIR protocol, i.e., useless user profile (UUP) [17, 18].

A higher-level goal of this work is to analyze the effectiveness of PIR protocol in preserving users’ privacy against an adverse WSE (from here on, we will call the PIR protocols as UUP, for simplicity of presentation without loss of generality). In UUP, a group of users exchanges their queries with each other in such a way that the identity of the query originator node remains hidden from other group mates. In the next step, all group members submit the received queries to the WSE and results are broadcasted in the group. On the WSE side, the user’s query is received in plain text but with a different identity, and thus WSE cannot identify the originator of the queries. We set out to investigate whether it is possible (and to what extent) for an adverse WSE—equipped with users’ web search profile (histories)—to link the queries coming out of UUP exit user to the original users and thus undermine the privacy provided by UUP.

To better understand the limits of UUP on user’s privacy, we present in this paper a study of UUP focusing on active users. This study is conducted with QuPiD attack, a machine learning-based attack that determines the distance between the user’s profile and query. We conducted our experiments with randomly selected active 100 users from publicly available AOL dataset and treated them as users of UUP. The AOL dataset is composed of over 20 million queries submitted during the period of March 1, 2006, to May 31, 2006, by 6.5 million users. The data of the first two months are used as training data while the last month data are used as testing data. We measured the efficiency of attack using some known machine learning matrices: precision, recall, F-measure, and true-positive rate. The results showed that our proposed QuPiD attack associates more than 70% queries to the correct user with more than 72% precision. Based on the results, we can conclude that most of the users are vulnerable to privacy infringement despite using UUP. The contributions of this work are as follows:(1)Proposed QuPiD attack: a machine learning-based attack for privacy evaluation of PIR protocols(2)A proposed new vector for query classification(3)Recommendation of a suitable machine learning algorithm for query classification

The remainder of the paper is organized as follows: In Section 2, we describe the proposed QuPiD attack. Experimentation setup, preprocessing of the dataset, feature vector construction, and classification algorithms are discussed in Section 3. Section 4 presents the experimental results. Section 5 presents the conclusions and outlines directions for future work.

2. Adverse Model and QuPiD Attack

Users are more concerned about the privacy risks of querying WSEs. In this work, we investigated the robustness of popular PIR protocol, i.e., UUP. As mentioned earlier, WSE receives a user’s query with a different identity due to the shuffling process. Therefore, the entries of queries will never appear with their true originator in the weblog. However, the weakness of this protocol is the timing of query submission by all group members. After the query shuffling step, every group member submits the received query to WSE almost at the same time. Due to which their entries appeared close to each other in the weblog. Figure 1 illustrates an example of query entries in the weblog. In Figure 1, exhibit 1 shows the users’ queries before the shuffling process while exhibit 2 shows the queries after the shuffling process. After shuffling, the queries are submitted to WSE (Figure 1, exhibit 3).

In the proposed adverse model, WSE is assumed to be an entity whose goal is to work against the privacy-preserving solution and identify the user of interest (UoI) queries for profiling purposes. It is assumed that WSE is equipped with the user’s search history (i.e., user profile) PU. The user profile contains queries submitted by the user in the past without using any UUP protocol shown in equation (1) (where Pqi shows the queries in the UoI profile).

The user profile PU is used as training data for building the classification model. As the dataset used for experimentation is spread across three months’ duration, the first two months’ data are used as a training set, while the UUP protocol is simulated with the third month data to create an anonymized log (as shown in Figure 1, exhibit 3). The anonymized log is used as the test set. For testing, all session windows of the UoI are drawn out from the query logs. Here, the session window is a block of records (query entries in the log) in an anonymized log that contains the entry of UoI, but with another user [26, 27]. In other words, the session window is composed of the selected number of queries’ entries in the WSE query log, which appeared immediately before and after the query of UoI. As shown in Figure 1 (exhibit 4), our UoI is “User 3” and the session window size is 15 records (7 records before UoI and 7 after UoI). For this research, we have used the window size of 251 records. Each session window (Swin) is composed of 125 queries appearing before and 125 queries appearing after the query of UoI (as per the recommendation of [27]). A generic session window Swin is shown in equation (2) (where qi represents a query in the session window). The collection of all session windows GSwin is shown in equation (3).

As shown in the query log, the target user who uses any PIR protocol will remain hidden since his/her query is exchanged with a query of another user in the group. Therefore, a session window is used to reduce the testing data. Both PU (training set) and GSwin (testing set) are used as input to the algorithm of the adverse model. The working of the adverse model is presented in Algorithm 1 and depicted in Figure 2. The working of the algorithm is as follows:

Input: User Profile (PU); all session windows belong to the user ().
Output: Expected User Label (Lu)
(1)procedure QUERY ASSOCIATION (PU, )
(2)  fordo
(3)   
(4)  
(5)  fordo
(6)   fordo
(7)    
(8)    
(9)  return
(1)Firstly, the user profile (PU) feature vector is acquired for training purposes. The user profile with the feature vector () is shown in equation (4). The feature vector is acquired from the uClassify (http://www.uclassify.com) service, a machine learning web service that provides numerous different classifiers for text classification. We have selected the “Topics” classifier that gives the score of each phrase or query in 10 major classes including Sports, Society, Science, Recreation, Home, Health, Games, Computers, Business, and Arts.
(2)In the second step, a classification model PModel is built using and supervised machine learning algorithms. To test the response of the data with different classification techniques, 10 classification algorithms are selected from tree-based, rule-based, lazy learner, metaheuristic, and Bayesian families.
(3)After the classification model (PModel), the third step is to acquire the feature vector shown in equation (5) for the queries of session window Swin from uClassify for testing data.
(4)In the last step, each query of is provided to the classification model for the expected label Lu. The label Lu shows whether the incoming query belongs to UoI or not.

For experimentation purposes, two subsets of 100 users were created from the AOL dataset constituting a three-month web query log of AOL users. Each subset was divided into two portions, i.e., training and testing data. Training data are composed of the first two months of the log, while the testing data are composed of the last month of the log. The details of the user selection criteria and dataset formation are discussed in Section 4.

3. Methodology

3.1. AOL Dataset

We used the real-world web search query log released by AOL in 2006 for the evaluation of our proposed adverse model. The AOL dataset consists of over 20 million queries submitted during the period of March 1, 2006, to May 31, 2006, by 6.5 million users. Although the AOL dataset is old and has a lot of deficiencies as compared to the current situation, we are forced to use this dataset due to a lack of availability of the benchmark dataset. The attributes of the query log are user ID, query, date and time of the query, the rank of the content clicked, and the clicked URL. For experimentation purposes, the data of the first two months were used as user profile (PU) or training data while the third month’s data were the new queries to be classified (i.e., testing data). The distribution of the number of queries issued per user in the selected dataset is shown in Figure 3. For experimentation, we chose 100 users with high query frequency instead of concentrating on all users. The user selection criteria are discussed in Section 3.3, while the summary of the dataset is provided in Table 1.

3.2. Feature Vector Extraction

The dataset is composed of five attributes: user ID, query, date and time of the query, the rank of the content clicked, and the clicked URL. Since our adverse model works with the user ID, submitted query, and query score in ten major topics, we neglect the remaining features. To obtain query scores in ten major classes, we used uClassify service that provides classifiers for topics, age, gender, sentiments, language detection, and many others. In this paper, the topic classifier is employed that provides the numeric value of 10 categories against each query. The topic classifier uses a subset of topics from the Open Directory Project (ODP) directory in which topics are placed in a hierarchy. The classes are Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society, and Sports. The classifier provides the percentage of each query in each category. For example, for query “olive oil,” the score for each topic is shown in Table 2.

In some cases, uClassify was unable to find the score of the dominant topic of the submitted query. For example, uClassify is unable to find the dominant class for the query “glenliviet 18.” Therefore, in that case, uClassify just divided an equal score in each class, i.e., 10% for each class. We refer to this kind of query as a “confused query” (shown in Table 2). In the dataset of selected 100 users, uClassify marked 28% of the queries as confusing queries. Therefore, we conducted our experiments using two datasets. One dataset was comprised of both confused and unconfused queries, while the other dataset was comprised of only unconfused queries to find the impact of confusing queries over the results of a classifier. From this point onwards, the dataset with confused queries will be referred to as a noisy dataset while the dataset with only unconfused queries will be referred to as the clean dataset. The details of both datasets are given in Table 3.

3.3. User Selection and Subset Construction

Instead of conducting experiments using all users, we focused on a few users who were considered to be active. Active users are those users who submitted more than 300 queries for at least 61 days during the entire period. From the analysis of the dataset, we found only 21,407 (3.29%) users to be active users. From those active users, we randomly selected 100 users as UoI. The cumulative distribution of queries in both noisy and clean dataset is shown in Figure 4. To see the effects of the size of the training data, we divide both noisy and clean datasets into five groups based on the average of query frequency. The selected 100 users are divided into 5 groups in both datasets. The average number of total, training, and testing instances in all groups for both datasets is given in Table 4.

3.4. Anonymized Log Creation

As mentioned earlier, the AOL data spans across three months. For experimentation purposes, we have considered the first two months’ data as the clean history of UoI available to the search engine and last month’s data as new queries to be classified. The selected PIR protocol, i.e., (UUP) is simulated with the third month’s query log to create the anonymized log of UoI. The parameters considered for simulations are group size and the number of queries submitted by the respective users. According to the literature, UUP is tested with a group size of 3, 4, 5, and 10 users [17, 18]. Another study indicated that a bigger group size offers more privacy [27]. We, therefore, considered a group size of 20 users. The number of queries submitted by the target user is dependent on the actual query frequency of the selected user in the third month queries log.

3.5. Classification Algorithms

In several previous studies, Peddinti et al. [23, 24] and Petit [21] used Random Forest, AD Tree, Zero R, Regression, and SVM algorithms for the classification of the data queries. In both studies, the classification model was biclass, i.e., the query is machine or user generated. Moreover, the model was built based on two attributes like query and assigned label. In our work, however, the classification model is multiclass, i.e., in the testing data, the model will decide which query belongs to which user and the model is based on twelve attributes (discussed in Section 3.2). We selected ten off-the-shelf (with default settings) different families’ classification algorithms. We chose J48 [28] and Logistic Model Tree (LMT) [29] from the tree-based family, Decision Table [30], JRip [31], and OneR [32] from rule-based family, IBK [33] and KStar [34] from lazy learner family, Bagging [35] and LogitBoost [36] from metaheuristic family, and Bayes Net [37] from Bayesian family. Rep Tree [38] and Regression are used as base classifiers for Bagging and LogitBoost algorithms.

3.6. Performance Evaluation Metrics

Three metrics, precision, recall, and F-measure, are usually used to evaluate the performance of a classifier. Precision represents how many of the identified samples are correct and recall describes how many of the total samples are correctly identified. Both precision and recall are mathematically represented in the following equations:where true positive represents the actual positives that are correctly identified cases by the classifier and false positive is the proportion of all negatives that still yield positive test outcomes, while false negative represents the proportion of positive which yields negative test outcomes with the test. The trade-off between precision and recall is represented by a unified metric called F-measure. The value of F-measure is in the range from 0 to 1, where 0 shows none of the samples is classified correctly, while 1 shows perfect classification. Mathematically, F-measure is represented as

4. Results and Discussion

The primary aim of this study is to propose and evaluate a privacy quantification model for PIR protocols. Experiments are performed with two datasets: noisy and clean (Section 3.3), each set composed of 100 users having variable query frequencies distributed over five groups. For each UoI, we measured precision, recall, and true-positive percentage of correctly classified queries from an anonymized log.

Tables 5 and 6 illustrate the true-positive percentage of the queries of UoI in both datasets. According to Table 5, all algorithms correctly identified more than 89% queries of 2 users in the noisy dataset except OneR and LogitBoost. OneR correctly identified 80% to 90% queries of 4 users. Overall, IBK correctly identified more than 50% queries of 36 users followed by Bagging and KStar with 30 and 28 users, respectively, in the noisy dataset. Similarly, in the clean dataset, LMT and IBK were able to correctly identify more than 89% queries of 14 users followed by J48 and Bagging with 12 users each. Overall, IBK correctly identified more than 50% queries of all 100 users followed by KStar and Bagging with 96 and 92 users in the clean dataset. The detailed performance of all algorithms (in terms of true-positive rate) of the clean dataset is given in Table 6. In both datasets, the performance of lazy learner family algorithms (i.e., IBK and KStar) is better when compared to other selected algorithms.

As mentioned earlier, both datasets are further divided into 5 groups of 20 users (Table 4) in order to observe the impact of the size of training on the accuracy of results. Table 7 shows the comparison of the performance of all algorithms with a variation of the training dataset size in the noisy dataset. The performance of each algorithm is measured in precision and recall. IBK and KStar associated more than 40% queries to the correct user with the precision of above 60% in all cases, while Bagging, J48, Decision Table, and Bayes Net associated more than 25% queries to the correct user with the precision of above 60% in all cases. From the perspective of the size of the training dataset, it is slightly difficult to draw a conclusion about its effect on accuracy. Almost every algorithm shows irregular behavior with a variation in the training dataset size. For the first three groups, the performance of IBK, J48, KStar, and LMT is observed more accurately. However, unexpectedly, the rate of recall drops for the last two groups. The results of precision and recall of noisy data are plotted in Figure 5.

In the clean dataset, however, a clear pattern of improvement in the recall is visible. According to Table 8, the performance of all algorithms is improving as the size of the training dataset increases. IBK and KStar associated more than 62% queries to the correct users with the precision of above 70% in all cases, while Bagging, J48, Decision Table, and LMT associated more than 51.68% to 82.84% queries to the correct user with the precision of above 60% in all cases. Among other algorithms, Bayes Net was able to associate more than 70% of the queries in some cases. Although the increase in recall with the increase in training data is not linear, an improvement pattern is clearly visible in the clean dataset. The results of precision and recall of clean data are plotted in Figure 6.

Overall, IBK and Bagging associated 45.1% and 43% queries to the correct user with above 70% precision for the noisy dataset, while J48, KStar, and LMT associated 42.2%, 41.7%, and 40.6% queries to the correct user with the precision of 70.9%, 73.5%, and 70.2%. Similarly, in the clean dataset, IBK and Bagging associated 79.5% and 75.7% queries to the correct user with 79.6% and 75.9% precision, while J48, KStar, and LMT associated 73.9%, 74.4%, and 72% queries to the correct user with the precision of 73.9%, 76.1%, and 72.6%. The top three algorithms in terms of F-measure (trade-off between precision and recall) for the noisy dataset are IBK, Bagging, and J48 with the score of 0.514, 0.487, and 0.477, respectively, while for the clean dataset, the top three algorithms are IBK, Bagging, and KStar with the score of 0.793, 0.753, and 0.745, respectively. Hence, IBK is determined to be a more appropriate algorithm for the feature vector “categories.” The results of the average F-measure of the noisy and the clean dataset are plotted in Figure 7.

5. Conclusions

Health information has been regarded as sensitive private information since ancient times. However, WSE collects this information for selling and targeted advertisements, which can infringe user’s privacy. This paper presents QuPiD attack: a machine learning-based attack that quantifies the level of protection provided by popular PIR protocol UUP. The QuPiD attack uses a classification algorithm and the history of the user to classify an incoming query. We used two subsets (noisy and clean datasets) of real-world web data to test the proposed model. We showed that our proposed attack succeeds in correctly associating incoming queries to their real originator at a high ratio. For the selection of the best classification algorithm, we conducted our experiments with ten classification algorithms from different families. J48 and LMT from the tree-based family, Decision Table, JRip, and OneR from rule-based family, IBK and KStar from lazy learner family, Bagging and LogitBoost from metaheuristic family, and Bayes Net from Bayesian family were selected. The results showed that IBK is the most appropriate algorithm if the “categories” feature vector is used.

During the analysis of the noisy dataset, almost every algorithm showed irregular behavior with the variation in the training dataset size. However, analyzing the clean dataset, we found that when increasing the size of the training data while building the classification model, the testing data in terms of recall are improving. We, therefore, conclude that noise is one of the factors responsible for unsteady behavior. Our analysis shows that PIR protocols are vulnerable to machine learning attacks, even with the first-degree classification tags of queries. This situation is alarming for currently available PIR protocols. Any web search engine or even web service armed with a profile of the user can expose a targeted user. In the future, we are interested to assess the proposed attack from different perspectives, such as the impact of group size, the number of queries in a session, user profile size, and others. Moreover, we are excited to explore the unsteady behavior of classification algorithms.

Data Availability

The data used to support the findings of the study are available at http://www.radiounderground.net/aol-data/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Deanship of Scientific Research, King Abdulaziz University (KAU), Jeddah, Saudi Arabia.