Abstract

This work introduces a novel data augmentation method for few-shot website fingerprinting (WF) attack where only a handful of training samples per website are available for deep learning model optimization. Moving beyond earlier WF methods relying on manually-engineered feature representations, more advanced deep learning alternatives demonstrate that learning feature representations automatically from training data is superior. Nonetheless, this advantage is subject to an unrealistic assumption that there exist many training samples per website, which otherwise will disappear. To address this, we introduce a model-agnostic, efficient, and harmonious data augmentation (HDA) method that can improve deep WF attacking methods significantly. HDA involves both intrasample and intersample data transformations that can be used in a harmonious manner to expand a tiny training dataset to an arbitrarily large collection, therefore effectively and explicitly addressing the intrinsic data scarcity problem. We conducted expensive experiments to validate our HDA for boosting state-of-the-art deep learning WF attack models in both closed-world and open-world attacking scenarios, at absence and presence of strong defense. For instance, in the more challenging and realistic evaluation scenario with WTF-PAD-based defense, our HDA method surpasses the previous state-of-the-art results by nearly 3% in classification accuracy in the 20-shot learning case. An earlier version of this work Chen et al. (2021) has been presented as preprint in ArXiv (https://arxiv.org/abs/2101.10063).

1. Introduction

For privacy protection in accessing the Internet, an increasing number of users have turned to anonymous networks. The Onion Router (Tor) [1, 2] is one of the most popular choices [3].

As a free and open-source software, Tor boosts anonymous communication. It directs Internet traffic through a free, worldwide, and volunteer overlay network with thousands of relays, concealing a user’s location and usage from anyone conducting network surveillance or traffic analysis. Concretely, it encrypts the content of communication and sends the data through a route comprised of successive random-selected Tor nodes. However, this remains not completely secure due to exposure of data transportation patterns before reaching Tor servers. For instance, a local attacker would eavesdrop on the connection between a user and the guard node of the Tor network, with the attacking positions including any devices in the same LAN or wireless network, switch, router, and compromised Tor guard node (see Figure 1). By just analyzing the patterns of data packets traffic without observing the content inside, the attacker is likely to reason about which website a target user is visiting. This is often known as website fingerprinting (WF) attack [4].

To implement a WF attack, the attacker needs first to create a particular digital fingerprint for every individual website and then learn some intrinsic pattern characteristics of these fingerprints for accomplishing attack. Earlier attacking methods rely on manually designed features based on expert domain knowledge [413]. They are not only inflexible but also susceptible to environmental changes over time. This limitation can now be solved by using more advanced deep learning techniques [14]. This is because other than utilizing manually designed features, deep learning methods can automatically learn feature representations directly from training data and are more scalable provided that up-to-date training data are accessible. A couple of latest state-of-the-art studies, deep fingerprinting (DF) [15] and Var-CNN [16], have demonstrated this potential in comparison to manual feature-based methods. However, these deep learning solutions are not perfect, as their success is established upon an unrealistic assumption that a sufficiently large number (e.g. hundreds) of training samples per website are available, that is, data hungry. When only a small training dataset is given as typical in practical use, their performances are not necessarily superior to traditional methods [1113]. It is always expensive, tedious, or even infeasible to collect a vast training set in reality due to highly frequent and continuous changes in Internet environments. Consequently, WF attack is fundamentally a few-shot learning problem, which nevertheless is largely unrecognized in the literature.

The nature of few-shot WF attack is also considered in the recent triplet fingerprinting method [17], under a condition that there is a large set of relevant auxiliary training samples for model pretraining. It is essentially a transfer learning setting. This will significantly limit its scalability in practice as in-the-wild changes of Internet data traffic conditions would render such assumptions to be invalid at high probabilities. On the contrary, we introduce a realistic, generic few-shot WF attack setting where only a handful of training samples are available for every target website, without making any domain-specific assumptions. Clearly, triplet fingerprinting is not applicable in our setting due to the need of auxiliary training data.

We summarize the contributions of this paper as follows:(I)We introduce a novel, practical few-shot website fingerprinting attack problem, in which only a few training samples are available without rich auxiliary data. This respects the intrinsic nature of highly dynamic Internet traffic conditions and high cost of collecting extensive training data in practice. Highlighting the importance of few-shot learning without any auxiliary data assumption for the first time, we hope more future efforts would be dedicated for solving this practically significant WF attack challenge.(II)To solve the proposed few-shot learning challenges, we embrace the enormous potentials and advantages of deep learning for WF attack by introducing a new harmonious data augmentation (HDA) method to explicitly solve the training data scarcity problem in deep learning. Specifically, we augment the original training data by rotating and masking-out randomly individual samples and mixing (linearly combining) sample pairs in arbitrary proportions. With such intrasample and intersample data transformations, our HDA method can efficiently expand a tiny training dataset at any scales.(III)We benchmark the performance of few-shot WF attack and demonstrate the efficacy of our data augmentation method using existing state-of-the-art deep learning models. In particular, we consider 5–20 shots per website/class in closed-world and open-world settings, with and without defense. The results show that our method can improve the performances of previous state-of-the-art deep learning solutions [16, 18] significantly.

2.1. Objectives, Scenarios, and Assumptions

The objective of WF attack is to identify which website a victim user is interacting with among a set of monitored target websites that the adversary is interested in detecting. Conceptually, it is a multiclass classification problem with each website regarded as a unique class. There are several scenarios with different assumptions. The most common scenario is closed-world attack that assumes the user can only visit a small set of websites and that the adversary collects samples to train on all of them. Given that the websites in a closed-world setting are far less than in the real world, this assumption is not realistic. In an open-world scenario, the victim user is considered to likely visit any other websites including those monitored ones, as typically experienced in real-world applications. As a result, the adversary cannot collect data and train for every website.

The above two scenarios are focused on the range of websites involved in WF attack, independent of WF defense.

The WF defense means that the user takes some actions to defend against a potential attack. This would lead to greater attack difficulty. Representative defense techniques include Buflo [19], Tamaraw [20], Walkie-Talkie [21], WTF-PAD [22], BiMorphing [23], DFD [24], FRONT [25], and GLUE [25]. Among them, WTF-PAD is not the newest defense method, but the main candidate to be deployed in Tor. We considered WTF-PAD-based defense in our evaluations.

In the literature, several common assumptions are made. We briefly discussed three main assumptions. In user behavior, it is assumed that all Tor users browsed websites sequentially, only opening a single tab at a time. In background traffic, it is assumed that the attacker is able to collect all the clean traces generated by the victim’s visits against dynamic background traffic. This is increasingly possible, as shown in [26], and the multiplexed TLS traffic can be split into individual encrypted connections to each website. In network condition, the attacker is assumed to have the same conditions as the victim, including traffic conditions and settings. To compare with the benchmark results, we follow these general assumptions for fair evaluations.

Instead, we focus on addressing the following assumption. Often, the attacker assumes that the training data fall into a similar distribution as the deployment data. This is a particularly strong and artificial assumption as the network condition is actually changing and evolving frequently. Such a property enforces the attacker to update the training data in order to have a robust attacking model over time. This implies that the attacker is not possible to collect a large set of training data at each time due to high acquiring costs. However, existing WF attack methods often ignore this factor by assuming the availability of large training data. In contrast, we study the largely ignored few-shot learning setting in the WF attack. Specifically, we approach this problem by explicitly solving the small training data issue via synthesizing new labelled training data.

2.2. Website Fingerprinting Attack Methods

The first pioneer attack against the Tor network was evaluated by Herrmann et al. [7] in 2009. It achieved an accuracy of 2.96% using around 20 training samples per website in the closed-world scenario. Later, Wang and Goldberg [10] proposed to represent the traffic data using more fundamental Tor cells (i.e., direction data) as a unit rather than TCP/IP packets. This representation is rather meaningful and informative as it encodes essential characteristics of Tor data. By training a kernel SVM classifier, a ground-breaking performance with 90.9% accuracy was achieved on 100 sites each with 40 training samples. In 2016, Panchenko et al. [13] proposed an idea of sampling the features from a cumulative trace representation and achieved 91.38% accuracy with 90 training instances per website. Hayes and Danezis [12] exploited random decision forests to achieve similar results. A typical design of these above methods is a two-stage strategy including feature design and classifier learning. This is not only constrained by the limitations of hand features but also lacks interaction between the two stages, making the model performance inferior.

Motivated by the remarkable success of deep learning techniques in computer vision and natural language processing [27, 28], several deep learning WF attack methods have been introduced which can well solve the weakness mentioned above. This is because deep learning methods carry out feature learning and classification optimization from the raw training data end-to-end. For example, Rimmer et al. [29] applied deep learning methods (e.g., stacked-denoising autoencoders, recurrent neural networks, and convolutional neural networks to WF attacks, assuming sufficient training data. Later, Oh et al. [30] utilized autoencoder (AE) to generate low-dimensional features to improve the performance of WF attacks. Meanwhile, using a popular neural network architecture called VGG network [31] as the backbone, Sirinam et al. [15] proposed a deep fingerprinting attack (DF) model that attains 90% accuracy on 95 websites. However, this method needs at least a low-data training set (e.g., 50 training samples per website); otherwise, it will suffer from significant performance drop. When using 20 training samples per website, DF can only hit around 80% accuracy.

To overcome this limitation, Bhat et al. [16] developed the Var-CNN model based on ResNet [18] and dilated causal convolution [32, 33]. When small training sets (e.g., 100 samples per website) are available, it achieves superior performance over DF but at dependence on less-realistic time features and less-scalable hand-crafted statistical information. Meanwhile, Rahman et al. [34] focused on how to utilize timing-related features in WF attacks.

A solution to few-shot learning is a recently proposed triplet fingerprinting (TF) method [17]. The key idea of TF is to pretrain a metric model that can measure pairwise distances on new classes. When the pretraining dataset is similar to the target data in distribution, TF can hit the accuracy of 94.5% on 100 websites using only 20 training samples per website. This is a strong transfer learning scenario. However, considering that the dynamics of network conditions is highly unknown and uncontrollable, such a transfer learning assumption is hardly valid in practice. In light of this observation, in this work, we propose a more realistic few-shot learning setting without assuming any auxiliary data with similar data characteristics for model pretraining. Hence, it is more scalable and generic for real-world deployments. Under the proposed more challenging few-shot setting, TF is unable to work properly due to insufficient network initialization.

2.3. Data Augmentation

Data augmentation is an important element in deep learning due to its data-hungry nature [14]. For example, random insertion, random swap, and random deletion for text classification in natural language processing [35], or geometric transformations (e.g., flipping, rotation, translation, cropping, and scaling), color space transformations (e.g., color casting, varying brightness, and noise injection), and interimage mixup [36] for image analysis [3739]. These previous attempts have shown the significance of different augmenting methods for model performance on the respective tasks. Inspired by these findings, we investigate the effectiveness of training data augmentation extensively by adapting existing operations for deep learning WF attacks in few-shot learning settings. To the best of our knowledge, this is the first attempt of its kind. Crucially, we demonstrate that the existing state-of-the-art deep WF attack method [16] significantly benefits from using the proposed data augmentation operations in varying evaluation scenarios. This result would be encouraging and influential for future investigation of deep learning WF attack methods in particular.

3. Method

3.1. Problem Definition

In website fingerprinting (WF) attack, the objective is to detect which website a target user is visiting. The common observations are data traffic traces produced by one visit to a website . Taking each website as a specific class, this is essentially a multiclass classification problem. For model training, a labelled training set is often provided, where specifies one of target websites. Two different settings are often considered in model testing: (1) closed-world attack where any test sample is assumed to belong to the target websites/classes, and (2) open-world attack where the above assumption is eliminated, i.e., a test trace may be produced by a nontarget (unmonitored) website. The latter is a more realistic setting, yet presenting a more challenging task as identifying if a test sample falls into target classes or not is nontrivial.

3.1.1. Feature Representation

For the Tor network, the raw representation of a specific traffic trace consists of a sequence of temporally successive Tor cells travelling between a target user and a website visited. It is derived from TCP/IP data. Specifically, after those TCP/IP packets retransmitted are discarded, TLS records are first reconstructed, and their lengths are then rounded down to the nearest multiple of 512 to form the final sequence data . In value, each is a sequence of 1 (outgoing cell) and −1 (incoming cell), with a variable length. This raw representation is hence known as the direction sample. Besides, temporal information about interpacket time is another modality of data used, but limited by high reliance on network conditions, i.e., not stable and much more noise. Consequently, we mainly consider the direction data samples in this study, which are more scalable and generic.

3.2. Deep Learning for Website Fingerprinting Attack

Most of existing WF attack methods rely on hand-crafted feature representations [413]. This strategy is not only unscalable but also unsatisfactory in performance due to limited and incomplete domain knowledge. Deep learning methods provide a viable solution via learning directly more effective and expressive representation from training data, as shown in a few recent studies [15, 16]. In this work, we advance this new direction further.

1D convolutional neural networks (CNN) [40] are usually explored for WF attacks as the raw data are temporal sequences. Building on the success of deep learning in computer vision, we adopt the same high-level network designs of standard 2D CNN models [41], whilst translating them into 1D counterparts. This is similar to [15, 16].

As shown in Figure 2, a CNN model consists of multiple convolutional layers with nonlinear activation functions such as ReLU [42] and fully-connected (FC) layers, characterized by end-to-end feature extraction and classification. With convolutional operations, the filters of each layer transform input sequences using learnable parameters and output new feature sequences. This feature transformation is conducted layer by layer in a hierarchical fashion. The receptive field (kernel) with size 3 is often used in each layer to capture local feature patterns. By stacking more layers and pooling operations, the model can perceive the information of larger regions and achieve translational invariance. Another effective method for enlarging the receptive field is dilated causal convolutions [32, 33], which has been exploited in [16].

The feature representations of WF samples are the output of the global average pooling layer on top of the last convolution layer. To obtain the classification probability vector over target classes, is fed into a FC layer and normalized by a softmax function.

For model training, we compute a cross-entropy objective loss function with the classification vector against the ground-truth class label over all training samples aswhere refers to the ground-truth class label of a training sample and is a Dirac function. The objective is to maximize the probability of the ground-truth class in prediction. This loss function is differentiable, with its gradients backpropagated to update all the learnable model parameters.

Once the deep model is trained, we forward a given test sample, obtain a classification probability vector, and take the most likely class as a prediction in both closed-world and open-world settings. For open-world setting, all unmonitored websites are considered to belong to a background class.

3.2.1. Discussion

While deep learning techniques have advanced significantly in the last several years, it is still assumed that a large set of labelled training samples is available. This is not always true, for example, for the WF attack problems. In real-world applications, an attacker is usually faced with highly dynamic network environments. It means that the distribution of raw features is evolving continuously. As such, the training data need to update frequently, which disables collection of large training data with labels in practice due to prohibitively high labelling costs. Consequently, only a small training set is accessible in reality, making deep learning methods ineffective.

3.3. Harmonious Website Fingerprinting Data Augmentation

To address the above small training data challenge, we propose an intuitive, novel harmonious data augmentation (HDA) method. We introduce both intrasample and intersample augmentation operations that can be applied in a joint and harmonious manner for more effective data expansion.

3.3.1. Intrasample Augmentation

The key idea of intrasample augmentation is that given an individual training sample, we introduce a certain degree of random data perturbation and/or variation whilst keeping the same class labels. Doing so allows us to generate an infinite number of labelled training samples due to the nature of randomness. We consider two perturbation operations: random rotation and random masking.

Random rotation-based data augmentation means rotating an original training sample forward or backward by random steps to generate virtual samples (Figure 3(a)):where and specify the steps and the direction to rotate on an input sample . The hypothesis behind is that class-sensitive information encoded in a sample is distributed across different subsequences and data traffic order is less important than signal patterns. After a sample is rotated, the original class information is largely preserved, i.e., semantically invariant. Hence, the same class can be annotated for the rotated variants. However, this hypothesis is more likely to stand under some certain (unknown) degrees. We therefore introduce an upper bound parameter so that the rotation range is limited at most steps in both directions, .

In contrast, random masking introduces localized corruption to an original training sample by setting a random subsequence to zero (Figure 3(b)). This data augmentation is written aswhere and denote the length and location of the subsequence that is masked out from an original sample . Rather than in form of subsequence, another strategy is to randomly select individual positions to mask. We consider this may introduce more significant corruption to the underlying semantic information.

Conceptually, random masking simulates varying traffic measurement errors in data transportation. Meanwhile, with the same above hypothesis, such masking would not dramatically change the semantic class information provided that the masking is subject to some limit, e.g., the length of subsequences masked out . It hence offers a complementary data perturbation choice with respect to random rotation.

3.3.2. Intersample Augmentation

Apart from data augmentation on individual samples, we further introduce data perturbation across two different samples to enrich the limited training set.

We propose random mixing that generates virtual samples and class labels by linear interpolation between two original samples and aswhere are the one-hot class labels of and . The mixing parameter follows a Beta distribution: with the parameter that controls the strength of interpolation. This is in a similar spirit of mixup in image understanding domain [36]. Unlike intrasample augmentation above, random mixing changes the semantic class information since original samples may be drawn from different classes. It simplifies the data distribution by imposing a linear relationship between classes for complexity minimization. As shown in Figure 3(c), only the common features are remained in the mixed sample. If two original samples are generated from visiting the same website, the mixed sample reflects the shared characteristics with respect to this website. Otherwise, it reflects the commonality of two different websites.

While seemingly counterintuitive, we will show that such a method brings positive contributions on top of random masking and random rotation.

3.3.3. Combination and Compatibility

Different augmentation operations can be applied on the same samples without conflict to each other in a harmony. There is also no particular constraint on the order of applying all the three data augmentation operations in a combination. Given a fixed set of parameters as discussed above, different augmentation orders will result in different virtual samples. This makes little conceptual difference as the space of sample is just infinite.

3.3.4. Augmentation Optimization

In our harmonious data augmentation (HDA), three hyperparameters are introduced. To generate meaningful virtual samples, obtaining their optimal values is necessary; otherwise, adversarial effects may even be imposed.

Instead of manual tuning, we adopt an automatic Bayesian estimator, called Tree of Parzen Estimators (TPE) [43]. The conventional TPE can take only a single parameter alone at a time. So, we need to optimize each of the three hyperparameters independently. This differs from our data augmentation process where the three augmentation operations are typically applied together, making the independently tuned parameters of TPE suboptimal. This is because jointly applying three augmentations together makes them interdependent.

For solving this problem, we propose a sequential optimization process that takes into account the interdependence property of different augmentation operations gradually (see Algorithm 1). Specifically, we start with a random, fixed order of applying our random rotation, masking, and mixing operations. Then, we optimize from the first one with TPE, move to the next one with all the previous ones optimized and fixed, and stop by finishing the last one. Each time, we still optimize a single hyperparameter whilst keeping all the previous optimized ones fixed. In this way, we expand the interdependence among different operations sequentially.

Input: A training , and validation set.
Output: Data augmentation with optimal parameters .
1: Setting  =  (empty set);
2: Sequencing data augmentation operations randomly;
3: while Enumerating augmentation operations do
4:  Get the search space of current augmentation ;
5:  Using TPE on to obtain the optimal parameter , with the model trained by and ;
6:  
7: end while
8: return
3.4. Theoretical Foundation and Formulation

The objective of learning a WF attack model is equivalent to deriving a function that fits the latent translation relationship between raw feature vectors and corresponding website class labels , that is, fitting a joint distribution To this end, in deep learning, we often leverage a loss function defined to penalize the differences between predictions and targets . We minimize the average loss over the joint distribution:which is known as expected risk minimization [44].

However, the joint distribution is often unknown, particularly for WF attacks with small training data. Given a limited training dataset , the joint distribution can only be approximated by an empirical distribution aswhere is a Dirac mass centered at a sample . Accordingly, the expected risk can now be approximated by an empirical risk:

The above approximation is in the empirical risk minimization (ERM) principle [44]. The cross-entropy loss (1) is a representative example, which essentially minimizes for the classification task.

While ERM is a common strategy, it suffers from a high risk of poor generalization due to the tendency of memorization, mainly when a large model is used [45]. To mitigate this issue, we adopt the notion of vicinal distribution [46] which can better approximate the true joint distribution. In particular, the vicinal distribution in the data space is defined as

Intuitively, measures the probability of finding a virtual labelled sample in the vicinity around an original training sample .

Given such vicinal distributions, we first construct a virtual dataset by sampling randomly and then minimize an empirical vicinal risk to learn as

Clearly, at the core of this strategy is performing data augmentation around original training samples. Rather than computing a loss value for every single training sample, it derives a local distribution centered at each individual sample and generates more virtual training samples to reduce the negative memorization effect of deep learning. This is the key rationale of our data augmentation method.

3.4.1. Augmentation Formulation

We formulate the proposed harmonious data augmentation operations in the vicinal distribution manner. For intrasample augmentation (including random rotation and masking), the vicinal distribution is defined aswhere is a transformation operator.

For random rotation, given any length- sample , we first define a circle matrix for forward rotation as

Then, we sample the step size uniformly from a range of . By one-hot representation of , we can obtain a rotation transformation as

For the backward case, we perform the same process as above but with a backward rotation matrix instead.

For random masking, we similarly sample the start position uniformly in the range of where is the length of the masked subsequence. The masking transformation can be represented by a matrix aswhere is the identity matrix, is the all-one vector, selects the th row of a matrix, and transforms a vector to a diagonal matrix. Masking operation is finally conducted by matrix multiplication as

For intersample augmentation, random mixing in our case, the vicinal distribution is defined aswhere is a random variable drawn from a Beta distribution and is one-hot class label vector. This local vicinity is assumed to respect a linear structure with respect to class labels.

4. Experiments

4.1. Experimental Setup
4.1.1. Datasets

We evaluated our data augmentation method HDA on four standard WF attack datasets as below. (1) [29]: this dataset provides a total of 100 monitored target websites, each with 2,500 raw feature traces. (2) [26]: this dataset gives 100 monitored websites with each contributing 90 feature traces. (3) [15]: this dataset gives 95 monitored websites with each contributing 1,000 feature traces. (4) ROWUM [29]: this dataset includes and a large set of samples, each was generated by a visit to a page of top 400,000 Alexa websites. (5) [15]: unlike all the above datasets, this is a more challenging dataset due to the presence of WTF-PAD-based defense against WF attack. It has 95,000 raw feature samples from 95 websites. We considered both closed-world and open-world WF attack scenarios using the above datasets.

4.1.2. Network Architectures

We used two different network architectures for testing the generic benefits of the proposed HDA method. (1) Var-CNN [16] is the current state-of-the-art deep learning WF method. (2) ResNet-34 [18] is a strong and popular network widely deployed in many different fields such as computer vision.

4.1.3. Implementation Details

We conducted our experiments in Keras [47]. In our experiments, we used the standard training, validation, and test splits for all competitors for fair comparisons. HDA was applied only to the training set. We optimized HDA’s hyperparameters using Var-CNN [16] as the deep learning model on in closed-world setting and applied the same parameter setting for all the other deep learning methods, datasets, and settings. This allows testing the generality and scalability of our HDA method. For augmentation optimization, we set the search space as with step 5 for forward/backward (random rotation) with step 20 for (random masking) and with step 0.1 for (random mixing). We selected the best value for each of these parameters with respect to the validation performance. The optimal parameter values we obtained are , , and . We applied the same parameter setting tuned on to all other datasets for both simplicity and generalization test.

For saving storage, we performed online data augmentation within each mini-batch without any data preprocessing. In each experiment, we trained every deep learning model for 150 epochs and used the checkpoint with the best performance on the validation set for the model test. We only used the direction feature data, without time sequences and hand-crafted features. We ran each experiment 10 times and reported the mean results and standard deviation as the final performance.

4.1.4. Why Not We Apply HDA to DF?

On the one hand, we found that DF is unstable while optimized by HDA. In some experiments, DF + HDA can get better results than original HDA, but not always so. On the other hand, the feature extractor of TF is from DF. Hence, we just provide the best result of TF following its recommended setting as baseline.

4.2. Closed-World WF Attack
4.2.1. Setting

We conducted the closed-world attack on , , and . We separated each dataset into training and test (70 samples per class) splits. We considered few-shot settings with training samples per class. The validation set was used to select the best performing model for test. We used classification accuracy as the performance metric. Besides deep network models, we also compared our method with two conventional hand-crafted feature-based methods: CUMUL [13] and k-FP [12].

4.2.2. Results

The results of different methods are compared in Tables 13. We have the following observations: (1) TF remains the best few-shot WF attack algorithm, especially pretrained with similar datasets (pretrained and test with the AWF dataset and test with the Wang dataset). (2) However, deep learning methods (Var-CNN) become clearly stronger when pretrained TF is faced with different distributions across training and testing datasets (pretraining on AWF and testing on and ), suggesting a great deal of potentials. In 10/15/20-shot cases, Var-CNN + HDA achieves the best overall result on both and . In particular, on , the benefit from HDA is significant, and Var-CNN + HDA surpasses TF with a big margin of 13.2% in 20-shot case. (3) With our HDA method for training data augmentation, every deep learning method improves in all few-shot cases. For example, the 20-shot accuracy of Var-CNN is increased from 78.7% to 90.7% on , from 88.4% to 90.6% on Wang and from 68.1% to 91.3% on . Similarly, the 20-shot accuracy of ResNet-34 is improved from 51.3% to 86.4% on , from 85.9% to 87.4% on Wang and from 61.4% to 85.8% on . (4) Our HDA can consistently improve different methods on varying datasets, suggesting good generality. (5) The performance deviation of Var-CNN assisted by our method HDA is the least among all the competitors, implying strong stability.

4.3. Open-World WF Attack
4.3.1. Setting

We conducted the open-world attack experiments on the combination of and . We treat the websites of as target (monitored) classes and those of as nontarget (unmonitored) classes. In this test, we selected randomly 8,020 out of 400,000 unmonitored websites and separated them into three disjoint sets sized at 20/1,000/7,000 for training, validation, and test, respectively. In this scenario, the precision and recall rates were used to evaluate model performance due to the need for detecting nontarget classes [48]. We considered the same two deep learning methods (Resnet-34 and Var-CNN [16]) for comparisons.

4.3.2. Results

The results of different methods are reported in Table 4. We considered two settings, one is tuned for best precision, and one for best recall. Overall, we obtained similar trends as above that our HDA is highly effective for improving both deep learning methods. It is noted that unlike the closed-world scenario, Var-CNN + HDA achieves very top results at most cases under both tuning settings, even if it may not be the best one. Similarly, Var-CNN + HDA remains to be more stable and less sensitive to training sample size. Significantly, our HDA method further enhances these strengths by efficient data augmentation, leading to the more robust WF attack solutions.

4.4. WF Attack against Defense
4.4.1. Setting

In contrast to the two above experiments, we further tested a more challenging WF attack scenario with defense involved. Defense changes the data traffic patterns to be more similar to one another, therefore making the attack more difficult. We considered the most popular defense, WTF-PAD, widely deployed in the Tor network. We used the dataset in this experiment. We used 100 random samples per website and divided them into three sets for training (20 samples), validation (10 samples), and test (70 samples), respectively. We reported the classification accuracy as performance metric in the closed-world scenario. We help the previous two deep learning methods (Resnet-34 and Var-CNN [16]) with HDA, compared with the pretrained few-shot method (TF [17]) and hand-crafted feature-based methods (k-NN [11], k-FP [12], and CUMUL [13]).

4.4.2. Results

We reported the results of closed-world WF attack under WTF-PAD-based defense in Table 5. We made the following observations. (1) Some hand-crafted feature-based methods (CUMUL) are superior over recent deep learning methods (ResNet-34 and Var-CNN) at the few-shot learning scenarios. This is mainly because the latter suffers from lacking enough training samples, resulting in model overfitting. (2) Using our HDA for training data augmentation, we can directly solve the data scarcity problem and significantly boost the performances of previous deep learning methods. As a result, Var-CNN + HDA outperforms the other competitors by a moderate margin, e.g., 2.9% gap over the best competitor CUMUL. (3) ResNet-34 is surpassed by Var-CNN continuously. By benefiting more from our data augmentation, Var-CNN achieves the best results across all different shot cases. This implies that Var-CNN has a higher desire for large training data with higher performance potential, as compared to ResNet-34. (4) If TF is not pretrained with a similar dataset, it will lose the advantage when a few more samples (20-shot) are provided.

4.5. Ablation Studies

We carried out a set of component analysis experiments to examine the exact effect of different designs of our method (HDA). We adopted the most common closed-world attack scenario without defense on the dataset, following the same setting as Section 4.2. It is noteworthy that this dataset is different from the dataset in Section 4.2 because they are different subsets. In this section, we evaluated the 15-shot learning case in particular, using Var-CNN [16] as the deep learning model backbone.

4.5.1. Individual Augmentation Operations

Recalling that our data augmentation method (HDA) consists of three different operations (random rotation, masking, and mixing), we have demonstrated their performance advantages of them as a whole in varying test settings above. For in-depth insights, examining their individual contributions would be informative and necessary as well as different combinations. We conducted these experiments with an exhaustive set of operation combinations and reported the results in Table 6.

It is observed that (1) each of the three operations makes a significant difference in performance, with rotation and masking the best individual operations that improve the classification accuracy by 17.4%. (2) When jointly using any two augmentation operations, the performance can be further increased. The combination of masking and mixing gives the highest accuracy among them. (3) Combining all three operations (HDA) achieves the best result with a smaller deviation. This suggests that all different operations are complementary and compatible with each other.

4.5.2. Augmentation Optimization

For optimal data augmentation, we propose a sequential optimization strategy (see Algorithm 1) for capturing the interdependence between different augmentation operations applied. To evaluate its effect, we compared with a baseline algorithm that independently optimizes each augmentation parameter.

As shown in Table 7, the proposed optimization algorithm (see Algorithm 1) is clearly superior, validating our consideration that there exists interdependence between different augmentation operations when applied jointly on the same samples. Note that we obtained this performance gain at the same cost as the baseline counterpart. Besides, it is worth noting that even with the simpler optimization, our data augmentation method (HDA) can still greatly improve the previous deep learning model Var-CNN and achieve new state-of-the-art results (Table 7 vs. Table 1). This further validates that the proposed augmentation operations are highly compatible with one another and can be applied together well.

5. Conclusion

We presented a model-agnostic, simple yet surprisingly effective data augmentation method, called HDA, for the few-shot website fingerprinting attack. This is an understudied and realistically critical problem, as in practice only a handful of training samples per website can be feasibly collected due to the inherent high dynamics of Internet networks and expensive label collection cost. Importantly, we focus on deep learning-based methods, a line of new research efforts with vast potentials for future investigations. In particular, our HDA method offers three different data augmentation operations, including random rotation, masking, and mixing in intrasample and intersample fashion. They can be applied to the same training samples harmoniously with high complement and compatibility. Moreover, we introduce a sequential augmentation parameter optimization method that captures the interdependence nature between different operations when applied jointly. With recent state-of-the-art deep learning WF attack models, we conducted extensive experiments on four benchmark datasets to validate the efficacy of our HDA method in both closed-world and open-world scenarios, with and without defense. The results show that the proposed data augmentation method makes dramatic differences in performance and enables previous deep learning methods to outperform hand-crafted feature-based counterparts in the few-shot learning setting for the first time, often by a large margin, while pretrained-based few-shot WF attack (TF) is placed in a new environment, it cannot outperform our augmented method. This is achieved without making any artificial assumptions of relevant, large auxiliary training data for model pretraining. With our HDA method, collecting large training data frequently is eliminated, whilst still achieving stronger and more robust WF attacks. Finally, we performed detailed component analysis to diagnose the effect of individual model components.

5.1. Additional Discussion

Except data augmentation for reducing the demand of data annotation in a few-shot learning context, an alternative approach is semisupervised learning, which has been extensively studied in e-mail classification [49], intrusion detection [50], authorship attribution [51], computer vision [52, 53], and so force. The key idea is to explore the structural knowledge (manifold and cluster structures) of unlabeled data to increase the volume of training data. Crucially, we believe that our proposed HDA can benefit existing semisupervised learning methods due to its algorithm agnostic nature. One limitation with our HDA is that more training data will lead to higher training cost. However, this is a general and common problem with all data augmentation methods including ours. To further boost the research of website fingerprinting, it is necessary to connect website fingerprinting with other fingerprinting fields, from the traditional fingerprint-based biometric systems [54] to the newest collaborative intrusion detection networks under passive message fingerprint attack [55, 56]. Through introducing the strategy which has produced marked effect in related fingerprinting fields, website fingerprinting especially few-shot website fingerprinting would go further.

Data Availability

The principal datasets used in this research can be downloaded from the websites (https://github.com/DistriNet/DLWF and https://www.cse.ust.hk/∼taow/wf/data/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of Hunan Province, China (no. 2021JJ40682), the National Key Research and Development Program of China (no. 2018YFB0204301), and National Natural Science Foundation of China (no. 61472439).