XGBXSS: An Extreme Gradient Boosting Detection Framework for Cross-Site Scripting Attacks Based on Hybrid Feature Selection Approach and Parameters Optimization
Introduction
Cross-Site Scripting (XSS) vulnerabilities are common high-risk cyber-attacks of web applications, which have made the users, web applications, and even the industrial field alike at high risk [1]. The web application with these security vulnerabilities could permit cybercriminals to inject their malicious malware scripts into the webpages displayed to different end-users. Hence, it can be exploited to cause damages such as completely changing the organization website's appearance or behavior, stealing sensitive enterprises' information or liable user' information, acting on behalf of the real user, and much more [2]. Various prevention and mitigation schemes have been proposed to counter XSS-based attacks either for client-side, server-side, or on the pair sides, using different analyzing methods such as static, dynamic, or hybrid [3]. However, the proposed solutions using existing traditional methods for such attack detection became insufficient due to the sophisticating and increasing forms of XSS payloads [4]; most of them are not scalable over time and have an un-ignorable case of false positives [5]. Recently, XSS had grabbed the distinction of the most widespread attack vector in 2019. According to PreciseSecurity research, nearly 40 % of all attacks recorded by security experts is XSS attacks. They noted that almost 75 % of prestigious companies across North America and Europe had targeted over 2019 [6]. Besides, the overall number of new XSS vulnerabilities in 2019 (2,023), increased by 30.2% compared to 2018 (1,554) and by 79.2% compared to 2017 (1,129) as per National Vulnerabilities Database (NVD) [7]. In 2018, the security report was released by the state of application security that considered XSS-based vulnerabilities as the second most common vulnerabilities [8]. It was also enlisted among the top three attack vectors used against web applications in 2018 as per Akamai's State of the Internet Security Report [9]. It is still on the pinnacle of 10 attack vectors in 2017 as per OWASP [10].
As mentioned above, a rapid increase in risk diverts our attention toward the modern world's challenges because of XSS-based attacks. Furthermore, it is also evident that the sophisticated XSS-based attacks have begun to emerge significantly, effectively and pose a real threat to both companies and individuals. Commercial Artificial Intelligence (AI) techniques are becoming mainstream that be readily available for organizations and cyber-criminals alike to take full benefits in this regard [11]. Therefore, the development of an efficient and robust cyber-defense mechanism with the latest AI schemes that have the potential to accurately and precisely detect these sophisticated XSS-based attacks is of keen interest to researchers and web-based communities.
Researchers are endeavoring to feature intelligence to improve detection probabilities employing AI techniques, which is evident from the considerable employ of machine learning (ML) techniques while detecting cyberattacks [12]. The security detection systems are trained by utilizing a dataset of previously known behaviors. Each group of action is recognized to be either malignant or benign. Although AI technology adds significant value in the cybersecurity domain, XSS detectors based on AI technologies still have some deficiencies. These deficiencies can be either the remarkable missing cases of false-negatives rate (FN) or the un-ignorable case of false positives (FP). Beside to time-consuming to handle a huge amount of data with high-dimensions. FNs are a more significant concern than FPs since many real attacks will override the detection system without being detected.
Consequently, the inevitable result is compromised the target security system and the cybercriminals get what they want. However, most of the related research works focus on the FP rate while ignoring the FN rate. Moreover, they lack the appropriate, accurate, and balanced dataset. Additionally, the proper methods to identify such attacks' optimal features to reduce computational costs and maintain high detector performance are missing. Hence, the investigation of whether advanced ML approaches with a clean dataset and optimal subset of features can be used to increase the detection capabilities against XSS attacks is an important one for possibly strengthening the defenses against this kind of cyberattacks.
In this study, a novel detection framework, namely XGBXSS, is proposed to overcome the above-discussed shortages. This framework provides a significant performance improvement compared to our previous proposal [13] with the capabilities of detecting XSS-based attacks and minimizing FP, FN rates simultaneously.
To enhance the proposed model ability to learn the most useful features, we improved the features extraction model's skills presented in our previous work using the dictionary search function. This step gives the model the ability to choose the best features from among the 160 features. To identify the optimal subset while maintaining powerful performance, we proposed a hybrid features selection method (IG-SBS), fusing information gain (IG) with a sequential backward selection (SBS). We adopt ensemble learning using XGBoost-based algorithm with an extreme optimization approach rather than relying on a single model to provide a robust model.
Moreover, this study also features an in-depth empirical evaluation of the proposed framework using various performance appraisal measures and statistical test. The proposed framework achieved perfect results on used test data set with accuracy, precision, probability of detection, FP rate, FN rate, and area under the ROC curve (AUC) scores of 99.59%, 99.50%, 99.02%, 0.20%, 0.98%, and 99.41% respectively. This study's main contributions can be summarized as follows:
- •
The study proposed a novel ensemble-based framework using XGBoost with extreme parameters optimization on a realistic and up-to-date XSS dataset with higher accuracy and detection rate.
- •
Fused features selection method composed of information gain (IG) with a sequential backward selection (SBS) approach is also proposed to select the most optimal features from the dataset, aiming to decrease the computational requirements with improved performance simultaneously.
- •
We derive the best subset of features consisting of 30 out of 160 features capable of efficiently characterizing XSS scenarios.
- •
This study also features an in-depth experimental evaluation of the proposed framework using various performance evaluation metrics, demonstrating that the proposed XGBXSS is robust, high precision, high detection probability, and less computational complexity. These features make it lighter, faster, and easier to deploy.
This study's remainder is organized as follows: Section 2 discusses related work, highlighting the previous studies' gaps that our proposed framework's primary focus. In section 3, the key details about this study's mechanism and techniques are presented, including the Enhanced Feature Extraction (EFE) model, hybrid feature selection method, and ensemble model construction. Section 4 offers the strategy of experimental design and extreme parameter optimization of this study. While Section 5 shows details of results and discusses the comparative analysis of the proposed framework with various existing techniques reported in the literature. Finally, Section 6 concludes this study, focusing on its significance and highlighting key future study directions.
Section snippets
Related work
Many traditional mechanisms against XSS attacks were proposed to be applied on either client-side [14], the server-side [15], or both [3] and are analyzed using various approaches on different attacks vectors. These analysis methods could be the following types. (i) Static analysis. The web application code review includes source code, byte-code, or binary code to disclose how the data or control will flow at runtime before the application is being executed [16]. However, due to the complexity
Detection Methodology
In recent years, web-based XSS attacks are the most crucial concern for security analysts. They are caused due to existing security bugs in the websites that are dawned by the features provided with dynamic web applications. Alongside to HTML and CSS, the JS are a key technology creating dynamic web content. However, JS codes have long been used to pass the infection to web applications. JS is prevalent inside webpages and interacts with the DOM elements and can be injected into various tags,
The dataset
The dataset consists of 138,569 samples, where 100,000 are benign, and 38,569 are malicious samples with 30-dimensional features that were selected. To develop a robust and accurate estimation model and providing an unbiased sense of model's efficiency, the dataset was split randomly and separately into three parts with a partition ratio of 60%: 20%: 20% for training, validation, and testing sets. The training set includes 83,142 samples labelled as [0: Benign, 1: Malicious], the validation set
Finalization of the detection framework
The proposed framework parameters are calibrated to achieve better performance results, where parameters are configured to achieve optimal results that involve parameter settings, as shown in Table 5. The other parameters are kept fixed to default. Later on, by carefully and rigorously observing the experimentation results of the XBGXSS framework, the performance was monitored on the validation dataset to verify the number of calibrated trees and adopt an early stopping technique once
Conclusions
This research proposes to use an XGBXSS detection framework for the detection of web-based XSS attacks. The detection framework has been proved efficient to achieve outstanding accuracy and detection rate with minimal FP and FN rates, i.e., almost equivalent to zero. The detection framework adopted a large dataset for the training and testing perspective with a proposed features extraction and selection technique and ensemble learning technique for the detection task. Numerous analyses have
Author Contribution Statement
Fawaz Mahiuob Mohammed Mokbal conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final version of the manuscript.
Dan Wang conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, supervising work, approved the final version of the manuscript.
Xiaoxi Wang analyzed the data, performed the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (45)
- et al.
A survey of detection methods for XSS attacks
J Netw Comput Appl
(2018) - et al.
An ensemble learning approach for XSS attack detection with domain knowledge and threat intelligence
Comput Secur
(2019) - et al.
Securing web applications from injection and logic vulnerabilities: Approaches and challenges
Inf Softw Technol
(2016) - et al.
Web application protection techniques: A taxonomy
J Netw Comput Appl
(2016) - et al.
Detection of malicious web pages based on hybrid analysis
J Inf Secur Appl
(2017) - et al.
New deep learning method to detect code injection attacks on hybrid applications
J Syst Softw
(2018) - et al.
A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors
Appl Soft Comput
(2019) - et al.
Industrial Control Systems Vulnerabilities Statistics
(2016) - et al.
A survey on detection and prevention of cross-site scripting attack
Int J Secur Its Appl
(2015) - et al.
Cross-site scripting (XSS) attacks and mitigation: A survey
Comput Networks
(2019)