XGBXSS: An Extreme Gradient Boosting Detection Framework for Cross-Site Scripting Attacks Based on Hybrid Feature Selection Approach and Parameters Optimization

https://doi.org/10.1016/j.jisa.2021.102813Get rights and content

Abstract

With the widespread popularity of the Internet and the transformation of the world into a global village, Web applications have been drawn increased attention over the years by companies, organizations, and social media, making it a prime target for cyber-attacks. The cross-site scripting attack (XSS) is one of the most severe concerns, which has been highlighted in the forefront of information security experts' reports. In this study, we proposed XGBXSS, a novel web-based XSS attack detection framework based on an ensemble-learning technique using the Extreme Gradient Boosting algorithm (XGboost) with extreme parameters optimization approach. An enhanced feature extraction method is presented to extract the most useful features from the developed dataset. Furthermore, a novel hybrid approach for features selection is proposed, comprising information gain (IG) fusing with sequential backward selection (SBS) to select an optimal subset reducing the computational costs and maintaining the high-performance of detector' simultaneously. The proposed framework has successfully exceeded several tests on the holdout testing dataset and achieved avant-garde results with accuracy, precision, detection probabilities, F-score, false-positive rate, false-negative rate, and AUC-ROC scores of 99.59%, 99.53 %, 99.01%, 99.27%, 0.18%, 0.98%, and 99.41%, respectively. Moreover, it can bridge the existing research gap concerning previous detectors, with a higher detection rate and lesser computational complexity. It also has the potential to be deployed as a self-reliant system, which is efficient enough to defeat such attacks, including zero-day XSS-based attacks.

Introduction

Cross-Site Scripting (XSS) vulnerabilities are common high-risk cyber-attacks of web applications, which have made the users, web applications, and even the industrial field alike at high risk [1]. The web application with these security vulnerabilities could permit cybercriminals to inject their malicious malware scripts into the webpages displayed to different end-users. Hence, it can be exploited to cause damages such as completely changing the organization website's appearance or behavior, stealing sensitive enterprises' information or liable user' information, acting on behalf of the real user, and much more [2]. Various prevention and mitigation schemes have been proposed to counter XSS-based attacks either for client-side, server-side, or on the pair sides, using different analyzing methods such as static, dynamic, or hybrid [3]. However, the proposed solutions using existing traditional methods for such attack detection became insufficient due to the sophisticating and increasing forms of XSS payloads [4]; most of them are not scalable over time and have an un-ignorable case of false positives [5]. Recently, XSS had grabbed the distinction of the most widespread attack vector in 2019. According to PreciseSecurity research, nearly 40 % of all attacks recorded by security experts is XSS attacks. They noted that almost 75 % of prestigious companies across North America and Europe had targeted over 2019 [6]. Besides, the overall number of new XSS vulnerabilities in 2019 (2,023), increased by 30.2% compared to 2018 (1,554) and by 79.2% compared to 2017 (1,129) as per National Vulnerabilities Database (NVD) [7]. In 2018, the security report was released by the state of application security that considered XSS-based vulnerabilities as the second most common vulnerabilities [8]. It was also enlisted among the top three attack vectors used against web applications in 2018 as per Akamai's State of the Internet Security Report [9]. It is still on the pinnacle of 10 attack vectors in 2017 as per OWASP [10].

As mentioned above, a rapid increase in risk diverts our attention toward the modern world's challenges because of XSS-based attacks. Furthermore, it is also evident that the sophisticated XSS-based attacks have begun to emerge significantly, effectively and pose a real threat to both companies and individuals. Commercial Artificial Intelligence (AI) techniques are becoming mainstream that be readily available for organizations and cyber-criminals alike to take full benefits in this regard [11]. Therefore, the development of an efficient and robust cyber-defense mechanism with the latest AI schemes that have the potential to accurately and precisely detect these sophisticated XSS-based attacks is of keen interest to researchers and web-based communities.

Researchers are endeavoring to feature intelligence to improve detection probabilities employing AI techniques, which is evident from the considerable employ of machine learning (ML) techniques while detecting cyberattacks [12]. The security detection systems are trained by utilizing a dataset of previously known behaviors. Each group of action is recognized to be either malignant or benign. Although AI technology adds significant value in the cybersecurity domain, XSS detectors based on AI technologies still have some deficiencies. These deficiencies can be either the remarkable missing cases of false-negatives rate (FN) or the un-ignorable case of false positives (FP). Beside to time-consuming to handle a huge amount of data with high-dimensions. FNs are a more significant concern than FPs since many real attacks will override the detection system without being detected.

Consequently, the inevitable result is compromised the target security system and the cybercriminals get what they want. However, most of the related research works focus on the FP rate while ignoring the FN rate. Moreover, they lack the appropriate, accurate, and balanced dataset. Additionally, the proper methods to identify such attacks' optimal features to reduce computational costs and maintain high detector performance are missing. Hence, the investigation of whether advanced ML approaches with a clean dataset and optimal subset of features can be used to increase the detection capabilities against XSS attacks is an important one for possibly strengthening the defenses against this kind of cyberattacks.

In this study, a novel detection framework, namely XGBXSS, is proposed to overcome the above-discussed shortages. This framework provides a significant performance improvement compared to our previous proposal [13] with the capabilities of detecting XSS-based attacks and minimizing FP, FN rates simultaneously.

To enhance the proposed model ability to learn the most useful features, we improved the features extraction model's skills presented in our previous work using the dictionary search function. This step gives the model the ability to choose the best features from among the 160 features. To identify the optimal subset while maintaining powerful performance, we proposed a hybrid features selection method (IG-SBS), fusing information gain (IG) with a sequential backward selection (SBS). We adopt ensemble learning using XGBoost-based algorithm with an extreme optimization approach rather than relying on a single model to provide a robust model.

Moreover, this study also features an in-depth empirical evaluation of the proposed framework using various performance appraisal measures and statistical test. The proposed framework achieved perfect results on used test data set with accuracy, precision, probability of detection, FP rate, FN rate, and area under the ROC curve (AUC) scores of 99.59%, 99.50%, 99.02%, 0.20%, 0.98%, and 99.41% respectively. This study's main contributions can be summarized as follows:

  • The study proposed a novel ensemble-based framework using XGBoost with extreme parameters optimization on a realistic and up-to-date XSS dataset with higher accuracy and detection rate.

  • Fused features selection method composed of information gain (IG) with a sequential backward selection (SBS) approach is also proposed to select the most optimal features from the dataset, aiming to decrease the computational requirements with improved performance simultaneously.

  • We derive the best subset of features consisting of 30 out of 160 features capable of efficiently characterizing XSS scenarios.

  • This study also features an in-depth experimental evaluation of the proposed framework using various performance evaluation metrics, demonstrating that the proposed XGBXSS is robust, high precision, high detection probability, and less computational complexity. These features make it lighter, faster, and easier to deploy.

This study's remainder is organized as follows: Section 2 discusses related work, highlighting the previous studies' gaps that our proposed framework's primary focus. In section 3, the key details about this study's mechanism and techniques are presented, including the Enhanced Feature Extraction (EFE) model, hybrid feature selection method, and ensemble model construction. Section 4 offers the strategy of experimental design and extreme parameter optimization of this study. While Section 5 shows details of results and discusses the comparative analysis of the proposed framework with various existing techniques reported in the literature. Finally, Section 6 concludes this study, focusing on its significance and highlighting key future study directions.

Section snippets

Related work

Many traditional mechanisms against XSS attacks were proposed to be applied on either client-side [14], the server-side [15], or both [3] and are analyzed using various approaches on different attacks vectors. These analysis methods could be the following types. (i) Static analysis. The web application code review includes source code, byte-code, or binary code to disclose how the data or control will flow at runtime before the application is being executed [16]. However, due to the complexity

Detection Methodology

In recent years, web-based XSS attacks are the most crucial concern for security analysts. They are caused due to existing security bugs in the websites that are dawned by the features provided with dynamic web applications. Alongside to HTML and CSS, the JS are a key technology creating dynamic web content. However, JS codes have long been used to pass the infection to web applications. JS is prevalent inside webpages and interacts with the DOM elements and can be injected into various tags,

The dataset

The dataset consists of 138,569 samples, where 100,000 are benign, and 38,569 are malicious samples with 30-dimensional features that were selected. To develop a robust and accurate estimation model and providing an unbiased sense of model's efficiency, the dataset was split randomly and separately into three parts with a partition ratio of 60%: 20%: 20% for training, validation, and testing sets. The training set includes 83,142 samples labelled as [0: Benign, 1: Malicious], the validation set

Finalization of the detection framework

The proposed framework parameters are calibrated to achieve better performance results, where parameters are configured to achieve optimal results that involve parameter settings, as shown in Table 5. The other parameters are kept fixed to default. Later on, by carefully and rigorously observing the experimentation results of the XBGXSS framework, the performance was monitored on the validation dataset to verify the number of calibrated trees and adopt an early stopping technique once

Conclusions

This research proposes to use an XGBXSS detection framework for the detection of web-based XSS attacks. The detection framework has been proved efficient to achieve outstanding accuracy and detection rate with minimal FP and FN rates, i.e., almost equivalent to zero. The detection framework adopted a large dataset for the training and testing perspective with a proposed features extraction and selection technique and ensemble learning technique for the detection task. Numerous analyses have

Author Contribution Statement

Fawaz Mahiuob Mohammed Mokbal conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final version of the manuscript.

Dan Wang conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, supervising work, approved the final version of the manuscript.

Xiaoxi Wang analyzed the data, performed the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (45)

  • Precise Security. Cross-Site Scripting (XSS) Makes Nearly 40% of All Cyber Attacks in 2019 - PreciseSecurity.com 2020....
  • NIST.gov. National Vulnerability Database (NVD), Vulnerabilities n.d....
  • Application Security Research Update

    (2018)
  • Akamai. State of the Internet: Security | Credential Stuffing Attacks Report (Volume 4, Issue 4)| Akamai. vol. 4....
  • OWASP top 10 - 2017 The Ten Most Critical Web Application Security Risks

    OwaspOrg

    (2017)
  • WhiteHat. 2018 Application Security Statistics Report....
  • J. Murphree

    Machine learning anomaly detection in large systems

    AUTOTESTCON (Proceedings)

    (2016)
  • FMM Mokbal et al.

    MLPXSS: An Integrated XSS-Based Attack Detection Scheme in Web Applications Using Multilayer Perceptron Technique

    IEEE Access

    (2019)
  • PJB Pajila et al.
    (2020)
  • S Gupta et al.

    XSS-SAFE: A Server-Side Approach to Detect and Mitigate Cross-Site Scripting (XSS) Attacks in JavaScript Code

    Arab J Sci Eng

    (2016)
  • M Mohammadi et al.

    Detecting Cross-Site Scripting Vulnerabilities through Automated Unit Testing.

    IEEE Int. Conf. Softw. Qual. Reliab. Secur.

    (2017)
  • X Guo et al.

    XSS Vulnerability Detection Using Optimized Attack Vector Repertory

    Int. Conf. Cyber-Enabled Distrib. Comput. Knowl. Discov., IEEE

    (2015)
  • Cited by (0)

    View full text