Abstract

Despite the efforts of information security experts, cybercrimes are still emerging at an alarming rate. Among the tools used by cybercriminals, malicious domains are indispensable and harm from the Internet has become a global problem. Malicious domains play an important role from SPAM and Cross-Site Scripting (XSS) threats to Botnet and Advanced Persistent Threat (APT) attacks at large scales. To ensure there is not a single point of failure or to prevent their detection and blocking, malware authors have employed domain generation algorithms (DGAs) and domain-flux techniques to generate a large number of domain names for malicious servers. As a result, malicious servers are difficult to detect and remove. Furthermore, the clues of cybercrime are stored in network traffic logs, but analyzing long-term big network traffic data is a challenge. To adapt the technology of cybercrimes and automatically detect unknown malicious threats, we previously proposed a system called MD-Miner. To improve its efficiency and accuracy, we propose the MD-MinerP here, which generates more features with identification capabilities in the feature extraction stage. Moreover, MD-MinerP adapts interaction profiling bipartite graphs instead of annotated bipartite graphs. The experimental results show that MD-MinerP has better area under curve (AUC) results and found new malicious domains that could not be recognized by other threat intelligence systems. The MD-MinerP exhibits both scalability and applicability, which has been experimentally validated on actual enterprise network traffic.

1. Introduction

Cybercrimes are becoming increasingly serious with the proliferation of Internet devices and applications. One of the most frequently used tools for cybercrimes is malicious domains to perform phishing, XSS, and other attacks. Internet attack organizations generally use code obfuscation techniques to generate a large number of polymorphic variants with the same malware [1] before establishing more than one command and control (C&C) server. Cybercriminals and malware authors leverage not only hidden and slow APT attacks but also various techniques, such as DGAs and domain-flux, to make them successful. By adopting technologies such as DGAs, these servers change their domain names and corresponding IP addresses over time to prevent being blocked by antivirus software or intrusion prevention systems [2]. The detection of malicious domains is difficult because of the defense dilemma caused by the long-term attack and the volatility of their domain names. However, malware generally exhibit footprints that show where they have been. The clue to tracking cybercrimes is in the network traffic; the challenge is how to analyze the huge amount of network traffic. Of the applications in malicious domains, botnets are considered the most damaging by enterprises.

A set of infected and controlled entities can be viewed as a botnet [3]. The botnet structure is composed of three main components: (1) the bots, (2) the command and control servers (C&C), and (3) the threat actor, or bot herder itself; bot, which refers to a remote victim computer, usually without the victim’s knowledge; and C&C server, responsible for managing the trunk host that controls the entire botnet and passes along the bot herder’s instructions. Once the botnet deployment is complete and launches a cyber-attack, the distributed denial-of-service (DDoS) shuts down the victim organization’s Internet service, and the APT leads to additional damage. Compromised hosts need the Internet as a communication bridge to perform cybercrimes, such as receiving instructions or stealing sensitive data and returning it to the C&C servers [4, 5]. The impact of botnets is great enough that several studies have focused their attention on the discovery of botnets, which has continued to be a hot topic [610].

To defend against cyber-attacks, many organizations have established systems such as intrusion detection systems (IDS) to detect and log suspicious traffic, but these produce many false alarms that dull their vigilance [11]. Unlike advanced traffic analysis techniques that require large amounts of computational resources and time, the domain blacklist matching method can instantly detect malicious domains and further disrupt their communications. However, the methods to perform string changes to domain names are simple, cheap, and fast, indicating that using domain blacklists to prevent attacks is effective but difficult to update in real time. Therefore, automating the maintenance of the domain blacklist is indispensable to improve the information security of organizations.

As described in our previous research [12], discovering botnets is important, and detecting C&C servers is vital to analyze APT events. Malicious domain names commonly require an Internet connection to communicate with compromised hosts, but tracking or mining them from the global public Internet has been a difficult problem. Fortunately, such processes leave footprints, and most enterprises leverage proxy servers as intermediate HTTP communications between internal computers and the Internet that result in logging footprints. Thus, systems can take advantage of packet capturing systems to obtain the HTTP communication records. However, one of the bottlenecks in analyzing network traffic is a single workstation can easily have millions of packets each day, which inhibits manually analyzing such traffic without automated intelligence systems. Therefore, we proposed the MD-Miner (MD stands for malicious domain) that adapts big data analysis with a scalability framework. The process utilizes network traffic to build a Process-domain annotated graph that discovers who is connecting with what. The MD-Miner uses user-agent plus client-IP as a feature to distinguish the distinct processes and incorporates this into the annotated bipartite graph to become the Process-domain annotated graph. The evaluation in [12] shows that the MD-Miner can determine a part of unknown domains that has a high probability of being malicious and demonstrates great identifiability, but there is still room for further improvement.

Inheriting from our previous research [12], we built a new scalable network-level behavior system called MD-MinerP (P represents Plus) that is based on the Hadoop and Spark cluster architecture. The design effectively uses an incremental clustering algorithm to handle large amounts of data. The MD-MinerP has evolved unique analytic capabilities that constantly examine the subtle clues left in proxy or network traffic logs to discriminate malicious domains.

This article demonstrates the steps to convert the MD-Miner to the MD-MinerP through two key points. First, the MD-MinerP replaces the annotated bipartite graph with an interaction profiling bipartite graph that better represents the association of Internet interactions. Second, the MD-MinerP exploits more connection factors to construct features with classification capabilities. In addition to the user-agent plus client-IP (Process), the MD-MinerP uses HTTP requests, domain IP addresses, and domain name lexical characteristics. The MD-MinerP leverages the user-agent plus client-IP building Process-domain interaction profiling graph to acquaint process queries that leverage HTTP requests to build the Trace-domain interaction profiling graph and determine the interactions between the client-server. The system also leverages the IP address of the destination domain to build the IP-domain interaction profiling graph to identify corresponding relations of the IP used by the domain name. The lexical algorithm is also used to extract variations in the domain string. Finally, these features are aggregated to frame the malicious domain detector. Related works and observations related to improvements of the MD-MinerP are detailed in Section 2.

The evaluation stage in Section 4 uses the CyberGraph [13] to verify new malicious domains found by the MD-MinerP in addition to the previously used K-fold cross-validation. The CyberGraph is a novel potential malicious domain verification analysis platform that retrieves different types of observable intelligence from different sources to produce a series of observations over time. This allows users to judge threats on the Internet. The CyberGraph is committed to integrating standardized and structured information through a vast and complex network intelligence.

The remainder of this paper is organized as follows. Section 2 describes the background and the assumptions and observations of our approach. Section 3 provides implementation detail of MD-MinerP and formulates the research contribution. Moreover, our design goals and core concepts are introduced and a simple example is used to illustrate the data flow of the framework. Section 4 shows the results from our evaluation using ISP-confirmed real-world network traffic to determine the effectiveness of the proposed system. Finally, a summary of the contributions and future research developments are presented in Section 5.

The principles of related techniques used by the MD-MinerP to generate the domain features are described in this section. The MD-MinerP has two major evolutions: improvements to the annotated bipartite graph and additional significant features. There are different annotated bipartite graphs imported for feature extraction. In [14, 15], two systems called Segugio and Doctrina are built from different annotated bipartite graphs with the DNS logs. These systems extract DNS answer-based features, time-based features, domain name-based features, and TTL value-based features of the DNS traffic to detect malicious domain activities. We used annotated bipartite graphs to develop a system, called MD-Miner, that monitors the network traffic to build a Process-domain annotated graph, as shown in Figure 1, to represent who is connecting to what [12]. The MD-Miner has abundant DNS logs available and is a scalable architecture. As shown in Figure 1, there are only malicious, benign, and unknown labels in the annotated bipartite graph, but the content of the network traffic log is not as simple as the DNS log. Therefore, the MD-MinerP replaces the annotated bipartite graph with an interaction profiling bipartite graph, which is detailed in Section 3 and has experimental results that show promise for its application.

2.1. User-Agent

The first factor in the network traffic log is the user-agent in the HTTP header sent along with a request for an Internet server, which is often but not always sent from a web browser. The intent is to inform the server of the capabilities of the software used by the client. The implementation of a classifier for user-agent strings with support vector machines is described in [16]. On the other hand, as mentioned in [12], the text area of a binary-analyzed result for malware suggests that when the user-agent string is hard-coded in the malware’s text area, the user-agent and malicious activities have a considerable degree of correlation. Anomalous user-agent strings were considered in [17] to determine the association with malware activities. However, dedicated user-agent strings that define attackers can easily evade detection by changing their form. Therefore, the MD-Miner [12] proposed a Process-domain annotated graph that uses user-agent strings and the client-IP in the network traffic as a feature to differentiate the network activity that was emitted from the same process and stores the information about who is connecting to what. In this annotated bipartite graph, the nodes represent either the Process nodes (p1p4) or domain nodes (d1d5), and an edge connects a Process to a domain if the connection occurred during the considered traffic observation time window. The classification results are used here to construct more effective bipartite graphs based on its composition using the factors described below.

2.2. HTTP Request

The HTTP network traffic contains significant important information to detect malicious interactions between malware-controlled domains and malware-compromised machines. HTTP is an application layer protocol that uses headers to transfer metadata over a client-server model where the client sends a request to a server, which responds with the available appropriate resource. The HTTP requests are important in Internet interactions, making this the second factor used to extract domain features. Many works have confirmed that the vast majority of malware leverages HTTP as a communication bridge with a cybercriminal’s C&C server to perpetrate malicious activities [18, 19]. Such tricks are not only used in the majority of SPAM botnets but also operated on the APT [2025]. In addition, the malware sample network activity experiments in [26] indicate that approximately 75% of malware samples trigger network activities and generate HTTP traffic. A malware clustering system was introduced in [26] to analyze the structural similarities between malicious HTTP requests in network traffic and used the application path and query string to calculate the distance between malware to clustering malware to obtain its signature. In addition, the HTTP request contained in the headers includes the path (e.g., /path/data) and query (e.g., ?key = value&key2 = value2) as ensconced interactive information between the client and server. References [27, 28] tried to detect malicious phishing web sites using path and query keywords by comparing the relevancy of terms within their URLs. One risk level is the similarity between the path and query terms based on Google Trend and Yahoo Clue. In studies that use HTTP protocols to detect suspicious packets [29, 30], the similarity from the URL path, parameter, and value could identify the packet as malicious or benign.

The MD-MinerP refers to the Trace-Channel interaction profiling graph proposed by our previous research on the CC-Tracker [19], which extends similar observations to [26]. The observation is that different malware samples that rely on the same web server application have similarly structured queries and related URL sequences. To reduce the complexity of the computing similarity between HTTP requests, we simplified the HTTP request as Trace, as shown in Figure 2. The upper part of Figure 2 shows that the Trace takes a raw HTTP request of “GET /web page.php?key1 = value1&key2 = value2&k3 = v3” as an example, where m indicates the method to query the URL and p denotes the queried page. The remaining terms used to query the URL are n and , which are after the question mark and are in the form of a key = value pair, where n indicates the parameter name of the queried URL and denotes the parameter. As the parameter values are relatively easy to change, all parameter values are replaced with the same symbol, which ignores the parameter values [19]. Therefore, the original HTTP request can be simplified to “GET_/web page.php?|key1|key2|k3|,” as shown in the lower part of Figure 2.

2.3. IP Address

The Internet protocol (IP) address is a unique logical digital address assigned to each hardware-equipped network and is recognized by the other devices through the IP address. Benign and malicious domains also have their own IP address and the correspondences are recorded in the network traffic files. The IP addresses are more stable than other metrics, such as the URL and DNS. That is, the domain string can easily change while the IP address is generally fixed. Cybercrimes create a specific technique called obfuscation to change the domain name string, which has been identified and summarized as having four basic types [31]. In contrast, the IP address holds two inborn traits that make it more difficult to change: stability with time and address space skewness [3234]. If it can be proven that the IP address used by a domain name d is positively related to a known malicious activity, then the domain name may be considered as malicious. Considering these two characteristics, the Segugio [14] and Doctrina [15] approaches successfully transformed the correspondence between the IP and domain names into features to mine for malicious domain names from the DNS logs. Moreover, some research used domain IP mapping as a trait to find network threats [35, 36]. While these detection methods can still be improved, they prove that IP addresses could be an effective identification factor. The MD-MinerP takes advantage of the mapping between the domain and IP address to become the third factor. This approach employs the interaction profiling bipartite graph concept to construct the IP-domain interaction profiling graph from network traffic logs to produce effective detection features.

2.4. Lexical Analysis

Manipulating the domain name is another common practice for cybercrimes. Previous research [37] has shown that nearly one-third of all websites in the world are potentially malicious. Many malicious URLs follow obfuscation methods that make the URL strings similar to benign URLs to avoid detection. However, studying various detection methods by analyzing the diversity of domain strings allows designing effective malicious URL detection solutions [38]. These develop lexical features that excavate the divergence of URLs by analyzing the statistical properties of URL strings. The adjective lexical describes the relation to a vocabulary of words and the associated lexical analysis is based on the characteristics of the URL string to determine the lexical features that represent the features of a URL name. Lexical features refer to the actual text without other external information of the URL string. The intention is to make malicious URLs “look” different to experts when compared with benign ones [27].

Most lexical features commonly used for such classifications include the statistical properties of the URL string, like the numerical information regarding the feature lengths (URL length, top-level domain length, primary domain length, etc.) and the number of special characters [39]. The extracted information is obfuscation-resistant and useful. One lexical analysis approach is called the bag-of-words (BoW), which builds a dictionary as a feature set by referring to all the different types of words in all URLs. When a URL includes a word in the dictionary, the value of the feature is 1; otherwise, it is 0. The MD-MinerP developed a kind of BoW approach to adapt to big data and accelerate the computing, which is described in detail in Section 3.

Due to the lack of scalability of previous research [2630], this was restricted to a small amount of material. Therefore, this paper proposes a MD-MinerP system which is mainly used to extract hidden malicious threats from long period and large amount of network traffic logs. Our approach takes full advantage of the concept that Internet communications for a specific purpose will invoke similar interactions. The MD-MinerP uses attributes in the network traffic log to create representative characteristics for each domain, which answers four important questions. (1) Who is connecting what (Process-domain interaction profiling graph)? (2) How to interact with what (Trace-domain interaction profiling graph)? (3) What domain name used what IP (IP-domain interaction profiling graph)? (4) What does it “look” like (lexical analysis)? An exhaustive description of how to use the unique methodologies proposed in this paper to establish effective classification features for each domain is given in the following section.

3. MD-MinerP Implementation

The concept of the MD-MinerP is to track known and discover unknown malicious network domains, which are designated as a channel for attackers to perform malicious acts. Looking at network communications from this perspective allows finding similar traces of connections, and the victim machine generally attempts to connect to malicious or newly created domains. Therefore, the MD-MinerP is based on the following main intuitions:(1)Victim clients tend to connect malicious domain families.(2)Malware belonging to the same family tend to connect to partially overlapping malware-controlled domains.(3)Benign applications rarely connect to domains that exist only to provide malicious functionality.(4)Cybercriminals prepare multiple malicious domains to prevent single-point failure.(5)Malicious domains reuse the same IP addresses.(6)Domain names with the same purpose often “look the same.”

To take advantage of these points, we proposed a new malicious domain detection system called MD-MinerP. The first part of this section gives a detailed explanation for the capture of network domain features. The second part elaborates on the implementation details of the MD-MinerP based on the MapReduce framework.

3.1. Domain Features

For each domain in the network traffic, the MD-MinerP creates four feature vectors. Three feature vectors are generated based on the interaction profiling bipartite graph, and the other feature vector is generated based on the lexical analysis. The intuitions described in Section 2 are used to generate relevant features through the interaction profiling bipartite graph, as shown in Figure 3.

In the interaction profiling bipartite graph, the domain represents the node on one side of the binary graph, and the CF stands for “connection factor,” which is the node on the other side. The connection factors include the Process, Trace, and Address used by the domain. The Process indicates the user-agent plus client-IP, Trace indicates the simplified HTTP request, and Address indicates the domain IP address. The MD-MinerP defines three interaction profiling bipartite graphs using these connection factors, where GP  = (P, D, EPD) represents the interaction profiling bipartite graph for Process, GT  = (T, D, ETD) is for Trace, and GA = (A, D, EAD) is for Address. Node set D represents the domain nodes with di ∈ D, node set P represents the Process nodes with pi ∈ P, node set T represents the Trace nodes with ti ∈ T, and node set A represents the Address nodes with ai ∈ A. The edge sets are called EPD in GP, ETD in GT, and EAD in GA. The Process pi connects a domain dj with an edge eij ∈ EPD, the Trace ti connects a domain dj with an edge eij ∈ ETD, and the Address ai connects a domain dj with an edge eij ∈ EAD. The features of different aspects of the network domain can be described from the interaction profiling bipartite graphs for different CFs. Communications with the same purposes interact through similar CFs. For example, the d1 and d2 conduct similar communications as shown in Figure 3. Once the interaction profiling bipartite graph is constructed, the next step is to extract the domain feature vectors from each graph, as detailed below.

Each domain name needs to go through three phases to extract the feature vector by analyzing the interaction profiling bipartite graph. The first phase is to mark the domain node, which obtains benign and malicious domain intelligence (whitelist/blacklist) from a public or private reputation database. If the domain exists in the whitelist, it is marked as DomainWhite; if it exists in the blacklist, it is a known malicious domain and marked as DomainBlack. All remaining domains are marked as DomainUnknown, which are the primary targets for further classification to mine malicious domains that are not recorded in the threat intelligence but are actually hidden.

The second phase is to label each CF node as White, Black, Mix, Unknown, or Leaf. The labeling method is based on the labeled domain nodes where each CF node is linked. Three numbers are counted for each CF node, namely, Whitesum, Blacksum, and Unknownsum, These are the number of edges of a CF node connected to different DomainWhite, the number of edges for different DomainBlack, and the number of edges for different DomainUnknown. Each CF node in the interaction profiling bipartite graph is then labeled with its own Whitesum, Blacksum, and Unknownsum. The labeling method is as follows, where the CF nodes in the lower part of Figure 4 illustrate the labeling method.(1)White: Whitesum > 0 & Blacksum = 0 (circle)(2)Black: Blacksum > 0 & Blacksum > Whitesum (cross)(3)Mix: Blacksum > 0 & Whitesum > Blacksum (circle sign combined with cross)(4)Unknown: Blacksum = 0 & Whitesum = 0 (question mark)(5)Leaf: Connect to a single domain only (triangle)

The third phase is to compute the feature values for each domain node. Figure 4 shows the interaction profiling bipartite graph GP, where the Process feature values of the domain node d3 are calculated using the GP as an example. Five values are counted from the attributes of the labeled Process nodes to which d3 is linked: SP, WP, BP, MP, and UP, where SP is the total number of Process nodes linked to d3; WP is the number of Process nodes linked to d3 and labeled as White; BP is the number of Process nodes linked to d3 and labeled as Black; MP is the number of Process nodes linked to d3 and labeled as Mix; and UP is the number of Process nodes linked to d3 and labeled as Unknown. The six following Process feature values of d3 are calculated using the following formulas.(1)Fraction of White Process nodes,  = |WP|/|SP|(2)Fraction of Black Process nodes, bP  =  |BP|/|SP|(3)Fraction of Mix Process nodes, mP  =  |MP|/|SP|(4)Fraction of Unknown Process nodes, uP  =  |UP|/|SP|(5)Fraction of Leaf Process nodes, lP  =  |LP|/|SP|(6)Fraction of total Process nodes, sP  =  |SP|

The feature values for d3 obtained from the above six formulas are 1⁄6, 2⁄6, 1⁄6, 1⁄6, 1⁄6, and 6. Following the same pattern applied to the interaction profiling bipartite graph GT allows using d3 to obtain , bT, mT, uT, lT, and sT. Applying this to the interaction profiling bipartite graph GA allows using d3 to obtain , bA, mA, uA, lA, and sA. All the domain nodes are assigned their own 18 feature values in the same way.

The lexical features are those acquired based on the properties of a domain name or string. The motivation is that the domain-based “appearance” should be able to identify the malicious nature of a domain. The MD-MinerP directly uses the BoW model, which loses information on the order of tokens that belong to the top-level and primary domains. This is done by creating a separate dictionary for each fragment. The lexical features also include the statistical properties of the domain, such as the length of its name and the number of “.” characters.

3.2. MapReduce Algorithm

The MD-MinerP is based on two important phases to detect potentially malicious domains, as shown in Figure 5: domain feature extraction and random forest classifier. First, the MD-MinerP constructs the domain node feature vector by taking the network traffic log and benign\malicious domain intelligence stored in the domain threat intelligence database as inputs. The domain feature extraction phase consists of four parts that extract 22 features of each domain node: (1) Process, (2) Trace, (3) Address, and (4) lexical feature extractions. Second, the MD-MinerP adopts Spark parallel processing to build a random forest classifier based on the decision tree model, which is employed to detect malicious domains.

Parts (1)–(3) of the domain feature extraction are based on a similar concept of using the interaction profiling bipartite graph to obtain adjacent information as features. The MD-MinerP designs four MapReduce jobs to realize feature extraction of the interaction profiling bipartite graph: (1) domain node labeling, (2) CF node labeling, (3) interaction profiling bipartite graph building, and (4) behavior feature calculating. Taking part (1) as an example, the following is a detailed description of the MapReduce jobs for the Process feature extraction when the Process nodes are used as CF nodes.

The domain node labeling job first utilizes multiple input mechanisms of the map phases with the network traffic and whitelist/blacklist as the input and domain as the key. Parallel label domain nodes are either DB (DomainBlack), DW (DomainWhite), or DU (DomainUnknown) in the reduce phase based on shuffle and sorting mechanisms. An example of the data flow for domain node labeling is shown in Figure 6.

The next job after labeling the domain nodes is to label the CF nodes. As described in Section 3.1, the label of a CF node is determined from the connected domain nodes. The five label types are White, Black, Mix, Unknown, and Leaf. The input to the CF node labeling job is the output of the domain node label from the previous step. Therefore, the MapReduce job at this step takes the CF (e.g., Process) node as the key and the domain node as the value in the map phase. In the reduce phase, the number of occurrences for DB, DW, and DU for each CF node are counted and the corresponding labels are calculated. The Process nodes are taken as the CF nodes as an example, and Figure 7 shows the data flow of the labeled CF nodes.

The next job is to build the interaction profiling bipartite graph to aggregate the labeled domain nodes and labeled CF nodes into a dataset. In the map phase, the output of the domain and CF node labeling jobs are taken as the inputs to use the advantages of multiple input mechanisms with the identity of the CF (e.g., Process) node as the key. The CF node labels are annotated for each record to obtain the interaction profiling bipartite graph in the reduce phase. Figure 8 shows an example of the data flow to build an interaction profiling bipartite graph during this job.

The interaction profiling bipartite graph constructed in the above jobs allows calculating the behavior features for each domain node. In the map phase, the constructed interaction profiling bipartite graph output from the previous job is taken as the input, where the domain node is the key. In the reduce phase, each domain node obtains its neighbor’s information (labels of CF nodes) through the shuffle and sorting mechanism. Therefore, the MD-MinerP can compute the behavior features of each domain node in parallel. Figure 9 shows an example of the parallel computing behavior features in the job.

Parts (2) and (3) can be implemented as similar MapReduce jobs for GT and GA. The only difference is that Part (1) uses the Process (user-agent + client-IP) and domain nodes to construct the interaction profiling bipartite graph GP, Part (2) uses the Trace nodes instead of the Process nodes to build the interaction profiling bipartite graph GT, and Part (3) uses the Address (destination IP address) nodes to replace the Process nodes and construct the interaction profiling bipartite graph GA.

The lexical feature extraction in Part (4) uses distributed caching mechanisms to store dictionaries for both the primary and top-level domains and gives each term an index number. The distributed caching mechanism allows calculating the lexical features in a single map phase, including the length of the domain, the number of “.” characters, and the index numbers of the top-level and main domains.

Once each domain in the dataset has its own 22 feature values based on the above steps, the MD-MinerP performs two steps to employ the random forest classifier based on Spark, which is a unified analytics engine for large-scale data processing. The first step constructs a classifier RFC by taking all the DB, DW, and their feature values in the dataset as the training set and inputs them into the random forest algorithm. The second step is to use the classifier RFC to identify all unknown domains labeled as DU in the dataset.

4. Evaluation

The MD-MinerP mines stealthy malicious domains for enterprise-scale big network traffic data. Therefore, the MD-MinerP is deployed for enterprise network environments. The deployed network environments are called ENTN1 and ENTN2, which are both real-world companies based in Taiwan with thousands of networked clients that install and run antivirus software. The ENTN1 is a medium-scale company and its compliance with security management rules is relatively relaxed. The organization’s network traffic was collected for 8 months (Jan 1, 2018, to Aug 31, 2018). The ENTN2 is a large-scale company that follows strict security and information management regulations with a collected network traffic period of 2 weeks (Aug 1, 2018, to Aug 15, 2018). Table 1 gives further details for both datasets.

The experiment presented in this paper is based on the two large network traffic datasets ENTN1 and ENTN2 and evaluates the overall performance of the MD-MinerP from three perspectives. First, k-fold cross-validation was employed to evaluate the classification capabilities of MD-MinerP. Second, the actual instances demonstrate the ability of MD-MinerP to mine hidden malicious domains. Finally, the ability of MD-MinerP to handle big data is demonstrated by adjusting the number of nodes in the parallel computing cluster and observing its operational performance.

To perform the k-fold cross-validation, we begin by marking all the known samples in the dataset as n (negative) or p (positive), where n is interpreted as benign and p is interpreted as malicious. A prediction result produced from classifying a sample with the model is divided into four types. First, true positive (TP) indicates the result of the classifier to predict the sample is p when it is; second, false positive (FP) indicates the result of the classifier predicts the sample is p when it is n; third, true negative (TN) indicates the result when the classifier predicts the sample is n when it is; and fourth, false negative (FN) indicates the result when the classifier predicts the sample is n when it is p.

The above description indicates that the first step to prepare the benchmark is to label the ENTN1 and ENTN2 datasets. Labeling DW required employing commercial and public intelligence as whitelists, such as Bluecoat, and the collection of the top 1 million most famous domain names from alexa.com. Labeling the DB data required checking if the entire domain name string matches an existing domain from a commercial domain blacklist. When the tagged datasets (ENTN1 and ENTN2) are ready, we can generate evaluation criteria for different feature vectors. To conduct a comprehensive evaluation, different metrics are needed: precision, recall, F-measure, accuracy, and AUC. Furthermore, each metric is calculated through a cross-validation process.

The k-fold cross-validation is a resampling procedure used to evaluate machine learning models for limited data samples. The original samples are randomly divided into k equally sized subsamples, where a single subsample is retained as the data for the validation model, while the other k-1 subsamples are used to train the model. The cross-validation process is then repeated k times (called folds), with each k subsample being used only once as the verification data. The results of the k-folds can then be averaged to produce a single estimate. The advantage of the k-fold cross-validation process is that each data sample only needs to be tested once and used to train k-1 times [40]. This paper adopts the 10-fold cross-validation procedure.

4.1. 10-Fold Cross-Validation for MD-MinerP

The first experiment deployed MD-MinerP to the real-world ENTN1 and ENTN2 datasets and determined their 10-fold cross-validation results. The MD-MinerP constructed three interaction profiling bipartite graphs (GP, GT, and GA) by applying the feature extraction method described in Section 3. The three feature vectors were then generated using GP, GT, and GA; each feature vector contained six feature values. Moreover, we used the proposed lexical analysis method to generate the fourth feature vector that contains four feature values. In addition, a feature vector containing 22 feature values was generated by merging the above four feature vectors. The performance of each feature vector was confirmed by observing the classification results calculated by different metrics as shown in Table 2. In addition, we used the ROC curve to show the ability of the classification model to all classification thresholds as shown in 10 and 11

The above experimental results show that when the MD-MinerP was deployed to the ENTN1 dataset, the Address feature vector performed the best, which the AUC and F-measure were as high as 0.99. Although the AUC of the other three feature vectors is greater than 0.8, the recall metric is low, indicating that these features are only applicable to partial data. However, combining feature vectors can improve the overall ability of classification. When MD-Minerp was deployed in ENTN2 dataset, the characteristic of combining features resulted in a more significant increase in overall classification capacity. Since ENTN2 belongs to a relatively diversified dataset, the recall value of general feature vectors is low. However, by combining the feature vectors, the recall can be significantly improved. Furthermore, in both the ENTN1 or ENTN2 datasets, the AUCs of the feature vectors that combined the other four were above 0.98, which indicates outstanding discrimination.

4.2. Interaction Profiling Bipartite Graph versus Annotated Bipartite Graph

The second experiment was to prove that the interaction profiling bipartite graph leveraged here is better than the annotated bipartite graph adopted in previous studies [12, 14, 15], which are compared in Figure 12. The annotated bipartite graph only considers the benign and malicious attributes of the connected domain when extracting features, as described in Section 3. The interaction profiling bipartite graph further considers additional aspects, such as outlier domains. Therefore, for the same CF, the annotated bipartite graph exports three feature values, while the interaction profiling bipartite graph brings out six feature values.

The experiment also utilized the ENTN1 and ENTN2 datasets. The annotated and interaction profiling bipartite graphs were formed using the same datasets and the same CF (selected from Process, Trace, and Address) to generate three feature vectors, which were combined into a fourth feature vector. Comparing the 10-fold cross-validation AUCs of the four feature vectors generated from the annotated bipartite graph and interaction profiling bipartite graph shows which bipartite graph had a better recognition effect. It is noted that the lexical feature vector is not included in the experiment because it only compares two bipartite graphs. The implementation of the annotated bipartite graph is based on previous studies [12, 14, 15], and the interaction profiling bipartite graph is defined in Section 3.

Figures 1316 are the results of experiments based on the ENTN1 dataset. Figure 13 shows that, for the Process CF, the annotated bipartite graph had an AUC of 0.85 and the interaction profiling bipartite graph had an AUC of 0.85. Figure 14 shows that, for the Trace CF, the annotated bipartite graph had an AUC of 0.74 and the interaction profiling bipartite graph had an AUC of 0.80. Figure 15 shows that, for the Address CF, the annotated bipartite graph had an AUC of 0.62 and the interaction profiling bipartite graph had an AUC of 1.00. Figure 16 shows the experiment that combined the feature values generated by the Process, Trace, and Address CFs gave AUCs for the annotated bipartite graph and interaction profiling bipartite graph of 0.94 and 1.00, respectively.

Figures 1720 are the results of experiments based on the ENTN2 dataset. Figure 17 shows that, for the Process CF, the annotated bipartite graph had an AUC of 0.79 and the interaction profiling bipartite graph had an AUC of 0.97. Figure 18 shows that, for the Trace CF, the annotated bipartite graph had an AUC of 0.54 and the interaction profiling bipartite graph had an AUC of 0.75. Figure 19 shows that, for the Address CF, the annotated bipartite graph had an AUC of 0.68 and the interaction profiling bipartite graph had an AUC of 0.96. Figure 20 shows the experiment that combined the feature values generated from the Process, Trace, and Address CFs. The annotated and interaction profiling bipartite graphs had AUCs of 0.83 and 0.95, respectively. Table 3 summarizes the AUC value of the feature vector generated by the annotated bipartite graph and interaction profiling bipartite graph in respect of classification assessment, which is used to evaluate the ability of classification.

The experiments in this section show that the data for the proposed interaction profiling bipartite graph are superior to the annotated bipartite graph in either deployed network environments of ENTN1 or ENTN2.

4.3. Identified Malicious Domain Analysis

This section demonstrates the effectiveness of the MD-MinerP at mining potentially malicious domains. With unknown domains in the ENTN2 dataset as the objects for detection, Table 4 shows the top 10 domains with the highest malicious probability as detected with the MD-MinerP. These 10 domains were analyzed using the VirusTotal, and four were identified as malicious while the remaining six were classified as clean. Due to the limited space for digital forensics content of the domain name “folder[.]maroon91[.]com,” this section describes the digital forensics as an example. The evidence to classify the domain as malicious is collected from external threat intelligence, including the file communication records and passive DNS records, and the relationship graph regarding the domain based on the collected information is created.

The relationship graph of the domain “folder[.]maroon91[.]com” is constructed using CyberGraph [13], as shown in Figure 21. The figure shows the direct and indirect relationship between the domain and malicious files, malicious IP, and malicious domains. The domain “folder[.]maroon91[.]com” is not a malicious site in VirusTotal, but the malicious files are directly connected to the domain. The domain is then resolved as “221[.]228[.]214[.]69,” “58[.]215[.]186[.]83,” “118[.]193[.]145[.]130,” and “118[.]193[.]187[.]35.” The records of these IP addresses that communicate with malicious files are observed from the graph. Moreover, the domains “dsc[.]maroon91[.]com,” “app[.]maroon91[.]com,” “usjzx[.]maroon91[.]com,” “upgrade[.]maroon91[.]com,” and “folderhw[.]maroon91[.]com” were discovered to communicate with the malicious files, which have a domain sibling relationship with each other. A series of outward relationships helped identify the domain “folder[.]maroon91[.]com” as malicious. As a consequence, this section demonstrates that the MD-MinerP can detect malicious domains that are not recognized by other reputable intelligence systems.

4.4. Performance Evaluation

From a complexity theory viewpoint, the MapReduce framework is unique in that it combines bounds on time, space, and communication. Each of these bounds would be very weak on its own: the total time available to processors is polynomial; the total space and communication are slightly less than quadratic. In particular, even though arranging the communication between processors is one of the most difficult parts of designing a MapReduce algorithm, classical results from communication complexity do not apply since the total communication available is more than linear [41]. Therefore, we use fixed dataset to measure the execution performance and scalability of MapReduce through the execution time of different cluster sizes.

The performance and scalability of the MD-MinerP are verified by adjusting the number of nodes in the Hadoop cluster, which were two, four, and six. Each node had 24 CPUs (each is an Intel (R) Xeon (R) CPU E5-2620 2.00 GHz processor) with 32 GB of RAM. The dataset used as a benchmark to analyze the MD-MinerP runtime is the ENTN2 dataset described in Table 1, which is sized at 172.7 GB.

The flow of the MD-MinerP can be divided into three parts: data preprocessing, feature extraction, and domain classification. The feature extraction stage of the MD-MinerP can be classified into two parts: interaction profiling bipartite graph mining and lexical feature extraction. The MD-MinerP designed four MapReduce jobs to accomplish the interaction profiling bipartite graph mining: (1) domain node labeling, (2) CF node labeling, (3) interaction profiling bipartite graph building, and (4) behavior feature calculation. The lexical feature extraction can be divided into three MapReduce jobs: (1) creation of primary domain dictionary, (2) creation of top-level domain dictionary, and (3) calculation of lexical features. On the other hand, both the data preprocess stage and domain classification stage have only one MapReduce job. Figure 22 shows the runtime analysis of the MD-MinerP. We observe that the data preprocess stage and the domain node labeling of the feature extraction are the primary bottleneck of the MD-MinerP process. As the above two jobs mainly involve I/O operations, the I/O is the primary performance bottleneck in processing the massive data. However, with an increased number of nodes, the computation time of the data preprocess stage and domain node labeling decreases substantially. The experiments show that the MD-MinerP tends to possess a superior scalability for the MapReduce.

5. Conclusions

This paper proposes a malicious domain detection system based on a novel bipartite graph called MD-MinerP. The interaction profiling bipartite graphs and lexical analysis adopted by the MD-MinerP can handle big data. The mining of unknown malicious domains is accomplished by analyzing network interaction behaviors between clients and domains in big network traffic data. The MD-MinerP is designed as a scalable system to monitor and analyze big network traffic data to find illegal network activities. Two big network traffic datasets (ENTN1 and ENTN2), three validation aspects, and four experiments were proposed to inspect the performance of MD-MinerP. The experiments used ROC curves and 10-fold cross-validation with known domains. The experimental results confirm that the feature extraction method proposed by MD-MinerP as applied to ENTN1 obtained an AUC of 1.00 and applied to the ENTN2 obtained an AUC of 0.98. The experimental results of the direct comparison showed that the feature vectors extracted from the interaction profiling bipartite graph are superior to the annotated bipartite graph for both the single and merged feature vectors. In addition, verifying the unknown domain predicted as malicious by the MD-MinerP allows the verification method to shape the relationship diagram of the domain. The relationship diagram shows that the domain is directly and indirectly associated with the IP and the domain with malicious behavior. Finally, controlling the number of nodes in the Hadoop cluster verifies that the MD-MinerP is a system that fully satisfies the parallel computing conditions, even if the enterprise’s network traffic data is large. Therefore, the MD-MinerP is applied to conduct malicious domain data mining.

This paper has confirmed the contribution of MD-MinerP, but it has some limitations. As described in Section 3, interaction profiling bipartite graph requires domain threat intelligence to label known domain nodes as black and white. Therefore, the quality and quantity of ground truth affect the performance of MD-MinerP. Fortunately, collecting public and commercial domain intelligence can effectively overcome this problem. In addition, MD-MinerP may not be suitable for DHCP network environment. This is because the proposed bipartite graph uses client-IP to locate individual hosts, and DHCP may cause different hosts to be assigned to the same IP. The solution to this challenge is to correlate DHCP logs with network traffic data to obtain the network behavior of each individual host. The final challenge is that MD-MinerP needs to be retrained periodically to maintain detection accuracy. As cybercriminals’ technology is constantly evolving, it is necessary to regularly employ MD-MinerP and through the latest network traffic data and network threat intelligence to obtain updated domain classification model.

Future work will focus on two areas. First, for detecting malicious domain from big network traffic data, it will be considered whether this approach applies to other large log data, such as firewall and DNS logs. Furthermore, the proposed bipartite graph algorithm can be used to perform correlation analysis for multiple types of network traffic logs to optimize the detection capability. Second, the proposed algorithm is applied to the analysis of other malicious threats. For example, treat the smartphone application’s dynamic analysis data (e.g., system call) and static analysis data (e.g., opcode) as CF, and match the threat intelligence of applications to build the interaction profiling bipartite graph of applications to mine hidden malicious applications. In addition, the MD-MinerP mechanism can be used as the basis for a bilateral market service model [42] to collect malicious traffic. We have provided a website, https://netflowtotal.firebaseapp.com/, to prove this concept [43].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.