1 Introduction

Targeted attacks are one of the serious threats through the Internet. The standard payload such as a traditional executable file has still been remained popular. According to a report, executable file is the second of the top malicious email attachment types [49]. Attackers often use obfuscated malware to evade anti-virus programs. Pattern matching-based detection methods such as anti-virus programs barely detect new malware [28]. To detect new malware, automated dynamic analysis systems and sandboxes are effective. The idea is to force any suspicious binary to execute in sandboxes, and if their behaviors are malicious, then the file is classified as malware. While dynamic analysis is a powerful method, it requires too much time to examine all suspicious files from the Internet. Furthermore, it requires high-performance servers and their license including commercial OS and applications. Therefore, a fast filtering method is required.

In order to achieve this, static detection methods with machine learning techniques can be applicable. These methods extract features from the malware binary and Portable Executable (PE) header. While the printable strings are often analyzed, they were not a decisive element for detection. With the recent development of natural language processing (NLP) techniques, the printable strings became more effective to detect malware [8]. Therefore, the combination of the printable strings and NLP techniques can be used as a filtering method.

In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our time series dataset consists of more than 500,000 samples obtained from multiple sources. Our experimental result demonstrates that our method can detect new malware. Furthermore, our method is effective against packed malware and anti-debugging techniques. This paper produces the following contributions.

  • Printable strings with NLP techniques are effective for detecting malware in a practical environment.

  • Our method is effective to not only subspecies of the existing malware, but also new malware.

  • Our method is effective against packed malware and anti-debugging techniques.

The structure of this paper is as follows. Section 2 describes related studies. Section 3 provides the natural language processing techniques related to this study. Section 4 describes our NLP-based detection model. Section 5 evaluates our model with the dataset. Section 6 discusses the performance and research ethics. Finally, Section 7 concludes this study.

2 Related work

2.1 Static analysis

Malware detection methods are categorized into dynamic analysis and static analysis. Our detection model is categorized into static analysis. Hence, this section focuses on the features used in static analysis.

One of the most common features is byte n-gram features extracted from malware binary [47]. Abou-Assaleh et al. used the frequent n-grams to generate signatures from malicious and benign samples [1]. Kolter et al. used information gain to extract 500 n-grams features [12, 13]. Zhang et al. also used information gain to select important n-gram features [55]. Henchiri et al. conducted an exhaustive feature search on a set of malware samples and strived to obviate over-fitting [6]. Jacob et al. used bigram distributions to detect similar malware [9]. Raff et al. applied neural networks to raw bytes without explicit feature construction [41]. Similar approaches are extracting features from the opcode n-grams [3, 10, 35, 57] or program disassembly [7, 14, 18, 44, 50].

While many studies focus on the body of malware, several studies focus on the PE headers. Shafiq et al. proposed a framework that automatically extracts 189 features from PE headers [48]. Perdisci et al. used PE headers to distinguish between packed and non-packed executables [38]. Elovici et al. used byte n-grams and PE headers to detect malicious code [4]. Webster et al. demonstrated how the contents of the PE files can help to detect different versions of malware [53]. Saxe et al. used a histogram of byte entropy values, DLL imports, and numerical PE fields with neural networks [45]. Li et al. extracted top of features from PE headers and sections with a recurrent neural network (RNN) model [17]. Raff et al. used raw byte sequences obtained from PE headers with a Long Short-Term Memory (LSTM) network [40].

Other features are also used to detect malware. Schultz et al. used n-grams, printable strings, and DLL imports with machine learning techniques for malware detection [46]. Masud et al. used byte n-grams, assembly instructions, and DLL function calls [20]. Ye et al. used interpretable strings such as API execution calls and important semantic strings [54]. Lee et al. focused on the similarity between two files to identify and classify malware [16]. The similarity is calculated from the extracted printable strings. Mastjik et al. analyzed string matching methods to identify the same malware family [19]. Their method used 3 pattern matching algorithms, Jaro, Lowest Common Subsequence (LCS), and n-grams. Kolosnjaji et al. proposed a method to classify malware with neural network that consists of convolutional and feedforward neural constructs [11]. Their model extracts feature from the n-grams of instructions, and the headers of executable files. Aghakhani et al. studied how machine learning-based on static analysis features operates on packed samples [2]. They used a balanced dataset with 37,269 benign samples and 44,602 malicious samples.

Thus, several studies used the printable strings as features. However, the printable strings are not used as the main method for detection. The combination of the printable strings and NLP techniques are not evaluated in a practical environment. This paper pursues the possibility of the printable strings as a filtering method.

2.2 NLP-based detection

Our detection model uses some NLP techniques. This section focuses on the NLP-based detection methods.

Moskovitch et al. used some NLP techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) to represent byte n-gram features [35]. Nagano et al. proposed a method to detect malware with Paragraph Vector [36]. Their method extracts the features from the DLL Import name, assembly code, and hex dump. A similar approach is classifying malware from API sequences with TF-IDF and Paragraph Vector [51]. This method requires dynamic analysis to extract API sequences. Thus, the printable strings are not used as the main method for detection.

NLP-based detection was applied to detect malicious traffic and other contents. Paragraph Vector was applied to extract the features of proxy logs [30, 32]. This method was extended to analyze network packets [22, 31]. To mitigate class imbalance problems, the lexical features are adjusted by extracting frequent words [23, 33]. Our method uses this technique mitigate class imbalance problems. Some methods use a Doc2vec model to detect malicious JavaScript code [29, 37, 39]. Other methods use NLP techniques to detect memory corruptions [52] or malicious VBA macros [24,25,26,27, 34].

3 NLP techniques

This section describes some NLP techniques related to this study. The main topic of NLP techniques is to enable computers to process and analyze large amounts of natural language data. The documents written in natural language are separated into words to apply NLP techniques such as Bag-of-Words. The corpus of words is converted into vectors which can be processed by computers.

3.1 Bag-of-words

Bag-of-Words (BoW) [43] is a fundamental method of document classification where the frequency of each word is used as a feature for training a classifier. This model converts each word in a document into vectors based on the frequency of each word. Let d, w, and n be expressed as a document, word (\(w_{i = 1,2,3,...}\)), and a frequency of w, respectively. The document d can be defined by Eq. (1). For this Eq. (1), next (2) locks the position of n on d, and omitted the word w. This \(\hat{d_j}\) (\({\hat{d}}_{j = 1,2,3,...}\)) is shown as a vector (document-word matrix). In Eq. (3), let construct the other documents to record the term frequencies of all the distinct words (other documents ordered as in Eq. (2)).

$$\begin{aligned} d&= [(w_1, n_{w_1}), (w_2, n_{w_2}), (w_3, n_{w_3}), ..., (w_i, n_{w_i}) ] \end{aligned}$$
(1)
$$\begin{aligned} {\hat{{\mathbf {d}}_\mathbf{j}}}&= (n_{w_1}, n_{w_2}, n_{w_3}, ..., n_{w_i} ) \end{aligned}$$
(2)
$$\begin{aligned} |D|&= \left[ \begin{array}{c@{}c@{}c@{}} n_{w_{1,1}} &{} \ldots &{} n_{w_{1,i}} \\ &{} \vdots &{} \\ n_{w_{j,1}} &{} \ldots &{} n_{w_{j,i}} \end{array} \right] . \end{aligned}$$
(3)

Thus, this model enables to convert documents into vectors. Apparently, this model does not preserve the order of the words in the original documents. The dimension of converted vectors attains the number of distinct words in the original documents. To analyze large-scale data, the dimension should be reduced so that can be analyzed in a practical time.

3.2 Latent semantic indexing

Latent Semantic Indexing (LSI) analyzes the relevance between a document group and words included in a document. In the LSI model, the vectors with BoW are reduced by singular value decomposition. Each component of the vectors is weighted. The decomposed matrix shows the relevance between the document group and words included in the document. In weighting each component of the vector, Term Frequency-Inverse Document Frequency (TF-IDF) is usually used. |D| is the total number of documents, \(\{d:d \ni t_{i}\}\) is the total document including word i, \(frequency_{i, j}\) is the appearance frequency of word i in document j. TF-IDF is defined by Eq. (4).

$$\begin{aligned}&tf_{i,j}*idf_{i} = {frequency_{i,j}} *\log \frac{|D|}{\{d:d \ni t_{i}\}}&. \end{aligned}$$
(4)

TF-IDF weights the vector to perform singular value decomposition. The components \(x_(i,j)\) of the matrix X show the TF-IDF value in the document j of the word i. Let X be decomposed into orthogonal matrices U and V and diagonal matrix \(\Sigma \), from the theory of linear algebra. In this singular value decomposition, U is a column orthogonal matrix and linearly independent with respect to the column vector. Therefore, U is the basis of the document vector space. The matrix X product giving the correlation between words and documents is expressed by the following determinant. Generally, this matrix U represents a latent meaning.

$$\begin{aligned} X&= \left[ \begin{array}{c@{}c@{}c@{}} x_{1,1} &{} \ldots &{} x_{1,j} \\ \vdots &{} \ddots &{} \vdots \\ x_{i,1} &{} \ldots &{} x_{i,j} \end{array} \right] = U\Sigma V^{T}\\&= \left[ \begin{array}{c@{}c@{}c@{}} u_{1,1} &{} \ldots &{} u_{1,r} \\ \vdots &{} \ddots &{} \vdots \\ u_{i,1} &{} \ldots &{} u_{i,r} \end{array} \right] * \left[ \begin{array}{c@{}c@{}c@{}} \sigma _{1,1} &{} \ldots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \ldots &{} \sigma _{r,r} \end{array} \right] * \left[ \begin{array}{c@{}c@{}c@{}} v_{1,1} &{} \ldots &{} v_{1,j} \\ \vdots &{} \ddots &{} \vdots \\ v_{r,1} &{} \ldots &{} v_{r,j} \end{array} \right] . \end{aligned}$$

In this model, the number of dimension can be determined arbitrarily. Thus, this model enables to reduce the dimension so that can be analyzed in a practical time.

3.3 Paragraph vector

To represent word meaning or context, Word2vec was created [21]. Word2vec is shallow neural networks which are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of documents and produces a vector space of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words which share common contexts in the corpus are located in close proximity to one another in the space. \( queen = king - man + woman \) is an example of operation using each word vector generated by Word2vec. Word2vec is a method to represent a word with meaning or context. Paragraph Vector is the extension of Word2vec to represent a document [15]. Doc2vec is an implementation of the Paragraph Vector. The only change is replacing a word into a document ID. Words could have different meanings in different contexts. Hence, vectors of two documents which contain the same word in two distinct senses need to account for this distinction. Thus, this model represents a document with word meaning or context.

4 NLP-based detection model

This section produces our detection model based on the previous study [8]. The previous study used an SVM model to classify samples. The main revised point is adding several classifiers. The structure of our detection model is shown in Fig. 1.

Fig. 1
figure 1

Structure of the NLP-based detection model 4

Our detection model consists of language models and classifiers. One model is selected from these models, respectively. In training phase, the selected language model is constructed with extracted words from malicious and benign samples. The constructed language model extracts the lexical features. The selected classifier is trained with the lexical features and labels. In testing phase, the constructed language model and trained classifier classify unknown executable files into malicious or benign ones.

4.1 Training

The training procedure is shown in Algorithm 1. Our method extracts all printable (ASCII) strings from malicious and benign samples and splits the strings into words, respectively. The frequent words are selected from the words, respectively. Thenceforth, the selected language model is constructed from the selected words. Our method uses a Doc2vec or LSI model to represent the lexical features. The Doc2vec model is constructed by the corpus of the words. The LSI model is constructed from the TF-IDF scores of the words. These words are converted into lexical features with the selected language model. Thus, the labeled feature vectors are derived. Thereafter, the selected classifier is trained with the both labeled feature vectors. The classifiers are Support Vector Machine (SVM), Random Forests (RF), XGBoost (XGB), Multi-Layer Perceptron (MLP), and Convolutional Neural Networks (CNN). These classifiers are popular in the various fields, and have each characteristic.

figure a

4.2 Test

The test procedure is shown in Algorithm 2. Our method extracts printable strings from unknown samples, and splits the strings into words. The extracted words are converted into lexical features with the selected language model, which was constructed in training phase. The trained classifier examines the feature vectors and predicts the labels.

figure b

4.3 Implementation

Our detection model is implemented in Python 2.7. Gensim [42] provides the LSI and Doc2vec models. Scikit-learnFootnote 1 provides the SVM and RF classifiers. The XGB is provided as a Python packageFootnote 2. The MLP and CNN are implemented with chainerFootnote 3. The parameters will be optimized in the next section.

5 Evaluation

5.1 Dataset

To evaluate our detection model, hundred thousands of PE samples were obtained from multiple sources. One is the FFRI dataset, which is a part of MWS datasets [5]. This dataset contains logs collected from the dynamic malware analysis system Cuckoo sandboxFootnote 4 and a static analyzer. This dataset is written in JSON format, and categorized into 2013 to 2019 based on the obtained year. Each data except 2018 contains printable strings extracted from malware samples. These data can be used as malicious samples (FFRI 2013–2017, 2019). Note that this dataset does not contain malware samples themselves. Printable strings extracted from benign samples are contained in 2019’s as Cleanware. These benign data do not contain the time stamps. Hence, we randomly categorized these data into 3 groups (Clean A, B, and C). Other samples are obtained from Hybrid Analysis (HA)Footnote 5, which is a popular malware distribution site. Almost ten thousand of samples are obtained with our web crawler. These samples are posted into the VirusTotalFootnote 6, and are identified by multiple anti-virus programs. Thereafter, the identified samples are categorized into 2016 to 2018 based on the first year defined by Microsoft defender. The printable strings are extracted from these malware samples (HA 2016–2018). To extract printable strings from these samples, we use the strings command on Linux. This command provides each string on one line. Our method uses these strings as word. Our extraction method is identical to the FFRI datasetFootnote 7. Thus, our dataset is constructed with the multiple sources.

Table 1 shows the number of each data, unique words, and family name. Tables 2 and 3 show the top family names.

Table 1 The number of each data, unique words, and family name
Table 2 Top 5 family names in the FFRI dataset
Table 3 Top 10 family names in the HA dataset

The unique words indicate the number of distinct words extracted from the whole dataset. In the NLP-based detection method, the number of unique words is important. Because the classification difficulty and complexity mainly depend on the number. The family indicates the number of distinct malware family defined by Microsoft defender. In benign samples, each dataset contains huge number of unique words. This means these benign samples are well distributed and not biased. In malicious samples, each dataset contains enough number of unique words and malware families. This suggests they contain not only subspecies of the existing malware, but also new malware. Hence, these samples are well distributed and not biased.

5.2 Metrics

Several metrics are used to evaluate our detection model. These metrics are based on the confusion matrix shown in Table 4.

Table 4 Confusion matrix

In this experiment, Accuracy (A), Precision (P), Recall (R), and F1 score (F) are mainly used. These metrics are defined as Eqs. (5)–(8).

$$\begin{aligned} Accuracy&= \frac{TP + TN}{TP + FP + FN + TN} \end{aligned}$$
(5)
$$\begin{aligned} Precision&= \frac{TP}{ TP + FP } \end{aligned}$$
(6)
$$\begin{aligned} Recall&= \frac{TP}{ TP + FN } \end{aligned}$$
(7)
$$\begin{aligned} F1&= \frac{2Recall * Precision}{ Recall + Precision }. \end{aligned}$$
(8)

In this experiment, TP indicates predicting malicious samples correctly. Since our detection model performs binary classification, Receiver Operating Characteristics (ROC) curve and Area under the ROC Curve (AUC) are used. An ROC curve is a graph showing the performance of a classification model at all classification thresholds. AUC measures the entire two-dimensional area underneath the entire ROC curve.

5.3 Parameter optimization

To optimize the parameters of our detection model, the Clean A–B and FFRI 2013–2016 are used. The Clean A and FFRI 2013–2015 are used as the training samples. The remainder Clean B and FFRI 2016 are used as the test samples.

First, the number of unique words is optimized. To construct a language model, our detection model selects frequent words from each class. The same number of frequent words from each class are selected. This process adjusts the lexical features and enables to mitigate class imbalance problems [23]. The F1 score for each model is shown in Fig. 2.

Fig. 2
figure 2

The F1 score for each model

In this figure, the vertical axis represents the F1 score, and the horizontal axis represents the total number of unique words. In the Doc2vec model, the most optimum number of the unique words is 500. In the LSI model, the F1 score gradually rises and achieves the maximum value at 9000.

Thereafter, the other parameters are optimized by grid search. Grid search is a search technique that has been widely used in many machine learning researches. The optimized parameters are shown in Tables 5 and 6.

Table 5 The optimized parameters in each language model
Table 6 The optimized parameters in each classifier

Thus, our detection model uses these parameters in the remaining experiments.

5.4 Cross-validation

To evaluate the generalization performance, tenfold cross-validation is performed on the Clean A and FFRI 2013–2015. Figure 3 shows the result.

Fig. 3
figure 3

The result of tenfold cross-validation

The vertical axis represents the Accuracy (A), Precision (P), Recall (R), or F1 score (F). Overall, each metric performed good accuracy. The LSI model was more effective than the Doc2vec model. Thus, the generalization performance of our detection model was almost perfect.

5.5 Time series analysis

To evaluate the detection rate (recall) for new malware, the time series is important. The purpose of our method is detecting unknown malware. In practical use, the test samples should not contain the earlier samples. To address this problem, we consider the time series of samples. In this experiment, the Clean A and FFRI 2013–2015 are used as the training samples. The Clean B, FFRI 2016–2017, and HA 2016–2018 are used as the test samples. As described in Table 1, the training samples are selected from the earlier ones. Moreover, the benign samples account for the majority of the test samples. This means the test samples are imbalanced, which represent more practical environment. Thus, the combination of each dataset is more challenging than cross-validation. The results of the time series analysis are shown in Figs. 4 and 5.

Fig. 4
figure 4

The result of Doc2vec in the time series analysis

Fig. 5
figure 5

The result of LSI in the time series analysis

The vertical axis represents the recall. Overall, the recall gradually decreases as time proceeds. The detection rates in the FFRI dataset are better than the ones in the HA dataset. This seems to be because the training samples were obtained from the same source. Nonetheless, the recall in HA maintains almost 0.9. Note that these samples were identified by VirusTotal and categorized based on the first defined year. The LSI model was more effective than the Doc2vec model. In regard to classifiers, the SVM and MLP performed good accuracy. Thus, our detection model is effective against new malware. Furthermore, the same technique [23] mitigates class imbalance problems in executable files.

To visualize the relationship between sensitivity and specificity, the ROC curves of each model with FFRI 2016 are depicted in Figs. 6 and 7.

Fig. 6
figure 6

The ROC curve of Doc2vec in the time series analysis

Fig. 7
figure 7

The ROC curve of LSI in the time series analysis

The vertical axis represents the true positive rate (recall), and the horizontal axis represents the false positive rate. Our detection model maintains the practical performances with a low false positive rate. As we expected, the LSI model was more effective than the Doc2vec model. The best AUC score achieves 0.992 with the LSI and SVM model.

The required time for training and test of FFRI 2016 is shown in Table 7.

Table 7 The required time for training and test of FFRI 2016

This experiment was conducted on the computer with Windows 10, Core i7-5820K 3.3GHz CPU, 32GB DDR4 memory, and Serial ATA 3 HDD. In regard to training time, it seems to depend on the classifier. Complicated classifiers such as CNN require more time for training. The test time maintains flat regardless of the classifier. The time to classify a single file is almost 0.1s. This speed seems to be enough to examine all suspicious files from the Internet.

5.6 Practical performance

In practical environment, actual samples are more distributed. Hence, the experimental samples might not represent the population appropriately. To mitigate this problem, a more large-scale dataset has to be used. Moreover, the training samples should be smaller. To represent actual sample distribution, the FFRI 2019 and Clean A–C are used. They contain 500,000 samples. These samples are randomly divided into 10 groups. One of the groups is used as the training samples. The rest 9 groups are used as the test samples. The training and test are repeated 10 times. The average result of the practical experiment is shown in Fig. 8.

Fig. 8
figure 8

The average result of the practical experiment

The vertical axis represents the Accuracy (A), Precision (P), Recall (R), or F1 score (F). Note that the training samples account for only 10 percent. This means the dataset is highly imbalanced. The LSI and SVM are the best combination. The best F1 score achieves 0.934. Thus, our detection model is effective in practical environment.

6 Discussion

6.1 Detecting new malware

In the time series analysis, our detection model is effective to new malware on the imbalanced dataset. The new malware samples are categorized into known malware and unknown malware. In this study, we assume that known malware samples are ones previously defined by Microsoft defender. These samples are new but just subspecies of the existing malware. We also assume that unknown malware samples are ones not defined by Microsoft defender at that time. In this experiment, our detection model was trained by the samples before 2015. Hence, the newly defined samples after 2016 are assumed as new malware. The detection rate of the known and unknown malware is shown in Table 8.

Table 8 Detection rate (recall) of the known and unknown malware (LSI and SVM)

The detection rate of unknown malware is on the same level with known malware. Thus, our detection model is effective to not only subspecies of the existing malware, but also new malware.

6.2 Defeating packed malware and anti-debugging techniques

Our detection model uses lexical features extracted from printable strings. These features include the API or argument names, which are useful to identify malware. These useful strings, however, are obfuscated in the packed malware. Several anti-debugging techniques might vary the lexical features. The test samples of the time series analysis are categorized into 4 types; packed and unpacked, or anti-debugging and no anti-debugging by PEiDFootnote 8. PEiD detects most common packers and anti-debugging techniques with more than 470 different signatures in PE files. Since the FFRI dataset does not contain malware samples, we analyzed the HA dataset. Table 9 shows the detection rate of each malware type. Tables 10 and 11 show the top names.

Table 9 Detection rate (recall) of each malware type (LSI and SVM)
Table 10 Top 10 PEiD names in the HA dataset
Table 11 Top 10 anti-debugging names in the HA dataset

Contrary to expectations, each detection rate is on the same level. We also analyzed the benign samples. These samples contain 27,196 packed samples and 77,893 samples with anti-debugging techniques. Detection rate in each type is 0.988 to 0.997. Therefore, our method does not seem to detect the packer or anti-debugging techniques as malware. One possible reason for this is the API and argument names used for obfuscation. In addition, the typical instructions can be extracted as printable strings. They must be remained for deobfuscation. Thus, our method is effective against packed malware and anti-debugging techniques.

6.3 Limitation

We are aware that our study may have some limitations.

The first is attributed to our dataset. As described previously, this study used more than 500,000 samples. Actual samples, however, might be more distributed. Hence, our dataset might not represent the population appropriately. As the matter of fact, we cannot use all actual samples for evaluation. To the best of our knowledge, the possible solution is using large-scale and multiple sources.

The second is lack of detailed analysis. In this study, we used a simple signature-based packer detector. This program has approximately 30 percent of false negatives [38]. We did not identify the packer names of our samples completely. Hence, our experimental result may not be applicable to some sophisticated packers, which are not detected by signature-based packer detectors. We identified our samples with VirusTotal and Microsoft defender. As reported in a paper, there is a problem with label changes in VirusTotal [56]. This can affect the accuracy of our experiment. Further analysis is required to reveal these issues.

The third is lack of comparison. In this paper, we focused on the practical performance and did not compare our method with other related studies. Due to the issues about the dataset or implementation, it was not feasible to provide fair comparison. Further experiments are required to provide fair comparison.

7 Conclusion

In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our dataset consists of more than 500,000 samples obtained from multiple sources. The training samples were selected from the earlier samples. The test samples contain many benign samples thereby be imbalanced. In the further experiment, the training samples account for only 10 percent thereby be highly imbalanced. Thus, our dataset represents more practical environment. Our experimental result shows that our detection model is effective to not only subspecies of the existing malware, but also new malware. Furthermore, our detection model is effective against packed malware and anti-debugging techniques.

Our study clearly has some limitations. Despite this, we believe our study could be a starting point to evaluate practical performance. Our method might be applicable to other architectures. In future work, we analyze the detail of the samples. A more precise packer detector will improve the reliability of this study.