1 Introduction

The increasing amount and complexity of Cyber security attacks in recent years have brought data mining techniques into the attention of researchers and experts as an important technique in detecting such attacks through the analysis of data and side effects left by malware and spyware programs and incidents of network and host intrusions. In particular, text analysis and mining is widely used in many Cyber security areas, such as malware detection and classification (e.g. [14,15,16, 18, 22, 24, 27, 29, 33]) and malicious code detection (e.g. [7, 30, 31]). The popularity of social media has opened up the doors for text mining and analysis as important techniques for increasing the knowledge about users’ context, e.g. their location and time, and combining that knowledge with other attributes related to important events, topics, emotions and interests [21]. Text mining and analysis has also been used for predicting links [8] and detecting leaks of confidential data [28], the unintentional sharing of private health information [17], as well as in classical areas such as digital forensics [12], electronic communications analysis [36] and Web text analysis [19].

Despite all this popularity, the Cyber security community has remained somehow reluctant in adopting an open approach to security-related data due to many factors. These include political factors because of the continuing hesitance on the part of organisations to share data and technical factors related to the consistency, quality and the lack of consensus on the nature of variables that should be monitored or the metrics that should be used to quantify the security data themselves [20]. Nonetheless, in recent times, this trend has started to turn with the arrival of large and open security data sets and data-sharing platforms backed by the reliability and reputation of well-established organisations in the area of Cyber security. Examples of these include VCDB [34], CERT’s Knowledgebase at Carnegie Mellon University [10], SecRepo [26], CAIDA [9] and others. As a result, we are starting to witness an increasing trend in the usage of data sets in gaining insight into how Cyber incidents occur and how they should be defended against. A slightly different approach to the sharing of knowledge is based on what is known as community data sets, where a pre-set-up federation allows organisations to share data, knowledge and experience. A good example of such federations is the European Cyber Security Organisation (ECSO) [1] and the European Network and Information Security Agency (ENISA) [2]. Other, more national-level hubs of Cyber security have also been set up all over the world for the same purposes.

The main goal of this study is to demonstrate that experience can be shared in a community of organisations based on a model of text analysis, where text classifiers can be used across organisations to predict future aspects of Cyber incidents. We consider, as a case study, one community data set that represents Cyber security incidents and response actions collected from five Korean companies. The data set was obtained from the KAITS Industrial Technology Security Hub [3] in Korea. In that regard, the paper has the following contributions:

  • First, we demonstrate how Cyber security-related questions can be answered based from analysing a data set. In our analysis, we focus on questions related to the prediction of future responses to Cyber incidents from the type of malware or the name of the malicious code encountered in previous incidents. We demonstrate that this has good accuracy using standard text classification algorithms.

  • We also construct an initial model of how trained instances of the text classifiers can then be shared across the companies without revealing their local data sets, therefore lifting the contribution of this paper from a pure technical level (text analysis) to a higher organisational level (experience sharing).

Our paper is an expanded version of a recent work presented in [27]. We expand on [27] by providing a more comprehensive related work survey and discussion, as well as providing full details of the motivating scenario, the data set used in the case study, the analysis model and the results of the classification process obtained. We give in this paper an initial experience-sharing model as an approach for future research based on the obtained results.

The rest of the paper is structured as follows. In Sect. 2, we discuss other works in the literature related to this paper. In Sect. 3, we describe the scenario motivating our work and the underlying data set sample used. In Sect. 4, we outline the research questions and the text mining model used in predicting the answers to these questions. In Sect. 5, we present the results of the experiments, and in Sect. 6, we discuss some aspects of these results. In Sect. 7, we outline a model of how companies in our scenario can share experience through the sharing of their classifier instances. Finally, in Sect. 8, we conclude the paper and outline some directions for future research.

2 Related work

The use of text analysis and data mining in detecting vulnerabilities and Cyber security threats is a research area that has been active for a number of years now, particularly with the massive proliferation of social media and user content-based applications. Most of this effort has been focused on the extraction of text in order to build information related to incidents. We review here a few examples of works that have used text and data analysis for the detection of malware and malicious code based on a variety of techniques.

Text mining has been used extensively in the area of intrusion detection, e.g. in [5, 6, 22, 23, 31, 33, 37]. In [5], the authors presented a method based on byte n-gram analysis to detect malicious code using common n-gram (CNG) analysis, which relies on profiles for class representation. The results showed that the applied method achieved 100% accuracy on training data and 98% accuracy in threefold cross-validation. Similarly, the authors in [31] presented a method to analyse suspicious files using OpCode n-gram patterns, which are extracted from the files data after dis-assembly for detecting unknown malicious code. The OpCode n-gram patterns can then be included in anti-virus programs as signatures. The evaluation of the methodology was performed using a test collection comprising of more than 30,000 files, in which various settings of OpCode n-gram patterns of various size representations and eight types of classifiers were evaluated. Evaluation results indicated that the proposed method achieved levels of accuracy higher than 96%.

In [6], text categorisation was used to learn the characteristics of normal and malicious user behaviour from data stored in the log files of Web servers, more specifically in the context of tele-medicine applications where malware programs can undermine user privacy. In [23], a framework was designed for the detection of intrusions based on system calls that uses text processing and data mining techniques. Once suspicious, system calls are textually analysed and then clustered based on the K-means method [25] to determine whether they fall within the malicious group of calls. The authors in [33] used data mining and text classification methods to detect security threats by extracting relevant information from various unstructured log messages, while in [22], the authors presented a text mining-based anomaly detection (TMAD) model to detect HTTP attacks in network traffic. TMAD is an anomaly detector that uses n-gram text categorisation and term frequency-inverse document frequency (TF-IDF) methods.

In addition, the authors in [37] proposed a method to automatically detect malicious code using the n-gram analysis. The proposed method used selected features based on information gain. Probabilistic neural networks were used in the process of building and testing the proposed multi-classifier system. The individual classifiers were used to produce classification evidences, where these evidences were then combined by the Dempster–Shafer combination rules [13, 32] to form the final classification results for new malicious code. Experimental results showed that the detection engine improved compared to the classification results produced by the individual classifiers. Other works have used more general (i.e. non-textual) data mining methods in the context of malware detection and analysis. For example, in [15], the authors utilised hooking techniques to trace the dynamic signatures that malware programs try to hide, in which behaviour records are used for classification training and building a description model. Machine learning algorithms, such as naïve Bayesian, J48 (decision tree) and support vector machine, were then used for the classification process. The authors in [18] proposed a graph-mining method to detect variants of malware using static analysis, while covering the existing defects. The proposed approach combines static analysis and graph-mining techniques. In addition, a novel algorithm was proposed, called minimal contrast frequent subgraph miner (MCFSM) algorithm, which was used for extracting minimal discriminative and widely employed malicious behaviour patterns. The proposed method showed high detection rates and low false-positive rates. Additionally, in [14], an API-based association mining method was proposed for detecting malware. The criteria for API selection and the criteria for association rule selection were proposed to reduce the number of rules and to improve the quality of the rules. A classification method based on multiple association rules was adopted. The experiments showed that the proposed strategies can significantly improve the running speed of object-oriented association mining, in which the time cost for data mining was reduced by 32% and the time cost for classification was reduced by 50%.

Another area that has benefited largely from the application of data mining techniques is spyware detection and analysis. For example, in [7] a method was proposed for spyware detection using data mining techniques. The framework used the breadth-first search approach, which is known to work well for detecting viruses and similar software. Experimental results showed that the method had an accuracy of 90.5% for detecting spyware.

The authors in [35] proposed an integrated architecture to defend against surveillance spyware and used features extracted from both static and dynamic analysis. These features were ranked according to their information gains. In addition, a support vector machine classifier was created for each client. In order to keep the classifier up to date, a server collects reports from all clients as well as retrain and redistribute the new classifier instance to each client. The proposed spyware detection system achieved an overall accuracy rate of 97.9% for known surveillance spyware programs and 96.4% for unknown programs.

Our approach in this paper differs from all of the above works in that it not only demonstrates the use of text and data mining techniques in the analysis of Cyber security data sets, but also defines an initial model of how experience derived from analysing the data and training the classifiers can be shared in a community of organisations and used in future predictions of Cyber incidents across these organisations. In this respect, our paper contributes to the technical as well as the organisational dimensions of the problem. We are also more focused on predicting features of incidents rather than detecting an incident as a whole.

3 Motivating scenario and data set

We describe in this section the scenario motivating our work and the sample data set collected and analysed in the paper.

3.1 Description of the scenario

The main scenario motivating our work was based on the current operational model of the KAITS Industrial Technology Security Hub [3] in Korea. The hub is a public-private partnership supported by Korean governmental agencies in order to promote the sharing of Cyber security knowledge, experience and expertise across small and medium enterprises (SMEs) and increase their ability in responding to future incidents. In our case, this ability is interpreted as the ability in predicting attributes about future Cyber security incidents based on features of existing ones. The scenario is depicted in Fig. 1.

Fig. 1
figure 1

Experience-sharing scenario across n number of organisations

Underlying this scenario is the assumption that each instance of a classifier represents the experience of the company that owns it. This assumption is based on the fact that the classifier is trained using the company’s past data, which represent the company’s past experience with Cyber security incidents. Therefore, as depicted in the figure, the sharing of experience is simply translated in our case as the sharing of trained classifiers. This sharing can be regulated through some experience or classifier sharing agreement, which would rely on the predictive qualities of each instance of the classifier being shared. The scenario brings with it a number of operational constraints; however, one in particular has an impact on the approach we follow for the rest of the paper.

Constraint [Each company’s data set is secret to that company]. Companies share their trained classifier instances but not their internal data sets. Therefore, we cannot assume in this scenario the existence of an overall hub-level data set consisting of the composition of the individual internal company data sets, and thus create only a single instance of a classifier. As a result, the scenario requires that the sample data sets collected be kept isolated from one another, therefore leading to the creation of different instances of the classification algorithms. In practical terms, this meant that we had to keep the individual company files separate rather than combining them in one single master file.

3.2 Description of the data set

The data set represents a sample of Cyber security intrusion events detected in five Korean SMEs over a period of ten months and collected by the KAITS industrial hub [3]. These companies are known by the pseudonyms DF, MT, SE, EP and MS. The data for each company are stored in a separate file, kept separate from the other files. There are 4643 incidents overall in the data set, and these are distributed as shown in Table 1.

Table 1 Incidents distribution

Each entry in a file (expressed as a row) has the following metadata:

  • Date and Time of Occurrence: this is a date/time value representing the date and time of the incident’s occurrence.

  • End Device: this is a textual (string) value representing the name of the end device affected in the incident.

  • Malicious Code: this is a textual (string) value representing the name of the malicious code detected in the incident.

  • Response: this is a textual (string) value representing the response action that was applied to the malicious code.

  • Type of Malware: this is a textual (string) value representing the type of the malware (malicious code) detected in the incident.

  • Detail: this is a textual (string) value to freely describe any other detail about the incident (example the location in the computer where the malware resided).

The following is an example entry from one of the files in the data set:

(14/02/2017 11:58, rc0208-pc, Gen:Variant.Mikey.57034, delet ed, virus, C: \(\backslash \) Users \(\backslash \) RC0208 \(\backslash \) AppData \(\backslash \) Local \(\backslash \) Temp \(\backslash \) is-ANFS3.tmp \(\backslash \) SetupG.exe)

4 Research questions and methodology

Our research in this paper is aimed at investigating one class of problems, where we take a text classification approach to solving them. This class of problems attempts to predict future aspects of Cyber security incidents. More specifically, we are concerned with questions related to how organisations can gain the ability to predict response actions to future Cyber security incidents involving malware. We consider the following two questions:

Q1.:

How to predict a response action from the name of malicious code

Q2.:

How to predict a response action from the type of malware involved

The significance of questions like these is that they can help in planning the level of resources that an organisation needs to allocate in response to future incidents in an efficient manner, as well as to take further actions to limit the spread of malware and to maintain the organisation’s business continuity. We note that the three metadata elements of the data set relevant to the above two questions that will be used in the classification process are Malicious Code, Response and Type of Malware. We will discuss later the classes we extracted for each one of these metadata.

Our proposed model consists of the following two main phases: (1) analysis and (2) classification. This model is illustrated in Fig. 2. The purpose of creating the model was to prepare the data for features selection and classification.

Fig. 2
figure 2

Our analysis and classification Model

We next discuss the two phases of this model.

  1. (1)

    The analysis phase is divided into two sub-phases: pre-processing and features extraction. These are defined as follows:

    Pre-processing: This phase is mainly responsible for text pre-processing as some of the data entries in the data set consist of textual values (e.g. Detail, Malware Detection). The main objective of pre-processing is to clean the data from noise in order to improve the accuracy of the results by reducing the errors in the data. This is done by removing special characters and stop words such as “a” and “the”, punctuation marks such as question and exclamation marks, and numbers. In addition, all terms are converted to lowercase. The resulting terms are used to generate the n-gram features. For example, in the Response data entries, the following has been given as an entry “The file is a malware and has been blocked”, and after pre-processing, the following information is extracted “file malware blocked”.

    Features extraction: In this phase, features are extracted from the three main classes in the data set (i.e. Malicious Code, Response and Type of Malware). The most commonly used features in text mining are n-gram and bag-of-words. The model makes use of “bigrams”, which are n-grams where n=2. For example, each two adjacent words create a bigram such as “malware detection”. In this phase, a bag-of-words is created, which contains all words (bigrams) and features are created from the three metadata elements above. This bag-of-words is filtered based on the minimum term frequency method, where terms that occur in less than the minimum frequency are filtered out and not used as features. For example, in the Response data entries, after the pre-processing phase, the following information is extracted, “file malware blocked”. However, since terms that occur in less than the minimum frequency are filtered out and not used as features, the following features are then extracted from the above information, “malware, blocked”.

  2. (2)

    The classification phase makes use of four machine learning algorithms, which are J48 decision tree, support vector machine (SVM), naïve Bayes (NB) and random forest (RF) algorithms, in order to build, test and compare the n-gram features predictive models. Each company’s data set is split into training and test sub-data sets. The training data set is used for building the model, and the test data set is used to evaluate the performance of the model. Hence, the features that have been extracted are used in the classification process.

To assess the performance of the machine learning classifiers, KNIME [4] was used. The experiments were set up using the typical tenfold cross-validation, i.e. the data set is split into tenfold, and each fold is used, in turn, for testing, while the other 9 are used for training. The output of the training process is a model, which is then used for the classification in the test fold. The labels produced by the model are matched to the true labels and typical performance indicators, such as accuracy, precision, recall and the F-measure [11].

5 Results

In this section, we present the results of the application of each of the four machine learning algorithms to each of the five companies data sets, in order to predict the outcome of the questions we posed in the previous section. For brevity, we show only the accuracy results for each case presented as percentage values. More detailed results can be found in Appendix at the end of the paper.

5.1 Q1 Results

Table 2 presents the classification performance results of the J48, NB, RF and SVM classifiers for the five companies for the question “How to predict a response action from the name of malicious code”. Full details can be found in Tables 78910 and 11 in Appendix.

Table 2 Performance of the classifiers for identifying the types of response based on malicious code—best results are highlighted in bold

For Company 1 (DF), SVM demonstrated the highest classification accuracy with 87%, whereas RF was the lowest with 82% accuracy. J48 and NB classifiers had a low recall value for “deleted” responses. More specifically, looking at where the errors occurred, J48 and NB classified most of this category as “segregated”. Similar to “deleted” responses, J48, NB and RF had relatively low recall values for “not defined” responses and classified a small percentage of these as “deleted” and “segregated”. In addition, SVM and RF could not identify “blocked” responses and incorrectly classified these as “segregated”.

For Company 2 (MT), SVM and RF showed the highest accuracy with 87%, whereas accuracy was 85% for NB, which was the lowest value. For this company, all the classifiers had 0% precision, recall and F-measure values for the cases of “none” and “name changed” responses, which were misclassified as “deleted” and “segregated”. Furthermore, J48 had the lowest recall value for “blocked” responses and misclassified a percentage of these as “deleted”. SVM, RF and NB also had low recall values for the “deleted” response as small percentage of these were misclassified as “segregated”.

For Company 3 (SE), the classification accuracy for J48, SVM and RF was 89%, and for NB was 85%. Similar to Company 2 (MT), all the classifiers could not identify “name changed” responses and had 0% precision, recall and F-measure values. This category was misclassified as “none” and “segregated”. In addition, NB had low recall values for “deleted” and the rest of the classifiers had 0% precision, recall and F-measure values. Furthermore, J48, SVM and RF misclassified a percentage of “none” responses as “segregated”.

For Company 4 (EP), SVM had the highest classification accuracy with 91% and RF had the lowest accuracy with 82%. J48 and SVM had 0% precision, recall and F-measure values for “none” responses, and these were misclassified as “segregated”. In addition, similar to Companies 2 (MT) and 3 (SE), all classifiers could not identify “name changed” responses and these were misclassified as “segregated”. Furthermore, SVM and NB had a lower recall value for “deleted” responses and J48 and RF had 0% precision, recall and F-measure values for “name changed” responses, which were misclassified as “segregated” and “blocked”. J48 misclassified “recovered” responses as “blocked”.

Finally, for Company 5 (MS), J48, SVM and RF had the highest accuracy value with 93% and NB had the lowest accuracy value with 89%. All the classifiers had 0% precision, recall and F-measure for “none” and “name changed” responses and misclassified these as “deleted” and “segregated”. Moreover, all the classifiers had relatively low recall values for “segregated” responses and misclassified a percentage of these as “deleted”.

5.2 Q2 Results

Table 3 presents the classification performance details of the J48, NB, RF and SVM classifiers for the five companies for the question “How to predict a response action from the type of malware involved”. Full results are shown in Tables 12131415 and 16 in Appendix.

Table 3 Performance of the classifiers for identifying the types of response based on malware—best results are highlighted in bold

For Company 1 (DF), J48, RF and NB had similar accuracy results of 66%, while SVM had the lowest accuracy of 52%. All classifiers had a 100% precision, recall and F-measure values for type of response “not defined”. For “deleted” responses, J48, RF and NB had a high precision, recall and F-measure values, whereas SVM had a recall value of 100% but low precision and F-measure values. In addition, for “blocked” responses, J48, RF and NB had high recall values of 100%, while SVM could not identify this type and had 0% precision, recall and F-measure values. SVM misclassified this type as “deleted”. Furthermore, all the classifiers failed to identify or classify other types of responses such as “none”, “recovered”, “segregated” and “name changed” based on the malware code and had 0% precision, recall and F-measure values as a result. These types were all misclassified as “deleted”.

For Company 2 (MT), all classifiers had similar accuracy of 70%. The classifiers were able to classify only two types of responses, which were “not defined” and “segregated” and had 100% recall values for both. However, they failed to identify or classify other types of response such as “deleted”, “none”, “blocked” and “name changed” and consequently had 0% precision, recall and F-measure values for these responses. These were then wrongly classified as “segregated”.

For Company 3 (SE), all classifiers had similar accuracy of 66% and recall values of 100% for two types of responses, which were “not defined” and “blocked”. In addition, all classifiers had similar precision, recall and F-measure values for the response type “deleted”, although some percentage of cases of this type were misclassified as “blocked”. Moreover, all the classifiers failed to identify or classify other types of responses such as “none”, “segregated” and “name changed” and therefore had 0% precision, recall and F-measure values for these types. These types were then wrongly classified as “blocked”.

For Company 4 (EP), all classifiers had similar accuracy values of 75% and recall values of 100% for two responses: “not defined” and “blocked”. In addition, all classifiers had similar recall values of 80% for “deleted”. However, 20% of “deleted” were misclassified as “blocked”. Furthermore, all the classifiers failed to identify or classify other types of responses such as “none”, “recovered”, “segregated” and “name changed” and had 0% precision, recall and F-measure values. These were misclassified as “blocked” and “deleted”.

Company 5 (MS) had the best classification performance in which all classifiers had an accuracy value of 98%. However, similar to the previous four companies, all the classifiers failed to identify or classify other types of responses such as “none”, “recovered” and “name changed” and had 0% precision, recall and F-measure values. These types were classified as “deleted” and “segregated”. In addition, J48, SVM, RF and NB had recall values of 100% for two types of responses, “not defined” and “segregated”, and a recall value of 98% for response type “deleted”.

6 Implications

In this case study, many factors affected the identification and classification process outlined in Section 4 using machine learning. We discuss here the results of the previous section in the light of these factors.

The overall results for the identification of different types of responses based on the given malicious code indicated that SVM was the best classifier for all the five companies in terms of performance and accuracy. In addition, most classifiers could not identify response categories such as “none”, “blocked” and “deleted”. The results for the identification of the different types of responses based on the given malware showed that all classifiers were suitable for some types of responses, but they were still unable to identify or classify every response types. The poor performance from the classifiers was due to the fact that some types of malware were assigned by companies to multiple response types (e.g “segregated” and “name changed” were assigned to malware type “virus”) and the high and low frequency of some types affected the classification due to the imbalance of the categories. This data imbalance problem is clearly visible from the data distributions of Tables 4 and 5.

Table 4 Type of response distribution for the five companies showing data imbalance
Table 5 Type of malware distribution for the five companies showing data imbalance

We can summarise the above observations by the following points:

  1. (1)

    The classification accuracy was affected by the imbalance of the data set categories as shown in Tables 4 and 5.

  2. (2)

    The performance of the classifiers was affected by the multi-labelling of some of the data categories.

  3. (3)

    The inconsistency of the categories used across the five companies (e.g types of responses and malware), as shown in Tables 4 and 5, made the evaluation task difficult.

  4. (4)

    Our results showed that SVM is more suitable for the detection of types of response using malicious code, whereas J48, RF and NB are more suitable for the detection of types of response using malware.

    Table 6 shows the average performance of the classifiers against the two questions considered in the study. It is clear from this table that Q2 was the more difficult question to predict the answer for, hence the low performance of the classifiers.

Table 6 Average performance of the four classifiers used for both questions

7 An experience-sharing model

In this section, we define an experience-sharing model that builds on the scenario of Sect. 3 and uses the results we obtained in Sect. 5 to demonstrate, in principle, how experience sharing can be implemented across organisations based on data prediction. The main motivation behind such experience sharing is to improve the ability of an organisation to predict features related to future incidents, e.g. in our case, improve the ability to answer the questions of Sect. 4.

We define experience sharing as the sharing of an instance of a classification algorithm that has been trained using data belonging to some company. Sharing implies sending a classifier, which we term the internal classifier, and possibly receiving either the other company’s classifier, termed the external classifier, or receiving some form of reward in exchange for sending the internal one. We do not consider here decisions related to the sending of a classifier instance, since this is determined by the quality of what is received. Therefore, we only focus on decisions related to the receiving of the other company’s classifier, where such external classifiers are compared in their quality to the internal ones to determine whether they are acceptable from the receiving company’s point of view. Since the assumption is that we compare the like-for-like in terms of the algorithms used, the difference will purely reflect the quality of the experience of either company as expressed by its local data set. For simplicity, we leave out of the scope of this model other administrative factors, such as the policy or contract governing the sharing process, where we consider these in future (richer) versions of this model.

Assuming \({\mathcal {C}}\) is the set of all classification algorithms being considered across organisations, where in our case \({\mathcal {C}}=\{\textit{J48}, \textit{SVM}, \textit{RF}, \textit{NB}\}\), and \({\mathcal {Q}}\) is the set of questions an organisation aims to answer, here \({\mathcal {Q}}=\{Q1, Q2\}\), then we define a classifier-receiving decision function, \(\zeta :{\mathcal {C}}\times {\mathcal {C}}\times {\mathcal {Q}}\rightarrow {\mathbb {B}}\), as a predicate:

$$\begin{aligned} \zeta (c_\mathrm{int}, c_\mathrm{ext}, q)= \left\{ \begin{array}{l} \text {if}\;\text {(accuracy}(c_\mathrm{ext},q)- \text {accuracy}(c_\mathrm{int},q))<\ell ~~\text {then}~~\mathbf{F }\\ \text {otherwise}~~\mathbf{T } \end{array} \right. \end{aligned}$$

where \(c_\mathrm{int}\in {\mathcal {C}}\) is the internally trained classifier, \(c_\mathrm{ext}\in {\mathcal {C}}\) is the externally trained classifier, which the organisation is considering to import, and \(q\in {\mathcal {Q}}\) is the question the company is attempting to predict the answer for. The purpose of this decision function is to determine whether a classifier should be received (True–T) or not (False–F). In this case, we have restricted the decision function to the comparison of the classifier accuracy only. The function would return false (F), if the difference in the accuracy between the externally and the internally trained classifiers was less than some predefined threshold percentage value, \(\ell \%\). This means that the decision is not to receive the external classifier, as it will be of no use to the company since its accuracy in predicting the answer to the question is lower than the accuracy of the internally trained classifier. Note again here that we are assuming that the algorithm (e.g. J48, SVM, etc.) underlying both classifier instances is the same. Otherwise, if the external classifier exceeds \(\ell \%\) percentage difference with the internal classifier, the decision will be true (T), i.e. to receive the external classifier. As noted earlier, it is straightforward to change the function such that it compares based on any other metric.

Figure 3 shows how the process of experience sharing would work between two companies using the decision function, \(\zeta \).

Fig. 3
figure 3

The process of experience sharing between two companies

Example. Let us consider an example of how the above decision function can be used. Assume Companies 1 (DF) and 5 (MS) would like to share their knowledge of answering Q1. In this case, both would select SVM as their best classifier. The sharing process from both companies’ points of view would result in these two calculation of the \(\zeta \) function, assuming that \(\ell _\mathrm{DF}=5\%\) and \(\ell _\mathrm{MS}=2\%\):

$$\begin{aligned}&\zeta _\mathrm{DF}(\text {SVM}_\mathrm{DF}, \text {SVM}_\mathrm{MS}, Q1)\\&\quad = \text {if}\; (\text {accuracy(SVM}_\mathrm{MS}, \text {Q1)}-\text {accuracy(SVM}_\mathrm{DF, Q1}))<\; 5\%\; \text {then}\; \mathbf{F }\; \text {otherwise}\; \mathbf{T }\\&\quad =\text {if}\; (93\%-87\%)\;<\;5\% \;\text {then}\; \mathbf{F }\; \text {otherwise} \;\mathbf{T }=\mathbf{T }\\&\zeta _\mathrm{MS}(\text {SVM}_\mathrm{MS}, \text {SVM}_\mathrm{DF}, Q1)\\&\quad = \text {if}\; \text {(accuracy(SVM}_\mathrm{DF}, \text {Q1)}-\text {accuracy(SVM}_\mathrm{MS}, {\text {Q1}}))\;<\;2\% \;\text {then} \;\mathbf{F }\; \text {otherwise} \;\mathbf{T }\\&\quad = \text {if}\; (87\%-93\%)\;<\;2\% \;\text {then}\; \mathbf{F }\; \text {otherwise} \;\mathbf{T }=\mathbf{F } \end{aligned}$$

In the above example, we see that company DF has an incentive to accept company MS’s classifier as it improves on its own classification accuracy. On the other hand, company MS has no incentive to accept company DF’s classifier as the latter does not improve upon MS’s classification accuracy. As in other cases, the value of \(\ell \) will determine what the outcome of the decision function is. Setting \(\ell \), in a real-world case, would require an internal study by the organisation. We consider this as one possible future research work.

8 Conclusion

In this paper, we presented the results of a case study on analysing a Cyber security incidents community data set for five Korean SMEs. Our analysis framework was designed and implemented to answer two kinds of problems: (1) the prediction of response actions to future Cyber security incidents involving malware or malicious code, and (2) the utilisation of the knowledge of the response actions in guiding analyses to determine the type of malware or the name of the malicious code. Furthermore, text mining methods such as n-gram and bag-of-words were used to extract relevant features from the data set, and then we applied machine learning algorithms for the classification process. The experimental results showed good performance for the classifiers in predicating different types of response and malware. An initial model of how experience can be shared among companies was also defined based on the given scenario.

As future work, we aim to investigate whether other questions may be answerable within the scope of the features offered by the data set. We also plan to expand the analysis to other data sets in order to investigate the impact of handling the problem of class imbalance on the classification performance and accuracy. Finally, we aim to formalise a more comprehensive model for the sharing of text and data classifiers as a means of sharing experience across organisations. Such a model would allow organisations to “trade” their trained classifiers using, for example, a smart contract specifically designed for this purpose. Using the blockchain technology, for example, one could envisage a scenario where successful predictions are rewarded, whereas unsuccessful ones are penalised, therefore leading to a fairer trade model.