Recognising innovative companies by using a diversified stacked generalisation method for website classification

Mirończuk, Marcin Michał; Protasiewicz, Jarosław

doi:10.1007/s10489-019-01509-1

Recognising innovative companies by using a diversified stacked generalisation method for website classification

Open access
Published: 22 June 2019

Volume 50, pages 42–60, (2020)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Recognising innovative companies by using a diversified stacked generalisation method for website classification

Download PDF

2095 Accesses
5 Citations
Explore all metrics

Abstract

In this paper, we propose a classification system which is able to decide whether a company is innovative or not, based only on its public website available on the internet. As innovativeness plays a crucial role in the development of myriad branches of the modern economy, an increasing number of entities are expending effort to be innovative. Thus, a new issue has appeared: how can we recognise them? Not only is grasping the idea of innovativeness challenging for humans, but also impossible for any known machine learning algorithm. Therefore, we propose a new indirect technique: a diversified stacked generalisation method, which is based on a combination of a multi-view approach and a genetic algorithm. The proposed approach achieves better performance than all other classification methods which include: (i) models trained on single datasets; or (ii) a simple voting method on these models. Furthermore, in this study, we check if unaligned feature space improves classification results. The proposed solution has been extensively evaluated on real data collected from companies’ websites. The experimental results verify that the proposed method improves the classification quality of websites which might represent innovative companies.

Trends and Future Perspective Challenges in Big Data

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Big Data Analytics: A Literature Review Paper

1 Introduction

It is argued that innovativeness plays a crucial role in the development of various branches of the modern economy, such as medicine, agriculture, security, defence, industry. Providing advanced products and services depends on smooth and efficient cooperation between companies and research units. As these two groups may have some difficulties in finding appropriate partners or information, we have created an information platform, Inventorum which aims to establish connections between them. In order to prevent users from being overwhelmed or distracted by useless data, the system recommends only innovations, projects, partners, and experts that suit them. Such an approach requires both preparations of relevant information and its careful selection. In this paper, we focus on finding potentially innovative companies on the internet, which should help to find more participants of Inventorum, i.e. information consumers. The system selectively gathers data from the internet. Then, each acquired website consisting of a group of web pages is analysed to establish whether it is related to an innovative company or not. This study evaluates classification of such websites based on two groups, namely (i) domains that represent innovative companies and (ii) others. For simplicity, we name these groups as (i) innovative domains or websites and (ii) non-innovative domains or websites. This classification does not mean that internet domains are innovative or not, but does indicate whether the companies represented in these domains are innovative or not.

In reality, an innovative business is one that offers innovative products or services, employs professionals or scientists that are the best in their profession, invests in research and development projects, cooperates with other creative firms or top champions [9, 36, 71]. The above factors are disputable except one which is an innovative product or service. Oslo Manual [48] defines innovative as something new or substantially enhanced and concurrently beneficial for consumers. Although it may be barely an idea that triggers organisational, processing, or production change resulting in an invention, implementation and diffusion [35]. Let us imagine how an individual can evaluate a company based on its website. At first, he or she should look for innovations or legal protections. If these appear on the website and belong to this firm, it may be innovative. Nonetheless, not all entities possess innovations, but they yet may be innovative because they might work on something ground-breaking. Thus, the individual analysing the company’s web page must look for the inconclusive factors mentioned above, i.e. the particular types of people, investments and cooperation.

We must mention that some attempts were made to construct models for assessing company innovativeness based on inquiry tools, e.g. a survey to benchmark organisational readiness for digital innovation [43] or a questionnaire to find a relationship between cooperation with innovative partners and internal innovations [53]. However, these techniques require conducting surveys on companies under examination. We want to develop an automatic model utilising machine learning techniques. As we can imagine, this task is not easy even for an expert, so the question is how it can be realised by a classification algorithm meant to substitute the expert’s work. Firstly, our imaginary algorithm has to retrieve innovation entities such as a person or a cooperating company. Then, it should assess whether these objects are relevant to the firm under examination. More specifically, the algorithm needs (i) to judge whether potential innovations are actually innovative; (ii) to assess whether persons are unique professionals; and (iii) to determine if related companies are innovative. According to our knowledge, these tasks are impossible to realise by any machine learning algorithm because the algorithm is not able to judge directly, for example, if a product/service or a professional is innovative or not without evaluating almost all possibilities across the world. Thus, an indirect technique is required which we propose below.

Although the concept of “an innovative website” is an abstract idea similar to concepts such as what constitutes spam, pornography, or phishing [1, 4, 6, 39, 52, 68, 74, 75], it has never been adequately addressed by known machine learning techniques. Unfortunately, innovativeness is difficult to grasp and cannot be recognised by identifying typical words, phrases, origin addresses, or similar, as in spam detection. Innovativeness is something that we must determine by context, so we propose an original procedure, which assumes that it is possible to create a machine learning classification model that can determine whether a company is innovative or not based on automatic analysis of its website. It analyses a data set from three different perspectives. The first perspective considers only a company’s description localised on its website. This text most likely contains descriptions of innovative products or services, and some notes about the most crucial professionals. The second perspective analyses links included on the company’s website which may contain information about cooperating companies. The third perspective utilises all text in the website. It may be important if our assumptions regarding the localisation of particular information on the website are inappropriate.

In our previous study [47], we proposed an experimental Naïve Bayes (NB) classification committee for categorisation of companies’ websites into the innovative or non-innovative group. The committee was based on a voting idea and diversified feature spaces, feature weights, and two models of features distribution, i.e. Bernoulli and Multinomial [19, 44]. We showed that the committee, as well as the selected individual NB classifiers, are able to resolve the defined classification task sufficiently. Furthermore, we noted that the proposed ensemble procedure may require some improvement to provide better results. Therefore, in this study, we explore this task more deeply. The novelties of this paper are as follows:

1.
We investigate a novel classification task, i.e. recognition of Polish innovative companies’ websites.
2.
We propose a novel improved classification approach based on multiple data sources to accomplish the classification task mentioned above.

In the recent literature, we did not find any related works that would include the topics mentioned above. In this study, we propose a new classification committee, which is an extended version of our previous proposition [47]. The primary enhancement relies on the application of a stacked generalisation method instead of a simple voting procedure. The objectives of the study are as follows:

1.
To demonstrate that the new proposed method can improve the decision process and produce better overall classification quality than the previously proposed voting committee.
2.
To demonstrate that a genetic algorithm can create an unaligned feature space that improves the classification results of the simple voting procedure.

Both objectives are realised based on empirical experiments and obtained results.

This paper is structured as follows. Section 2 summarises previous and related works. Section 3 presents an overview of the proposed classification system of innovative websites. Section 4 describes the evaluation process of the proposed system and the results obtained during experiments. Finally, Section 5 concludes the findings.

2 Related works vs. proposed approach

A review of recent literature indicates that there are some works concerning detection of innovative themes on the internet [7, 50, 51]. However, the analysed approaches are insufficient to solve the problem presented in this study owing to the reason that they do not offer a direct solution to recognise whether a company is innovative or not, based only on its public website available on the internet. Thus, we constructed a classification model that can recognise innovative websites on the internet with sufficient quality. We propose a method related to ensemble learning. Ensemble learning refers to procedures employed to train multiple classifiers by combining their outputs and considering them as a “committee” of decision makers. Various approaches accomplish this learning concept, for example, bagging, boosting, AdaBoost (adaptive boosting), stacked generalisation, mixtures of experts, and voting based methods [2, 5, 11, 14, 40, 42, 58, 73]. The presented approach is closely related to stacked generalisation [66, 67, 70]. Stacking is an ensemble learning technique, where a set of models is constructed from bootstrap samples of a data set, and then their outputs as a hold-out data set are used as inputs to a meta-model. The set of base models is called level-0, whereas the meta-model is considered as level-1. The task of the level-1 model is to combine the set of level-0 outputs so as to classify the target correctly, thereby correcting any mistakes made by the level-0 models [60].

There are some differences between our approach and the approach mentioned above. Usually, typical classification methods, as well as classification tasks, utilise only one data source to construct a classification model. These data include information describing a target class, i.e. innovative vs. noninnovative, spam vs. nospam, etc. In our case, three data sources represent different aspects of the destination concept that is the innovative company. Thus, we can state that the classification is based on multiple sources. In our first research [47], we combined these sources and proposed a classification method to obtain satisfactory classification indicators. However, the simple classification, which utilised each data set separately, produced insufficient results. Appendix 1 covers additional analysis of the classification results achieved by such an approach. The empirical results suggest that a solitary classifier trained on a single dataset may be not able to obtain a satisfactory classification quality.

The stacked generalisation method splits one data source into several n-bootstrap samples to learn n classification functions (level-0), and then uses one classification function to classify the remainder of the data (level-1). In our case, we learn one classification function per a data source (level-0), and the results of these classifications are used to build the level-1 classifier. Thus, we name this method: diversified stack generalisation. Thanks to this approach, we obtain a single three-dimensional feature space in the case when we have three data sources. The validation of this method is simple, especially when we have a small number of learning examples and an unbalanced number of positive and negative examples.

Moreover, in our previous work, we assumed an aligned feature space, i.e. each classifier in the committee used the same feature space and the number of features was the same for all classifiers. In this study, we evaluate the case in which there are unaligned feature spaces, i.e. the number of features is not the same for each classifier. In this manner, we check whether there are other combinations of features that may improve classification results.

3 Innovative website recognition system

This section presents the proposed classification system for innovative websites and explains the general aspects of the proposed system at a high level of abstraction. Figure 1 depicts a flow chart outlining its structure. Section 1 presents in detail how the proposed method works.

Below, we investigate how the proposed system works based on a simple use case. This use case explains how a single company website, which is a collection of HTML documents, is processed. We can generalise this case to n-websites. The system (Fig. 1) considers a company’s website as an input. A pre − processing phase processes the HTML documents and creates three different data sets. The first data set (Dataset₁ (L_D)) contains a textual description of the company. This description is extracted from the first (index) page of the website. The second data set (Dataset₂ (L_L)) involves the link labels that were extracted from the index page. The last data set (Dataset₃ (L_B)) consists of a so-called bigdocument. The bigdocument is constructed based on all of the collected HTML documents, i.e. it is the conjunction of several appropriately selected documents from the website. This representation is a concatenation of the n best documents from a given website as determined by the Okapi BM25 search system [33, 44, 47, 57] (for more technical details, see Appendix 1). The textual data from each data set is classified by an appropriate classification function. We use the NB learning method to create each function. The results of the classification are merged (D_MLE - is a matrix which stores the collected decisions) and are utilised by a meta − classifier (γ_ML). We used and tested several learning methods, such as k nearest neighbours (k-nn), Naïve Bayes (NB), support vector machine (SVM), and decision trees (DT) [2, 17, 19, 24, 34, 44] to create the final decision classification function. The result provides information on whether a given website describes an innovative business or a non-innovative company.

4 Empirical evaluation of the solution

4.1 The data set and evaluation method

We decided to create an original test set due to the lack of such data in other works. We labelled 2,747 real websites (509 as innovative websites, 2,238 as non-innovative websites), and created three sets as follows: |L_D| = 2,747 description examples described by |F_D| = 140,699 features that also include information about logotypes, |L_L| = 2,747 examples of the link labels described by |F_L| = 140,271 features that also include information about logotypes, and |L_B| = 2,747 examples of the big documents described by |F_L| = 663,015 features.

It is worth mentioning that to build the data set, we used annual rankings of leading companies on the market, e.g. Forbes rankings and awards such as Business Gazelle. We assumed that these rankings are highly reliable and utilised this data to collect company websites as learning data sets describing innovative companies. Opposite examples i.e. non-innovative internet sites, were obtained from a collected set of common company websites.

The classification quality is evaluated according to the 10-fold cross-validation procedure and measured by error, accuracy, precision, recall, and F-measure. In fact, it is the F1 measure that assumes an equal balance of precision and recall [44, 61, 64].

Furthermore, we applied statistical tests to verify whether the F − score distributions of different models for meta-learning classification (γ_ML) and voting classification (γ_Voting), are identical without assuming that they follow the normal distribution. The tests’ procedures varied depending on the type of feature space.

When the feature space was aligned (Section 4.2), we compared the results produced by the voting method and each meta-learning method by using two tests. First, Friedman’s aligned rank test verifies the null hypothesis H_F0 stating that the performances of experimental models are equal [18, 22], assuming that the statistical significance, α is equal to 0.05. Second, the Wilcoxon signed-rank statistic test using the Bonferroni correction of p-values for multiple testing (corrected pairwise tests) [20, 49, 59] tests the null hypothesis, H_W0 stating that the median difference between pairs of experimental models is zero, assuming that α is equal to 0.05.

Conversely, when the feature space was unaligned (Section 4.3), we compared the results coming from the voting procedure and the best meta-learning method for the aligned feature space with results coming from the same models but for the unaligned feature space. For this purpose, we used the Wilcoxon rank-sum test [20, 49]. The null hypothesis, H_WR0 of the Wilcoxon rank-sum test is that the median difference between pairs of experimental models is zero, and α is equal to 0.05.

We used the Scmamp R-project package [13] to execute Friedman’s aligned rank test, and the standard Stat package [55] to perform the Wilcoxon signed-rank test with Bonferroni correction, and the Wilcoxon rank-sum test (Wilcoxon Mann-Whitney test).

4.2 Aligned space of features

Simple diversified classification committee γ _Voting

Our previous experiments [47] were intended to compare the best NB classifiers (γ_NB1, γ_NB2, and γ_NB3) with γ_Voting. The γ_Voting classification function is based on a simple voting process. For each classifier, we choose the class label of the highest weight, and then we select the most common label from all the decisions. We recall some of those results below to provide background regarding the applied meta-learning techniques and justify the differences between the simple voting and meta-learning methods. Figure 2 shows the comparison of results for each classification method mentioned above.

Our previous research indicated that methods for feature selection have a significant influence on the classifiers’ performance. For example, we observed that the γ_NB1 classifier produced the best outcomes when the χ² feature selection method was used, whereas the γ_NB2 and the γ_NB3 classifiers performed best when using the Fisher feature selection method.

Figure 2 compares the values of precision, recall, and F-measure of γ_Voting and the best three NB classifiers. Based on these results, we can conclude that the performance of the classifiers depends not only on the feature selection method but also on the number of input features. More specifically, the committee outperforms every single classifier when the number of features varies between 600 and 4,000. However, further increase in the number of features leads to decrease of its F-measure. Although the higher number of the feature in the γ_NB1 classifier with the χ² feature selection method leads to better performance than the committee, precision and recall values inform us that the increase in the number of features causes over-fitting of almost all classifiers and the committee. The over-fitting, manifested by the precision increase and the recall decrease, represents an incorrect performance of the classifiers because many innovative companies are not selected; it is better for the purpose of this study that some non-innovative companies are labelled as innovative (lower precision) provided a high number of truly innovative businesses are found from the internet datasets. Moreover, we applied an information retrieval system in the γ_NB2 classifier, which led to high-quality performance of the classifier when the number of features was very high.

It can be concluded that the simple committee produces the firmest decisions, and concurrently, the proposed information retrieval system improves the classification quality and prevents from over-fitting.

Figure 3 presents a visualisation of the data points in the constructed three-dimensional feature space that was built based on the values representing the probabilities of the innovative class of every single classifier function, i.e. γ_NB1, γ_NB2, γ_NB3.

It can be observed (see Fig. 3) that classes are well separated. The presented results indicate that we can use other supervising classification method(s) Γ_ML to create classification function(s) based on such a created feature space. In our case, we refer to this type of learning as meta-learning classification.

Meta-learning classification

We have tested some learning methods in the Γ_ML category to determine the final meta-learned classifier γ_ML (Fig. 6). Figures 4 and 5, and Table 1 show the results of the classification experiments in respect to different meta-classifiers γ_ML and the features number in the range of 600-10,100. More specifically, there were tested six meta-classifiers. Three based on k-nn considering different numbers of neighbours and three other algorithms (NB, SVM, and DT). Figure 4 depicts precision, recall, and F-measures achieved by these meta-classifiers. All meta-classifiers γ_ML work better than γ_Voting when the number of features is average or high. However, their quality is lower than γ_Voting at the lowest range of features. The best meta-classifiers are based on DT, SVM, and k-nn with three or nine neighbours. By the contrast, the lowest enhancements are produced by meta-classifiers based on NB and k-nn with one neighbour.

Table 1 The comparison of the mean values of basic indicators for various types of learning methods (^∗ or without a mark - the averages are calculated for number of features in the range of 600 to 10,100; ^∗∗ - the averages calculated for number of features in the range of 600 to 4,500)

Full size table

Additionally, we calculated the mean values of precision, recall, F-measure, error, and accuracy for each meta-classifier γ_ML and γ_Voting in the specified range of features, i.e. from 600 to 10,100. The results are included in Table 1. The γ_Voting has two values for each indicator. The first value (marked by ^∗) is calculated for the entire range of features, whereas the second (marked by ^∗∗) analyses the number of features in the range of 600 to 4,500. The average results prove the above conclusions. Moreover, they show clearly that γ_Voting performs better when the number of values is low. However, when the number of features increases, all meta-classifiers γ_ML improve. Additionally, Fig. 5 presents the differences of F-scores between γ_Voting and each meta-classifier γ_ML. Thus, the improvements introduced by the new methods are clearly apparent.

Furthermore, Friedman’s aligned rank test indicated that the F-scores of the seven best classification models, for which the number of features ranges from 600 to 10,100 (see Table 1), were significantly different (p − value < 2.2e − 16). Thus, we can reject the hypothesis, H_F0. In addition, we achieved the following p-values for multiple pairwise comparisons made using the Wilcoxon signed-rank test (1) p − value = 1.5e − 13 for γ_Voting^∗ vs. γ_ML, k-nn, k = 1, (2) p − value = 5.4e − 16 for γ_Voting^∗ vs. γ_ML, k-nn, k = 3, (3) p − value = 5.2e − 16 for γ_Voting^∗ vs. γ_ML, k-nn, k = 9, (4) p − value = 3.8e − 16 for γ_Voting^∗ vs. γ_ML, SVM , (5) p − value = 1.1e − 15 for γ_Voting^∗ vs. γ_ML, NB, and (6) p − value = 4.2e − 16 for γ_Voting^∗ vs. γ_ML, DT. The test showed that there are significant differences between the pairs compared, suggesting that these pairs are independent. As a consequence, we can reject the null hypothesis, H_W0.

4.3 Unaligned space of features

Although each classifier in the committee uses a different feature space, the number of features is the same for all classifiers. This is probably a suboptimal solution because it is apparent that the classifiers produce the best results in distinct ranges of feature numbers (Fig. 2). Thus, the optimal system may involve an adaptive selection of the feature numbers for each classifier in the committee.

As finding an optimal feature space is similar to selecting the best population among many possible solutions, we decided to apply a genetic algorithm to solve this problem. We conducted several computational experiments to verify this assumption. They involved choosing an optimal feature space by the genetic algorithm for the γ_Voting classifier, as well as the meta-classifier γ_ML based on the SVM. Table 2 lists the best and average values of precision, recall, F-measure, error, and accuracy obtained in the experiments.

Table 2 The comparison of basic indicators for different type of feature space size

Full size table

Firstly, note that the γ_Voting classifier utilised approximately 4,500 features because we previously observed that when the number of features is higher than this value over-fitting may occur. Thus, the genetic algorithm searches optimal solutions without the risk of creating an over-fitted model. For example, the combination of 3,258 features for the γ_NB1, 812 features for the γ_NB2 and 4,092 features for the γ_NB3 functions produce the optimal solution for the γ_Voting classifier (see the first row in Table 2). The average outcomes of the genetic algorithm in the case of the γ_Voting model were calculated by using a 10-fold cross-validation procedure. When we compare these results (see the second row in Table 2) with outcomes covered in Table 1, we note that there are some combinations of features that slightly improve the classification quality. Moreover, the Wilcoxon signed-rank test compared γ_Voting for the unaligned feature space (Table 2) with γ_Voting^∗∗ for the aligned feature space (Table 1). It indicated that p − value < 2.2e − 16. The test showed that there is a difference between the pair compared. We conclude that this pair is independent. As a consequence, we can reject the null hypothesis, H_WR0.

Secondly, note that the maximum number of features for the meta-classifier γ_ML, which is based on SVM, was equal to approximately 20,000 (the third and fourth rows in Table 2). This boundary has been selected experimentally with consideration of the problems of over-fitting and acceptable performance. For instance, if we take into account the F-measure, the following combinations of features are optimal: 14,207 features for γ_NB1, 16,081 for γ_NB2, and 16,283 for γ_NB3. The average number of indicators for the γ_ML models were achieved by using a 10-fold cross-validation procedure. Note that the genetic model selected more features than the specified upper boundary, i.e. 10,100. Furthermore, the Wilcoxon signed-rank test compared γ_ML for the unaligned feature space (Table 2) with γ_ML, SVM for the aligned feature space (Table 1). Since the test indicated that p − value < 2.2e − 16, there is a difference between the pair compared, suggesting that this pair is independent. As a consequence, we can reject the null hypothesis, H_WR0.

Finally, we tested the meta-classifiers γ_ML with the genetic algorithm again, but we assumed that the maximum number of features should not exceed the value of 10,100 (the rows marked ^∗ in Table 2). The meta-classifiers γ_ML performed best when the genetic algorithm selected the following combinations of features: 8,662 for γ_NB1, 8,022 for γ_NB2, and 9,481 for γ_NB3. Interestingly, in most cases, the algorithm searched for the optimal solutions in the range when the feature numbers were higher than 10,100. Moreover, we compared γ_ML for the unaligned feature space (see rows marked ^∗ in Table 2) with γ_ML, SVM for the aligned feature space (Table 1) by using the Wilcoxon signed-rank test. Since the test indicated that p − value = 0.57, there is no difference between the pair compared. Therefore, we conclude that this pair is dependent. As a consequence, we cannot reject the null hypothesis, H_WR0.

5 Conclusions

Nowadays, owing to the growing need for innovativeness, modern companies use highly complex processes to develop increasingly advanced products in order to gain a competitive edge in the market. Such strategies require sophisticated knowledge and skills, as well as high organisational and technical culture. These attributes characterise businesses that either employ in-house researchers or outsource their services. Thus, suitable matching between the firms and the researchers is crucial for boosting innovativeness. One of the aspects of this process is the identification of innovative companies that can be based on their internet websites.

In this study, we developed the novel diversified stacked generalisation method for recognition of websites that may indicate innovative companies. The main idea behind this approach is that we consider an entire data source as one bootstrap sample and learn one classification function for each data source. In our case, we consider three data sources, which thus represent three bootstrap samples. Then, the classification results from each bootstrap sample are used to build the final meta-classifier.

Because we consider three data sources, a single three-dimensional feature space is achieved. The system has been verified experimentally regarding various combinations of the feature numbers and meta-classification methods. A genetic algorithm selects the combinations of features. We experimentally confirmed that the simple voting classifier performs better when the number of features is low, but as features increase, all meta-classifiers improve. The proposed system is robust to over-fitting, especially when big documents are considered.

In conclusion, the most important findings of this work are as follows:

We can recognise innovative companies with accurate precision based on analysis of a textual content of companies’ websites. For this purpose, we may use machine learning techniques, such as those applied in the proposed method.
The proposed diversified stacked generalisation method (the multi-sources classification model), based on the meta-classifier outperformsthevoting approach.The results were improved by 11% in terms of the F-score mean.
We may utilise a genetic algorithm to construct the unaligned feature space. More specifically, we can create various combinations of feature numbers originating from the different data sources, and in this manner improve the classification results. The results of the simple voting procedure were improved by 4.6% in terms of F-score mean. On the other hand, we did not notice a significant improvement in the results of a meta-classifier that utilises the aligned feature space in comparison to a meta-classifier which uses the unaligned feature space.
The proposed method creates non-over-fitted models over some ranges of feature numbers.
We achieve a well separated three-dimensional feature space that is simple to visualise.

In further work, we will attempt to better evaluate our method by using other classification approaches and data sets. For instance, we believe that the method could be applied to the problem of multilingual text classification. Preliminary experiments indicate significant classification improvements in this field [54].

In our current approach, the training set is created by experts. We believe that through these data the classification system contains unique knowledge, which allows the system to reject non-innovative companies.

Another issue is the model robustness to fake innovativeness such as using buzzwords or just advertising on a website. Since classification models are constructed in a supervised way, they are as robust as the training set is correct and general. The system may classify companies improperly if unknown features, which indicate innovativeness, appear in real data. On the other hand, we must underline that the proposed system is not resistant to poisoning attacks by including corrupted data in a training dataset. A poisoning attack is the issue of data or applications security, and it was beyond the study. It may be the topic of further research.

Finally, we argue that complex classification tasks require more sophisticated approaches than applying a single model. The diversified stacked generalisation method integrates many views (classifiers) on the same problem, and in this manner, categorises data sufficiently.

Notes

https://doi.org/10.5281/zenodo.2537998

References

Adebowale M, Lwin K, Sánchez E, Hossain M (2018) Intelligent web-phishing detection and protection scheme using integrated features of images, frames and text Expert Systems with Applications
Aggarwal CC (2018) Machine learning for text springer. https://doi.org/10.1007/978-3-319-73531-3
MATH Google Scholar
Allaire J, Chollet F keras: R Interface to ’Keras’ (2018). https://CRAN.R-project.org/package=keras. R package version 2.2.4
Almeida TA, Silva TP, Santos I, Hidalgo JMG (2016) Text normalization and semantic indexing to enhance instant messaging and sms spam filtering. Knowl-Based Syst 108:25–32
Google Scholar
Asim Y, Shahid AR, Malik AK, Raza B (2017) Significance of machine learning algorithms in professional blogger’s classification. Computers & Electrical Engineering
Barushka A, Hajek P (2018) Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl Intell 48(10):3538–3556. https://doi.org/10.1007/s10489-018-1161-y
Google Scholar
Benaim M (2018) From symbolic values to symbolic innovation: Internet-memes and innovation. Res Policy 47(5):901–910
Google Scholar
Bojan M, Concha B, Pedro L (2018) bnclassify: Learning Discrete Bayesian Network Classifiers from Data. https://CRAN.R-project.org/package=bnclassify. R package version 0.4.1
Brattström A, Frishammar J, Richtnér A, Pflueger D (2018) Can innovation be measured? a framework of how measurement of innovation engages attention in firms. Journal of Engineering and Technology Management
Breiman L, Cutler A (2007) Random forests-classification description. Department of Statistics, Berkeley, vol 2
Brown G (2010) Encyclopedia of Machine Learning, chap. Ensemble Learning. Springer, Boston, pp 312–320
Google Scholar
Buehlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting (with discussion). Stat Sci 22(4):477–505
MATH Google Scholar
Calvo B, Santafe G (2015) scmamp: Statistical comparison of multiple algorithms in multiple problems. The R Journal Accepted for publication
Catal C, Nangir M (2017) A sentiment classification model based on multiple classifiers. Appl Soft Comput 50:135–141
Google Scholar
Chatterjee S (2016) fastAdaboost: a Fast Implementation of Adaboost. https://CRAN.R-project.org/package=fastAdaboost. R package version 1.0.0
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y (2018) xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost. R package version 0.71.2
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://www.jmlr.org/papers/v7/demsar06a.html
MathSciNet MATH Google Scholar
Diab DM, Hindi KME (2017) Using differential evolution for fine tuning naïve bayesian classifiers and its application for text classification. Appl Soft Comput 54:183–199. https://doi.org/10.1016/j.asoc.2016.12.043
Google Scholar
Field A, Miles J, Field Z (2012) Discovering statistics using. R Sage Publications, Thousand Oaks
MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22. http://www.jstatsoft.org/v33/i01/
Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Amer Stat Assoc 32(200):675–701
MATH Google Scholar
Hartmann J, Huppertz J, Schamp C, Heitmann M (2018) Comparing automated text classification methods. International Journal of Research in Marketing
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer series in statistics. Springer, Berlin. http://www.worldcat.org/oclc/300478243
MATH Google Scholar
Hastie T, Tibshirani R, Narasimhan B, Chu G (2014) pamr: Pam: prediction analysis for microarrays. https://CRAN.R-project.org/package=pamr. R package version 1.55
Hechenbichler K, Schliep K (2004) Weighted k-nearest-neighbor techniques and ordinal classification. In: Discussion paper 399, SFB 386. Ludwig-Maximilians University Munich
Helleputte T LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library (2017). R package version 2.10-8
Hofner B, Boccuto L, Goeker M (2015) Controlling false discoveries in high-dimensional situations: Boosting with stability selection BMC Bioinformatics 16(144)
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29:3–35
MathSciNet MATH Google Scholar
Hornik K, Buchta C, Zeileis A (2009) Open-source machine learning: R meets Weka. Comput Stat 24 (2):225–232. https://doi.org/10.1007/s00180-008-0119-7
MathSciNet MATH Google Scholar
Hothorn T, Buehlmann P, Kneib T, Schmid M, Hofner B (2010) Model-based boosting 2.0. J Mach Learn Res 11:2109– 2113
MathSciNet MATH Google Scholar
Hothorn T, Buehlmann P, Kneib T, Schmid M, Hofner B (2018) mboost: Model-Based Boosting. https://CRAN.R-project.org/package=mboost. R package version 2.9-1
Huang JX, He B, Zhao J (2018) Mining authoritative and topical evidence from the blogosphere for improving opinion retrieval. Information Systems. https://doi.org/10.1016/j.is.2018.02.002, http://www.sciencedirect.com/science/article/pii/S0306437917302211
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2014) An introduction to statistical learning: With applications in R. Springer Publishing Company, Incorporated
Jha AK, Bose I (2016) Innovation research in information systems: A commentary on contemporary trends and issues. Inf Manag 53 (3):297–306. https://doi.org/10.1016/j.im.2015.10.007, http://www.sciencedirect.com/science/article/pii/S0378720615001238. Information Technology and Innovation: Drivers, Challenges and Impacts
Google Scholar
Kahn KB (2018) Understanding innovation. Bus Horizons 61(3):453–460. https://doi.org/10.1016/j.bushor.2018.01.011, http://www.sciencedirect.com/science/article/pii/S0007681318300119
Google Scholar
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab – an S4 package for kernel methods in R. J Stat Softw 11(9):1–20. http://www.jstatsoft.org/v11/i09/
Google Scholar
Kuhn M, Quinlan R (2018) C50: C5.0 Decision Trees and Rule-Based Models. https://CRAN.R-project.org/package=C50. R package version 0.1.2
Kumar BS, Ravi V (2016) A survey of the applications of text mining in financial domain. Knowl-Based Syst 114:128– 147
Google Scholar
Kuncheva LI (2014) Combining pattern classifiers: methods and algorithms, 2nd edn. Wiley, New York
MATH Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22. https://CRAN.R-project.org/doc/Rnews/
Google Scholar
Lochter JV, Zanetti RF, Reller D, Almeida TA (2016) Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst Appl 62:243–249
Google Scholar
Lokuge S, Sedera D, Grover V, Dongming X (2018) Organizational readiness for digital innovation: Development and empirical calibration of a construct. Information & Management. https://doi.org/10.1016/j.im.2018.09.001, http://www.sciencedirect.com/science/article/pii/S0378720616303111
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
MATH Google Scholar
Marcous D, Sandbank Y (2017) deepboost: Deep Boosting Ensemble Modeling. https://CRAN.R-project.org/package=deepboost. R package version 0.1.6
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2018) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071. R package version 1.7-0
Mirończuk M, Protasiewicz J (2016) A Diversified Classification Committee for Recognition of Innovative Internet Domains. Springer International Publishing, Cham, pp 368–383
Google Scholar
Mortensen PS, Bloch CW, et al. (2005) Oslo Manual-Guidelines for collecting and interpreting innovation data: Proposed guidelines for collecting and interpreting innovation data. Organisation for economic cooporation and development OECD
Myles H, Douglas AW, Eric C (2014) Nonparametric statistical methods, 3rd edn. Wiley, New York
MATH Google Scholar
Nakatsuji M, Miyoshi Y, Otsuka Y (2006) Innovation Detection Based on User-Interest Ontology of Blog Community. Springer, Berlin, pp 515–528
Google Scholar
Nakatsuji M, Yoshida M, Ishida T (2009) Detecting innovative topics based on user-interest ontology. Web Semant Sci Serv Agents World Wide Web 7(2):107–120
Google Scholar
Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120
Google Scholar
Pilav-Velić A, Marjanovic O (2016) Integrating open innovation and business process innovation: Insights from a large-scale study on a transition economy. Inf Manag 53(3):398–408. https://doi.org/10.1016/j.im.2015.12.004. http://www.sciencedirect.com/science/article/pii/S0378720615001433. Information Technology and Innovation: Drivers, Challenges and Impacts
Google Scholar
Protasiewicz J, Mirończuk M, Dadas S (2017) Categorization of multilingual scientific documents by a compound classification system. In: International conference on artificial intelligence and soft computing. Springer, pp 563–573
R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Google Scholar
R Core Team (2018) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Google Scholar
Robertson SE, Walker S, Jones S, Hancock-beaulieu M, Gatford M (1994) Okapi at TREC-3. In: TREC, pp 109–126
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1-2):1–39
Google Scholar
Salkind N (2010) Encyclopaedia of Research Design, vol 1. Sage Publications, Thousand Oaks
Google Scholar
Sammut, C, Webb, GI (eds.) (2017) Stacked Generalization. Springer, Boston. pp 1173–1173, https://doi.org/10.1007/978-1-4899-7687-1_969
Santafe G, Inza I, Lozano JA (2015) Dealing with the evaluation of supervised classification algorithms. Artif Intell Rev 44(4):467–508. https://doi.org/10.1007/s10462-015-9433-y
Google Scholar
Shaikh GM, Shuib NLM, Idris N, Hoo WL, Raj RG, Khowaja K, Shaikh K, Nweke HF (2019) Clinical text classification research trends: Systematic literature review and open issues. Expert Syst Appl 116:494–520. https://doi.org/10.1016/j.eswa.2018.09.034
Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13. http://www.jstatsoft.org/v39/i05/
Google Scholar
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Google Scholar
Sun W, Qiao X, Cheng G (2015) snn: Stabilized Nearest Neighbor Classifier. https://CRAN.R-project.org/package=snn. R package version 1.1
Ting KM, Witten IH (1997) Stacked generalization: when does it work?. In: Proceedings of International Joint Conference on Artificial Intelligence. Morgan Kaufmann, pp 866–871
Ting KM, Witten IH (1999) Issues in stacked generalization. J Artif Intell Res 10:271–289
MATH Google Scholar
Wang Z, Gu S, Xu X (2018) GSLDA: LDA-based group spamming detection in product reviews. Appl Intell 48(9):3094–3107. https://doi.org/10.1007/s10489-018-1142-1
Google Scholar
Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
MATH Google Scholar
Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259
Google Scholar
Zach FJ, Hill T (2017) Network, knowledge and relationship impacts on innovation in tourism destinations. Tour Manag 62:196–207
Google Scholar
Zhang C, Liu C, Zhang X, Almpanidis G (2017) An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst Appl 82:128–150. https://doi.org/10.1016/j.eswa.2017.04.003
Google Scholar
Zhang C, Ma Y (2012) Ensemble machine learning: methods and applications. Springer, Berlin
MATH Google Scholar
Zhang D, Yan Z, Jiang H, Kim T (2014) A domain-feature enhanced classification model for the detection of chinese phishing e-business websites. Inf Manag 51(7):845–853. https://doi.org/10.1016/j.im.2014.08.003
Google Scholar
Zhang W, Jiang Q, Chen L, Li C (2017) Two-stage elm for phishing web pages detection using hybrid features. World Wide Web 20(4):797–813
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Intelligent Information Systems, National Information Processing Institute, al. Niepodległości 188 b, 00-608, Warsaw, Poland
Marcin Michał Mirończuk & Jarosław Protasiewicz

Authors

Marcin Michał Mirończuk
View author publications
You can also search for this author in PubMed Google Scholar
Jarosław Protasiewicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcin Michał Mirończuk.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Formal description of the system

This section presents the formal description of the proposed classification system for innovative websites. Figure 6 depicts a flow chart outlining its detailed structure. We use function interfaces to illustrate how the system works. The interfaces are considered in terms of computer science.

Figure 6 shows a detailed overview of the classification system for innovative websites. The framework consists of two main phases, namely a learning phase and an applied phase. The learning phase includes (i) a processing₁ function, which is responsible for acquisition of learning data, (ii) the NB methods Γ_NB1, Γ_NB2, and Γ_NB3 that create (iii) the classification functions γ_NB1, γ_NB2, and γ_NB3 [19, 44], (iv) a createMetaLearnExamples function that constructs a new feature space and new learning examples for a meta-learning classification process, and (v) a method Γ_ML that creates the final meta-classifier. On the other hand, the applied phase consists of (i) a processing₂ function that collects unlabelled data for classification, (ii) three classification functions γ_NB1, γ_NB2, and γ_NB3, (iii) a createDataToClassification function that prepares the received results from the classifier functions for final classification, and (iv) the final classification functions γ_Voting and γ_ML that are based on the simple voting classifier and the meta-classifier respectively.

A.1 Realisation of the standard and meta-learning phases

Realisation of the processing ₁ function

If we do not have the classification function(s) (the first condition in Fig. 6, i.e. is learning process? is yes), then, in the first step, we implement the interface of the processing₁ : (R_L, n_p, n_n, q)↦(L_D, L_L, L_B) function (Fig. 6, Notation 7). The notations from 1 to 6 describe the required elements for the realisation of this function.

Notation 1

Let C_L be a set (a labels set) of the two elements c_L and ¬c_L, where c_L is a positive class label, for example, c_L = ”innovativewebsite”, and ¬c_L is a negative class label, for example, ¬c_L = ”noninnovativewebsite”.

Notation 2

Let R_L be a set (a learning data set to crawl) of the triples (idWebsite, h, c_Lor ¬c_L), where idWebsite is a crawled website identifier, for example, idWebsite = 1, h is a host of website, for example, h = http : //www.mostostal.waw.pl, and c_Lor ¬c_L is a label class as above.

Notation 3

Let F_L be a set of the features extracted from links (their labels) of the first pages of the given websites, i.e. F_L = {f_l1,⋯ , f_ln}, where f_li ∈ F_L is a feature (phrase, n-gram, and additional information, e.g. whether a given website contains appropriate logotypes), for example, f_li = contact, f_li = containinnovativelogotype, etc.

Notation 4

Let F_D be a set of the features extracted from descriptions of the first pages of the given websites, i.e. F_D = {f_d1,⋯ , f_dn}, where f_di ∈ F_D is a feature (phrase, n-gram, and additional information e.g. whether a given website contains appropriate logotypes), for example, f_di = ourcompany, etc.

Notation 5

Let F_B be a set of the features extracted from big documents that were constructed for all websites. For each website, we build a special text representation called a big document. This representation is a concatenation of the n best documents from a given website received by the Okapi BM25 search system [33, 44, 47, 57]. Then F_B = {f_b1,⋯ , f_bn}, where f_bi ∈ F_B is a feature (phrase, n-gram, name entity recognition label, noun phrase), for example, f_di = ourcompany, f_bi = person, f_bi = city, f_bi = venturecapital, etc.

Notation 6

Let L_L, L_D, and L_B be sets (learning data sets) of the n − tuples (the learning examples), such as (idWebsite, w_fl1,⋯ , w_fln, c_Lor ¬c_L) for the L_L set, (idWebsite, w_fd1,⋯ , w_fdn, c_Lor ¬c_L) for the L_D set, and (idWebsite, w_fb1,⋯ , w_fbn, c_Lor ¬c_L) for the L_B set, where w_fli, w_fdi, and w_fbi are the weights of the features from the F_L, F_D and F_B sets for the given idWebsite respectively, and c_L, ¬c_L are defined above.

Notation 7

Let processing₁ : (R_L, n_p, n_n, q)↦(L_D, L_L, L_B) be an interface of the processing function, which is based on the R_L, n_p, n_n, q parameters received from the L_D, L_L, L_B learning sets. The n_p is the amount of documents that are taken to construct a big document for the innovativewebsite, where n_n is the number of documents used to construct a big document for the noninnovativewebsite, and q is a query sent to the Okapi BM25 search system, and R_L, L_D, L_L, and L_B are defined above. Also assume that \(|F_{D} \cap F_{L} \cap F_{B}| \simeq 0\), i.e. L_D, L_L, and L_B are described by diversified disjunction (independent) sets of features.

Realisation of the learning functions

After creating all necessary data sets mentioned above, we build the classification models. Additionally, if meta-learning is required, i.e. is meta-learning process? is yes, we learn meta-learning classification function(s) by using several classification methods, such as k-nn, NB, SVM, and DT. We call this meta-learning, because the output results of several classification functions are the inputs for the other classification function(s). The notations from 8 to 12 describe the required elements for the realisation of this phase.

Notation 8

Let Γ_NB1, Γ_NB2, and Γ_NB3 be the classification methods that create the γ_NB1, γ_NB2, and γ_NB3 classification functions based on the L_D, L_L, and L_B learning sets, respectively (the Γ_NB1 : L_D↦γ_NB1, Γ_NB2 : L_L↦γ_NB2, and Γ_NB3 : L_B↦γ_NB3 functions, respectively). Further assume that γ_NB1 and γ_NB2 are NB classifiers based on the Bernoulli distribution model and γ_NB3 is an NB classifier based on the multinomial distribution model.

Notation 9

Let \(R_{\gamma _{NB1}}\), \(R_{\gamma _{NB2}}\), and \(R_{\gamma _{NB3}}\) be the sets (the classification result sets) of the triples (idWebsite, \(w_{c_{L},\gamma _{NB1}},\)\(w_{\neg c_{L},\gamma _{NB1}})\), \((idWebsite, w_{c_{L},\gamma _{NB2}}, w_{\neg c_{L},\gamma _{NB2}})\), and \((idWebsite, w_{c_{L},\gamma _{NB3}}, w_{\neg c_{L},\gamma _{NB3}})\) respectively, where idWebsite, γ_NB1, γ_NB2, γ_NB3, c_L and ¬c_L are defined above, \(w_{c_{L},\gamma _{NB1}}\), \(w_{c_{L},\gamma _{NB2}}\), and \(w_{c_{L},\gamma _{NB3}}\) are weights, i.e. the probabilities of the c_L class computed by the γ_NB1, γ_NB2, and γ_NB3 classification functions respectively, and \(w_{\neg c_{L},\gamma _{NB1}}, w_{\neg c_{L},\gamma _{NB2}}\), and \(w_{\neg c_{L},\gamma _{NB3}}\) are also weights, i.e. the probabilities of the ¬c_L class computed by the \(\gamma _{NB1} : L_{D} \mapsto R_{\gamma _{NB1}}, \gamma _{NB2} : L_{L} \mapsto R_{\gamma _{NB2}}\), and \(\gamma _{NB3} : L_{B} \mapsto R_{\gamma _{NB3}}\) classification functions, respectively.

Notation 10

Let D_MLE be the set (the meta-learning examples set) of the quadruples (idWebsite, \(w_{c_{L},\gamma _{NB1}},\)\(w_{c_{L},\gamma _{NB2}},\)\(w_{c_{L},\gamma _{NB3}},\)c_L), where idWebsite, \(w_{c_{L},\gamma _{NB1}}\), \(w_{c_{L},\gamma _{NB2}}\), \(w_{c_{L},\gamma _{NB3}}\), and c_L are defined above.

Notation 11

Let \(createMetaLearnExamples: (R_{\gamma _{NB1}}, R_{\gamma _{NB2}}, R_{\gamma _{NB3}}) \mapsto D_{MLE}\) be the interface of a function to create the learning examples for the meta-learning classifier Γ_ML, which is based on the \(R_{\gamma _{NB1}}, R_{\gamma _{NB2}},and R_{\gamma _{NB3}}\) parameters and receives the D_MLE learning sets. The \(R_{\gamma _{NB1}}\), \(R_{\gamma _{NB2}}\), \(R_{\gamma _{NB3}}\), and D_MLE are defined above.

Notation 12

Let Γ_ML be the classification method that creates the γ_ML classification function based on the D_ML meta-learning set (the Γ_ML : D_MLE↦γ_ML function). Also, we assume that we consider learning methods, such as k-nn (k = 1, k = 3, and k = 9), NB classifier based on the Normal distribution model, SVM, and DT.

A.2 Realisation of the applied phase

After creating all necessary classification models and the meta-learning model, we can realise the applied phase. The notations from 13 to 18 describe the required elements for realisation of this phase.

Notation 13

Let R_C be a set (the data set to crawl) of the tuples (idWebsite, h), where idWebsite and h are defined above.

Notation 14

Let D_L, D_D and D_B be a set (a crawled data set for classification) of the n − tuples, such as (idWebsite, w_fl1, ⋯ , w_fln) for the D_L set, (idWebsite, w_fd1,⋯ , w_fdn) for the D_D set, and (idWebsite, w_fb1,⋯ , w_fbn) for the D_B set, where w_fli, w_fdi, and w_fbi are defined above.

Notation 15

Let processing₂ : (R_C, n, q)↦(D_D, D_L, D_B) be an interface of the processing function, which is based on the R_C, n, q parameters and receives the D_D, D_L, D_B data sets for classification. The variable n is the number of documents used to construct a big document. R_C, D_D, D_L, and D_B are defined above.

Notation 16

Let D_C be the set (the classified data set) of the quadruples (idWebsite, \(w_{c_{L},\gamma _{NB1}},\)\(w_{\neg c_{L},\gamma _{NB1}},\)\(w_{c_{L},\gamma _{NB2}},\)\(w_{\neg c_{L},\gamma _{NB2}},\)\(w_{c_{L},\gamma _{NB3}},\)\(w_{\neg c_{L},\gamma _{NB3}})\), where idWebsite, c_L, ¬c_L, \(w_{c_{L},\gamma _{NB1}}\), \(w_{c_{L},\gamma _{NB2}}\), \(w_{c_{L},\gamma _{NB3}}\), \(w_{\neg c_{L},\gamma _{NB1}}\), \(w_{\neg c_{L},\gamma _{NB2}}\), and \(w_{\neg c_{L},\gamma _{NB3}}\) are defined above.

Notation 17

Let \(D_{C}^{\prime }\) be the set (the meta data set for classification) with the same structure as D_MLE from Notation 10.

Notation 18

Let R_Voting and R_ML be the sets (the results data sets of the appropriate classification functions) of the tuples (idWebsite, c_Lor ¬c_L) where idWebsite, c_L, and ¬c_L are defined above.

The γ_Voting classification function is based on the simple voting process. For each classifier, we take the class label of the highest weight, and then we select the most common label from all the decisions. The γ_ML meta-classification function is one of the classification functions learned by the classification methods mentioned above.

Appendix B: Experiment results

This appendix covers the results of additional experiments, which show the performance of various solitary classification models. They are trained on single datasets by using various algorithms. The recent review articles analysing state-of-the-art classification algorithms [23, 62, 72] report that methods, such as Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), Artificial Neural Network (ANN), Decision Tree (DT), are the most frequently applied to text classification. Therefore, in the experiments, we took into account these algorithms. In addition, we paid our attention to less known methods, for example, Nearest Shrunken Centroids, Generalized Linear Model, eXtreme Gradient Boosting, Model Averaged Naive Bayes Classifier or Naive Bayes Classier with Attribute Weighting in our document classification experiments. It was based on our intuition and experience that these tools may produce competitive results.

The experiments allow us to show how differs the performance of our diversified stacked generalisation method from simple classifiers in the task of websites classification. The selected algorithms are enumerated in Table 3. We had available three separate data sets, namely (i) companies description (L_D); (ii) link labels (L_L); (iii) big documents (L_B). They were created from the web pages of companies, as it is described in Section 4.1. We trained classification models on each data set and for each algorithm separately by using the 10-fold cross-validation procedure.

Table 3 Algorithms which are used to train simple classifiers in the additional experiments

Full size table

Tables 4, 5 and 6 covers the values of precision, recall, f-measure, error, and accuracy of all eighteen models trained on (L_D), (L_L), (L_B) data sets respectively. The best results produced an Averaged Naive Bayes Classifier irrespectively of a data set. Moreover, a Naive Bayes Classifier with Attribute Weighting gained the best classification quality on the data set containing companies description. We must remember that these are the best models among solitary classifiers trained on a single dataset. If we compare these results with the outcomes of the diversified stacked generalisation method, which is proposed in this study, we can easily notice this method gives classification models that perform better than single models. These results are observable when a features space in this method is either aligned (Table 1) or unaligned (Table 2).

Table 4 Classification results on the data set containing companies description (L_D)

Full size table

Table 5 Classification results on the data set containing link labels (L_L)

Full size table

Table 6 Classification results on the data set containing big documents (L_B)

Full size table

Also, we must note that the time of creating models by each training algorithm was not under our consideration. However, we noted the computation time of each model, and we present this data in the open data repository^{Footnote 1}.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Mirończuk, M.M., Protasiewicz, J. Recognising innovative companies by using a diversified stacked generalisation method for website classification. Appl Intell 50, 42–60 (2020). https://doi.org/10.1007/s10489-019-01509-1

Download citation

Published: 22 June 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10489-019-01509-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Recognising innovative companies by using a diversified stacked generalisation method for website classification

Abstract

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Big Data Analytics: A Literature Review Paper

1 Introduction

2 Related works vs. proposed approach

3 Innovative website recognition system

4 Empirical evaluation of the solution

4.1 The data set and evaluation method

4.2 Aligned space of features

Simple diversified classification committee γ Voting

Meta-learning classification

4.3 Unaligned space of features

5 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Formal description of the system

A.1 Realisation of the standard and meta-learning phases

Realisation of the processing 1 function

Notation 1

Notation 2

Notation 3

Notation 4

Notation 5

Notation 6

Notation 7

Realisation of the learning functions

Notation 8

Notation 9

Notation 10

Notation 11

Notation 12

A.2 Realisation of the applied phase

Notation 13

Notation 14

Notation 15

Notation 16

Notation 17

Notation 18

Appendix B: Experiment results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Simple diversified classification committee γ _Voting

Realisation of the processing ₁ function