1 Introduction

It is argued that innovativeness plays a crucial role in the development of various branches of the modern economy, such as medicine, agriculture, security, defence, industry. Providing advanced products and services depends on smooth and efficient cooperation between companies and research units. As these two groups may have some difficulties in finding appropriate partners or information, we have created an information platform, Inventorum which aims to establish connections between them. In order to prevent users from being overwhelmed or distracted by useless data, the system recommends only innovations, projects, partners, and experts that suit them. Such an approach requires both preparations of relevant information and its careful selection. In this paper, we focus on finding potentially innovative companies on the internet, which should help to find more participants of Inventorum, i.e. information consumers. The system selectively gathers data from the internet. Then, each acquired website consisting of a group of web pages is analysed to establish whether it is related to an innovative company or not. This study evaluates classification of such websites based on two groups, namely (i) domains that represent innovative companies and (ii) others. For simplicity, we name these groups as (i) innovative domains or websites and (ii) non-innovative domains or websites. This classification does not mean that internet domains are innovative or not, but does indicate whether the companies represented in these domains are innovative or not.

In reality, an innovative business is one that offers innovative products or services, employs professionals or scientists that are the best in their profession, invests in research and development projects, cooperates with other creative firms or top champions [9, 36, 71]. The above factors are disputable except one which is an innovative product or service. Oslo Manual [48] defines innovative as something new or substantially enhanced and concurrently beneficial for consumers. Although it may be barely an idea that triggers organisational, processing, or production change resulting in an invention, implementation and diffusion [35]. Let us imagine how an individual can evaluate a company based on its website. At first, he or she should look for innovations or legal protections. If these appear on the website and belong to this firm, it may be innovative. Nonetheless, not all entities possess innovations, but they yet may be innovative because they might work on something ground-breaking. Thus, the individual analysing the company’s web page must look for the inconclusive factors mentioned above, i.e. the particular types of people, investments and cooperation.

We must mention that some attempts were made to construct models for assessing company innovativeness based on inquiry tools, e.g. a survey to benchmark organisational readiness for digital innovation [43] or a questionnaire to find a relationship between cooperation with innovative partners and internal innovations [53]. However, these techniques require conducting surveys on companies under examination. We want to develop an automatic model utilising machine learning techniques. As we can imagine, this task is not easy even for an expert, so the question is how it can be realised by a classification algorithm meant to substitute the expert’s work. Firstly, our imaginary algorithm has to retrieve innovation entities such as a person or a cooperating company. Then, it should assess whether these objects are relevant to the firm under examination. More specifically, the algorithm needs (i) to judge whether potential innovations are actually innovative; (ii) to assess whether persons are unique professionals; and (iii) to determine if related companies are innovative. According to our knowledge, these tasks are impossible to realise by any machine learning algorithm because the algorithm is not able to judge directly, for example, if a product/service or a professional is innovative or not without evaluating almost all possibilities across the world. Thus, an indirect technique is required which we propose below.

Although the concept of “an innovative website” is an abstract idea similar to concepts such as what constitutes spam, pornography, or phishing [1, 4, 6, 39, 52, 68, 74, 75], it has never been adequately addressed by known machine learning techniques. Unfortunately, innovativeness is difficult to grasp and cannot be recognised by identifying typical words, phrases, origin addresses, or similar, as in spam detection. Innovativeness is something that we must determine by context, so we propose an original procedure, which assumes that it is possible to create a machine learning classification model that can determine whether a company is innovative or not based on automatic analysis of its website. It analyses a data set from three different perspectives. The first perspective considers only a company’s description localised on its website. This text most likely contains descriptions of innovative products or services, and some notes about the most crucial professionals. The second perspective analyses links included on the company’s website which may contain information about cooperating companies. The third perspective utilises all text in the website. It may be important if our assumptions regarding the localisation of particular information on the website are inappropriate.

In our previous study [47], we proposed an experimental Naïve Bayes (NB) classification committee for categorisation of companies’ websites into the innovative or non-innovative group. The committee was based on a voting idea and diversified feature spaces, feature weights, and two models of features distribution, i.e. Bernoulli and Multinomial [19, 44]. We showed that the committee, as well as the selected individual NB classifiers, are able to resolve the defined classification task sufficiently. Furthermore, we noted that the proposed ensemble procedure may require some improvement to provide better results. Therefore, in this study, we explore this task more deeply. The novelties of this paper are as follows:

  1. 1.

    We investigate a novel classification task, i.e. recognition of Polish innovative companies’ websites.

  2. 2.

    We propose a novel improved classification approach based on multiple data sources to accomplish the classification task mentioned above.

In the recent literature, we did not find any related works that would include the topics mentioned above. In this study, we propose a new classification committee, which is an extended version of our previous proposition [47]. The primary enhancement relies on the application of a stacked generalisation method instead of a simple voting procedure. The objectives of the study are as follows:

  1. 1.

    To demonstrate that the new proposed method can improve the decision process and produce better overall classification quality than the previously proposed voting committee.

  2. 2.

    To demonstrate that a genetic algorithm can create an unaligned feature space that improves the classification results of the simple voting procedure.

Both objectives are realised based on empirical experiments and obtained results.

This paper is structured as follows. Section 2 summarises previous and related works. Section 3 presents an overview of the proposed classification system of innovative websites. Section 4 describes the evaluation process of the proposed system and the results obtained during experiments. Finally, Section 5 concludes the findings.

2 Related works vs. proposed approach

A review of recent literature indicates that there are some works concerning detection of innovative themes on the internet [7, 50, 51]. However, the analysed approaches are insufficient to solve the problem presented in this study owing to the reason that they do not offer a direct solution to recognise whether a company is innovative or not, based only on its public website available on the internet. Thus, we constructed a classification model that can recognise innovative websites on the internet with sufficient quality. We propose a method related to ensemble learning. Ensemble learning refers to procedures employed to train multiple classifiers by combining their outputs and considering them as a “committee” of decision makers. Various approaches accomplish this learning concept, for example, bagging, boosting, AdaBoost (adaptive boosting), stacked generalisation, mixtures of experts, and voting based methods [2, 5, 11, 14, 40, 42, 58, 73]. The presented approach is closely related to stacked generalisation [66, 67, 70]. Stacking is an ensemble learning technique, where a set of models is constructed from bootstrap samples of a data set, and then their outputs as a hold-out data set are used as inputs to a meta-model. The set of base models is called level-0, whereas the meta-model is considered as level-1. The task of the level-1 model is to combine the set of level-0 outputs so as to classify the target correctly, thereby correcting any mistakes made by the level-0 models [60].

There are some differences between our approach and the approach mentioned above. Usually, typical classification methods, as well as classification tasks, utilise only one data source to construct a classification model. These data include information describing a target class, i.e. innovative vs. noninnovative, spam vs. nospam, etc. In our case, three data sources represent different aspects of the destination concept that is the innovative company. Thus, we can state that the classification is based on multiple sources. In our first research [47], we combined these sources and proposed a classification method to obtain satisfactory classification indicators. However, the simple classification, which utilised each data set separately, produced insufficient results. Appendix 1 covers additional analysis of the classification results achieved by such an approach. The empirical results suggest that a solitary classifier trained on a single dataset may be not able to obtain a satisfactory classification quality.

The stacked generalisation method splits one data source into several n-bootstrap samples to learn n classification functions (level-0), and then uses one classification function to classify the remainder of the data (level-1). In our case, we learn one classification function per a data source (level-0), and the results of these classifications are used to build the level-1 classifier. Thus, we name this method: diversified stack generalisation. Thanks to this approach, we obtain a single three-dimensional feature space in the case when we have three data sources. The validation of this method is simple, especially when we have a small number of learning examples and an unbalanced number of positive and negative examples.

Moreover, in our previous work, we assumed an aligned feature space, i.e. each classifier in the committee used the same feature space and the number of features was the same for all classifiers. In this study, we evaluate the case in which there are unaligned feature spaces, i.e. the number of features is not the same for each classifier. In this manner, we check whether there are other combinations of features that may improve classification results.

3 Innovative website recognition system

This section presents the proposed classification system for innovative websites and explains the general aspects of the proposed system at a high level of abstraction. Figure 1 depicts a flow chart outlining its structure. Section 1 presents in detail how the proposed method works.

Fig. 1
figure 1

General overview of the diversified stacked generalisation architecture

Below, we investigate how the proposed system works based on a simple use case. This use case explains how a single company website, which is a collection of HTML documents, is processed. We can generalise this case to n-websites. The system (Fig. 1) considers a company’s website as an input. A preprocessing phase processes the HTML documents and creates three different data sets. The first data set (Dataset1 (LD)) contains a textual description of the company. This description is extracted from the first (index) page of the website. The second data set (Dataset2 (LL)) involves the link labels that were extracted from the index page. The last data set (Dataset3 (LB)) consists of a so-called bigdocument. The bigdocument is constructed based on all of the collected HTML documents, i.e. it is the conjunction of several appropriately selected documents from the website. This representation is a concatenation of the n best documents from a given website as determined by the Okapi BM25 search system [33, 44, 47, 57] (for more technical details, see Appendix 1). The textual data from each data set is classified by an appropriate classification function. We use the NB learning method to create each function. The results of the classification are merged (DMLE - is a matrix which stores the collected decisions) and are utilised by a metaclassifier (γML). We used and tested several learning methods, such as k nearest neighbours (k-nn), Naïve Bayes (NB), support vector machine (SVM), and decision trees (DT) [2, 17, 19, 24, 34, 44] to create the final decision classification function. The result provides information on whether a given website describes an innovative business or a non-innovative company.

4 Empirical evaluation of the solution

4.1 The data set and evaluation method

We decided to create an original test set due to the lack of such data in other works. We labelled 2,747 real websites (509 as innovative websites, 2,238 as non-innovative websites), and created three sets as follows: |LD| = 2,747 description examples described by |FD| = 140,699 features that also include information about logotypes, |LL| = 2,747 examples of the link labels described by |FL| = 140,271 features that also include information about logotypes, and |LB| = 2,747 examples of the big documents described by |FL| = 663,015 features.

It is worth mentioning that to build the data set, we used annual rankings of leading companies on the market, e.g. Forbes rankings and awards such as Business Gazelle. We assumed that these rankings are highly reliable and utilised this data to collect company websites as learning data sets describing innovative companies. Opposite examples i.e. non-innovative internet sites, were obtained from a collected set of common company websites.

The classification quality is evaluated according to the 10-fold cross-validation procedure and measured by error, accuracy, precision, recall, and F-measure. In fact, it is the F1 measure that assumes an equal balance of precision and recall [44, 61, 64].

Furthermore, we applied statistical tests to verify whether the Fscore distributions of different models for meta-learning classification (γML) and voting classification (γVoting), are identical without assuming that they follow the normal distribution. The tests’ procedures varied depending on the type of feature space.

When the feature space was aligned (Section 4.2), we compared the results produced by the voting method and each meta-learning method by using two tests. First, Friedman’s aligned rank test verifies the null hypothesis HF0 stating that the performances of experimental models are equal [18, 22], assuming that the statistical significance, α is equal to 0.05. Second, the Wilcoxon signed-rank statistic test using the Bonferroni correction of p-values for multiple testing (corrected pairwise tests) [20, 49, 59] tests the null hypothesis, HW0 stating that the median difference between pairs of experimental models is zero, assuming that α is equal to 0.05.

Conversely, when the feature space was unaligned (Section 4.3), we compared the results coming from the voting procedure and the best meta-learning method for the aligned feature space with results coming from the same models but for the unaligned feature space. For this purpose, we used the Wilcoxon rank-sum test [20, 49]. The null hypothesis, HWR0 of the Wilcoxon rank-sum test is that the median difference between pairs of experimental models is zero, and α is equal to 0.05.

We used the Scmamp R-project package [13] to execute Friedman’s aligned rank test, and the standard Stat package [55] to perform the Wilcoxon signed-rank test with Bonferroni correction, and the Wilcoxon rank-sum test (Wilcoxon Mann-Whitney test).

4.2 Aligned space of features

Simple diversified classification committee γ Voting

Our previous experiments [47] were intended to compare the best NB classifiers (γNB1, γNB2, and γNB3) with γVoting. The γVoting classification function is based on a simple voting process. For each classifier, we choose the class label of the highest weight, and then we select the most common label from all the decisions. We recall some of those results below to provide background regarding the applied meta-learning techniques and justify the differences between the simple voting and meta-learning methods. Figure 2 shows the comparison of results for each classification method mentioned above.

Fig. 2
figure 2

Precision, Recall, and F-measure achieved by the best NB classifiers γNB1, γNB2, γNB3 and the γVoting classifier, i.e the committee [47]

Our previous research indicated that methods for feature selection have a significant influence on the classifiers’ performance. For example, we observed that the γNB1 classifier produced the best outcomes when the χ2 feature selection method was used, whereas the γNB2 and the γNB3 classifiers performed best when using the Fisher feature selection method.

Figure 2 compares the values of precision, recall, and F-measure of γVoting and the best three NB classifiers. Based on these results, we can conclude that the performance of the classifiers depends not only on the feature selection method but also on the number of input features. More specifically, the committee outperforms every single classifier when the number of features varies between 600 and 4,000. However, further increase in the number of features leads to decrease of its F-measure. Although the higher number of the feature in the γNB1 classifier with the χ2 feature selection method leads to better performance than the committee, precision and recall values inform us that the increase in the number of features causes over-fitting of almost all classifiers and the committee. The over-fitting, manifested by the precision increase and the recall decrease, represents an incorrect performance of the classifiers because many innovative companies are not selected; it is better for the purpose of this study that some non-innovative companies are labelled as innovative (lower precision) provided a high number of truly innovative businesses are found from the internet datasets. Moreover, we applied an information retrieval system in the γNB2 classifier, which led to high-quality performance of the classifier when the number of features was very high.

It can be concluded that the simple committee produces the firmest decisions, and concurrently, the proposed information retrieval system improves the classification quality and prevents from over-fitting.

Figure 3 presents a visualisation of the data points in the constructed three-dimensional feature space that was built based on the values representing the probabilities of the innovative class of every single classifier function, i.e. γNB1, γNB2, γNB3.

Fig. 3
figure 3

Plot of the feature values that were received from each classification function, i.e. γNB1, γNB2, γNB3 (10,100 features, 473 training innovative examples, and 1,999 training non-innovative examples)

It can be observed (see Fig. 3) that classes are well separated. The presented results indicate that we can use other supervising classification method(s) ΓML to create classification function(s) based on such a created feature space. In our case, we refer to this type of learning as meta-learning classification.

Meta-learning classification

We have tested some learning methods in the ΓML category to determine the final meta-learned classifier γML (Fig. 6). Figures 4 and 5, and Table 1 show the results of the classification experiments in respect to different meta-classifiers γML and the features number in the range of 600-10,100. More specifically, there were tested six meta-classifiers. Three based on k-nn considering different numbers of neighbours and three other algorithms (NB, SVM, and DT). Figure 4 depicts precision, recall, and F-measures achieved by these meta-classifiers. All meta-classifiers γML work better than γVoting when the number of features is average or high. However, their quality is lower than γVoting at the lowest range of features. The best meta-classifiers are based on DT, SVM, and k-nn with three or nine neighbours. By the contrast, the lowest enhancements are produced by meta-classifiers based on NB and k-nn with one neighbour.

Fig. 4
figure 4

Precision, Recall and F measure of the different meta-classifiers γML

Fig. 5
figure 5

Comparison of the difference between F-measures of the classifier based on γVoting and meta-classifiers γML

Fig. 6
figure 6

Detailed overview of the diversified stacked generalisation architecture

Table 1 The comparison of the mean values of basic indicators for various types of learning methods ( or without a mark - the averages are calculated for number of features in the range of 600 to 10,100; ∗∗ - the averages calculated for number of features in the range of 600 to 4,500)

Additionally, we calculated the mean values of precision, recall, F-measure, error, and accuracy for each meta-classifier γML and γVoting in the specified range of features, i.e. from 600 to 10,100. The results are included in Table 1. The γVoting has two values for each indicator. The first value (marked by ) is calculated for the entire range of features, whereas the second (marked by ∗∗) analyses the number of features in the range of 600 to 4,500. The average results prove the above conclusions. Moreover, they show clearly that γVoting performs better when the number of values is low. However, when the number of features increases, all meta-classifiers γML improve. Additionally, Fig. 5 presents the differences of F-scores between γVoting and each meta-classifier γML. Thus, the improvements introduced by the new methods are clearly apparent.

Furthermore, Friedman’s aligned rank test indicated that the F-scores of the seven best classification models, for which the number of features ranges from 600 to 10,100 (see Table 1), were significantly different (pvalue < 2.2e − 16). Thus, we can reject the hypothesis, HF0. In addition, we achieved the following p-values for multiple pairwise comparisons made using the Wilcoxon signed-rank test (1) pvalue = 1.5e − 13 for γVoting vs. γML, k-nn, k = 1, (2) pvalue = 5.4e − 16 for γVoting vs. γML, k-nn, k = 3, (3) pvalue = 5.2e − 16 for γVoting vs. γML, k-nn, k = 9, (4) pvalue = 3.8e − 16 for γVoting vs. γML, SVM , (5) pvalue = 1.1e − 15 for γVoting vs. γML, NB, and (6) pvalue = 4.2e − 16 for γVoting vs. γML, DT. The test showed that there are significant differences between the pairs compared, suggesting that these pairs are independent. As a consequence, we can reject the null hypothesis, HW0.

4.3 Unaligned space of features

Although each classifier in the committee uses a different feature space, the number of features is the same for all classifiers. This is probably a suboptimal solution because it is apparent that the classifiers produce the best results in distinct ranges of feature numbers (Fig. 2). Thus, the optimal system may involve an adaptive selection of the feature numbers for each classifier in the committee.

As finding an optimal feature space is similar to selecting the best population among many possible solutions, we decided to apply a genetic algorithm to solve this problem. We conducted several computational experiments to verify this assumption. They involved choosing an optimal feature space by the genetic algorithm for the γVoting classifier, as well as the meta-classifier γML based on the SVM. Table 2 lists the best and average values of precision, recall, F-measure, error, and accuracy obtained in the experiments.

Table 2 The comparison of basic indicators for different type of feature space size

Firstly, note that the γVoting classifier utilised approximately 4,500 features because we previously observed that when the number of features is higher than this value over-fitting may occur. Thus, the genetic algorithm searches optimal solutions without the risk of creating an over-fitted model. For example, the combination of 3,258 features for the γNB1, 812 features for the γNB2 and 4,092 features for the γNB3 functions produce the optimal solution for the γVoting classifier (see the first row in Table 2). The average outcomes of the genetic algorithm in the case of the γVoting model were calculated by using a 10-fold cross-validation procedure. When we compare these results (see the second row in Table 2) with outcomes covered in Table 1, we note that there are some combinations of features that slightly improve the classification quality. Moreover, the Wilcoxon signed-rank test compared γVoting for the unaligned feature space (Table 2) with γVoting∗∗ for the aligned feature space (Table 1). It indicated that pvalue < 2.2e − 16. The test showed that there is a difference between the pair compared. We conclude that this pair is independent. As a consequence, we can reject the null hypothesis, HWR0.

Secondly, note that the maximum number of features for the meta-classifier γML, which is based on SVM, was equal to approximately 20,000 (the third and fourth rows in Table 2). This boundary has been selected experimentally with consideration of the problems of over-fitting and acceptable performance. For instance, if we take into account the F-measure, the following combinations of features are optimal: 14,207 features for γNB1, 16,081 for γNB2, and 16,283 for γNB3. The average number of indicators for the γML models were achieved by using a 10-fold cross-validation procedure. Note that the genetic model selected more features than the specified upper boundary, i.e. 10,100. Furthermore, the Wilcoxon signed-rank test compared γML for the unaligned feature space (Table 2) with γML, SVM for the aligned feature space (Table 1). Since the test indicated that pvalue < 2.2e − 16, there is a difference between the pair compared, suggesting that this pair is independent. As a consequence, we can reject the null hypothesis, HWR0.

Finally, we tested the meta-classifiers γML with the genetic algorithm again, but we assumed that the maximum number of features should not exceed the value of 10,100 (the rows marked in Table 2). The meta-classifiers γML performed best when the genetic algorithm selected the following combinations of features: 8,662 for γNB1, 8,022 for γNB2, and 9,481 for γNB3. Interestingly, in most cases, the algorithm searched for the optimal solutions in the range when the feature numbers were higher than 10,100. Moreover, we compared γML for the unaligned feature space (see rows marked in Table 2) with γML, SVM for the aligned feature space (Table 1) by using the Wilcoxon signed-rank test. Since the test indicated that pvalue = 0.57, there is no difference between the pair compared. Therefore, we conclude that this pair is dependent. As a consequence, we cannot reject the null hypothesis, HWR0.

5 Conclusions

Nowadays, owing to the growing need for innovativeness, modern companies use highly complex processes to develop increasingly advanced products in order to gain a competitive edge in the market. Such strategies require sophisticated knowledge and skills, as well as high organisational and technical culture. These attributes characterise businesses that either employ in-house researchers or outsource their services. Thus, suitable matching between the firms and the researchers is crucial for boosting innovativeness. One of the aspects of this process is the identification of innovative companies that can be based on their internet websites.

In this study, we developed the novel diversified stacked generalisation method for recognition of websites that may indicate innovative companies. The main idea behind this approach is that we consider an entire data source as one bootstrap sample and learn one classification function for each data source. In our case, we consider three data sources, which thus represent three bootstrap samples. Then, the classification results from each bootstrap sample are used to build the final meta-classifier.

Because we consider three data sources, a single three-dimensional feature space is achieved. The system has been verified experimentally regarding various combinations of the feature numbers and meta-classification methods. A genetic algorithm selects the combinations of features. We experimentally confirmed that the simple voting classifier performs better when the number of features is low, but as features increase, all meta-classifiers improve. The proposed system is robust to over-fitting, especially when big documents are considered.

In conclusion, the most important findings of this work are as follows:

  • We can recognise innovative companies with accurate precision based on analysis of a textual content of companies’ websites. For this purpose, we may use machine learning techniques, such as those applied in the proposed method.

  • The proposed diversified stacked generalisation method (the multi-sources classification model), based on the meta-classifier outperformsthevoting approach.The results were improved by 11% in terms of the F-score mean.

  • We may utilise a genetic algorithm to construct the unaligned feature space. More specifically, we can create various combinations of feature numbers originating from the different data sources, and in this manner improve the classification results. The results of the simple voting procedure were improved by 4.6% in terms of F-score mean. On the other hand, we did not notice a significant improvement in the results of a meta-classifier that utilises the aligned feature space in comparison to a meta-classifier which uses the unaligned feature space.

  • The proposed method creates non-over-fitted models over some ranges of feature numbers.

  • We achieve a well separated three-dimensional feature space that is simple to visualise.

In further work, we will attempt to better evaluate our method by using other classification approaches and data sets. For instance, we believe that the method could be applied to the problem of multilingual text classification. Preliminary experiments indicate significant classification improvements in this field [54].

In our current approach, the training set is created by experts. We believe that through these data the classification system contains unique knowledge, which allows the system to reject non-innovative companies.

Another issue is the model robustness to fake innovativeness such as using buzzwords or just advertising on a website. Since classification models are constructed in a supervised way, they are as robust as the training set is correct and general. The system may classify companies improperly if unknown features, which indicate innovativeness, appear in real data. On the other hand, we must underline that the proposed system is not resistant to poisoning attacks by including corrupted data in a training dataset. A poisoning attack is the issue of data or applications security, and it was beyond the study. It may be the topic of further research.

Finally, we argue that complex classification tasks require more sophisticated approaches than applying a single model. The diversified stacked generalisation method integrates many views (classifiers) on the same problem, and in this manner, categorises data sufficiently.