Introduction

Since proposed by Huang et al., extreme learning machine (ELM) has shown fast training speed and incomparable classification ability in the fields of machine learning and neural networks during the past decades [1, 2]. From the perspective of network structure, ELMs are single-hidden layer feedforward networks (SLFNs). ELMs are an efficient solution for SLFNs to address drawbacks such as learning rate, training epochs, and local minima [3] as they do not require any iterative techniques [4, 5].

The hidden nodes, input weights, and hidden bias of ELM are all generated randomly. Meanwhile, the output weights of neurons are calculated using the conventional Moore–Penrose inverse based on the least square method. Therefore, the parameter tuning process is not required to generate the network model before training, which is entirely irrelevant to the training data. ELM has incomparable benefits compared to the other machine learning methods in terms of training speed, global optimality, and generalization performance.

Although the effectiveness of ELM has been proved in some specific applications, its training data are added one by one or chunk by chunk. Therefore, a wide range of improved ELMs was developed. The incremental ELM (I-ELM) proposed by Huang et al. [6] could calculate the output weights for the ever-increasing hidden nodes more accurately, but it remained a universal approximate approach. The adaptive growth ELM (AG-ELM) proposed by Zhang et al. [7], which consisted of the hidden nodes, the incremental renewal of the network weights, and the sequential generation of the grouping new networks, could approximate the nonlinearity functions effectively. ELM can also be transformed into an online version based on the online sequence analysis [8]. As for generalization, the kernel ELM (KELM) based on the kernel function was proposed in [9] for classification problems. A sequential online ELM based on KELM is proposed in [10] for online learning. Through optimizing the input weights, Sun et al. [11] proposed a differential evolutionary ELM under the compact network structure construction to improve the generalization performance. However, the current version of I-ELM still faces redundant nodes that continually increase the network structure complexity and reduce the learning efficiency. The convergence rate and the prediction ability are still low due to the unfavorable number of hidden layer nodes and the sensitiveness to the new data.

Training speed and learning accuracy crucial for ELM depend heavily on the parameter optimization process. Therefore, researchers proposed several intelligent optimization methods to compute ELM parameters more precisely. Improving the training speed and learning accuracy of ELMs based on bionics methods is becoming a research focus. A differential evolution algorithm is utilized in [12] to optimize the ELM input parameters. An adaptive differential evolution algorithm is given in [13] to optimize the hidden layer node parameters. The MP generalized inverse method was utilized to calculate the output weights. An improved particle swarm optimization algorithm is proposed in [14] to optimize the hidden layer node parameters.

Although the traditional differential evolution optimization method has strong global optimization ability, it could not avoid the premature problem. Traditional particle swarm optimization algorithm has low searching speed despite its local optimization merits. Therefore, a hybrid optimization approach called DEPSO based on differential evolution algorithm and the particle swarm optimization method is given in [15] to optimize the hidden layer nodes. A novel hybrid optimization method given in [16] took advantage of the global search ability of the differential evolution algorithm and the local search capability of the multi-group gray wolf optimization algorithm. However, there are still problems of local optimum and poor population diversity in later iterations.

As the extension of the research in [16], a Hybrid Intelligent Deep Kernel Incremental Extreme Learning Machine (HI-DKIELM) based on the improved coyote optimization algorithm (ICOA) and improved beetle swarm optimization algorithm (IBSOA) is proposed in this paper. First, COA was improved based on the Gaussian global best-growing operator to enhance the searching efficiency and convergence speed. Second, the tent mapping inverse learning and dynamic mutation strategies were adopted to prevent BSOA from local optimum and modify the population diversity in later iterations. Finally, the hybrid intelligent optimization method was adopted to modify the parameters of a deep kernel incremental extreme learning machine (DKIELM) and improve the training speed and classification accuracy. This paper is the first proposal of the HCOBSO strategy as a new perspective.

The significant contributions of this paper are summarized as follows:

  1. 1.

    An improved Coyote Optimization Algorithm is proposed based on the Gaussian global best-growing operator to improve the searching efficiency and convergence. In this paper, we introduce the Gaussian global best-growing operator to replace the original growing operator in the COA method to enhance the searching efficiency and convergence speed.

  2. 2.

    An improved Beetle Swarm Optimization Method was designed based on tent mapping inverse learning and dynamic mutation strategies to prevent BSOA from falling into a local optimum. Due to the higher solving speed and accuracy, BSOA has been used in signal positioning and data classification successfully. However, the original BSOA has several drawbacks. First, it falls into the local optimum easily when solving complex optimization problems. Second, the computational complexity is very high when updating the parameters. Third, in multi-dimension function optimization, relying on a single individual search alone increases the possibility of the optimization algorithm running into the local optimum. Therefore, BSOA was improved in this paper using tent mapping inverse learning and dynamic mutation strategies.

  3. 3.

    The novel HCOBSO method was proposed for optimizing the parameters of DKIELM. The proposed hybrid intelligent optimization method took advantage of the global search ability of COA and the local search capability of BSOA.

The remainder of this paper is organized as follows: the related works are briefly reviewed in section “Preliminary”. Section “The proposed HCOBSO method” presents the HCOBSO algorithm. Section “The proposed HI-DKIELM” presents details of the DKIELM. The experiment result is presented in section “Results and discussion”. Section “Conclusion” concludes our work and outlines our future work.

Preliminary

In this part, the notation of Kernel Incremental extreme learning machine (KI-ELM), the Gaussian global best-growing operator, the tent mapping inverse learning, and the dynamic mutation strategies was provided for the convenience of understanding the proposed ELM algorithm.

Kernel incremental extreme learning machine

Suppose training set \({\left\{{x}_{i},{t}_{i}\right\}}_{i=1}^{N}\) is composed of \(N\) training samples, the input \({x}_{i}\) has \(d\) dimensions, and \({t}_{i}\) is the label of the output, then the output of the ELM is [17, 18]

$$ f\left( x \right) = h\left( x \right)\beta = h\left( x \right)H^{T} \left( {\frac{I}{C} + HH^{T} } \right)^{ - 1} T. $$
(1)

The kernel matrix of KI-ELM can be expressed as

$$ K_{{{\text{ELM}}}} = HH^{T} = h\left( {x_{i} } \right) \cdot h\left( {x_{j} } \right) = K\left( {x_{i} ,x_{j} } \right); $$
(2)

thus, the output function of KI-ELM can be given as

$$ \begin{aligned} { }f\left( x \right) & = h\left( x \right)\beta = h\left( x \right)H^{T} \left( {\frac{I}{C} + HH^{T} } \right)^{ - 1} T \\ & = \left[ {\begin{array}{*{20}c} {K\left( {x,x_{1} } \right)} \\ {K\left( {x,x_{2} } \right)} \\ {\begin{array}{*{20}c} \vdots \\ {K\left( {x,x_{N} } \right)} \\ \end{array} } \\ \end{array} } \right]\left( {\frac{1}{C} + K_{{{\text{ELM}}}} } \right)^{ - 1} T . \\ \end{aligned} $$
(3)

In Eq. (3), assuming \(A = \left[ {\frac{1}{C} + K_{{{\text{ELM}}}} } \right]\), then at moment t, there are

$$ A_{t} = \left[ {\begin{array}{*{20}c} {\frac{1}{C} + K\left( {x_{1} ,x_{1} } \right)} & \cdots & {\frac{1}{C} + K\left( {x_{1} ,x_{N} } \right)} \\ \vdots & \cdots & \vdots \\ {\frac{1}{C} + K\left( {x_{N} ,x_{1} } \right)} & \cdots & {\frac{1}{C} + K\left( {x_{N} ,x_{N} } \right)} \\ \end{array} } \right] . $$
(4)

Thus, at \(t + 1\), there are

$$ A_{t + 1} = \left[ {\begin{array}{*{20}c} {\frac{1}{C} + K\left( {x_{1} ,x_{1} } \right)} & \cdots & {\frac{1}{C} + K\left( {x_{1} ,x_{N + k} } \right)} \\ \vdots & \cdots & \vdots \\ {\frac{1}{C} + K\left( {x_{N + k} ,x_{1} } \right)} & \cdots & {\frac{1}{C} + K\left( {x_{N + k} ,x_{N + k} } \right)} \\ \end{array} } \right]. $$
(5)

Suppose

$$ U_{t} = \left[ {\begin{array}{*{20}c} {K\left( {x_{1} ,x_{N + 1} } \right)} & \cdots & {K\left( {x_{1} ,x_{N + k} } \right)} \\ \vdots & \cdots & \vdots \\ {K\left( {x_{N} ,x_{N + 1} } \right)} & \cdots & {K\left( {x_{N} ,x_{N + k} } \right)} \\ \end{array} } \right] $$
(6)
$$ D_{t} = \left[ {\begin{array}{*{20}c} {\frac{1}{C} + K\left( {x_{N + 1} ,x_{N + 1} } \right)} & \cdots & {\frac{1}{C} + K\left( {x_{N + 1} ,x_{N + k} } \right)} \\ \vdots & \cdots & \vdots \\ {\frac{1}{C} + K\left( {x_{N + k} ,x_{N + 1} } \right)} & \cdots & {\frac{1}{C} + K\left( {x_{N + k} ,x_{N + k} } \right)} \\ \end{array} } \right]; $$
(7)

Then, Eq. (5) can be simplified as

$$ A_{t + 1} = \left[ {\begin{array}{*{20}c} {A_{t} } & {U_{t} } \\ {U_{t}^{T} } & {D_{t} } \\ \end{array} } \right]. $$
(8)

For the new input data, there are

$$ A_{t + 1} = \left[ {\begin{array}{*{20}c} {A_{t}^{ - 1} + A_{t}^{ - 1} U_{t} C_{t}^{ - 1} U - t^{T} A_{t}^{ - 1} } & { - C_{t}^{ - 1} U_{t} C_{t}^{ - 1} } \\ { - C_{t}^{ - 1} U_{t}^{T} A_{t}^{ - 1} } & {C_{t}^{ - 1} } \\ \end{array} } \right]. $$
(9)

The output value \(\hat{Y}_{{{\text{test}}}} \) can be estimated online as

$$ \hat{Y}_{{{\text{test}}}} = \left[ {\begin{array}{*{20}c} {K\left( {{\text{x}}_{{{\text{test1}}}} ,{\text{x}}_{1} } \right)} & \cdots & {K\left( {{\text{x}}_{{{\text{test1}}}} ,{\text{x}}_{M} } \right)} \\ \vdots & \cdots & \vdots \\ {K\left( {{\text{x}}_{{{\text{test}}N}} ,{\text{x}}_{1} } \right)} & \cdots & {K\left( {{\text{x}}_{N} ,{\text{x}}_{M} } \right)} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {A_{M}^{ - 1} y_{1} } \\ \vdots \\ {A_{M}^{ - 1} y_{M} } \\ \end{array} } \right]. $$
(10)

Gaussian global best-growing operator

The Gaussian global best-growing operator is based on the best optimal strategy and Gaussian distribution and inspired by the global optimal harmony search algorithm. The best optimal strategy uses the state information of the current globally optimal target to improve the mining ability and enhance inner information sharing. Gaussian distribution, or normal distribution, is different from the uniform random distribution \([\mathrm{0,1}]\) utilized in COA. It expands the scope of the rope and improves the global roping ability to a certain extent.

Tent mapping inverse learning

Research showed that in swarm intelligent searching, the convergence performance is affected by the initial population. The larger and the more evenly distributed the populations, the faster the algorithm converges to the optimal solution. Using chaos mapping to initialize the population is random, ergodic, and bounded. Therefore, the search efficiency can be improved.

The initial sequence generated by Tent mapping [19, 20] is more uniform than that by logistic mapping. Therefore, tent mapping was applied to BSOA to initialize the population, and inverse learning strategies were applied to optimize the initial population. Through competition, the better individuals can be selected for the next generation of learning. Therefore, the population searching scope was expanded, and the invalid searching operation was reduced. Thus, the speed of the convergence is improved. The mathematical expression is as follows [21]:

$$ {\text{x}}_{n + 1} = \left\{ {\begin{array}{*{20}c} {2x_{n} ,} & {0 \le x_{n} \le 1/2} \\ {2\left( {1 - x_{n} } \right),} & {1/2 \le x_{n} \le 1} \\ \end{array} } \right.. $$
(11)

Defining an inverse solution feasible in D-dimensional space as \(x = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {x_{1} ,} & {x_{2} ,} \\ \end{array} } & {\begin{array}{*{20}c} { \ldots ,} & {x_{D} } \\ \end{array} } \\ \end{array} } \right)\), \(x = \left[ {\begin{array}{*{20}c} {a,} & b \\ \end{array} } \right], \) then the inverse solution is \(x^{\prime} = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {x_{1}^{\prime } ,} & {x_{2}^{\prime } ,} \\ \end{array} } & {\begin{array}{*{20}c} { \cdots ,} & {x_{D}^{\prime } } \\ \end{array} } \\ \end{array} } \right)\), where \(x^{\prime} = a + b - x_{i}\).

Dynamic mutation strategies

In later iterations of BSOA, the decreasing population diversity reduces the searching capacity. Dynamic mutation strategies are introduced to increase population diversity in later iterations and improve the convergence accuracy, so that premature can be avoided. Scholars have proposed a variety of variation algorithms, such as the Gaussian variation and Cauchy variation [22, 23]. Compared with the Gaussian operator, the Cauchy operator has longer wings and can generate a larger range of random numbers, offering a greater chance to jump out of the local optimum. In the meantime, the Cauchy variation takes less time to search nearby areas when the peak value is lower.

The proposed HCOBSO method

Output weight \(\beta\) is the most important parameter of an ELM. Therefore, the number of hidden layer nodes \(L\) should be estimated in advance. In this chapter, the HCOBSO method based on ICOA and IBSO is proposed. The motivation of the new idea is stated in section “Kernel incremental extreme learning machine”. The concepts of ICOA and IBSO were explained in detail in sections “Gaussian global best-growing operator” and “Tent mapping inverse learning. The implementation of the proposed hybrid intelligent optimization method was summed up in section “Dynamic mutation strategies”. Therefore, an understanding of the proposed optimization method can be facilitated.

Motivations

Although COA and BSO are all wolves’ optimization algorithms, they are different in some sense. In terms of searching, COA simulates the growing and dying processes of the coyote, whereas BSO simulates the Hierarchy and hunting patterns of beetle. In terms of guide pattern, COA uses only one best wolf to guide other wolves growing. Therefore, the two methods were improved, respectively, and then fused to compensate for the drawbacks in each of them. As a result, an HCOBSO method was obtained. The proposed hybrid intelligent optimization method took advantage of the global search ability of COA and the local search capability of BSOA.

The improved COA method

The COA method is inspired by the Canis latrans species in North America, and is classified as both swarm intelligence and evolutionary heuristic method. COA method with a structure different from that of other optimization method does not focus on the social hierarchy and dominant rules.

In this paper, we introduce the Gaussian global best-growing operator to replace the original growing operator in the COA method to enhance the searching efficiency and convergence speed. In the following, each step in the proposed COA framework will be explained in detail.

  1. 1.

    Parameter initialization and random initialization of the coyote pack First, the global population of coyotes consisting of \(N_{p}\) packs each with \(N_{c}\) coyotes is initialized. The initial social conditions for the \(C{\text{th}}\) coyote of the \(p{\text{th}}\) pack of the \(j{\text{th}}\) dimension inside the search space are set randomly, as follows [24, 25]:

    $$ {\text{soc}}_{c,j} = {\text{lb}}_{j} + r_{j} \times \left( {{\text{ub}}_{j} - {\text{lb}}_{j} } \right) , $$
    (12)

where \({\text{lb}}_{j}\) and \( {\text{ub}}_{j}\) represent the lower and upper bounds of the \(j{\text{th}}\) decision variable, and \(r_{j}\) is a real random number generated inside the range \(\left[ {0,1} \right]\) using uniform probability. Then, the social fitness value of the coyote is evaluated as

$$ {\text{fit}}_{c} = f\left( {{\text{soc}}_{c} } \right). $$
(13)
  1. 2.

    Coyote growing inside the pack. The information of the coyotes is computed with the COA as the cultural tendency \( {\text{cult}}\) of the pack

    $$ {\text{cult}}_{j} = \left\{ {\begin{array}{*{20}c} {O_{{\left( {N_{c} + 1/2,j} \right)}} } & {N_{c} \,\,{\text{is odd}}} \\ {\left( {O_{{N_{c} /2,j}} + O_{{\left( {N_{c} + 1/2,j} \right)}} } \right)} & {{\text{otherwise}}} \\ \end{array} } \right\}. $$
    (14)

Therefore, the coyotes are under the alpha influence \(\delta_{1}\) and the pack influence \(\delta_{2}\). \(\delta_{1}\) means a cultural difference from a random coyote of the pack \({\text{cr}}_{{1}}\) to the alpha coyote. \(\delta_{2}\) means a cultural difference from a random coyote of the pack \({\text{cr}}_{2}\) to the cultural tendency of the pack. \(\delta_{1} \) and \(\delta_{2}\) are written as

$$ \delta_{1} = {\text{alpha}} - {\text{soc}}_{{{\text{cr}}_{{1}} }} $$
(15)
$$ \delta_{2} = {\text{cult}} - {\text{soc}}_{{{\text{cr}}_{{2}} }} . $$
(16)

The \({\text{alpha}}\) is the best coyote inside the pack and the \({\text{alpha}}\) of the \(p{\text{th}}\) pack of the \(t{\text{th}}\), an instant of the time, is defined as

$$ {\text{alpha}}^{p,t} = \left\{ {{\text{soc}}^{p,t} {\text{|arg}}_{{c = \left\{ {1,2, \ldots ,N_{c} } \right\}}} {\text{min}}f\left( {{\text{soc}}^{p,t} } \right)} \right\}. $$
(17)

Unlike the original COA method, in this paper, the new social condition of coyote: \({\text{new}}\_{\text{soc}}_{1}\) is updated using the Gaussian global best-growing operator through the following equations:

$$ \delta_{3} = {\text{GP}} - {\text{soc}}_{{{\text{cr}}_{{1}} }} $$
(18)
$$ {\text{new}}\_{\text{soc}}_{1} = {\text{soc}} + rn_{1} \times \delta_{3} + rn_{2} \times \delta_{2} . $$
(19)

The \({\text{GP}}\) is the global best coyote in the current condition, \(rn_{1}\) and \(rn_{2}\) represent a real random number using Gaussian normal distribution. Compared with the original COA, there is a big difference between the individuals at the beginning of the searching operation. It can enhance the searching ability at the beginning of the searching operation. While in the latter searching, the difference is become small and the searching space is become small. Therefore, the new social condition of coyote can improve the production capacity.

  1. 3.

    Birth and death of coyote The birth of new coyote pups is written as a combination of the social conditions of two-parent and environmental influence

    $$ {\text{pup}}_{j} = \left\{ {\begin{array}{*{20}c} {{\text{soc}}_{{cr_{1,j} }} } & {r_{j} < p_{{\text{s}}} \;{\text{or}}\;j = j_{1} } \\ {{\text{soc}}_{{cr_{2,j} }} } & {r_{j} \ge p_{{\text{s}}} + p_{{\text{a}}} \;{\text{or}}\;j = j_{2} } \\ {R_{j} } & {{\text{otherwise}}} \\ \end{array} } \right\}, $$
    (20)

wherein \(r_{j}\) is the random number in the range \(\left[ {0,1} \right]\) using uniform probability, \(j_{1}\) and \(j_{2}\) are two random dimensions of the problem, \(R_{j}\) is a random number inside the decision variable bound of the \(j{\text{th}}\) dimension, and \(p_{{\text{s}}}\) and \(p_{{\text{a}}}\) are scatter probability and association probability, respectively. Besides, \(p_{{\text{s}}}\) and \(p_{{\text{a}}}\) can be written as

$$ p_{{\text{s}}} = 1/D $$
(21)
$$ p_{{\text{a}}} = \left( {1 - p_{{\text{s}}} } \right)/2. $$
(22)

To keep the population size static, the COA syncs the birth and death of coyotes, as shown in Alg. 1.

  1. 4.

    Coyote expulsion and admission Sometimes, the coyote will leave packs and become solitary or join a pack instead and occurs with probability \(p_{{\text{e}}}\)

    $$ p_{{\text{e}}} = 0.005 \cdot N_{C}^{2} . $$
    (23)

The pseudocode of the proposed COA is described in Alg. 2

To evaluate the robustness of the proposed COA algorithm, the Schaffer function is adopted to test its optimization performance, as shown in Fig. 1.

Fig. 1
figure 1

The test performance of Schaffer function

Figure 1 shows that the proposed COA methods can find the global optimal solution effectively. The optimization results are close to the global extreme reference value, indicating that the proposed COA method has a good optimization ability.

The improved BSO method

The solving speed and precision accuracy of BSO are higher than other optimal methods. BSO is used in signal positioning and data classification successfully. However, the original BSO has several drawbacks. At first, it falls into local optimality easily when solving the complex optimal problem. Second, the computational complexity is very high when updating the parameters. Third, sawyer only relies on a single individual search, increasing the possibility of the optimal algorithm into the bureau in the multi-dimension function optimization. Therefore, the BSO optimal method is improved using Tent mapping inverse learning and dynamic mutation.

For the population initialization, the Tent mapping inverse learning is adopted to initializing the population as follows [26]:

  1. 1.

    In the searching space, the Tent mapping inverse learning is adopted to generate \(N\) positions \(x_{ij}\) of the beetle population as the initial population \({\text{OB}}\);

  2. 2.

    According to the definition of inverse solving, inverse positions \(x_{ij}^{\prime }\) of each beetle position \(x_{ij}\) in the initial population \({\text{OB}}\) are generated as the inverse population \({\text{ FB}}\);

  3. 3.

    The fitness value belongs to the \(2N\) beetle populations, sorted using the ascending sort with the combination of the population \({\text{OB}}\) and \({\text{FB}}\). \(N\) beetle population was selected while corresponding top \(N\) fitness value as the initial population.

In the later iteration of the optimal beetle method, the diversity of the population will become worse. Therefore, the secondary optimization operation is performed to the beetle population using the dynamic mutation

$$ \left\{ {\begin{array}{*{20}l} {X^{*} \left( t \right) = X\left( t \right) + \eta *C\left( {0,1} \right)} \hfill \\ {\eta = e^{{ - \lambda \frac{1}{T}}} } \hfill \\ \end{array} } \right.. $$
(24)

In Eq. (24), \(\eta\) is the mutation weight and \(C\left( {0,1} \right)\) is a random value which generates by the Cauchy operator with the probability parameter 1. The pseudocode of the improved BSO is shown in Alg. 3.

To demonstrate the effectiveness of the proposed BSO method, we test the convergence using the unimodal function \(\min f\left( x \right) = \sum\nolimits_{i = 1}^{n} {x_{i}^{2} }\) and compare it with GA, PSO, DE, and original BSO methods. The test results are shown in Fig. 2. From the experiment results, we can find that the convergence speed and convergence accuracy are improved slightly while compared with other four optimization approaches. The global searching ability and local searching ability will be balanced, and the convergence speed will be accelerated and bring it to jump out the local minima when we bring up the Tent mapping inverse learning and dynamic mutation.

Fig. 2
figure 2

The convergence results of the proposed BSO method

The proposed HCOBSO method

A new hybrid optimization approach called hybrid Coyote Optimization and Beetle Swarm Optimization (HCOBSO) algorithm is proposed with the advantage of improved COA and improved BSO optimization method while deriving from Frog Leaping algorithm (FLA) to improve the generalization performance. The detailed pseudocode of the proposed optimization method is given in Table1.

Table 1 The pseudocode of hybrid Coyote Optimization and Beetle Swarm Optimization methods (HCOBSO) algorithm

The proposed HI-DKIELM

The detailed structure of the proposed HI-DKIELM is given. The design idea of the HI-DKIELM is based on KI-ELM and deep learning network, showing that the HI-DKIELM consists of three parts: an input layer, output layer, and some hidden layer of a cascade, as shown in Fig. 3. During the training process, the HCOBSO optimization method is utilized to optimize the output weight to enhance its robustness. According to \(k\) hidden layer, the input feature \(X^{k}\) can be obtained after subtracting the initial input data and then mapping it using the kernel function. The detailed implementation process of the proposed HI-DKIELM is taken in Table 2.

Fig. 3
figure 3

The structure of the proposed DKI-ELM

Table 2 The pseudocode of the HI-DKIELM algorithm

Results and discussion

Experimental settings

Several experiments are provided in diverse ways to demonstrate the effectiveness of the proposed HI-DKIELM approach.

The experiment is tested on a PC with Intel Core I7-8700 at 3.40 GHz and 16 GB RAM. The proposed method is implemented using Matlab2013a, and the code for the other models comes directly from the code published by the respective authors. To verify the effectiveness and robustness of the HI-DKIELM algorithm, the experiment is divided into five parts:

  1. 1.

    First of all, in section “Evaluation of the performance of the HCOBSO optimization algorithm using the CEC2017 and CEC2019 database”, the CEC2017 and CEC 2019 database is adopted to test the performance and robustness of the HCOBSO optimization algorithm proposed in section “The proposed HCOBSO method” and compared with other incomplete optimization methods to test the contribution of each improvement.

  2. 2.

    Section “Performance evaluation of the HCOBSO optimization algorithm on typical functions” focuses on the proposed HCOBSO optimization method is tested on five typical optimization functions to check the optimization ability.

  3. 3.

    Section “Selection of hyperparameters” focuses on selecting hyperparameters in HI-DKIELM, including the penalty factor \(C\) and the number of hidden nodes \(L\).

  4. 4.

    Section “Regression analysis” focuses on the regression problems. The generalization performance of the HI-DKIELM algorithm is tested with 10 sets of real UHI data sets and compared it with traditional ELM, I-ELM, EM-ELM, EI-ELM, and B-ELM.

  5. 5.

    Section “Classification analysis” focuses on the classification problems. The generalization performance of the proposed HI-DKIELM algorithm is tested with ten sets of real UHI data sets, and compared with traditional ELM, I-ELM, EM-ELM, EI-ELM, and B-ELM.

  6. 6.

    In section “Real-world application learning tasks”, the performance of the HI-DKIELM in practical application learning tasks is verified and compared with other baseline methods.

Data and parameter settings

In these experiments, several databases are used in various experiments to verify the performance of the novel HI-DKIELM method.

For the experiment shown in sections “Evaluation of the performance of the HCOBSO optimization algorithm using the CEC2017 and CEC2019 database” and “Performance evaluation of the HCOBSO optimization algorithm on typical functions”, the CEC2017 database and five typical optimization functions are used to prove the optimization effectiveness of the proposed HCOBSO optimization method, respectively.

The CEC2017 database consists of a unimodal function (\(F_{1} - F_{3}\)), multimodal function (\(F_{4} - F_{10}\)), mixed-function (\(F_{11} - F_{20}\)), and hybrid function (\(F_{21} - F_{30}\)). The public parameters for all the algorithms are set the same to reflect the fairness of the experiment: the \(D = 30\); the evaluation times of the Maximum function were \(D*10000\).

The CEC2019 database is known “The 100-Digit challenge” which are intend to be used in annual optimization competition. They are used to evaluate the algorithm for large scale optimization problems. The first three functions, CEC01–CEC03, have various dimensions. On the other hand, the CEC04 to CEC10 functions set as ten-dimensional minimization problems in the range \(\left[ { - 100,100} \right]\), and they are shifted and rotated. All the CEC functions are scalable and all global optimum of these functions were united toward point 1.

The five typical functions are shown in Table 3. The \({\text{MaxDT}}\) are setting as 100 and 500 for \(D = 10\) and \(D = 30\), respectively. The dimension of the solution in every typical function \(D\) is 30, and the range of the solution of the typical function \(F_{n5}\) is setting as \(\left[ { - 100,100} \right]\), while the remains are setting as \(\left[ { - 30,30} \right]\).

Table 3 Typical optimization functions

In section “Evaluation of the performance of the HCOBSO optimization algorithm using the CEC2017 and CEC2019 database”, the best parameter is \(N = 100,\;\; N_{c} = 5,{ }\;{\text{and }}\;N_{p} = 20\) for the COA optimization method. The parameter of the HCOSBO is setting as \(N = 100,\;\;N_{c} = 5, \;\;N_{p} = 20\) in the earlier searching, and the parameter is setting as \(N = 100, \;N_{c} = 10, \;{\text{and}}\; N_{p} = 10\) in the later searching.

In sections “Selection of hyperparameters” and “Regression analysis”, the UHI machine data set is adopted in the experiment with 13 regression problems and 14 classification problems. The data specifications are shown in Table 4. The relevant data of the training set and test set are represented by the notation #train and the notation #test, respectively.

Table 4 The specification of 27 benchmark UHI machine data sets

Evaluation of the performance of the HCOBSO optimization algorithm using the CEC2017 and CEC2019 database

To test the contribution of each improvement to the HCOBSO method, the HCOBSO method is compared with some incomplete optimization algorithms such as HCOBSO5 (\(N_{c} = 5,\;N_{p} = 20\)), COA, BSO, ICOA, LPB [27], and FDO [28] on the CEC2017 and CEC2019 database, as shown in Table 5. In the experiment, three functions are selected in each group. The rank method is used in this experiment to compare the mean. The rank becomes better with the decrease of the mean. When the mean is the same, the standard variance will be compared.

Table 5 The comparison results of HCOBSO and its incomplete algorithms under CEC2017 database

Table 5 shows that the HCOBSO5 will get the best mean and variance for the most times on unimodal functions, indicating that the global searching ability of the BSO method and the mining ability of the COA algorithm are improved with the growth of \(N_{p}\). The large value of \(N_{p}\) means the strong mining ability and fast rate of convergence. However, the HCOBSO5 will fall into local minimum easily and the optimization performance is not well on multimodal functions. Among the five optimization methods, the average rank of the HCOBSO is 1.58, while the average rank of COA, BSO, ICOA, and HCOBSO5 are 4.16, 4.58, 2.75, and 2, respectively. The HCOBSO optimization method proposed in this paper gets the best total rank among the five methods, while the total ranks of COA, BSO, ICOA, and HCOBSO5 are 4, 5, 3, and 2, respectively. The results have shown that the HCOBSO can obtain the best optimization performance and prove the effectiveness of this method.

From the results shown in Table6, we can find that the HCOBSO will get the best mean and variance for the most times on CEC 2019 database, that means with the growth of \(N_{p}\), the global searching ability of the BSO method and the mining ability of the COA algorithm are improved in the meantime correspondingly. That means the larger the value of \(N_{p}\), the more strong mining ability and faster rate of convergence. Among the six optimization methods, the average rank of the HCOBSO is 1.7, while the average ranks of LPB, FDO, COA, BSO, and HCOBSO5 are 3.3, 3, 4.5, 5.2and 3.3 respectively. In the meantime, we can observe that the HCOBSO optimization method proposed in this paper gets the best total rank among the six methods.

Table 6 The comparison results of HCOBSO and its incomplete algorithms under CEC2019 database

Performance evaluation of the HCOBSO optimization algorithm on typical functions

In this part, we compare the HCOBSO algorithm with four baseline optimization methods: COA, BSO, ICOA, and DE-MPGWO [15] using typical functions.

In this study, the number of iterations of each algorithm is set to 2000, and the overall scale of the four baseline optimization methods is the same, i.e., \(N_{p} = 40\). The experiments are performed 50 times, and the average results are the final optimization result. Tables 7 and 8 show the optimization result of COA, BSO, ICOA, DE-MPGWO, and HCOBSO methods, while five typical functions are used under \(D = 10\) and \(D = 30,\) respectively.

Table 7 Typical optimization function optimization results under \(D = 10\)
Table 8 Typical optimization function optimization results under \(D = 30\)

Tables 7 and 8 show the function \(F_{n2}\) have an ideal performance, and other results of the nine typical methods are summarized as follows: (1) the proposed HCOBSO presents more precise results compared with COA, BSO, ICOA, and DE-MPGWO methods when dealing with the other four typical functions. (2) The BSO algorithm can quickly drop to the minimum and jump out of the local minimum with a short time domain.

To sum up, the HCOBSO optimization method proposed in this study has a good balance in search of accuracy and convergence speed and can improve search ability.

Selection of hyperparameters

In this part, we will deduce parameters in HI-DKIELM and the original ELM scheme. Compared with the traditional neural network training algorithm, the ELM and HI-DKIELM need fewer parameters. According to the introduction of HI-DKIELM in section “The proposed HI-DKIELM”, two specified parameters are crucial: penalty factor \(C\) and the number of hidden nodes \(L\). The results are shown in Fig. 4.

Fig. 4
figure 4

Testing accuracy in \((C,L)\) subspace for the HI-DKIELM and ELM. a Accuracy of HI-DKIELM in terms of \(L\). b Accuracy of ELM in terms of \(L\). c Accuracy of HI-DKIELM in terms of \(C\). d Accuracy of ELM in terms of \(C\)

The learning accuracies of ELM and HI-DKIELM in the \(L\) subspace are shown in Fig. 4a and b; the parameter \(C\) is prefixed. The learning accuracies of ELM and HI-DKIELM in the \(C\) subspace are shown in Fig. 4c and d; the parameter \(L\) is prefixed. Figure 4a and b shows that the proposed HI-DKIELM follows a similar convergence property compared to the ELM but with higher testing accuracy. Meanwhile, the performance tends to be more stable in a wide range \(L\). According to the results shown in Fig. 4c and d, the accuracy of HI-DKIELM changes slightly as \(C\) increases, and then increases rapidly to converge a better performance than the original ELM.

The value L needs to be large enough. Since L does not affect the accuracy of the curve, we can select C with few nodes.

Regression analysis

In this part, we will test the performance of the HI-DKIELM method proposed in section “The proposed HI-DKIELM” based on the regression analysis using the actual UHI data set in Table 4 to evaluate its generalization and robustness, compared with four baseline ELM methods: enhanced random search-based ELM (EI-ELM), error minimization ELM (EM-ELM), incremental ELM (I-ELM), and bidirectional ELM (B-ELM).The initially hidden layer neurons and increasing hidden neurons in the network are set as one in the experiment. The number of neurons in the hidden layer is the same as the number of iterations. In Table 9, we compare the testing root-mean-square error (RMSE) and training time of the regression problem.

Table 9 Generalization performance comparison between different ELM-based methods on regression problem

The proposed method shows a similar generalization performance as other ELMs with much higher learning speed. For instance, the training time of the proposed HI-DKIELM method is only 0.0001S, 156 times as fast as that of ELM, 2025 times of I-ELM, 13004 times of EI-ELM, 75 times of EM-ELM, and 3732 times of B-ELM in the Auto MPG dataset. The training time of the proposed HI-DKIELM method is about 0.0001S, 616 times as fast as that of ELM, 4680 times of I-ELM, 35478 times of EI-ELM, 431 times of EM-ELM, and 453 times of B-ELM in the Delta elevators dataset. Furthermore, in this section, the performance of ELM and the proposed method has been tested in the XOR problem, which has one training sample in each class. The aim of this simulation is to verify whether the proposed method can handle some rare cases, such as cases with extremely few training data sets. The results are that the testing RSME obtained by ELM is close to 0.0010 when 2000 hidden nodes are used, while the testing RMSE obtained by the proposed method is close to 0.0010 when 87 hidden nodes are used.

In addition, the comparison results of average test accuracy are shown in Fig. 5. In Fig. 5, the \(x \)represents the number of hidden nodes, and the \(y \)represents the average testing root-mean-square error (RMSE). Compared with other ELM methods, the HI-DKIELM method has the best test performance. For instance, the test error of the Ca house dataset obtained by the HI-DKIELM method is half of that of ELM. In addition, with 2000 hidden nodes, the test RMSE of the ELM method is close to 0.0010, while that of the HI-DKIELM method is close to 0.0010 with 89 hidden nodes.

Fig. 5
figure 5

The average testing accuracy of the different ELM based methods under different database for the regression problem

Classification analysis

In this part, for the classification problem, we will test the performance of the HI-DKIELM method proposed in section “The proposed HI-DKIELM” using the actual UHI data set given in Table 4 to evaluate its generalization and robustness and compare it with the four baseline ELM methods.

The initially hidden layer neurons and a growing number of hidden neurons in the network are set as one in the experiment. The number of neurons in the hidden layer is the same as the number of iterations. We compare the testing root-mean-square error (RMSE) and training time of the regression problem. The comparison of the testing accuracy is displayed in Fig. 6.

Fig. 6
figure 6

The average testing accuracy of the different ELM based methods under different database for classification problem

The results shown in Fig. 6 shows that the proposed HI-DKIELM has significant advantages in training speed. In the Connect-4 dataset with a large number of medium input dimension training samples, the running speed of the proposed method is 5600 times, 300 times, 45 times, and 1000 times as fast as that of I-ELM, EI-ELM, EM-ELM, and PC-ELM, respectively. The moderate number of training samples in the Hill Valley dataset has a moderate input dimension. In the hill Valley dataset, the running time of the proposed method is 3 times, 20 times, 25 times, and 1200 times as fast as that of I-ELM, EI-ELM, EM-ELM, and PC-ELM, respectively, with better performance.

Real-world application learning tasks

In this section, we will evaluate the performance of the proposed HI-DKIELM on the real-world applications: image classification task.

The original data are color images from the Corel data set. Each image is segmented, using the Blob world system, into fragments that represent instances. Fragments containing specific visual content (e.g., elephant) are labeled positive, while the remaining fragments are labeled negative. Therefore, fragments (i.e., instances) from the same kind of image (e.g., elephant) form a binary learning problem. We have five different image data sets: Tiger, Elephant, Fox, Bikes, and Cars, and the number of instances is 1096, 1259, 1474, 5215, and 5600, respectively. The instances in Tiger, Elephant, and Fox data sets are described by a 230-dimensional feature vector which represents the color, texture, and shape of the region, while the instances in Bikes and Cars data sets are represented by a 90-dimensional feature vector.

In image classification, visual content-based image retrieval is an important application, for example, finding pictures containing an elephant from a data set. In this subsection, sample images from the benchmark data sets are shown in Fig. 7.

Fig. 7
figure 7

Example images used in the experiment from the COREL image categorization database

The detailed experimental results are in Tables 10 and 11 for image classification tasks in terms of classification accuracy (ACC) and area under the curve (AUC), respectively. In Tables 10 and 11, the proposed HI-DKILEM on all image datasets achieves the best ACC and AUC, indicating HI-DKILEM is superior to other methods in content-based image retrieval tasks. The extraordinary performance is due to many local approximations created by the proposed HI-DKIELM. The results show that the naive Bayes (NB) method on the Tiger, Elephant, Bikes, and Cars datasets has the worst ACC and AUC performance. However, the SVM method on the Fox data set has the most inferior performance. For other baselines, more detailed experiment results can be found in Tables 10 and 11.

Table 10 Experimental results on image classification data sets concerning classification accuracy (ACC) %
Table 11 Experimental results on image classification data sets concerning the area under the ROC curve (AUC) (%)

Conclusion

In this paper, a novel ELM method named Hybrid Intelligent Deep Kernel Incremental Extreme Learning Machine (HI-DKIELM) based on the hybrid coyote optimization Beetle Swarm Optimization (HCOBSO) method is proposed. The proposed method has several features different from existing ELM-based methods.

  1. 1.

    An improved Coyote Optimization Algorithm has been proposed to improve the searching efficiency and convergence, and the Gaussian global best-growing operator is utilized to replaces the original growing operator.

  2. 2.

    An improved Beetle Swarm Optimization Method has been designed according to Tent mapping inverse learning and dynamic mutation strategies to prevent the BSO algorithm from falling into a local optimum.

  3. 3.

    A novel hybrid intelligent optimize algorithm called hybrid coyote optimization Beetle Swarm Optimization (HCOBSO) method for optimizing the parameters of the DKIELM has been presented. The proposed hybrid intelligent optimization method combines the global search ability of the coyote optimization algorithm and the local search capability of the Beetle Swarm Optimization algorithm.

In future work, there is developing space to explore the proposed methods. First, it is still challenging to make more insight into ELM for exploring its deep learning capability. Second, from an optimization point of view, we can use other novel heuristic algorithms, such as the five-element cycle optimization method.