1 Introduction

Although a number of direct search methods (Kolda et al. 2003) to solve continuous optimization problems were already proposed over half a of century ago, the last two decades have witnessed a very rapid influx of papers introducing novel metaheuristics (e.g. Boussad et al. 2013; Xing and Gao 2014; Salcedo-Sanz 2016). A vast majority of papers propose yet another optimization algorithm. A much smaller fraction of papers aims at understanding the behaviour of these algorithms. However, in recent years important questions regarding the behaviour and usefulness of the metaheuristics as well as the quality of the respective research (both in general terms and with regard to specific methods) have been raised.

The potential deficiency of metaheuristics, referred to as structural bias, that is discussed in this study has been addressed by Kononova et al. (2015). According to Kononova et al. (2015), a heuristic algorithm is structurally biased when it is more prone to visit some parts of the search space than the other parts and this behaviour is not justified by the fitness landscape. They found such a tendency in a simple variant of the genetic algorithm (GA) (Holland 1975) and a basic particle swarm optimization (PSO) algorithm (Eberhart and Kennedy 1995). As metaheuristics are designed to sample the search space based only on the fitness function’s values observed in already visited points, such behaviour may suggest that the algorithm itself is somehow biased. As a very simple example, imagine a heuristic that moves a population of individuals in the search space according to the \(x_{ij}^{t+1} =x_{ij}^t/\textit{rand}(1,2)\) rule, where \(x_{ij}^t \) is the position of the ith individual in the jth dimension at the tth iteration, and \(\textit{rand}(1,2)\) is a random number generated from the uniform distribution within the interval [1, 2]. Such an algorithm, composed of individuals that move independently, is structurally biased and will converge to \(\mathbf x = 0\). The closer to \(\mathbf x = 0\) the optimum of a particular problem is, the higher the chance that such an algorithm finds it. Because in many artificial benchmark functions used to test optimizers the optimum is located in, or close to \(\mathbf x = 0\) [as for example in 40% of problems defined in the Yao et al. (1999) paper], such a naive procedure defined above could seemingly be very successful. This is of course a trivial example, but if an algorithm is more complicated, the presence of some structural bias may be easily overlooked. This will lead to doubtful conclusions of good performance, especially if an approach is tested on a small number of simple problems.

The presence of structural bias may be determined by testing algorithms on a fully random function \(f({{{\varvec{x}}}}\)) that maps any location x to uniformly random values from a uniform distribution within [0, 1]. Values of \(f({{\varvec{x}}})\) depend neither on the values of f in the neighbourhood of x nor on the values of f sampled in the past. Without loss of generality, minimization of f is considered in this study. Examples of such functions have been given in Cleghorn and Engelbrecht (2014, 2015) and Kononova et al. (2015). In this study, we use the variant of the function proposed in Kononova et al. (2015):

$$\begin{aligned} {f:S\subset R^{D}\rightarrow [0,1],}\qquad {f({{\varvec{x}}})\in \textit{uniform}[0,1],} \qquad S=[0,1]^{D} \end{aligned}$$
(1)

The function defined in Eq. (1) does not have any stable fitness landscape, as each time a particular location x is visited, the value of \(f({{\varvec{x}}})\) is generated anew from the uniform distribution. This is a slightly different approach from that proposed in Cleghorn and Engelbrecht (2014, 2015), where f is static during a particular run. Hence, when searching for the minimum of f, if the algorithm does not sample some parts of the search space more often than the others (due to algorithms’ own structural bias), the distribution of the positions of the best solutions found during multiple runs should also be uniform in the search space. This can be confirmed by a visual inspection of the respective plots or, more objectively, by using some statistical tests. If this distribution cannot be considered uniform, there must be some structural bias of the algorithm itself.

In Kononova et al. (2015), it is suggested that the structural bias is “stronger” when the population size increases and that such bias has higher impact on the search efficiency when the problem to be solved is “difficult”. However, Kononova et al. (2015) examined only two very simple metaheuristics, basic variants of GA and PSO, and just on 30-dimensional variants of f defined in Eq. (1), with the number of function calls fixed to 300,000. Each experiment was repeated 50 times. Although this value is frequently used for statistical tests, to convincingly detect structural bias in specific algorithms a much larger number of repetitions may be needed. In addition, in Kononova et al. (2015) more attention was paid to visual observations than statistical tests, even though such observations are inevitably subjective.

The present study expands the research described in Kononova et al. (2015) and goes a step further. The main goal is to search for the presence of structural bias in various variants of two popular families of metaheuristics, namely PSO and differential evolution (DE) (Storn and Price 1997), and in two widely used historical direct search methods (Nelder and Mead 1965; Rosenbrock 1960). In addition, we verify how quickly the effect of the structural bias can be observed and how strong are the impacts of the problem’s dimensionality and algorithm’s population size on the occurrence of structural bias. We also find the source of structural bias in one selected variant of PSO and show that it may be removed. Finally, all 14 algorithms that are tested for the presence of structural bias are also used to solve 22 real-world problems from CEC2011 (Das and Suganthan 2010) and 50-dimensional versions of 30 artificial benchmarks from CEC2014 (Liang et al. 2013). The objective is to check whether the presence of structural bias can indeed affect the practical application of metaheuristics (see discussion in Michalewicz 2012).

2 Background: recent criticisms of optimization metaheuristics

Notwithstanding the wide popularity of metaheuristics and their overall acclamation shared by the majority of papers devoted to these methods, some doubts have been raised in recent literature. Although in this study we pay attention mostly to the problem of structural bias in the algorithms (Kononova et al. 2015), it is useful to place our discussion in a broader context. First of all, the abundance of metaheuristics that are frequently developed without any theoretical justification leads to at least three undesired effects: (i) it is hard to properly choose the right method for a particular task (Muñoz et al. 2015; Yuen and Zhang 2015; ii) although performance of some widely used metaheuristics is promising, many others show little (if any) novelty and efficiency—instead they rather compete for popularity by using appealing nomenclature (Sörensen 2015; iii) even though studies of the behaviour of metaheuristics do appear (Clerc and Kennedy 2002; Van den Bergh and Engelbrecht 2004; Auger and Doerr 2011; Liao et al. 2013; Bonyadi and Michalewicz 2014; Cleghorn and Engelbrecht 2014; Hu et al. 2014; Rada-Vilela et al. 2014; Leonard et al. 2015; Hu et al. 2016), in the majority of papers a desire to propose yet another novel optimizer dominates over the willingness to gain deeper insight into the already available methods. This fact, combined with a lack of commonly accepted procedures to properly develop, study and compare metaheuristics [see for example the discussion in Michalewicz 2012 and Sörensen et al. 2015], triggered a number of critical research outputs that identified many important issues.

From the no-free-lunch (NFL) theorems for optimization, that have already been widely discussed from various points of view (Wolpert and Macready 1997; Köppen et al. 2001; Schumacher et al. 2001; Droste et al. 2002; Whitley and Rowe 2008; Rowe et al. 2009; Duéñez-Guzmán and Vose 2013), it is well known that the performance of all heuristic optimizers averaged over all problems is equal. Although it is generally understood that only a very small fraction of “all problems” [for a detailed discussion of what “all problems” means, see Wolpert and Macready (1997) and Schumacher et al. (2001)] is of interest to anyone (Igel and Toussaint 2003, 2004), it is very unfortunate that the implications of the NFL are rarely considered and discussed when novel metaheuristics are proposed. Although any applicable set of problems would be just an extremely small fraction of “all problems”, this is not an excuse for the fact that the new methods are frequently tested on very few artificial benchmark problems. Moreover, such benchmark problems are often the ones for which the performance of the promoted algorithms is good enough to justify their being proposed. Although relationships between the types of problems, or their specific features, and the performance of algorithms may be identified (Malan and Engelbrecht 2013, 2014; Muñoz and Smith-Miles 2016), such relations are very rarely studied in practice. As a result, when metaheuristics are applied to solve practical tasks, despite their popularity and wide acclamation (Fogel 2000; Eiben and Smith 2015; Zelinka 2015), they are often outperformed by problem-specific algorithms (Droste et al. 2006), or, when computational budget is limited, mathematical programming approaches (Pošik et al. 2012). Moreover, although the convergence of most metaheuristics cannot be proven, one may for some of them prove that on specific types of problems the convergence of such methods cannot be guaranteed (Hu et al. 2016).

Numerous studies empirically show how disputable the empirical comparison of metaheuristics may be. For example, Derrac et al. (2014) noted in the research based on the CEC2011 real-world problems (Das and Suganthan 2010) that very different rankings of metaheuristics may be obtained depending on whether statistical tests are used to verify the statistical significance of differences between the final results achieved by the different metaheuristics versus when the tests are used to verify the differences in the convergence properties of such metaheuristics. Also Veček et al. (2014) emphasize the extent to which the final opinion on the algorithms’ performance may depend on the choice of the statistical tests used. In Dymond et al. (2015), it was pointed out that the choice of proper control parameter values for various metaheuristics largely depends on the number of function calls used. Liao et al. (2013) showed that the claimed superiority of the newer variants of artificial bee colony (ABC) algorithms over the older methods may be confirmed when tests are performed on simple low-dimensional problems, but not when more difficult high-dimensional problems are considered. Draa (2015) showed, based on a wide selection of problems, that the performance of the flower pollination algorithm, one of the recently introduced approaches, turns out to be relatively poor when compared with classical algorithms, in contradiction to what is claimed in the literature. Similarly, Fong et al. (2016) found the lack of superiority of a novel bat-inspired algorithm over the classical PSO. Finally, Piotrowski (2015) and Piotrowski and Napiorkowski (2016) showed that very different algorithms rankings may be derived when one performs tests either by searching for the minimum versus the maximum of exactly the same functions.

The above discussion shows that the way in which comparative analysis is conducted may affect the opinion on the performance of competing metaheuristics—the results may even be skewed to draw the intended conclusions. No wonder then that the very quality of research reported in the papers analysing metaheuristics has been questioned a number of times. Mernik et al. (2015) studied all papers on artificial bee colony (ABC) that were available up to August 2013 and referred to the first ABC algorithm published in the journal paper (Karaboga and Basturk 2008). They found that, depending on the evaluation criteria, between 40 and 67% of such studies led to results that are not trustworthy or even simply wrong, due to improperly made comparisons among different methods. Some papers (Ampellio and Vassio 2016) introduce both the novel algorithm and the novel method of comparison between algorithms, which may lead to some doubts regarding the evaluation of the achieved results. Črepinšek et al. (2016) discussed in depth how, by manipulating the replications of experiments, the comparison of metaheuristics may become unfair. This problem of a frequent lack of scrutiny, that may lead to erroneous results when comparing various metaheuristics, has also been addressed in numerous other studies (Eiben and Jelasity 2002; Weise et al. 2012; Črepinšek et al. 2012, 2014; Chinta et al. 2016). The fact that similar problems are also observed in other fields of science (Yancey 1990; Hall et al. 1996; Kitchenham et al. 2002) should not be considered as an excuse.

Table 1 Algorithms tested for the presence of structural bias

Apart from the difficulty of setting fair comparison rules, and occasional errors and mistakes, the simple lack of originality in some quasi-novel metaheuristics has been discussed in various recent papers. It seems that in many recently proposed methods the novelty is largely... the name. For example, Weyland (2010, 2015) showed that the harmony search algorithm is just a special case of one of the well-known evolutionary strategies (Rechenberg 1973). Piotrowski et al. (2014) pointed out that the black-hole-based heuristic algorithm is just a much simplified variant of PSO with inertia weight (Shi and Eberhart 1998). Simon et al. (2011) discussed the large similarities between biogeography-based optimization and many older metaheuristics. A number of recently introduced physical processes-based approaches are also considered to include mainly replications of the former methods (Salcedo-Sanz 2016), and a few core components that are widely used in various metaheuristics are identified in Fong et al. (2016) to expose the lack of novelty in some of the most recent methods. According to Sörensen (2015), introducing such redundant approaches can be partly explained by the common practice of hiding the lack of methodological novelty behind the new nomenclature that comes from the “inspiration”—resulting in algorithms based on “quantum frogs”, “competitive imperialists” or “intelligent water drops”. A kind of recipe of how to prepare such algorithms is outlined by Fister et al. (2016) to draw more attention to the problem. Luckily, authors of some papers (Sörensen et al. 2015; Fister et al. 2016) argue that there are signs that such a tendency would soon be stopped. As one may conclude from this brief review, the question regarding the presence of structural bias in such popular families of optimization algorithms as particle swarm optimization and differential evolution just touches the tip of the iceberg that, if warmed up, may flood the metaheuristics with at least partly-deserved scepticism.

3 Methods for structural bias detection

In this paper, the 14 metaheuristics and direct search methods listed in Table 1 are tested for the presence of structural bias. All abbreviations of the specific algorithms are defined in Table 1. Among these 14 methods, there are five PSO (PSO-basic—the first version proposed by Eberhart and Kennedy (1995), ALC-PSO, CLPSO, DNS-PSO and PSO-init-weight) and five DE variants (DE-basic—the first version proposed by Storn and Price in 1995, CDE, DEGL, MDE_pBX and SADE), one hybrid PSO-DE approach (CLPSO-DEGL), two historical direct search methods (NMA and RA) and, for comparison, the winner of the CEC2011 competition (GA-MPC). Control parameters of algorithms are mainly the same as in the original papers, with the exception of the population sizes and of those parameters discussed in the comments column of Table 1. Using standard values of control parameters is important, as different control parameter values may lead to exaggeration or mitigation of the presence of structural bias. We explicitly verify the impact of the population size settings on the structural bias in various metaheuristics. Apart from CLPSO and GA-MPC, which implement their own approaches to handling boundary constraints (Liang et al. 2006; Elsayed et al. 2011), the reflection method (Helwig et al. 2013) is used to prevent algorithms from exceeding bounds.

The PSO and DE algorithms are chosen in this study due to their large popularity and versatile applications in various fields of science. The interested reader may find details on both families of methods in numerous reviews or books (Price et al. 2005; Clerc 2006; Poli et al. 2007; Banks et al. 2008; Das et al. 2008, 2016; Neri and Tirronen 2010; Das and Suganthan 2011; Xin et al. 2012; Rada-Vilela et al. 2014; Zhang et al. 2015; Piotrowski 2016; Bonyadi and Michalewicz 2016).

To find the presence of structural bias, each algorithm is tested in numerous experiments on the fully random function defined in Eq. (1) with the following conditions:

  • the dimension (D) of the domain of function f (Eq. (1)) is set to 2, 10, 30, 100 and 1000;

  • the maximum number of allowed function calls (MNFC) is set to 10,000 and 300,000;

  • the population size (PopSize) of each algorithm, except for RA and NMA, is set to 20, 50, 100 and 500 (by definition, RA, which is the only non-population-based heuristic chosen to be used in this study, may only be tested with \(\textit{PopSize} = 1\), and for NMA the PopSize is always related to D by \(\textit{PopSize} = D+ 1\)).

Hence, for each algorithm \(5\cdot 2\cdot 4 = 40\) experiments are performed (with the exception of RA and NMA, for which we have \(5\cdot 2\cdot 1 = 10\) experiments). Each experiment is repeated 1000 times to achieve a statistically meaningful sample. The locations of the best solution found after random initialization of the population (i.e. after PopSize function calls) and the best solution found after MNFC are stored. Samples that contain 1000 elements are considered large enough to detect the presence of structural bias and to relate its strength to the problem dimensionality, the number of allowed function calls or the population size of the algorithm.

To check for the presence and the strength of structural bias, the one-sample and the two-sample Kolmogorov–Smirnov (KS) statistical tests (Stephens 1970; Press et al. 2006), of which only the first has been used in the previous study by Kononova et al. (2015), are applied to each D-dimensional sample of 1000 best locations found in each experiment. As in Kononova et al. (2015), each test is performed independently for each dimension, giving D results (p values) for every sample.

The one-sample KS test verifies whether the empirical distribution function (edf) of the sample of locations of the best solutions found within the MNFC for a particular dimension is different from the expected cumulative distribution function (cdf)—the uniform distribution in this case. Hence, the results depend on the coordinate system [i.e. not rotation invariant; see Salomon (1996)]. Hypothesis H0 is that the positions of the best individuals in a particular dimension found by the algorithm follow the uniform distribution. The H0 hypothesis is rejected at the \(\alpha = 0.05\) significance level if the p value obtained in the test for a particular dimension is below 0.05. For every algorithm (and case considered), the number of such tests is equal to the problem dimensionality. The discussion is based on percentages of rejected H0 hypotheses, the lowest (among D) obtained p values and—in the case of selected algorithms—graphical illustrations. It is assumed that the more frequently H0 hypotheses are rejected, the higher the structural bias of the particular algorithm is. Note that in Kononova et al. (2015), only 30-dimensional variants of f were tested, limiting the conclusions drawn there. For 1000-dimensional variants of f, it is expected that 50 out of 1000 tests (i.e. 5%) reject the true H0, but for the 30-dimensional case only 1.5 out of 30 tests performed are expected to reject the true H0. This is an important difference when higher rejection rates are considered as an effect of the presence of structural bias.

Some readers may suspect that the structural bias could be caused not by the algorithm itself, but by the pseudo-random number generator, though Kononova et al. (2015) claimed that this is not the case. All algorithms tested in this study were implemented in MATLAB (or obtained from their inventors; see notes in Table 1), using the Mersenne Twister (Matsumoto and Nishimura 1998) as the random number generator. To verify whether the bias is not present in the randomly initialized solutions (in such a case it would be hard to blame the algorithm), a one-sample KS test is used twice for each experiment, the second time to test the H0 that the edf of best solutions obtained just after initialization of the algorithm is from a uniform distribution. Then, the results of the latter are compared with the results of a one-sample KS test performed on the sets of the best solutions after the algorithm completed the MNFC.

In addition, in this study the two-sample KS test is also performed (again, separately for each dimension at the \(\alpha = 0.05\) significance level). The two-sample KS test compares two empirical distribution functions—the edf of the sample of locations of the best solutions found within the MNFC and the edf of the sample of best solutions obtained right after initialization of the algorithm. The hypothesis H0 is that both samples come from the same distribution. The results of both one-sample and two-sample tests, performed for all algorithms, population sizes and problem dimensionalities considered, are discussed in Sect. 4.1.

The number of allowed function calls may affect the results obtained. In practical applications, MNFC is sometimes related linearly to the problem dimensionality [for example to 10,000D, where D is the problem dimensionality, as in the CEC2005 problems introduced in Suganthan et al. (2005), or the CEC2014 benchmarks proposed by Liang et al. (2013)] or is fixed, being independent of the problem dimensionality [as in the collection of CEC2011 real-world problems introduced in Das and Suganthan (2010)]. Also, in various practical applications the MNFCs noticeably differ. It has already been shown by Pošik et al. (2012) that metaheuristics are mostly useful when the MNFC is relatively large; when it is small, mathematical programming methods are suggested. This is why we conducted our main experiments with a MNFC set to 10,000 and 300,000. However, as the present paper aims to also verify how quickly the presence of structural bias may be detected, apart from the experimental settings discussed above, additional tests are performed by running all algorithms 1000 times with PopSize set to 100 (apart from NMA and RA, which require specific values of PopSize) on the 1000-dimensional fully random function (Eq. (1)) for very low numbers of function calls, namely 200, 300, 400, 500, 1000, 2000 and 5000. Runs for each number of function calls are performed independently. The H0 hypothesis that the obtained empirical distributions of best solutions found within so few function calls follow the uniform distribution is then tested by the one-sample KS test. This allows us to show how quickly the presence of structural bias may be observed for different DE and PSO variants. Note that these additional experiments are limited to specific values of population size and problem dimensionality. The results are discussed in Sect. 4.2.

Finally, to check whether there is any relationship between the possible presence (and strength) of structural bias and the performance of the algorithm on various kinds of problems, each tested heuristic is applied to solve 22 real-world problems from the CEC2011 set (Das and Suganthan 2010) and 50-dimensional versions of 30 artificial benchmarks from the CEC2014 set (Liang et al. 2013). Due to much longer computational time, for the CEC2011 and CEC2014 problems each algorithm is tested only with one selected population size that seems suitable for its practical application. The number of allowed function calls is set to 150000 for the CEC2011 problems (as suggested in Das and Suganthan 2010) and to 10,000D for the CEC2014 benchmark problems (as suggested by Liang et al. 2013). The number of runs is reduced to 51 (as suggested in Liang et al. 2013). The statistical significance of pair-wise comparisons among all 14 algorithms is tested by means of the Friedman’s test with post hoc Shaffer’s static procedure (Shaffer 1986), as suggested in Garcia and Herrera (2008) and Ulas et al. (2012). The MATLAB implementation of Shaffer’s procedure, developed by Ulas et al. (2012), and available at www.cmpe.boun.edu.tr/~ulas/m2test/details.php, is applied in this study.

4 Experiments

Discussion of the results is divided into four parts. Section 4.1 discusses the wide-scale search for the presence of structural bias in heuristic algorithms. Section 4.2 shows how quickly the effects of the structural bias may be observed for various algorithms in a case study experiment. Section 4.3 aims at finding the reason for the structural bias in a specific algorithm. Finally, in Sect. 4.4 the results achieved by all considered algorithms on the CEC2011 and 50-dimensional versions of the CEC2014 benchmark problems are discussed in order to determine the relationship between the strength of the structural bias and the algorithm’s practical performance.

Table 2 Percentages of H0 hypotheses rejected by Kolmogorov–Smirnov test at \(\alpha = 0.05\) significance level. The H0 hypothesis is that the positions of best individuals are from uniform distribution in the final generation, after 10,000 (upper part of the table) or 300,000 (lower part of the table) function calls. \(H^0\) is tested for each among D dimensions separately. NMA has always population size equal \(D + 1\); RA is not population based
Table 3 Percentages of H0 hypotheses rejected by Kolmogorov–Smirnov test at \(\alpha = 0.05\) significance level. The H0 hypothesis is that the positions of best individuals are from uniform distribution after initialization. As initialization for tests with 10,000 and 300,000 function calls is performed independently, results for both cases are presented separately. H0 is tested for each among D dimensions separately. NMA has always population size equal \(D+1\); RA is not population based
Table 4 Lowest p value obtained from Kolmogorov–Smirnov test among D dimensions. The H0 hypothesis, tested for each of D dimensions separately, is that the positions of best individuals are from uniform distribution after the final generation, i.e. after 10,000 (upper part of the table) or 300,000 (lower part of the table) function calls. NMA has always population size equal \(D+1\); RA is not population based
Table 5 Lowest p value obtained from Kolmogorov–Smirnov test among D dimensions. The H0 hypothesis, tested for each of D dimensions separately, is that the positions of best individuals are from uniform distribution after initialization. As initialization for tests with 10,000 and 300,000 function calls is performed independently, results for both cases are presented separately. NMA has always population size equal \(D + 1\); RA is not population based

4.1 The presence of structural bias

The percentage of dimensions for which the presence of structural bias was detected by the one-sample KS test at \(\alpha =0.05\) (i.e. the H0 hypothesis was rejected) is given in Table 2. This number is shown for every combination of algorithm (columns), problem dimensionality, D (row), and population size, PopSize (row). The upper part of Table 2 displays the results obtained when MNFC was set to 10,000, whereas the lower part displays the results when MNFC was set to 300,000. Table 3 contains the respective percentages of rejections of H0 hypotheses for the best solutions found right after random initialization of the population, i.e. before the proper algorithm started to work. Table 4 presents the lowest p values obtained by the KS test among all D dimensions for each considered experimental setting with \(\hbox {MNFC} = 10{,}000\) (upper part) and \(\hbox {MNFC} = 300{,}000\) (lower part). Table 5 shows the lowest p values of KS tests performed for solutions obtained just after initialization, before the algorithms started. These experiments are carried out to find out whether some bias in locations of the best solutions is observed:

  • immediately after initialization (the presence of such bias cannot be attributed to the algorithm);

  • after the algorithm was run for a relatively small number of function calls set to 10,000;

  • after the algorithm was run for a number of function calls set to a more frequently used value of 300,000;

  • or such bias is not observed at all.

Note that initialization is performed randomly and independently for each MNFC level. Therefore, the KS test produced different results after initialization for experiments in which algorithms were run with 10,000 versus 300,000 function calls (see Tables 3, 5). To show versatile effects of the structural bias on the location of the best solutions found by various algorithms for the fully random function f, edfs of 10-, 100- and 1000-dimensional variants of five algorithms (including two PSO, two DE and one GA variant) with PopSize set to 100 are illustrated graphically in Figs. 13 and in Supplementary Material available in an online version of the paper (Suppl. Fig. 1). In addition, four very different histograms of the p values obtained for the 1000-dimensional fully random function by four chosen algorithms (GA-MPC, DNS-PSO, CDE and DE-basic) with PopSize set to 100 are also illustrated in Fig. 4.

Tables 2 and 3 show that the percentage of H0 hypotheses rejected after the search had terminated is, for most algorithms, much higher than the 5% of the number of D values, indicating the presence of structural bias. This is the case for both 10,000 and 300,000 function calls. Hence, the presence of the structural bias affects the location of the best solutions found by most PSO and DE variants even after a relatively short run. More details on how quickly the effect of the structural bias may be observed are given in Sect. 4.2. The 14 tested algorithms may be roughly classified into six groups (G1–G6) according to the “strength” of the structural bias observed in Tables 2, 3, 4 and 5, which are discussed below.

Algorithms DNS-PSO and PSO-init-weight show very clear presence of structural bias (G1). In the case of both these algorithms, the KS test at \(\alpha = 0.05\) (for all coordinates in all experiments, irrespective of the population size, problem dimensionality or maximum number of function calls) always rejects the H0 hypothesis that the empirical distribution of the best solutions found in 1000 runs is uniform. It indicates a very strong structural bias that may be attributed to the algorithm, as the distribution of values obtained during initialization is unbiased. The presence of extreme structural bias of DNS-PSO may be observed visually: the highly nonlinear edf for DNS-PSO (Fig. 1) and the narrow peak-like shape of the histogram for DNS-PSO p values (Fig. 4) confirm that the solutions are concentrated close to \(x_{j} = 0\) on all dimensions (\(j = 1,{\ldots },1000\)), irrespective of the problem dimensionality.

Fig. 1
figure 1

Empirical distribution function (edf) of the localization of the best solutions found by DNS-PSO algorithm with population size set to 100 individuals. Left figures show edf of best solutions found after 300,000 function calls (green lines); the right figures show edf of best solutions found after random initialization (blue lines). Each green line and each blue line represent edf for a single axis out of D (D varies from 10 in upper figures, to 1000 in the lower figures) (Color figure online)

Fig. 2
figure 2

Empirical distribution function (edf) of the localization of the best solutions found after 300,000 function calls by ALC-PSO (left), SADE (middle) and GA-MPC (right column) algorithms, when the population size is set to 100. Each green line represents edf for a single axis out of D (D varies from 10 in upper figures, to 1000 in the lower figures). The respective edfs of the best solutions found in the initialized populations, that show no sign of structural bias, are given in Supplementary Material available online (Suppl. Fig. 1) (Color figure online)

Fig. 3
figure 3

Empirical distribution function (edf) of the localization of the best solutions found by DE-basic algorithm with population size set to 100 individuals. Left figures show edf of best solutions found after 300,000 function calls (green lines); the right figures show edf of best solutions found after random initialization (blue lines). Each green line and each blue line represent edf for a single axis out of D (D varies from 10 in upper figures, to 1000 in the lower figures) (Color figure online)

Fig. 4
figure 4

Histogram of p values obtained for four selected algorithms with population size set to 100 individuals and MNFC set to 300,000

In the case of three other algorithms (G2), namely PSO-basic, CLPSO-DEGL and GA-MPC, the H0 rejection rate is also very high. For CLPSO-DEGL and GA-MPC, the rejection rate is predominantly within the 75–100% range, with an average exceeding 90%. For PSO-basic, the rejection rate is slightly lower. Irrespective of the problem dimensionality, in the case of the PSO-basic algorithm, the percentage of rejected H0 hypotheses increases with the population size, often reaching 90–100% for the largest population sizes tested (see Table 2). Interestingly, the opposite relation between the population size and the percentage of H0 rejection rates is observed for CLPSO-DEGL and GA-MPC. As illustrated in Fig. 2 (right side), in the case of GA-MPC, the edf is mildly S-shaped indicating less evident structural bias. Contrary to DNS-PSO, the best solutions concentrate rather in the middle of the search space, and the divergence from the uniform distribution is much weaker. However, a selected histogram of p values obtained for GA-MPC, given in Fig. 4, shows that the distribution of p values is highly skewed. What is important is that the distribution of the best solutions obtained after initialization can again be considered uniform (see Tables 3, 5).

In the case of four other metaheuristics (ALC-PSO, DEGL, MDE_pBX and SADE), the presence of structural bias is also quite clearly observed (G3). However, although for these algorithms the H0 rejection rates are much higher than 5%, they are not as high as in the case of the five algorithms discussed earlier. Also, H0 rejection rates vary depending on the problem dimensionality, the population size and MNFC. Except for SADE which seems to be the most biased within this group, the H0 rejection rates are well below 50% for the vast majority of higher-dimensional experiments. In the case of DEGL and MDE_pBX, H0 rejection rates are often within the range of 10–30%. Generally, the structural bias observed seems the weakest when the population size of SADE, MDE_pBX and ALC-PSO is set to the highest tested value (500 particles or individuals). However, in the case of the four considered algorithms the relation between the strength of structural bias and the population size does not show clear trends and hence cannot be generalized. For example, for the DEGL algorithm, the percentage of rejected H0 hypotheses is often the lowest when the population size is the highest. In the case of ALC-PSO, the structural bias is the strongest for low-dimensional problems. When the population size is lower than 500, and the number of function calls is set to 300,000, the structural bias of ALC-PSO is observed for all dimensions of 2- and 10-dimensional experiments. Hence, due to the fact that a large population size is never used in practice for the PSO algorithms, de facto ALC-PSO suffers from strong structural bias. Here again, the distributions of best solutions obtained after initialization indicate the lack of structural bias.

It is worth noting that for many of the algorithms included into groups G2 and G3, almost the same, S-shaped edf of best solutions found after 300,000 function calls is observed—three chosen examples (ALC-PSO, SADE and GA-MPC) are illustrated in Fig 2. Again, the edf of best solutions obtained after initialization of all these algorithms does not show any signs of structural bias (see Suppl. Fig. 1). As such examples include both DE, PSO and GA variants, this may suggest that metaheuristics from various families of methods may too frequently sample locations in the “middle” part of the search space. Interestingly, the shape of edf does not depend on the chosen dimensionality of f.

For two tested algorithms (G4), CLPSO and CDE, the presence of structural bias may be said to be disputable (see also Fig. 4 for an illustrative example of the histogram of p values of CDE). When CLPSO is applied to low-dimensional problems, the presence of structural bias is observed in 100% of cases. However, this number diminishes with increase in the problem dimensionality, and for 100- or 1000-dimensional problems, the percentage of H0 rejections for CLPSO varies between 7 and 25%. This shows a relatively weak bias. The strength of structural bias is similar for 10,000 and 300,000 function calls, and the relationship between the number of H0 rejections and the population size is not straightforward. However, there may be some questions regarding the distribution of solutions obtained after CLPSO pseudo-random initialization. In the case of the CLPSO algorithm (and also CLPSO-DEGL, which uses the same CLPSO code), the number of rejected H0 hypotheses for each experiment after initialization remains between 5 and 10%. This is a higher level than the expected 5% and higher than observed for other algorithms. As indicated in Table 1, the CLPSO source code has been obtained from one of the authors of Liang et al. (2006). We consider this code as the official version of the CLPSO algorithm and use it “as is”, especially since it has also been applied by many other users. For higher-dimensional problems, the number of H0 hypotheses rejected for distributions of the best solutions obtained after termination of the algorithm is only slightly larger than for the initialized ones. It implies that the structural bias of CLPSO is either non-existent or very weak when problem dimensionality is high. Hence, the supposed structural bias of CLPSO is observed for low-dimensional problems, but becomes barely detectable for high-dimensional ones.

In the case of CDE, the opposite relationship between the presence of structural bias and the problem dimensionality is noted—the CDE algorithm seems to be unbiased for low-dimensional problems. For higher-dimensional problems, the presence of structural bias in CDE depends on the population size: the higher the population size, the weaker the structural bias. When the number of individuals is set to 500 [which, contrary to PSO methods, is not an unreasonable choice for DE algorithms, especially when the problem dimensionality is high; see Storn and Price (1997), Gämperle et al. (2002) and Piotrowski (2016)], no presence of structural bias is noted even for high-dimensional problems. However, the CDE algorithm is biased for lower population sizes. This is illustrated by the histogram of p values for the 1000-dimensional case with \(\textit{PopSize} =100\) and MNFC set to 300,000 (Fig. 4). Nonetheless, the structural bias is relatively “weak” compared to most other methods tested. As a result, CDE is the only tested metaheuristic among those proposed in the present century, which may be considered structurally unbiased, at least when it is used with a large but still reasonable population size.

We can also conclude, based on Tables 2, 3, 4 and 5, that the legendary Nelder–Mead algorithm (NMA) invented in 1965 and the basic DE variant (DE-basic) proposed in 1995 are structurally unbiased (G5). This result is so clear that it does not require lengthy discussion. The shapes of the edfs of the best solutions found by DE-basic are linear, irrespective of the dimensionality (see Fig 3). In further support of our findings, we provide a histogram of p values for the DE-basic algorithm for the 1000-dimensional fully random function, with PopSize set to 100 and \(\hbox {MNFC} = 300{,}000\) (see Fig. 4). Although each of the four histograms presented in Fig. 4 clearly differs from the others, the only one that does not point at any bias in the results is obtained from the DE-basic algorithm.

At first glance it may seem that RA, another 50-year-old approach, is also structurally unbiased. However, the tests performed on functions like f seem rather inconclusive for RA. This explains why we put it into a separate group (G6) for which structural bias cannot be reliably measured by the approach discussed. When testing any algorithm on f (Eq. (1)), the expected value of success (finding better solution than all sampled previously in that run) in the nth function call (\(n = 1,{\ldots },\hbox {MNFC}\)) equals \(E\left[ {Pr (f(n)<\min (f(n-1),f(n-2),{\ldots },f(1))} \right] =\frac{1}{n}\), where Pr stands for probability. RA is not a rotation invariant method, and it makes moves along coordinate axes, in each run starting from the same coordinate system. It is only under special circumstances that it may rotate its coordinate system during run, namely after one successful and one unsuccessful step in each direction. The objective function is called for the first time (\(n = 1\)) by the initial solution, hence this initial solution must be successful (\(Pr = 1\)), the first move along the first coordinate leads to the second function call (\(n = 2\)) with the expected probability of success \(E\left[ {Pr (f(2)<f(1))} \right] =\frac{1}{2}\) and so on. When moves along all D coordinate axes are tried, the loop restarts. Let us denote the number of loops already completed as nl. When the loop restarts, \(n = \textit{nl}\cdot D+2\) for move along the first coordinate axe, for the move along the second coordinate axe \(n = nl\cdot D+3\), etc. The later the move along a particular coordinate is executed, the lower the probability that such move will be successful on f defined by Eq. (1). Also, the more loops completed, the lower the probability that any move within any subsequent loop in a sequence will be successful. To change the location of the best solution generated during initialization along particular coordinate axe, either the move along such a coordinate must be successful (which is rare after the first loop), or the coordinate system must be rotated to allow for any move in the new coordinate system to be successful. However, in RA, the rotation of coordinates may take place only after one successful and one unsuccessful step in each direction. This, as shown above, is unrealistic for higher-dimensional variants of f. As a result, along many coordinates no successful moves occur in high-dimensional cases and, consequently, no algorithm-induced structural bias can be observed. In the case of low-dimensional problems, successful moves may occur for all coordinates and the presence of structural bias may be (and indeed is) observed. As a result, the experiment with RA seems useful not as far as detecting the structural bias is concerned, but to show some limitations of the concept.

The list of the lowest p values presented in Tables 4 and 5 supports the above conclusions. However, knowing the lowest p values may allow us to spot differences in the strength of structural bias. Based on the number of H0 rejection rates at \(\alpha = 0.05\), both DNS-PSO and PSO-init-weight could be considered equally biased. But comparison of the lowest p values from Tables 4 and 5 reveals that they differ by dozens of orders of magnitude. For example, the lowest noted p value obtained when PSO-init-weight is run for 300,000 function calls on the 1000-dimensional function f with population size set to 50 (which is a frequent choice for the PSO algorithms) equals 1.10e-11, but the respective p value for DNS-PSO is as low as... 5.45e-144. Also histograms of p values obtained for four algorithms (GA-MPC, DNS-PSO, CDE and DE-basic) on the 1000-dimensional fully random function with population size set to 100 shown in Fig. 4 imply large differences between different algorithms in terms of the strength of structural bias.

Table 6 Percentages of H0 hypotheses rejected by two-sample Kolmogorov–Smirnov test at \(\alpha = 0.05\) significance level. The H0 hypothesis is that the positions of best individuals among those generated during initialization from uniform distribution, and of best individuals obtained after 10,000 function calls by particular algorithm are from the same distribution. H0 is tested in each among D dimensions separately. NMA has always population size equal \(D + 1\); RA is not population based
Table 7 Percentages of H0 hypotheses rejected by two-sample Kolmogorov–Smirnov test at \(\alpha = 0.05\) significance level. The H0 hypothesis is that the positions of best individuals among those generated during initialization from uniform distribution, and of best individuals obtained after 300,000 function calls by particular algorithm are from the same distribution. H0 is tested in each among D dimensions separately. NMA has always population size equal \(D + 1\); RA is not population based

When the two-sample KS test is used to verify the hypothesis if the initial and the final edfs of a particular algorithm differ, the results generally confirm conclusions based on the one-sample KS test (see Tables 6, 7 for numbers of rejected H0 hypotheses when the maximum number of function calls is set to 10,000 and 300,000, respectively). This is not surprising, as we found out that the initial sampling could be considered unbiased, based on results of testing the hypothesis that the distributions of the best solutions after initialization of each algorithm are uniform (H0 rejection rates are close to 5%). The exceptions observed for CLPSO and CLPSO-DEGL were discussed earlier.

According to the above discussion, the structural bias is observed both in most PSO and in most DE variants tested. However, according to the results given in Tables 2, 3, 4, 5, 6 and 7, the PSO algorithms seem more structurally biased than the DE algorithms. This may be due to the fact that the basic DE version proposed by Storn and Price (1995) does not exhibit any sign of structural bias, but the basic PSO version proposed by Eberhart and Kennedy (1995) does. Only one PSO variant (CLPSO) may be considered almost structurally unbiased, at least in some experiments. In general, the number of H0 hypotheses rejected by the KS test is much higher for the PSO algorithms than for the DE algorithms. Nonetheless, apart from DE-basic and (in some cases) CDE, the presence of structural bias is also observed in the DE algorithms.

One can expect that the combination of two approaches, when one is structurally biased and the second is (almost) not, will likely be structurally biased. Although this observation is based on only one example, namely the CLPSO-DEGL hybrid (see Tables 2, 3, 4, 5), it is hard to explain how one constituent algorithm could correct the effect of the other’s structural bias.

4.2 How quickly structural bias affects the search

The above discussion concluded that the effect of structural bias is observed within 10,000 function calls. Moreover, for most algorithms, the strength of structural bias seems to be similar after 10,000 and 300,000 function calls, and in the case of a few methods, it may even decrease. Because in all cases but CLPSO and CLPSO-DEGL the initial solutions are unbiased, the structural bias has to affect the early stages of the search. Here a natural question arises as to how quickly this bias impacts the distribution of the best solutions found. To address this question, additional tests with very low numbers of function calls (between 200 and 5000) were conducted by running each algorithm with PopSize set to 100 (apart from NMA and RA, which require population sizes set to \(D+1\) and 1 respectively) on a 1000-dimensional version of the fully random function f (Eq. (1)). The results are given in Table 8 which also includes the values obtained after initialization and 10,000 function calls copied from Tables 2 and 3. Note that, in the case of NMA and RA, some rows in Table 8 are empty. This is because NMA requires 2002 function calls (1001 for initialization and 1001 for the first run) to make a first move, and RA needs at least 1001 function calls (1 for initialization and 1000 to sample along each coordinate) to present any meaningful results. However, for simplicity, in Table 8 the first available results are given in the same row as the results obtained after 1000 (RA) and 2000 (NMA) function calls, but are marked with *.

Table 8 Percentages of H0 hypotheses rejected by Kolmogorov–Smirnov test at \(\alpha = 0.05\) significance level after very few numbers of function calls.Tests are performed for 1000-dimensional fully random function, \(\textit{PopSize} = 100\) for all algorithms but NMA (which population always equal \(D+1\)) and RA (which population is always set to 1). The H0 hypothesis is that the positions of best individuals after specified number of function calls are from the uniform distribution within [0, 1] search domain. H0 is tested in each among D dimensions separately. Tests for each number of function calls are independent, but results obtained after initialization and after 10,000 function calls are the same as given in Table 2 (results for initialization of other experiments are skipped). NMA requires at least 2002 function calls (1001 for initialization and 1001 for the first run); RA requires at least 1001 calls (1 for initialization and 1000 to sample along each dimensions), but for simplicity the respective first NMA and RA results are given in the same row as the results obtained after 2000 (NMA) or 1000 (RA) function calls

It is very important to note here that the results shown in Table 8 for various numbers of function calls are mutually independent. Each time every algorithm was initialized independently. Consequently, results reported after 300 function calls come from other runs than those reported after 200 calls, and so on. This may lead to some occasional bumps in specific values, but avoids the problem of dependency between the values observed.

We can conclude from Table 8 that the structural bias, if present, affects the distribution of the best solutions very quickly. However, there are clear differences between the PSO and DE algorithms in this respect. For PSO, in the case of three out of five variants tested, the bias developed already after the first iteration (e.g. one move of each particle). Afterwards the number of H0 hypotheses rejected remains more or less constant in subsequent generations, although p values may further diminish the “strengthening” effect of the structural bias. The rapid effects of the structural bias present in DNS-PSO may be observed in Fig. 5. The edf of the best solutions found for the initialized population (hence after 100 function calls) is linear, but just after 200 function calls the edf becomes S-shaped (as discussed earlier—this is a frequent shape observed for many metaheuristics). From 300 function calls, the edf becomes more and more bent-shaped, indicating that the majority of the best solutions found by the algorithm are located close to \(\mathbf x = 0\).

Fig. 5
figure 5

Empirical distribution function (edf) of the localization of best solutions of 1000-dimensional fully random function found by DNS-PSO algorithm with population size set to 100 individuals after various numbers of function calls. Each green line and each blue line represent edf for a single axis out of D (\(D = 1000\)) (Color figure online)

On the other hand, in the case of DE algorithms and the GA variant tested, the effect of the structural bias is low after the first few generations. Then, the effect of structural bias often steadily increases during early generations and reaches a maximum after a few or several thousands of function calls. After that, the strength of the structural bias may even start receding (at least in the case of DEGL and MDE_pBX variants). The reason for this remains unclear and needs to be addressed in the future by a separate, algorithm-focused study. The slower development of the structural bias in the case of DE variants is illustrated in Fig. 6, using the MDE_pBX algorithm.

Fig. 6
figure 6

Empirical distribution function (edf) of the localization of best solutions of 1000-dimensional fully random function found by MDE-pBX algorithm with population size set to 100 individuals after various numbers of function calls. Each green line and each blue line represent edf for a single axis out of D (\(D = 1000\)) (Color figure online)

Fig. 7
figure 7

Empirical distribution function (edf) of the localization of best solutions of 1000-dimensional fully random function found by DE-basic algorithm with population size set to 100 individuals after various numbers of function calls. Each green line and each blue line represent edf for a single axis out of D (\(D = 1000\)) (Color figure online)

The results given in Table 8 agree with findings from Sect. 4.1 that some algorithms (NMA and DE-basic) are structurally unbiased, and a few others may be called almost structurally unbiased (CDE, CLPSO). The lack of changes in the shape of edf of the DE-basic algorithm with the increasing number of function calls is illustrated in Fig. 7, which visually confirms that there is no sign of structural bias in the distribution of best solutions found after any number of function calls when the classical DE scheme is used.

Fig. 8
figure 8

Empirical distribution function (edf) of the localization of best solutions of 1000-dimensional fully random function found by DNS-PSO algorithm and version of DNS-PSO with neighbourhood strategies skipped, each with population size set to 100 individuals. Results obtained after 200, 1000 and 10,000 function calls (runs are independent) are shown. Each green line represents edf for a single axis out of D (\(D = 1000\)) (Color figure online)

4.3 Can we identify reasons for the structural bias?

Searching for the parts of each tested algorithm that cause structural bias must unfortunately be considered beyond the scope of the present study. Such research probably needs to be performed for each specific approach separately, although we cannot exclude the possibility that the reasons for, for example, the frequently observed S-shaped edf may be similar for various metaheuristics. However, as an example for future studies it may be useful to investigate reasons for the presence of the extremely “strong” structural bias in the DNS-PSO algorithm that tends to locate best solutions close to \(\mathbf x = 0\). In DNS-PSO (Wang et al. 2013), local and global neighbourhood search strategies are carried out, both aiming at creation of additional temporary particles \(\mathbf{LX}_{i}\) and \(\mathbf{GX}_{ i}\) for which the function value is sampled. These particles are generated as \(\mathbf{LX}_i =r_1 \cdot \mathbf{X}_i +r_2 \cdot \mathbf{pbest}_i +r_3 \cdot \left( {\mathbf{X}_c -\mathbf{X}_d } \right) \) and \(\mathbf{GX}_i =r_4 \cdot \mathbf{X}_i +r_5 \cdot \mathbf{pbest}_i +r_6 \cdot \left( {\mathbf{X}_e -\mathbf{X}_f } \right) \), where \(\mathbf{X}_{i}\) is the current position vector of the ith particle, \(\mathbf{pbest}_{i}\) is the best position sampled so far by particle \(\mathbf{X}_{i}\), the \(\mathbf{X}_{c}\) and \(\mathbf{X}_{d}\) are positions of two different particles chosen randomly from those located in the neighbourhood of \(\mathbf{X}_{i}\) (for a definition of the neighbourhood used we refer the reader to Wang et al. 2013), the \(\mathbf{X}_{e}\) and \(\mathbf{X}_{f}\) are positions of two different particles chosen randomly from the entire swarm and—what is crucial—\(r_{1}\), \(r_{2}\), \(r_{3}\), \(r_{4}\), \(r_{5}\) and \(r_{6}\) are random numbers from the [0,1] interval meeting the following constraints: \(r_1 +r_2 +r_3 =1\) and \(r_4 +r_5 +r_6 =1\). One may note that for given domain \(S=[0,1]^{D}\), \({\begin{array}{ll} {\forall i}&{} {{\begin{array}{lll} {\forall j\in D}&{} {X_{ij}>0}&{} {pbest_{ij} >0} \\ \end{array} }} \\ \end{array} }\), but \(-1\le \left( {X_{cj} -X_{dj} } \right) \le 1\) and \(-1\le \left( {X_{ej} -X_{fj} } \right) \le 1\), the third terms in equations for generating \(\mathbf{LX}_{i}\) and \(\mathbf{GX}_{i}\) may be negative. This leads to a decrease in position values of \(\mathbf{LX}_{i}\) and \(\mathbf{GX}_{i}\) particles and causes the bias. In fact, if no boundary constraints were applied, elements of \(\mathbf{LX}_{i}\) and \(\mathbf{GX}_{i}\) could also be negative. Figure 8 compares the edfs of DNS-PSO and DNS-PSO with neighbourhood search excluded (i.e. when \(\mathbf{LX}_{i}\) and \(\mathbf{GX}_{i}\) are skipped) for selected cases with \(\textit{PopSize} = 100\) and MNFC set to 200, 1000 and 10,000. One can conclude from Fig. 8 that indeed the neighbourhood search seems to be this part of DNS-PSO to which the structural bias can be mostly attributed.

Fig. 9
figure 9

Empirical distribution function (edf) of the localization of best solutions of 1000-dimensional fully random function found by ALC-PSO algorithm with population size set to 100 individuals after various numbers of function calls. Each green line and each blue line represent edf for a single axis out of D (\(D = 1000\)) (Color figure online)

However, we must express our doubt that the justification of the structural bias may not be so easily given in each case. For example, the results presented in Table 8 reveal some strange behaviour of the ALC-PSO algorithm during the first few hundred function calls. The percentage of H0 hypotheses rejected (see Table 8) increased from 5.2 after initialization to 68.1 after 200 function calls, then fell back to 7.9 after 300 function calls, and afterwards slowly increased, similarly to the DE algorithms. To verify that such a result is not accidental, all experiments with 200 and 300 function calls by ALC-PSO were repeated five times with identical parameter settings and each time this similar strange behaviour was observed. These show that some algorithms, under specific circumstances, may exhibit a temporary tendency to concentrate solutions in some parts of the search space. This observation can also be confirmed visually by verifying changes in the edf during the first thousands of function calls (Fig. 9). However, we failed to find any reason for such strange behaviour of ALC-PSO, which may be a warning that the search for the explanation of the structural bias in various algorithms, as well as its practical impact on the results, may not always be as simple as in the case of DNS-PSO.

4.4 Strength of structural bias and algorithms’ performance in the case of various benchmark problems

To verify whether the presence of structural bias affects the performance of metaheuristics, all algorithms summarized in Table 1 are applied to solve 22 real-world problems from the CEC2011 set (Das and Suganthan 2010) and 50-dimensional variants of 30 artificial benchmark problems from the CEC2014 set (Liang et al. 2013). The averages of the best objective function values obtained within 150,000 function calls in the case of CEC2011 (value specified in Das and Suganthan 2010) or within 500,000 function calls in the case of CEC2014 [equal to 10,000D, as specified in Liang et al. (2013)] are given in Tables 9 and 10, whereas their respective standard deviations are shown in Supplementary Tables 1 to 2 (available as Supplementary Material in an online version of the paper). The last rows of Tables 9 and 10 present rankings of each algorithm averaged over all 22 CEC2011 problems (Table 9) or 30 CEC2014 problems (Table 10). The ranking was obtained as follows. First, for each problem separately, the algorithms were ranked from the best (rank 1) to the worst (rank 14) according to the 30-run averaged objective function values. Then, ranks achieved by each algorithm were averaged over all 22 (in the case of CEC2011), or all 30 (in the case of CEC2014) problems, giving the values the reader may find in Tables 9 and 10.

Table 9 The 30-run averaged fitness function values obtained for CEC2011 real-world problems
Table 10 The 30-run averaged fitness function values obtained for 50-dimensional variants of CEC2014 benchmark problems

One should point out here that the code of GA-MPC, the winner of the CEC2011 competition, has been downloaded from the conference web page (http://www.ntu.edu.sg/home/epnsugan). For tests on CEC2011 and CEC2014 problems, neither the population size nor other values of control parameters were changed in this study, with one exception. For a few CEC2011 problems (codes implemented in MATLAB are also available on http://www.ntu.edu.sg/home/epnsugan), one may obtain “not a number” as the objective function value, even if the solution is within the specified domain. In such a case, the objective function value was set to 0 in the original GA-MPC code. In the present application, this has been modified and, as the CEC2011 problems aim at minimization, a very large number (\(10^{100})\) have been used instead. This may have some impact on the result, as, for instance, in the case of problem F5 from CEC2011, where the initially generated sample often contains positive, negative and “not a number” objective function values. Although the results obtained for GA-MPC based on the mentioned code differ from those published in Elsayed et al. (2011) and their “modified” values available on the CEC2011 web page, this should not be surprising. In the code of GA-MPC, authors of Elsayed et al. (2011) use a seed from the clock and therefore different results are expected when the computations are repeated—this is also discussed on the CEC2011 homepage (http://www.ntu.edu.sg/home/epnsugan).

The rankings of algorithms on the CEC2011 problems presented in Table 9 indicate that the performance of CDE and CLPSO, the two modern metaheuristics for which the presence of structural bias turned out to be inconclusive, is moderate. Their averaged rankings are poorer than those of three structurally biased methods, namely SADE, GA-MPC and MDE_pBX, but the differences in rankings among these five best algorithms are small and (see Supplementary Table 3) statistically insignificant at the \(\alpha =0.05\) level. On the other hand, two structurally unbiased methods, the basic version of the DE algorithm (DE-basic) and the historical NMA algorithm turned out to be among the four poorest optimizers. The two most biased algorithms (PSO-init-weight and DNS-PSO) are in the middle of the pack. According to the statistical tests performed for the results obtained for the CEC2011 problems (Supplementary Table 3, see also the discussion at the end of Sect. 3), the differences between most of the tested algorithms are not significant at the \(\alpha = 0.05\) level and one cannot find any direct link between the presence of an algorithm’s structural bias and an algorithm’s performance. Interestingly, the differences between structurally unbiased NMA and DE-basic on the one hand, and the four best methods (SADE, GA-MPC, MDE_pBX and CDE, among which the three best ones are structurally biased) on the other, are statistically significant.

The ranking of the algorithms on the 50-dimensional artificial benchmark problems from CEC2014 looks similar to the one obtained for CEC2011 real-world problems (see Table 10). Algorithms SADE, GA-MPC and MDE_pBX (all structurally biased) remain the best three approaches, while structurally (almost) unbiased CLPSO and CDE came 4th and 7th, respectively. There is, however, one major difference: structurally unbiased historical NMA turned out to be quite highly ranked, and the difference between NMA and best performing SADE is not statistically significant at the \(\alpha =0.05\) level (see Supplementary Table 4). DE-basic remains among the two poorest methods. Although, again, no clear link between the strength of structural bias and the performance of the algorithm is observed, nonetheless the three out of four supposedly unbiased methods perform quite well.

According to the above results, the practical implications of structural bias of metaheuristics remain questionable. It looks as though the effect from the fitness function landscape was much higher than the effect of structural bias. However, the presence of structural bias should not be ignored. Firstly, we cannot exclude the possibility that the structural bias indeed hampers the search, preventing algorithms from achieving better results. Would the variants of SADE, GA-MPC and MDE_pBX perform better if the part responsible for the presence of structural bias was detected and removed? Secondly, algorithms that show the strongest indications of structural bias perform rather poorly on the CEC2011 and CEC2014 problems. Thirdly, three out of four methods that are almost structurally unbiased (CLPSO, CDE and NMA) show quite good performance: CLPSO is among the best PSO variants; CDE is ranked in the middle of the DE-based ones; NMA remains competitive against 50 years younger optimizers on artificial benchmark problems from CEC2014 and at least for some real-world problems from CEC2011 (indeed, with some surprise, we must acknowledge that NMA turned out to be the best approach for two CEC2011 problems, see Table 9, although it performed poorly on the majority of the others). Hence, one can conclude that although the presence of “moderate” structural bias does not have to prevent potentially good performance of the optimizer, methods that are unbiased may perform better than expected (like NMA, or also relatively old CLPSO).

In this paper, we did not intend to discuss any performance comparison between PSO and DE approaches. The number of variants from both families of methods is too large, and many classical or the most recent ones were not considered in this study. We believe that no strong opinions regarding the overall performance of PSO or DE algorithms can be formulated based on this research. The sole purpose of this section was to verify whether there is any relationship between the presence of structural bias and algorithm performance.

5 Conclusions

This paper’s goal was to empirically verify the presence of structural bias in 11 particle swarm optimization (PSO) and differential evolution (DE) algorithms. For comparison, tests were also performed on three other optimizers, namely a genetic algorithm, GA-MPC (Elsayed et al. 2011) which was the winner of the CEC2011 competition, and two historical direct search methods, namely the Nelder and Mead (1965) and Rosenbrock’s (1960) algorithms.

In the vast majority of tested PSO and DE variants, the presence of structural bias was detected. In addition, GA-MPC was found to be structurally biased. However, some PSO (CLPSO) (Liang et al. 2006) and DE (CDE) (Cai et al. 2011) algorithms turned out to be almost structurally unbiased, at least when they were applied with the appropriate population size. The historical Nelder and Mead (1965) algorithm and the basic DE variant proposed by Storn and Price (1995) are the only methods among the 14 algorithms tested that showed no sign of structural bias. We also found that testing some optimization methods (i.e. Rosenbrock’s 1960 algorithm) on a fully random function, as proposed in Kononova et al. (2015), may not allow us to draw any conclusions regarding the presence of structural bias. In general, the PSO algorithms seem to be more structurally biased than the DE algorithms, probably because the initial version of PSO (Eberhart and Kennedy 1995) is structurally biased, contrary to the initial version of DE (Storn and Price 1995).

The impact of the population size, problem dimensionality and the number of allowed function calls on the strength of structural bias depends on the particular algorithm and no general conclusion may be drawn. The ALC-PSO method (Chen et al. 2013) can serve as a good illustration of this fact. For example, the strength of structural bias depends on the population size and on the length of the run, but not monotonically: when only 10,000 function calls are allowed, the larger the population size, the stronger the structural bias of ALC-PSO; but when the number of function calls is extended to 300,000, the opposite relation is found. Various algorithm-specific control parameters may affect the presence and the strength of structural bias as well. In the paper, we pointed out how to (almost) remove the structural bias from the most biased algorithm [DNS-PSO (Wang et al. 2013)].

The presence of structural bias affects the distribution of the best solutions very quickly, often in the first generations in the case of PSO, and during the first dozens of generations in the case of DE algorithms. What is quite strange and not understood at this point is the following finding: in the case of some DE algorithms, the effect of structural bias may start receding after a large enough number of function calls, although it does not completely disappear.

To find the link between the presence of structural bias and algorithms’ performance, all 14 methods were tested on 22 real-world problems proposed for CEC2011, and 50-dimensional versions of 30 artificial benchmarks proposed for CEC2014 competitions. No firm relationship was found. The best methods were those with a moderate structural bias. The performance of two PSO and DE variants that are (almost) structurally unbiased (CLPSO and CDE) turned out to be slightly above the middle of the pack. The structurally unbiased basic DE variant performed poorly. However, the unbiased 50-year-old Nelder–Mead algorithm, although it turned out rather poorly on most CEC2011 real-world problems, enjoyed a better performance than all 13 other optimizers on two among the CEC2011 problems, and was highly ranked on the CEC2014 artificial benchmark problems. One may speculate that the impact of the structural bias on the search is relatively low compared to the impact of the fitness landscape, although this may largely depend on the specific problem to be solved. To find out to what extent structural bias affects the search on various types of fitness landscapes may be an important topic of future studies. Nonetheless, we highly recommend verifying the presence of structural bias in the currently existing and newly proposed algorithms and removing it, if possible.