1 Introduction

Clustering is one of the most relevant areas in data mining and machine learning (Larose 2005; Witten and Frank 2005). Clustering techniques are based on the extraction of patterns in data blindly, referred to as unsupervised learning. Using clustering techniques, data analysts are able to extract information from different datasets without human or expert supervision. Clustering has been designed to group data by similarity. The aim is to minimize the value of a pre-defined cost function, assigning data instances to different groups (clusters) and optimizing this assignment in order to obtain the lowest value of the cost function.

There are several areas that have dealt with clustering problems. One of the most relevant is the statistics area, where well-known clustering algorithms have been proposed, such as K-means, expectation maximization (EM), hierarchical, spectral and fuzzy clustering, among others. Over the last few years, bio-inspired algorithms have received increasing attention. The potential that swarm intelligence and evolutionary algorithms have in optimization has made them potential techniques for clustering. This paper explores this potential, specifically focusing on ant colony optimization (ACO; Dorigo and Stützle 2004).

The proposed algorithms address the main problem with centroid-based approaches, that is the fact that they need to know the features of the search space in order to determine the central point and that they are sensitive to noise. This means, centroid-based clustering algorithms use a multi-dimensional space to represent the data based on their features in order to find the centroid (central point) position of each cluster. A distance metric (in most cases Euclidean) is used to set a centroid and optimize its position according to the distance between the centroid and the data. As a centroid position is determined by averaging the coordinate values of the data in each cluster, this process does not cope well with outliers. Centroid-based clustering algorithms work well when the data can be represented by features in a multi-dimensional space, e.g. clustering of houses based on features such as price, square metres, number of bedrooms/bathrooms, distance to public transportation. However, they are not appropriate in cases where the features of the data are not clear, e.g. clustering of face images—while it is straightforward to calculate the similarity of images, it not easy to define features to represent them in a multi-dimensional space.

Medoid-based clustering algorithms are usually more robust to noise effects, and data instances do not need to be represented in a multi-dimensional space. They use a notion of similarity/distance among the data instances, which can be obtained as a Gram matrix of a kernel or a distance measure, and they choose data instances to define clusters centres—the selected instances are called medoids.

This paper proposes two medoid-based ACO clustering algorithms, where the only information needed is the distance among data: one algorithm that uses an ACO procedure to determine an optimal medoid set (METACOC algorithm) and another algorithm that additionally uses an automatic selection of the number of clusters (METACOC-K algorithm). These algorithms use a graph-based structure and a search strategy that requires no knowledge about the search space features. As aforementioned, this strategy is different from classical centroid-based approaches, where the position of the centroid is optimized in order to define the different clusters. In order to evaluate the performance of the proposed algorithms, we have compared them against the ACO-based ACOC algorithm (Kao and Cheng 2006) using synthetic and real-world datasets, and also against five well-known clustering algorithms: K-means (MacQueen 1967), partition around medoids (PAM; Kaufman and Rousseeuw 1987), PAMK (Kaufman and Rousseeuw 2009), EMBIC (Fraley and Raftery 2007) and Clues (Wang et al. 2007).

The remainder of this paper is organized as follows. Section 2 presents related work, discussing the clustering problem and previous ACO algorithms for clustering. Section 3 introduces the proposed algorithms. Computational experiments and analysis of the obtained results are presented in Sect. 4. Finally, Sect. 5 presents conclusions and future work.

2 Related work

Data mining and machine learning techniques have been used for several applications. One of the most prominent application areas is the identification of patterns in data, which helps data analysts to extract hidden information from data (Larose 2005). Recent data analysis demands have presented new challenges for machine learning techniques (Cao 2010); for example, the need for creating new scalable and robust methodologies is currently receiving increasing interest. In order to improve the robustness of these analysis, new methodologies based on swarm intelligence have shown promise due to the quality of the results extracted using these techniques, which are highly competitive when compared with classical algorithms.

One of the most successful swarm intelligence techniques is ACO (Dorigo and Stützle 2004). ACO algorithms are based on some aspects of the foraging behaviour of ants that collectively can find the shortest path from the nest to a food source. The use of ACO has been extended to several optimization areas, including machine learning. This section provides a general description of the clustering problem—including a discussion of issues about the K-adaptive problem within clustering—and it discusses ACO applications in clustering and the related classification task.

2.1 The clustering problem

Clustering has been widely used in several interdisciplinary areas, such as image segmentation (Menéndez et al. 2014) and sport prediction (Menéndez et al. 2013), among others. Given a dataset \(X = \{x_1,x_2,\ldots ,x_n\}\), the aim of clustering is to group data instances in different clusters, in such a way that similar data instances fall into the same cluster. Let \(C = \{c_1,\ldots ,c_k\}\) be the set of clusters, where k is the number of clusters and \(c_i\) is a cluster. The goal is to generate a function that assigns each data instance to a cluster so that a cost function J is minimized—the classical cost function is related to the Euclidean distance and the square norm. The goal is to minimize J by selecting the best clustering group (\(c_j\) out of the k different clusters) for each data instance \(x_i\). The cost function is given by

$$\begin{aligned} J = \sum _{i=1}^n \min _{j=1}^k ||x_i-c_j||^2. \end{aligned}$$
(1)

The search for an optimal clustering has usually been implemented as an iterative procedure that (1) updates the cluster decision according to the data associated with each cluster and (2) updates the data associated with each cluster based on the cluster centroid position (i.e. the average point across all the points in the cluster) in the space. This is the main idea behind the best known clustering algorithm: K-means (MacQueen 1967). This algorithm represents the clusters as a set of centroids and optimizes their position according to the cost function using the iterative process described above.

There are also several statistical techniques that have been applied to clustering problems, such as EM (Dempster et al. 1977). This approach uses the likelihood of the cluster selection to guide the search, and it is able to apply different statistical estimators depending on the problem. The most frequent estimator for EM is a Gaussian mixture model, where the user defines one Gaussian distribution per cluster and the process optimizes the mean and variance of each distribution in order to generate a good clustering distribution reducing some cost function.

Statistical techniques usually use a search space representation, where the parameters of the estimator are optimized. They are known as parametric techniques. However, there are other important approaches that do not use parameters or estimators. These techniques are named nonparametric techniques (Menéndez et al. 2014), and one of the most relevant approaches in this domain is based on medoids (Kaufman and Rousseeuw 1987). Medoids can be defined as a group of relevant instances for a specific dataset, which could be considered as representatives of the clusters. In the medoid-based approach, the set of k clusters can be defined as

$$\begin{aligned} C = \{m_1,\ldots ,m_k \ | \ m_i \in X\}, \end{aligned}$$
(2)

where \(m_i\) represents the medoid selected out of the set of X data instances. In these approaches, the search is focused on the data instances instead of the whole search space. However, it is required to generate a topology among the data using a similarity/dissimilarity metric. One of the main techniques, called PAM (Kaufman and Rousseeuw 1987), generates a graph topology through a dissimilarity matrix. This matrix contains the pairwise cost metric between data instances, and the algorithm tries to minimize the cost function J (i.e. differences between a data instance and a medoid) with respect to the medoids selection.

The main advantages of medoids when compared with centroids are:

  • Centroids are determined by averaging the coordinate values of the data in each cluster, while medoids are representative members of the data: centroids are not suitable when the average cannot be defined (e.g. clustering of face images, time series or gene expression data);

  • Centroids are more sensitive to outliers: an instance that is far away from the rest of the cluster produces an important modification in the centroid position. This does not happen with medoids because they are a relevant instance of the datasets.

Figure 1 illustrates an example of the problem of outliers in centroid-based clustering: it shows how K-means cluster assignation is affected by an outlier, while PAM keeps the optimal solution even in the presence of an outlier. Since medoid-based algorithms use the information extracted from the data distances, they are a good choice for problems where the search space is not well defined, such as time series clustering.

Fig. 1
figure 1

Results of PAM and K-means after the introduction of an outlier into three Gaussian distributions: PAM keeps its correct solution, while K-means is diverted by the outlier—it creates a cluster with a single data point. The lines connecting the different clusters illustrate their distance

One of the main challenges around the clustering problem is how to choose a good number of clusters (Tibshirani et al. 2001). The majority of clustering algorithms require the specification of the number of clusters a priori as a parameter of the algorithm. An alternative to having the number of clusters fixed is based on the use of a metric to evaluate the clusters’ quality, allowing an algorithm to test a variable number of clusters. The most relevant metric used in the literature is the silhouette (Rousseeuw 1987; see Sect. 3.2). This metric represents a balance between the number of clusters and the cluster separation, which can be used to evaluate the trade-off between the number of clusters and their dissimilarity. Different algorithms have been proposed to optimize the silhouette measure. The most relevant are PAMK (Kaufman and Rousseeuw 2009; an extended version of PAM allowing the number of clusters to vary) and Clues (Wang et al. 2007; an iterative algorithm focused on the silhouette optimization).

2.2 Ant colony optimization in clustering

ACO has already been applied to clustering (Jafar and Sivakumar 2010) and classification (Martens et al. 2011). The advantage of applying ACO algorithms to these problems is that ACO performs a global search in the solution space, which is less likely to get trapped in local minima and, thus, has the potential to find more accurate solutions.

The most popular bio-inspired approaches that deal with the clustering problem are focused on evolutionary algorithms (Menéndez et al. 2014). Hruschka et al. (2009) presents a survey of clustering algorithms from different evolutionary approaches. In the context of ant-based approaches, researchers have explored mainly two different strategies. There are ant-based approaches that focus on the cooperative self-organization characteristics of ant algorithms. Handl et al. (2006) present an adaptive clustering algorithm, called ATTA, based on the clustering of corpses behaviour of ants. An interesting aspect of ATTA is its ability to adapt the total number of clusters k during the search, although at the same time this is viewed as a limitation, since the algorithm does not allow the specification of k for problems where the number of clusters is known a priori. More examples can be found in Fernandes et al. (2008), Herrmann and Ultsch (2008). These approaches can also be characterized based on the way data are manipulated by ants: ant-based approaches can be based on a grid, where ants move data to define the clusters mimicking a behaviour observed in nature (e.g. the way ants move their brood or their waste) or based on the association of each data instance to an ant (Hamdi et al. 2010). Other ant-based approaches involve the use of an ACO procedure, where the clustering problem in modelled as an optimization problem and pheromone is used to guide the search towards better solutions. Kao and Cheng (2006) designed a centroid-based ACO clustering algorithm, where ants assign each data instance to one of the available clusters and cluster centroids are adjusted based on this assignment. França et al. (2008) introduce a bi-clustering algorithm. Ashok and Messinger (2012) focused their work on graph-based clustering of spectral imagery, where the data are represented as a graph and an ACO procedure is used to find long paths through the data. Several other approaches are discussed in Jafar and Sivakumar (2010).

3 Medoid-based ACO clustering algorithms

This section presents the proposed medoid-based ACO clustering algorithms. Both algorithms employ an ACO procedure to select an optimal medoid set to determine the clusters. The first algorithm, called MEdoid seT ACO Clustering algorithm (METACOC), is similar to the PAM algorithm, where the goal of the algorithm is to choose the best k medoids (data instances) based only on distance information—where k is the pre-defined number of clusters. The second algorithm, called K-adaptive MEdoid seT ACO Clustering algorithm (METACOC-K), is an extension of METACOC that enables the algorithm to automatically adjust the number of clusters—useful for problems where the number of cluster is not known a priori.

3.1 METACOC: a medoid set ACO clustering algorithm

The METACOC algorithm is based on several ants looking for the best path in the construction graph. The construction graph is composed by all data instances. Solutions are generated by choosing medoids (data instances) and assigning remaining data instances deterministically to them, according to their distance in relation to the selected medoids. The medoids selection is illustrated in Fig. 2. The rationale is that once the medoids are determined, there is a deterministic optimal cluster allocation based on the similarity/dissimilarity values.

Each ant (a) has the following features:

  • a list of visited data instances (\(tb_a\));

  • a set of chosen medoids \(M_a\), which is initially empty.

Ants have two possible search strategies, exploitation and exploration. In each iteration, an ant chooses the strategy for the medoid assignment j according to the pseudo-random proportional rule (Dorigo and Gambardella 1997)

$$\begin{aligned} j = \left\{ \begin{array}{ll} \hbox {argmax}_{u \in \{y,n\}} \{ \tau (i,u) \}&{} \hbox {if} \quad q \le q_0\\ S&{}\hbox {otherwise}\\ \end{array} \right. , \end{aligned}$$
(3)

where \(\{y,n\}\) are the possibilities (to be or not to be a medoid) to data instance i (see Fig. 2), \(\tau (i,u)\) is the pheromone value between i (the data instance) and u (the condition “yes” or “no” to become a medoid), \(q_0\) is the user-defined exploitation probability, q is a random number distributed uniformly in [0, 1] for strategy selection and S is the ACO-based exploration strategy. The ACO-based exploration strategy S is defined by

$$\begin{aligned} S = P(i,u) = \frac{\tau (i,u)}{\sum _{l \in \{y,n\}}\tau (i,l)}, \end{aligned}$$
(4)

where P(iu) is the probability that data instance i could be selected as a medoid or not and \(u \in \{y,n\}\). Note that METACOC does not use heuristic information to select a medoid. While the number of selected medoids m is less than k, where k is the pre-defined number of clusters, any data instance can be selected as a new medoid and the pheromone values are used to decide whether a data instance is considered a medoid or not. When the maximum number of medoids is reached, the selection process stops and the remaining data instances are set to not be medoids.

Fig. 2
figure 2

An ant travelling through the construction graph. The pheromone values are stored in the edges: the order of visiting the data instances is random and the pheromone values represent the desirability of considering an instance x as a medoid (\(\tau (x,y)\) value) or not (\(\tau (x,n)\) value)

The METACOC algorithm can be described as follows:

  1. 1.

    Initialize the pheromone matrix \(\tau _0\).

  2. 2.

    Initialize each ant a: set the chosen medoids \(M_a = \emptyset \) and the visited data instances \(tb_a = \emptyset \).

  3. 3.

    For each ant, check if all instances have been visited (\(|tb_a| == n\)) or all medoids have been chosen (\(|M_a| == k\)). If not:

    1. (a)

      select the next data instance i.

    2. (b)

      choose a search strategy;

    3. (c)

      if i is selected as a medoid add it to \(M_a\);

    4. (d)

      add i to the list of visited data instances \(tb_a\).

  4. 4.

    Assign each data instance to its closest medoid and calculate the objective function value for each ant a:

    $$\begin{aligned} J^a = \sum _{i=1}^n \min _{j=1}^{|M_a|} d(x_i,m_j^a), \end{aligned}$$
    (5)

    where \(x_i\) represents a data instance and \(m_j\) represents a medoid in \(M_a\).

  5. 5.

    Choose the best solution:

    1. (a)

      rank the ants solutions;

    2. (b)

      if an ant has less medoids than k it is eliminated from the ranking;

    3. (c)

      choose the best ant \(a^*\) (iteration-best solution);

    4. (d)

      compare \(a^*\) with the best-so-far solution \(a^{**}\) and update this value with the maximum between them.

  6. 6.

    Update the pheromone trails (global updating rule). Only the r best ants add pheromone:

    $$\begin{aligned} \tau _{t+1}(i,j) = (1- \rho )\tau _t(i,j) + \sum _{h=1}^r \varDelta \tau _t(i,j)^h , \quad \varDelta \tau _t(i,j)^h = \frac{1}{J^h}, \end{aligned}$$
    (6)

    where \(\rho \) is the pheromone evaporation rate, (\(0 < \rho < 1\)), t is the iteration counter, r is the number of elitist ants and \(J^h\) is the quality of the solution created by ant h.

  7. 7.

    Check termination condition:

    1. (a)

      if the number of iterations is greater than the maximum number of iterations, it finishes choosing the best-so-far solution \(a^{**}\);

    2. (b)

      otherwise, go to step 2.

Once this process has finished, the best-so-far solution is chosen as the solution found by the algorithm. The solution consists of a set of medoids, which are the data instances representative of the clusters. Each data instance is then assigned to its closest medoid to define the clusters.

In terms of computational complexity, we can assume that all data instances are visited during the search process—although in practice this is not frequent—which takes O(A n) (where A is the number of ants and n is the number of data instances). The algorithm also includes a step that assigns each data instance to its closest medoid, which takes O(A n k) (where k is the number of medoids). The evaluation involves calculating the similarity of each data instance to its assigned medoid, which takes O(A n). Finally, the ranking of solutions takes \(O(A \log A)\) and the pheromone update uses r elitist ants and visits all data instances, which takes O(r n). Since these steps are repeated T iterations, the total complexity is \(O(T A n) + O(T A n k) + O(T A \log A) + O(T r n)\)—as \(O(T A n k) \ge O(T A n) \ge O(T r n) \ge O(T A \log A)\), the complexity is simplified to O(T A n k).

3.2 METACOC-K: a k-adaptive extension of METACOC

The proposed METACOC algorithm cannot choose the number of clusters, but requires as input a value for k. This section presents the METACOC-K algorithm, which allows the estimation of the number of clusters using METACOC as a starting point.

The main features of METACOC-K are:

  • each ant can have a different number of clusters;

  • the quality metric is designed to balance between the number of clusters and the cluster assignment cost.

The first improvement is a straightforward modification of the ant behaviour, where each ant chooses a random number k of clusters to build a solution. The value of k is limited to a pre-defined range \([k_{\mathrm{min}}, k_{\mathrm{max}}]\). This is used to allow the algorithm to explore different numbers of clusters. The second improvement consists in the solution evaluation, which now takes into account that each candidate solution can have a different number of clusters. As the metric used to update the pheromone information, we take the average silhouette calculated as

$$\begin{aligned} \hbox {Avg}\_\mathrm{sil}(X) = \frac{\sum _{x \in X}\hbox {sil}(x)}{|X|}, \end{aligned}$$
(7)

where \(x \in X\) is a data instance and \(\hbox {sil}(x)\) is the silhouette metric (Kaufman and Rousseeuw 2009) given by

$$\begin{aligned} \hbox {sil}(x) = \frac{d(x,C_{\mathrm{closest}})-\frac{\sum _{j \in C_x} d(x,j)}{|C_x|}}{\max \left( \frac{\sum _{j \in C_x} d(x,j)}{|C_x|}, d(x,C_{\mathrm{closest}})\right) }, \end{aligned}$$
(8)

where d(xj) is the distance between data instances x and j, \(d(x,C_{\mathrm{closest}})\) is the distanceFootnote 1 between x and the closest neighbouring cluster \(C_{\mathrm{closest}}\), \(C_x\) is the cluster to which element x belongs and \(|C_x|\) represents the number of elements of \(C_x\). The silhouette compares tightness and separation of clusters. It is calculated by data instance and gives information about those data instances that are well assigned to a cluster and those that should be moved. The silhouette of all data instances provides an appreciation of the clusters’ quality (in a similar way of a Riemann integral). The area of the shape defined by silhouette is useful to determine the quality of the number of clusters selection (see Fig. 3).

Fig. 3
figure 3

Silhouette for a dataset where four clusters have been discriminated: the first value represents the cluster number, the second is the number of instances and the third is the average silhouette of the cluster (cluster number: instances|silhouette). The average silhouette value across all clusters is 0.74, which measures the quality of the number of clusters selection

The METACOC-K algorithm follows a structure similar to METACOC. The main differences are:

  1. 1.

    Selection of the number of clusters:

    1. (a)

      during the ant initialization (step 2 in METACOC), it additionally chooses uniformly at random the number of clusters in the range \([k_{\mathrm{min}}, k_{\mathrm{max}}]\); the solution is then created using the same procedure as in METACOC.

  2. 2.

    Solution evaluation:

    1. (a)

      candidate solutions are evaluated using the average silhouette (Eq. 7), which evaluates the balance between the number of clusters and the cluster assignment cost (step 4 in METACOC).

Concerning the computational complexity of METACOC-K, we have to consider that the silhouette calculation is more expensive. The silhouette is used to evaluate the solutions, and it involves the distance between every pair of data instances n and the distance between each data instance and the k medoids, repeated over T iterations; therefore, the evaluation of solutions is \(O(T A n^2 k)\) (where A is the number of ants). Since the remaining steps are similar to METACOC and \(O(T A n^2 k) \ge O(T A n k)\), the complexity remains \(O(T A n^2 k)\).

4 Computational experiments

This section presents the experiments that were carried out to measure the performance of the proposed algorithms: METACOC and METACOC-K. METACOC was compared against K-means, ACOC and PAM as non-adaptive algorithms (i.e. algorithms that required a fixed number of clusters), whereas METACOC-K was compared against EMBIC, Clues and PAMK as adaptive algorithms (i.e. algorithms that do not required a fixed number of clusters).

4.1 Datasets

We divided the computational results in three sets of experiments. In the first set of experiments, we evaluated the proposed algorithms on synthetic datasets. The following synthetic datasets were generated:

  • synthetic dataset 1: This dataset corresponds to points in a two-dimensional Euclidean space, where nine clusters of points, each derived from a two-dimensional Gaussian distribution, were generated. There are three Gaussians which are closer than the rest. This dataset has 450 instances, and it is illustrated in the top-left plot in Fig. 4;

  • synthetic dataset 2: This second dataset is generated analogously to dataset 1 (nine clusters of points), but with additional noisy data in the background. This dataset has 550 instances, and it is illustrated in the top-right plot in Fig. 4;

  • synthetic dataset 3: This dataset is composed of three two-dimensional Gaussian distributions, which are well separated. This dataset has 150 instances, and it is illustrated in the bottom-centre plot in Fig. 4.

Fig. 4
figure 4

Data points generated by the three synthetic datasets that have been used for the experiments: the first (top-left plot) shows nine two-dimensional Gaussian distributions, where three of them are very close; the second (top-right plot) introduces noise to the nine Gaussian models; and the last one (bottom-centre plot) shows three well-separated Gaussian models

In the second set of experiments, we chose 20 real-world datasets from the UCI Machine Learning Repository (Frank and Asuncion 2010). These datasets are benchmark datasets for clustering and classification tasks. Table 1 shows the main characteristics for each UCI dataset used in our experiments. Finally, in the third set of experiments, we chose 10 time series benchmark datasets from the UCR time series repository (Chen et al. 2015) in order to evaluate the medoid-based methodologies in a specific area where they have been successful. Table 2 shows the main characteristics for each UCR dataset used in our experiments.

Table 1 Description of the UCI datasets used in the experiments
Table 2 Description of the UCR datasets used in the experiments

4.2 Experimental setup

This section briefly describes the selected algorithms used for comparison. ACOC (Kao and Cheng 2006) is an ACO clustering algorithm based on centroids. ACOC uses a pheromone matrix to store the relationship between the data instances and the centroid labels, where ants assign each data instance to one of the available clusters and cluster centroids are adjusted based on this assignment. Comparing ACOC to METACOC and METACOC-K, both METACOC and METACOC-K use a different construction graph, where an ant chooses whether an instance is a medoid or not (i.e. it is always a binary decision regardless of the number of clusters).

K-means (MacQueen 1967) is an iterative algorithm based on centroids, which are randomly selected at the beginning. The goal of the algorithm is to find the best centroid positions. It is executed in two steps: in the first step, it assigns the data to the closest centroid (cluster); in the second step, it calculates the new position of each centroid as the centroid of the data that have been assigned to it.

PAM (Kaufman and Rousseeuw 1987) is similar to K-means, but it uses medoids instead of centroids. PAM can work with a dissimilarity/similarity matrix, which is used to calculate the overall cost of a cluster. PAMK (Kaufman and Rousseeuw 2009) is an extension of PAM, which calculates the number of clusters using the silhouette as a decision metric.

EMBIC (Fraley and Raftery 2007) combines EM with the Bayesian information criterion (BIC). The EM algorithm tries to optimize the parameters of an estimator (in this case, Gaussian Mixture Models), and BIC adds a penalty to the likelihood based on the number of parameters. This is helpful when the number of clusters needs to be controlled. Finally, Clues (Wang et al. 2007) creates a cluster per data instance and merges the clusters according to the silhouette metric.

We used the R standard implementationFootnote 2 of K-means, PAM, PAMK, EMBIC and Clues: for each algorithm, the number of iterations was set to 100 and the remaining parameters were used with their default values; the initial centroids for K-means were randomly chosen. The parameters of ACOC, METACOC and METACOC-K algorithms have been set in a similar way as in the original work (Kao and Cheng 2006): the number on ants is 1000, the number of elitist ants is 10, the exploitation probability (\(q_0\)) is 0.0001, the initial pheromone values follow a uniform distribution in [0.7, 0.8], \(\beta = 2.0\) (only used by ACOC), \(\rho = 0.1\) and the maximum number of iterations is 1000.

All the experiments have been carried out using the Euclidean distance as the basic performance metric, which is defined as

$$\begin{aligned} d(x_i,x_j) = ||x_i-x_j|| = \sqrt{\sum _v (x_i^v - x_j^v)^2}, \end{aligned}$$
(9)

where \(x_i, x_j\) represent two data instances and v represents each attribute of the data instance. Additionally, K-means, PAM, ACOC and METACOC algorithms need the number of clusters as an initial parameter. The experiments have been carried out 100 times per algorithm and dataset used, and the average is reported.

The evaluation of the experiments has been focused on two different criteria: on one hand, the synthetic datasets have been evaluated according to the cluster discrimination and the performance of the algorithm to discriminate the original clusters in the noisy case; on the other hand, the real-world datasets have been evaluated using the silhouette metric, which is optimized directly by the PAMK, EMBIC, Clues and METACOC-K algorithms, and indirectly by the remaining algorithms (K-means, PAM, ACOC and METACOC) when they optimize the cost function defined by the Euclidean metric.

4.3 Synthetic experiments

This section presents the result for the synthetic experiments. We have measured how the algorithms discriminate data, applying the adjusted rand index metric (Hubert and Arabie 1985) to the solutions generated for each dataset. As mentioned above, we considered three datasets. Table 3 shows the average results for each algorithm (average \(\pm \) SD) over 100 executions; no standard deviation is shown when its value is lower than 0.001. For the adaptive algorithms METACOC-K, PAMK, EMBIC and Clues, the average number of clusters identified is in brackets. Table 4 shows the median results for each algorithm. Finally, Table 5 shows the best results obtained by each algorithm: a value in this table corresponds to the highest value in terms of the adjusted rand index metric achieved by an algorithm.

Table 3 Average results of the application of the algorithms to the synthetic datasets in adjusted rand index terms, calculated over 100 executions (average \(\pm \) SD); no SD is shown for an algorithm when all values are lower than 0.001
Table 4 Median value for the adjusted rand index on the synthetic datasets

Table 3 shows that METACOC is the algorithm that is able to clearly discriminate the data in all three datasets, achieving the highest average adjusted rand index of all algorithms. METACOC-K also performs well overall, although it seems to have more problems discriminating the cluster boundaries on the synthetic dataset 1. PAM and PAMK obtain similar performances, but PAMK has problems in identifying the correct number of clusters on synthetic dataset 2. This is also the case for EMBIC, which performs well on synthetic dataset 1 and synthetic dataset 3, but has problems on synthetic dataset 2. Clues is the algorithm that achieved the lowest average in synthetic dataset 3, since it generates several clusters—many more than the existing clusters in the data—during the discrimination process (12 cluster); it achieves a good performance in the remaining datasets. ACOC performs well overall, with the exception of synthetic dataset 2, where it has problems discriminating the cluster centres. K-means has problems in all three datasets: while it managed to discriminate the clusters in the majority of the runs, it seems to be more sensitive to the initial centroids’ positions, as can be noticed by its lower average and higher standard deviation values.

Looking closely at the median (Table 4) and average (Table 3) results, we get an intuition about the convergence of METACOC and METACOC-K. METACOC has similar values for both median and average, showing that the solutions are similar over multiple runs. METACOC-K variates more according to the average, which is usually lower than the median. This shows that an outlier result might appear when we apply METACOC-K multiple times, which affects the average value. Comparing the median of METACOC-K with the maximum value (Table 5) of the other algorithms, METACOC-K achieves a better or similar result, which suggests that in more than 50 % of the runs METACOC-K obtains a better or similar result than the best result of the other algorithms.

Table 5 Highest value for the adjusted rand index on the synthetic datasets

These results show that the proposed algorithms are able to find good results when compared with classical algorithms using synthetic datasets and in general achieved better results than ACOC.

4.4 Experiments with real-world datasets

This section presents the results of the experiments with real-world datasets. In this case, the evaluation is focused on the algorithms objectives—i.e. optimizing the silhouette metric. Table 6 shows the results of all the non-adaptive algorithms, and Table 7 shows the results of the adaptive algorithms. The values in these tables represent the average and standard deviation (average \(\pm \) SD) over 100 executions; no standard deviation is shown for an algorithm when all values are lower than 0.001 (EMBIC, Clues and PAMK results).

Table 6 Average results of the application of the non-adaptive algorithms to the UCI datasets in silhouette metric terms (average \(\pm \) SD)
Table 7 Average results of the application of the adaptive algorithms to the UCI datasets in silhouette metric terms (average \(\pm \) SD); no standard deviation is shown for an algorithm when all values are lower than 0.001
Table 8 Highest value for the silhouette metric on the UCI datasets

We have performed a statistical analysis using the Wilcoxon test (Demšar 2006). We compared the performance of METACOC against PAM (Table 6) and METACOC-K against PAMK (Table 7): the datasets where METACOC (METACOC-K)’s performance is statistically significantly better according to the Wilcoxon test with a significant level of 0.05 are marked with the symbol

figure aj

; the datasets where METACOC (METACOC-K)’s performance is statistically significantly worse are marked with the symbol

figure ak

; if no symbol is shown, no significant difference was observed. In the first case, METACOC and PAM have been chosen since both are medoid-based clustering algorithms, but METACOC employs a different search strategy compared to PAM. In the second case, METACOC-K and PAMK have been chosen as PAMK is the adaptive algorithm with the best performance among the algorithms optimizing the silhouette metric.

Table 9 Average results of the best K-means run computed over 30 restarts, and a single run of METACOC and METACOC-K on the UCI datasets in silhouette metric terms (average \(\pm \) SD)

Table 6 shows that METACOC obtains statistically significantly better results than PAM in 8 out of 20 datasets, while achieving statistically significantly worse results in only 3. The comparison of METACOC with the rest of the non-adaptive algorithms shows that the algorithm achieves the best results in 6 out of 20 datasets, a similar results obtained by PAM while K-means obtains the best results in 10 out of 20 datasets. The good performance of K-means is likely a consequence that this algorithm is able to move its centroids in the whole search space (i.e. centroid values do not necessarily correspond to values from a data instance), while METACOC and PAM—the medoid-based algorithms—choose data instances as medoids, which probably reduces the silhouette values. This case is similar when ACOC and METACOC are compared: the ACOC algorithm is able to use the whole search space, while METACOC has to use a reduced and discrete version based only on the data instances. As a consequence, the performance of ACOC and METACOC is similar in several cases. However, it is important to remark that centroid-based algorithms cannot be used when only the distances/similarities among data are known.

Table 10 Average results of the application of the adaptive algorithms to the UCR time series datasets in silhouette metric terms (average \(\pm \) SD); no SD is shown for an algorithm when all values are lower than 0.001

Table 7 shows the experimental results for the datasets when the adaptive algorithms are considered. This table shows that METACOC-K obtains statistically significantly better results than PAMK in 15 of the 20 datasets, while achieving statistically significantly worse results in only 4. When METACOC-K is compared with the rest of the adaptive algorithms, it obtains better results than both EMBIC and Clues—with the exception of the So and Li datasets, where Clues obtains better results.

Table 8 presents a summary of the best results obtained by each algorithm. A value in the table corresponds to the highest value in terms of the silhouette metric achieved by an algorithm over 100 executions. These results show again the better performance of METACOC-K in optimizing the silhouette metric over the remaining algorithms: METACOC-K obtained the highest value in 17 of the 20 datasets; although K-means optimizes the silhouette metric only indirectly by minimizing the Euclidean error in the clusters, it obtained the highest value in two datasets (in one of them it tied with METACOC-K); ACOC and PAMK obtained the highest value in one dataset each.

Table 11 Average computational time (average \(\pm \) SD) in seconds taken by METACOC and METACOC-K on the UCI datasets

We also compared the best results of K-means against a single run of METACOC and METACOC-K. This comparison presents a balance between the computational time and the performance of the algorithms, given that the proposed algorithms use a more time-consuming ACO procedure where multiple candidate solutions are evaluated, while K-means employs a faster local search strategy. The results are presented in Table 9. A value in the table corresponds to the average of the best K-means value over 30 executions (where the best value is determined over 30 restarts for each execution) and a single execution of METACOC and METACOC-K. The results show that METACOC-K is the best of the ACO-based algorithms, achieving statistically significantly better results than K-means in 14 of the 20 datasets and statistically significantly worse results in only one dataset; in the remaining 5 datasets, no statistically significant differences were detected. In this case is evident the advantage of the ACO procedure, since it leads to the creation of high quality solutions. The results obtained by METACOC are mixed: K-means is statistically significantly better than METACOC in 9; K-means is statistically significantly worse than METACOC in 5 datasets; and they have similar performances in 4 datasets. Given the stochastic nature of the ACO search, better results might be obtained by multiple executions of METACOC, at the cost of a higher computational time.

Overall, we consider the results presented in Tables 6, 7, 8 and 9 positive. In summary, METACOC shows statistically significant improvements over PAM; METACOC-K, the proposed algorithm that can adapt the number of clusters, obtains the highest results of all the algorithms in 17 of the 20 datasets. More importantly, it statistically significantly outperforms PAMK in 15 of the 20 datasets.

4.5 Time series experiments

In this section, we present a set of experiments focused on a specific domain where medoid-based approaches have been successful: time series analysis (Liao 2005). We have selected ten datasets from the UCR Time Series Classification Archive (Chen et al. 2015). Details of these datasets are presented in Table 2. The similarity matrix derived from the alignment between two time series is generated applying the Dynamic Time Wrapping distance (Keogh and Ratanamahatana 2005). Table 10 shows the experimental results for the medoid-based algorithms: PAM, METACOC, Clues, PAMK and METACOC-K. The values in this table represent the average and standard deviation (average \(\pm \) SD) over 100 executions; no standard deviation is shown for an algorithm when all values are lower than 0.001 (Clues and PAMK results).

In these experiments, we use PAM, Clues and PAMK as benchmark for the Wilcoxon test with significance level of 0.05, comparing them with METACOC and METACOC-K, respectively. METACOC shows better performance compared with PAM overall, achieving statistically significantly better results in two datasets (CB and SA) and statistically significantly worse results in only one dataset (Lt). METACOC-K achieved statistically significantly better results than Clues in 9 out of 10 datasets and no statistically significant differences were detected in only one dataset (He); compared with PAMK, METACOC-K achieved statistically significantly better results in 6 out of 10 datasets (AH, CB, Co, Ha, IP and SA) and no statistically significant differences were detected in the remaining datasets. Additionally, METACOC-K identified the right number of clusters in all cases but the AH dataset, which was also not identified by any of the adaptive algorithms.

Fig. 5
figure 5

Illustration of the convergence of METACOC and METACOC-K on the breast cancer, breast tissue, Haberman and Wine UCI datasets

4.6 Computational time

Table 11 shows the average computational time (average \(\pm \) SD) in seconds taken by METACOC and METACOC-K on the UCI datasets over a fixed number of iterations. The algorithms are around 10 times slower than K-means, 6 times slower than PAM and Clues, 4 times slower than PAMK and similar to EMBIC. Overall, METACOC is faster than METACOC-K. We were expecting a higher computational time for METACOC-K, since the algorithm explores solutions with different values for k and it uses a more complex evaluation function. In our observations, both METACOC and METACOC-K are generally faster than ACOC. We attribute this to the simplified construction process compared to ACOC. As soon as the algorithm selects k medoids (where k is the number of clusters), the solution construction process stops, while ACOC must visit all instances of the dataset to create a solution.

Figure 5 illustrates the convergence of METACOC and METACOC-K. It is interesting to note that METACOC-K converges faster than METACOC, while being slower than METACOC over the same number of iterations. This suggests that the computational time of METACOC-K can be improved by using a smaller number of iterations to reduce its overall computation time, without negative impact on its performance.

5 Conclusions and future work

In this paper, we proposed two medoid-based ACO clustering algorithms, METACOC and METACOC-K. Medoid-based clustering algorithms only need the distances/similarities among data to find a solution and they are more robust to outliers. One of the main advantages of medoid-based algorithms is that they can directly be applied to problems where the features of data cannot be easily represented in a multi-dimensional space. The first algorithm, called METACOC, uses an ACO procedure to determine an optimal medoid set (METACOC algorithm). The second algorithm, called METACOC-K, uses an automatic selection of the number of clusters, useful for problems where the number of cluster is not known a priori.

We compared the proposed algorithms against classical clustering algorithms, both centroid- and medoid-based, in synthetic and real-world datasets. METACOC results were positive, statistically significantly outperforming PAM in 8 out of 20 real-world datasets and achieving competitive results against (centroid-based) K-means and ACOC algorithms, while using only the information about the distance among the data instances. METACOC-K results were also positive: it statistically significantly outperformed PAMK in 15 out of the 20 real-world datasets. METACOC-K was also the algorithm that consistently achieved the best results in the real-world datasets in the experiments optimizing the silhouette metric. Concerning the time series datasets, METACOC shows better performance compared with PAM overall, achieving statistically significantly better results in two datasets and statistically significantly worse results in only one dataset; METACOC-K achieved statistically significantly better results than Clues in 9 out of 10 datasets and than PAMK in 6 out of 10 datasets, with no statistically significant differences detected in the remaining datasets.

There are several future research directions. Both METACOC and METACOC-K do not employ heuristic information during the construction process—it would be interesting to investigate whether the search can be further improved by such information. Exploring the use of different cluster evaluation measures to improve the number of clusters selection in METACOC-K is also another interesting research direction—this can be evaluated in an automatic configuration setting (López-Ibáñez et al. 2011). At the moment, the selection of the number of clusters is not part of the construction graph, and therefore, it is not influenced by pheromone values—adding the selection to the construction graph might improve the search. Finally, the application of the algorithms in large-scale data analysis tasks is also a research direction worth further exploration.