1 Introduction

The history of work on clustering algorithms based on, and motivated by the gravitational (and electrostatic) models stretches back to the 1970s. They are too numerous to be exhaustively reviewed here; we review below a few relevant highlights, thereby placing our work into context and pointing out its distinguishing features. For a general review of clustering techniques, see Berkhin (2006) and for a recent survey see Xu and Tian (2015).

One of the earliest papers of cognate interest is Wright (1977b), where the attractive force (between the particles) under Newtonian gravity moves the data points to form successively larger and fewer clumps, eventually resulting in all the mass being lumped together in one single cluster.

Starting with a set of data points where each point has a potential, in Shi et al. (2002), an agglomerative hierarchical clustering algorithm is presented based on a comparison of distances between already defined clusters. The distance used in Shi et al. (2002) is calculated as some weighted value of differences of potentials.

In Lu and Wan (2012), each of the data points is endowed with a potential (of a Newtonian type), and the points are sorted in a list in ascending order of their potential. Cluster centres are nominated as the list is being worked through by using a preselected bandwidth parameter; being able to measure the distance between points is an essential feature of the algorithm.

An interesting recent paper with a philosophical flavour is Henning (2015). The authors reason that there is no such thing as the ‘best’ clustering algorithm since the notion of a cluster will be dependent on the context and on the intention of the investigator. On the other end of the spectrum, there is the early paper Wright (1977a), seeking to make clustering into an objective endeavour by assuming a priori a ‘clustering function’ which then is used to assess the quality of the clustering achieved.

The view taken here is that ‘reality’ and ‘truth’ are elusive concepts and can be at best of partial concern to the modeller; important will be that, initially, a collection of (more or less plausible) models is available, and out of these the one is chosen which most successfully predicts future events. We want our model to be understood and received in this spirit: the model put forward is rooted in the physics of own data attributes, it is rich in structure and appears plausible in certain circumstances.

The algorithm suggested here is an hierarchical clustering algorithm for a finite set of data points in the Euclidean space, each of which is assumed to be generating a potential field constructed by means of the density of a certain family of probability distributions. The individual point potentials can be anisotropic, thus allowing the surface representing the total potential to be thought of (in two dimensions for the sake of simplicity) as an undulating landscape with hills and valleys extending in any direction. This construction will allow clusters of data points on a ‘real’ surface to be modelled, a feature which may turn out to be useful in future work in epidemic modelling using networks (see Barabási 2016, Chapter 10). The novelty of our model also lies in the fact that the result is not a single dendrogram but a randomized dendrogram, defined as a convex combination of deterministic dendrograms, thus allowing the results to be formulated in a probabilistic framework.

The important novel feature introduced here is that of the proximity curve whose pattern will be the key for determining the suggested clustering. A welcomed by-product of our algorithm is that the black hole problem (as a local minima challenge discussed for example in Li and Fu 2011) does not appear here: all massive data points will be ‘collected’ prior to visiting genuine clusters comprising smaller data points.

In the next section, we place the paper’s topic into context and introduce some notation. Then, in the main Section 3, the new proximity curve concept and proposed algorithm are described in detail and illustrated by using a synthetic data set. In Section 4, experiments using the Iris data set (Fisher 1936, 2011) are conducted for validation purposes on classification and clustering benchmark data. In Section 5, the results and various features of the suggested technique and its merit are highlighted. In Section 6, our findings are summarised and some future research ideas are indicated.

Appendices A – C contain auxiliary theoretical material, whereas all nontextual material namely contour maps, proximity curve, potential curve, dendrogram, pseudocodes, tables and large matrices are collected in Appendix E.

2 Context and Assumptions

We are given N data points x1,…,xN in the Euclidean space, assumed here for the sake of simplicity to be the 2-dimensional plane \(\mathbb {R}^{2}\). This is not a technical limitation as such; it is intended merely as a vehicle for conveying the underlying principles. The ith point induces the potentials vi, generating the total potential as follows:

$$ V(\mathbf{y}) = \sum\limits_{i=1}^{N} v_{i}(\mathbf{y}), \quad \mathbf{y} \in \mathbb{R}^{2}. $$

Depending on the context and emphases, the data points may be referred to also as nodes, point masses, point charges or simply points. The collection (set) of all data points may be referred to as a data cloud.

The structure of the point potentials vi is as follows:

$$ v_{i}(\mathbf{y}) = - M_{i} f\left( \mathbf{y} \left\vert \boldsymbol{\theta}_{i} \right. \right), $$

where Mi > 0 is the ‘mass’ of point i and \(f\left (. \left \vert \boldsymbol {\theta } \right . \right )\) stands for a Cauchy probability density functions (pdf) with parameter 𝜃. The parameter 𝜃 has five components, reflecting the geometric transformation steps carried out to convey the seed distribution (a standard Cauchy) to the required Cauchy distribution; the construction involved is described in detail in Appendices A – C.

Our running example is a synthetic data set comprising N = 19 points in the plane shown in Fig. 2 with individual and joint equipotentials. The corresponding parameter values are shown in Table 2.

We will be interested in an algorithm for finding data clusters based on the points’ location and the overall potential field generated by them.

The physical models of Gravity and Electrostatics serve us as an intuitive background (e.g. Ramsey 1959; Kip 1962). The field’s gradient vector ∇V (y) is the force exerted onto an infinitesimal unit ‘test mass’ or ‘test charge’ located at y. The gradient vector ∇V (y) will be used to reflect qualitatively on where the point y in relationship with the others lies: is \(\lVert \nabla V(\mathbf {y}) \rVert \), the magnitude of the gradient, close to zero, so is y near the bottom of a valley (i.e. near a local minimum of V ), whereas ‘noticeably positive’ values of \(\lVert \nabla V(\mathbf {y}) \rVert \) are associated with the point y being on a slope on the surface describing V.Footnote 1 Respective examples in Fig. 2 are the points #10 and #6. (See Table 2 for the points’ numbering.)

Our aim is to discover data clusters by considering properties derived from the gradient field. Finite lists of tuples of real numbers will be constructed (we shall call them proximity curves) whose pattern will suggest ways of clustering the data.

3 The Algorithm: Cluster Discovery by Proximity Curves

3.1 Proximity Curves

Initially, all data points are said to be active. Starting from some initial position, steepest descent is applied to the total potential generated by all the active points, guiding us to one of the data points. The point thus found is then deactivated, the total potential is redefined (i.e. updated) to be the one generated by the now active points and the process is restarted at the position of the point just deactivated. This process is carried out until the last point has been found upon which all points become inactive and the total potential is zero.

The core of this algorithm is ProximityCurve; it is shown as Pseudocode E-2. As can be gleaned from Pseudocode E-2, in the course of the computations, a list of pairs is being built up, each entry comprising the label of the node being deactivated and the logarithm of the vector norm of the gradient of the updated potential at the node’s location. (The algorithm ClosestActive, shown as Pseudocode E-3, is an auxiliary.) Figure 3 illustrates the progress of the algorithm with a snapshot of the equipotentials just after having deactivated seven data points.

A proximity curve, illustrated by Fig. 4, is a plot of the logarithm of the vector norm of the gradient of the successively updated potential (as described above) versus the points’ position in the list (the logarithm applied to the magnitude of the gradient is merely there to express and amplify an existing pattern). In total, there will be at the most N proximity curves as in practice some nodes cannot be reached to be the first node to be visited and since the first node visited completely determines the rest of the proximity curve. The full information about all possible proximity curves can be conveyed by two square matrices of size N:

  • The rows of matrix \(\mathbf {S} = \left (s_{\text {ij}} \right )_{i,j = 1, \ldots , N}\) record all possible sequences of nodes, thus obtainable.

  • The rows of matrix \(\boldsymbol {\Lambda } = \left (\lambda _{\text {ij}} \right )_{i,j = 1, \ldots , N}\) are the logarithm of the norm of the gradient of the corresponding successively updated potential at the respective node locations.

This is illustrated in Fig. 4 by the proximity curve whose first node is #2. (It was obtained by starting ProximityCurve in (− 2, − 6). The curve shown in Fig. 4 is obtained by combining the second rows of each of the matrices S and Λ (shown in Appendix E.4) into the list of pairs

$$ \begin{array}{@{}rcl@{}} \mathbf{L} &=& \lbrack \left( s_{2 1}, \lambda_{2 1}\right), \left( s_{2 2}, \lambda_{2 2}\right), \left( s_{2 3}, \lambda_{2 3}\right), \ldots, \left( s_{2 (N-1)}, \lambda_{2 (N-1)}\right), \left( s_{2 N}, \lambda_{2 N}\right)\rbrack\\ &= &\lbrack (2, -3.39),(1,-4.07),(16,-2.12), \ldots,(7,-2.46),(9,-\infty) \rbrack. \end{array} $$
(1)

3.2 Pattern Extraction

The curve shown in Fig. 4 has roughly a self-similar structure (in a statistical sense) akin to the Elliott Wave Pattern described in Casti (2002) in the context of financial data. The principle formulated below will guide us when using proximity curves for identifying clusters of data points.

Guiding Principle. Large values of the norm of the gradient of the updated potential function are indicative of the beginning of a new cluster. Decreasing values of the norm of the gradient indicate a lack of ‘attractive mass’ of what is left in the current cluster. Cluster boundaries are therefore marked by local minima of the proximity curve.

The pattern of a proximity curve is explored by a recursive algorithm.

  • The recursive step is in locating the smallest local minimum. In the example shown in Fig. 4, this occurs at node #13 as is seen from the appropriate section of the list representation of the proximity curve (1) as follows:

    $$ \mathbf{L} = \lbrack \left( 2, -3.39\right), \ldots, \left( 15, -2.68\right), \left( 13, -7.72\right), \left( 19, -7.14\right), \ldots,\left( 9,-\infty\right) \rbrack $$

    The proximity curve L is now recognised as the concatenation of the two lists:

    $$ \begin{array}{@{}rcl@{}} \mathbf{L}_{\text{left}} &=& \lbrack \left( 2, -3.39\right), \ldots, \left( 15, -2.68\right), \left( 13, -7.72\right)\rbrack\end{array} $$

    and

    $$ \begin{array}{@{}rcl@{}} \mathbf{L}_{\text{right}} &=& \lbrack \left( 19, -7.14\right), \ldots,\left( 9,-\infty\right) \rbrack, \end{array} $$

    each of which is a proximity curve in its own right. Obviously, if local minima determine cluster boundaries, then Lleft and Lright give rise to the clustering as follows:

    $$ \lbrace \lbrace 2, \ldots, 15, 13 \rbrace, \lbrace 19, \ldots, 7, 9 \rbrace \rbrace, $$
    (2)

    defining level 2 in the hierarchical clustering scheme. The next level of clustering will be arrived at by subjecting each of the lists Lleft and Lright to the same recursive step. (A list not containing a local minimum will not be expanded further but will be replaced by a list containing the list itself as the only entry.)

  • The base case is arrived at if none of the input lists can be written as the concatenation of two non-empty lists as described above.

3.3 Deterministic Dendrograms

The algorithm in Section 3.2 can be used for obtaining a nested list representation of the set of nodes indicating the clustering thus found; for the present example, this is as follows:

$$ \lbrack\lbrack\lbrack\lbrack\lbrack2,1\rbrack\rbrack,\lbrack\lbrack16,17,18\rbrack,\lbrack3,4,5\rbrack\rbrack\rbrack,\lbrack\lbrack\lbrack14,12,15,13\rbrack\rbrack\rbrack\rbrack,\lbrack\lbrack\lbrack\lbrack19,10,11,8\rbrack\rbrack\rbrack,\lbrack[[6,7,9\rbrack\rbrack\rbrack\rbrack\rbrack. $$

The dendrogram shown in Fig. 5 is arrived at by parsing this nested list structure. The upper hyphenated line in Fig. 5 indicates the clustering shown in Eq. 2 (at depth 2); in full,

$$ \lbrace\lbrace 2,1,16,17,18,3,4,5,14,12,15,13\rbrace,\lbrace19,10,11,8,6,7,9\rbrace\rbrace. $$

The lower hyphenated line indicates the clustering found at depth 3 (see Fig. 5),

$$ \lbrace\lbrace\lbrace2,1,16,17,18,3,4,5\rbrace,\lbrace14,12,15,13\rbrace\rbrace,\lbrace\lbrace19,10,11,8\rbrace,\lbrace6,7,9\rbrace\rbrace\rbrace. $$

3.4 Probabilistic Dendrograms

Every proximity curve gives rise uniquely to a fully developed dendrogram as described in Sections 3.2 – 3.3. Not all conceivable proximity curves can be obtained by the present algorithm, however, as is illustrated by the running example. We are going to describe here how those reachable can be combined to a probabilistic (or randomized) dendrogram. Such a dendrogram allows the user to reason probabilistically and motivate a choice of starting point. Questions like ”What is the probability that two given points belong to the same cluster?” can then be meaningfully addressed.

Choose a window of interest \(\mathcal {W} \subset \mathbb {R}^{2}\). To any given position \(\mathbf {u} \in \mathcal {W}\), a possible dendrogram is assigned in the following manner. \(\mathcal {W}\) is chosen so that it defines a window that the entire data set fits into, and that will provide a selection of possible points which are “sensible”, i.e. could have been observed given the context of the initial data.

To arrive at \(\mathbf {z} = \text {\textsc {SteepestDescent}}\left (\mathbf {u}, V \right )\), use Pseudocode E-1 from Appendix E.3. The point z will be the position of a local minimum of the potential V. Find the data point xι(u) ∈{x1,…,xN} which is closest to z. Circumstances will be in practice such that z is unique (as far as the numerical procedure allows), resulting in a unique choice of the closest point xι(u). The process thus described defines a mapping as follows: HCode

$$ \begin{array}{c} \iota : \mathcal{W} \rightarrow \lbrace 1, \ldots, N \rbrace\\ \mathbf{u} \mapsto \iota(\mathbf{u}) \end{array} $$

Running Pseudocode E-2 (see Appendix E.3) with the initial vector z(start) = u will result in the proximity curve whose first node is labelled ι(u). The proximity curve thus found is unique in that no two different proximity curves have the same first node. The dendrogram based on this proximity curve will be denoted by \(\mathcal {D}_{\iota (\mathbf {u})}\).

Let u1,…,u be a random sample of size from the uniform distribution on \(\mathcal {W}\). We define the dendrogram probability vectord by the following:

$$ \begin{array}{@{}rcl@{}} \mathbf{d} &=& \left( d_{1}, \ldots, d_{N} \right),\end{array} $$

whose k th component

$$ \begin{array}{@{}rcl@{}} d_{k} &=& \frac{1}{\ell} \sum\limits_{j=1}^{\ell} I_{\lbrace \iota(\mathbf{u}_{j}) = k \rbrace} \end{array} $$
(3)

is the relative frequency of the node k having been chosen as the first data point for the corresponding proximity curve, where

$$ \begin{array}{@{}rcl@{}} I_{B} (x) = \left\{\begin{array}{llll} 1 & \quad \text{if } x \in B\\ 0 & \quad \text{if } x \notin B \end{array}\right. \end{array} $$
(4)

is the indicator function of a given set B.

The probability of two given nodes m1m2 being in the same cluster can now be evaluated by the following:

$$ \begin{array}{@{}rcl@{}} P\left( \text{nodes}\ m_{1}\ \text{and}\ m_{2}\ \text{are in the same cluster} \right) =\\ \sum\limits_{k=1}^{N} d_{k} P\left( \text{nodes}\ m_{1}\ \text{and}\ m_{2}\ \text{are in the same cluster}|\mathcal{D}_{k}\ \text{applies} \right). \end{array} $$
(5)

The conditional probabilities in Eq. 5 take values in {0,1}; they are obtained by inspecting dendrogram \(\mathcal {D}_{k}\). There are in total N(N − 1)/2 probabilities defined in Eq. 5.

Probabilities for i pairwise different nodes m1,…,mi being in the same cluster can be evaluated analogously; there are \(\binom {N}{i}\) such values.

Based on a sample of size = 1,000, the dendrogram probabilities were calculated by Eq. 3; they are shown in Table 3. Notice that the data points 6, 7, 9, 13, and 15 (and the corresponding dendrograms) are unreachable, which is a consequence of none of these points being located close enough to a valley (a local minimum) of V, or, more precisely, for all of these data points there is another data point closer to the candidate minimum location; see Fig. 2, the right-hand box.

Consider m1 = 16, m2 = 18 as an example. It can be seen by inspection that out of the 14 possible dendrograms (not shown here), each of the following 9 will assign these two nodes to the same cluster:

$$ \lbrace \mathcal{D}_{2},\mathcal{D}_{3},\mathcal{D}_{4},\mathcal{D}_{8},\mathcal{D}_{10},\mathcal{D}_{11},\mathcal{D}_{12},\mathcal{D}_{16},\mathcal{D}_{19} \rbrace. $$

It is therefore

$$ \begin{array}{@{}rcl@{}} &&P\left( \text{nodes}\ \#16\ \text{and}\ \#18\ \text{are in the same cluster} \right)\\ &=&d_{2} + d_{3} + d_{4} + d_{8} + d_{10} + d_{11} + d_{12} + d_{16} + d_{19} = 0.698 \end{array} $$

In the above calculations and example, fully developed dendrograms were considered only (requiring us to descend to depth 5). In general, the probability of any given set of nodes belonging to the same cluster can be calculated for any required level in a similar manner.

3.5 Implementation

A suite of programmes has been written to implement the algorithms. The functions in Appendix E.3 were implemented in R since this has excellent programming and plotting facilities for data analytics. The pictorial and computational output of the R programs are the contour plots and the proximity curve, the latter internally represented as a list-of-tuples which then serves as an input to a recursive Haskell programme returning the dendrogram as a nested list. Haskell was also used to produce LATE X code for drawing the dendrogram (as a tree) as exemplified in Fig. 5.

The associated code and data are available from

http://www.comp.brad.ac.uk/~dneagu/proximity

4 Experimental Work

In order to evaluate the accuracy achieved by the clustering method proposed, we use the Iris data set (Fisher 2011), which is commonly used in the literature. The Cauchy parameters are selected from the data set as detailed below. The Iris data set has the following fields:

  • Sepal width (S.W.)

  • Sepal length (S.L.)

  • Petal width (P.W.)

  • Petal length (P.L.)

In order to test how the algorithm behaves under different assignments of those fields to Cauchy distribution parameters, we varied which Iris data set field gets assigned to which parameter (μ1, μ2, c1, c2).

For each of the inputs, the mass is assumed to be 1, and the angle parameters are computed as follows:

$$ \alpha_{i} = 2 \pi \left( \frac{\mu_{i1} + \mu_{i2} + c_{i1} + c_{i2}}{\max(\boldsymbol{\mu}_{1}) + \max(\boldsymbol{\mu}_{2}) + \max(\mathbf{c}_{1}) + \max(\mathbf{c}_{2})}\right) $$

It should be noted that this only one pragmatic assignment of αi, but in principle other functions of the other parameters can be used.

In addition, since we know which points should be clustered together (i.e. those belonging to the same Iris species), we can evaluate the accuracy for a given clustering (a given depth of the dendrogram). The accuracy is measured by determining what proportion of the data points belong to the correct cluster. The class label assigned to the cluster is chosen based on what class the majority of the points in the cluster belong to. The full results, for all possible parameter assignments, are given in Appendix E.5.

We consider here the 4 parameter assignments shown in Table 1, along with the related accuracies, for the sake of simplicity (the experimental work has covered the entire range of possible permutations; please see Appendix E.5 for the entire list of results). Accuracy at depth 1 is not shown, as that represents the root of the tree, i.e. the point before any clustering has been attempted. This information is visualised in Fig. 7. As evidenced by Table 1, the accuracy achieved is highly dependent on the parameter assignments chosen.

Table 1 Parameter assignments and accuracy measures

For example, the best overall result, as well as the fastest, is obtained for 𝜃3. That, after the first split, gives the accuracy of 0.593, which grows to 0.607 after one more iteration, and reaches 0.860 at dendogram depth 5. These results validate our theoretical assumption that data attributes translated into Cauchy parameters (which give rise to a particular proximity curve) influence and determine cluster definitions. The proposed algorithm allows further tuning in subsequent iterations, creating the opportunity to further improve classification results.

5 Discussion

The distinguishing feature of our algorithm is that the data points are kept stationary while the space is explored with a moving unit test charge (as in electrostatics) or test mass (as in gravity) to discover clusters. In doing so, the test charge will be attracted to groups of charges (or masses), depending on the force of attraction as measured by the gradient of the potential. As a data point is ‘discovered’, it is deactivated (i.e. removed) and the gradient of the updated potential field is used to guide us to discover the next data point. By successively removing data points, the current cluster gets depleted, resulting in a steady weakening of attraction to what remains of the cluster. Eventually, the current cluster gets exhausted, indicated by the small magnitude of the gradient of the potential. This then is the sign for starting a new cluster: the magnitude of the gradient of the total potential of the remaining active points begins to increase. The pattern of the proximity curve thus constructed holds the clue to the definition of the clustering.

The process described above suggests a nested structure of clusters inherent in hierarchical clustering.

Individual proximity curves are usually suitable for finding the clustering at the local level because removal of points may eventually noticeably interfere with the global interplay of individual point potentials. The effect of ‘point removal’ inherent in the technique is intended to be counteracted by taking into account all possible (reachable) proximity curves in a probabilistic sense.

The dendrogram in Fig. 5 in conjunction with the set’s joint equipotentials in Fig. 2 shows that the clustering suggested is intuitively plausible. The fact that point #19 is not grouped together with the points {16,17,18} is attributable to the fact that #19, even though it is physically close to the group, is separated from it by a ridge. Figure 3 shows a snapshot of the node deactivation process. It is seen that the algorithm duly observes both the mutual closeness of data points as well as the topography of the potential surface.

Finally, it is instructive to see that the north-eastern group of data points {14,12,15,13} is separated by the dendrogram from the south-western group {6,7,9} even though the locations of all these points are roughly on the same potential level, as shown in Fig. 6. This illustrates once more the point that the algorithm takes into account mutual geographic proximity of points as well as surface topography.

6 Conclusion and Outlook

The new concept of the proximity curve has been introduced and used for finding clusterings in finite sets of data points in the Euclidean space endowed with individual potentials derived from a multivariate Cauchy distribution. The finite collection of all proximity curves gives rise to a probabilistic dendrogram. A synthetic example comprising 19 data points was taken to illustrate the technique. The evaluation of the algorithm on a combination of Iris data set case studies reported in Section 4 proves that this new approach to consider data features’ contributions to the global field potential as a way to discover clusters is promising and valuable.

Further research is planned to explore the technique for data sets in higher dimension, for large data sets, for more real-life data sets, and the associated computational complexity. More comprehensive evaluations on other cases with real data is envisaged as future work allowing an insight into the applicability of the proposed methodology. Another research direction could address how the technique generalises to the continuous case where the data set is so large and densely packed that it is reasonable to model it as a continuum. The authors see interesting research avenues on parameter tuning and extension to multi-dimensional spaces that can be treated as multi-objective optimisation challenges.