1 Introduction

Understanding the particularities of geosystems and geoscience data is critical for obtaining accurate and physically consistent inferences and predictions (Reichstein et al. 2019). Due to technological advances in capturing geoscience data, the archives of input data are large and ever growing (Sellars 2018). Geoscience data are obtained from a variety of sources, including remote sensing (e.g. hyperspectral satellite images, airborne geophysical surveys and high-quality aerial photography via drones), in situ sensors in close proximity to the phenomenon under investigation (e.g. cameras on conveyor belts, multiple sensors in the flotation cells, and soil pH sensors), direct observations and sampling during field campaigns (e.g. soil geochemical samples and drilling data), historical records and simulation data generated from process-based models. Inconsistency and poor quality are therefore often unavoidable; For instance, noise and missing values, measurement errors and analytical errors often accompany geochemical data (Grunsky 2010).

Geoscience processes and attributes vary significantly through time and space. Such heterogeneity is related to the spatial and temporal variation of soil types, rock types, land uses, vegetation types, climatic conditions and tectonic activities. The heterogeneity and non-stationarity of geosystems and geoscience data must be accounted for during modelling of geoscience variables across all points in space and time (Chilès and Delfiner 2012). Geoscience attributes are spatially and/or temporally auto- and cross-correlated (Goovaerts, 1997; Webster and Oliver, 2007) or show even more complex statistical and spatial patterns (Mariethoz and Caers 2015); For instance, a geochemical sample that shows a low proportion of magnesium oxide (MgO) is generally surrounded by locations that have similar MgO proportions. This sample and surrounding locations potentially share similar geological characteristics, such as bedrock geology or surficial quaternary units.

Extracting information from high-dimensional and large datasets far exceeds any human’s abilities. Machine learning (ML) approaches are used to extract the hidden information from such datasets. However, ML algorithms may not be optimal when applied to geoscience data (Karpatne et al. 2019). The particularities of geosystems and geoscience data (e.g. big data, multi-source, multi-scale, high-dimensionality, poor quality data, limited sample size, paucity of ground-truth information, physics-based systems, importance of extreme cases, spatial and temporal heterogeneity, auto- and cross-correlations and complex uncertainty model) should be accounted for in the development of ML algorithms suitable for geoscience data. The most advanced ML algorithms are accurate when good-quality training data are abundant, but are seldom transparent, credible or interpretable. In particular, a lack of physical realism is problematic, as it makes ML algorithms potentially inaccurate in terms of extrapolations and less interpretable in terms of input parameters and predictor ranking (Reichstein et al. 2019). The majority of ML algorithms are based on the assumption of identically, independently distributed data. Such algorithms are not credible when applied to geoscience data (Schaeben et al. 2019). Fit-for-purpose ML algorithms should be able to capture dynamic (varying through time and space) multivariate spatial and/or temporal patterns of different scales and types.

Either current statistical learning algorithms can be amended to be consistent with the nature of geoscience data or new algorithms need to be developed; For instance, earth science data can be clustered to split the domain of study to account for the radically different behaviours of the natural phenomenon over the domain (useful for earth process discovery) and to simplify the subsequent modelling steps. However, to achieve this goal, a consistent clustering algorithm should be implemented. Non-spatial clustering techniques generally cluster observations based on their relationships in the feature space, so they do not have the means to consider auto- and cross-correlations of the regionalised variables. As a result of the lack of spatial awareness, the spatial coherence of the resulting clusters is not ensured (Fouedjio 2016a).

Incorporating coordinates as additional dimensions into the feature space and applying classical non-spatial clustering algorithms subsequently may lead to unsatisfactory results; For instance, two distal (in geographical space) points may belong to the same geological unit of interest or teleconnections in climate studies (Kawale et al. 2013). Spatial contiguity can also be enforced during the clustering process by imposing a proximity condition based on a graph organising the observations in the geographical space (Romary et al. 2015). Secchi et al. (2013) proposed a technique to cluster spatially dependent functional data using random Voronoi tessellations. In their proposed approach, the original data are replaced by some local representatives, and these local data are clustered subsequently. In addition, they achieved a model of uncertainty by implementing a bagging process. Another possibility is to apply the non-spatial clustering algorithms on a modified version of the dissimilarity matrix. Dissimilarities between observations are modified to take into account the spatial dependence (Oliver and Webster 1989; Bourgault et al. 1992; Fouedjio 2016a, b).

However, the aforementioned possibilities for improving the spatial awareness of a clustering algorithm are not suitable for recognising complex spatial patterns, objects and structures of different scales (which are not easily captured by two-point geostatistics) or their spatial distributions across the domain of study. Deep clustering technique such as those based on convolutional auto-encoders (Guo et al. 2017; Min et al. 2018) need a pre-processing step for multi-sensor imagery and vector data (e.g. interpolation and upscaling), and usually there is no sensitivity analysis on the effects of such pre-processing on the final results. The black-box nature of deep learning techniques makes the inference process somewhat difficult (Kuwajima et al. 2019). More importantly, the clustering of spatial mixed (e.g. continuous and categorical) and constrained (e.g. compositional) data has not been discussed.

The main objectives of this paper are to highlight the nature and characteristics of geoscience data and geosystems, the limitations of ML algorithms when it comes to spatially dependent data, and finally illustrate how classical ML algorithms can be amended to account for the particularities of spatial data through a case study aiming to improve the spatial awareness of spectral clustering.

Spectral clustering is a relatively recent clustering algorithm that relies on graph-theoretical concepts (Ng et al. 2001; von Luxburg 2007) but generally does not account for spatial relationships between observations. A new approach is introduced to generate a dissimilarity matrix computed from the distances between multivariate data events (e.g. geophysical, geochemical and geological spatial patterns) of different size and scales. Subsequently, spectral clustering will be used to cluster the input data based on the novel dissimilarity matrix. Incorporating existing domain knowledge (e.g. prior knowledge on the size and geometry of geophysical, geochemical and geological spatial patterns, and reliability and data abundance for any sources of information) into models of this type generates more physically realistic outputs.

The proposed technique will be illustrated through one synthetic and one mining case study where the primary geometallurgical attributes are spatially clustered to split the deposit into different domains. These domains will simplify the subsequent metallurgical sampling and modelling steps.

Section 2 presents the proposed methodology for physically realistic and spatially aware spectral clustering. Section 3 illustrates the implementation of the new technique using synthetic and real cases. Finally, some conclusions and final thoughts are presented in Sect. 4.

2 Methodology

Spectral clustering algorithms cluster the eigenvectors derived from a similarity matrix of input data. Normally, a kernel function, e.g. a radial basis function (RBF) kernel, is used to generate the \(n\times n\) similarity matrix (where \(n\) is the number of input data). The kernel function ignores spatial information such as heterogeneity, auto- and cross-correlations, as well as complex spatial objects and patterns; For instance, in the case of the RBF, the kernel function is based on the pairwise squared Euclidean distance between observations in the feature space. Subsequently, a normalised Laplacian is defined from the non-spatial similarity matrix. After defining the eigensystem of the Laplacian, a classical clustering algorithm such as k-means is applied on the top \(m\) eigenvectors to obtain \(m\) clusters. To improve the spatial awareness of the spectral clustering, the following algorithm is proposed:

A set of \(n\) regionalised multivariate data \(Y=\left\{{\varvec{y}}\left({u}_{i}\right)=[{y}^{1}\left({u}_{i}\right), {y}^{2}\left({u}_{i}\right),\dots ,{y}^{K}\left({u}_{i}\right)]; i=1,\dots ,n\right\}\) needs to be clustered into \(m\) subsets. The variables \({y}^{1}\left(\cdot \right), {y}^{2}\left(\cdot \right),\dots ,{y}^{K}\left(\cdot \right)\) can be a mixture of categorical and continuous spatial variables, and some of the continuous variables may be compositional. For ease of discussion, it is assumed that the first \(L, L\ge 0\), variables form a composition. For each location \({u}_{i}\), the data event for the kth variable is defined as \({d}_{{E}_{k}}({u}_{i})=\left\{{y}^{k}\left({u}_{ij}\right); j=1,\dots ,{E}_{k}\right\}\), which consists of the values of the kth variable, \(k=L+1, \dots ,K\), at all the \({E}_{k}\) nodes (including the node \({u}_{i}\)) in the neighbourhood. If a subset of the continuous variables is compositional, \({y}^{1}\left(\cdot \right),\dots \), \({y}^{L}\left(\cdot \right)\), then the data event for these variables is multivariate and defined as \({d}_{{E}_{Z}}({u}_{i})=\left\{{[y}^{1}\left({u}_{ij}\right),\dots {y}^{L}\left({u}_{ij}\right)]; j=1,\dots ,{E}_{Z}\right\}\). The overall data event at \({u}_{i}\) is then given by \({{\varvec{d}}}_{E}\left({u}_{i}\right)=\left[{d}_{{E}_{Z}}\left({u}_{i}\right), {d}_{{E}_{L+1}}\left({u}_{i}\right),\dots , {d}_{{E}_{K}}({u}_{i})\right].\) The parameters \({E}_{Z} \mathrm{a}\mathrm{n}\mathrm{d}\) \({E}_{k}, k=L+1,\dots , K\) control the order of spatial statistics (the size of the patterns) relevant for the composition and for the kth variable. Selecting \({E}_{k}={E}_{Z}=1\) changes the proposed spatial algorithm to the classical non-spatial clustering. The order of spatial statistics and the geometry of the spatial pattern can be different for each source of information (e.g. mineralogy and geochemical compositions, rock type, alteration code, porosity, permeability and resistivity). Figure 1 shows examples of data event geometry useful for measuring pattern similarity at small scale, large scale, a combination of all scales, oriented, and three-dimensional, respectively. It is the responsibility of the expert users to define the most relevant geometry for the spatial patterns based on their prior knowledge.

Fig. 1
figure 1

Examples of data event geometry useful for measuring pattern similarity at a small scale (\(E=9\)), b large scale (\(E=9\)), c a combination of all scales (\(E=25\)), d oriented (\(E=11\)), and e three-dimensional (\(E=75\))

To define a distance between two multivariate data events, the nature of the variables must be considered. If a subset Z composed of the first L variables is compositional, then the distance is given by

$$ \begin{aligned} &d_{{{\text{composition}}}}^{Z} \left\{ {d_{{E_{Z} }} \left( {u_{i} } \right),d_{{E_{Z} }} \left( {u_{j} } \right)} \right\} \\ & \quad = \frac{1}{{E_{Z} d_{{{\max}}}^{{{Z}}} }}\mathop \sum \limits_{e = 1}^{{E_{Z} }} \sqrt {\frac{1}{2L}\mathop \sum \limits_{r = 1}^{L} \mathop \sum \limits_{q = 1}^{L} \left( {ln\frac{{y^{r} \left( {u_{{{{ie}}}} } \right)}}{{y^{q} \left( {u_{{{{ie}}}} } \right)}} - ln\frac{{y^{r} \left( {u_{{{{je}}}} } \right)}}{{y^{q} \left( {u_{{{{je}}}} } \right)}}} \right)^{2} } ,\end{aligned}$$
(1)

where dZmax is the compositional range, i.e. the longest Aitchison distance between compositional observations (Aitchison 1982), for the index set \(Z\). If the kth variable, \(k\ge L+1\), is categorical, then the distance is defined as

$$ d_{{{\text{categorical}}}}^{{{k}}} \left\{ {d_{{E_{k} }} \left( {u_{i} } \right),d_{{E_{k} }} \left( {u_{j} } \right)} \right\} = \frac{1}{{E_{k} }}\mathop \sum \limits_{e = 1}^{{E_{k} }} \left( {1 - \delta \left( {y^{k} \left( {u_{{{{ie}}}} } \right),y^{k} \left( {u_{{{{je}}}} } \right)} \right)} \right){ }, $$
(2)

where \(\delta \left(x,y\right)=1\) if \(x=y\) and \(0\) otherwise. If the kth variable is continuous, then the distance is defined as

$$ d_{{{\text{continuous}}}}^{{{k}}} \left\{ {d_{{E_{k} }} \left( {u_{i} } \right),d_{{E_{k} }} \left( {u_{j} } \right)} \right\} = \frac{1}{{E_{k} d_{{\max}}^{{{k}}} }}\mathop \sum \limits_{e = 1}^{{E_{k} }} \sqrt {\left[ {y^{k} \left( {u_{{{{ie}}}} } \right) - y^{k} \left( {u_{{{{je}}}} } \right)} \right]^{2} } , $$
(3)

where the normalising factor \({d}_{\text{max}}^{k}\) is the range of the kth variable.

The total distance between two spatial mixed data events is the convex combination of the individual distances

$$ d_{{{{ij}}}} \left\{ {{\varvec{d}}_{E} \left( {u_{i} } \right),{\varvec{d}}_{E} \left( {u_{j} } \right)} \right\} = \mu_{Z} d_{{{\text{composition}}}}^{{{Z}}} + \mathop \sum \limits_{k = L + 1}^{K} \mu_{k} d^{k} , $$
(4)

where \({\mu }_{Z}+\sum_{k=L+1}^{K}{\mu }_{k}=1; {{\mu }_{Z}, \mu }_{k}\ge 0\), and

$$ d^{k} = \left\{ {\begin{array}{*{20}c} {{{d}}_{{{\text{categorical}}}}^{{{k}}} {\text{ if the }k\text{th variable is categorical}}} \\ {{{d}}_{{{\text{continuous}}}}^{{{k}}} {\text{ if the }k\text{th variable is continuous}}} \\ \end{array} } \right.. $$

By construction, \({d}_{ij}\in \left[\mathrm{0,1}\right]\). The function \({d}_{ij}\) satisfies the following conditions: \({d}_{ij}\ge 0\), \({d}_{ij}=0\) if and only if \(i=j\), \({d}_{ij}={d}_{ji}\), and \({d}_{ij}\le {d}_{ik}+{d}_{kj}.\) The weights \({\mu }_{Z}\) and \({\mu }_{k}\) account for the multiple-point spatial dependence between mixed variables. It is highly recommended that the geometries of the spatial patterns and dependency weights be defined by geoscientists based on their expert domain knowledge.

The radial basis function is used to define a symmetric similarity matrix \({\mathrm{A}}_{n\times n}\) from the spatially aware distance matrix as

$$ {{A}}_{{{{ij}}}} = \left( {1 - \delta \left( {i,j} \right)} \right)\exp \left( {\frac{{ - d_{ij}^{{2}} }}{{2\sigma^{2} }}} \right). $$
(5)

The scaling parameter \(\sigma \) controls the deceleration of the similarity \({\mathrm{A}}_{ij}\) with distance \({\mathrm{d}}_{ij}\), and several approaches have been introduced for choosing it automatically (Ng et al. 2001; Zelnik-Manor and Perona 2004). In this study, to select a practical value for \(\sigma \), a distribution of the \(n\) closest distances is generated from the distance matrix (minimum for each row, excluding diagonal values), and the 99th percentile is selected as \(\sigma \). The symmetric normalised Laplacian is subsequently defined as

$$ L = D^{ - 1/2} AD^{ - 1/2} { }, $$
(6)

where \(D\) is the diagonal matrix whose ith element is the sum of the entries in the ith row of A. The Laplacian \(L\) is symmetric, and its eigenvectors can be chosen to be pairwise orthogonal. In spectral clustering implementation, the eigengap heuristic is often used to find the number of clusters \(m\). The \(m\) largest eigenvalues represent subregions of the study area that share similar multivariate spatial patterns. The eigengap \({\updelta }_{m}=\left|{\lambda }_{m+1}-{\lambda }_{m}\right|\) is the absolute difference between the \({(m+1)}^{\rm th}\) and \({m}^{\rm th}\) largest eigenvalues of \(L\). The eigenvectors \({x}_{1},\dots ,{x}_{m}\)(counted according to their multiplicity) corresponding to the m largest eigenvalues are used to form the \(n\times m\) matrix \(X=[ {x}_{1}\dots {x}_{m}]\). Based on matrix perturbation theory, the subspace spanned by \(\mathrm{X}\) is stable if \({\updelta }_{m}\) is sufficiently large. The rows of \(\mathrm{X}\) are normalised to unit Euclidean length. The resulting points on the \(m\)-dimensional unit sphere can then be clustered via a distortion minimisation technique such as K-means clustering. Finally, the original observation \({\varvec{y}}\left({u}_{i}\right)\) is assigned to a cluster \(s\in (\mathrm{1,2},\cdots ,m)\) if and only if the ith point was assigned to cluster \(s\). Theoretical considerations on spectral clustering can be found in Ng et al. (2001) and von Luxburg (2007). There is no assumption on the underlying model for the proposed clustering algorithm, thus a large variety of different indices (Charrad et al. 2014) can be used to pick the number of clusters and optimise the input parameters including the geometry of data events, \({E}_{Z}\), \({E}_{k}\), \({\mu }_{Z}\), and \({\mu }_{k}\); For instance, the silhouette metric (ranging from −1 to +1) that measures the similarity of an object to its own cluster and dissimilarity from the other clusters can be used to tune the input parameters. The input parameters for the proposed technique can also be selected by experts in a knowledge-driven way. In this study, this approach is implemented. The input parameters enable experts to use their existing domain knowledge. The resultant clusters are highly dependent on the experts’ knowledge and decisions. Each geological structure determines the choice of geometry for the corresponding spatial pattern in a multivariate data event; For instance, a geological structure such as a weathering profile may be captured using a spatial pattern with a greater vertical than horizontal extent. Similar decisions need to be made for alteration zones and intensities, faults and folds, and geochronological order. Reliability and data abundance for any sources of information can also be handled via different weighting configurations. Such flexibility in terms of the physical realism of the input parameters leads to an interpretable and reliable unsupervised model.

It should be noted that some pre-processing might be required. If the input data do not lie on a regular grid, they should first be migrated to the closest nodes of a regular grid of suitable resolution (some inputs such as satellite images are already regularly spaced) or be rasterised using geostatistical simulation techniques. Similarly, missing values and empty cells should be imputed first, and multiple-point simulations are recommended for this purpose. Moreover, dissimilarity measures for the data located at the margins of the study area are based on a lower order of spatial statistics (part of the data event is located outside the known region). For the cases where full pattern matching is needed, the margins can be dropped from calculations.

3 Experiments

3.1 Synthetic Case Study

The performance of the proposed method was tested using a synthetic case study (Fig. 2). Figure 2a shows a geological cross-section. Three geological units, two types of sedimentary transition and two types of fault transition (rock type #1 to #2 and rock type #2 to #3) are indicated in this figure. Three samples from a uniform distribution (with different centres) were combined to define the input variable (Fig. 2b). Classical spectral clustering (\({E}_{k}=1\), model M1) was implemented to assess the gain in the implementation of a spatially and physically aware learner. The geometry in Fig. 1c (\({E}_{k}=25\)) was selected to build the spatially aware distance matrix (model M2). Figure 2c and d show the large eigenvalues of the symmetric normalised Laplacians for models M1 and M2, respectively. The eigengap heuristic suggests three clusters (jump after the third largest eigenvalue) for the M1 and 11 for the M2 model. Comparing Fig. 2e and f reveals the superior performance (larger silhouette value) of the proposed algorithm in terms of generating tight clusters. There is a consistency between the eigengap heuristic and the silhouette metric in this case; For instance, both of these methods suggest 11 clusters for the M2 model. The final clustered maps are shown in Fig. 2g, h. The proposed method recognised the three geological units and the four types of transitions. In addition, M2 recognised the different sides of transitions, which is useful for some geological phenomenon such as uranium roll-front or skarn mineralisation. The proposed method is also robust in terms of removing unstructured noise and improving the spatial coherency of the final clusters (Fig. 2g, h).

Fig. 2
figure 2

a Cross-section of the input data, b histogram of the input data, large eigenvalues of the Laplacian matrix for model c M1 and d M2, optimum number of clusters (based on silhouette metric) for model e M1 and f M2, and final clustered map for model g M1 and h M2

3.2 Mining Data

Murrin Murrin East (MME) is a nickel–cobalt laterite deposit located in Western Australia. At MME, nickel laterite deposits occur as laterally extensive, undulating blankets of mineralisation with strong vertical trends covering basement ultramafic rocks (Murphy 2003). In total, 17,377 samples (of length 1 m) from 920 regularly spaced RC holes (25 m × 25 m) make up the database for this study (Fig. 3). Variables of interest consist of one categorical variable, viz. rock type: ferruginous (FZ), smectite (SM), saprolite (SM) and ultramafic (UM), and geochemical compositions: three major (Fe, Al and Mg) and two target (Ni and Co) elements, plus Rest to achieve closure in the compositional data.

Fig. 3
figure 3

Three-dimensional representation of the input data, ae continuous variables of interest, and g categorical variable (rock type)

The geometry of the three-dimensional data event in Fig. 1e was selected to measure the distances. In this study, the same parameters were used for geological information and compositional geochemistry (\({E}_{k}={E}_{Z}=75\)). Weights \({\mu }_{Z}=0.7\) and \({\mu }_{7}=0.3\) were defined to give 70% importance to compositional information and 30% to rock types. Equation 1 was used to measure the distance between spatial patterns of geochemical compositions, and Eq. (2) to measure the distance between spatial patterns of lithological units. Classical spectral clustering (\({E}_{k}={E}_{Z}=1\), \({\mu }_{Z}=0.7\) and \({\mu }_{7}=0.3\)) was also implemented to assess the gain in the implementation of the spatial spectral clustering for geometallurgical domaining.

The eigengap heuristic suggested two (with a major jump after the second largest eigenvalue) while the average silhouette index showed seven clusters as the best option (Fig. 4a, b). The first two eigenvalues are related to the dominant geological units, FZ and SA (Fig. 3f). To provide more detail, boreholes were clustered into seven spatially homogeneous regions. Figure 5a and b show the spatial distribution of the final seven clusters for the non-spatial and spatial spectral clustering, respectively. The non-spatial clustering generated scattered domains (less spatial continuity), while the spatial clustering recognised domains with complex structure; For instance, cluster #1 shows complex curvilinear structure that is not easy to capture by two-point geostatistics.

Fig. 4
figure 4

Selecting the number of clusters using the a eigengap heuristic and b average silhouette

Fig. 5
figure 5

Spatial distribution of the clustered data: a non-spatial clustering (\({E}_{k}={E}_{Z}=1\)) and b spatial clustering (\({E}_{k}={E}_{Z}=75\))

Compositional characteristics such as closed geometric means, total compositional variances and rock proportions (Pawlowsky-Glahn et al. 2015) of the generated spatial clusters are presented in Table 1. Cluster #6 is rich in Fe, depleted in Mg and mainly composed of ferruginous rocks. Cluster #5 shows high levels of Co and Ni mineralisation and mainly consists of smectite units. Fresh ultramafic rocks are mainly located in cluster #3. This cluster shows the lowest level of global compositional dispersion, which might be related to the fact that weathering has not reached to this depth. The spatial distribution and compositional characteristics of the domains generated by the spatial spectral clustering are consistent with the current geological understanding of this deposit (Talebi et al. 2017, 2019).

Table 1 Statistical characteristics of each spatially clustered region

4 Conclusions

Traditional statistical learning techniques do not fully account for the particularities of geosystems and geoscience data. These techniques must thus be amended to account for the characteristics of geoscience data, or new spatial learners must be developed. A novel methodology is proposed herein to improve the physical realism and spatial awareness of spectral clustering. This spatial spectral clustering approach allows the use of existing domain knowledge to select the input parameters (e.g. geometry of the patterns, order of spatial statistics, and weights for the convex combination of distance matrices). Clusters generated via the proposed algorithm are homogeneous (similar multivariate spatial patterns) up to a selected order of spatial statistics, thus presenting a richer view of stationarity than variogram-based or non-spatial clustering techniques. Many attributes of different natures (continuous, categorical and constrained) can be used as inputs to the proposed clustering technique. The results from the synthetic and mining datasets proved the usefulness of the technique. The proposed method will be developed further for clustering of multisensor imagery and vector data simultaneously.