Coal elemental (compositional) data analysis with hierarchical clustering algorithms

https://doi.org/10.1016/j.coal.2021.103892Get rights and content

Abstract

The modes of occurrence for elements in coal are extremely important for deciphering geological process of coal formation and for anticipating the technological behavior and environmental and health impacts derived from coal utilization. Hierarchical clustering algorithm has been widely adopted to investigate the modes of occurrence of elements in coal. The traditional statistics (e.g., Pearson correlation, Euclidean distance) for the elemental data of coal may lead to misinterpretation because the elemental data of coal are of compositional nature and follow the rules of Aitchison geometry. This work applied log-ratio transformations in order to overcome this problem. Different hierarchical clustering algorithms with various data transformations can infer modes of occurrence for coal elements, but which algorithm is optimum deserves to be investigated. In this paper, we discuss four commonly used hierarchical clustering algorithms utilizing pivot coordinates and weighted symmetric pivot coordinates (WSPC), two types of log-ratio transformations, to infer modes of occurrence of elements in coal, based on published coal elemental data. Results showed that the Pearson correlation produces more meaningful results than the Euclidean distance in clustering rare earth elements and Y. WSPC produces more interpretable results than those from pivot coordinates transformed data for these coal elemental data. Compared with the single, complete, and centroid, the average-linkage algorithm is indeed the optimum.

Introduction

Coal is an important resource in many countries around the world, particularly those of developing countries such as China, India, and Turkey. Coal is composed primarily of organic matter with up to 50 wt% inorganic components, and the latter is usually referred to as mineral matter (Ward, 2002, Ward, 2016). Geochemically, mineral matter in coal mainly consists of non-mineral elements (i.e., elements bound by organic matter, adsorbed on the surfaces of organics, and dissolved in pore waters) and elements hosted in minerals (Dai et al., 2020b, Dai et al., 2021; Finkelman et al., 2019; Ward, 2002, Ward, 2016). The environmental and health impacts from potentially toxic elements in coal are not only determined by their concentrations but also by their modes of occurrence (Dai et al., 2020b, Dai et al., 2021; Finkelman and Greb, 2008). The accurate determination of the modes of the occurrence for elements in coal is important for deciphering geological process of coal formation and anticipating the technological behavior, economic by-product potential, and environmental and health impacts from coal utilization.

A number of physical and chemical methods have been used to determine the modes of occurrence of elements in coal (Dai et al., 2020a, Dai et al., 2021; Finkelman et al., 2019), including optical microscopy, X-ray diffraction analysis (XRD), scanning electron microscopy equipped with energy dispersive X-Ray spectroscopy (SEM-EDS), X-ray fluorescence spectrometry (XRF), sensitive high-resolution ion microprobe (SHRIMP), transmission electron microscopy (TEM), electron microprobe analyzer (EMPA), and laser ablation inductively coupled plasma mass spectrometry (LA ICP-MS). In addition, indirect methods, e.g., density separations and selective leaching procedures, have also been used (Dai et al., 2020a, Dai et al., 2021; Finkelman et al., 2019). Furthermore, several statistical methods have been widely adopted to investigate the modes of occurrence of elements in coal (e.g., Eskanazy et al., 2010; Geboy et al., 2013; Liu et al., 2019; Ward, 2002), although there are some controversies regarding this approach. For example, Drew et al. (2008) and Geboy et al. (2013) noted inconsistencies in correlations between elements for coal geochemical data reported on whole-coal and ash bases. Both studies reported that the root cause of the difference between bases is mathematical (i.e., subcompositional incoherence) and can be potentially solved using statistical methods based on compositional data analysis. Eskanazy et al. (2010) pointed out some potential problems using statistical analysis to determine the modes of occurrence of elements in coal and cautioned that careful consideration of geochemical principles must be seriously considered. Dai et al., 2020b, Dai et al., 2021 reviewed statistical analyses used in coal geochemistry such as principal component analysis, cluster analysis, and correlation analysis, and pointed out that statistical analysis is not always correct for deciphering the modes of occurrence of elements in coal.

Among all the statistical methods and clustering algorithms, hierarchical clustering is commonly used because it can represent the degree of affinity of the elements in coal with each other, but does not require relationships to be linear (Dai et al., 2008, Dai et al., 2012c; Dai et al., 2012a; Templ et al., 2008; Jain et al., 1999; Xu et al., 2020; Xu and Wunsch, 2005; Zhou and Jia, 2000). Specifically, hierarchical clustering aims to divide elements in coal into different clusters using different algorithms, based on the affinity of elemental data. Hierarchical clustering algorithms usually include single-, complete-, centroid-, and average-linkage (Jain et al., 1999; Xu et al., 2020; Xu and Wunsch, 2005). Common dissimilarity measurements include Pearson correlation and Euclidean distance (Jain et al., 1999; Xu et al., 2020; Xu and Wunsch, 2005). Among all the hierarchical clustering algorithms, the centroid-linkage hierarchical clustering algorithm is usually equipped with Euclidean distance (Jain et al., 1999; Xu et al., 2020; Xu and Wunsch, 2005). According to most literature on the hierarchical clustering algorithms for coal elemental data analysis, some specific hierarchical clustering algorithms are used for deducing the modes of occurrence. For example, Eskanazy et al. (2010) used Euclidean distance and centroid-linkage hierarchical clustering algorithm to infer elemental modes of occurrence from a suite of 75 samples from a Bulgarian lignite deposit. Dai et al. (2012c) analyzed 33 coal samples from the Adaohai coal mine in the Daqingshan Coalfield, Inner Mongolia, China, using Pearson correlation dissimilarity with a centroid-linkage hierarchical clustering algorithm, and they produced insights about geological processes that affected the modes of occurrence of elements in coal.

Elemental coal data are compositional (Geboy et al., 2013). Compositional data have been defined historically as random vectors with strictly positive components whose sum may be constant, though the latter is not a strict requirement. Compositional data exist in a hyperplane of real space (known as the Simplex) but do not follow the rules of Euclidean geometry, meaning that the geometric properties used in conventional statistics (e.g., distance and correlation) may provide incorrect or spurious results if applied to these data (Aitchison, 1986). Rather, the data follow the rules of Aitchison geometry and a series of compositional data analysis methods (Egozcue et al., 2003; Xu et al., 2020). In particular, log-ratio transformations have been proposed to either transform compositional data into real space or allow for correctly work with data in the Simplex. These log-ratio transformations include additive log-ratio transformation (alr) (Aitchison, 1986), centered log-ratio transformation (clr) (Aitchison, 1986), and orthonormal log-ratio transformation (olr) (Egozcue et al., 2003; Fišerová and Hron, 2011). The performance evaluation of compositional data transformations has been developed for many years especially through mathematical analysis (Aitchison, 1986; Aitchison et al., 2000; Egozcue et al., 2003). However, there are notable differences in transformed data and care must be taken in how log-ratio transformed data are utilized. For example, olr transformed compositional data exhibit orthonormal properties while alr- and clr-transformed data do not.

Another approach for constructing orthonormal coordinates is pivot coordinates, which can construct coordinates that contain only the compositional part of interest. One of the coordinates (such as the first coordinate) explains all the relevant information about that part through pairwise log ratios to the other parts of the composition. Hron et al. (2017) constructed weighted pivot coordinates that treat the redundant information in a controlled manner. Kynčlová et al. (2017) proposed symmetric pivot coordinates to measure the strength of association of compositional parts through the correlation coefficient of a particular choice of orthonormal coordinates. Hron et al. (2021) proposed weighted symmetric pivot coordinates (WSPC) focusing on pairwise associations. In the method of WSPC, variables with large log ratio variances are down-weighted to suppress their effects on the remaining variables. Compared to the weighted pivot coordinates and symmetric pivot coordinates, the method of weighted symmetric coordinates focuses on the pairwise associations.

Based on the analysis of different hierarchical clustering algorithms and different data transformations discussed above, it is concluded that different hierarchical clustering algorithms can yield different modes of occurrence for coal elements. This paper focuses on the performance evaluation of hierarchical clustering algorithms with the pivot coordinates and WSPC (Egozcue et al., 2003; Drew et al., 2008; Mateu-Figueras et al., 2011; Hron et al., 2021) for the coal elemental data and their associations, so as to determine which approach is optimum among all the different clustering techniques for the dataset being investigated.

Section snippets

Orthonormal log-ratio (olr) and Pivot Coordinates

The olr coordinates, previously referred to as isometric log-ratio coordinates, map the data from Simplex space xi ∈ S to Euclidean space yi ∈ R. The olr does this by building an orthonormal basis in the hyperplane and has the advantage of avoiding singularity which occurs with clr preprocessing coefficients (Filzmoser et al., 2009). One particular choice of a basis leads to pivot (logratio) coordinates (Filzmoser et al., 2018; Fišerová and Hron, 2011), which is defined as:yi=pivotcoordinatexi=n

Dissimilarity for the affinity of coal elemental data

Coal elemental concentration data can be represented as a vector x(i) = [x1(i), x2(i), …, xm(i)]T, i = 1…n, where m and n are sample size and element number, respectively. The dissimilarity between concentrations of element x(i) and element x(j) is denoted as D(x(i), x(j)). Pearson correlation and Euclidean distance are widely used to measure the dissimilarity between two elements.

Different hierarchical clustering algorithms

The hierarchical clustering algorithms usually involve single, complete, centroid, and average-linkages.

  • (1)

    Single-linkage

Background information of the coal datasets

For the interpretations of the different algorithm with different data transformation, the data used in this study are from late Paleozoic coals (i.e., CP2 coal) from the Adaohai and Datanhao mines in the Daqingshan Coalfield, Inner Mongolia, northern China (Fig. 1A). The Daqingshan Coalfield contains 16 mines (Fig. 1B). During the period of peat deposition, the Daqingshan Coalfield was close to the sediment source region, i.e., Yinshan Upland (Dai et al., 2012c, Dai et al., 2015). Due to

Conclusion

In this paper, we have conducted extensive experiments of applying hierarchical clustering algorithms to real datasets collected from Adaohai and Datanhao mines. Based on the comprehensive studies, the main conclusions can be drawn as follows:

  • (1)

    The Pearson correlation is much better than the Euclidean distance in clustering REY. The correlation involves raw data, transforms data of pivot coordinates and WSPC.

  • (2)

    In general, for the hierarchical clustering results with correlation, pivot coordinates

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (No. 61772320) and 111 Projects (No. B17042). Thanks are given to the anonymous reviewers for their careful reviews and detailed comments.

References (36)

Cited by (20)

  • Application of self-organizing maps to coal elemental data

    2023, International Journal of Coal Geology
View all citing articles on Scopus
View full text