Cluster analysis of crude oils with k-means based on their physicochemical properties
Introduction
Crude oils, mainly constituted by hydrocarbons, are mixtures of different organic components, containing also metals, sulfur, nitrogen and oxygen (Silva et al., 2011). Their composition can vary significantly, depending on the geographic source, causing the crude oils to have properties with a wide range of values. Inorganic components like sulfur, nitrogen, oxygen and heavy metals, although available in small quantities, play a major role in the crude oil quality, significantly affecting the refining process. Refineries convert crude oils into a wide range of products. Lighter crude oils tend to produce products with better properties but are generally more expensive; heavier crude oils tend to be less expensive but have more impurities and are harder to process. Depending on the amount of sulfur content, a crude oil can be classified as “sweet” or “sour”, with the latter requiring more intensive treatments in the refining process. Typical, crude oils categorization is grossly based on the density and sulfur content, which is insufficient for planning the subsequent blending and refining purposes.
With the increasing amount of data available in the industry, the use of data science and machine learning techniques to extract additional knowledge is becoming more popular (Hassani and Silva, 2018). In the petrochemical industry, supervised learning regression methods have been combined with advanced characterization techniques such as near-infrared spectroscopy (NIR) (Falla et al., 2006), nuclear magnetic resonance spectroscopy (NMR) (Masili et al., 2012) and gas chromatography and mass spectroscopy (GC-MS) (El Nady et al., 2014) to successfully determine physicochemical properties of crude oils.
Classification methods have also been applied to crude oils. Linear discriminant analysis was used on data from GC-MS (Sun et al., 2018) and physicochemical data (Vieira et al., 2016) to determine the crude oil type. Other common classification methods, such as partial least squares for discriminant analysis, was also applied to NIR (Galtier et al., 2011) and Fourier-transform infrared spectroscopy (FTIR) (Abbas et al., 2006) data to identify the source of the crude oil samples. Furthermore, non-linear methods such as Artificial Neural Networks (ANN) have been successfully developed to determine the geographical location using NIR data of Bitumen (Blanco et al., 2001).
On the other hand, unsupervised learning methods have the power to explore the natural structure of data, using the similarities between observations and/or variables. One of the most important applications of unsupervised learning methods in the petrochemical industry is the determination of the geographical source for crude oils. Zhan (2019) and Liu et al. (2017) successfully applied hierarchical clustering to geochemical biomarkers to identify crude oil samples from different geographic locations. Fernández-Varela (2010) combined different techniques such as Principal Component Analysis (PCA), hierarchical clustering and feature selection to find a reduced subset of diagnostic ratios derived from GC-MS that were sufficient to characterize the crude oils and the groups formed, reducing the number of features from 28 to 4.
Advanced characterization techniques such as gas chromatography (Teixeira, 2014; Zhang et al., 2015; Hashemi-Nasab and Parastar, 2020), and different types of spectroscopies (Chiaberge et al., 2013; Zhan et al., 2015; Barbosa et al., 2013; Onojake et al., 2015) have been used to successfully identify the crude oil source using hierarchical clustering. However, the cluster analysis performed lacked an internal validation stage and was not guided by objective criteria for selecting the number of clusters. Furthermore, this type of characterization techniques is time consuming, and requires samples to be already available and in the plant laboratories, which is a disadvantage when logistic planning is required in advance. However, due to the inherent diversity of crude oils, the seller is required to provide a list of physical and chemical properties, usually called crude oil assay. This information is available even before the crude reaches the plant. Sad et al. (2019) used simple physicochemical properties such as density and total acidity number and applied PCA and hierarchical clustering to physicochemical properties to identify outliers in the evaluation of crude oil blends. Ferreira et al. (2017) applied swarm particle optimization, with the silhouette score as the fitness function, to assay properties of different crude samples which shows different groups, which the standard crude oil classification based on the density was not able to identify.
In this work, the k-means clustering method is applied and two internal validation indexes were used to robustly determine the number of groups offering the best clustering performance based on their physical and chemical properties. While cluster analysis using specific data already available in the crude oil assays can provide additional information to the refineries before its acquisition, this area lacks more in-depth research on how to best extract information from data and to make better use of them for decision-making.
Section snippets
Industrial data set
The data set used in the present work was collected from crude oils processed in Matosinhos and Sines refineries of Galp, over the past years, containing 454 observations from 45 different crude oil sources.
The features used in this study consist of crude oils physicochemical properties, measured in situ: API gravity; sulfur content; pour point; acidity; Conradson carbon residue (CCR) content; nickel content; vanadium content; iron content; vanadium-nickel ratio. These properties are used for
Preprocessing
The data set studied contained a total of 13,5% of invalid entries, with the vast majority being of the type “less than” (“<”) entries (Table 1). The observations were preprocessed using the imputation technique described in Section 2.2.
The detection of outliers, as described in Section 2.2, was also implemented. This was done by determining the Mahalanobis distance, after scaling the data set (Eq. (2)), adjusting a distribution and considering observations above the 99th percentile of
Conclusions
In the present work, industrial data from a variety of physicochemical properties of crude oils was analyzed and its clustering potential assessed and described using an unsupervised machine learning methodology. k-means was applied and the clustering quality evaluated using different internal validation metrics, which in the end revealed consistent results. From this analysis, it was possible to determine that 3 clusters is the recommended number of clusters for organizing the crude oils of
CRediT authorship contribution statement
A. Sancho: Conceptualization, Writing – review & editing. J.C. Ribeiro: Conceptualization, Writing – review & editing. M.S. Reis: Conceptualization, Writing – review & editing. F.G. Martins: Conceptualization, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported the project UIDB/00511/2020 – Laboratory for Process Engineering, Environment, Biotechnology and Energy–LEPABE funded by national funds through FCT/MCTES (PIDDAC); and the project UID/EQU/00102/2019 – Chemical Process Engineering and Forest Products Research centre – CIEPQPF funded by national funds through FCT/MCTES (PIDDAC). A. Sancho also acknowledge the research grant financially supported from FCT Doctoral Program PD00158/2012, Ref: PD/BDE/142834/2018 and Petrogal.
References (31)
- et al.
Classification of crude oil samples through statistical analysis of APPI FTICR mass spectra
Fuel Process. Technol.
(2013) - et al.
Biomarker characteristics of crude oils from Ashrafi and GH oilfields in the Gulf of Suez, Egypt: an implication to source input and paleoenvironmental assessments
Egypt. J. Pet.
(2014) - et al.
Characterization of crude petroleum by NIR
J. Pet. Sci. Eng.
(2006) - et al.
Selecting a reduced suite of diagnostic ratios calculated between petroleum biomarkers and polycyclic aromatic hydrocarbons to characterize a set of crude oils
J. Chromatogr. A
(2010) - et al.
Comparison of PLS1-DA, PLS2-DA and SIMCA for classification by origin of crude petroleum oils by MIR and virgin olive oils by NIR for different spectral regions
Vib. Spectrosc.
(2011) - et al.
Pattern recognition analysis of gas chromatographic and infrared spectroscopic fingerprints of crude oil for source identification
Microchem. J.
(2020) - et al.
Geochemistry and correlation of oils and source rocks in Banqiao Sag, Huanghua depression, northern China
Int. J. Coal Geol.
(2017) - et al.
Chemometric representation of molecular marker data of some Niger delta crude oils
Egypt. J. Pet.
(2015) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)- et al.
Multivariate data analysis applied in the evaluation of crude oil blends
Fuel
(2019)
Chromatographic and spectroscopic analysis of heavy crude oil mixtures with emphasis in nuclear magnetic resonance spectroscopy: a review
Anal. Chim. Acta
An efficient classification method for fuel and crude oil types based on m/z 256 mass chromatography by COW-PCA-LDA
Fuel
Source identification of sea surface oil with geochemical data in Cantarell, Mexico
Microchem. J.
Exploratory data analysis using API gravity and V and Ni contents to determine the origins of crude oil samples from petroleum fields in the Espírito Santo Basin (Brazil)
Microchem. J.
Chemometric differentiation of crude oil families in the southern dongying depression, Bohai Bay Basin, China
Org. Geochem.
Cited by (11)
Mechanism-based deep learning for tray efficiency soft-sensing in distillation process
2023, Reliability Engineering and System SafetyAdvanced separation of soluble organic matter in a low-rank coal and evaluation using unsupervised analyses
2022, FuelCitation Excerpt :Data processing and analysis become more and more important because of the rapid increase of data volume [25]. Unsupervised cluster analysis is a multi-objective statistical analysis method, which can classify complex objects by grouping them into different clusters with similar characteristics such as acidity, relative content, polarity and so on [26–28]. Hierarchical clustering analysis (HCA) can show the clustering relationship among variables via an iterative process between samples based on selected groups of peaks [29].
RISK AND MUTUAL FUND CLUSTERING IN AN EMERGING MARKET: EVIDENCE FOR THAILAND
2023, International Symposia in Economic Theory and Econometrics