Cluster analysis of crude oils with k-means based on their physicochemical properties

doi:10.1016/j.compchemeng.2021.107633

Computers & Chemical Engineering

Volume 157, January 2022, 107633

https://doi.org/10.1016/j.compchemeng.2021.107633 Get rights and content

Highlights

•
Physicochemical properties of crude oils have a structure suitable for clustering.
•
k-means with 3 clusters provides additional knowledge to better group crude oils.
•
Usage of internal validation indexes to measure clustering quality is relevant.

Abstract

The values of the physicochemical properties of crude oils vary significantly, depending on their geographical origins. A standard categorization of crude oils is grossly based on the density and sulfur content, not considering other properties that can have meaningful impacts on blending and in some refining processes. Cluster analysis is an unsupervised machine learning technique that categorizes observations based on their similarity. In this work, k-means clustering algorithm was applied to a wide range of physicochemical properties to identify groups of crudes oils with high affinity that possibly have similar behavior later on, in downstream operations.

A data set from Galp SA refineries (located in Portugal) containing 454 observations, corresponding to values of 9 properties, from 45 different crude oil sources was used in the present analysis. After suitable preprocessing, k-means was applied using different cluster numbers, and their performance was evaluated through the internal validation metrics silhouette index and Local Cores-based Cluster Validity (LCCV) index. The recommend number of clusters was 3, which presented the best performance with a LCCV index of 0.39. Crude oils from the same source should be incorporated in the same cluster, and this was corroborated by external validation, with 1.8% of the observations were placed in a different cluster than the majority of same source crude oils. The proposed method was also able to identify observations with unusually high iron contents concerning the same source of crude oils when more clusters were considered.

This work provides a methodology to obtain a better categorization of crude oils by using cluster analysis, allowing the refineries to know how similar crude oils and their sources are. This categorization is very useful for improving the formulation of crude blends and the crude oils quality control, with the goal to optimize further the refining operations.

Introduction

Crude oils, mainly constituted by hydrocarbons, are mixtures of different organic components, containing also metals, sulfur, nitrogen and oxygen (Silva et al., 2011). Their composition can vary significantly, depending on the geographic source, causing the crude oils to have properties with a wide range of values. Inorganic components like sulfur, nitrogen, oxygen and heavy metals, although available in small quantities, play a major role in the crude oil quality, significantly affecting the refining process. Refineries convert crude oils into a wide range of products. Lighter crude oils tend to produce products with better properties but are generally more expensive; heavier crude oils tend to be less expensive but have more impurities and are harder to process. Depending on the amount of sulfur content, a crude oil can be classified as “sweet” or “sour”, with the latter requiring more intensive treatments in the refining process. Typical, crude oils categorization is grossly based on the density and sulfur content, which is insufficient for planning the subsequent blending and refining purposes.

With the increasing amount of data available in the industry, the use of data science and machine learning techniques to extract additional knowledge is becoming more popular (Hassani and Silva, 2018). In the petrochemical industry, supervised learning regression methods have been combined with advanced characterization techniques such as near-infrared spectroscopy (NIR) (Falla et al., 2006), nuclear magnetic resonance spectroscopy (NMR) (Masili et al., 2012) and gas chromatography and mass spectroscopy (GC-MS) (El Nady et al., 2014) to successfully determine physicochemical properties of crude oils.

Classification methods have also been applied to crude oils. Linear discriminant analysis was used on data from GC-MS (Sun et al., 2018) and physicochemical data (Vieira et al., 2016) to determine the crude oil type. Other common classification methods, such as partial least squares for discriminant analysis, was also applied to NIR (Galtier et al., 2011) and Fourier-transform infrared spectroscopy (FTIR) (Abbas et al., 2006) data to identify the source of the crude oil samples. Furthermore, non-linear methods such as Artificial Neural Networks (ANN) have been successfully developed to determine the geographical location using NIR data of Bitumen (Blanco et al., 2001).

On the other hand, unsupervised learning methods have the power to explore the natural structure of data, using the similarities between observations and/or variables. One of the most important applications of unsupervised learning methods in the petrochemical industry is the determination of the geographical source for crude oils. Zhan (2019) and Liu et al. (2017) successfully applied hierarchical clustering to geochemical biomarkers to identify crude oil samples from different geographic locations. Fernández-Varela (2010) combined different techniques such as Principal Component Analysis (PCA), hierarchical clustering and feature selection to find a reduced subset of diagnostic ratios derived from GC-MS that were sufficient to characterize the crude oils and the groups formed, reducing the number of features from 28 to 4.

Advanced characterization techniques such as gas chromatography (Teixeira, 2014; Zhang et al., 2015; Hashemi-Nasab and Parastar, 2020), and different types of spectroscopies (Chiaberge et al., 2013; Zhan et al., 2015; Barbosa et al., 2013; Onojake et al., 2015) have been used to successfully identify the crude oil source using hierarchical clustering. However, the cluster analysis performed lacked an internal validation stage and was not guided by objective criteria for selecting the number of clusters. Furthermore, this type of characterization techniques is time consuming, and requires samples to be already available and in the plant laboratories, which is a disadvantage when logistic planning is required in advance. However, due to the inherent diversity of crude oils, the seller is required to provide a list of physical and chemical properties, usually called crude oil assay. This information is available even before the crude reaches the plant. Sad et al. (2019) used simple physicochemical properties such as density and total acidity number and applied PCA and hierarchical clustering to physicochemical properties to identify outliers in the evaluation of crude oil blends. Ferreira et al. (2017) applied swarm particle optimization, with the silhouette score as the fitness function, to assay properties of different crude samples which shows different groups, which the standard crude oil classification based on the density was not able to identify.

In this work, the k-means clustering method is applied and two internal validation indexes were used to robustly determine the number of groups offering the best clustering performance based on their physical and chemical properties. While cluster analysis using specific data already available in the crude oil assays can provide additional information to the refineries before its acquisition, this area lacks more in-depth research on how to best extract information from data and to make better use of them for decision-making.

Section snippets

Industrial data set

The data set used in the present work was collected from crude oils processed in Matosinhos and Sines refineries of Galp, over the past years, containing 454 observations from 45 different crude oil sources.

The features used in this study consist of crude oils physicochemical properties, measured in situ: API gravity; sulfur content; pour point; acidity; Conradson carbon residue (CCR) content; nickel content; vanadium content; iron content; vanadium-nickel ratio. These properties are used for

Preprocessing

The data set studied contained a total of 13,5% of invalid entries, with the vast majority being of the type “less than” (“<”) entries (Table 1). The observations were preprocessed using the imputation technique described in Section 2.2.

The detection of outliers, as described in Section 2.2, was also implemented. This was done by determining the Mahalanobis distance, after scaling the data set (Eq. (2)), adjusting a $χ^{2}$ distribution and considering observations above the 99th percentile of $χ^{2}$

Conclusions

In the present work, industrial data from a variety of physicochemical properties of crude oils was analyzed and its clustering potential assessed and described using an unsupervised machine learning methodology. k-means was applied and the clustering quality evaluated using different internal validation metrics, which in the end revealed consistent results. From this analysis, it was possible to determine that 3 clusters is the recommended number of clusters for organizing the crude oils of

CRediT authorship contribution statement

A. Sancho: Conceptualization, Writing – review & editing. J.C. Ribeiro: Conceptualization, Writing – review & editing. M.S. Reis: Conceptualization, Writing – review & editing. F.G. Martins: Conceptualization, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported the project UIDB/00511/2020 – Laboratory for Process Engineering, Environment, Biotechnology and Energy–LEPABE funded by national funds through FCT/MCTES (PIDDAC); and the project UID/EQU/00102/2019 – Chemical Process Engineering and Forest Products Research centre – CIEPQPF funded by national funds through FCT/MCTES (PIDDAC). A. Sancho also acknowledge the research grant financially supported from FCT Doctoral Program PD00158/2012, Ref: PD/BDE/142834/2018 and Petrogal.

References (31)

S. Chiaberge et al.
Classification of crude oil samples through statistical analysis of APPI FTICR mass spectra
Fuel Process. Technol.
(2013)
M.M. El Nady et al.
Biomarker characteristics of crude oils from Ashrafi and GH oilfields in the Gulf of Suez, Egypt: an implication to source input and paleoenvironmental assessments
Egypt. J. Pet.
(2014)
F.S. Falla et al.
Characterization of crude petroleum by NIR
J. Pet. Sci. Eng.
(2006)
R. Fernández-Varela et al.
Selecting a reduced suite of diagnostic ratios calculated between petroleum biomarkers and polycyclic aromatic hydrocarbons to characterize a set of crude oils
J. Chromatogr. A
(2010)
O. Galtier et al.
Comparison of PLS1-DA, PLS2-DA and SIMCA for classification by origin of crude petroleum oils by MIR and virgin olive oils by NIR for different spectral regions
Vib. Spectrosc.
(2011)
F.S. Hashemi-Nasab et al.
Pattern recognition analysis of gas chromatographic and infrared spectroscopic fingerprints of crude oil for source identification
Microchem. J.
(2020)
Q. Liu et al.
Geochemistry and correlation of oils and source rocks in Banqiao Sag, Huanghua depression, northern China
Int. J. Coal Geol.
(2017)
M.C. Onojake et al.
Chemometric representation of molecular marker data of some Niger delta crude oils
Egypt. J. Pet.
(2015)
P.J. Rousseeuw
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
C.M. Sad et al.
Multivariate data analysis applied in the evaluation of crude oil blends
Fuel
(2019)

S.L. Silva et al.

Chromatographic and spectroscopic analysis of heavy crude oil mixtures with emphasis in nuclear magnetic resonance spectroscopy: a review

Anal. Chim. Acta

(2011)

P. Sun et al.

An efficient classification method for fuel and crude oil types based on m/z 256 mass chromatography by COW-PCA-LDA

Fuel

(2018)

C.C. Texeira et al.

Source identification of sea surface oil with geochemical data in Cantarell, Mexico

Microchem. J.

(2014)

L.V. Vieira et al.

Exploratory data analysis using API gravity and V and Ni contents to determine the origins of crude oil samples from petroleum fields in the Espírito Santo Basin (Brazil)

Microchem. J.

(2016)

Z.W. Zhan et al.

Chemometric differentiation of crude oil families in the southern dongying depression, Bohai Bay Basin, China

Org. Geochem.

(2019)

Cited by (11)

Mechanism-based deep learning for tray efficiency soft-sensing in distillation process
2023, Reliability Engineering and System Safety
Distillation is an important unit operation in the chemical industry. However, its process variables fluctuation can frequently cause abnormal conditions, resulting in the reduction of system reliability, and even causing safety accidents. Tray efficiency, as its key operation indicator, has been a long-term implicit variable that cannot be directly monitored so that the operators have insufficient information about the running status of the distillation system. Soft sensing for tray efficiency can greatly improve the safety, stability and reliability of the production system. In this paper, a mechanism-based deep learning method is proposed for the soft sensing of tray efficiency in distillation process. Firstly, based on the statistics of extreme alarm values and distillation process mechanism, the tray efficiency that is prone to anomalies is analyzed. The key trays that need to be monitored are identified. Secondly, the typical working conditions of the distillation system are focused by data clustering as the input of mechanism modeling. Then, the distillation system is simulated to obtain associated datasets of tray efficiency and process measurable variables. Finally, the LSTM-based deep learning model extracts the mechanical characteristics of the distillation system to construct a surrogate model for the tray efficiency soft-sensing by using these datasets.
Identification modeling of ship maneuvering motion based on local Gaussian process regression
2023, Ocean Engineering
A fast and accurate nonparametric modeling method based on local Gaussian process regression (LGPR) is proposed for the identification modeling and prediction of ship maneuvering motion. The training dataset collected from the free-running model tests of ship maneuvering is automatically divided into a number of clusters according to the similarity criterion by clustering analysis using k-means algorithm. Utilizing the data in each cluster, the corresponding local nonparametric model is identified. The computational cost of training and prediction based on LGPR is reduced compared to that based on the classic Gaussian process regression (CGPR) using the whole training dataset. Taking the KVLCC2 tanker and an unmanned surface vehicle (USV) as study objects, the nonparametric models are identified based on the experimental data of zigzag maneuvers of the KVLCC2 model and random maneuver of the USV. Using the identified models, the zigzag maneuvers of the KVLCC2 model and the random maneuver of the USV, which are not involved in the training data, are predicted. The results show that LGPR has higher computational efficiency than CGPR with acceptable prediction accuracy.
Advanced separation of soluble organic matter in a low-rank coal and evaluation using unsupervised analyses
2022, Fuel
Citation Excerpt :
Data processing and analysis become more and more important because of the rapid increase of data volume [25]. Unsupervised cluster analysis is a multi-objective statistical analysis method, which can classify complex objects by grouping them into different clusters with similar characteristics such as acidity, relative content, polarity and so on [26–28]. Hierarchical clustering analysis (HCA) can show the clustering relationship among variables via an iterative process between samples based on selected groups of peaks [29].
Three soluble fractions (SFs) of a low-rank coal were obtained by sequential thermal dissolution (TD) using cyclohexane, toluene and methanol as the solvent. SFs were further separated using preparative high-performance liquid chromatography (HPLC) to obtain subfractions. Both SFs and subfractions were characterized by gas chromatography/mass spectrometry (GC/MS). Low polar compounds such as alkanes and arenes were enriched by cyclohexane. A large number of arenes and esters were detected in toluene extracts. Strong polar compounds like phenols were concentrated by methanol. After the separation by using preparative HPLC, more phenols, alcohols, esters and ketones were identified in subfractions, and the number of compounds detected by GC/MS increased by around 3 times. In addition, two trace compounds, (2,6-dimethoxybenzoyl chloride and tetrachloro-o-benzoquinone), were identified and one compound, 1,4-benzene-dicarboxylic acid-bis(2-ethylhexyl) ester, was enriched to 89.9% after separation. Two unsupervised analysis methods, hierarchical cluster analysis and principal components analysis, revealed the differences in distributional and compositional features of SFs and subfractions.
RISK AND MUTUAL FUND CLUSTERING IN AN EMERGING MARKET: EVIDENCE FOR THAILAND
2023, International Symposia in Economic Theory and Econometrics
Optimizing Predictive Maintenance Decisions: Use of Non-Arbitrary Multi-Covariate Bands in a Novel Condition Assessment under a Machine Learning Approach
2023, Machines
Predicting the Health Status of a Pulp Press Based on Deep Neural Networks and Hidden Markov Models
2023, Energies

View all citing articles on Scopus

View full text

Cluster analysis of crude oils with k-means based on their physicochemical properties

Highlights

Abstract

Introduction

Section snippets

Industrial data set

Preprocessing

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Fuel Process. Technol.

Egypt. J. Pet.

J. Pet. Sci. Eng.

J. Chromatogr. A

Vib. Spectrosc.

Microchem. J.

Int. J. Coal Geol.

Egypt. J. Pet.

J. Comput. Appl. Math.

Fuel

Anal. Chim. Acta

Fuel

Microchem. J.

Microchem. J.

Org. Geochem.