Cluster analysis of crude oils with k-means based on their physicochemical properties

https://doi.org/10.1016/j.compchemeng.2021.107633Get rights and content

Highlights

  • Physicochemical properties of crude oils have a structure suitable for clustering.

  • k-means with 3 clusters provides additional knowledge to better group crude oils.

  • Usage of internal validation indexes to measure clustering quality is relevant.

Abstract

The values of the physicochemical properties of crude oils vary significantly, depending on their geographical origins. A standard categorization of crude oils is grossly based on the density and sulfur content, not considering other properties that can have meaningful impacts on blending and in some refining processes. Cluster analysis is an unsupervised machine learning technique that categorizes observations based on their similarity. In this work, k-means clustering algorithm was applied to a wide range of physicochemical properties to identify groups of crudes oils with high affinity that possibly have similar behavior later on, in downstream operations.

A data set from Galp SA refineries (located in Portugal) containing 454 observations, corresponding to values of 9 properties, from 45 different crude oil sources was used in the present analysis. After suitable preprocessing, k-means was applied using different cluster numbers, and their performance was evaluated through the internal validation metrics silhouette index and Local Cores-based Cluster Validity (LCCV) index. The recommend number of clusters was 3, which presented the best performance with a LCCV index of 0.39. Crude oils from the same source should be incorporated in the same cluster, and this was corroborated by external validation, with 1.8% of the observations were placed in a different cluster than the majority of same source crude oils. The proposed method was also able to identify observations with unusually high iron contents concerning the same source of crude oils when more clusters were considered.

This work provides a methodology to obtain a better categorization of crude oils by using cluster analysis, allowing the refineries to know how similar crude oils and their sources are. This categorization is very useful for improving the formulation of crude blends and the crude oils quality control, with the goal to optimize further the refining operations.

Introduction

Crude oils, mainly constituted by hydrocarbons, are mixtures of different organic components, containing also metals, sulfur, nitrogen and oxygen (Silva et al., 2011). Their composition can vary significantly, depending on the geographic source, causing the crude oils to have properties with a wide range of values. Inorganic components like sulfur, nitrogen, oxygen and heavy metals, although available in small quantities, play a major role in the crude oil quality, significantly affecting the refining process. Refineries convert crude oils into a wide range of products. Lighter crude oils tend to produce products with better properties but are generally more expensive; heavier crude oils tend to be less expensive but have more impurities and are harder to process. Depending on the amount of sulfur content, a crude oil can be classified as “sweet” or “sour”, with the latter requiring more intensive treatments in the refining process. Typical, crude oils categorization is grossly based on the density and sulfur content, which is insufficient for planning the subsequent blending and refining purposes.

With the increasing amount of data available in the industry, the use of data science and machine learning techniques to extract additional knowledge is becoming more popular (Hassani and Silva, 2018). In the petrochemical industry, supervised learning regression methods have been combined with advanced characterization techniques such as near-infrared spectroscopy (NIR) (Falla et al., 2006), nuclear magnetic resonance spectroscopy (NMR) (Masili et al., 2012) and gas chromatography and mass spectroscopy (GC-MS) (El Nady et al., 2014) to successfully determine physicochemical properties of crude oils.

Classification methods have also been applied to crude oils. Linear discriminant analysis was used on data from GC-MS (Sun et al., 2018) and physicochemical data (Vieira et al., 2016) to determine the crude oil type. Other common classification methods, such as partial least squares for discriminant analysis, was also applied to NIR (Galtier et al., 2011) and Fourier-transform infrared spectroscopy (FTIR) (Abbas et al., 2006) data to identify the source of the crude oil samples. Furthermore, non-linear methods such as Artificial Neural Networks (ANN) have been successfully developed to determine the geographical location using NIR data of Bitumen (Blanco et al., 2001).

On the other hand, unsupervised learning methods have the power to explore the natural structure of data, using the similarities between observations and/or variables. One of the most important applications of unsupervised learning methods in the petrochemical industry is the determination of the geographical source for crude oils. Zhan (2019) and Liu et al. (2017) successfully applied hierarchical clustering to geochemical biomarkers to identify crude oil samples from different geographic locations. Fernández-Varela (2010) combined different techniques such as Principal Component Analysis (PCA), hierarchical clustering and feature selection to find a reduced subset of diagnostic ratios derived from GC-MS that were sufficient to characterize the crude oils and the groups formed, reducing the number of features from 28 to 4.

Advanced characterization techniques such as gas chromatography (Teixeira, 2014; Zhang et al., 2015; Hashemi-Nasab and Parastar, 2020), and different types of spectroscopies (Chiaberge et al., 2013; Zhan et al., 2015; Barbosa et al., 2013; Onojake et al., 2015) have been used to successfully identify the crude oil source using hierarchical clustering. However, the cluster analysis performed lacked an internal validation stage and was not guided by objective criteria for selecting the number of clusters. Furthermore, this type of characterization techniques is time consuming, and requires samples to be already available and in the plant laboratories, which is a disadvantage when logistic planning is required in advance. However, due to the inherent diversity of crude oils, the seller is required to provide a list of physical and chemical properties, usually called crude oil assay. This information is available even before the crude reaches the plant. Sad et al. (2019) used simple physicochemical properties such as density and total acidity number and applied PCA and hierarchical clustering to physicochemical properties to identify outliers in the evaluation of crude oil blends. Ferreira et al. (2017) applied swarm particle optimization, with the silhouette score as the fitness function, to assay properties of different crude samples which shows different groups, which the standard crude oil classification based on the density was not able to identify.

In this work, the k-means clustering method is applied and two internal validation indexes were used to robustly determine the number of groups offering the best clustering performance based on their physical and chemical properties. While cluster analysis using specific data already available in the crude oil assays can provide additional information to the refineries before its acquisition, this area lacks more in-depth research on how to best extract information from data and to make better use of them for decision-making.

Section snippets

Industrial data set

The data set used in the present work was collected from crude oils processed in Matosinhos and Sines refineries of Galp, over the past years, containing 454 observations from 45 different crude oil sources.

The features used in this study consist of crude oils physicochemical properties, measured in situ: API gravity; sulfur content; pour point; acidity; Conradson carbon residue (CCR) content; nickel content; vanadium content; iron content; vanadium-nickel ratio. These properties are used for

Preprocessing

The data set studied contained a total of 13,5% of invalid entries, with the vast majority being of the type “less than” (“<”) entries (Table 1). The observations were preprocessed using the imputation technique described in Section 2.2.

The detection of outliers, as described in Section 2.2, was also implemented. This was done by determining the Mahalanobis distance, after scaling the data set (Eq. (2)), adjusting a χ2 distribution and considering observations above the 99th percentile of χ2

Conclusions

In the present work, industrial data from a variety of physicochemical properties of crude oils was analyzed and its clustering potential assessed and described using an unsupervised machine learning methodology. k-means was applied and the clustering quality evaluated using different internal validation metrics, which in the end revealed consistent results. From this analysis, it was possible to determine that 3 clusters is the recommended number of clusters for organizing the crude oils of

CRediT authorship contribution statement

A. Sancho: Conceptualization, Writing – review & editing. J.C. Ribeiro: Conceptualization, Writing – review & editing. M.S. Reis: Conceptualization, Writing – review & editing. F.G. Martins: Conceptualization, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported the project UIDB/00511/2020 – Laboratory for Process Engineering, Environment, Biotechnology and Energy–LEPABE funded by national funds through FCT/MCTES (PIDDAC); and the project UID/EQU/00102/2019 – Chemical Process Engineering and Forest Products Research centre – CIEPQPF funded by national funds through FCT/MCTES (PIDDAC). A. Sancho also acknowledge the research grant financially supported from FCT Doctoral Program PD00158/2012, Ref: PD/BDE/142834/2018 and Petrogal.

References (31)

Cited by (11)

  • Advanced separation of soluble organic matter in a low-rank coal and evaluation using unsupervised analyses

    2022, Fuel
    Citation Excerpt :

    Data processing and analysis become more and more important because of the rapid increase of data volume [25]. Unsupervised cluster analysis is a multi-objective statistical analysis method, which can classify complex objects by grouping them into different clusters with similar characteristics such as acidity, relative content, polarity and so on [26–28]. Hierarchical clustering analysis (HCA) can show the clustering relationship among variables via an iterative process between samples based on selected groups of peaks [29].

  • RISK AND MUTUAL FUND CLUSTERING IN AN EMERGING MARKET: EVIDENCE FOR THAILAND

    2023, International Symposia in Economic Theory and Econometrics
View all citing articles on Scopus
View full text