Novel manifold learning based virtual sample generation for optimizing soft sensor with small data

doi:10.1016/j.isatra.2020.10.006

ISA Transactions

Volume 109, March 2021, Pages 229-241

https://doi.org/10.1016/j.isatra.2020.10.006 Get rights and content

Highlights

•
A novel manifold learning based virtual sample generation method is proposed.
•
Isomap is used to visualize high dimensional data for finding data sparse regions.
•
Feasible virtual samples can be generated to supplement the small sample space.
•
Case studies including numerical simulations and a real-world application are executed.
•
The effectiveness of the proposed virtual sample generation method is conformed.

Abstract

Due to the extremely complex mechanism and strong non-linear characteristics of industrial processes, data-driven soft sensor technologies play a key role in the intelligent measurement of process industries. However, the information of the collected process data in the steady stage is quite limited and unreliable, causing the small sample problem. As a result, it becomes an intractable challenge to catch the nature of the process and build accurate soft sensor models. To solve this problem, this paper proposes a novel manifold learning based virtual sample generation method (Isomap-VSG) to generate feasible virtual samples in the information gaps for supplementing the original small sample space. To find data sparse regions reasonably, one kind of manifold learning methods called Isomap is used to visualize process data with high dimension. Then virtual samples can be generated by the interpolation method and extreme learning machine. The simulation results on a standard dataset and a real-world application demonstrate that, compared with other advanced methods, the proposed Isomap-VSG method can achieve better performance in terms of generating feasible virtual samples and improving the accuracy of soft sensor models using limited samples.

Introduction

In modern process industries, establishing accurate and reliable soft sensor models plays a crucial role in the early planning for policy makers [1], [2]. The petrochemical production process always contains a series of complex physical and chemical reactions involving multiple transfers (momentum transfer, heat transfer and mass transfer) [3], [4], [5]. Mechanism modeling requires a thorough grasp of the inherent mechanism of entire reaction processes [6]. As the process scale increases, it turns to be more difficult to establish mechanism models [7]. While, data-driven modeling approaches are primarily based on process data rather than complex mechanisms to build mathematical models [8], [9]. Data-driven modeling methods have been widely used in the establishment of soft sensor models. However, building data-driven models has specific requirements on the sample size and data characteristics. The number of valid samples is the most important factor affecting the accuracy of data-driven models [10], [11], [12]. Although advanced equipment and techniques are employed in modern process industries to collect and store large numbers of production operation data, it is still difficult to get sufficient typical data for two reasons: firstly, the scale of representative samples is limited by the steady process with small fluctuations, high cost of data collection, and low probability of abnormal events [13], [14]; secondly, it is hard to effectively extract valuable information and knowledge because the collected data may have the characteristics of non-linearity, random noise, missing values and uncertainty [15]. Due to the actual situation of insufficient sample capacity, poor representation and uneven distribution, the development and application of data-driven models is seriously constrained. These challenges can be considered as small sample problems. Small sample problems are common in many fields, such as in the field of the process industry, biomedical engineering and material science. How to establish a reliable soft-sensor model based on small data is a key issue in manufacturing management [16], [17], [18], [19]. A variety of approaches like the bat algorithm and the support vector regression (SVR) have been proposed by relevant researchers to deal with small sample problems [20], [21], [22].

Due to the limitations of imprecision, uncertainty and incompleteness of limited data, virtual sample generation (VSG) technology was proposed from the perspective of expanding sample data [23], [24]. VSG based methods can effectively expand the data quantity based on small sample data information. VSG based methods have been widely utilized to solve small sample problems and have made great progress in theoretical researches and practical applications [25], [26]. Based on the analysis of the current generation ideas, VSG based methods are usually classified into three types: sampling-based VSG, information diffusion-based VSG and feature representation-based VSG. Sampling-based VSG approaches are represented by Bootstrap and Synthetic Minority Oversampling Technique. Such methods mainly expand the sample size by resampling the small samples and understand the real distribution through the sampling distribution [27]. These methods can only expand the number of samples, but they cannot make up for the lack of small sample information. Information diffusion-based VSG methods are typified by mega trend diffusion (MTD) and tree-based trend diffusion (TTD). In [28], TTD used the information diffusion principle to derive the diffusion function for each dimension of the sample data, and then the range of the diffusion domain could be obtained. Finally, in TTD, the fuzzy theory was used to generate new samples in a certain domain range. The drawback of TTD is to calculate the expansion domain of each dimension variable separately and randomly generate virtual samples of each dimension variable without considering the correlation between variables [29]. Feature representation-based VSG methods relay on feature extraction algorithms or deep networks to find feature space suitable for processing high-dimensional samples [30], [31]. By generating virtual samples in the feature space of high-dimensional small samples, the real characteristics and real distribution can be easily learnt. Compared with sampling-based VSG and information diffusion-based VSG, feature representation-based VSG is more suitable for processing high-dimensional small data and can also generate new data to fill the information gap of small data. Thus, the work of this paper is based on the generation idea of feature representation.

Based on the generation idea of feature representation, the feature space of small samples can be learnt by feature extraction algorithms or deep networks. The latest VSG method based on deep networks is Generative Adversarial Net (GAN) in [32], [33]. GAN has been used in the field of image generation. While some problems such as mode collapse may occur in GAN. On the other hand, there is a big gap between the number of samples in the field of image processing and in the field of process industries. It is difficult to learn the parameters of deep networks from the small sample data of industrial process. The feature representation method of manifold learning called Isomap is selected to realize dimensionality reduction and the visualization of high-dimensional data, because Isomap is an effective algorithm for feature extraction. A regression model between the extracted features and the original data should be built for obtaining new samples. Extreme learning machine (ELM) as an effective non-linear mapping tool is adopted to build the regression model and the soft sensor model. To enhance the performance of soft sensor using limited data, this paper proposes a manifold learning based virtual sample generation method (Isomap-VSG). The specific steps to develop the proposed model are listed as follows: (1) Isomap is used to realize the dimensionality reduction of the original high-dimensional data. (2) Virtual samples are generated in the sparse place according to the visual structure of the data in the low-dimensional space; (3) One evaluation index for virtual samples, i.e. the range of asymmetric acceptable extension domain is calculated to select feasible virtual samples; (4) The obtained virtual samples are utilized to modify the established ELM soft sensor model. To verify the effectiveness of Isomap-VSG, case studies on a standard dataset and a real-world application are carried out. The simulation results show that the essence of the industrial process can be captured by the expanded samples and the accuracy of the soft sensor modes can be improved. Compared with SVR and several advanced VSG methods such as MTD, TTD, Bootstrap, the proposed Isomap-VSG can achieve better performance.

The specific arrangements of the remaining paper are structured as follows: Some preliminaries of ELM, manifold learning method and asymmetric acceptable domain range expansion method are described in Section 2; In Section 3, the proposed Isomap-VSG method and the specific process of soft sensor modeling are illustrated in detail; Section 4 analyzes the simulation results of a benchmark function and the Purified Terephthalic Acid production process; Section 5 draws the conclusions.

Section snippets

Extreme learning machine

ELM is a learning algorithm based on the concept of single-hidden-layer feed-forward neural networks [34], [35]. The basic structure of ELM is depicted in Fig. 1.

Assume that all small samples are represented as $S = \{(x_{i}, y_{i}) | i = 1, 2, \dots . n\}$ , where $x_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i m}] \in R^{m}$ is the $i$ th input data and $x_{i m}$ means the $m$ th variable in $x_{i}$ , and $y_{i}$ is the $i$ th variable of the single output $y$ . The training process can be written as follows: $\sum_{j = 1}^{N_{h}} β_{j} g (x_{i} \cdot ω_{j} + b_{j}) = o_{i}$ where $ω_{j} = {(w_{j 1}, w_{j 2}, \dots, w_{j m})}^{T} (j = 1, 2 \dots N_{h})$ is the input weights vector

The proposed Isomap-VSG

The objective of the proposed method is to solve small sample problem and improve soft modeling accuracy by generating feasible virtual samples. In this section, the Isomap-VSG method and the construction process of a soft sensor model are introduced detailedly. The whole process can be divided into four steps: constructing an ELM model using small samples, generating virtual samples by the manifold learning method and the interpolation method, selecting suitable virtual samples and adding them

Case study

Two validation cases are provided to verify the validity and application meaning of the proposed method in this section. The validation datasets include one standard dataset and one industrial application dataset collected from the Purified Terephthalic Acid (PTA) production process. The selected datasets are introduced in detail followed by the discussion and analysis of the simulation results.

Conclusions

For data-driven approaches, the modeling performance depends on the number of valid samples. In the complex petrochemical process, data-driven modeling methods are limited by the uncertain small sample environments. For catching the nature of the process and build accurate soft sensor models using the small sample set, a new VSG method named Isomap-VSG is presented to generate feasible virtual samples. The proposed method is applied to enhancing the performance of prediction modeling for

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported by National Natural Science Foundation of China under Grant Nos. 61973024 and 61703027.

References (37)

HeY.L. et al.
Novel soft sensor development using echo state network integrated with singular value decomposition: Application to complex chemical processes
Chemometr Intell Lab Syst
(2020)
SadatiN. et al.
Observational data-driven modeling and optimization of manufacturing processes
Expert Syst Appl
(2018)
ZhangX.H. et al.
Energy modeling using an effective latent variable based functional link learning machine
Energy
(2018)
KadlecP. et al.
Data-driven soft sensors in the process industry
Comput Chem Eng
(2009)
GongH.F. et al.
A Monte Carlo and PSO based virtual sample generation method for enhancing the energy prediction and energy optimization on small data problem: An empirical study of petrochemical industries
Appl Energy
(2017)
LiuY.F. et al.
Wasserstein GAN-based small-sample augmentation for new-generation artificial intelligence: A case study of cancer-staging data in biology
Engineering
(2019)
ShaikhinaT. et al.
Handling limited datasets with neural networks in medical applications: A small-data approach
Artif Intell Med
(2017)
TianC.L. et al.
Data driven parallel prediction of building energy consumption using generative adversarial nets
Energy Build
(2019)
HeY.L. et al.
A novel and effective nonlinear interpolation virtual sample generation method for enhancing energy prediction and analysis on small data problem: A case study of Ethylene industry
Energy
(2018)
LiD.C. et al.
Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency
Decis Support Syst
(2014)

EspezuaS. et al.

A Projection Pursuit framework for supervised dimension reduction of high dimensional small sample datasets

Neurocomputing

(2015)

MengM. et al.

A small-sample hybrid model for forecasting energy-related CO₂ emissions

Energy

(2014)

HongW.C. et al.

Novel chaotic bat algorithm for forecasting complex motion of floating platforms

Appl Math Model

(2019)

LiuZ. et al.

A SVM controller for the stable walking of biped robots based on small sample sizes

Appl Soft Comput

(2016)

ChengA. et al.

Multiple sources and multiple measures based traffic flow prediction using the chaos theory and support vector regression method

Physica A

(2017)

ChenZ.S. et al.

A PSO based virtual sample generation method for small sample sets: Applications to regression datasets

Eng Appl Artif Intell

(2017)

YangJ. et al.

A novel virtual sample generation method based on Gaussian distribution

Knowl-Based Syst

(2011)

HeY.L. et al.

Fault diagnosis using novel AdaBoost based discriminant locality preserving projection with resamples

Eng Appl Artif Intell

(2020)

Cited by (0)

View full text

Research articleNovel manifold learning based virtual sample generation for optimizing soft sensor with small data

Highlights

Abstract

Introduction

Section snippets

Extreme learning machine

The proposed Isomap-VSG

Case study

Conclusions

Declaration of Competing Interest

Acknowledgments

Chemometr Intell Lab Syst

Expert Syst Appl

Energy

Comput Chem Eng

Appl Energy

Engineering

Artif Intell Med

Energy Build

Energy

Decis Support Syst

Neurocomputing

Energy

Appl Math Model

Appl Soft Comput

Physica A

Eng Appl Artif Intell

Knowl-Based Syst

Eng Appl Artif Intell

Research article
Novel manifold learning based virtual sample generation for optimizing soft sensor with small data