Elsevier

Information Sciences

Volume 581, December 2021, Pages 262-277
Information Sciences

A Gaussian mixture model based virtual sample generation approach for small datasets in industrial processes

https://doi.org/10.1016/j.ins.2021.09.014Get rights and content

Abstract

Due to small-quantity and often imbalance of labeled samples, it is challenging to establish a robust and accurate prediction model through data-driven methods. To deal with the small dataset problem, new virtual samples may be generated via virtual sample generation (VSG) methods based on the trend of the original small raw dataset, thereby improving modeling performance. Effective VSG is desirable, but also challenging. Conventional VSG usually assumes that the raw sample set contains only a single operating mode. Taking multi-mode into account will improve the VSG based modeling performance since actual processes are often multi-mode. To this end, an information expansion function considering sample density and amount (IEDA) is first developed to expand the domain range of the attributes in this paper. Then, virtual samples under the multiple operating mode condition are generated by proposing a Gaussian mixture model based virtual sample generation (GMMVSG) method. Applications of GMMVSG on Tennessee Eastman benchmark process and an industrial hydrocracking process show significant improvement of modeling and predictions over other conventional VSG methods.

Introduction

Modeling of industrial processes has received much attention with the development of intelligent manufacturing and the goal of green and highly efficient industrial production. It has benefits of better analysis and design of actual systems, trends prediction of certain states, and development of optimal control strategies [1], [2], [3]. By far, many kinds of process modeling methods have been proposed. Generally, they can be split into mechanism-based, data-based, and hybrid-based approaches [4]. Mechanism-based model [5], [6], also known as white box model, is a mathematical model based on the internal mechanism of process, or the transfer mechanism of materials flow. In other words, it is a mathematical model based on mass balance equations, energy balance equations, momentum balance equations, phase balance equations, etc. However, as the industrial processes are often complicated, it is difficult to describe them by accurate and concise mathematical relationships. Moreover, there are usually quite a lot of parameters in the mechanism model, and some of them are difficult to identify. Therefore, to meet strong demand for simplified modeling, data-based modeling method [7], [8] was proposed. Different from the mechanism-based model, data-based model is a black box model obtained by training and fitting of various types of data. In this modeling process, the easily measurable auxiliary variables are used as the inputs of the model, and the key process variables (leading variables) are used as the outputs [9]. The data-based model is easier to implement. To make full use of process knowledge and process data, hybrid-based modeling procedure [10] was also developed. However, in view of the respective characteristics of complex processes such as insufficient and imbalance of labeled samples, it is still challenging to fully and effectively use all available data.

It is well-known that with the rapid development of data-driven technology, data-based modeling methods are widely used in many fields, for example, soft sensors, monitoring, diagnosis, and operating performance assessment [11], [12], [13], [14]. To build an accurate data-based model, sufficient data and an appropriate distribution assumption are needed [15], [16]. However, the small sample sets problem [17], [18] usually occurs in industrial processes, especially during the transition of processes. Due to the discrete and sparse characteristics of the distribution of the small sample sets, information gap between the data samples can worsen the characterization of processes. To this end, capturing potential information within the interval of sample points is of great significance to describing overall characteristics [19], [20].

In the past decades, learning based on small sample sets has drawn special attention from both academia and industry. It can be divided into three categories: grey modeling [21], [22], feature extraction [23], [24], and virtual sample generation (VSG) [25], [26]. Among them, VSG is a relatively new method that can improve the performance of data-based modeling by generating effective virtual samples (VS). Virtual samples usually refer to auxiliary samples generated using prior knowledge and known training samples. It can make up for the information gap caused by insufficient real sample data in the raw sample space, expand the number of samples, thereby improving the predictive ability of the model and suppressing overfitting. To improve the performance of image recognition technique, Niyogi et al. first mentioned in 1998 that virtual samples can be generated using prior process knowledge [27]. Inspired by this study, Li et al. developed an internalized kernel density estimator (IKDE) [28] and an extended version of IKDE (named GKIDE) [29] in 2006 and 2008, respectively. Both IKDE and GKIDE can be used to extract additional information from a small sample set to expedite learning speed. Huang et al. trained learning function via an information diffusion [30] based on back-propagation neural networks. This new network is called diffusion-neural-network (DNN) [31]. In DNN, data points were regarded as data centers with a fuzzy normal distribution at certain intervals. Then the data points were symmetrically diffused on the left and right side of the fuzzy normal distribution via a symmetric diffusion function. However, correlation between variables was required to be greater than 0.9 in DNN. It is difficult for an industrial data set to meet such requirements. Therefore, the application of DNN is greatly restricted. For more sufficient filling of the information gap, a mega-trend-diffusion (MTD) technology was developed [32]. This method successfully realizes the expansion of single point diffusion in DNN to overall diffusion. To generate optimal virtual samples, genetic algorithm, particle swarm optimization and Monte Carlo algorithms were then applied to VSG [33], [34], [35]. These methods do not have strict requirements on the raw small sample set and the prediction training set, but only the prediction model with a MAPE less than 10% can be used for subsequent VSG. After that, researchers proposed several VSG methods based on deep learning networks. Ian Goodfellow of the University of Montreal proposed a generative adversarial networks (GAN) in 2014 [36], [37]. The core idea of GAN comes from the Nash equilibrium of game theory. There are two networks in GAN, namely generator and discriminator. Generator is responsible for capturing potential distribution of the real data samples and generating new data samples. Discriminator is a two-classifier, which can be used to distinguish whether the input is real data or generated sample. By continuously optimizing the above two networks, it is possible to finally pass fake imitation for genuine. However, the training of GAN is unstable and easy to fail in training. Moreover, due to a large number of known samples need to train the generator network, it cannot effectively extract accurate small sample data features. Therefore, GAN cannot be directly applied to the virtual sample generation of small sample sets. In addition, both the conventional VSG methods and GAN are only applicable to single-mode processes, and did not address VSG for multi-mode processes. In fact, most industrial processes operate in multi-mode due to changes in inlet conditions, product demands, and even manual operation. Each operating mode have its own modeling characteristics. Hence, the single global VSG no longer applicable and show limited effectiveness in VSG of real industrial process with multi-mode. Therefore, even though VSG can generate more information for predictive model construction in small sample sets, many problems remain.

The following are some of the main difficulties that are often come about while generating virtual samples.

(1) Industrial processes are often multi-mode, hence, it is difficult to establish a single general VSG model that can be well applied to all modes.

(2) As data information (the amount of labeled data) is limited, it is difficult to estimate the underlying population distribution of the small sample set accurately. Therefore, it is difficult to determine the boundary of the extended domain range of VSG model.

To overcome the aforesaid difficulties, a novel Gaussian mixture model based virtual sample generation (GMMVSG) method for small datasets is proposed in this paper aiming at the VSG problems in multi-mode processes. Novelty and contributions of the proposed GMMVSG method with respect to the existing VSG approaches are summarized as follows.

(1) It introduces the idea of local modeling into VSG, and provides an alternative data-driven solution for VSG of multi-mode processes to obtain more plentiful and balanced modeling data set.

(2) Better modeling performance and more accurate predictions are obtained in the multi-mode processes with small sample sets.

(3) It presents a sufficient asymmetric acceptable domain extension using the designed information expansion function considering sample density and amount (IEDA), which can lead the virtual sample hyperplanes closer to the population distribution and helpful to VSG.

The remainder of the paper is structured as follows. Gaussian mixture model (GMM), expectation–maximization (EM) algorithm and extreme learning machine (ELM) are briefly described in Section 2. In Section 3, the proposed GMMVSG is thoroughly explained. The proposed GMMVSG is applied to two industrial case studies, Tennessee Eastman benchmark process and an industrial hydrocracking process in Section 4. At last, conclusions are provided in Section 5.

Section snippets

Gaussian mixture model

GMM [38], [39]is a probabilistic modeling approach used to represent normally distributed subpopulations within an overall population. Suppose zRd is a sample with d features. If the sample z comes from a mixture Gaussian model, the probability density function of it can be expressed as Eq. (1).p(z|Ω)=k=1Kωkf(z|θk)where K represents the number of Gaussian components in GMM, and ωk represents the probabilistic weight of the kth Gaussian component.

Let Ω={{ω1,μ1,1},{ω2,μ2,2},{ωK,μK,K}}be the

The proposed GMM based VSG

In this section, the proposed GMMVSG is thoroughly explained.

Case studies

The proposed method was applied to two case studies: Tennessee Eastman (TE) process and an industrial hydrocracking process. In each case study, 30 independent runs are performed and in each run, performance indices, MAPE and error improving rate (EIR), are computed. Expression for EIR is provided in Eq. (21).EIR=MAPEbefore-MAPEafterMAPEbefore×100%

The average of MAPEs and EIRs obtained from the 30 independent runs is calculated using the following formulae.AveMAPE=i=130MAPEi30AveEIR=i=130EIRi

Conclusions

In this paper, a novel VSG method based on Gaussian mixture model is proposed to accomplish the objective of boosting the prediction performance of data-driven models on small datasets. Many virtual sample generation methods have been proposed but none of them addressed the issue of generating acceptable new samples for multi-mode processes. This research gap is filled in by the proposed GMMVSG method. In this study, it is shown that the consideration of both the density and amount of samples

CRediT authorship contribution statement

Ling Li: Conceptualization, Data curation, Methodology, Software, Writing - original draft. Seshu Kumar Damarla: Writing - review & editing. Yalin Wang: Funding acquisition, Project administration, Supervision. Biao Huang: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by National Key Research and Development Program of China (2018AAA0101603), in part by National Natural Science Foundation of China (NSFC) (6210021657, 61590921, U1911401).

References (46)

  • D.-C. Li et al.

    Rebuilding sample distributions for small dataset learning

    Decis. Support. Syst.

    (2018)
  • D.C. Li et al.

    Utilization of virtual samples to facilitate cancer identification for DNA microarry data in the early stages of an investigation

    Inf. Sci.

    (2009)
  • Z. Ge

    Active learning strategy for smart soft sensor development under a small number of labeled data samples

    J. Process Control

    (2014)
  • L. Wu et al.

    The effect of sample size on the grey system model

    Appl. Math. Model.

    (2013)
  • S. Espezua et al.

    A projection pursuit framework for supervised dimension reduction of high dimensional small sample datasets

    Neurocomputing

    (2015)
  • D.-C. Li et al.

    Using structure-based data transformation method to improve prediction accuracies for small data sets

    Decis. Support Syst.

    (2012)
  • J. Yang et al.

    A novel virtual sample generation method based on Gaussian distribution

    Know. -Based Syst.

    (2011)
  • D.C. Li et al.

    Using virtual sample generation to build up management knownledge in the early manufacturing stages

    Eur. J. Oper. Res.

    (2006)
  • D.-C. Li et al.

    Learning management knowledge for manufacturing systems in the early stages using time series data

    Eur. J. Oper. Res.

    (2008)
  • H. Chongfu

    Principle of information diffusion

    Fuzzy Set. Syst.

    (1997)
  • C. Huang et al.

    A diffusion-neural-network for learning from small samples

    Int. J. Approx. Reason.

    (2004)
  • D.-C. Li et al.

    Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge

    Comput. Oper. Res.

    (2007)
  • D.-C. Li et al.

    A genetic algorithm-based virtual sample generation technique to improve small data set learning

    Neurocomputing

    (2014)
  • Cited by (0)

    View full text