Missing value imputation in multivariate time series with end-to-end generative adversarial networks
Introduction
Multivariate time series are the main resources for data analysis and forecasting in various fields [2]. For instance, touch screen gesture series have been utilized to recognize person [3], sequences of user behaviors in social networks have been used for recommendation [4], and records of patients in hospitals are usually mined to predict the future states of patients [5]. However, time series often contain many missing values because of errors during data collection, the failure of acquisition devices, and human errors. In industrial applications, some data, such as videos, are partially damaged by irregular graphics because of insufficient resolution or the incomplete functioning of equipment [6].
We depict the effect of missing values on the time series analysis task in Fig. 1. Physionet [7] is a public electronic medical record dataset that has been developed using data from the intensive care unit (ICU). Each record contains 41 indicators (variables) sampled per second in the first 48 h after the admission of patients to the ICU. Every sample has a label that indicates if the patient died in the ICU. We used data on the first 48 h to predict whether a patient would die in a hospital. Fig. 1(a) reveals that the missing rates of the original Physionet dataset increase over time. As we can observe, the missing rates in this real scenario are very high. Furthermore, we discard of the original data randomly and classify them accordingly. The classification results become worse with the increasing missing rate, as shown in Fig. 1(b). The missing values in the time series data significantly damage the performance of downstream applications, including classification and regression [8].
Imputation of missing values for a multivariate time series is useful for various applications involving data analysis. In video portals, the acquisition of user-click behavior for web pages may be affected by network congestion, resulting in missing user behavior records. Therefore, filling in missing values in records can help websites push video content to users. In hospitals, devices, such as electrocardiogram machines, may not be properly connected to the patient, resulting in missing parts in patient records. This scenario can affect doctors’ diagnoses. Imputing meaningful values to obtain complete data can help doctors pinpoint the cause. In the internet-of-things, the location of sensor nodes affects the transmission of environment information. If the node is far from the cluster head, it will have a high packet loss rate while transferring information, thereby resulting in missing values. Imputation methods can be employed to obtain complete sensor data to evaluate regional environment information.
Deep learning methods for imputing missing values effectively during training are proposed to maximize the temporal relations between two observations. GRU-D [9] uses a mask to indicate whether data are missing as a part of the input and then trains a model with the masked data. BRITS [10] is a multi-task model for imputing missing values while performing downstream tasks. Recent studies have explored the use of generative models for the imputation of missing values. GAIN [11] was proposed to impute missing values with generative adversarial networks (GANs). However, the focus was on non-sequential datasets, and pertinent measures to process temporal relations were not adopted. We developed GAN-2-Stage [12] composed of the standard GRUI [12] as a generative model to impute the incomplete time series data. It achieved state-of-the-art results on the imputation task and downstream applications.
However, GAN-2-Stage has two main disadvantages:
1) Time Consuming. It includes two phases for training the model and generating the imputation values. Depending on the input “noise” vector, it needs extra time to optimize the “noise” vector for reasonable imputation values.
2) Error propagation. The generator of GAN-2-Stage uses the generated one-step-ahead values as input of the current time. Therefore, if the imputation value generated at some point is far from the real data, the following imputation values will always be inaccurate. This affects the performance of the downstream tasks [13].
Our preliminary version [1] first proposed an end-to-end architecture for time series imputation by combing an autoencoder and a GAN. In this paper, we not only introduce an encoder network into a standard GAN but also propose a stable generation method. Instead of sampling a random “noise” vector from the latent space and optimizing it for reasonable imputation values, we introduce an encoder network based on GRUI to compress the original incomplete time series samples to low-dimensional vectors. The generator uses a compressed vector from the encoder to generate imputation values in a self-feed manner. Inspired by the Professor Forcing algorithm [14], which introduced real data into the training of recurrent networks, we propose a method called real-data forcing (RF). RF applies a combination of real data and generated data in the generator of GAN. Owing to the advantages of the encoder network and RF, our model can overcome the disadvantages of conventional imputation methods, and it exhibits state-of-the-art performance in terms of imputation and classification/regression tasks.
The main contributions of this paper are as follows:
1) The existing generative model that has a time-consuming extra phase for optimizing the “noise” vector is analyzed. As we highlighted, the generation process is always unstable with one-step-ahead prediction.
2) We apply an encoder network to GAN-2-Stage and reduce it to optimize the “noise” vector. To make the generation phase more stable, we propose the RF method and apply it to the generator network.
3) Our model is applied to three real-world datasets and achieves state-of-the-art results in the imputation and downstream prediction tasks. Auxiliary experiments demonstrate the superiority of the proposed model over the baselines in terms of time.
The rest of this paper is organized as follows. In Section 2, we discuss related studies on imputation methods and time series analysis. We formulate our problem and introduce a related model in Section 3. Our model is described in detail in Section 4. We conduct intrinsic and extrinsic evaluations and compare the influencing factors, such as the training epoch and hyper-parameter experiments, in Section 5. Finally, Section 6 concludes this paper.
Section snippets
Traditional imputation methods
Many researchers have proposed useful methods to handle missing values. Some simple imputation methods have attempted to impute missing values statistically. The missing values can be replaced with mean values, median values, and so on [15]. Akouemo et al. proposed an autoregressive integrated moving average with an exogenous input model to extract the characteristics of the time series and impute the anomalous data [16]. Some machine learning-based imputation methods have been proposed to
Problem formulation
In this study, we focus on the time series data with missing values. Some notations are defined to describe this problem.
A d-dimensional time series observed in can be described as a sequence . Therefore, the observation of X at is defined as , and is the jth feature in . In the following example, , and “none” is a missing value,
To represent the missing values in X, we
Method
In this section, we introduce the end-to-end GAN with RF (GAN-RF). Our model consists of an encoder network, a generator network, and a discriminator network, all of which are based on GRUI. We avoid using the “noise” vector that has been randomly sampled from the latent space as the first input of the generator. An encoder network is applied to obtain a low-dimensional representation of the original time series. We do not optimize the “noise” vector because the vector compressed by the
Experiments
Our model has a wide variety of applications. We evaluated our method on three real-world datasets and compared its performance with those of other baseline methods. The detailed items of the datasets are presented in Table 1.
Conclusion
In this paper, we propose an end-to-end GAN with RF to impute missing values using an efficient and stable generation process. GAN-2-Stage requires an extra process to optimize the “noise” vector to generate imputation values correlated with the input data. However, our GAN-RF compresses the input data into a low-dimensional representation as the initialization of the generator which relates the imputation values with the original data. In addition, GAN-2-Stage has an unstable generation
CRediT authorship contribution statement
Ying Zhang: Conceptualization, Methodology. Baohang Zhou: Software, Visualization, Writing - original draft. Xiangrui Cai: Formal analysis, Investigation, Writing - review & editing. Wenya Guo: Writing - review & editing. Xiaoke Ding: Writing - review & editing. Xiaojie Yuan: Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We thank Ya Guo and Yonghong Luo for their valuable suggestions regarding this manuscript. This work is supported by the Chinese Scientific and Technical Innovation Project 2030 (2018AAA0102100), NSFC-General Technology Joint Fund for Basic Research (No. U1936206, No. U1836109), National Natural Science Foundation of China (No. 62002178 and No. 61772289), and Natural Science Foundation of Tianjin, China (No. 20JCQNJC01730).
References (50)
- et al.
A novel ensemble deep learning model with dynamic error correction and multi-objective ensemble pruning for time series forecasting
Inf. Sci.
(2021) - et al.
Person recognition based on touch screen gestures using computational intelligence methods
Inf. Sci.
(2017) - et al.
Time-semantic-aware poisson tensor factorization approach for scalable hotel recommendation
Inf. Sci.
(2019) - et al.
Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values
Comput. Biol. Med.
(2015) - et al.
Sample generation based on a supervised wasserstein generative adversarial network for high-resolution remote-sensing scene classification
Inf. Sci.
(2020) - et al.
PAN: pipeline assisted neural networks model for data-to-text generation in social internet of things
Inf. Sci.
(2020) - et al.
Image captioning by incorporating affective concepts learned from both visual and textual components
Neurocomputing
(2019) - et al.
Model-coupled autoencoder for time series visualisation
Neurocomputing
(2016) - et al.
A generative model for category text generation
Inf. Sci.
(2018) - et al.
Egan: end-to-end generative adversarial network for multivariate time series imputation
IJCAI
(2019)
Learning time series associated event sequences with recurrent point process networks
TNNLS
Feature extraction for incomplete data via low-rank tensor decomposition with feature regularization
TNNLS
Recurrent neural networks for multivariate time series with missing values
Sci. Rep.
The treatment of missing values and its effect on classifier accuracy
Missing data: a comparison of neural network and expectation maximization techniques
Curr. Sci.
Matrix completion and low-rank svd via fast alternating least squares
J. Mach. Learn. Res.
Cited by (86)
A missing manufacturing process data imputation framework for nonlinear dynamic soft sensor modeling and its application
2024, Expert Systems with ApplicationsA generic sparse regression imputation method for time series and tabular data
2023, Knowledge-Based SystemsDual-branch cross-dimensional self-attention-based imputation model for multivariate time series
2023, Knowledge-Based SystemsMVIRA: A model based on Missing Value Imputation and Reliability Assessment for mortality risk prediction
2023, International Journal of Medical Informatics