Elsevier

Information Sciences

Volume 551, April 2021, Pages 67-82
Information Sciences

Missing value imputation in multivariate time series with end-to-end generative adversarial networks

https://doi.org/10.1016/j.ins.2020.11.035Get rights and content

Abstract

Missing values are inherent in multivariate time series because of multiple reasons, such as collection errors, which deteriorate the performance of follow-up analytic applications on the multivariate time series. Numerous missing value imputation methods have been proposed to mitigate the influence of missing values on multivariate time series analysis. Recently, inspired by the success of generative adversarial networks (GANs) in image generation, the GAN-2-Stage has been used to address the imputation problem with the generative model. Specifically, GAN-2-Stage employs GANs to impute the missing values. However, an extra phase is required to optimize the input random “noise” of the generator. In addition, the imputed values can be very different from real values because of the difficulty in training a GAN and the unstable generation process. Therefore, this paper proposes an end-to-end model to impute the missing values in a multivariate time series. Specifically, we introduce an encoder network into the standard GAN architecture that eliminates the input optimization phase in the GAN-2-Stage. Our generator utilizes real data during training to force the imputed values to be close to the real ones. Experiments on three real-world multivariate time series datasets demonstrate that the proposed model outperforms state-of-the-art methods in imputation tasks and downstream applications, including classification and regression.

Introduction

Multivariate time series are the main resources for data analysis and forecasting in various fields [2]. For instance, touch screen gesture series have been utilized to recognize person [3], sequences of user behaviors in social networks have been used for recommendation [4], and records of patients in hospitals are usually mined to predict the future states of patients [5]. However, time series often contain many missing values because of errors during data collection, the failure of acquisition devices, and human errors. In industrial applications, some data, such as videos, are partially damaged by irregular graphics because of insufficient resolution or the incomplete functioning of equipment [6].

We depict the effect of missing values on the time series analysis task in Fig. 1. Physionet [7] is a public electronic medical record dataset that has been developed using data from the intensive care unit (ICU). Each record contains 41 indicators (variables) sampled per second in the first 48 h after the admission of patients to the ICU. Every sample has a label that indicates if the patient died in the ICU. We used data on the first 48 h to predict whether a patient would die in a hospital. Fig. 1(a) reveals that the missing rates of the original Physionet dataset increase over time. As we can observe, the missing rates in this real scenario are very high. Furthermore, we discard 20%,30%,,90% of the original data randomly and classify them accordingly. The classification results become worse with the increasing missing rate, as shown in Fig. 1(b). The missing values in the time series data significantly damage the performance of downstream applications, including classification and regression [8].

Imputation of missing values for a multivariate time series is useful for various applications involving data analysis. In video portals, the acquisition of user-click behavior for web pages may be affected by network congestion, resulting in missing user behavior records. Therefore, filling in missing values in records can help websites push video content to users. In hospitals, devices, such as electrocardiogram machines, may not be properly connected to the patient, resulting in missing parts in patient records. This scenario can affect doctors’ diagnoses. Imputing meaningful values to obtain complete data can help doctors pinpoint the cause. In the internet-of-things, the location of sensor nodes affects the transmission of environment information. If the node is far from the cluster head, it will have a high packet loss rate while transferring information, thereby resulting in missing values. Imputation methods can be employed to obtain complete sensor data to evaluate regional environment information.

Deep learning methods for imputing missing values effectively during training are proposed to maximize the temporal relations between two observations. GRU-D [9] uses a mask to indicate whether data are missing as a part of the input and then trains a model with the masked data. BRITS [10] is a multi-task model for imputing missing values while performing downstream tasks. Recent studies have explored the use of generative models for the imputation of missing values. GAIN [11] was proposed to impute missing values with generative adversarial networks (GANs). However, the focus was on non-sequential datasets, and pertinent measures to process temporal relations were not adopted. We developed GAN-2-Stage [12] composed of the standard GRUI [12] as a generative model to impute the incomplete time series data. It achieved state-of-the-art results on the imputation task and downstream applications.

However, GAN-2-Stage has two main disadvantages:

  • 1) Time Consuming. It includes two phases for training the model and generating the imputation values. Depending on the input “noise” vector, it needs extra time to optimize the “noise” vector for reasonable imputation values.

  • 2) Error propagation. The generator of GAN-2-Stage uses the generated one-step-ahead values as input of the current time. Therefore, if the imputation value generated at some point is far from the real data, the following imputation values will always be inaccurate. This affects the performance of the downstream tasks [13].

Our preliminary version [1] first proposed an end-to-end architecture for time series imputation by combing an autoencoder and a GAN. In this paper, we not only introduce an encoder network into a standard GAN but also propose a stable generation method. Instead of sampling a random “noise” vector from the latent space and optimizing it for reasonable imputation values, we introduce an encoder network based on GRUI to compress the original incomplete time series samples to low-dimensional vectors. The generator uses a compressed vector from the encoder to generate imputation values in a self-feed manner. Inspired by the Professor Forcing algorithm [14], which introduced real data into the training of recurrent networks, we propose a method called real-data forcing (RF). RF applies a combination of real data and generated data in the generator of GAN. Owing to the advantages of the encoder network and RF, our model can overcome the disadvantages of conventional imputation methods, and it exhibits state-of-the-art performance in terms of imputation and classification/regression tasks.

The main contributions of this paper are as follows:

  • 1) The existing generative model that has a time-consuming extra phase for optimizing the “noise” vector is analyzed. As we highlighted, the generation process is always unstable with one-step-ahead prediction.

  • 2) We apply an encoder network to GAN-2-Stage and reduce it to optimize the “noise” vector. To make the generation phase more stable, we propose the RF method and apply it to the generator network.

  • 3) Our model is applied to three real-world datasets and achieves state-of-the-art results in the imputation and downstream prediction tasks. Auxiliary experiments demonstrate the superiority of the proposed model over the baselines in terms of time.

The rest of this paper is organized as follows. In Section 2, we discuss related studies on imputation methods and time series analysis. We formulate our problem and introduce a related model in Section 3. Our model is described in detail in Section 4. We conduct intrinsic and extrinsic evaluations and compare the influencing factors, such as the training epoch and hyper-parameter experiments, in Section 5. Finally, Section 6 concludes this paper.

Section snippets

Traditional imputation methods

Many researchers have proposed useful methods to handle missing values. Some simple imputation methods have attempted to impute missing values statistically. The missing values can be replaced with mean values, median values, and so on [15]. Akouemo et al. proposed an autoregressive integrated moving average with an exogenous input model to extract the characteristics of the time series and impute the anomalous data [16]. Some machine learning-based imputation methods have been proposed to

Problem formulation

In this study, we focus on the time series data with missing values. Some notations are defined to describe this problem.

A d-dimensional time series observed in T=(t1,,tn) can be described as a sequence X=(xt1,,xti,,xtn)TRn×d. Therefore, the observation of X at ti is defined as xti, and xtij is the jth feature in xti. In the following example, d=6,n=3, and “none” is a missing value,X=31nonenone322732617nonenonenone13none107none876690,T=00.31.2.

To represent the missing values in X, we

Method

In this section, we introduce the end-to-end GAN with RF (E2GAN-RF). Our model consists of an encoder network, a generator network, and a discriminator network, all of which are based on GRUI. We avoid using the “noise” vector that has been randomly sampled from the latent space as the first input of the generator. An encoder network is applied to obtain a low-dimensional representation of the original time series. We do not optimize the “noise” vector because the vector compressed by the

Experiments

Our model has a wide variety of applications. We evaluated our method on three real-world datasets and compared its performance with those of other baseline methods. The detailed items of the datasets are presented in Table 1.

Conclusion

In this paper, we propose an end-to-end GAN with RF to impute missing values using an efficient and stable generation process. GAN-2-Stage requires an extra process to optimize the “noise” vector to generate imputation values correlated with the input data. However, our E2GAN-RF compresses the input data into a low-dimensional representation as the initialization of the generator which relates the imputation values with the original data. In addition, GAN-2-Stage has an unstable generation

CRediT authorship contribution statement

Ying Zhang: Conceptualization, Methodology. Baohang Zhou: Software, Visualization, Writing - original draft. Xiangrui Cai: Formal analysis, Investigation, Writing - review & editing. Wenya Guo: Writing - review & editing. Xiaoke Ding: Writing - review & editing. Xiaojie Yuan: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank Ya Guo and Yonghong Luo for their valuable suggestions regarding this manuscript. This work is supported by the Chinese Scientific and Technical Innovation Project 2030 (2018AAA0102100), NSFC-General Technology Joint Fund for Basic Research (No. U1936206, No. U1836109), National Natural Science Foundation of China (No. 62002178 and No. 61772289), and Natural Science Foundation of Tianjin, China (No. 20JCQNJC01730).

References (50)

  • S. Xiao et al.

    Learning time series associated event sequences with recurrent point process networks

    TNNLS

    (2019)
  • Q. Shi et al.

    Feature extraction for incomplete data via low-rank tensor decomposition with feature regularization

    TNNLS

    (2019)
  • I. Silva, G. Moody, D.J. Scott, L.A. Celi, R.G. Mark, Predicting in-hospital mortality of icu patients: The...
  • Q. Lan, X. Xu, H. Ma, G. Li, Multivariable data imputation for the analysis of incomplete credit data, Expert Syst....
  • Z. Che et al.

    Recurrent neural networks for multivariate time series with missing values

    Sci. Rep.

    (2018)
  • W. Cao, D. Wang, J. Li, H. Zhou, L. Li, Y. Li, Brits: Bidirectional recurrent imputation for time series, in: NeurIPS,...
  • J. Yoon, J. Jordon, M. van der Schaar, GAIN: missing data imputation using generative adversarial nets, in: ICML, 2018,...
  • Y. Luo, X. Cai, Y. ZHANG, J. Xu, X. Yuan, Multivariate time series imputation with generative adversarial networks, in:...
  • A.M. Lamb, A.G. ALIAS PARTH GOYAL, Y. Zhang, S. Zhang, A.C. Courville, Y. Bengio, Professor forcing: a new algorithm...
  • E. Acuna et al.

    The treatment of missing values and its effect on classifier accuracy

  • H.N. Akouemo, R.J. Povinelli, Time series outlier detection and imputation, in: IEEE PES General Meeting— Conference &...
  • F.V. Nelwamondo et al.

    Missing data: a comparison of neural network and expectation maximization techniques

    Curr. Sci.

    (2007)
  • T. Hastie et al.

    Matrix completion and low-rank svd via fast alternating least squares

    J. Mach. Learn. Res.

    (2015)
  • X. Zhang, C. Yan, C. Gao, B.A. Malin, Y. Chen, Xgboost imputation for time series data, in: ICHI, IEEE, 2019, pp....
  • M. Peña, P. Ortega, M. Orellana, A novel imputation method for missing values in air pollutant time series data, in:...
  • Cited by (86)

    View all citing articles on Scopus
    View full text