Efficient online learning with improved LSTM neural networks

https://doi.org/10.1016/j.dsp.2020.102742Get rights and content

Highlights

Abstract

We introduce efficient online learning algorithms based on the Long Short Term Memory (LSTM) networks that employ the covariance information. In particular, we introduce the covariance of the present and one-time step past input vectors into the gating structure of the LSTM networks. Additionally, we include the covariance of the output vector, and we learn their weight matrices to improve the learning performance of the LSTM networks where we also provide their updates. We reduce the number of system parameters through the weight matrix factorization where we convert the LSTM weight matrices into two smaller matrices in order to achieve high learning performance with low computational complexity. Moreover, we apply the introduced approach to the Gated Recurrent Unit (GRU) architecture. In our experiments, we illustrate significant performance improvements achieved by our methods on real-life datasets with respect to the vanilla LSTM and vanilla GRU networks.

Introduction

In the modern era, online learning is getting more important and extensively studied in the literature due to its wide range of application areas including machine learning, signal processing and network intrusion detection [1], [2], [3]. In this paper, we study online learning for sequentially observed data in a nonlinear regression problem. We aim to sequentially learn the parameters of a system in time with low computational complexity.

In the computational learning literature, linear modeling approaches are used in various applications due to their low computational complexity [4]. However, such approaches are not suitable for the modeling of real life signals since they are not usually capable of capturing complex nonlinear relationships in the signal [1]. Therefore, nonlinear approaches are used in wide range of applications due to their high modeling power [5], [6], [7], [8]. Among the nonlinear approaches, Deep Neural Networks (DNNs) are extensively used [9], [10]. Nevertheless, DNN based methods suffer from limited performance on modeling the temporal data due to their lack of memory [11]. In order to mitigate this issue, Recurrent Neural Networks (RNNs), which are able to handle the long-term data dependency, are introduced [12], [13]. These methods, on the other hand, suffer from exploding and vanishing gradient problems during the back propagation of the error gradient [14]. The Long Short Term Memory (LSTM) networks solve this problem thanks to their gating architectures that control the flow of information [15]. Here, we introduce efficient online learning algorithms using the LSTM networks. Furthermore, since the gating mechanism of the LSTM network is its backbone, we incorporate additional information into the gating mechanism that helps in learning and modeling the complex processes effectively while preserving computational complexity [15].

There are many applications such as genetic protein therapy and housing prices where there is direct or indirect link between the past and the present data instances [16]. In such cases, using this relation, which is grasped with the covariance information, increases the prediction performance in an efficient manner. We utilize this covariance information in our LSTM networks through the gating structures. Furthermore, we also learn the covariance weight matrices during the training process. Hence, we improve the learning performance of the LSTM networks by not only utilizing covariance information but also learning its relation with the desired output. We call this improved LSTM networks based on the covariance information as “Co-LSTM” networks.

Although including the covariance information into the gating structure of the LSTM networks increases the modeling power, and hence, the learning performance, it also introduces additional time and computational complexity. This increase in time and computational complexity results from the additional input and output weight and covariance matrices of the Co-LSTM networks. In order to make the resulting network to have low complexity, we incorporate the Weight Matrix Factorization (WMF) method into the Co-LSTM network weight matrices [17]. The objective of the WMF is to break down a higher rank matrix into two lower rank matrices. Hence, we significantly reduce the time and computational complexities with the use of the WMF on the Co-LSTM network weight matrices while maintaining the overall accuracy and the performance of the system.

In the learning literature, neural network based approaches are extensively used due to their high capabilities on modeling data [18]. Particularly, Long Short Term Memory (LSTM) networks provide significantly high learning performance thanks to their ability to learn the long-term data dependencies with their gating based architectures [19], [15]. However, as a result of their performance-complexity trade-off, LSTM networks are not fully utilized in nonlinear regression tasks. In this paper, we solve nonlinear regression problem with our novel LSTM networks that bypass this trade-off. With our improved structure with WMF, we achieve higher learning performance and lower training time compared to the vanilla LSTM where we illustrate these significant improvements through our experiments.

Covariance matrix provides useful information on the data, which makes it highly valuable for learning tasks. However, its use in neural network based methods is highly limited due to their black box nature. The most common approach in the literature to utilize the covariance matrix in the neural networks is to use it for preprocessing such as dimensionality reduction [20], [21]. Nonetheless, such preprocessing steps may degrade the learning performance due to information loss. In this paper, we directly employ the covariance matrix along with the data without any information loss in our novel LSTM structure, which results in high learning performance.

Matrix factorization method is highly used in neural network based algorithms for compression purposes [22], [23]. These algorithms decrease the training time by reducing the number of training parameters that are naturally introduced with the basic neural network architecture [22]. However, they provide low performance since the direct application of this method degrades the learning capacity. In this paper, we apply the matrix factorization to our improved Co-LSTM structure, which provides high learning performance. Hence, even though we observe a slight degradation through WMF, we still achieve significantly high performance.

The main contributions of this paper are as follows:

  • We propose effective online learning algorithms based on the LSTM networks where we incorporate input and output covariance information, which grasp the relationship between the previous and present data instances, into the gating structure.

  • We assign a weight matrix to each covariance matrix in our improved LSTM based gating structure and learn all the weights through the sequential updates where we enhance the regression performance.

  • In order to alleviate the time and computational complexity, we use the Weight Matrix Factorization (WMF) method where we elegantly control trade off between computational complexity and performance.

  • Through our experiments on real-life datasets, we illustrate significant performance improvements of the introduced Co-LSTM and Co-GRU networks over the vanilla LSTM and vanilla GRU networks, respectively.

This paper is organized as follows. In Section 2, we first describe our problem setting and the model formulation. We then introduce our models in Section 3.1 and provide the parameter updates in Section 3.2. In Section 3.3, we describe the Weight Matrix Factorization (WMF) method used to reduce the computational complexity. We provide the updates for the reformulation of our method based on the WMF in Section 3.4. In Section 4, we demonstrate the performance improvements of our methods on real datasets. Finally, we conclude with our remarks in Section 5.

Section snippets

Problem description

In this paper, all vectors are column vectors, and they are denoted by boldface lower case letters. Matrices are represented by boldface uppercase letters. For a vector u, uT is the ordinary transpose. For a vector xt, xt,i is the ith element of the vector xt. Similarly, for a matrix W, wij is the ith row and the jth column entry.

We sequentially observe xtRp along with desired output dt at each discrete time or step t, and we aim to estimate the output as dˆt. Here, dˆt depends only on {x1,x2,

Efficient online learning algorithms with covariance information and WMF

In this section, we present our LSTM models based on the weighted input and output covariances. Then, we apply the weight matrix factorization method to our proposed models to achieve low computational complexity. Finally, we provide the stochastic gradient based weight updates for the introduced method.

Experiments

In this section, we illustrate the regression performance of the proposed models on real-life datasets including Kinematics [27], Elevators [28], Protein Tertiary [29] and Puma8NH [30]. We demonstrate the classification performance of the proposed models on two real-life dynamic datasets, i.e., ISCX IDS 2012 [31] and Pen-Based Handwritten Digits [32] dataset. Moreover, we compare the proposed models with the state-of-the-art methods based on their performances on image caption generation [33]

Conclusion

We have introduced efficient online learning algorithms based on the LSTM network architecture. We have utilized the input and the output covariance information and introduced them into the gating structure of the vanilla LSTM networks. We have proposed three variants of the introduced Co-LSTM network approach where we assign weight matrices to the input and output covariance matrices and learn them in a sequential manner. Furthermore, we have used the matrix factorization method on the Co-LSTM

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ali H. Mirza received the B.S. degree in with honors in Electrical Engineering from University of Engineering and Technology, Lahore, Pakistan, in 2014 and his M.S. degree in Electrical and Electronics Engineering from Bilkent University, Ankara, Turkey. He is currently working as a Developer for Stability in 4G/5G Systems at Ericsson HQ, Sweden. His research interests include machine learning, signal processing, online learning and deep neural networks.

References (57)

  • N. Cesa-Bianchi et al.

    Prediction, Learning, and Games

    (2006)
  • K. Greff et al.

    Lstm: a search space odyssey

    IEEE Trans. Neural Netw. Learn. Syst.

    (2017)
  • M. Hermans et al.

    Training and analysing deep recurrent neural networks

  • J. Mazumdar et al.

    Recurrent neural networks trained with backpropagation through time algorithm to estimate nonlinear load harmonic currents

    IEEE Trans. Ind. Electron.

    (2008)
  • H. Jaeger

    Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the “Echo State Network” Approach, vol. 5

    (2002)
  • Y. Bengio et al.

    Learning long-term dependencies with gradient descent is difficult

    IEEE Trans. Neural Netw.

    (1994)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • X. Liu

    Deep recurrent neural network for protein function prediction from sequence

  • O. Kuchaiev et al.

    Factorization tricks for lstm networks

  • D.F. Specht

    A general regression neural network

    IEEE Trans. Neural Netw.

    (1991)
  • F.A. Gers et al.

    Learning to Forget: Continual Prediction with Lstm

    (1999)
  • M.J. Er et al.

    Face recognition with radial basis function (rbf) neural networks

    IEEE Trans. Neural Netw.

    (2002)
  • K. Hadad et al.

    Enhanced neural network based fault detection of a vver nuclear power plant with the aid of principal component analysis

    IEEE Trans. Nucl. Sci.

    (2008)
  • T.N. Sainath et al.

    Low-rank matrix factorization for deep neural network training with high-dimensional output targets

  • Y. Zhang et al.

    Extracting deep neural network bottleneck features using low-rank matrix factorization

  • S. Ruder

    An overview of gradient descent optimization algorithms

  • R.J. Williams et al.

    A learning algorithm for continually running fully recurrent neural networks

    Neural Comput.

    (1989)
  • D.D. Lee et al.

    Learning the parts of objects by non-negative matrix factorization

    Nature

    (1999)
  • Cited by (27)

    • Joint optimization of linear and nonlinear models for sequential regression

      2022, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      We study online regression/prediction, where we observe a sequential time series signal with related side-information sequence and predict the next sample. Online regression is extensively studied in the signal processing [1,2] and machine learning [3] literatures since it has a wide range of application areas including transportation [4], education [5] and object tracking [6]. In such regression problems, linear modeling approaches are commonly used because of their low computational complexity and robustness with limited data [7].

    • A new multi-data-driven spatiotemporal PM2.5 forecasting model based on an ensemble graph reinforcement learning convolutional network

      2021, Atmospheric Pollution Research
      Citation Excerpt :

      Therefore, in this paper, the spatial graph convolution method was used to model the PM2.5 data graph network. Compared with other recursive neural networks, the most important feature of LSTM is that it uses a gate structure (Mirza et al., 2020). In the course of network training, the weights are updated automatically.

    • Fingerprinting based data abstraction technique for remaining useful life estimation in a multi-stage gearbox

      2021, Measurement: Journal of the International Measurement Confederation
    View all citing articles on Scopus

    Ali H. Mirza received the B.S. degree in with honors in Electrical Engineering from University of Engineering and Technology, Lahore, Pakistan, in 2014 and his M.S. degree in Electrical and Electronics Engineering from Bilkent University, Ankara, Turkey. He is currently working as a Developer for Stability in 4G/5G Systems at Ericsson HQ, Sweden. His research interests include machine learning, signal processing, online learning and deep neural networks.

    Mine Kerpicci received the B.S. degree in Electrical and Electronics Engineering from Middle East Technical University, Ankara, Turkey in 2017. She is currently pursuing the M.S. degree in the Department of Electrical and Electronics Engineering at Bilkent University. Her research interests include machine learning, online learning and statistical signal processing.

    Suleyman Serdar Kozat received the B.S. (Hons.) degree from Bilkent University, Ankara, Turkey, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Illinois at Urbana-Champaign, Urbana, IL, USA.

    He joined the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, as a Research Staff Member and later became a Project Leader with Pervasive Speech Technologies Group, where he focused on problems related to statistical signal processing and machine learning. He was a Research Associate with the Cryptography and Anti-Piracy Group, Microsoft Research, Redmond, WA, USA. He is currently a Professor at the Electrical and Electronics Engineering Department at Bilkent University. He has co-authored over 120 papers in refereed high impact journals and conference proceedings and holds several patent inventions (currently used in several different Microsoft and IBM products, such as MSN and ViaVoice). He holds several patent inventions due to his research accomplishments with the IBM Thomas J. Watson Research Center and Microsoft Research. His current research interests include cyber security, anomaly detection, big data, data intelligence, adaptive filtering and machine learning algorithms for signal processing.

    Dr. Kozat received many international and national awards. He is the Elected President of the IEEE Signal Processing Society, Turkey Chapter.

    This work was supported by the Scientific and Technological Research Council of Turkey under Contract 117E153.

    View full text