Efficient online learning with improved LSTM neural networks

doi:10.1016/j.dsp.2020.102742

Digital Signal Processing

Volume 102, July 2020, 102742

https://doi.org/10.1016/j.dsp.2020.102742 Get rights and content

Highlights

•
Highly efficient and effective online learning algorithms are introduced.
•
Long Short Term Memory networks are improved by using the covariance information.
•
Matrix factorization is employed to achieve low computational complexity.
•
Weight matrices are learnt in sequential manner to improve the learning performance.
•
Significant performance improvements in regression are achieved.

Abstract

We introduce efficient online learning algorithms based on the Long Short Term Memory (LSTM) networks that employ the covariance information. In particular, we introduce the covariance of the present and one-time step past input vectors into the gating structure of the LSTM networks. Additionally, we include the covariance of the output vector, and we learn their weight matrices to improve the learning performance of the LSTM networks where we also provide their updates. We reduce the number of system parameters through the weight matrix factorization where we convert the LSTM weight matrices into two smaller matrices in order to achieve high learning performance with low computational complexity. Moreover, we apply the introduced approach to the Gated Recurrent Unit (GRU) architecture. In our experiments, we illustrate significant performance improvements achieved by our methods on real-life datasets with respect to the vanilla LSTM and vanilla GRU networks.

Introduction

In the modern era, online learning is getting more important and extensively studied in the literature due to its wide range of application areas including machine learning, signal processing and network intrusion detection [1], [2], [3]. In this paper, we study online learning for sequentially observed data in a nonlinear regression problem. We aim to sequentially learn the parameters of a system in time with low computational complexity.

In the computational learning literature, linear modeling approaches are used in various applications due to their low computational complexity [4]. However, such approaches are not suitable for the modeling of real life signals since they are not usually capable of capturing complex nonlinear relationships in the signal [1]. Therefore, nonlinear approaches are used in wide range of applications due to their high modeling power [5], [6], [7], [8]. Among the nonlinear approaches, Deep Neural Networks (DNNs) are extensively used [9], [10]. Nevertheless, DNN based methods suffer from limited performance on modeling the temporal data due to their lack of memory [11]. In order to mitigate this issue, Recurrent Neural Networks (RNNs), which are able to handle the long-term data dependency, are introduced [12], [13]. These methods, on the other hand, suffer from exploding and vanishing gradient problems during the back propagation of the error gradient [14]. The Long Short Term Memory (LSTM) networks solve this problem thanks to their gating architectures that control the flow of information [15]. Here, we introduce efficient online learning algorithms using the LSTM networks. Furthermore, since the gating mechanism of the LSTM network is its backbone, we incorporate additional information into the gating mechanism that helps in learning and modeling the complex processes effectively while preserving computational complexity [15].

There are many applications such as genetic protein therapy and housing prices where there is direct or indirect link between the past and the present data instances [16]. In such cases, using this relation, which is grasped with the covariance information, increases the prediction performance in an efficient manner. We utilize this covariance information in our LSTM networks through the gating structures. Furthermore, we also learn the covariance weight matrices during the training process. Hence, we improve the learning performance of the LSTM networks by not only utilizing covariance information but also learning its relation with the desired output. We call this improved LSTM networks based on the covariance information as “Co-LSTM” networks.

Although including the covariance information into the gating structure of the LSTM networks increases the modeling power, and hence, the learning performance, it also introduces additional time and computational complexity. This increase in time and computational complexity results from the additional input and output weight and covariance matrices of the Co-LSTM networks. In order to make the resulting network to have low complexity, we incorporate the Weight Matrix Factorization (WMF) method into the Co-LSTM network weight matrices [17]. The objective of the WMF is to break down a higher rank matrix into two lower rank matrices. Hence, we significantly reduce the time and computational complexities with the use of the WMF on the Co-LSTM network weight matrices while maintaining the overall accuracy and the performance of the system.

In the learning literature, neural network based approaches are extensively used due to their high capabilities on modeling data [18]. Particularly, Long Short Term Memory (LSTM) networks provide significantly high learning performance thanks to their ability to learn the long-term data dependencies with their gating based architectures [19], [15]. However, as a result of their performance-complexity trade-off, LSTM networks are not fully utilized in nonlinear regression tasks. In this paper, we solve nonlinear regression problem with our novel LSTM networks that bypass this trade-off. With our improved structure with WMF, we achieve higher learning performance and lower training time compared to the vanilla LSTM where we illustrate these significant improvements through our experiments.

Covariance matrix provides useful information on the data, which makes it highly valuable for learning tasks. However, its use in neural network based methods is highly limited due to their black box nature. The most common approach in the literature to utilize the covariance matrix in the neural networks is to use it for preprocessing such as dimensionality reduction [20], [21]. Nonetheless, such preprocessing steps may degrade the learning performance due to information loss. In this paper, we directly employ the covariance matrix along with the data without any information loss in our novel LSTM structure, which results in high learning performance.

Matrix factorization method is highly used in neural network based algorithms for compression purposes [22], [23]. These algorithms decrease the training time by reducing the number of training parameters that are naturally introduced with the basic neural network architecture [22]. However, they provide low performance since the direct application of this method degrades the learning capacity. In this paper, we apply the matrix factorization to our improved Co-LSTM structure, which provides high learning performance. Hence, even though we observe a slight degradation through WMF, we still achieve significantly high performance.

The main contributions of this paper are as follows:

•
We propose effective online learning algorithms based on the LSTM networks where we incorporate input and output covariance information, which grasp the relationship between the previous and present data instances, into the gating structure.
•
We assign a weight matrix to each covariance matrix in our improved LSTM based gating structure and learn all the weights through the sequential updates where we enhance the regression performance.
•
In order to alleviate the time and computational complexity, we use the Weight Matrix Factorization (WMF) method where we elegantly control trade off between computational complexity and performance.
•
Through our experiments on real-life datasets, we illustrate significant performance improvements of the introduced Co-LSTM and Co-GRU networks over the vanilla LSTM and vanilla GRU networks, respectively.

This paper is organized as follows. In Section 2, we first describe our problem setting and the model formulation. We then introduce our models in Section 3.1 and provide the parameter updates in Section 3.2. In Section 3.3, we describe the Weight Matrix Factorization (WMF) method used to reduce the computational complexity. We provide the updates for the reformulation of our method based on the WMF in Section 3.4. In Section 4, we demonstrate the performance improvements of our methods on real datasets. Finally, we conclude with our remarks in Section 5.

Section snippets

Problem description

In this paper, all vectors are column vectors, and they are denoted by boldface lower case letters. Matrices are represented by boldface uppercase letters. For a vector u, $u^{T}$ is the ordinary transpose. For a vector $x_{t}$ , $x_{t, i}$ is the ith element of the vector $x_{t}$ . Similarly, for a matrix W, $w_{i j}$ is the ith row and the jth column entry.

We sequentially observe $x_{t} \in R^{p}$ along with desired output $d_{t}$ at each discrete time or step t, and we aim to estimate the output as ${\hat{d}}_{t}$ . Here, ${\hat{d}}_{t}$ depends only on ${x_{1}, x_{2}, \dots$

Efficient online learning algorithms with covariance information and WMF

In this section, we present our LSTM models based on the weighted input and output covariances. Then, we apply the weight matrix factorization method to our proposed models to achieve low computational complexity. Finally, we provide the stochastic gradient based weight updates for the introduced method.

Experiments

In this section, we illustrate the regression performance of the proposed models on real-life datasets including Kinematics [27], Elevators [28], Protein Tertiary [29] and Puma8NH [30]. We demonstrate the classification performance of the proposed models on two real-life dynamic datasets, i.e., ISCX IDS 2012 [31] and Pen-Based Handwritten Digits [32] dataset. Moreover, we compare the proposed models with the state-of-the-art methods based on their performances on image caption generation [33]

Conclusion

We have introduced efficient online learning algorithms based on the LSTM network architecture. We have utilized the input and the output covariance information and introduced them into the gating structure of the vanilla LSTM networks. We have proposed three variants of the introduced Co-LSTM network approach where we assign weight matrices to the input and output covariance matrices and learn them in a sequential manner. Furthermore, we have used the matrix factorization method on the Co-LSTM

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ali H. Mirza received the B.S. degree in with honors in Electrical Engineering from University of Engineering and Technology, Lahore, Pakistan, in 2014 and his M.S. degree in Electrical and Electronics Engineering from Bilkent University, Ankara, Turkey. He is currently working as a Developer for Stability in 4G/5G Systems at Ericsson HQ, Sweden. His research interests include machine learning, signal processing, online learning and deep neural networks.

References (57)

A.C. Singer et al.
Nonlinear autoregressive modeling and estimation in the presence of noise
Digit. Signal Process.
(1994)
M. Iliadis et al.
Deep fully-connected networks for video compressive sensing
Digit. Signal Process.
(2018)
T.G. Kang et al.
Dnn-based monaural speech enhancement with temporal and spectral variations equalization
Digit. Signal Process.
(2018)
J. Schmidhuber
Deep learning in neural networks: an overview
Neural Netw.
(2015)
G. Montavon et al.
Methods for interpreting and understanding deep neural networks
Digit. Signal Process.
(2018)
A. Shiravi et al.
Toward developing a systematic approach to generate benchmark datasets for intrusion detection
Comput. Secur.
(2012)
A. Graves et al.
Framewise phoneme classification with bidirectional lstm and other neural network architectures
Neural Netw.
(2005)
I. Ali et al.
Design quality and robustness with neural networks
IEEE Trans. Neural Netw.
(1999)
V. Vovk
Competitive on-line linear regression
D.C. Montgomery et al.
Introduction to Linear Regression Analysis, vol. 821
(2012)

N. Cesa-Bianchi et al.

Prediction, Learning, and Games

(2006)

K. Greff et al.

Lstm: a search space odyssey

IEEE Trans. Neural Netw. Learn. Syst.

(2017)

M. Hermans et al.

Training and analysing deep recurrent neural networks

J. Mazumdar et al.

Recurrent neural networks trained with backpropagation through time algorithm to estimate nonlinear load harmonic currents

IEEE Trans. Ind. Electron.

(2008)

H. Jaeger

Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the “Echo State Network” Approach, vol. 5

(2002)

Y. Bengio et al.

Learning long-term dependencies with gradient descent is difficult

IEEE Trans. Neural Netw.

(1994)

S. Hochreiter et al.

Long short-term memory

Neural Comput.

(1997)

X. Liu

Deep recurrent neural network for protein function prediction from sequence

O. Kuchaiev et al.

Factorization tricks for lstm networks

D.F. Specht

A general regression neural network

IEEE Trans. Neural Netw.

(1991)

F.A. Gers et al.

Learning to Forget: Continual Prediction with Lstm

(1999)

M.J. Er et al.

Face recognition with radial basis function (rbf) neural networks

IEEE Trans. Neural Netw.

(2002)

K. Hadad et al.

Enhanced neural network based fault detection of a vver nuclear power plant with the aid of principal component analysis

IEEE Trans. Nucl. Sci.

(2008)

T.N. Sainath et al.

Low-rank matrix factorization for deep neural network training with high-dimensional output targets

Y. Zhang et al.

Extracting deep neural network bottleneck features using low-rank matrix factorization

S. Ruder

An overview of gradient descent optimization algorithms

R.J. Williams et al.

A learning algorithm for continually running fully recurrent neural networks

Neural Comput.

(1989)

D.D. Lee et al.

Learning the parts of objects by non-negative matrix factorization

Nature

(1999)

Cited by (27)

A CNN encoder decoder LSTM model for sustainable wind power predictive analytics
2023, Sustainable Computing: Informatics and Systems
Wind Power (WP) proliferates as one of the significant sustainable energies available in the form of temporal intervals. WP exists as a natural energy generating resource that depends on climatic parameters like wind speed, wind direction. These parameters are highly undetermined, thus collected over a fixed time interval. Hence, the nature of WP is quite uncertain over the time sequences and intricate to forecast. Accurate WP prediction has high significance for wind plants. It sequences the planning for different types of wind farm layouts, number and type of wind turbines required, peak load estimation, distribution areas and plans, cost and benefit risk mitigation etc. Long Short-Term Memory (LSTM) involves prediction using temporal relationships of data points collected over a period of time. The type of predictive analytics required in WP forecasting is very similar to the working of LSTM model. A hybrid Deep Learning (DL) based model using encoder decoder framework with Convolution Neural network (CNN) encoder with LSTM as decoder, named CNN Encoder-Decoder LSTM (CNN-ED-LSTM) is proposed in this study. Hence, to enhance the existing WP predictive analytics (WPPA), this paper focuses on optimization of WPPA using LSTM as the decoder and one-dimensional CNN as encoder. Evaluation of the proposed model is done using five different performance evaluation metrics Mean Absolute Error (MAE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE) and computational time. Also, other traditional DL models such as VanillaLSTM, StackedLSTM, CNN-LSTM and Bidirectional-LSTM (Bi-LSTM) are assessed over the same metrics. Additionally, five statistical models namely, ARIMA-LSTM, AutoARIMA, AutoRegressive Integrated Mean Average (ARIMA), AutoRegressive Mean Average (ARMA) and AutoRegressive (AR), are evaluated. Then, the efficacy of CNN-ED-LSTM is compared with these conventional models of time series predictive analytics as well as traditional DL models that are already used in different studies for WPPA. The proposed CNN-ED-LSTM model for WPPA is performing better than traditional DL models from 3.2% to 9.6% in terms of MSE metric and 3.2–7.5% for RMSE metric.
Joint optimization of linear and nonlinear models for sequential regression
2022, Digital Signal Processing: A Review Journal
Citation Excerpt :
We study online regression/prediction, where we observe a sequential time series signal with related side-information sequence and predict the next sample. Online regression is extensively studied in the signal processing [1,2] and machine learning [3] literatures since it has a wide range of application areas including transportation [4], education [5] and object tracking [6]. In such regression problems, linear modeling approaches are commonly used because of their low computational complexity and robustness with limited data [7].
We investigate nonlinear regression and introduce a novel approach based on the joint optimization of linear and nonlinear models. In order to capture both the nonlinear and linear characteristics in sequential data, we model the underlying data as a combination of linear and nonlinear models, where we optimize the models jointly to minimize the final regression error. As the nonlinear model, we employ a differentiable version of the boosted decision trees. As the linear model, we use the well-known SARIMAX model. Our approach is generic so that any differentiable nonlinear or linear model can be readily employed provided that they are differentiable. By this joint optimization, we alleviate the well-known underfitting and overfitting problems in modeling sequential data. Through our experiments on synthetic and real-life data, we demonstrate significant improvements over individual components as well as the combination/mixture methods in the literature.
A dynamic ensemble deep deterministic policy gradient recursive network for spatiotemporal traffic speed forecasting in an urban road network
2022, Digital Signal Processing: A Review Journal
Traffic congestion is a difficult problem that restricts the construction of urbanization. Spatiotemporal traffic speed forecasting technologies can provide effective technical support for alleviating traffic congestion and ensuring vehicle travel safety. The ensemble learning algorithm is a hot topic in traffic speed modeling. In this field, previous ensemble learning methods mainly adopt the principle of static modeling, which limits the learning ability of the model to dynamic features. To solve this problem, in this paper, a new dynamic ensemble deep deterministic policy gradient recursive network is presented for traffic speed forecasting, which comprises three main modeling steps. In step I, the simple recursive network (SRU) and temporal convolution network (TCN) methods are used as the main predictors to build the traffic speed forecasting model. In step II, the multi-objective imperialist competitive algorithm (MOICA) integrates these neural networks by optimizing the weight coefficients and generating the Pareto solution set. In step III, the deep deterministic policy gradient (DDPG) method dynamically selects the Pareto optimal solution of the MOICA according to the changes in the traffic speed data. The MOICA and DDPG dynamically integrate the forecasting results from the SRU and TCN to obtain the final results. Based on the experimental analysis results, several conclusions can be given as follows: (a) the model presented in this paper can obtain accurate traffic speed forecasting results with MAPE values below 4% on all data sets. (b) the proposed model can achieve better results than thirteen alternative models and four proposed models from other researchers. (c) the proposed model can improve the prediction performance of traditional predictors by about 6%.
Attention-based LSTM for Remaining Useful Life Estimation of Aircraft Engines
2022, IFAC-PapersOnLine
In a critical business sector such as the aviation industry, remaining useful life (RUL) prediction helps engineers schedule maintenance to avoid the risk of catastrophic failure in both the manufacturing and the servicing sectors. This paper attempts to review and evaluate various RUL predictive models for aircraft engines and compare their performance with a proposed Long-Short Term Memory (LSTM) method based on a data-driven machine learning approach. This study uses the C-MAPSS datasets in order to evaluate the performance and the results of each approach. The obtained outcomes show that the modified LSTM method with Attention mechanism improves the RUL prediction for aircraft engines and provides better performance.
A new multi-data-driven spatiotemporal PM2.5 forecasting model based on an ensemble graph reinforcement learning convolutional network
2021, Atmospheric Pollution Research
Citation Excerpt :
Therefore, in this paper, the spatial graph convolution method was used to model the PM2.5 data graph network. Compared with other recursive neural networks, the most important feature of LSTM is that it uses a gate structure (Mirza et al., 2020). In the course of network training, the weights are updated automatically.
Spatiotemporal PM2.5 forecasting technology plays an important role in urban traffic environment management and planning. In order to establish a satisfactory high-precision PM2.5 prediction model, a new multidata-driven spatiotemporal PM2.5 forecasting model is proposed in this paper. The overall modelling framework consists of three main parts. In part I, the graph convolutional network uses an adjacency matrix to effectively aggregate spatiotemporal pollutant data from different nodes and extract the most valuable feature information for target point modeling from the original data. In part II, the extracted feature information is used as the input of the gated recursive unit and the long short-term memory network to construct the prediction model. In part III, the Q-learning algorithm builds the best ensemble PM2.5 forecasting model by analyzing the processing ability and analysis ability of different predictors. Based on the analysis of multiple cases, the following conclusions can be drawn: (1) Graphic convolutional networks can effectively analyze the spatiotemporal correlation of PM2.5 data and achieve better performance than traditional convolutional neural networks. (2) Q-learning can adaptively optimize the ensemble weight coefficient and achieve better results than the traditional optimization algorithm. (3) The proposed GCN-LSTM-GRU-Q model can achieve better results than the 24 benchmark models.
Fingerprinting based data abstraction technique for remaining useful life estimation in a multi-stage gearbox
2021, Measurement: Journal of the International Measurement Confederation
Remaining Useful Life (RUL) studies generate large volumes of data which demand complex computational resources to process. Removal of anomalies from the acquired raw data is a significant paradigm which makes the condition monitoring process robust. In this investigation, a fingerprinting based data abstraction technique is proposed to identify the prominent data points from the acquired data. Run-to-failure experiments are performed on a scaled gearbox to acquire vibration signatures. Continuous Wavelet Transform (CWT) is performed and the most prominent data points (fingerprints) are abstracted from the CWT coefficients. Descriptive statistics are computed for these fingerprints. Cumulative energy is computed from the fingerprint to build health index for predicting the RUL for different speed stages of gearbox. Zone demarcation points are estimated for gearbox stages to determine individual stage’s health. A comparison between different classification algorithms yielded RNNs (long short-term memory networks) as the best in conjunction with the proposed algorithm.

View all citing articles on Scopus

Mine Kerpicci received the B.S. degree in Electrical and Electronics Engineering from Middle East Technical University, Ankara, Turkey in 2017. She is currently pursuing the M.S. degree in the Department of Electrical and Electronics Engineering at Bilkent University. Her research interests include machine learning, online learning and statistical signal processing.

Suleyman Serdar Kozat received the B.S. (Hons.) degree from Bilkent University, Ankara, Turkey, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Illinois at Urbana-Champaign, Urbana, IL, USA.

He joined the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, as a Research Staff Member and later became a Project Leader with Pervasive Speech Technologies Group, where he focused on problems related to statistical signal processing and machine learning. He was a Research Associate with the Cryptography and Anti-Piracy Group, Microsoft Research, Redmond, WA, USA. He is currently a Professor at the Electrical and Electronics Engineering Department at Bilkent University. He has co-authored over 120 papers in refereed high impact journals and conference proceedings and holds several patent inventions (currently used in several different Microsoft and IBM products, such as MSN and ViaVoice). He holds several patent inventions due to his research accomplishments with the IBM Thomas J. Watson Research Center and Microsoft Research. His current research interests include cyber security, anomaly detection, big data, data intelligence, adaptive filtering and machine learning algorithms for signal processing.

Dr. Kozat received many international and national awards. He is the Elected President of the IEEE Signal Processing Society, Turkey Chapter.

^☆: This work was supported by the Scientific and Technological Research Council of Turkey under Contract 117E153.

View full text

Efficient online learning with improved LSTM neural networks☆

Highlights

Abstract

Introduction

Section snippets

Problem description

Efficient online learning algorithms with covariance information and WMF

Experiments

Conclusion

Declaration of Competing Interest

Digit. Signal Process.

Digit. Signal Process.

Digit. Signal Process.

Neural Netw.

Digit. Signal Process.

Comput. Secur.

Neural Netw.

Design quality and robustness with neural networks

IEEE Trans. Neural Netw.

Competitive on-line linear regression

Introduction to Linear Regression Analysis, vol. 821

Prediction, Learning, and Games

Lstm: a search space odyssey

IEEE Trans. Neural Netw. Learn. Syst.

Training and analysing deep recurrent neural networks

Recurrent neural networks trained with backpropagation through time algorithm to estimate nonlinear load harmonic currents

IEEE Trans. Ind. Electron.

Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the “Echo State Network” Approach, vol. 5

Learning long-term dependencies with gradient descent is difficult

IEEE Trans. Neural Netw.

Long short-term memory

Neural Comput.

Deep recurrent neural network for protein function prediction from sequence

Factorization tricks for lstm networks

A general regression neural network

IEEE Trans. Neural Netw.

Learning to Forget: Continual Prediction with Lstm

Face recognition with radial basis function (rbf) neural networks

IEEE Trans. Neural Netw.

Enhanced neural network based fault detection of a vver nuclear power plant with the aid of principal component analysis

IEEE Trans. Nucl. Sci.

Low-rank matrix factorization for deep neural network training with high-dimensional output targets

Extracting deep neural network bottleneck features using low-rank matrix factorization

An overview of gradient descent optimization algorithms

A learning algorithm for continually running fully recurrent neural networks

Neural Comput.

Learning the parts of objects by non-negative matrix factorization

Nature