Learning longer-term dependencies via grouped distributor unit
Introduction
Recurrent Neural Networks (RNNs, [1], [2]) are powerful dynamic systems for tasks that involve sequential inputs, such as audio classification, machine translation and speech generation. As they process a sequence one element at a time, internal states are maintained to store information computed from the past inputs, which makes RNNs capable of modeling temporal correlations between elements from any distance in theory.
In practice, however, it is difficult for RNNs to learn long-term dependencies in data by using back-propagation through time (BPTT, [1]) due to the well-known vanishing and exploding gradient problem [3]. Besides, training RNNs suffers from gradient conflicts (e.g. input conflict and output conflict [4]) which make it challenging to latch long-term information while keeping mid- and short-term memory simultaneously. Various attempts have been made to increase the temporal range in which credit assignment takes effect for recurrent models during training, including adopting a much more sophisticated Hessian-Free optimization method instead of stochastic gradient descent [5], [6], using orthogonal weight matrices to assist optimization [7], [8] and allowing direct connections to model inputs or states from the distant past [9], [10], [11]. Long short-term memory (LSTM, [4]) and its variant, known as gated recurrent units (GRU, [12]) mitigate gradient conflicts by using multiplicative gate units. Moreover, the vanishing gradient problem is alleviated by the additivity in their state transition operator. Simplified gated units have been proposed [13], [14] yet the ability to capture long-term dependencies has not been improved. Recent work also supports the idea of partitioning the hidden units in an RNN into separate modules with different processing periods [15].
In this paper, we introduce the Grouped Distributor Unit (GDU), a new gated recurrent architecture with additive state transition and only one gate unit. Hidden states inside a GDU are partitioned into groups, each of which keeps a constant proportion of previous memory at each time step, forcing information to be latched. The vanishing gradient problem, together with the issue of gradient conflict, which impedes the extraction of long-term dependencies are thus alleviated.
We empirically evaluated the proposed model against other models such as LSTM and GRU on both synthetic problems which are designed to be pathologically difficult and natural datasets containing long-term components. On synthetic problems, our proposed model outperforms LSTM and GRU with a simpler structure and fewer parameters. In tasks on natural datasets, GDU outperforms other related models that were recently proposed.
Section snippets
Background and related work
An RNN is able to encode sequences of arbitrary length into a fixed-length representation by folding a new observation into its hidden state using a transition operator T at each time step t:
Simple recurrent networks (SRN, [16]), known as one of the earliest variants, make T as the composition of an element-wise nonlinearity with an affine transformation of both and 1 :
Grouped distributor unit
As introduced in Section 2, a network combining the advantages of GAST and the idea to partition state units into groups seems promising. Further, we argue that a dynamic system with memory does not need to overwrite the vast majority of its memory based on relatively little input data. For cGAST models in which , we define the proportion of states to be overwritten at time step t as:where K is the state size. On the other hand, the proportion of previous states to be
Experiments
We evaluated the proposed GDU on both pathological synthetic tasks and natural datasets in comparison with other models. The pathological tasks comprise the adding problem (4.1), the 3-bit temporal order problem (4.2), and the multi-embedded Reber grammar (4.3), on which GDU was compared with LSTM and GRU. It is important to point out that although LSTM and GRU have similar performance in nature datasets [21], one model may outperform another by a huge gap in different pathological tasks like
Conclusions and future work
We proposed a novel RNN architecture with gated additive state transition, which contains only one gate unit. The issues of gradient vanishing and conflict are mitigated by explicitly limiting the proportion of states to be overwritten at each time step. Our experiments comprise challenging pathological problems and tasks on natural datasets. The results were consistent over different tasks and clearly demonstrated that the proposed grouped distributor architecture is helpful to extract
CRediT authorship contribution statement
Wei Luo: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Feng Yu: Conceptualization, Writing - review & editing, Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Wei Luo received his B.Sc. degree in Information and Computing Science from Zhejiang University, Hangzhou, China, in 2014. He is currently pursuing the Ph.D. degree in Zhejiang University. His research interests include machine learning, in particular neural networks and optimization.
References (40)
Generalization of backpropagation with application to a recurrent gas market model
Neural Networks
(1988)Finding structure in time
Cognitive Sci.
(1990)- et al.
Novel architecture for long short-term memory used in question classification
Neurocomputing
(2018) - et al.
Learning representations by back-propagating errors
Nature
(1986) - et al.
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies
- et al.
Long short-term memory
Neural Comput.
(1997) - J. Martens, Deep learning via hessian-free optimization, in: Proceedings of the 27th International Conference on...
- J. Martens, I. Sutskever, Learning recurrent neural networks with hessian-free optimization, in: Proceedings of the...
- A.M. Saxe, J.L. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural...
- Q.V. Le, N. Jaitly, G.E. Hinton, A simple way to initialize recurrent networks of rectified linear units (2015)....
Learning long-term dependencies in narx recurrent neural networks
Trans. Neur. Netw.
Learning to forget: Continual prediction with lstm
Neural Comput.
Cited by (0)
Wei Luo received his B.Sc. degree in Information and Computing Science from Zhejiang University, Hangzhou, China, in 2014. He is currently pursuing the Ph.D. degree in Zhejiang University. His research interests include machine learning, in particular neural networks and optimization.
Dr. Feng Yu is a Professor in the College of Biomedical Engineering and Instrument Science at Zhejiang University. He has a BS in Semiconductor from Nankai University and MS/PhD in Instrumentation Engineering from Zhejiang University. His current primary research interests are in novel computational architectures for biomedical image processing applications.