Learning longer-term dependencies via grouped distributor unit

doi:10.1016/j.neucom.2020.06.105

Neurocomputing

Volume 412, 28 October 2020, Pages 406-415

https://doi.org/10.1016/j.neucom.2020.06.105 Get rights and content

Abstract

Learning long-term dependencies remains difficult for recurrent neural networks (RNNs) despite their success in sequence modeling recently. In this paper, we propose a novel gated RNN structure, which contains only one gate. Hidden states in the proposed grouped distributor unit (GDU) are partitioned into groups. For each group, the proportion of memory to be overwritten in each state transition is limited to a constant and is adaptively distributed to each group member. In other words, every separate group has a fixed overall update rate, yet all units are allowed to have different paces. Information is therefore forced to be latched in a flexible way, which helps the model to capture long-term dependencies in data. Besides having a simpler structure, GDU is demonstrated experimentally to outperform other models such as LSTM and GRU on both pathological problems and tasks on natural datasets.

Introduction

Recurrent Neural Networks (RNNs, [1], [2]) are powerful dynamic systems for tasks that involve sequential inputs, such as audio classification, machine translation and speech generation. As they process a sequence one element at a time, internal states are maintained to store information computed from the past inputs, which makes RNNs capable of modeling temporal correlations between elements from any distance in theory.

In practice, however, it is difficult for RNNs to learn long-term dependencies in data by using back-propagation through time (BPTT, [1]) due to the well-known vanishing and exploding gradient problem [3]. Besides, training RNNs suffers from gradient conflicts (e.g. input conflict and output conflict [4]) which make it challenging to latch long-term information while keeping mid- and short-term memory simultaneously. Various attempts have been made to increase the temporal range in which credit assignment takes effect for recurrent models during training, including adopting a much more sophisticated Hessian-Free optimization method instead of stochastic gradient descent [5], [6], using orthogonal weight matrices to assist optimization [7], [8] and allowing direct connections to model inputs or states from the distant past [9], [10], [11]. Long short-term memory (LSTM, [4]) and its variant, known as gated recurrent units (GRU, [12]) mitigate gradient conflicts by using multiplicative gate units. Moreover, the vanishing gradient problem is alleviated by the additivity in their state transition operator. Simplified gated units have been proposed [13], [14] yet the ability to capture long-term dependencies has not been improved. Recent work also supports the idea of partitioning the hidden units in an RNN into separate modules with different processing periods [15].

In this paper, we introduce the Grouped Distributor Unit (GDU), a new gated recurrent architecture with additive state transition and only one gate unit. Hidden states inside a GDU are partitioned into groups, each of which keeps a constant proportion of previous memory at each time step, forcing information to be latched. The vanishing gradient problem, together with the issue of gradient conflict, which impedes the extraction of long-term dependencies are thus alleviated.

We empirically evaluated the proposed model against other models such as LSTM and GRU on both synthetic problems which are designed to be pathologically difficult and natural datasets containing long-term components. On synthetic problems, our proposed model outperforms LSTM and GRU with a simpler structure and fewer parameters. In tasks on natural datasets, GDU outperforms other related models that were recently proposed.

Section snippets

Background and related work

An RNN is able to encode sequences of arbitrary length into a fixed-length representation by folding a new observation $x_{t}$ into its hidden state $s_{t}$ using a transition operator T at each time step t: $s_{t} = T (x_{t}, s_{t - 1}) .$

Simple recurrent networks (SRN, [16]), known as one of the earliest variants, make T as the composition of an element-wise nonlinearity with an affine transformation of both $x_{t}$ and $s_{t - 1}$ ¹ : $s_{t} = ϕ_{s} (W_{s} x_{t} + U_{s} s_{t - 1} + b_{s}),$

Grouped distributor unit

As introduced in Section 2, a network combining the advantages of GAST and the idea to partition state units into groups seems promising. Further, we argue that a dynamic system with memory does not need to overwrite the vast majority of its memory based on relatively little input data. For cGAST models in which $β_{t} = 1 - α_{t}$ , we define the proportion of states to be overwritten at time step t as: $P_{α_{t}} = \frac{1}{K} \sum_{k = 1}^{K} G_{α_{t}}^{k},$ where K is the state size. On the other hand, the proportion of previous states to be

Experiments

We evaluated the proposed GDU on both pathological synthetic tasks and natural datasets in comparison with other models. The pathological tasks comprise the adding problem (4.1), the 3-bit temporal order problem (4.2), and the multi-embedded Reber grammar (4.3), on which GDU was compared with LSTM and GRU. It is important to point out that although LSTM and GRU have similar performance in nature datasets [21], one model may outperform another by a huge gap in different pathological tasks like

Conclusions and future work

We proposed a novel RNN architecture with gated additive state transition, which contains only one gate unit. The issues of gradient vanishing and conflict are mitigated by explicitly limiting the proportion of states to be overwritten at each time step. Our experiments comprise challenging pathological problems and tasks on natural datasets. The results were consistent over different tasks and clearly demonstrated that the proposed grouped distributor architecture is helpful to extract

CRediT authorship contribution statement

Wei Luo: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Feng Yu: Conceptualization, Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Wei Luo received his B.Sc. degree in Information and Computing Science from Zhejiang University, Hangzhou, China, in 2014. He is currently pursuing the Ph.D. degree in Zhejiang University. His research interests include machine learning, in particular neural networks and optimization.

References (40)

P.J. Werbos
Generalization of backpropagation with application to a recurrent gas market model
Neural Networks
(1988)
J.L. Elman
Finding structure in time
Cognitive Sci.
(1990)
W. Xia et al.
Novel architecture for long short-term memory used in question classification
Neurocomputing
(2018)
D.E. Rumelhart et al.
Learning representations by back-propagating errors
Nature
(1986)
S. Hochreiter et al.
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies
S. Hochreiter et al.
Long short-term memory
Neural Comput.
(1997)
J. Martens, Deep learning via hessian-free optimization, in: Proceedings of the 27th International Conference on...
J. Martens, I. Sutskever, Learning recurrent neural networks with hessian-free optimization, in: Proceedings of the...
A.M. Saxe, J.L. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural...
Q.V. Le, N. Jaitly, G.E. Hinton, A simple way to initialize recurrent networks of rectified linear units (2015)....

T. Lin et al.

Learning long-term dependencies in narx recurrent neural networks

Trans. Neur. Netw.

(1996)

S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock, M.A. Hasegawa-Johnson, T.S. Huang, Dilated...

R. S. DiPietro, C. Rupprecht, N. Navab, G.D. Hager, Analyzing and exploiting narx recurrent neural networks for...

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase...

J. Collins, J. Sohl-Dickstein, D. Sussillo, Capacity and trainability in recurrent neural networks., in: ICLR (Poster),...

G.-B. Zhou, J. Wu, C.-L. Zhang, Z.-H. Zhou, Minimal gated unit for recurrent neural networks (2016)....

J. Koutnik, K. Greff, F. Gomez, J. Schmidhuber, A clockwork RNN, in: E. P. Xing, T. Jebara (Eds.), Proceedings of the...

R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in: Proceedings of the 30th...

F.A. Gers et al.

Learning to forget: Continual prediction with lstm

Neural Comput.

(2000)

F. A. Gers, J. Schmidhuber, Recurrent nets that time and count, in: Proceedings of the IEEE-INNS-ENNS International...

Cited by (0)

Dr. Feng Yu is a Professor in the College of Biomedical Engineering and Instrument Science at Zhejiang University. He has a BS in Semiconductor from Nankai University and MS/PhD in Instrumentation Engineering from Zhejiang University. His current primary research interests are in novel computational architectures for biomedical image processing applications.

View full text

Learning longer-term dependencies via grouped distributor unit

Abstract

Introduction

Section snippets

Background and related work

Grouped distributor unit

Experiments

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Neural Networks

Cognitive Sci.

Neurocomputing

Learning representations by back-propagating errors

Nature

Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

Long short-term memory

Neural Comput.

Learning long-term dependencies in narx recurrent neural networks

Trans. Neur. Netw.

Learning to forget: Continual prediction with lstm

Neural Comput.