Hierarchical and lateral multiple timescales gated recurrent units with pre-trained encoder for long text classification

https://doi.org/10.1016/j.eswa.2020.113898Get rights and content

Highlights

  • Performance of current text classifiers degrade with longer input sequences.

  • The proposed model can classify texts of diverse input lengths.

  • A hierarchical and lateral architecture is proposed to enhance the performance.

  • The model uses rich features extracted by pre-trained bidirectional encoders.

  • Our model outperforms existing models on various long text classification datasets.

Abstract

Text classification, using deep learning techniques, has become a research challenge in natural language processing. Most of the existing deep learning models for text classification face difficulties when the length of the input text increases. Most models work well on shorter text inputs, however, their performance degrades with the increase in the input length. In this work, we introduce a model for text classification that can alleviate this problem. We present the hierarchical and lateral multiple timescales gated recurrent units (HL-MTGRU), in combination with pre-trained encoders to address the long text classification problem. HL-MTGRU can represent multiple temporal scale dependencies for the discrimination task. By combining the slow and fast units of the HL-MTGRU, our model effectively classifies long multi-sentence texts into the desired classes. We also show that the HL-MTGRU structure helps the model to prevent degradation of performance on longer text inputs. We demonstrate that the proposed network with the help of the latest pre-trained encoders for feature extraction outperforms the conventional models on various long text classification benchmark datasets.

Introduction

Text classification is a category of Natural Language Processing (NLP) with real-world applications. The goal of this task is to assign labels to texts. It has a number of applications including topic labeling (Wang & Manning, 2012), sentiment classification (Maas et al., 2011, Pang et al., 2008), chat discrimination (Moirangthem et al., 2017), and spam detection (Sahami et al., 1998). Conventional methods (Wang & Manning, 2012) involve representing documents with sparse lexical features like n-grams, and then linear models or kernel methods are used on the representation for the task. An important intermediate step is text representation. More recent approaches used deep learning, such as convolutional neural networks (CNN) (Conneau et al., 2016, Johnson and Zhang, 2017, Kalchbrenner et al., 2014, Kim, 2014, Shen et al., 2018, Zhang et al., 2015), recurrent neural networks (RNN) based on long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) in Liu et al., 2016, Seo et al., 2017 and Yogatama et al. (2017), and attention mechanisms (Lin et al., 2017, Yang et al., 2016) to learn text representations. The recently proposed Transformer (Vaswani et al., 2017), uses only self-attention mechanism for sequence tasks such as machine translation and achieved superior performance compared to its recurrent counterparts.

Language model pre-training for learning universal language representations with the help of huge amounts of unlabeled data has shown to be very effective. A recent survey by Gao et al. (2019) has demonstrated the effectiveness. Some of the most outstanding examples are embeddings from language models (Peters et al., 2018), generative pre-training (Radford et al., 2018) and bidirectional encoder representations from Transformers (BERT) (Devlin et al., 2018). There are also a number of variants introduced recently. These models are neural language models trained on a large amount of text data using unsupervised objectives. For example, BERT is a Transformer encoder, which is trained on unlabeled text for masked word prediction and next sentence prediction tasks. Unlike other models, BERT produce bidirectional encoder features from text sequences. In order to use such pre-trained models in natural language understanding (NLU) tasks, fine tuning of the models is required with additional layers using task-specific training data for each task. For example, BERT can be fine-tuned to address a range of NLU tasks (Devlin et al., 2018). However, the number of parameters in such models are large and the resources required to run increases. Recently, Lan et al. (2020) introduced a new pre-trained encoder model with parameter reduction techniques in order to alleviate scaling issues in such pre-trained models. The model is a lite version of BERT (ALBERT) and significantly reduce the number of parameters but retaining the performance and thereby improving parameter-efficiency. ALBERT has 18x fewer parameters and is about 1.7x faster than BERT. In our work, we adopt ALBERT for feature extraction due to its reduced number of parameters and enhanced performance.

The recently introduced language model pre-trained Transformers have been widely used for sentence level text classification (Devlin et al., 2018, Lan et al., 2020, Radford et al., 2018) and have been very successful. However, recent studies reveal that the lack of recurrence in the Transformer models hinders its further improvement due to its limitations in handling longer sequences (Chen et al., 2018, Dehghani et al., 2019). Introducing recurrence in pre-trained Transformers can be one way to take advantage of the existing models in order to further enhance the performance. In this work, we address the long text classification problem, which is a multiple sentence text classification problem. This task requires handling of longer sequences efficiently. Moreover, the classification model should also be robust in handling diverse sequence lengths. For example, in the Yahoo Answers topic classification task, the input text length may vary from 130 to 4018 characters. These varying lengths of text should be properly handled by the model to classify the inputs to 10 different topics. Therefore, there is a need for the development of a model that can handle both long and short sequences efficiently in order to address long multi-sentence text classification problems.

In this work, we develop a deep network for long text classification with the help of a multiple timescales gated recurrent unit (MTGRU) (Kim et al., 2016, Moirangthem and Lee, 2017, Moirangthem and Lee, 2020, Moirangthem et al., 2017) and a pre-trained Transformer encoder. We take advantage of the aforementioned recurrency and self-attention mechanisms while also recycling existing pre-trained models in a modularized manner, saving time and computational power while enhancing its performance. Although pre-trained Transformer based approaches to NLP tasks have shown enhanced performance, we argue that better representations can be obtained by incorporating a temporal hierarchy in the model architecture of the large pre-trained language model Transformers. The intuition underlying our model is that different parts of the model will focus on different lengths of the text with the goal to address better classification of a diverse length of input texts. The temporal hierarchy concept with MTGRU has also been proven to perform well in language modeling (Moirangthem and Lee, 2017, Moirangthem et al., 2017) and summarization (Kim et al., 2016) tasks. The MTGRU is known to handle long term dependency better with the help of varying timescales to represent multiple compositionalities of language. The temporal hierarchy approach has also been shown to eliminate the need for complex structures and normalization techniques (Chung et al., 2017, Cooijmans et al., 2017, Ha et al., 2017, Krueger and Memisevic, 2016), and thereby increasing the computational efficiency of the model. Moreover, pre-trained encoders have been proven to be great feature extractors, which are suitable for several NLU tasks including text classification problems. However, the available pre-trained encoders such as BERT and ALBERT are trained on a masked language model (MLM). In the MLM training objective, the output vectors are grounded to tokens instead of sentences, while in long text classification, we must encode and represent multi-sentence inputs. We solve this representation issue by introducing the proposed MTGRU layers on top of the pre-trained encoders.

We improve the MTGRU model by introducing a hierarchical and lateral multiple timescales structure. The conventional hierarchical MTGRU is most effective for handling long term dependencies in very long text inputs for applications such as summarization but performs comparable to vanilla GRU with shorter text inputs. Our proposed hierarchical and lateral multiple timescales gated recurrent unit (HL-MTGRU) is significantly different from the conventional MTGRU structure. HL-MTGRU follows a lateral (branch or root) architecture and a hierarchical structure where the slow and fast units are directly connected to the inputs and the final outputs of the units are combined to form the final representation. The fast units are also connected to the slow units in a hierarchical fashion. The hierarchical and lateral connections in the HL-MTGRU will enable encoding of rich features that have different temporal dependencies from the input sentences in order to help classify the information correctly. This structure enables all the layers with different timescales to capture relevant features directly from the inputs keeping the advantages of hierarchical multi-layer structures. This feature enables efficient handling of both short and long sequence data. Since the data consist of inputs of different lengths, HL-MTGRU proves to be more suitable for this task.

Our major contributions are as follows:

  • We introduce the HL-MTGRU network, with the help of pre-trained encoders, to build rich features from input texts to classify texts of diverse lengths.

  • The HL-MTGRU architecture enables our model to perform well on longer text sequences with the help of the slow layer as well as maintain comparable performance on shorter sequences.

  • In order to demonstrate that the proposed model outperforms the existing models, we report the performance on various long text classification benchmark datasets.

  • The results of our experiments demonstrate that the proposed model achieves state-of-the-art performance on the benchmark datasets.

Section snippets

Related work

Deep neural network models have demonstrated huge success in many NLP tasks, including learning distributed word, sentence, and document representations (Le and Mikolov, 2014, Mikolov et al., 2013), neural machine translation (Cho et al., 2014), sentiment classification (Kim, 2014), etc. Neural network models require little external domain knowledge in learning distributed sentence representations and such model can produce satisfactory results in related tasks like document classification,

The proposed model

In this section, we describe in detail the proposed long text classifier model. We develop a hybrid of hierarchical and lateral MTGRU (HL-MTGRU) in a unified architecture for semantic sequence modeling. We extract the text features with the help of pre-trained Transformer encoders and feed the features directly to the HL-MTGRU. This architecture enables the network to learn multiple temporal scale dependencies from higher-order features. We hypothesize that the combination of slow and fast

The text classification datasets

In this section, we describe several benchmark datasets for text classification with diverse lengths of input text for gauging the performance of our text classifier. Table 1 shows the statistics of the datasets that we have used. Some of the data samples are shown in Fig. 3.

DBpedia Ontology Classification The DBpedia ontology classification dataset (Zhang et al., 2015) is constructed by picking 14 non-overlapping classes from DBpedia 2014.

Amazon Review The Amazon review dataset (Zhang et al.,

Experiments and results

We evaluate the performance of the proposed method and compare it to the conventional models using the text classification datasets described in Section 4.

Discussion

We have investigated in detail the difference in performance between the pre-trained Transformer models, a Transformer and GRU hybrid model, and our proposed model. GRU is the base model of our HL-MTGRU, and both models have been integrated with the large pre-trained ALBERT Transformer model in this work. The results shown in Table 2 illustrate that our HL-MTGRU model significantly outperforms the GRU model. The statistical significance test results given in Table 3 also show a significant

Conclusion and future work

This paper addressed the issue of text classification for multiple lengths of text. We developed a hybrid model consisting of a HL-MTRGU network with a pre-trained encoder to classify different sets of text inputs. The proposed HL-MTGRU was able to effectively classify the texts inputs despite the variance in their length. We evaluated the performance of the proposed hybrid model on various benchmark text classification datasets with differing lengths in order to compare to several existing

CRediT authorship contribution statement

Dennis Singh Moirangthem: Conceptualization, Methodology, Data curation, Writing - original draft, Software. Minho Lee: Supervision, Investigation, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP), South Korea grant funded by the Korea government (MSIT) (2016-0-00564, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding) (50%) and Technology Innovation Program: Industrial Strategic Technology Development Program (No: 10073162) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea) (50%).

References (56)

  • DathathriS. et al.

    Plug and play language models: A simple approach to controlled text generation

  • Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2019). Universal transformers. In International...
  • DevlinJ. et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

    (2018)
  • DingN. et al.

    Cortical tracking of hierarchical linguistic structures in connected speech

    Nature Neuroscience

    (2016)
  • GaoJ. et al.

    Neural approaches to conversational AI

    Foundations and Trends® in Information Retrieval

    (2019)
  • Ha, D., Dai, A., & Le, Q. V. (2017). HyperNetworks. In Proceeding of the international conference on learning...
  • HaoJ. et al.

    Modeling recurrence for transformer

    (2019)
  • HeinrichS. et al.

    Adaptive learning of linguistic hierarchy in a multiple timescale recurrent neural network

  • HochreiterS. et al.

    Long short-term memory

    Neural Computation

    (1997)
  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the...
  • Johnson, R., & Zhang, T. (2017). Deep pyramid convolutional neural networks for text categorization. In Proceedings of...
  • Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. In...
  • KimY.

    Convolutional neural networks for sentence classification

  • KimM. et al.

    Towards abstraction from extraction: Multiple timescale gated recurrent unit for summarization

  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

    (2014)
  • KirosR. et al.

    Skip-thought vectors

  • Krueger, D., & Memisevic, R. (2016). Regularizing RNNs by stabilizing activations. In Proceeding of the international...
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A lite BERT for self-supervised...
  • Cited by (13)

    • Bengali text document categorization based on very deep convolution neural network

      2021, Expert Systems with Applications
      Citation Excerpt :

      The backward and forward propagation processes are continued until the model is converged or error is minimized according to the predefined value. In this study, the aim is to investigate the classification performance of a pre-train multi-lingual transformer based BERT (Devlin et al., 2019), DCRNNs (Hossain & Hoque, 2021), LibSVM (Chang & Lin, 2011), SGD (Kabir et al., 2015), DCNN (Hossain & Hoque, 2019), CNN (Kim, 2014), GRU (Moirangthem & Lee, 2021b) and CNN-LSTM (Behera et al., 2021) model with the same datasets. The classification performance are summarised in Table 8.

    View all citing articles on Scopus
    View full text