Hierarchical and lateral multiple timescales gated recurrent units with pre-trained encoder for long text classification

doi:10.1016/j.eswa.2020.113898

Expert Systems with Applications

Volume 165, 1 March 2021, 113898

https://doi.org/10.1016/j.eswa.2020.113898 Get rights and content

Highlights

•
Performance of current text classifiers degrade with longer input sequences.
•
The proposed model can classify texts of diverse input lengths.
•
A hierarchical and lateral architecture is proposed to enhance the performance.
•
The model uses rich features extracted by pre-trained bidirectional encoders.
•
Our model outperforms existing models on various long text classification datasets.

Abstract

Text classification, using deep learning techniques, has become a research challenge in natural language processing. Most of the existing deep learning models for text classification face difficulties when the length of the input text increases. Most models work well on shorter text inputs, however, their performance degrades with the increase in the input length. In this work, we introduce a model for text classification that can alleviate this problem. We present the hierarchical and lateral multiple timescales gated recurrent units (HL-MTGRU), in combination with pre-trained encoders to address the long text classification problem. HL-MTGRU can represent multiple temporal scale dependencies for the discrimination task. By combining the slow and fast units of the HL-MTGRU, our model effectively classifies long multi-sentence texts into the desired classes. We also show that the HL-MTGRU structure helps the model to prevent degradation of performance on longer text inputs. We demonstrate that the proposed network with the help of the latest pre-trained encoders for feature extraction outperforms the conventional models on various long text classification benchmark datasets.

Introduction

Text classification is a category of Natural Language Processing (NLP) with real-world applications. The goal of this task is to assign labels to texts. It has a number of applications including topic labeling (Wang & Manning, 2012), sentiment classification (Maas et al., 2011, Pang et al., 2008), chat discrimination (Moirangthem et al., 2017), and spam detection (Sahami et al., 1998). Conventional methods (Wang & Manning, 2012) involve representing documents with sparse lexical features like $n$ -grams, and then linear models or kernel methods are used on the representation for the task. An important intermediate step is text representation. More recent approaches used deep learning, such as convolutional neural networks (CNN) (Conneau et al., 2016, Johnson and Zhang, 2017, Kalchbrenner et al., 2014, Kim, 2014, Shen et al., 2018, Zhang et al., 2015), recurrent neural networks (RNN) based on long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) in Liu et al., 2016, Seo et al., 2017 and Yogatama et al. (2017), and attention mechanisms (Lin et al., 2017, Yang et al., 2016) to learn text representations. The recently proposed Transformer (Vaswani et al., 2017), uses only self-attention mechanism for sequence tasks such as machine translation and achieved superior performance compared to its recurrent counterparts.

Language model pre-training for learning universal language representations with the help of huge amounts of unlabeled data has shown to be very effective. A recent survey by Gao et al. (2019) has demonstrated the effectiveness. Some of the most outstanding examples are embeddings from language models (Peters et al., 2018), generative pre-training (Radford et al., 2018) and bidirectional encoder representations from Transformers (BERT) (Devlin et al., 2018). There are also a number of variants introduced recently. These models are neural language models trained on a large amount of text data using unsupervised objectives. For example, BERT is a Transformer encoder, which is trained on unlabeled text for masked word prediction and next sentence prediction tasks. Unlike other models, BERT produce bidirectional encoder features from text sequences. In order to use such pre-trained models in natural language understanding (NLU) tasks, fine tuning of the models is required with additional layers using task-specific training data for each task. For example, BERT can be fine-tuned to address a range of NLU tasks (Devlin et al., 2018). However, the number of parameters in such models are large and the resources required to run increases. Recently, Lan et al. (2020) introduced a new pre-trained encoder model with parameter reduction techniques in order to alleviate scaling issues in such pre-trained models. The model is a lite version of BERT (ALBERT) and significantly reduce the number of parameters but retaining the performance and thereby improving parameter-efficiency. ALBERT has 18x fewer parameters and is about 1.7x faster than BERT. In our work, we adopt ALBERT for feature extraction due to its reduced number of parameters and enhanced performance.

The recently introduced language model pre-trained Transformers have been widely used for sentence level text classification (Devlin et al., 2018, Lan et al., 2020, Radford et al., 2018) and have been very successful. However, recent studies reveal that the lack of recurrence in the Transformer models hinders its further improvement due to its limitations in handling longer sequences (Chen et al., 2018, Dehghani et al., 2019). Introducing recurrence in pre-trained Transformers can be one way to take advantage of the existing models in order to further enhance the performance. In this work, we address the long text classification problem, which is a multiple sentence text classification problem. This task requires handling of longer sequences efficiently. Moreover, the classification model should also be robust in handling diverse sequence lengths. For example, in the Yahoo Answers topic classification task, the input text length may vary from 130 to 4018 characters. These varying lengths of text should be properly handled by the model to classify the inputs to 10 different topics. Therefore, there is a need for the development of a model that can handle both long and short sequences efficiently in order to address long multi-sentence text classification problems.

In this work, we develop a deep network for long text classification with the help of a multiple timescales gated recurrent unit (MTGRU) (Kim et al., 2016, Moirangthem and Lee, 2017, Moirangthem and Lee, 2020, Moirangthem et al., 2017) and a pre-trained Transformer encoder. We take advantage of the aforementioned recurrency and self-attention mechanisms while also recycling existing pre-trained models in a modularized manner, saving time and computational power while enhancing its performance. Although pre-trained Transformer based approaches to NLP tasks have shown enhanced performance, we argue that better representations can be obtained by incorporating a temporal hierarchy in the model architecture of the large pre-trained language model Transformers. The intuition underlying our model is that different parts of the model will focus on different lengths of the text with the goal to address better classification of a diverse length of input texts. The temporal hierarchy concept with MTGRU has also been proven to perform well in language modeling (Moirangthem and Lee, 2017, Moirangthem et al., 2017) and summarization (Kim et al., 2016) tasks. The MTGRU is known to handle long term dependency better with the help of varying timescales to represent multiple compositionalities of language. The temporal hierarchy approach has also been shown to eliminate the need for complex structures and normalization techniques (Chung et al., 2017, Cooijmans et al., 2017, Ha et al., 2017, Krueger and Memisevic, 2016), and thereby increasing the computational efficiency of the model. Moreover, pre-trained encoders have been proven to be great feature extractors, which are suitable for several NLU tasks including text classification problems. However, the available pre-trained encoders such as BERT and ALBERT are trained on a masked language model (MLM). In the MLM training objective, the output vectors are grounded to tokens instead of sentences, while in long text classification, we must encode and represent multi-sentence inputs. We solve this representation issue by introducing the proposed MTGRU layers on top of the pre-trained encoders.

We improve the MTGRU model by introducing a hierarchical and lateral multiple timescales structure. The conventional hierarchical MTGRU is most effective for handling long term dependencies in very long text inputs for applications such as summarization but performs comparable to vanilla GRU with shorter text inputs. Our proposed hierarchical and lateral multiple timescales gated recurrent unit (HL-MTGRU) is significantly different from the conventional MTGRU structure. HL-MTGRU follows a lateral (branch or root) architecture and a hierarchical structure where the slow and fast units are directly connected to the inputs and the final outputs of the units are combined to form the final representation. The fast units are also connected to the slow units in a hierarchical fashion. The hierarchical and lateral connections in the HL-MTGRU will enable encoding of rich features that have different temporal dependencies from the input sentences in order to help classify the information correctly. This structure enables all the layers with different timescales to capture relevant features directly from the inputs keeping the advantages of hierarchical multi-layer structures. This feature enables efficient handling of both short and long sequence data. Since the data consist of inputs of different lengths, HL-MTGRU proves to be more suitable for this task.

Our major contributions are as follows:

•
We introduce the HL-MTGRU network, with the help of pre-trained encoders, to build rich features from input texts to classify texts of diverse lengths.
•
The HL-MTGRU architecture enables our model to perform well on longer text sequences with the help of the slow layer as well as maintain comparable performance on shorter sequences.
•
In order to demonstrate that the proposed model outperforms the existing models, we report the performance on various long text classification benchmark datasets.
•
The results of our experiments demonstrate that the proposed model achieves state-of-the-art performance on the benchmark datasets.

Section snippets

Related work

Deep neural network models have demonstrated huge success in many NLP tasks, including learning distributed word, sentence, and document representations (Le and Mikolov, 2014, Mikolov et al., 2013), neural machine translation (Cho et al., 2014), sentiment classification (Kim, 2014), etc. Neural network models require little external domain knowledge in learning distributed sentence representations and such model can produce satisfactory results in related tasks like document classification,

The proposed model

In this section, we describe in detail the proposed long text classifier model. We develop a hybrid of hierarchical and lateral MTGRU (HL-MTGRU) in a unified architecture for semantic sequence modeling. We extract the text features with the help of pre-trained Transformer encoders and feed the features directly to the HL-MTGRU. This architecture enables the network to learn multiple temporal scale dependencies from higher-order features. We hypothesize that the combination of slow and fast

The text classification datasets

In this section, we describe several benchmark datasets for text classification with diverse lengths of input text for gauging the performance of our text classifier. Table 1 shows the statistics of the datasets that we have used. Some of the data samples are shown in Fig. 3.

DBpedia Ontology Classification The DBpedia ontology classification dataset (Zhang et al., 2015) is constructed by picking 14 non-overlapping classes from DBpedia 2014.

Amazon Review The Amazon review dataset (Zhang et al.,

Experiments and results

We evaluate the performance of the proposed method and compare it to the conventional models using the text classification datasets described in Section 4.

Discussion

We have investigated in detail the difference in performance between the pre-trained Transformer models, a Transformer and GRU hybrid model, and our proposed model. GRU is the base model of our HL-MTGRU, and both models have been integrated with the large pre-trained ALBERT Transformer model in this work. The results shown in Table 2 illustrate that our HL-MTGRU model significantly outperforms the GRU model. The statistical significance test results given in Table 3 also show a significant

Conclusion and future work

This paper addressed the issue of text classification for multiple lengths of text. We developed a hybrid model consisting of a HL-MTRGU network with a pre-trained encoder to classify different sets of text inputs. The proposed HL-MTGRU was able to effectively classify the texts inputs despite the variance in their length. We evaluated the performance of the proposed hybrid model on various benchmark text classification datasets with differing lengths in order to compare to several existing

CRediT authorship contribution statement

Dennis Singh Moirangthem: Conceptualization, Methodology, Data curation, Writing - original draft, Software. Minho Lee: Supervision, Investigation, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP), South Korea grant funded by the Korea government (MSIT) (2016-0-00564, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding) (50%) and Technology Innovation Program: Industrial Strategic Technology Development Program (No: 10073162) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea) (50%).

References (56)

MoirangthemD.S. et al.
Abstractive summarization of long texts by representing multiple compositionalities with temporal hierarchical pointer generator network
Neural Networks
(2020)
PaineR.W. et al.
Motor primitive and sequence self-organization in a hierarchical recurrent neural network
Neural Networks
(2004)
SaltonG. et al.
Term-weighting approaches in automatic text retrieval
Information Processing & Management
(1988)
BotvinickM.M.
Multilevel structure in behaviour and in the brain: a model of Fuster’s hierarchy
Philosophical Transactions of the Royal Society, Series B (Biological Sciences)
(2007)
ChenM.X. et al.
The best of both worlds: Combining recent advances in neural machine translation
(2018)
ChoK. et al.
Learning phrase representations using RNN encoder-decoder for statistical machine translation
(2014)
Chung, J., Ahn, S., & Bengio, Y. (2017). Hierarchical multiscale recurrent neural networks. In Proceeding of the...
ConneauA. et al.
Supervised learning of universal sentence representations from natural language inference data
(2017)
ConneauA. et al.
Very deep convolutional networks for text classification
(2016)
Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., & Courville, A. (2017). Recurrent batch normalization. In...

DathathriS. et al.

Plug and play language models: A simple approach to controlled text generation

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2019). Universal transformers. In International...

DevlinJ. et al.

Bert: Pre-training of deep bidirectional transformers for language understanding

(2018)

DingN. et al.

Cortical tracking of hierarchical linguistic structures in connected speech

Nature Neuroscience

(2016)

GaoJ. et al.

Neural approaches to conversational AI

Foundations and Trends® in Information Retrieval

(2019)

Ha, D., Dai, A., & Le, Q. V. (2017). HyperNetworks. In Proceeding of the international conference on learning...

HaoJ. et al.

Modeling recurrence for transformer

(2019)

HeinrichS. et al.

Adaptive learning of linguistic hierarchy in a multiple timescale recurrent neural network

HochreiterS. et al.

Long short-term memory

Neural Computation

(1997)

Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the...

Johnson, R., & Zhang, T. (2017). Deep pyramid convolutional neural networks for text categorization. In Proceedings of...

Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. In...

KimY.

Convolutional neural networks for sentence classification

KimM. et al.

Towards abstraction from extraction: Multiple timescale gated recurrent unit for summarization

KingmaD.P. et al.

Adam: A method for stochastic optimization

(2014)

KirosR. et al.

Skip-thought vectors

Krueger, D., & Memisevic, R. (2016). Regularizing RNNs by stabilizing activations. In Proceeding of the international...

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A lite BERT for self-supervised...

Cited by (13)

Big data-assisted urban governance: A comprehensive system for business documents classification of the government hotline
2024, Engineering Applications of Artificial Intelligence
The government service platform, exemplified by the government hotline, has to handle extensive volumes of business documents that contain rich and timely public opinion information and citizens’ demands. However, manual processing struggles to process large-scale text data, adversely impacting operating costs and the quality of government services. This study proposes a comprehensive system for business document classification of the government hotline (BDCGHS) in China to address these challenges. BDCGHS leverages information entropy fused with term frequency-inverse document frequency (TF-IDF) weight to mine new words from business documents of the government hotline, and store them in a new word repository. These new words optimize Chinese word segmentation and text representation for text classification. We introduce a novel data structure called nested balanced binary tree to expedite new word mining, yielding a computational speed of almost five times than the Trie trees. Comparative experiments on the THUNews and government hotline datasets validate our proposed improvement BDCGHS algorithm’s superior performance 3 % over text classification algorithms. Compared to the latest bidirectional encoder representations from the transformers (BERT) model, BDCGHS enhances the accuracy of order dispatch based on business documents by almost 3 %. It has also demonstrated stable operations in two Chinese cities for over a year, yielding favorable results.
Text classification with improved word embedding and adaptive segmentation
2024, Expert Systems with Applications
Text classification first needs to convert the text into embedding vectors. Considering that static word embedding models such as Word2vec do not consider the position information of word and the difference of its role in different documents, while dynamic word embedding models such as Bert consume a large amount of time. An improved word embedding model based on pre-trained Word2vec is proposed, which achieves better classification accuracy and much lower classification time than Bert. At first, the concept of Term Document Frequency (TDF) is proposed on the basis of TF-IDF, and the TF-IDF-TDF of each word in different documents is calculated. Then, The positional encoding is added. Finally, in order to reduce the misleading of words with low importance, a filter is designed to set the embedding vector with low importance to zero. Considering that the sequence length that the deep learning model can handle is limited, and the text sequence exceeding the Maximum Sequence Length (MSL) set by the deep learning model will be directly truncated and discarded, an adaptive segmentation model is proposed, which can set different segmentation strategies for different texts according to the length of the text and the MSL. In order to maintain the continuity of adjacent text after segmentation, an adjacent-segment-vector-attended co-attention network is designed. In addition, the multi-channel convolution and the capsule network are designed to further extract deep hidden features. Multiple comparative experiment results show that the proposed model achieves the best Accuracy and Micro-F1 on five long text baseline datasets and six short text baseline datasets. In addition, when the MSL is not set too large compared with the document length in the dataset, the classification results of the proposed model are not affected by it.
Comparison of pre-trained language models in terms of carbon emissions, time and accuracy in multi-label text classification using AutoML
2023, Heliyon
Since Turkish is an agglutinative language and contains reduplication, idiom, and metaphor words, Turkish texts are sources of information with extremely rich meanings. For this reason, the processing and classification of Turkish texts according to their characteristics is both time-consuming and difficult. In this study, the performances of pre-trained language models for multi-text classification using Autotrain were compared in a 250 K Turkish dataset that we created. The results showed that the BERTurk (uncased, 128 k) language model on the dataset showed higher accuracy performance with a training time of 66 min compared to the other models and the CO2 emission was quite low. The ConvBERTurk mC4 (uncased) model is also the best-performing second language model. As a result of this study, we have provided a deeper understanding of the capabilities of pre-trained language models for Turkish on machine learning.
Bengali text document categorization based on very deep convolution neural network
2021, Expert Systems with Applications
Citation Excerpt :
The backward and forward propagation processes are continued until the model is converged or error is minimized according to the predefined value. In this study, the aim is to investigate the classification performance of a pre-train multi-lingual transformer based BERT (Devlin et al., 2019), DCRNNs (Hossain & Hoque, 2021), LibSVM (Chang & Lin, 2011), SGD (Kabir et al., 2015), DCNN (Hossain & Hoque, 2019), CNN (Kim, 2014), GRU (Moirangthem & Lee, 2021b) and CNN-LSTM (Behera et al., 2021) model with the same datasets. The classification performance are summarised in Table 8.
In recent years, the amount of digital text contents or documents in the Bengali language has increased enormously on online platforms due to the effortless access of the Internet via electronic gadgets. As a result, an enormous amount of unstructured data is created that demands much time and effort to organize, search or manipulate. To manage such a massive number of documents effectively, an intelligent text document classification system is proposed in this paper. Intelligent classification of text document in a resource-constrained language (like Bengali) is challenging due to unavailability of linguistic resources, intelligent NLP tools, and larger text corpora. Moreover, Bengali texts are available in two morphological variants (i.e., Sadhu-bhasha and Cholito-bhasha) making the classification task more complicated. The proposed intelligent text classification model comprises GloVe embedding and Very Deep Convolution Neural Network (VDCNN) classifier. Due to the unavailability of standard corpus, this work develops a large Embedding Corpus (EC) containing $969, 000$ unlabelled texts and Bengali Text Classification Corpus (BDTC) containing $156, 207$ labelled documents arranged into 13 categories. Moreover, this work proposes the Embedding Parameters Identification (EPI) Algorithm, which selects the best embedding parameters for low-resource languages (including Bengali). Evaluation of 165 embedding models with intrinsic evaluators (semantic & syntactic similarity measures) shows that the GloVe model is more suitable (regarding Spearman & Pearson correlation) than other embeddings (Word2Vec, FastText, m-BERT) in Bengali text. Experimental results on the test dataset confirm that the proposed GloVe + VDCNN model outperformed (achieving the highest $96.96 %$ accuracy) the other classification models and existing methods to perform the Bengali text classification task.
A Cognitively Inspired Multi-granularity Model Incorporating Label Information for Complex Long Text Classification
2024, Cognitive Computation
Twenty Years of Machine-Learning-Based Text Classification: A Systematic Review
2023, Algorithms

View all citing articles on Scopus

View full text