Elsevier

Pattern Recognition Letters

Volume 155, March 2022, Pages 128-134
Pattern Recognition Letters

Transformer-based approach for joint handwriting and named entity recognition in historical document

https://doi.org/10.1016/j.patrec.2021.11.010Get rights and content

Highlights

  • End-to-end Transformer-based architecture for named entity recognition in handwritten document images.

  • Joint recognition of handwriting and named entity at the paragraph level, allowing the model to avoid segmentation errors.

  • Exploration of different training scenarios including two-stage, mixed-data, and curriculum learning.

  • Extensive ablation and comparative experiments are conducted to validate the effectiveness of our approach.

Abstract

The extraction of relevant information carried out by named entities in handwriting documents is still a challenging task. Unlike traditional information extraction approaches that usually face text transcription and named entity recognition as separate subsequent tasks, we propose in this paper an end-to-end transformer-based approach to jointly perform these two tasks. The proposed approach operates at the paragraph level, which brings two main benefits. First, it allows the model to avoid unrecoverable early errors due to line segmentation. Second, it allows the model to exploit larger bi-dimensional context information to identify the semantic categories, reaching a higher final prediction accuracy. We also explore different training scenarios to show their effect on the performance and we demonstrate that a two-stage learning strategy can make the model reach a higher final prediction accuracy. As far as we know, this work presents the first approach that adopts the transformer networks for named entity recognition in handwritten documents. We achieve the new state-of-the-art performance in the ICDAR 2017 Information Extraction competition using the Esposalles database, for the complete task, even though the proposed technique does not use any dictionaries, language modeling, or post-processing.

Introduction

In the last decades, researchers have been exploring various document recognition techniques to recover textual information from images. Lately, optical character recognition techniques have achieved high accuracy in recovering texts from modern documents. However, they need some refinement while handling historical documents due to the degraded quality of the images and the complexity of the old handwriting styles.

Although Handwritten Text Recognition (HTR) of historical document images is a good step to recover textual information [1], there is an increasing interest within the research community regarding information extraction and document understanding to allow meaningful semantic access to the information contained in document collections.

In this context, Named Entity Recognition (NER) from document images is one of the most challenging and practical problems, which consists of transcribing textual contents and classifying them into semantic categories (names, organizations, locations, etc).

In the literature, traditional NER methods on document images mainly adopt two processing steps [2], [3], [4], [5]. Text information is extracted firstly via the HTR process, and then Natural Language Processing (NLP) techniques are applied to parse the output text and extract the named entity tags. Despite the recent improvements of deep learning-based NLP systems, the performance of these two-stages approaches still relies on the quality of the HTR processing step. Generally, errors of the HTR stage due to the low-quality scans, for example, affect the NLP stages performance considerably.

The second category aims to jointly perform transcription and named entity recognition from the document images without an intermediate HTR stage [6], [7], [8], [9]. Most studies of this second category confirm the benefit when leveraging the dependency of these pairs of tasks with a single joint model. In [8] a single Convolutional Neural Network (CNN) is used to directly classify word images into different categories skipping the recognition step. However, this approach does not use the context surrounding the word to be classified, which might be critical to correctly predict named entity tags. In [9] a CNN is combined with a Long Short-Term Memory (LSTM) network to integrate a larger context, achieving better results compared to [8]. Still, in this work, the context is limited to the line level, which affects the extraction of semantic named entity tags. To integrate a bi-dimensional context, authors in [7] propose an end-to-end model that jointly performs handwritten text detection, transcription, and named entity recognition at the page level, capable of benefiting from shared features for these tasks. This approach presents two main drawbacks. First, it requires word bounding box annotation, which is a huge cost saving in the real application. Second, the proposed multi-task model can be limited in performance in cases where one specific task is much harder and unrelated to the others.

Recently, inspired by their success in many NLP applications, Sequence-to-Sequence (Seq2Seq) approaches using attention-based encoder-decoder architectures have started to be successfully applied for HTR [10], [11]. Most of these architectures still combine the attention mechanism with a recurrent network (BLSTMs or GRU) which severely affects the effectiveness when processing longer sequence lengths by imposing substantial memory limitations. Recently, authors in [12] propose an architecture inspired by transformers, which dispenses any recurrent network for HTR of text-line images. The major drawback of this method is that the line segmentation errors are often irreversible and will therefore significantly affect the recognition performance.

For handwritten historical document recognition, the line segmentation process is a complicated task compared to modern documents. Besides the complexity of handwritten texts (inconsistent spaces between lines, characters of successive lines can be overlaid, etc.), text images can involve distortions and noisy pixels due to the quality of these documents. Many studies tried to enhance the quality of the document image before the segmentation [13] or to improve the segmentation quality in historical documents [14], [15], [16]. However, in most cases, the segmentation is applied as a preprocessing step before recognition. Lately, researchers have been examining the recognition of text blocks instead of text lines without any segmentation step [17], [18], following two categories of approaches. In the first category, the text-block images are transformed into lines representation using convolution layers [19] or attention mechanism [18], in order to perform Connectionist Temporal Classification (CTC) decoding. In the second approach, feature extraction conserves the 2D representation of the text block, then, the decoding is performed using 2D-CTC [20] or attention-based Seq2Seq architecture [17]. As far as we know, there have been no works in the literature applying the transformer architecture at paragraph level to perform jointly HTR and NER.

Motivated by the above observations, we propose in this paper an end-to-end transformer-based approach to jointly perform full paragraph handwriting and named entities recognition in historical documents. To the best of our knowledge, this is the first study that involves the transformer architecture [21] for such a task. The aim is to surpass the line segmentation problems, as well as to allow the model to exploit larger bi-dimensional context information to identify the semantic NE tags. To this end, our first contribution consists in adapting the transformer architecture to deal with the 2D representation of the input text block. For this aim, the 2D features maps obtained by the ResNet architecture are transformed into 1D sequential features using flattening operation. In order to add positional information, we have tested two positional encoding (PE) methods: the 2D-based PE which is performed on the 2D features maps and the 1D PE applied to the 1D feature sequence.

The second contribution of this paper consists in exploring different training scenarios including two-stage learning, mixed-data learning, and curriculum learning to show their effect on the performance and make the model reach a higher final prediction accuracy.

The major contributions of this paper can therefore be summarized as follows:

  • We propose an end-to-end transformer-based architecture for named entity recognition in handwritten document images.

  • The proposed method jointly performs handwriting and named entity recognition at the paragraph level, allowing the model to avoid unrecoverable early errors due to line segmentation and to exploit larger context information to identify the semantic relations between the named entities.

  • We explore different training scenarios including two-stage learning, mixed-data learning, and curriculum learning to show their effect on the performance and make the model reach a higher final prediction accuracy.

  • Extensive ablation and comparative experiments are conducted to validate the effectiveness of our approach. Even though the proposed technique does not use any dictionaries, language modeling, or post-processing, we achieve new state-of-the-art performance on the public IEHHR competition [22].

The remaining part of this paper is structured as follows. The approach is presented in Section 2 including a description of the proposed architecture and relatively proposed methods. Section 3 displays the experimental results. The conclusion and perspectives are stated in Section 4.

Section snippets

Proposed approach

The proposed approach consists of creating an end-to-end neural network architecture, shown in Fig. 1, that recognizes texts and possible named entities from multi-line historical images. This network is composed of two components: multi-line feature extraction and transformer-based sequence labeling.

The input image is fed to ResNet-50 architecture to extract features. The number of obtained features is compressed to match the transformer parameters using a simple 2D convolutional layer. After,

Datasets

Esposalles dataset: We have conducted our experiments on the public dataset proposed in ICDAR 2017 Information Extraction from Historical Handwritten Records (IEHHR) competition [22]. This dataset is a subset of the Esposalles dataset [25]. It has been labeled for information extraction. The dataset collects 125 handwritten pages, containing 1221 marriage records (paragraphs). Each record is composed of several text lines giving information of the husband, wife, and their parents names,

Conclusion

In this paper, we have proposed an end-to-end architecture to perform paragraph-level handwriting and named entity recognition on historical document images. As far as we know, it is the first approach that adopts the transformer networks at paragraph level for such a task. In contrast to traditional approaches which are based on two subsequent tasks (HTR and NLP), the proposed method jointly learns these two tasks on only one stage. Detailed analysis and evaluation are performed on each

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

References (27)

  • J.I. Toledo et al.

    Handwritten word image categorization with convolutional neural networks and spatial pyramid pooling

    S+SSPR

    (2016)
  • Y. Deng et al.

    Image-to-markup generation with coarse-to-fine attention

    Proceedings of the 34th International Conference on Machine Learning - Volume 70

    (2017)
  • J. Michael et al.

    Evaluating sequence-to-sequence models for handwritten text recognition

    2019 International Conference on Document Analysis and Recognition (ICDAR)

    (2019)
  • Cited by (39)

    View all citing articles on Scopus
    View full text