当前位置: X-MOL 学术Int. J. Doc. Anal. Recognit. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EAML: ensemble self-attention-based mutual learning network for document image classification
International Journal on Document Analysis and Recognition ( IF 2.3 ) Pub Date : 2021-06-24 , DOI: 10.1007/s10032-021-00378-0
Souhail Bakkali , Zuheng Ming , Mickaël Coustaty , Marçal Rusiñol

In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images has encountered the problem of low inter-class discrimination, and high intra-class structural variations between its categories. In parallel, text-level understanding jointly learned with the corresponding visual properties within a given document image has considerably improved the classification performance in terms of accuracy. In this paper, we design a self-attention-based fusion module that serves as a block in our ensemble trainable network. It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage. Besides, we encourage mutual learning by transferring the positive knowledge between image and text modalities during the training stage. This constraint is realized by adding a truncated Kullback–Leibler divergence loss (Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\)) as a new regularization term, to the conventional supervised setting. To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification. The experimental results illustrate the effectiveness of our approach in terms of accuracy for the single-modal and multi-modal modalities. Thus, the proposed ensemble self-attention-based mutual learning model outperforms the state-of-the-art classification results based on the benchmark RVL-CDIP and Tobacco-3482 datasets.



中文翻译:

EAML:用于文档图像分类的基于集成自注意力的互学习网络

最近,复杂的深度神经网络在各种文档理解任务(例如文档图像分类和文档检索)中引起了极大的兴趣。由于许多文档类型具有独特的视觉风格,因此仅使用深度 CNN 学习视觉特征来对文档图像进行分类遇到了类间区分度低和类别间类内结构差异大的问题。同时,文本级理解与给定文档图像中的相应视觉属性联合学习,在准确性方面大大提高了分类性能。在本文中,我们设计了一个基于自我注意的融合模块,作为集成可训练网络中的一个块。它允许在整个训练阶段同时学习图像和文本模态的判别特征。此外,我们通过在训练阶段在图像和文本模态之间转移正性知识来鼓励相互学习。这个约束是通过添加一个截断的 Kullback-Leibler 散度损失(Tr-\(\hbox {KLD}_{{\mathrm{Reg}}}\) ) 作为新的正则化项,用于传统的监督设置。据我们所知,这是第一次利用相互学习方法和基于自我注意的融合模块来执行文档图像分类。实验结果说明了我们的方法在单模态和多模态精度方面的有效性。因此,所提出的基于集成自注意力的相互学习模型优于基于基准 RVL-CDIP 和 Tobacco-3482 数据集的最新分类结果。

更新日期:2021-06-25
down
wechat
bug