当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2022-08-23 , DOI: 10.1109/tip.2022.3197981
Yuxin Wang 1 , Hongtao Xie 1 , Shancheng Fang 1 , Mengting Xing 1 , Jing Wang 2 , Shenggao Zhu 2 , Yongdong Zhang 1
Affiliation  

The exploration of linguistic information promotes the development of scene text recognition task. Benefiting from the significance in parallel reasoning and global relationship capture, transformer-based language model (TLM) has achieved dominant performance recently. As a decoupled structure from the recognition process, we argue that TLM’s capability is limited by the input low-quality visual prediction. To be specific: 1) The visual prediction with low character-wise accuracy increases the correction burden of TLM. 2) The inconsistent word length between visual prediction and original image provides a wrong language modeling guidance in TLM. In this paper, we propose a Progressive scEne Text Recognizer (PETR) to improve the capability of transformer-based language model by handling above two problems. Firstly, a Destruction Learning Module (DLM) is proposed to consider the linguistic information in the visual context. DLM introduces the recognition of destructed images with disordered patches in the training stage. Through guiding the vision model to restore patch orders and make word-level prediction on the destructed images, visual prediction with high character-wise accuracy is obtained by exploring inner relationship between the local visual patches. Secondly, a new Language Rectification Module (LRM) is proposed to optimize the word length for language guidance rectification. Through progressively implementing LRM in different language modeling steps, a novel progressive rectification network is constructed to handle some extremely challenging cases ( e.g. distortion, occlusion, etc.). By utilizing DLM and LRM, PETR enhances the capability of transformer-based language model from a more general aspect, that is, focusing on the reduction of correction burden and rectification of language modeling guidance. Compared with parallel transformer-based methods, PETR obtains 1.0% and 0.8% improvement on regular and irregular datasets respectively while introducing only 1.7M additional parameters. The extensive experiments on both English and Chinese benchmarks demonstrate that PETR achieves the state-of-the-art results.

中文翻译:

PETR:重新思考基于 Transformer 的语言模型在场景文本识别中的能力

语言信息的探索促进了场景文本识别任务的发展。受益于并行推理和全局关系捕获的重要性,基于转换器的语言模型(TLM)最近取得了主导性能。作为与识别过程的解耦结构,我们认为 TLM 的能力受到输入低质量视觉预测的限制。具体来说:1)低字符精度的视觉预测增加了TLM的校正负担。2)视觉预测和原始图像之间不一致的词长在TLM中提供了错误的语言建模指导。在本文中,我们提出了一种渐进式场景文本识别器(PETR),通过处理上述两个问题来提高基于转换器的语言模型的能力。首先,提出了一个破坏学习模块(DLM)来考虑视觉上下文中的语言信息。DLM 在训练阶段引入了对带有无序斑块的破坏图像的识别。通过引导视觉模型恢复补丁顺序并对破坏的图像进行词级预测,通过探索局部视觉补丁之间的内在关系,获得具有高字符精度的视觉预测。其次,提出了一种新的语言纠正模块(LRM)来优化语言指导纠正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新颖的渐进式校正网络来处理一些极具挑战性的案例(DLM 在训练阶段引入了对带有无序斑块的破坏图像的识别。通过引导视觉模型恢复补丁顺序并对破坏的图像进行词级预测,通过探索局部视觉补丁之间的内在关系,获得具有高字符精度的视觉预测。其次,提出了一种新的语言纠正模块(LRM)来优化语言指导纠正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新颖的渐进式校正网络来处理一些极具挑战性的案例(DLM 在训练阶段引入了对带有无序斑块的破坏图像的识别。通过引导视觉模型恢复补丁顺序并对破坏的图像进行词级预测,通过探索局部视觉补丁之间的内在关系,获得具有高字符精度的视觉预测。其次,提出了一种新的语言纠正模块(LRM)来优化语言指导纠正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新颖的渐进式校正网络来处理一些极具挑战性的案例(通过探索局部视觉块之间的内在关系,获得了具有高字符精度的视觉预测。其次,提出了一种新的语言纠正模块(LRM)来优化语言指导纠正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新颖的渐进式校正网络来处理一些极具挑战性的案例(通过探索局部视觉块之间的内在关系,获得了具有高字符精度的视觉预测。其次,提出了一种新的语言纠正模块(LRM)来优化语言指导纠正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新颖的渐进式校正网络来处理一些极具挑战性的案例( 例如失真、遮挡等)。通过利用 DLM 和 LRM,PETR 从更普遍的方面增强了基于 Transformer 的语言模型的能力,即侧重于减少纠正负担和纠正语言建模指导。与基于并行变换器的方法相比,PETR 在规则和不规则数据集上分别获得了 1.0% 和 0.8% 的改进,同时仅引入了 1.7M 额外参数。在英文和中文基准上的广泛实验表明,PETR 取得了最先进的结果。
更新日期:2022-08-23
down
wechat
bug