Semi-Supervised Scene Text Recognition,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Semi-Supervised Scene Text Recognition
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2021-01-20 , DOI: 10.1109/tip.2021.3051485
Yunze Gao , Yingying Chen , Jinqiao Wang , Hanqing Lu

Scene text recognition has been widely researched with supervised approaches. Most existing algorithms require a large amount of labeled data and some methods even require character-level or pixel-wise supervision information. However, labeled data is expensive, unlabeled data is relatively easy to collect, especially for many languages with fewer resources. In this paper, we propose a novel semi-supervised method for scene text recognition. Specifically, we design two global metrics, i.e., edit reward and embedding reward, to evaluate the quality of generated string and adopt reinforcement learning techniques to directly optimize these rewards. The edit reward measures the distance between the ground truth label and the generated string. Besides, the image feature and string feature are embedded into a common space and the embedding reward is defined by the similarity between the input image and generated string. It is natural that the generated string should be the nearest with the image it is generated from. Therefore, the embedding reward can be obtained without any ground truth information. In this way, we can effectively exploit a large number of unlabeled images to improve the recognition performance without any additional laborious annotations. Extensive experimental evaluations on the five challenging benchmarks, the Street View Text, IIIT5K, and ICDAR datasets demonstrate the effectiveness of the proposed approach, and our method significantly reduces annotation effort while maintaining competitive recognition performance.

中文翻译：

半监督场景文本识别

场景文本识别已通过监督方法进行了广泛研究。大多数现有算法需要大量标记数据，有些方法甚至需要字符级或逐像素监督信息。但是，标记的数据很昂贵，未标记的数据相对容易收集，尤其是对于许多资源较少的语言而言。在本文中，我们提出了一种新颖的半监督场景文本识别方法。具体来说，我们设计了两个全局度量，即编辑奖励和嵌入奖励，以评估生成的字符串的质量，并采用强化学习技术来直接优化这些奖励。编辑奖励测量地面真相标签与生成的字符串之间的距离。除了，图像特征和字符串特征被嵌入到公共空间中，嵌入奖励由输入图像和生成的字符串之间的相似性定义。很自然，所生成的字符串应与所生成的图像最接近。因此，可以在没有任何地面真实信息的情况下获得嵌入奖励。这样，我们可以有效利用大量未标记的图像来提高识别性能，而无需任何其他费力的注释。对五个具有挑战性的基准（街景文字，IIIT5K和ICDAR数据集）进行的广泛实验评估证明了该方法的有效性，并且我们的方法可显着减少注释工作，同时保持竞争性识别性能。

更新日期：2021-02-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>