Mixed-Supervised Scene Text Detection With Expectation-Maximization Algorithm,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Mixed-Supervised Scene Text Detection With Expectation-Maximization Algorithm
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2022-08-17 , DOI: 10.1109/tip.2022.3197987
Mengbiao Zhao ₁ , Wei Feng ₁ , Fei Yin ₁ , Xu-Yao Zhang ₁ , Cheng-Lin Liu ₁

Affiliation

Scene text detection is an important and challenging task in computer vision. For detecting arbitrarily-shaped texts, most existing methods require heavy data labeling efforts to produce polygon-level text region labels for supervised training. In order to reduce the cost in data labeling, we study mixed-supervised arbitrarily-shaped text detection by combining various weak supervision forms (e.g., image-level tags, coarse, loose and tight bounding boxes), which are far easier to annotate. Whereas the existing weakly-supervised learning methods (such as multiple instance learning) do not promote full object coverage, to approximate the performance of fully-supervised detection, we propose an Expectation-Maximization (EM) based mixed-supervised learning framework to train scene text detector using only a small amount of polygon-level annotated data combined with a large amount of weakly annotated data. The polygon-level labels are treated as latent variables and recovered from the weak labels by the EM algorithm. A new contour-based scene text detector is also proposed to facilitate the use of weak labels in our mixed-supervised learning framework. Extensive experiments on six scene text benchmarks show that (1) using only 10% strongly annotated data and 90% weakly annotated data, our method yields comparable performance to that of fully supervised methods, (2) with 100% strongly annotated data, our method achieves state-of-the-art performance on five scene text benchmarks (CTW1500, Total-Text, ICDAR-ArT, MSRA-TD500, and C-SVT), and competitive results on the ICDAR2015 Dataset. We will make our weakly annotated datasets publicly available.

中文翻译：

基于期望最大化算法的混合监督场景文本检测

场景文本检测是计算机视觉中一项重要且具有挑战性的任务。为了检测任意形状的文本，大多数现有方法需要大量的数据标记工作来生成多边形级文本区域标签以进行监督训练。为了降低数据标记的成本，我们通过结合各种弱监督形式（例如，图像级标签、粗、松、紧边界框）来研究混合监督任意形状的文本检测，这些形式更容易注释。鉴于现有的弱监督学习方法（例如多实例学习）不能促进全对象覆盖，以逼近全监督检测的性能，我们提出了一种基于期望最大化（EM）的混合监督学习框架，仅使用少量多边形级注释数据和大量弱注释数据来训练场景文本检测器。多边形级标签被视为潜在变量，并通过 EM 算法从弱标签中恢复。还提出了一种新的基于轮廓的场景文本检测器，以促进在我们的混合监督学习框架中使用弱标签。在六个场景文本基准上进行的大量实验表明（1）仅使用 10% 的强注释数据和 90% 的弱注释数据，我们的方法产生了与完全监督方法相当的性能，（2）使用 100% 的强注释数据，我们的方法在五个场景文本基准测试（CTW1500、Total-Text、ICDAR-ArT、MSRA-TD500、和 C-SVT），以及 ICDAR2015 数据集上的竞争结果。我们将公开我们的弱注释数据集。

更新日期：2022-08-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11