Learning to detect, localize and recognize many text objects in document images from few examples,International Journal on Document Analysis and Recognition

当前位置： X-MOL 学术 › Int. J. Doc. Anal. Recognit. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning to detect, localize and recognize many text objects in document images from few examples
International Journal on Document Analysis and Recognition ( IF 1.8 ) Pub Date : 2018-06-09 , DOI: 10.1007/s10032-018-0305-2
Bastien Moysset , Christopher Kermorvant , Christian Wolf

The current trend in object detection and localization is to learn predictions with high capacity deep neural networks trained on a very large amount of annotated data and using a high amount of processing power. In this work, we particularly target the detection of text in document images and we propose a new neural model which directly predicts object coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data are not as abundant as in the classical configuration of natural images and Imagenet/Pascal-VOC tasks. The proposed model also facilitates the detection of many objects in a single image and can deal with inputs of variable sizes without resizing. To enhance the localization precision of the coordinate regressor, we limit the amount of information produced by the local model components and propose two different regression strategies: (i) separately predict lower-left and upper-right corners of each object bounding box, followed by combinatorial pairing; (ii) only predict the left side of the objects and estimate the right position jointly with text recognition. These strategies lead to good full-page text recognition results in heterogeneous documents. Experiments have been performed on a document analysis task, the localization of the text lines in the Maurdor dataset.

中文翻译：

学习从几个示例中检测，定位和识别文档图像中的许多文本对象

目标检测和定位的当前趋势是学习使用在大量注释数据上训练的高容量深度神经网络进行预测，并使用大量处理能力。在这项工作中，我们特别针对文档图像中文本的检测，并提出了一种直接预测对象坐标的新神经模型。我们所做贡献的特殊性在于采用新形式的局部参数共享对预测进行局部计算，从而使可训练参数的总量保持较低水平。该模型的关键组成部分是空间2D-LSTM递归层，它们在图像区域之间传达上下文信息。我们证明，在训练数据不像自然图像和Imagenet / Pascal-VOC任务的经典配置那样丰富的应用中，该模型比现有技术更强大。提出的模型还有助于检测单个图像中的许多对象，并且可以处理可变大小的输入而无需调整大小。为了提高坐标回归器的定位精度，我们限制了局部模型组件产生的信息量，并提出了两种不同的回归策略：（i）分别预测每个对象边界框的左下角和右上角，然后是组合配对（ii）仅预测对象的左侧，并与文本识别一起估计右侧位置。这些策略可在异构文档中获得良好的整页文本识别结果。已经对文档分析任务（Maurdor数据集中文本行的本地化）进行了实验。

更新日期：2018-06-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11