当前位置: X-MOL 学术Neural Comput. & Applic. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Building an efficient OCR system for historical documents with little training data
Neural Computing and Applications ( IF 6 ) Pub Date : 2020-05-09 , DOI: 10.1007/s00521-020-04910-x
Jiří Martínek , Ladislav Lenc , Pavel Král

As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.



中文翻译:

使用很少的培训数据为历史文档构建高效的OCR系统

在过去的几十年中,随着数字化历史文献的数量迅速增加,有必要提供有效的信息检索和知识提取方法以使数据可访问。这样的方法取决于光学字符识别(OCR),该光学字符识别将文档图像转换为文本表示形式。如今,OCR方法通常不适合于历史领域。此外,他们通常需要大量带注释的文档。因此,本文介绍了一套方法,该方法允许仅使用少量的真实,手动注释的训练数据对历史文档图像执行OCR。提出的完整的OCR系统包括两个主要任务:页面布局分析,包括文本块和行分段以及OCR。我们的分割方法基于完全卷积网络,而OCR方法利用循环神经网络。两种方法都是相关领域的最新技术。我们已经从Porta fontium门户创建了一个新颖的OCR真实数据集。该语料库可免费用于研究,所有提议的方法都将根据这些数据进行评估。我们显示,仅使用几个带注释的真实数据样本,分割和OCR任务都是可行的。实验旨在确定在给定的少量数据下如何实现良好性能的最佳方法。我们还证明,所获得的分数与几种最新系统的分数相当甚至更好。综上所述,本文显示了一种方法,该方法如何为历史文档创建高效的OCR系统,而仅需要少量带注释的训练数据。

更新日期:2020-05-09
down
wechat
bug