Efficient and effective OCR engine training,International Journal on Document Analysis and Recognition

当前位置： X-MOL 学术 › Int. J. Doc. Anal. Recognit. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient and effective OCR engine training
International Journal on Document Analysis and Recognition ( IF 2.3 ) Pub Date : 2019-10-30 , DOI: 10.1007/s10032-019-00347-8
Christian Clausner , Apostolos Antonacopoulos , Stefan Pletschacher

We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.

中文翻译：

高效有效的OCR引擎培训

我们提出了一种有效的方法来使用Aletheia文档分析系统来训练OCR引擎。培训所需的所有组件都无缝集成到了Aletheia中：培训数据准备，OCR引擎的培训过程本身，文本识别以及对受训引擎的定量评估。这样一个全面的培训和评估系统，通过GUI进行指导，可以进行迭代的增量培训，以达到最佳效果。以广泛使用的Tesseract OCR引擎为例，以证明所提出方法的效率和有效性。提出的实验结果验证了使用两个不同的历史数据集（代表最近的重要数字化项目）的训练方法的有效性。

更新日期：2019-10-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>