OCR error correction using correction patterns and self-organizing migrating algorithm,Pattern Analysis and Applications

当前位置： X-MOL 学术 › Pattern Anal. Applic. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

OCR error correction using correction patterns and self-organizing migrating algorithm
Pattern Analysis and Applications ( IF 3.7 ) Pub Date : 2020-11-23 , DOI: 10.1007/s10044-020-00936-y
Quoc-Dung Nguyen , Duc-Anh Le , Nguyet-Minh Phan , Ivan Zelinka

Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition.

中文翻译：

使用校正模式和自组织迁移算法的OCR纠错

光学字符识别（OCR）系统有助于数字化纸质历史成就。但是，扫描文档的质量差和文本识别技术的局限性导致OCR输出中出现各种错误。后处理是通过检测和清除错误来提高OCR系统输出质量的重要步骤。在本文中，我们提出了一个由OCR后处理的错误检测和错误校正阶段组成的自动模型。我们提出了一种使用校正模式编辑和进化算法的OCR后处理错误校正的新方法，该方法主要用于解决优化问题。我们的模型采用了自组织迁移算法的变体以及基于重要语言功能修改的适应度函数。我们说明了如何构建涉及所有类型的编辑操作并直接从训练数据集中学习的校正模式编辑表。通过高效设置算法参数，可以通过高质量的候选生成和错误校正来执行我们的模型。实验结果表明，我们建议的方法优于各种基准方法，该方法在ICDAR 2017 Post-OCR文本更正竞赛的基准数据集上进行了评估。

更新日期：2020-11-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11