Survey of Post-OCR Processing Approaches,ACM Computing Surveys

当前位置： X-MOL 学术 › ACM Comput. Surv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Survey of Post-OCR Processing Approaches
ACM Computing Surveys ( IF 23.8 ) Pub Date : 2021-07-13 , DOI: 10.1145/3453476
Thi Tuyet Hai Nguyen ₁ , Adam Jatowt ₂ , Mickael Coustaty ₁ , Doucet Antoine ₁

Affiliation

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

中文翻译：

OCR 后处理方法的调查

光学字符识别 (OCR) 是用于将打印文档转换为机器可读文档的最流行的技术之一。虽然 OCR 引擎可以很好地处理现代文本，但不幸的是，它们的性能在历史材料上显着降低。此外，许多文本已经通过各种过时的数字化技术进行了处理。因此，数字化文本噪声很大，需要进行后校正。本文通过研究 OCR 结果对信息检索和自然语言处理应用程序的影响，阐明了提高 OCR 结果质量的重要性。然后，我们定义 OCR 后处理问题，说明其典型的管道，并回顾最先进的 OCR 后处理方法。评估指标、可访问的数据集、语言资源、还报告了有用的工具包。此外，该工作确定了当前趋势并概述了该领域的一些研究方向。

更新日期：2021-07-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11