当前位置: X-MOL 学术Egypt. Inform. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Aligning document layouts extracted with different OCR engines with clustering approach
Egyptian Informatics Journal ( IF 5.0 ) Pub Date : 2020-12-30 , DOI: 10.1016/j.eij.2020.12.004
S. Tomovic , K. Pavlovic , M. Bajceta

Layout analysis is essential step in information extraction from scanned document images. In this paper we propose an algorithm for aligning layouts generated with different OCR engines. The main requirement is to always generate the same layout for the given document image regardless of OCR engine used for image processing. In that way information extraction from scanned documents, that is heavily dependent on fields positions in the document, does not depend on specific OCR engine. In other words, it is sufficient to maintain universal extractor knowledge and not necessary to train extractor explicitly with samples processed on specific OCR engine. The proposed algorithm can handle administrative documents with complex layout.



中文翻译:

使用聚类方法对齐使用不同 OCR 引擎提取的文档布局

布局分析是从扫描文档图像中提取信息的重要步骤。在本文中,我们提出了一种用于对齐不同 OCR 引擎生成的布局的算法。主要要求是始终为给定的文档图像生成相同的布局,而不管用于图像处理的 OCR 引擎如何。以这种方式从扫描文档中提取信息,这在很大程度上取决于文档中的字段位置,而不依赖于特定的 OCR 引擎。换句话说,保持通用的提取器知识就足够了,不需要用在特定 OCR 引擎上处理的样本显式训练提取器。所提出的算法可以处理布局复杂的行政文件。

更新日期:2020-12-30
down
wechat
bug