当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-modal page stream segmentation with convolutional neural networks
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2019-09-27 , DOI: 10.1007/s10579-019-09476-2
Gregor Wiedemann , Gerhard Heyer

In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As first steps, the workflow usually involves batch scanning and optical character recognition (OCR) of documents. In the case of multi-page documents, the preservation of document contexts is a major requirement. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into coherent multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach for PSS based on convolutional neural networks (CNN). As a first project, we combine visual information from scanned images with semantic information from OCR-ed texts for this task. The multi-modal combination of features in a single classification architecture allows for major improvements towards optimal document separation. Further to multimodality, our PSS approach profits from transfer-learning and sequential page modeling. We achieve accuracy up to 95% on multi-page documents on our in-house dataset and up to 93% on a publicly available dataset.



中文翻译:

卷积神经网络的多模式页面流分割

近年来,对纸质文件进行(还原)数字化已成为私人和公共档案馆的一项主要任务,同时也是电子邮件收发室应用程序中的一项重要任务。首先,工作流程通常涉及文档的批处理扫描和光学字符识别(OCR)。对于多页文档,保留文档上下文是一个主要要求。为了简化涉及大量纸张扫描的工作流程,页面流分割(PSS)是自动将扫描图像流分离为连贯的多页文档的任务。在与德国联邦档案馆一起进行的数字化项目中,我们为基于卷积神经网络(CNN)的PSS开发了一种新颖的方法。作为第一个项目,我们将扫描图像中的视觉信息与OCR版本的文本中的语义信息相结合。单一分类体系结构中功能的多模式组合允许对最佳文档分离进行重大改进。除了多模式之外,我们的PSS方法还受益于转移学习和顺序页面建模。我们内部数据集上的多页文档的准确性高达95%,公开数据集上的准确性高达93%。

更新日期:2019-09-27
down
wechat
bug