当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Historical document layout analysis using anisotropic diffusion and geometric features
International Journal on Digital Libraries Pub Date : 2020-01-23 , DOI: 10.1007/s00799-020-00280-w
Galal M. BinMakhashen , Sabri A. Mahmoud

There are several digital libraries worldwide which maintain valuable historical manuscripts. Usually, digital copies of these manuscripts are offered to researchers and readers in raster-image format. These images carry several document degradations that may hinder automatic information retrieval solutions such as manuscript indexing, categorization, retrieval by content, etc. In this paper, we propose a learning-free and hybrid document layout analysis for handwritten historical manuscripts. It has two main phases: page characterization and segmentation. First, the proposed method locates main-content initially using top-down whitespace analysis. It employs anisotropic diffusion filtering to find whitespaces. Then, it extracts template features representing manuscripts’ authors writing behavior. After that, moving windows are used to scan the manuscript page and define main-content boundaries more precisely. We evaluated the proposed method on two datasets: One set is publicly available with 38 historical manuscript pages, and the other set of 51 historical manuscript pages that are collected from the online Harvard Library. Experiments on both datasets show promising results in terms of segmentation quality of main-content that reaches up to 98.5% success rate.



中文翻译:

使用各向异性扩散和几何特征进行历史文档布局分析

全球有数个数字图书馆保存着宝贵的历史手稿。通常,这些手稿的数字副本以光栅图像格式提供给研究人员和读者。这些图像带有多种文档降级,这些文档降级可能会妨碍自动信息检索解决方案,例如手稿索引,分类,按内容检索等。在本文中,我们提出了一种用于手写历史手稿的无学习和混合文档布局分析方法。它有两个主要阶段:页面表征和分段。首先,所提出的方法首先使用自上而下的空白分析来定位主要内容。它采用各向异性扩散过滤来找到空白。然后,它提取代表手稿作者写作行为的模板特征。在那之后,移动窗口用于扫描手稿页面并更精确地定义主要内容的边界。我们在两个数据集上评估了该方法的有效性:一组公开提供了38个历史手稿页面,另一组公开了51个历史手稿页面,这些页面是从在线哈佛图书馆收集的。在这两个数据集上的实验均表明,就主要内容的分割质量而言,成功率高达98.5%。

更新日期:2020-01-23
down
wechat
bug