当前位置: X-MOL 学术Big Data Res. › 论文详情
Comparative Study of Layout Analysis of Tabulated Historical Documents
Big Data Research ( IF 2.673 ) Pub Date : 2021-01-08 , DOI: 10.1016/j.bdr.2021.100195
Xusheng Liang; Abbas Cheddad; Johan Hall

Nowadays, the field of multimedia retrieval system has earned a lot of attention as it helps retrieve information more efficiently and accelerates daily tasks. Within this context, image processing techniques such as layout analysis and word recognition play an important role in transcribing content in printed or handwritten documents into digital data that can be further processed. This transcription procedure is called document digitization. This work stems from an industrial need, namely, a Swedish company (ArkivDigital AB) has scanned more than 80 million pages of Swedish historical documents from all over the country and there is a high demand to transcribe the contents into digital data. Such process starts by figuring out text location which, seen from another angle, is merely table layout analysis. In this study, the aim is to reveal the most effective solution to extract document layout w.r.t Swedish handwritten historical documents that are featured by their tabular forms. In short, outcome of public tools (i.e., Breuel's OCRopus method), traditional image processing techniques (e.g., Hessian/Gabor filters, Hough transform, Histograms of oriented gradients -HOG- features), machine learning techniques (e.g., support vector machines, transfer learning) are studied and compared. Results show that the existing OCR tool cannot carry layout analysis task on our Swedish historical handwritten documents. Traditional image processing techniques are mildly capable of extracting the general table layout in these documents, but the accuracy is enhanced by introducing machine learning techniques. The best performing approach will be used in our future document mining research to allow for the development of scalable resource-efficient systems for big data analytics.



中文翻译:

制表历史文献布局分析的比较研究

如今,多媒体检索系统领域已经得到了广泛的关注,因为它有助于更​​有效地检索信息并加速日常工作。在这种情况下,诸如布局分析和文字识别之类的图像处理技术在将印刷或手写文档中的内容转录为可以进一步处理的数字数据中起着重要作用。此转录过程称为文档数字化。这项工作源于工业需求,也就是说,一家瑞典公司(ArkivDigital AB)已扫描了来自全国各地的超过8000万页的瑞典历史文件,并且强烈要求将其内容转换为数字数据。这样的过程从弄清楚文本位置开始,从另一个角度看,文本位置仅仅是表格布局分析。在这个研究中,目的是揭示最有效的解决方案,以表格形式呈现的瑞典手写历史文档来提取文档布局。简而言之,是公共工具(例如Breuel的OCRopus方法),传统图像处理技术(例如Hessian / Gabor滤镜,霍夫变换,定向梯度直方图-HOG-特征),机器学习技术(例如支持向量机,学习和比较。结果表明,现有的OCR工具无法对我们的瑞典历史手写文档进行布局分析任务。传统的图像处理技术能够在这些文档中提取一般的表格布局,但是通过引入机器学习技术可以提高准确性。

更新日期:2021-01-13
全部期刊列表>>
微生物研究
亚洲大洋洲地球科学
NPJ欢迎投稿
自然科研论文编辑
ERIS期刊投稿
欢迎阅读创刊号
自然职场,为您触达千万科研人才
spring&清华大学出版社
城市可持续发展前沿研究专辑
Springer 纳米技术权威期刊征稿
全球视野覆盖
施普林格·自然新
chemistry
物理学研究前沿热点精选期刊推荐
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
阿拉丁试剂right
上海中医药大学
浙江大学
西湖大学
化学所
北京大学
清华
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
清华大学-1
武汉大学
浙江大学
天合科研
x-mol收录
试剂库存
down
wechat
bug