当前位置: X-MOL 学术Mobile Netw. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ClusTi: Clustering Method for Table Structure Recognition in Scanned Images
Mobile Networks and Applications ( IF 2.3 ) Pub Date : 2021-04-30 , DOI: 10.1007/s11036-021-01759-9
Arthur Zucker , Younes Belkada , Hanh Vu , Van Nam Nguyen

OCR (Optical Character Recognition) for scanned paper invoices is very challenging due to the variability of 19 invoice layouts, different information fields, large data tables, and low scanning quality. In this case, table structure recognition is a critical task in which all rows, columns, and cells must be accurately positioned and extracted. Existing methods such as DeepDeSRT only dealt with high-quality born-digital images (e.g., PDF) with low noise and apparent table structure. This paper proposes an efficient method called CluSTi (Clustering method for recognition of the Structure of Tables in invoice scanned Images). The contributions of CluSTi are three-fold. Firstly, it removes heavy noises in the table images using a clustering algorithm. Secondly, it extracts all text boxes using state-of-the-art text recognition. Thirdly, based on the horizontal and vertical clustering algorithm with optimized parameters, CluSTi groups the text boxes into their correct rows and columns, respectively. The method was evaluated on three datasets: i) 397 public scanned images; ii) 193 PDF document images from ICDAR 2013 competition dataset; and iii) 281 PDF document images from ICDAR 2019’s numeric tables. The evaluation results showed that CluSTi achieved an F1-score of 87.5%, 98.5%, and 94.5%, respectively. Our method also outperformed DeepDeSRT with an F1-score of 91.44% on only 34 images from the ICDAR 2013 competition dataset. To the best of our knowledge, CluSTi is the first method to tackle the table structure recognition problem on scanned images.



中文翻译:

ClusTi:扫描图像中表结构识别的聚类方法

由于19种发票布局的可变性,不同的信息字段,较大的数据表和较低的扫描质量,因此,用于扫描的纸质发票的OCR(光学字符识别)非常具有挑战性。在这种情况下,表结构识别是一项至关重要的任务,其中必须精确定位和提取所有行,列和单元格。诸如DeepDeSRT之类的现有方法仅处理低噪声且具有明显表结构的高质量数字数码图像(例如PDF)。本文提出了一种有效的方法,称为CluSTi(用于在发票扫描图像中识别表结构的聚类方法)。CluSTi的贡献是三倍。首先,它使用聚类算法消除表格图像中的重噪声。其次,它使用最先进的文本识别功能提取所有文本框。第三,基于具有优化参数的水平和垂直聚类算法,CluSTi分别将文本框分为正确的行和列。该方法在三个数据集上进行了评估:i)397张公共扫描图像;ii)来自ICDAR 2013竞赛数据集的193个PDF文档图像;iii)来自ICDAR 2019数字表的281个PDF文档图像。评估结果表明,CluSTi实现了F 1分数分别为87.5%,98.5%和94.5%。我们的方法在ICDAR 2013竞赛数据集中的仅34张图像上以91.44%的F 1得分优于DeepDeSRT 。据我们所知,CluSTi是解决扫描图像上的表结构识别问题的第一种方法。

更新日期:2021-04-30
down
wechat
bug