当前位置: X-MOL 学术Sādhanā › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
NN-based analytic approach to symbol level recognition for degraded Bengali printed documents
Sādhanā ( IF 1.6 ) Pub Date : 2020-10-22 , DOI: 10.1007/s12046-020-01492-1
Jayati Mukherjee , Swapan K Parui , Utpal Roy

Analysis of degraded printed documents has been a research topic for last several years. In this article the contribution lies in segmentation of word images into symbols and recognition of the symbols of degraded printed document images of Bengali, the 7th most popular language in the world. A novel approach to symbol level segmentation based on a Multilayer Perceptron (MLP) network is proposed. A database of segmenting and non-segmenting image columns is developed from the ISIDDI page level database and segmentation is treated as a two-class classification problem. The MLP weights are learnt based on this database using the back propagation algorithm. We have introduced certain new metrics, based on which the F-score of the proposed segmentation algorithm is determined. Our method utilizes information that is relevant for character segmentation, ignoring other highly variable information contained in a printed text document, thus allowing for efficient transfer learning between datasets and alleviating the need for labelled training data. Other than Bengali, we have tested on English, Tamil and Devnagari scripts. For the classification purpose we have identified 336 symbols, and the corresponding training and test sets have been developed. The ISIDDI database is used for this purpose. Two classifiers, one CNN based and the other LSTM based, have been developed for this 336-class problem. The classification accuracies obtained on the test set by the CNN classifier and the LSTM classifier are 86.05% and 88.11%, respectively. The proposed classifiers outperform the existing classifiers for the ISIDDI database.



中文翻译:

基于NN的退化孟加拉语印刷文档符号级别识别的解析方法

近几年来,对降级的打印文档进行分析一直是研究的主题。本文的贡献在于将单词图像分割为符号,并识别孟加拉语(世界上第七大流行语言)的降级印刷文档图像的符号。提出了一种基于多层感知器(MLP)网络的符号级分割新方法。从ISIDDI页面级数据库开发了一个分割和非分割图像列的数据库,分割被视为两类分类问题。使用反向传播算法基于此数据库学习MLP权重。我们引进了一些新的指标,在此基础上的˚F确定所提出的分割算法的分数。我们的方法利用与字符分割相关的信息,而忽略了包含在打印的文本文档中的其他高度可变的信息,从而允许在数据集之间进行有效的转移学习,并减少了对标记训练数据的需求。除孟加拉语外,我们还对英语,泰米尔语和天哪语脚本进行了测试。为了分类,我们已经识别了336个符号,并且已经开发了相应的训练和测试集。ISIDDI数据库用于此目的。已针对此336类问题开发了两个分类器,一个基于CNN,另一个基于LSTM。CNN分类器和LSTM分类器在测试集上获得的分类准确度分别为86.05%和88.11%。

更新日期:2020-10-30
down
wechat
bug