A Comparative analysis for identification and classification of text segmentation challenges in Takri Script,Sādhanā

当前位置： X-MOL 学术 › Sādhanā › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Comparative analysis for identification and classification of text segmentation challenges in Takri Script
Sādhanā ( IF 1.6 ) Pub Date : 2020-06-04 , DOI: 10.1007/s12046-020-01384-4
Shikha Magotra , Baijnath Kaushik , Ajay Kaul

Takri is an Indian regional class of scripts, used in hilly areas of north-west India which include Jammu and Kashmir (J & K), Himachal Pradesh (H.P.), Punjab and Uttarakhand. This script has immense variations; almost 13 identified in the whole region of North-west India. It has been observed that no work for text identification and recognition of Takri script has been done so far. Therefore, our work focuses on identifying and classifying the various challenges in the script based on comparative analysis of existing text segmentation approaches, as correct segmentation of text leads to more accurate machine recognition. As there were no metal fonts available for the script, it is required to collect the machine-printed form of data for solving the text identification problem in Takri script. The paper surveys for different text segmentation approaches and based on the structural properties of the script, shows an implementation of these on Takri data in three steps- Gurmukhi segmentation technique, Connected Component segmentation approach, and Gurmukhi touching characters segmentation approach. Results are analyzed for Segmentation Accuracy and Challenges are identified along with their statistical analysis. Further, the challenges identified as half- forms, numerous types of touching characters, overlapping bounding boxes, are classified. The effectiveness of these challenges was evaluated using Naïve-Bayesian machine learning algorithm. The results showed 80% accuracy in text identification and classification of Takri script.

中文翻译：

Takri脚本中文本分割挑战的识别和分类的比较分析

塔克里语是印度地区的一种文字，在印度西北部的丘陵地区使用，包括查mu和克什米尔（J＆K），喜马al尔邦（HP），旁遮普邦和北阿坎德邦。这个脚本有很大的变化。在印度西北部的整个地区中，几乎有13个被确定。据观察，到目前为止，还没有完成用于文本识别和Takri脚本识别的工作。因此，我们的工作集中在对现有文本分割方法进行比较分析的基础上，对脚本中的各种挑战进行识别和分类，因为正确的文本分割会导致更准确的机器识别。由于该脚本没有可用的金属字体，因此需要收集机器打印的数据形式以解决Takri脚本中的文本标识问题。本文针对不同的文本分割方法进行了调查，并基于脚本的结构属性，显示了在Takri数据上以三个步骤实现这些文本的方法-Gurmukhi分割技术，Connected Component分割方法和Gurmukhi触摸字符分割方法。分析结果的细分准确性，并与统计分析一起识别挑战。此外，对识别为半形，多种类型的触摸字符，重叠边界框的挑战进行了分类。使用朴素-贝叶斯机器学习算法评估了这些挑战的有效性。结果表明，在Takri脚本的文本识别和分类中，准确性为80％。在三个步骤中显示了这些在Takri数据上的实现-Gurmukhi分割技术，Connected Component分割方法和Gurmukhi触摸字符分割方法。分析结果的细分准确性，并确定挑战及其统计分析。此外，对识别为半形，多种类型的触摸字符，重叠边界框的挑战进行了分类。使用朴素-贝叶斯机器学习算法评估了这些挑战的有效性。结果表明，在Takri脚本的文本识别和分类中，准确性为80％。在三个步骤中显示了这些在Takri数据上的实现-Gurmukhi分割技术，Connected Component分割方法和Gurmukhi触摸字符分割方法。分析结果的细分准确性，并确定挑战及其统计分析。此外，对识别为半形，多种类型的触摸字符，重叠边界框的挑战进行了分类。使用朴素-贝叶斯机器学习算法评估了这些挑战的有效性。结果表明，在Takri脚本的文本识别和分类中，准确性为80％。分析结果的细分准确性，并确定挑战及其统计分析。此外，对识别为半形，多种类型的触摸字符，重叠边界框的挑战进行了分类。使用朴素-贝叶斯机器学习算法评估了这些挑战的有效性。结果表明，在Takri脚本的文本识别和分类中，准确性为80％。分析结果的细分准确性，并确定挑战及其统计分析。此外，对识别为半形，多种类型的触摸字符，重叠边界框的挑战进行了分类。使用朴素-贝叶斯机器学习算法评估了这些挑战的有效性。结果表明，在Takri脚本的文本识别和分类中，准确性为80％。

更新日期：2020-06-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>