当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Clustering online handwritten mathematical expressions
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2021-03-28 , DOI: 10.1016/j.patrec.2021.03.027
Huy Quang Ung , Cuong Tuan Nguyen , Khanh Minh Phan , Vu Tran Minh Khuong , Masaki Nakagawa

To help human markers mark many answers in the form of online handwritten mathematical expressions (OHMEs), this paper proposes bag-of-features for clustering OHMEs. It consists of six levels of features from low-level pattern features to high-level symbolic and structural features obtained from a state-of-the-art OHME recognizer. Then, it introduces distance-based representation (DbR) to reduce the dimensionality of our proposed feature spaces. Moreover, it presents a method for combining the proposed features to improve the performance. Experiments using the k-means++ algorithm are conducted on a set of 3,150 OHMEs (Dset_50) and an answer dataset (Dset_Mix) of 200 OHMEs intermixed between real patterns and synthesized patterns for each of 10 questions. When the number of clusters is set as the true number of categories, the best purity around 0.99 is produced by bag-of-symbols with DbR for Dset_50, which is better than state-of-the-art methods for clustering offline patterns converted from their OHMEs. The combination of both low-level and high-level features with DbR achieves a purity of around 0.777, increases to more than 0.90 and reduce the marking cost by more than 0.35 point than manually marking OHME answers by adjusting the number of clusters for Dset_Mix.



中文翻译:

在线手写数学表达式的聚类

为了帮助人类标记者以在线手写数学表达式(OHME)的形式标记许多答案,本文提出了用于分类OHME的特征包。它由六个级别的功能组成,从低级模式特征到从最新的OHME识别器获得的高级符号和结构特征。然后,它引入了基于距离的表示(DbR),以减少我们提出的特征空间的维数。此外,它提出了一种组合提出的功能以提高性能的方法。对一组3,150个OHME(Dset_50)和200个OHME的答案数据集(Dset_Mix)进行了混合实验,使用了k-means ++算法对10个问题中的每一个进行了混合。将簇数设置为真实的类别数时,最佳纯度约为0。99由带有Dset的DbR的符号袋生成,用于Dset_50,这比用于对从其OHME转换而来的脱机模式进行聚类的最新方法要好。与通过Dset_Mix调整簇数手动标记OHME答案相比,DbR的低级和高级功能的结合实现了大约0.777的纯度,增加到0.90以上的标记成本以及0.35点以上的标记成本。

更新日期:2021-04-13
down
wechat
bug