HMATC: Hierarchical multi-label Arabic text classification model using machine learning,Egyptian Informatics Journal

当前位置： X-MOL 学术 › Egypt. Inform. J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

HMATC: Hierarchical multi-label Arabic text classification model using machine learning
Egyptian Informatics Journal ( IF 5.0 ) Pub Date : 2020-09-22 , DOI: 10.1016/j.eij.2020.08.004
Nawal Aljedani , Reem Alotaibi , Mounira Taileb

Multi-label classification assigns multiple labels to each document concurrently. Many real-world classification problems tend to employ high-dimensional label spaces, which can be naturally structured in a hierarchy. In this type of problem, each instance may belong to multiple labels and labels are organized in a hierarchical structure. It presents a more complex problem than flat classification, given that the classification algorithm has to take into account hierarchical relationships between labels and be able to predict multiple labels for the same instance. Few studies have investigated multi-label text classification for the Arabic language. Most of these studies have focused mainly on flat classification and have neglected the hierarchical structure. Therefore, this paper explores the hierarchical multi-label classification in the context of the Arabic language. It proposes a hierarchical multi-label Arabic text classification (HMATC) model with a machine learning approach. The impact of feature selection methods and feature set dimensions on classification performance are also investigated. In addition, the Hierarchy Of Multilabel ClassifiER (HOMER) algorithm is optimized via examination of different sets of multi-label classifiers, clustering algorithms and different numbers of clusters to improve the hierarchical classification. Moreover, this study contributes to existing research by introducing a hierarchical multi-label Arabic dataset in an appropriate format for hierarchical classification and making it publicly available. The results reveal that the proposed model outperforms all models considered in the experiments in terms of the computational cost, which consumed less cost (2 h) compared with other evaluated models. In addition, it shows a significant improvement compared with the state-of-the-art model (Fatwa model) in terms of Hamming loss (0.004), hierarchical loss (1.723), multi-label accuracy (0.758), subset accuracy (0.292), micro-averaged precision (0.879), micro-averaged recall (0.828), and micro-averaged F-measure (0.853).

中文翻译：

HMATC：使用机器学习的分层多标签阿拉伯语文本分类模型

多标签分类同时为每个文档分配多个标签。许多现实世界的分类问题倾向于使用高维标签空间，这些空间可以自然地按层次结构构建。在这类问题中，每个实例可能属于多个标签，并且标签以层次结构组织。它提出了比平面分类更复杂的问题，因为分类算法必须考虑标签之间的层次关系，并且能够为同一实例预测多个标签。很少有研究调查阿拉伯语的多标签文本分类。这些研究大多主要集中在平面分类上，而忽略了层次结构。所以，本文探讨了阿拉伯语语境中的分层多标签分类。它提出了一种采用机器学习方法的分层多标签阿拉伯语文本分类 (HMATC) 模型。还研究了特征选择方法和特征集维度对分类性能的影响。此外，通过检查不同的多标签分类器集、聚类算法和不同数量的聚类来优化多标签分类器（HOMER）算法的层次结构，以改进层次分类。此外，本研究通过以适当的分层分类格式引入分层多标签阿拉伯语数据集并使其公开，为现有研究做出了贡献。结果表明，所提出的模型在计算成本方面优于实验中考虑的所有模型，与其他评估模型相比，其消耗的成本（2 h）更少。此外，与最先进的模型（Fatwa 模型）相比，它在汉明损失（0.004）、层次损失（1.723）、多标签准确度（0.758）、子集准确度（0.292 )、微平均精度 (0.879)、微平均召回率 (0.828) 和微平均 F-measure (0.853)。

更新日期：2020-09-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文