Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis

doi:10.1016/j.websem.2020.100578

Journal of Web Semantics

Volume 63, August 2020, 100578

https://doi.org/10.1016/j.websem.2020.100578 Get rights and content

Highlights

•
This paper proposes two novel strategies to resolve two types of semantic problems.
•
We use a novel SSC strategy to resolve the polysemy problem.
•
We develop a novel SCM method to resolve the synonym problem.
•
Experiments indicate that the proposed strategies can improve the performance of SDC.

Abstract

Document classification has become an indispensable technology to realize intelligent information services. This technique is often applied to the tasks such as document organization, analysis, and archiving or implemented as a submodule to support high-level applications. It has been shown that semantic analysis can improve the performance of document classification. Although this has been incorporated in previous automatic document classification methods, with an increase in the number of documents stored online, the use of semantic information for document classification has attracted greater attention as it can greatly reduce human effort. In this present paper, we propose two semantic document classification strategies for two types of semantic problems: (1) a novel semantic similarity computation (SSC) method to solve the polysemy problem and (2) a strong correlation analysis method (SCM) to solve the synonym problem. Experimental results indicate that compared with traditional machine learning, n-gram, and contextualized word embedding methods, the efficient semantic similarity and correlation analysis allow eliminating word ambiguity and extracting useful features to improve the accuracy of semantic document classification for texts in Chinese.

Introduction

In recent years, document classification (text classification/categorization) has become an indispensable technology to implement intelligent information services. This technique is often utilized in various tasks, such as document organization, analysis, and archiving, or considered as a submodule to support high-level applications [1]. For example, on a daily basis, a medium-sized company may receive a considerable number of service requests in the form of documents. Some of them may lack necessary information, such as a specific processing department. An automatic document classification system can be used to considerably reduce human workload and save time costs although company customer servicers could assist in managing task assignment.

Considering the rapid growth of the number of online electronic documents, the conventional method to classify information through reading and analyzing online documents by humans cannot meet daily needs appropriately [2]. As a result, there is a necessity in developing more precise, intelligent, and fast text categorization techniques to classify incoming documents into categories that possibly have different granularities. On this basis, users can estimate the importance of contents and determine the priority of each document, thereby planning the work in a more organized way [3].

Automatic document classification has wide applications in data mining and natural language processing (NLP). In general, it can be used to organize documents according to pre-determined classes by using machine learning (ML) algorithms. Specifically, a commonly used method to realize text classification is formulated as follows: given a set of classification labels and a set of training documents, calculate the probability of each category label for each test document, and then, select the label with the highest probability as a prediction category. Conventional ML algorithms, such as naïve Bayesian (NB) classifier, decision tree (DT), k-nearest neighbor, support vector machine (SVM) and neural networks (NNs), can be applied to solve text classification tasks [4]. In addition, deep learning algorithms have been considered as a possible solution for text classification. Common methods include convolutional NNs (CNNs) [5], which are often used for image classification in computer vision, and recursive NNs (RNNs), which are used for processing serialized information [6].

However, most of the aforementioned methods do not consider solving classification problems from the viewpoint of semantic analysis (SA) [1]. For example, a conventional text classification method based on the Bayesian principle constructs a classification model by counting the frequency of feature words in a text corpus. However, in the classification process, this method does not address the negative effects associated with polysemy (that is, a word may have different meanings depending on the context) and synonyms (that is, different words can have the same meaning). For example, the Chinese word “Xiaomi” (“millet” in English) can refer to both agricultural products and the high-tech corporation Xiaomi. Therefore, documents containing the word “Xiaomi” may be classified as “agriculture” or “technology” when using conventional Bayesian methods.

At the same time, document misclassification may be caused by synonyms. For example, “expert” is a synonym for “specialist” and “professional”, which may appear in documents in various contexts (for example, architecture, culture, and history). Therefore, selecting these words as features in a classification model may result in classification errors. Similar situations may also appear when deep learning methods based on word embedding are applied for document classification tasks. Dependencies between words are calculated based on the statistical analysis to estimate a posterior probability of a word following another word. However, a single embedding cannot represent multiple meanings, whereas similar embeddings may involve different topics. An example is provided to illustrate the negative effects of polysemy and synonyms on document classification in Appendix A.

To sum up, in this paper, we mainly explore the following research problems:

(1) Polysemy problem: particular words often have multiple meanings, and the texts containing such words are prone to misclassification;

(2) Synonym problem: different words with similar meanings often appear in different contexts. However, when they are presented in the same text simultaneously, they can lead to document classification errors.

Previous related works reported the important progress in these two research problems. Khan, Baharudin, Lee, and Khan (2010) [4] considered that SA could be applied to improve the performance of classification [7]. They noted that SA could be implemented by introducing ontologies [4]. Fang, Guo, Wang, and Yang (2007) also discussed that SA could be conducted using ontology methods [8]. An ontology was defined as a domain-specific knowledge pre-defined by experts [9]. However, the ontology methods (for example, ontology integration) were found not always sufficient to eliminate ambiguities between different domains or natural languages [10]. Therefore, this type of method could not be applied to solve the problems of polysemy and synonyms appropriately [11]. Liu, Scheuermann, Li, and Zhu (2007) [12] proposed a word sense disambiguation (WSD) method [13] based on the WordNet dictionary and used it for text classification. Other methods relying on WSD include supervised methods (for example, those proposed by Jin, Zhang, Chen, and Xia (2016) [14]), or unsupervised/joint methods (for example, those discussed by Wawer and Mykowiecka (2017) [15]). However, most of existing text classification methods do not consider the meaning uncertainty of polysemous words and multi-scene characteristics of synonyms simultaneously. Moreover, several methods were implied using named entities for text classification. For example, Anđjelić, Kondić, Perić, Jocić, and Kovac̆ević (2017) [3] proposed a method based on named entity linking [16]. However, as a result of conducted experiments, they concluded that this strategy did not yield significant improvements, and in some cases, the performance was even worse. Türker, Zhang, Koutraki, and Sack (2018) [17] proposed a method based on a named entity dictionary (namely, an anchor text dictionary). However, if a word in the text was not specified in the dictionary, the classification results could have been biased.

The results of the extended HIT IR-Lab Tongyici Cilin experiment confirmed that the performance of information retrieval (IR) [18], text classification, and automatic question-answering systems [19] could be significantly improved by effectively extending word meanings or replacing keywords with synonyms [20]. In the field of NLP, Fadaee, Bisazza, and Monz [21] demonstrated that machine translation models could be improved by expanding the text. Kobayashi [22] proposed a bi-directional language model (LM) to predict possible target words based on surrounding ones. Kobayashi verified the LM on the basis of CNN and RNN on six datasets, and the obtained results indicated that text expansion could be successfully applied aiming to further improve the performance of the NLP model. On the basis of the aforementioned linguistic evidence, in the present paper, we propose two novel strategies to improve the performance of document classification from the perspective of SA referred to as semantic document classification (SDC). The first strategy implies solving the polysemy problem through a novel semantic similarity computation (SSC) method. The contextual semantics of a polysemous word in the original sentence are determined by semantically comparing them with different concepts representing this word in a reference dictionary. In the present study, the CoDic [23], [24] and Hownet dictionaries [25] are utilized as reference dictionaries to implement semantic disambiguation. The purpose of this strategy is to ensure that each feature has clear semantics, and those with ambiguity need to be removed. The second strategy relies on applying strong correlation analysis method (SCM) to solve the synonym problem. This method is aimed to remove the synonyms that are irrelevant to a considered classification task. Concerning a set of synonymous feature words ( $S$ ), it extracts the common semantics of $S$ from the reference dictionary through semantic replacement. Experimental results demonstrate that conventional ML methods (for example, Bernoulli naïve Bayes or nu-support vector classification) are prone to be affected by the polysemy and synonym issues, specifically, in the case of Chinese documents. Moreover, n-gram and contextualized word embedding (for example, FastText and bi-directional encoder representations from transformers (BERT)) are not enough in the case of Chinese text classification. However, the models developed on their basis perform appropriately concerning English documents and are not affected by these two issues.

The paper is organized as follows. Section 2 provides an overview of the research works dedicated to document classification and the two considered semantic dictionaries. Section 3 describes the two novel proposed strategies (namely, SSC and SCM) for SDC. In Section 4, we discuss the experiments conducted for the performance evaluation. The final section draws a conclusion outlining future research directions.

Section snippets

Document classification

The history of the automatic document classification technology started in the early 1960s. Aggarwal and Zhai [26] have formulated a mathematical definition of document classification: given a set of text documents $D = {d_{1}, d_{2}, \dots d_{n}}$ , each document $d_{i}$ is assigned a category index from a list of $m$ text category labels ${c_{1}, c_{2}, \dots, c_{m}}$ . At the primary age, document classification relies on heuristic strategies. These approaches imply using a set of rules based on expert knowledge to categorize a text.

Semantic document classification

In this section, we describe two novel strategies to solve the above research problems, respectively.

Experiments

In this section, we first introduce the considered datasets and define the used evaluation metrics. Then, we compare the proposed strategies with several baseline ones and describe detailed experimental procedure. After that, experimental results are discussed based on the classification performance estimates.

Conclusion

In the present paper, we propose two novel strategies for SDC of Chinese texts. They mainly introduce the two following improvements: (1) Solving the polysemy problem by utilizing a novel SSC. SSC implies implementing SA by executing semantic similarity computation and embedding by using a common semantic dictionary. In the present paper, we apply CoDic for English texts and Hownet for Chinese ones. (2) Solving the synonym problem by developing an SCM. SCM is based on the CDM strategy for

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research is supported by both the National Natural Science Foundation of China (grant no.: 61802079) and Guangzhou University Grant. The authors would like to thank all the anonymous referees for their valuable comments and helpful suggestions.

References (57)

AltınelB. et al.
Semantic text classification: A survey of past and recent advances
Inf. Process. Manage.
(2018)
BruniR. et al.
Website categorization: A formal approach and robustness analysis in the case of e-commerce detection
Expert Syst. Appl.
(2020)
VargaA. et al.
Linked knowledge sources for topic classification of microposts: A semantic graph-based approach
J. Web Semant.
(2014)
KřemenP. et al.
Improving discoverability of open government data with rich metadata descriptions using semantic government vocabulary
J. Web Semant.
(2019)
TekliJ. et al.
Building semantic trees from xml documents
J. Web Semant.
(2016)
GruetzeT. et al.
Coheel: Coherent and efficient named entity linking through random walks
J. Web Semant.
(2016)
McDowellL.K. et al.
Ontology-driven, unsupervised instance population
J. Web Semant.
(2008)
LopezV. et al.
Aqualog: An ontology-driven question answering system for organizational semantic intranets
J. Web Semant.
(2007)
XiaoG. et al.
Semantic input method of chinese word senses for semantic document exchange in e-business
J. Ind. Inf. Integr.
(2016)
ZhouS. et al.
Fuzzy deep belief networks for semi-supervised sentiment classification
Neurocomputing
(2014)

SáEzC. et al.

An hl7-CDA wrapper for facilitating semantic interoperability to rule-based clinical decision support systems

Comput. Methods Programs Biomed.

(2013)

YangS. et al.

An improved id3 algorithm for medical data classification

Comput. Electr. Eng.

(2018)

AnđelićS. et al.

Text classification based on named entities

KhanA. et al.

A review of machine learning algorithms for text-documents classification

J. Adv. Inf. Technol.

(2010)

KimY.

Convolutional neural networks for sentence classification

(2014)

YoungT. et al.

Recent trends in deep learning based natural language processing

IEEE Comput. Intell. Mag.

(2018)

FangJ. et al.

Ontology-based automatic classification and ranking for web documents

ThangarajM. et al.

Text classification techniques: A literature review

Interdiscip. J. Inf. Knowl. Manage.

(2018)

GambhirM. et al.

Recent automatic text summarization techniques: a survey

Artif. Intell. Rev.

(2017)

LiuY. et al.

Using wordnet to disambiguate word senses for text classification

JinP. et al.

Bag-of-embeddings for text classification.

A. Wawer, A. Mykowiecka, Supervised and unsupervised word sense disambiguation on word embedding vectors of unambigous...

TürkerR. et al.

Tecne: Knowledge based text classification using network embeddings.

Jiu-leT. et al.

Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system [j]

J. Jilin Univ. (Inf. Sci. Ed.)

(2010)

FadaeeM. et al.

Data augmentation for low-resource neural machine translation

(2017)

KobayashiS.

Contextual augmentation: Data augmentation by words with paradigmatic relations

(2018)

GuoJ. et al.

Improving multilingual semantic interoperation in cross-organizational enterprise systems through concept disambiguation

IEEE Trans. Ind. Inf.

(2012)

DongZ. et al.

HowNet and the Computation of Meaning

(2006)

Cited by (16)

A user-knowledge vector space reconstruction model for the expert knowledge recommendation system
2023, Information Sciences
Expert Knowledge Recommendation System (EKRS) is an intelligent research assistance system. The system is formed by mapping two sets of conceptual spaces through Institutional Repository (IR) and Core Resource Dataset (CRD) in 2018. The user knowledge pattern matching (UKPM) of EKRS has problems such as uncertain user knowledge text matching, slow update of expert knowledge, and inability to accurately track user knowledge. This paper establishes a user knowledge vector space reconstruction model (UKVSM) through the following steps to solve the above problems. Firstly, the text feature items of IR and CRD are reconstructed and the depth and density correction coefficient matrix of the original node of the text semantic meaning is calculated based on the similarity of feature items of the semantic layer. Secondly, in order to improve the efficiency of UKPM exact matching, the Lagrangian relaxation algorithm (LRA) is used to optimize the two sets of knowledge matching strategies. Finally, the real data set is extracted from the EKRS platform, and the model and algorithm proposed in this paper are tested and verified respectively, and compared with other methods. Experiments show that reconstruction model can improve the accuracy of user knowledge task assignment in EKRS, while LRA can improve the efficiency of model solving.
A parametric similarity method: Comparative experiments based on semantically annotated large datasets
2023, Journal of Web Semantics
We present the parametric method SemSim^p aimed at measuring semantic similarity of digital resources. SemSim^p is based on the notion of information content, and it leverages a reference ontology and taxonomic reasoning, encompassing different approaches for weighting the concepts of the ontology. In particular, weights can be computed by considering either the available digital resources or the structure of the reference ontology of a given domain. SemSim^p is assessed against six representative semantic similarity methods for comparing sets of concepts proposed in the literature, by carrying out an experimentation that includes both a statistical analysis and an expert judgment evaluation. To the purpose of achieving a reliable assessment, we used a real-world large dataset based on the Digital Library of the Association for Computing Machinery (ACM), and a reference ontology derived from the ACM Computing Classification System (ACM-CCS). For each method, we considered two indicators. The first concerns the degree of confidence to identify the similarity among the papers belonging to some special issues selected from the ACM Transactions on Information Systems journal, the second the Pearson correlation with human judgment. The results reveal that one of the configurations of SemSim^p outperforms the other assessed methods. An additional experiment performed in the domain of physics shows that, in general, SemSim^p provides better results than the other similarity methods.
Product Styling Cognition Based on Kansei Engineering Theory and Implicit Measurement
2023, Applied Sciences (Switzerland)
Deep learning model with multi-feature fusion and label association for suicide detection
2023, Multimedia Systems
Semi-supervised learning models for document classification: A systematic review and meta-analysis
2023, Inteligencia Artificial
A dynamic weighted model for semantic similarity measurement between geographic feature categories
2023, Cehui Xuebao/Acta Geodaetica et Cartographica Sinica

View all citing articles on Scopus

View full text

Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis

Highlights

Abstract

Introduction

Section snippets

Document classification

Semantic document classification

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgment

Inf. Process. Manage.

Expert Syst. Appl.

J. Web Semant.

J. Web Semant.

J. Web Semant.

J. Web Semant.

J. Web Semant.

J. Web Semant.

J. Ind. Inf. Integr.

Neurocomputing

Comput. Methods Programs Biomed.

Comput. Electr. Eng.

Text classification based on named entities

A review of machine learning algorithms for text-documents classification

J. Adv. Inf. Technol.

Convolutional neural networks for sentence classification

Recent trends in deep learning based natural language processing

IEEE Comput. Intell. Mag.

Ontology-based automatic classification and ranking for web documents

Text classification techniques: A literature review

Interdiscip. J. Inf. Knowl. Manage.

Recent automatic text summarization techniques: a survey

Artif. Intell. Rev.

Using wordnet to disambiguate word senses for text classification

Bag-of-embeddings for text classification.

Tecne: Knowledge based text classification using network embeddings.

Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system [j]

J. Jilin Univ. (Inf. Sci. Ed.)

Data augmentation for low-resource neural machine translation

Contextual augmentation: Data augmentation by words with paradigmatic relations

Improving multilingual semantic interoperation in cross-organizational enterprise systems through concept disambiguation

IEEE Trans. Ind. Inf.

HowNet and the Computation of Meaning