Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis

https://doi.org/10.1016/j.websem.2020.100578Get rights and content

Highlights

  • This paper proposes two novel strategies to resolve two types of semantic problems.

  • We use a novel SSC strategy to resolve the polysemy problem.

  • We develop a novel SCM method to resolve the synonym problem.

  • Experiments indicate that the proposed strategies can improve the performance of SDC.

Abstract

Document classification has become an indispensable technology to realize intelligent information services. This technique is often applied to the tasks such as document organization, analysis, and archiving or implemented as a submodule to support high-level applications. It has been shown that semantic analysis can improve the performance of document classification. Although this has been incorporated in previous automatic document classification methods, with an increase in the number of documents stored online, the use of semantic information for document classification has attracted greater attention as it can greatly reduce human effort. In this present paper, we propose two semantic document classification strategies for two types of semantic problems: (1) a novel semantic similarity computation (SSC) method to solve the polysemy problem and (2) a strong correlation analysis method (SCM) to solve the synonym problem. Experimental results indicate that compared with traditional machine learning, n-gram, and contextualized word embedding methods, the efficient semantic similarity and correlation analysis allow eliminating word ambiguity and extracting useful features to improve the accuracy of semantic document classification for texts in Chinese.

Introduction

In recent years, document classification (text classification/categorization) has become an indispensable technology to implement intelligent information services. This technique is often utilized in various tasks, such as document organization, analysis, and archiving, or considered as a submodule to support high-level applications [1]. For example, on a daily basis, a medium-sized company may receive a considerable number of service requests in the form of documents. Some of them may lack necessary information, such as a specific processing department. An automatic document classification system can be used to considerably reduce human workload and save time costs although company customer servicers could assist in managing task assignment.

Considering the rapid growth of the number of online electronic documents, the conventional method to classify information through reading and analyzing online documents by humans cannot meet daily needs appropriately [2]. As a result, there is a necessity in developing more precise, intelligent, and fast text categorization techniques to classify incoming documents into categories that possibly have different granularities. On this basis, users can estimate the importance of contents and determine the priority of each document, thereby planning the work in a more organized way [3].

Automatic document classification has wide applications in data mining and natural language processing (NLP). In general, it can be used to organize documents according to pre-determined classes by using machine learning (ML) algorithms. Specifically, a commonly used method to realize text classification is formulated as follows: given a set of classification labels and a set of training documents, calculate the probability of each category label for each test document, and then, select the label with the highest probability as a prediction category. Conventional ML algorithms, such as naïve Bayesian (NB) classifier, decision tree (DT), k-nearest neighbor, support vector machine (SVM) and neural networks (NNs), can be applied to solve text classification tasks [4]. In addition, deep learning algorithms have been considered as a possible solution for text classification. Common methods include convolutional NNs (CNNs) [5], which are often used for image classification in computer vision, and recursive NNs (RNNs), which are used for processing serialized information [6].

However, most of the aforementioned methods do not consider solving classification problems from the viewpoint of semantic analysis (SA) [1]. For example, a conventional text classification method based on the Bayesian principle constructs a classification model by counting the frequency of feature words in a text corpus. However, in the classification process, this method does not address the negative effects associated with polysemy (that is, a word may have different meanings depending on the context) and synonyms (that is, different words can have the same meaning). For example, the Chinese word “Xiaomi” (“millet” in English) can refer to both agricultural products and the high-tech corporation Xiaomi. Therefore, documents containing the word “Xiaomi” may be classified as “agriculture” or “technology” when using conventional Bayesian methods.

At the same time, document misclassification may be caused by synonyms. For example, “expert” is a synonym for “specialist” and “professional”, which may appear in documents in various contexts (for example, architecture, culture, and history). Therefore, selecting these words as features in a classification model may result in classification errors. Similar situations may also appear when deep learning methods based on word embedding are applied for document classification tasks. Dependencies between words are calculated based on the statistical analysis to estimate a posterior probability of a word following another word. However, a single embedding cannot represent multiple meanings, whereas similar embeddings may involve different topics. An example is provided to illustrate the negative effects of polysemy and synonyms on document classification in Appendix A.

To sum up, in this paper, we mainly explore the following research problems:

(1) Polysemy problem: particular words often have multiple meanings, and the texts containing such words are prone to misclassification;

(2) Synonym problem: different words with similar meanings often appear in different contexts. However, when they are presented in the same text simultaneously, they can lead to document classification errors.

Previous related works reported the important progress in these two research problems. Khan, Baharudin, Lee, and Khan (2010) [4] considered that SA could be applied to improve the performance of classification [7]. They noted that SA could be implemented by introducing ontologies [4]. Fang, Guo, Wang, and Yang (2007) also discussed that SA could be conducted using ontology methods [8]. An ontology was defined as a domain-specific knowledge pre-defined by experts [9]. However, the ontology methods (for example, ontology integration) were found not always sufficient to eliminate ambiguities between different domains or natural languages [10]. Therefore, this type of method could not be applied to solve the problems of polysemy and synonyms appropriately [11]. Liu, Scheuermann, Li, and Zhu (2007) [12] proposed a word sense disambiguation (WSD) method [13] based on the WordNet dictionary and used it for text classification. Other methods relying on WSD include supervised methods (for example, those proposed by Jin, Zhang, Chen, and Xia (2016) [14]), or unsupervised/joint methods (for example, those discussed by Wawer and Mykowiecka (2017) [15]). However, most of existing text classification methods do not consider the meaning uncertainty of polysemous words and multi-scene characteristics of synonyms simultaneously. Moreover, several methods were implied using named entities for text classification. For example, Anđjelić, Kondić, Perić, Jocić, and Kovac̆ević (2017) [3] proposed a method based on named entity linking [16]. However, as a result of conducted experiments, they concluded that this strategy did not yield significant improvements, and in some cases, the performance was even worse. Türker, Zhang, Koutraki, and Sack (2018) [17] proposed a method based on a named entity dictionary (namely, an anchor text dictionary). However, if a word in the text was not specified in the dictionary, the classification results could have been biased.

The results of the extended HIT IR-Lab Tongyici Cilin experiment confirmed that the performance of information retrieval (IR) [18], text classification, and automatic question-answering systems [19] could be significantly improved by effectively extending word meanings or replacing keywords with synonyms [20]. In the field of NLP, Fadaee, Bisazza, and Monz [21] demonstrated that machine translation models could be improved by expanding the text. Kobayashi [22] proposed a bi-directional language model (LM) to predict possible target words based on surrounding ones. Kobayashi verified the LM on the basis of CNN and RNN on six datasets, and the obtained results indicated that text expansion could be successfully applied aiming to further improve the performance of the NLP model. On the basis of the aforementioned linguistic evidence, in the present paper, we propose two novel strategies to improve the performance of document classification from the perspective of SA referred to as semantic document classification (SDC). The first strategy implies solving the polysemy problem through a novel semantic similarity computation (SSC) method. The contextual semantics of a polysemous word in the original sentence are determined by semantically comparing them with different concepts representing this word in a reference dictionary. In the present study, the CoDic [23], [24] and Hownet dictionaries [25] are utilized as reference dictionaries to implement semantic disambiguation. The purpose of this strategy is to ensure that each feature has clear semantics, and those with ambiguity need to be removed. The second strategy relies on applying strong correlation analysis method (SCM) to solve the synonym problem. This method is aimed to remove the synonyms that are irrelevant to a considered classification task. Concerning a set of synonymous feature words (S), it extracts the common semantics of S from the reference dictionary through semantic replacement. Experimental results demonstrate that conventional ML methods (for example, Bernoulli naïve Bayes or nu-support vector classification) are prone to be affected by the polysemy and synonym issues, specifically, in the case of Chinese documents. Moreover, n-gram and contextualized word embedding (for example, FastText and bi-directional encoder representations from transformers (BERT)) are not enough in the case of Chinese text classification. However, the models developed on their basis perform appropriately concerning English documents and are not affected by these two issues.

The paper is organized as follows. Section 2 provides an overview of the research works dedicated to document classification and the two considered semantic dictionaries. Section 3 describes the two novel proposed strategies (namely, SSC and SCM) for SDC. In Section 4, we discuss the experiments conducted for the performance evaluation. The final section draws a conclusion outlining future research directions.

Section snippets

Document classification

The history of the automatic document classification technology started in the early 1960s. Aggarwal and Zhai [26] have formulated a mathematical definition of document classification: given a set of text documents D={d1,d2,dn}, each document di is assigned a category index from a list of m text category labels {c1,c2,,cm}. At the primary age, document classification relies on heuristic strategies. These approaches imply using a set of rules based on expert knowledge to categorize a text.

Semantic document classification

In this section, we describe two novel strategies to solve the above research problems, respectively.

Experiments

In this section, we first introduce the considered datasets and define the used evaluation metrics. Then, we compare the proposed strategies with several baseline ones and describe detailed experimental procedure. After that, experimental results are discussed based on the classification performance estimates.

Conclusion

In the present paper, we propose two novel strategies for SDC of Chinese texts. They mainly introduce the two following improvements: (1) Solving the polysemy problem by utilizing a novel SSC. SSC implies implementing SA by executing semantic similarity computation and embedding by using a common semantic dictionary. In the present paper, we apply CoDic for English texts and Hownet for Chinese ones. (2) Solving the synonym problem by developing an SCM. SCM is based on the CDM strategy for

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research is supported by both the National Natural Science Foundation of China (grant no.: 61802079) and Guangzhou University Grant. The authors would like to thank all the anonymous referees for their valuable comments and helpful suggestions.

References (57)

  • SáEzC. et al.

    An hl7-CDA wrapper for facilitating semantic interoperability to rule-based clinical decision support systems

    Comput. Methods Programs Biomed.

    (2013)
  • YangS. et al.

    An improved id3 algorithm for medical data classification

    Comput. Electr. Eng.

    (2018)
  • AnđelićS. et al.

    Text classification based on named entities

  • KhanA. et al.

    A review of machine learning algorithms for text-documents classification

    J. Adv. Inf. Technol.

    (2010)
  • KimY.

    Convolutional neural networks for sentence classification

    (2014)
  • YoungT. et al.

    Recent trends in deep learning based natural language processing

    IEEE Comput. Intell. Mag.

    (2018)
  • FangJ. et al.

    Ontology-based automatic classification and ranking for web documents

  • ThangarajM. et al.

    Text classification techniques: A literature review

    Interdiscip. J. Inf. Knowl. Manage.

    (2018)
  • GambhirM. et al.

    Recent automatic text summarization techniques: a survey

    Artif. Intell. Rev.

    (2017)
  • LiuY. et al.

    Using wordnet to disambiguate word senses for text classification

  • JinP. et al.

    Bag-of-embeddings for text classification.

  • A. Wawer, A. Mykowiecka, Supervised and unsupervised word sense disambiguation on word embedding vectors of unambigous...
  • TürkerR. et al.

    Tecne: Knowledge based text classification using network embeddings.

  • Jiu-leT. et al.

    Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system [j]

    J. Jilin Univ. (Inf. Sci. Ed.)

    (2010)
  • FadaeeM. et al.

    Data augmentation for low-resource neural machine translation

    (2017)
  • KobayashiS.

    Contextual augmentation: Data augmentation by words with paradigmatic relations

    (2018)
  • GuoJ. et al.

    Improving multilingual semantic interoperation in cross-organizational enterprise systems through concept disambiguation

    IEEE Trans. Ind. Inf.

    (2012)
  • DongZ. et al.

    HowNet and the Computation of Meaning

    (2006)
  • Cited by (16)

    View all citing articles on Scopus
    View full text