Non-numerical nearest neighbor classifiers with value-object hierarchical embedding

https://doi.org/10.1016/j.eswa.2020.113206Get rights and content

Highlights

  • A value-object hierarchical distance measure for categorical data.

  • Two nearest neighbor classifiers applied for categorical data classification.

  • Better results of the new classifiers compared with state-of-the-art classifiers.

  • Detailed analysis of the impact of factors on the new classifiers.

Abstract

Non-numerical classification plays an essential role in many real-world applications such as DNA analysis, recommendation systems and expert systems. The nearest neighbor classifier is one of the most popular and flexible models for performing classification tasks in these applications. However, due to the complexity of non-numerical data, existing nearest neighbor classifiers that use the overlap measure and its variants cannot capture the inherent ordered relationship and statistic information of non-numerical data. This phenomenon leads to the classification limitation of nearest neighbor classifiers in non-numerical data environments. To overcome this challenge, we propose a novel object distance metric, i.e., value-object hierarchical metric (VOHM), which is able to capture inherent ordered relationships within non-numerical data. Then, we construct two nearest neighbor classifiers, i.e., the value-object hierarchical embedded nearest neighbor classifier (VO-kNN) and the two-stage value-object hierarchical embedded nearest neighbor classifier (TSVO-kNN), which take advantages of both VOHM and non-numerical feature selection. Experiments show that both VO-kNN and TSVO-kNN could mine more knowledge from data and achieve better performance than state-of-the-art classifiers in non-numerical data environments.

Introduction

The classification problem (Murphy, 2012) is a fundamental issue in the areas of machine learning, data mining, artificial intelligence, and expert systems etc. It plays an essential role in many applications, including sentiment analysis, spam filtering, image analysis, text analysis, and DNA sequence analysis, etc. There exist a lot of methods to solve classification tasks such as logistic regression (LR) (Murphy, 2012, Walker, Duncan, 1967), random forest(RF) (Breiman, 2001, Ho, 1995), support vector machines(SVMs) (Catanzaro, Sundaram, Keutzer, 2008, Cortes, Vapnik, 1995), artificial neural networks(ANN) and deep learning(DL) (Bengio, Courville, Vincent, 2013, Lecun, Bengio, Hinton, 2015) etc., which have achieved great success in numerical environments. However, it is a big challenge to apply these models to solve classification tasks in non-numerical environments due to the lack of useful non-numerical distance metrics for evaluating the relations between objects.

Non-numerical (or categorical) data is a widely used data type in expert systems. For example, Table 1 is a staff table instance with non-numerical values. It is convenient for people to acquire knowledge from non-numerical values. However, most existing machine learning algorithms are difficult to deal with the same problem as humans do, because, in the perspective of machine learning algorithms, non-numerical data does not contain the same semantics or context that humans can easily capture and understand. Hence, narrowing the gap between algorithms and humans has become one of the critical factors to improve the classification performance of the algorithms in non-numerical data environments. To solve the non-numerical classification problem, the most commonly used strategy is to convert non-numerical data into numerical data and process it with machine learning algorithms (Buttrey, 1998) such as nearest neighbor classifiers (Chen, Guo, 2015, Hu, Yu, Xie, 2008, Liu, Cao, Yu, 2014). The k-nearest neighbor classifier (kNN) is one of the most frequently used nonparametric classification models in expert systems (Mller, Salminen, Nieminen, Kontunen, Karjalainen, Isokoski, Rantala, Savia, Vliaho, Kallio, Lekkala, Surakka, 2019, Rodger, 2014), because the model has no training phase and easy implementation. Therefore, the problem of how to convert non-numerical data into numerical values becomes the bottleneck of the kNN classifiers for performing non-numerical classification tasks.

The overlap metric (or simply match coefficients) (Boriah, Chandola, & Kumar, 2008) is a commonly used discrepancy measure which is able to convert data from non-numerical form into numerical form. However, due to the neglect of the different contributions of attributes in non-numerical classification problems, the overlap metric can not meet the requirement of real-world applications. Although the attribute weighting (Chen, Guo, 2015, Chen, Ye, Guo, Zhu, 2016, Morlini, Zani, 2012) based approach satisfies these requirements, it still fails to capture latent ordered relationships that exist in non-numerical values. For example, in Table 1, according to common sense, the distance of values between ‘Bachelor’ and ‘Doctor’ is obviously greater than that between ‘Bachelor’ and ‘Master’. However, capturing this latent ordered information is beyond the scope of the overlap measure and attribute weighting based measures. Therefore, when the classification algorithm processes non-numerical data, how to mine latent ordered information from non-numerical values becomes one of the basic problems to be solved. Furthermore, there also exists another complicated dependency relationship between non-numerical values from different attributes. For example, the association rule ‘Beer’ ⇒ ‘Diapers’ found in the sales data would indicate that if a customer buys beer, it is likely also to buy diapers with a certain probability (Agrawal, Srikant, 1994, Han, Pei, Yin, 2000, Zaki, 2000). In other words, It means that there exist dependency relationships between non-numerical values from different attributes. Hence, another question is arose, namely, how to represent and capture the second relationship in non-numerical data environments. In summary, these challenges put new requirements to update the distance measure for non-numerical data. Data science (Cao, 2017a, Cao, 2017b) shows that there exist complicated relationships in data, and a lot of knowledge required intelligence algorithm to discover.

To overcome these problems, in this paper, we construct a novel distance metric, called value-object hierarchical metric (VOHM), to update the strategy of distance measure for non-numerical data. VOHM could learn the latent ordered relationship from the value-object hierarchy structure. At the value level, the VOHM handles all values in a probability perspective, which could capture the latent ordered relationship of values relative to the class label distribution. And, at the object level, VOHM treat the distance of each object pair as the total sum of the discrepancies from the value level. Then, we proposed a nearest neighbor classifier that takes advantages of both VOHM and attribute reduction. Firstly, to avoid the curse of dimensionality and reduce calculation, we employ the rough set theory based attribute reduction strategy to select the required attributes. Then, we equip up the nearest neighbor classifier with VOHM to perform the classification task for the selected attribute filtered data set. More specifically, we developed two nearest neighbor classifiers, i.e., the value-object hierarchical embedded nearest neighbor classifier (VO-kNN) and the two-stage value-object hierarchical embedded nearest neighbor classifier (TSVO-kNN).

The main contributes of this paper are summarized as follows:

  • A new distance measure for non-numerical data, i.e., VOHM, is proposed. VOHM enables us to capture the latent ordered relationship, which gives more knowledge than the overlap measure and the weighted overlap measure.

  • We propose a value-object hierarchical embedded nearest neighbor classifier that takes advantage of VOHM.

  • We propose a two-stage value-object hierarchical embedded nearest neighbor classifier to perform the classification task on the reduced data set to avoid the curse of dimensionality.

The rest of the paper is organized as follows. In Section 2, we briefly do an overview of existing classifiers for non-numerical data. In Section 3, we formulate the research problem and introduce the preliminary notions. Section 4 defines the VO-kNN classifier. In Section 5, we design a feature section algorithm based on rough set theory. In Section 6, we conduct experiments to show the advantage of our model and algorithm. Finally, we conclude this work in Section 7.

Section snippets

Overview of existing classifiers for categorical data and related work

In the literature, researchers have proposed a lot of classifiers for categorical data. This section will present an overview of them as follows.

Problem formulation and framework

In what follows, the dataset T consists of data objects, i.e.,T{(x(1),y(1)),(x(2),y(2)),,(x(m),y(m))}, where m is the size of the set T, and y(i) is the class label of the object x(i). Each object x(i) consists of categorical feature values, i.e., x(i)={x1(i),x2(i),,xd(i)}, where d is the total number of features. The categorical features means that each feature has a value set with a limited cardinality. For example, the categorical feature “sex” maybe has two values {‘male’, ‘female’}.

VO-kNN classifier

In this section, we will introduce the components of the VO-kNN classifier.

Two-stage VO-kNN classifier

Due to the feature correlation, we need to select some features that play a significant role in the classification task. In the numerical environment, there exist a lot of feature selection methods, for example, the principal component analysis (PCA). However, it is rare to find the feature selection method for non-numerical data except for rough set based methods.

The core issue of feature selection is to select a feature subset which has the same ability to distinguish all objects like the

Experiments

The empirical study of the VO-kNN and TSVO-kNN classifier is given in this section. We first set up the experiments by introducing the datasets and comparison methods. Then we evaluate the performance in terms of the prediction accuracy compared with other methods. Besides, we also evaluate the effects of factors such as parameter k and the selected attributes on classification performance.

Conclusion

Due to the complexity of the data and the lack of effective non-numerical distance metrics, it is a challenging task of how to exploit the inherent characteristic of non-numerical data and to effectively represent it. In this work, we developed a novel value-object hierarchical metric to capture the latent order relationship from non-numerical data. And, we equip up nearest neighbor classifier with the new non-numerical metric and rough set theory based feature selection. It extends nearest

CRediT authorship contribution statement

Sheng Luo: Writing - original draft, Writing - review & editing, Conceptualization, Formal analysis, Methodology, Data curation, Validation. Duoqian Miao: Conceptualization, Formal analysis, Supervision, Writing - review & editing, Funding acquisition. Zhifei Zhang: Writing - review & editing, Data curation, Investigation. Zhihua Wei: Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors thank both the editors and the anonymous referees for their valuable suggestions, which substantially improved this paper. This work is supported by National Key R&D Program of China (Grant no. 213), the National Science Foundation of China (Grant nos. 61673301, 61906137).

References (55)

  • S.J. Russell et al.

    Artificial intelligence: A modern approach

    (2016)
  • R. Agrawal et al.

    Fast algorithms for mining association rules in large databases

    Proceedings of the 20th international conference on very large data bases

    (1994)
  • M.J. Alamelu et al.

    A novel web page classification model using an improved k nearest neighbor algorithm

    3rd international conference on intelligent computational systems (ICICS’2013) april

    (2013)
  • Y. Bengio et al.

    Representation learning: A review and new perspectives

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2013)
  • S. Boriah et al.

    Similarity measures for categorical data: A comparative evaluation

    Proceedings of the 8th SIAM international conference on data mining

    (2008)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • L.I. Breiman et al.

    Classification and regression trees (CART)

    Encyclopedia of Ecology

    (1984)
  • L. Cao

    Data science: A comprehensive overview

    ACM Computing Surveys

    (2017)
  • L. Cao

    Data science: Challenges and directions

    Communications of the ACM

    (2017)
  • B. Catanzaro et al.

    Fast support vector machine training and classification on graphics processors

    Proceedings of the 25th international conference on machine learning

    (2008)
  • L. Chen et al.

    Kernel-based linear classification on categorical data

    Soft Computing

    (2016)
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • S. Cost et al.

    A weighted nearest neighbor algorithm for learning with symbolic features

    Machine Learning

    (1993)
  • W. Daelemans et al.

    Memory-based language processing

    (2005)
  • P.J. Garc A-Laencina et al.

    K nearest neighbours with mutual information for simultaneous classification and missing data imputation

    Neurocomputing

    (2009)
  • J. Han et al.

    Mining frequent patterns without candidate generation

    Proceedings of the 2000 ACM SIGMODinternational conference on management of data

    (2000)
  • D.J. Hand et al.

    Idiot’s Bayes – Not so stupid after all?

    International Statistical Review

    (2001)
  • Cited by (2)

    • Variable radius neighborhood rough sets and attribute reduction

      2022, International Journal of Approximate Reasoning
      Citation Excerpt :

      However, the equivalence relation requires that all attribute values of objects in the same equivalent class are identical. The condition is so strict that the information about similarity relations [1,4,35,40] and order relations [28,34] between objects is easily ignored. Consequently, a lot of scholars have extended rough sets to neighborhood rough sets [21], covering rough sets [5,66], variable precision rough sets [67], multigranulation rough sets [32,55], fuzzy rough sets [15,45], k-nearest neighbor rough sets [18,19], and other fields [7,30,54,65].

    View full text