Non-numerical nearest neighbor classifiers with value-object hierarchical embedding

doi:10.1016/j.eswa.2020.113206

Expert Systems with Applications

Volume 150, 15 July 2020, 113206

https://doi.org/10.1016/j.eswa.2020.113206 Get rights and content

Highlights

•
A value-object hierarchical distance measure for categorical data.
•
Two nearest neighbor classifiers applied for categorical data classification.
•
Better results of the new classifiers compared with state-of-the-art classifiers.
•
Detailed analysis of the impact of factors on the new classifiers.

Abstract

Non-numerical classification plays an essential role in many real-world applications such as DNA analysis, recommendation systems and expert systems. The nearest neighbor classifier is one of the most popular and flexible models for performing classification tasks in these applications. However, due to the complexity of non-numerical data, existing nearest neighbor classifiers that use the overlap measure and its variants cannot capture the inherent ordered relationship and statistic information of non-numerical data. This phenomenon leads to the classification limitation of nearest neighbor classifiers in non-numerical data environments. To overcome this challenge, we propose a novel object distance metric, i.e., value-object hierarchical metric (VOHM), which is able to capture inherent ordered relationships within non-numerical data. Then, we construct two nearest neighbor classifiers, i.e., the value-object hierarchical embedded nearest neighbor classifier (VO-kNN) and the two-stage value-object hierarchical embedded nearest neighbor classifier (TSVO-kNN), which take advantages of both VOHM and non-numerical feature selection. Experiments show that both VO-kNN and TSVO-kNN could mine more knowledge from data and achieve better performance than state-of-the-art classifiers in non-numerical data environments.

Introduction

The classification problem (Murphy, 2012) is a fundamental issue in the areas of machine learning, data mining, artificial intelligence, and expert systems etc. It plays an essential role in many applications, including sentiment analysis, spam filtering, image analysis, text analysis, and DNA sequence analysis, etc. There exist a lot of methods to solve classification tasks such as logistic regression (LR) (Murphy, 2012, Walker, Duncan, 1967), random forest(RF) (Breiman, 2001, Ho, 1995), support vector machines(SVMs) (Catanzaro, Sundaram, Keutzer, 2008, Cortes, Vapnik, 1995), artificial neural networks(ANN) and deep learning(DL) (Bengio, Courville, Vincent, 2013, Lecun, Bengio, Hinton, 2015) etc., which have achieved great success in numerical environments. However, it is a big challenge to apply these models to solve classification tasks in non-numerical environments due to the lack of useful non-numerical distance metrics for evaluating the relations between objects.

Non-numerical (or categorical) data is a widely used data type in expert systems. For example, Table 1 is a staff table instance with non-numerical values. It is convenient for people to acquire knowledge from non-numerical values. However, most existing machine learning algorithms are difficult to deal with the same problem as humans do, because, in the perspective of machine learning algorithms, non-numerical data does not contain the same semantics or context that humans can easily capture and understand. Hence, narrowing the gap between algorithms and humans has become one of the critical factors to improve the classification performance of the algorithms in non-numerical data environments. To solve the non-numerical classification problem, the most commonly used strategy is to convert non-numerical data into numerical data and process it with machine learning algorithms (Buttrey, 1998) such as nearest neighbor classifiers (Chen, Guo, 2015, Hu, Yu, Xie, 2008, Liu, Cao, Yu, 2014). The k-nearest neighbor classifier (kNN) is one of the most frequently used nonparametric classification models in expert systems (Mller, Salminen, Nieminen, Kontunen, Karjalainen, Isokoski, Rantala, Savia, Vliaho, Kallio, Lekkala, Surakka, 2019, Rodger, 2014), because the model has no training phase and easy implementation. Therefore, the problem of how to convert non-numerical data into numerical values becomes the bottleneck of the kNN classifiers for performing non-numerical classification tasks.

The overlap metric (or simply match coefficients) (Boriah, Chandola, & Kumar, 2008) is a commonly used discrepancy measure which is able to convert data from non-numerical form into numerical form. However, due to the neglect of the different contributions of attributes in non-numerical classification problems, the overlap metric can not meet the requirement of real-world applications. Although the attribute weighting (Chen, Guo, 2015, Chen, Ye, Guo, Zhu, 2016, Morlini, Zani, 2012) based approach satisfies these requirements, it still fails to capture latent ordered relationships that exist in non-numerical values. For example, in Table 1, according to common sense, the distance of values between ‘Bachelor’ and ‘Doctor’ is obviously greater than that between ‘Bachelor’ and ‘Master’. However, capturing this latent ordered information is beyond the scope of the overlap measure and attribute weighting based measures. Therefore, when the classification algorithm processes non-numerical data, how to mine latent ordered information from non-numerical values becomes one of the basic problems to be solved. Furthermore, there also exists another complicated dependency relationship between non-numerical values from different attributes. For example, the association rule ‘Beer’ ⇒ ‘Diapers’ found in the sales data would indicate that if a customer buys beer, it is likely also to buy diapers with a certain probability (Agrawal, Srikant, 1994, Han, Pei, Yin, 2000, Zaki, 2000). In other words, It means that there exist dependency relationships between non-numerical values from different attributes. Hence, another question is arose, namely, how to represent and capture the second relationship in non-numerical data environments. In summary, these challenges put new requirements to update the distance measure for non-numerical data. Data science (Cao, 2017a, Cao, 2017b) shows that there exist complicated relationships in data, and a lot of knowledge required intelligence algorithm to discover.

To overcome these problems, in this paper, we construct a novel distance metric, called value-object hierarchical metric (VOHM), to update the strategy of distance measure for non-numerical data. VOHM could learn the latent ordered relationship from the value-object hierarchy structure. At the value level, the VOHM handles all values in a probability perspective, which could capture the latent ordered relationship of values relative to the class label distribution. And, at the object level, VOHM treat the distance of each object pair as the total sum of the discrepancies from the value level. Then, we proposed a nearest neighbor classifier that takes advantages of both VOHM and attribute reduction. Firstly, to avoid the curse of dimensionality and reduce calculation, we employ the rough set theory based attribute reduction strategy to select the required attributes. Then, we equip up the nearest neighbor classifier with VOHM to perform the classification task for the selected attribute filtered data set. More specifically, we developed two nearest neighbor classifiers, i.e., the value-object hierarchical embedded nearest neighbor classifier (VO-kNN) and the two-stage value-object hierarchical embedded nearest neighbor classifier (TSVO-kNN).

The main contributes of this paper are summarized as follows:

•
A new distance measure for non-numerical data, i.e., VOHM, is proposed. VOHM enables us to capture the latent ordered relationship, which gives more knowledge than the overlap measure and the weighted overlap measure.
•
We propose a value-object hierarchical embedded nearest neighbor classifier that takes advantage of VOHM.
•
We propose a two-stage value-object hierarchical embedded nearest neighbor classifier to perform the classification task on the reduced data set to avoid the curse of dimensionality.

The rest of the paper is organized as follows. In Section 2, we briefly do an overview of existing classifiers for non-numerical data. In Section 3, we formulate the research problem and introduce the preliminary notions. Section 4 defines the VO-kNN classifier. In Section 5, we design a feature section algorithm based on rough set theory. In Section 6, we conduct experiments to show the advantage of our model and algorithm. Finally, we conclude this work in Section 7.

Section snippets

Overview of existing classifiers for categorical data and related work

In the literature, researchers have proposed a lot of classifiers for categorical data. This section will present an overview of them as follows.

Problem formulation and framework

In what follows, the dataset $T$ consists of data objects, i.e., $T ≜ {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})},$ where m is the size of the set $T,$ and y⁽ⁱ⁾ is the class label of the object x⁽ⁱ⁾. Each object x⁽ⁱ⁾ consists of categorical feature values, i.e., $x^{(i)} = {x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{d}^{(i)}},$ where d is the total number of features. The categorical features means that each feature has a value set with a limited cardinality. For example, the categorical feature “sex” maybe has two values {‘male’, ‘female’}.

VO-kNN classifier

In this section, we will introduce the components of the VO-kNN classifier.

Two-stage VO-kNN classifier

Due to the feature correlation, we need to select some features that play a significant role in the classification task. In the numerical environment, there exist a lot of feature selection methods, for example, the principal component analysis (PCA). However, it is rare to find the feature selection method for non-numerical data except for rough set based methods.

The core issue of feature selection is to select a feature subset which has the same ability to distinguish all objects like the

Experiments

The empirical study of the VO-kNN and TSVO-kNN classifier is given in this section. We first set up the experiments by introducing the datasets and comparison methods. Then we evaluate the performance in terms of the prediction accuracy compared with other methods. Besides, we also evaluate the effects of factors such as parameter k and the selected attributes on classification performance.

Conclusion

Due to the complexity of the data and the lack of effective non-numerical distance metrics, it is a challenging task of how to exploit the inherent characteristic of non-numerical data and to effectively represent it. In this work, we developed a novel value-object hierarchical metric to capture the latent order relationship from non-numerical data. And, we equip up nearest neighbor classifier with the new non-numerical metric and rough set theory based feature selection. It extends nearest

CRediT authorship contribution statement

Sheng Luo: Writing - original draft, Writing - review & editing, Conceptualization, Formal analysis, Methodology, Data curation, Validation. Duoqian Miao: Conceptualization, Formal analysis, Supervision, Writing - review & editing, Funding acquisition. Zhifei Zhang: Writing - review & editing, Data curation, Investigation. Zhihua Wei: Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors thank both the editors and the anonymous referees for their valuable suggestions, which substantially improved this paper. This work is supported by National Key R&D Program of China (Grant no. 213), the National Science Foundation of China (Grant nos. 61673301, 61906137).

References (55)

S.E. Buttrey
Nearest-neighbor classification with categorical variables
Computational Statistics & Data Analysis
(1998)
L. Chen et al.
Nearest neighbor classification of categorical data by attributes weighting
Expert Systems with Applications
(2015)
C. Gao et al.
Maximum decision entropy-based attribute reduction in decision-theoretic rough set model
Knowledge-Based Systems
(2018)
J. Gou et al.
A generalized mean distance-based k-nearest neighbor classifier
Expert Systems with Applications
(2019)
Q. Hu et al.
Neighborhood classifiers
Expert Systems with Applications
(2008)
M.Z. Jahromi et al.
A method of learning weighted similarity function to improve the performance of nearest neighbor
Information Sciences
(2009)
X. Jia et al.
Generalized attribute reduct in rough set theory
Knowledge-Based Systems
(2016)
L. Jiang et al.
Improving tree augmented naive Bayes for class probability estimation
Knowledge-Based Systems
(2012)
K. Kim
A hybrid classification algorithm by subspace partitioning through semi-supervised decision tree
Pattern Recognition
(2016)
H. Li et al.
Random subspace evidence classifier
Neurocomputing
(2013)

S.J. Russell et al.

Artificial intelligence: A modern approach

(2016)

R. Agrawal et al.

Fast algorithms for mining association rules in large databases

Proceedings of the 20th international conference on very large data bases

(1994)

M.J. Alamelu et al.

A novel web page classification model using an improved k nearest neighbor algorithm

3rd international conference on intelligent computational systems (ICICS’2013) april

(2013)

Y. Bengio et al.

Representation learning: A review and new perspectives

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2013)

S. Boriah et al.

Similarity measures for categorical data: A comparative evaluation

Proceedings of the 8th SIAM international conference on data mining

(2008)

L. Breiman

Random forests

Machine Learning

(2001)

L.I. Breiman et al.

Classification and regression trees (CART)

Encyclopedia of Ecology

(1984)

L. Cao

Data science: A comprehensive overview

ACM Computing Surveys

(2017)

L. Cao

Data science: Challenges and directions

Communications of the ACM

(2017)

B. Catanzaro et al.

Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on machine learning

(2008)

L. Chen et al.

Kernel-based linear classification on categorical data

Soft Computing

(2016)

C. Cortes et al.

Support-vector networks

Machine Learning

(1995)

S. Cost et al.

A weighted nearest neighbor algorithm for learning with symbolic features

Machine Learning

(1993)

W. Daelemans et al.

Memory-based language processing

(2005)

P.J. Garc A-Laencina et al.

K nearest neighbours with mutual information for simultaneous classification and missing data imputation

Neurocomputing

(2009)

J. Han et al.

Mining frequent patterns without candidate generation

Proceedings of the 2000 ACM SIGMODinternational conference on management of data

(2000)

D.J. Hand et al.

Idiot’s Bayes – Not so stupid after all?

International Statistical Review

(2001)

Cited by (2)

Variable radius neighborhood rough sets and attribute reduction
2022, International Journal of Approximate Reasoning
Citation Excerpt :
However, the equivalence relation requires that all attribute values of objects in the same equivalent class are identical. The condition is so strict that the information about similarity relations [1,4,35,40] and order relations [28,34] between objects is easily ignored. Consequently, a lot of scholars have extended rough sets to neighborhood rough sets [21], covering rough sets [5,66], variable precision rough sets [67], multigranulation rough sets [32,55], fuzzy rough sets [15,45], k-nearest neighbor rough sets [18,19], and other fields [7,30,54,65].
Neighborhood rough sets provide important insights into dealing with numerical data. Neighborhood radius, a key factor that affects data uncertainty, is uniformly given in most of the existing neighborhood rough sets. Although it is concise and convenient to construct a granular structure, the same radius is not appropriate for the unique circumstance of each element in the universe. Therefore, taking the different environment of each object and label distribution into consideration, in this paper, we propose two novel neighborhood rough set models, namely, variable radius neighborhood rough sets (VRNRs) and neighborhood rough sets based on α-covering (α-CNRSs). They customize the neighborhood radius for each object or local region of the universe by surrounding functions. Based on an investigation of the basic properties of VRNRs and α-CNRSs, we present two attribute reduction algorithms. Moreover, three comparative experiments are designed in terms of the running time, model stability, and classification accuracy. Theoretical analyses and experimental results show that the two new neighborhood rough set models have good robustness and validity in attribute reduction and classification performance.
A representation coefficient-based k-nearest centroid neighbor classifier
2022, Expert Systems with Applications
K-nearest neighbor rule (KNN) has been regarded as one of the top 10 methods in the field of data mining. Due to its simplicity and effectiveness, it has been widely studied and applied to various classification tasks. In this article, we develop a novel representation coefficient-based k-nearest centroid neighbor method (RCKNCN), which aims to further improve the classification performance and reduce the method’s sensitivity to the neighborhood size k, especially in the cases of small sample size. Different from existing KNN-based methods, RCKNCN is able to capture both the proximity and the geometry of k-nearest neighbors, and learn to differentiate the contribution of each neighbor to the classification of a testing sample through a linear representation method. Moreover, under the RCKNCN framework, we also propose a novel weighted majority voting algorithm using the representation coefficients associated with individual nearest centroid neighbors, which are deemed to hold more discriminative information of the neighbors. To fully study the classification performance of RCKNCN, we compare it with the state-of-the-art KNN-based methods on many data sets that are widely used in the literature. The extensive experiments demonstrate the effectiveness and robustness of our method in various classification tasks.

View full text

Non-numerical nearest neighbor classifiers with value-object hierarchical embedding

Highlights

Abstract

Introduction

Section snippets

Overview of existing classifiers for categorical data and related work

Problem formulation and framework

VO-kNN classifier

Two-stage VO-kNN classifier

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Computational Statistics & Data Analysis

Expert Systems with Applications

Knowledge-Based Systems

Expert Systems with Applications

Expert Systems with Applications

Information Sciences

Knowledge-Based Systems

Knowledge-Based Systems

Pattern Recognition

Neurocomputing

Fast algorithms for mining association rules in large databases

Proceedings of the 20th international conference on very large data bases

A novel web page classification model using an improved k nearest neighbor algorithm

3rd international conference on intelligent computational systems (ICICS’2013) april

Representation learning: A review and new perspectives

IEEE Transactions on Pattern Analysis and Machine Intelligence

Similarity measures for categorical data: A comparative evaluation

Proceedings of the 8th SIAM international conference on data mining

Random forests

Machine Learning

Classification and regression trees (CART)

Encyclopedia of Ecology

Data science: A comprehensive overview

ACM Computing Surveys

Data science: Challenges and directions

Communications of the ACM

Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on machine learning

Kernel-based linear classification on categorical data

Soft Computing

Support-vector networks

Machine Learning

A weighted nearest neighbor algorithm for learning with symbolic features

Machine Learning

Memory-based language processing

K nearest neighbours with mutual information for simultaneous classification and missing data imputation

Neurocomputing

Mining frequent patterns without candidate generation

Proceedings of the 2000 ACM SIGMODinternational conference on management of data

Idiot’s Bayes – Not so stupid after all?

International Statistical Review