Abstract

This paper investigates the three-way clustering involving fuzzy covering, thresholds acquisition, and boundary region processing. First of all, a valid fuzzy covering of the universe is constructed on the basis of an appropriate fuzzy similarity relation, which helps capture the structural information and the internal connections of the dataset from the global perspective. Due to the advantages of valid fuzzy covering, we explore the valid fuzzy covering instead of the raw dataset for RFCM algorithm-based three-way clustering. Subsequently, from the perspective of semantic interpretation of balancing the uncertainty changes in fuzzy sets, a method of partition thresholds acquisition combining linear and nonlinear fuzzy entropy theory is proposed. Furthermore, boundary regions in three-way clustering correspond to the abstaining decisions and generate uncertain rules. In order to improve the classification accuracy, the k-nearest neighbor (kNN) algorithm is utilized to reduce the objects in the boundary regions. The experimental results show that the performance of the proposed three-way clustering based on fuzzy covering and kNN-FRFCM algorithm is better than the compared algorithms in most cases.

1. Introduction

Three-way decisions (3WD) proposed by Yao [1, 2] is a hot topic in various fields in recent years. Since it was put forward, the idea of tripartition has attracted many scholars to do research. Especially recently, great progress has been made in the theoretical research and model building of three-way decisions based on rough sets. For example, Liang and Liu et al. [36] proposed fuzzy three-way decision models and stochastic three-way decision models to deal with real-valued or linguistic-valued decision-making problems. Qian et al. [7] established multigranulation decision-theoretic rough set model based on granular computing theory. Hu [8, 9] introduced the concept of three-way decision space and established a three-way decision model based on partially ordered sets. Qi et al. [10] investigated the 3WD model in the framework of lattice theory. Li et al. [11] have constructed a cost-sensitive sequential three-way decision model to simulate the decision-making process from coarse granularity (high cost) to fine granularity (low cost) and please refer [1214] for further generalizations and applications of this model. Yao et al. [15] construct an optimization-based framework for three-way approximations of fuzzy sets. In the meanwhile, for dynamic objects and attributes, some algorithms and incremental 3WD models are designed for classification of dynamic data [16, 17]. From the viewpoint of application, three-way decisions have been widely used in research fields such as pattern recognition [18, 19], artificial intelligence [2022], engineering, managements [23], and social communities [24].

Based on the above backgrounds and work in three-way decisions, a novel method for three-way clustering based on fuzzy covering is discussed. First, the fuzzy covering of the dataset according to the reasonable fuzzy similarity relation is constructed. The fuzzy covering of the universe requires that the more similar the objects in the universe are, the more similar the corresponding fuzzy classes are. The fuzzy covering established in this way can better reflect the intrinsic relationship between objects in the universe. Therefore, clustering results will have more accuracy with valid fuzzy covering. One of the inevitable problems of clustering is threshold calculation. As is well known, for most of the three-way decision models mentioned above, we first need to obtain the pair of partition thresholds and . Different thresholds lead to different decision results. The appropriate partition thresholds make the decision more accurate, whereas the inappropriate thresholds distort the decision. Traditionally, the partition thresholds are usually selected according to the experts experience in advance [2527]. According to the loss function, Yao et al. [1] proposed a method to determine the thresholds by Bayesian risk decision theory. By using Shannon entropy as a measure of uncertainty, Deng et al. [28] present an information-theoretic approach to explain and calculate the thresholds. Zhou et al. [29] explore the shadowed set to automatically obtain the partition thresholds of the three-way decisions but cannot theoretically give a reasonable semantic explanation. To address this issue, inspired by the idea of balancing the uncertainty change of fuzzy sets, a threshold calculation method combining linear fuzzy entropy with nonlinear fuzzy entropy is proposed. This method provides a new scientific explanation for the generation of thresholds. And then, the boundary regions of three-way clustering are processed by the kNN algorithm to reduce uncertainty and improve decision accuracy.

The structure of the rest of this paper is as follows: Section 2 briefly introduces the necessary notions of three-way decisions. Section 3 focuses on constructing the fuzzy covering of the raw dataset according to the fuzzy similarity relation and some necessary conditions and discusses its related properties. In Section 4, a novel rough fuzzy C-means (FRFCM) algorithm based on valid fuzzy covering is established. Then, we investigate the partition thresholds by combining the linear and nonlinear fuzzy entropy. Furthermore, the framework for processing the boundary region of three-way clustering using the kNN algorithm is introduced. In Section 5, the validity and practicability of the algorithm are evaluated by experiment. Concluding remarks are given in Section 6.

2. Preliminaries

The basic concepts on three-way decisions are briefly reviewed in this section.

An information system is defined as a 4-tuple , where denotes a finite nonempty universe, is a nonempty finite of condition attributes, is a nonempty finite of decision attributes, and , where is a domain of attribute ; is an information function such that for every . If is a membership function value, then the value of object under attribute can be expressed as .

The trisecting-and-acting framework of three-way decisions is an extension of binary decision in order to overcome some shortcomings of binary decision. The traditional binary decision model only has acceptance and rejection options, which can easily lead to errors when the information available is insufficient to make an accurate judgment. Sometimes, the cost of wrong decisions is very high. Therefore, deferment decision is necessary, which allows decision makers to collect more information and make more accurate judgment. This is a strategy that people often adopt in the decision-making process, and deferment decision is consistent with human cognition. A three-way decision model based on the evaluation function and a pair of thresholds is shown as follows.

Definition 1. (see [30]). Let U be a finite nonempty universe, be an evaluation function, and a pair of thresholds, , then the positive, negative, and boundary regions of any subset are defined as follows:Evaluation function is the key of decision. The result of decision-making is different with different evaluation functions. There are various evaluation functions that can be adopted. If a fuzzy membership function is used as an evaluation function, then the induced three regions are defined by the following equations [31]:The three-valued approximations of a fuzzy set is described by Zadeh [32] as follows: (1) , if ; (2) does not belong to , if ; (3) and has an indeterminate status relative to , if . These three cases correspond to the three-way decisions of the above fuzzy set. When and , we obtain the qualitative three-way decisions of a fuzzy set. However, the qualitative decision model of fuzzy set is very restrictive, and we generally do not select these two thresholds.

3. Fuzzy Covering and Its Validity

The focus of this section is on the method of constructing valid fuzzy covering of raw data and discusses the properties of the fuzzy covering. Let us first recall some concepts that help us to better understand fuzzy covering.

Definition 2. (see [33, 34]). Let be a finite universe and be the fuzzy power set of . For each , we call with , a fuzzy -covering of , if for each . is called a fuzzy -covering approximation space. If for each , then is called a fuzzy covering of U. is called a fuzzy covering approximation space. for each , then is called a fuzzy partition of U. We call a fuzzy partition approximation space.

Definition 3. (see [35]). Let be a mapping . is called the degree of similarity between fuzzy sets and , if satisfies the following properties:(1)(2) = (3)if , then Some similarity measures are listed as follows:The fuzzy set in this paper is constructed by fuzzy similarity relation which satisfies the following properties. For any ,(1)(2)For a fuzzy similarity relation , , and , the membership of belonging to fuzzy set is denoted asObviously, if , it means that certainly belongs to . Conversely, if , it indicates that certainly does not belong to . is also called a fuzzy similarity class associated with on . Therefore, the set of fuzzy similarity classes constructed by relation is a fuzzy covering of universe .
In the following, we investigate the validity and related properties of the fuzzy covering of the raw dataset.

Definition 4. Let be a universe. is the fuzzy similarity relation on , and is the similarity relation on . is a fuzzy covering of constructed by fuzzy similarity relation R. For any , is the set of similarity objects with . is defined as a valid fuzzy covering of with respect to , if the following condition holds:where .
It is easy to know that the value of depends on and and the choice of . is generally assigned no less than 0.8. The closer the is to 1, the more relation the expresses the structure of sample space. If is less than 0.5, the fuzzy covering of the universe is invalid. The fuzzy covering satisfies that similar objects in have corresponding similar fuzzy classes, so the fuzzy covering more fully reflects the original distribution of objects in .

Proposition 1. Let , then .

Proof. It can be easily verified by the definition.

Remark 1. Let and be two valid fuzzy coverings of with respect to the same . We choose fuzzy covering with a larger validity index as research data.

4. Three-Way Clustering

4.1. Rough Fuzzy C-Means Algorithm Based on Fuzzy Covering

In this section, we discuss the rough fuzzy C-means algorithm with fuzzy covering. The reason for clustering with fuzzy covering is that each fuzzy similarity class can reflect the relationship with the whole dataset, avoiding the disadvantage of excessive loss of clustering information with raw data.

The combination of fuzzy set and rough set provides an important direction for uncertain reasoning. Lingras [36] developed rough C-means (RCM) by combining the C-means clustering algorithm with rough set theory. The new clustering center is only related to the positive region and the boundary region, unlike fuzzy C-means (FCM) [37], which is related to all objects. Since there is no membership involved, rough C-means (RCM) cannot effectively deal with the uncertainty caused by overlapping boundaries. In such circumstances, Mitra et al. [25] proposed a rough fuzzy C-means (RFCM) algorithm in which it combines the advantages of both fuzzy set and rough set into the framework of the C-means clustering algorithm. When dividing objects into approximation regions, replacing the absolute distance with a fuzzy membership is the innovation of the rough fuzzy C-means. This adjustment enhances the robustness of the clustering to deal with overlapping situations. Maji et al. [26] modified the calculation of the new clustering center in the RFCM model by assuming that the objects in the lower approximation have definite weights and the objects in the boundary have fuzzy weights. In what follows, we discuss the rough fuzzy C-means of fuzzy covering (FRFCM) algorithm, which is an RFCM algorithm based on fuzzy covering of the universe.

Suppose is a valid fuzzy covering of . The cluster centers are denoted as . In the FRFCM algorithm, is divided into clusters . The membership of to the cluster iswhere is the distance between and , , and . The parameter is the fuzzifier greater than 1.

A two-category dataset is taken to explain the influence of different parameters on classification. The membership degree of each object belonging to each cluster can be considered as a function which is related to relative distances and the fuzzifier parameter. Then, formula (6) translates to the following form:where denotes the relative distance of an object with respect to one of the clusters.

The uncertainty caused by different fuzzifier parameter can be illustrated in Figure 1.

It is easily to obtain that if the value of tends to 1, the memberships are most crisp, as well as the uncertainty of the system is reduced which is suitable for three-way clustering. In this circumstance, only objects that are approximately the same distance from each cluster center are divided into boundary regions. In addition, the parameter cannot be assigned with a very large value because as the value increases, the memberships of objects around the center of the cluster will be assigned to 1 and most objects are divided into boundary region which will increase the uncertainty of the system and the error rate of decision-making. Furthermore, the positive region of cluster may become empty.

The center vectors are updated as follows:where and can be considered as the contributions to the center by the fuzzy lower region and fuzzy boundary region, respectively. denotes the boundary region of cluster , where and are the lower and upper approximations of cluster with respect to relation R, respectively. The weighted values and usually satisfy and . In this paper, we take and .

The approximation regions are determined by the FRFCM algorithm with the following principles: if , where and , then , It also means . In this case, cannot be divided into the positive region of any clusters. Otherwise, and . Due to the particularity structure of the fuzzy covering of , the results of fuzzy covering clustering can well reflect the clustering results of the raw dataset through the above FRFCM algorithm.

4.2. Acquisition of Thresholds for Three-Way Clustering

In this section, we firstly review the shadowed set model for computing thresholds. Then, a novel method of calculating thresholds is proposed by combining the linear and nonlinear fuzzy entropy.

The FRFCM algorithm is an important tool to deal with imprecise, incomplete, and inconsistent data. The thresholds in FRFCM which determines the formation of approximation regions should be carefully selected. The unreasonable thresholds may cause the partition of approximate regions to be distorted, and clustering centers may deviate from the expected locations. Therefore, we should compute the partition thresholds scientifically according to some principles.

There are many methods to obtain the thresholds, and the most popular method is the shadowed set [38]. In fact, the shadowed set adopts the method of elevating and reducing membership degree, which divides the domain of fuzzy set into three regions. The corresponding membership function is as follows:where is the membership function of fuzzy set .

In the following study, only discrete fuzzy systems are considered, and similar models and conclusions can be obtained for continuous fuzzy systems. According to shadowed sets theory, the following formula is proposed to calculate the minimum value to obtain the optimal thresholds and :

However, the semantic interpretation of obtaining threshold pairs by using the above method is not very clear. Because the shadowed set model can not reasonably explain the relationship between the obtained shadowed set and the fuzziness of the raw fuzzy set, further research is needed. Various methods for measuring uncertainty are described in the literature [39]. Fuzzy entropy is an important tool to measure the uncertainty of fuzzy set and meets the following requirements.

Definition 5. (see [40]). Let be a fuzzy set on the universe of discourse . The fuzzy entropy of fuzzy set is the mapping , which satisfies the following four conditions:(1) if (2)(3), if or , then (4)It is easy to verify that, for any , or , the value of corresponding entropy function is 0, then the fuzzy entropy of the fuzzy set equals 0; i.e., the uncertainty of the fuzzy set is the minimum. When holds for any , the value of corresponding entropy function is 1, then the fuzzy set has maximum uncertainty. The commonly used linear and nonlinear fuzzy entropy functions are listed as follows [4143]:With the above fuzzy entropy functions of fuzzy measure, the corresponding fuzzy entropy of the fuzzy set can be easily obtained as follows:The basic idea of calculating the thresholds by fuzzy entropy is to reduce the uncertainty of the membership of the objects which are the elevating or reducing operation in the shadowed set to 0, while the membership of objects corresponding to the middle part in the shadowed set is adjusted to the maximal uncertainty; i.e., the fuzzy degree increases to 1. In what follows, we propose a flexible fuzzy entropy method which combines the linear fuzzy entropy function and nonlinear fuzzy entropy function to obtain the clustering thresholds. Then, the calculation model is as follows:where is a parameter adjusting the impacts of linear entropy and nonlinear entropy.
In equation (13), when , only linear fuzzy entropy function is used to calculate the thresholds. If , only nonlinear fuzzy entropy function is used to calculate the thresholds. The smaller the value of , the more the influence brought from the linear fuzzy entropy, and vice versa. In the subsequent experiments of this study, we assign .
Figure 2 illustrates the increase and decrease in fuzzy degree of the fuzzy entropy function by taking the linear fuzzy entropy function , the nonlinear fuzzy entropy function , and the fuzzy entropy function which is combined by and with equal weight as examples.
It can be seen from Figure 2 that the curve of flexible fuzzy entropy function lies between the curve of linear and nonlinear entropy functions. The method of using flexible fuzzy entropy to obtain the thresholds can prevent the uncertainty of fuzzy set measured by linear or nonlinear fuzzy entropy from being too small or too large, which leads to the partition thresholds unreasonable.
Thresholds used in RFCM and its related algorithms are usually user-defined. However, the threshold calculated by the above model can not only be interpreted from the change in fuzzy degree of fuzzy set but also be adjusted and optimized automatically.
According to and , the positive, boundary, and negative regions of each cluster can be expressed aswhere is the membership degree of the object belonging to the class.

4.3. Boundary Region Processing of Three-Way Clustering Based on kNN Algorithm

Following the above discussion on automatically selecting the optimal partition thresholds based on fuzzy entropy theory, this section will present the object processing in the boundary regions of three-way clustering.

In the three-way clustering, the boundary region objects are rarely further processed. k-nearest neighbor (kNN) algorithm [44] is a well-known nonparametric classifier, which is considered as one of the simplest methods in data mining and pattern recognition. The principle of the kNN algorithm is to find k nearest neighbors of a query in dataset and then predicts the query with the major class in the k nearest neighbors. In this paper, the kNN algorithm will be utilized to process the objects in the boundary regions. If the object does not find a positive region, it is still classified to the boundary region. Therefore, the uncertainty of the boundary region decreases with the decrease in the number of objects in the boundary region, and reclassifying the objects in the boundary region can improve the accuracy of the three-way clustering.

The details of updating the boundary region with the kNN algorithm are as follows.

Because the kNN algorithm mainly relies on limited adjacent objects for classification, it is more suitable than other methods for the overlap of class domain or the object set to be classified at the boundary region. Therefore, Algorithm 1 can handle the uncertain arising from the boundary region. Of course, dealing with the boundary region with the k-nearest neighbor algorithm will add extra computing burden and may also face the risk of misclassification of objects.

: a set of objects , the cluster centers , the positive region , boundary region , and the optimal value of k.
: the updated positive region and boundary region
: calculate the distance between and other objects, where ;
: find the region where the k points with the smallest distance are located;
: is the number of k objects in the positive region of class , where . is the number of k objects in the boundary region, and . If there is only one cluster , such that , then and else
: repeat Steps 1–3 until all boundary objects have been computed.

In what follows, based on valid fuzzy covering, FRFCM and kNN algorithms, we proposed a three-way clustering algorithm, which is called the kNN-FRFCM algorithm, and it can be formed, as shown in Algorithm 2.

Input: the valid fuzzy covering of universe , the cluster centers , and the initial fuzzy membership degrees ;
Output: the positive, boundary, and negative regions of each cluster, respectively.
: compute the optimal partition thresholds and for each cluster using formula (13);
: according to formula (14), determine the positive region , boundary region , and for each cluster by , , and fuzzy partition matrix ;
: update each clustering region by Algorithm 1;
: update the membership partition matrix by formula (6);
: update the cluster center with formula (8);
repeat Step 1 to Step 5 until convergence is reached;
: the results of fuzzy covering clustering are replaced by the corresponding objects in the universe.

Thus, according to Algorithm 2, we obtain three-way clustering results of the original dataset by using the valid fuzzy covering.

5. Experiment Analysis

Three-way clustering method based on fuzzy covering proposed in this paper is suitable for dataset with less data and dimension or data with similar amount of data and dimension. Otherwise, clustering with the fuzzy covering constructing by the data with a large amount of data and few dimension will cause the curse of dimensionality. In this paper, six datasets include Iris, Breast Cancer Wisconsin (Original) (BCWO) which eliminates the missing data, New thyroid, Seeds, Forest-type mapping (FTM), and CT from UCI Machine Learning Repository [45] for empirical study. On these datasets and their corresponding fuzzy covering, the results of clustering methods including FCM, RCM, RFCM, kNN-RCM, and kNN-RFCM are compared. In order to distinguish the results of the raw dataset and the fuzzy covering with the same algorithm, the clustering algorithms of the fuzzy covering are expressed as FFCM, FRCM, FRFCM, kNN-FRCM, and kNN-FRFCM, respectively. Details of the six datasets are described in Table 1.

The partition threshold related to RCM and its related algorithms is set as 0.001. and involved in fuzzy covering are set as 0.8 and 0.9, respectively. The value of k in the kNN algorithm is assigned as 7, and the evaluation indexes such as the normalized mutual information (NMI) [47], ACC [48], and rand index (RI) [49] are utilized to investigate the validity of the algorithm. Furthermore, the reasonable values of fuzzifier involved in all comparison algorithms are greater than 1. and are selected, and the experimental comparison results are listed in Tables 27.

From Tables 27, it can be easily concluded that the selected fuzzy parameters have a significant impact on the performance of all comparison algorithms when dealing with the same dataset. Since the boundary region is the main cause of system uncertainty, thus, too large boundary regions are not required for three-way clustering and we need to pay attention to the uncertainty caused by the fuzzifier in the implementation of the algorithms. Moreover, the clustering results show that kNN-FRFCM algorithm has better performance than the other algorithms in most of cases. This is mainly because it can reduce the uncertainty of the system by reprocessing the objects in the boundary regions. From the clustering results, we can also obtain that the results of clustering based on fuzzy covering are mostly better than the results of clustering with raw data. Therefore, the valid fuzzy covering can replace the raw dataset for clustering, and the clustering results are better than the raw dataset. The premise that fuzzy covering can replace the raw dataset for clustering is to select the appropriate fuzzy similarity relation [46].

6. Conclusions

In this paper, a valid fuzzy covering of the raw dataset is constructed by some principles. Because the similarity between fuzzy similarity classes in the valid fuzzy covering can be used to measure the similarity between objects in the raw dataset, each fuzzy similarity class reflects the connection with the whole dataset, so valid fuzzy covering instead of the raw data for clustering can improve the precision of clustering. From the perspective of semantic explanation of uncertainty change in fuzzy sets, we investigate the method of combining linear fuzzy entropy with nonlinear fuzzy entropy to obtain decision threshold pairs. The advantage of calculating thresholds method in this paper not only objectively obtains the classification thresholds based on the objects intrinsic relations but also the formula is simple and easy to understand, as well as the method of calculating the thresholds avoids the inappropriate subjective assignment. Additionally, the objects in the boundary region obtained by the FRFCM algorithm are reprocessed by the kNN algorithm to reduce the uncertainty of the system.

Furthermore, we will continue to investigate the method of thresholds acquisition and the processing method of boundary region for three-way clustering following the idea of this paper. The three-way clustering in incremental information system is one of the future research directions too.

Data Availability

The experimental data supporting the findings of this study are available on the website provided in this article.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported by the Science Research Project of Inner Mongolia University for Nationalities with the title “Research on three-way clustering methods of preference linguistic data” (no. NMDYB18030) and Natural Science Foundation of Inner Mongolia Autonomous Region (nos. 2018MS01008 and 2020MS07008).