Abstract

This study focuses on high-dimensional text data clustering, given the inability of K-means to process high-dimensional data and the need to specify the number of clusters and randomly select the initial centers. We propose a Stacked-Random Projection dimensionality reduction framework and an enhanced K-means algorithm DPC-K-means based on the improved density peaks algorithm. The improved density peaks algorithm determines the number of clusters and the initial clustering centers of K-means. Our proposed algorithm is validated using seven text datasets. Experimental results show that this algorithm is suitable for clustering of text data by correcting the defects of K-means.

1. Introduction

Clustering is the main technique used for unsupervised information extraction. In clustering, the aim is to divide the unlabelled dataset into multiple nonoverlapping class clusters, making the data points in the cluster as similar as possible, while making the data points between the clusters as different as possible. In text clustering, text vectors are characterized by high dimension, sparsity, and correlation among dimensions, which requires improvements to the clustering algorithm to process high-dimension text [1, 2].

When the K-means method is used to process high-dimensional data, the “Curse of Dimensionality” [3] problem becomes prominent, and the redundancy index also increases. Consequently, the conventional clustering method cannot process the data accurately. Some research [49] has proposed improvements on the text clustering algorithm, and some studies [10, 11] have proposed improvements on the K-means algorithm. To apply the K-means, it is necessary to specify the number of clusters in advance and randomly select the initial clustering centers. The clustering result is greatly influenced by the selection of the initial center point. Improper selection of the initial center can easily cause the clustering result trap into the local optimal solution and lead to an inaccurate clustering result.

In recognition of these problems, we propose an enhanced K-means text clustering algorithm based on the clustering by fast search and find of density peaks (DPC) algorithm [12]. Since text-based data is usually high-dimensional and sparse, we propose a deep random projection dimensionality reduction framework, named Stacked-Random Projection (SRP), a greedy layer-wise architecture. We first use the dimensionality reduction method to reduce the dimension of the high-dimensional text feature vectors. Then use the improved density peaks algorithm to determine the number of clusters and the initial clustering centers, after which the K-means algorithm is used for clustering.

The organization of this paper is as follows. The proposed methodology is discussed in Methods. In Experiments and Discussion, experimental results are explained. Finally, Conclusions concludes the paper and highlights future work related to the study.

2. Methods

2.1. Stacked-Random Projection

The basic idea of random projection is to choose a random hyperplane to map original variables into a low-dimensional space. In 1948, Johnson and Lindenstrauss proposed a theorem, nowadays termed the Johnson-Lindenstrauss lemma (JL) [13]. JL lemma is the theoretical basis of random projection, which guarantees that the subspace errors generated by random projection are controllable. The JL lemma states that for any , and any integer , let be a positive integer such that

Then, for any -point set in , there is a map : , such that for all , ,

It indicates that by using random projection, the original high-dimensional data is reduced to low-dimensional data, and the distance between the original data is maintained approximately with a high probability. Zhang et al. [14] proposed a random projection ensemble approach and applied it to the prediction of drug-target interaction. Gondara [15] also proposed an ensemble random projection, in which the random projection matrix is applied to different subsets of the original dataset, and which can achieve greater classification accuracy compared with the random forest and AdaBoost methods.

According to the Johnson-Lindenstrauss lemma, the minimum size of the target dimension after dimensionality reduction that guarantee the embedding is given by Equation (3):

For example, where is the number of samples, it would require at least 6,515 dimensions to project 2k samples without too much distortion (). Thousands of dimensions are still high-dimensional data for the following step such as classification or clustering. Inspired by stacked Auto-Encoder, we propose a deep random projection framework, named Stacked-Random Projection (SRP), which incorporates random projection as its core stacking element. The SRP framework with layers uses the input data as the first layer, and the output of the () layer is taken as the () layer input. In this way, a group of random projections method can be combined layer by layer in a stack.

The main idea of the SRP dimensionality reduction method based on the high-dimensional text feature vector can be illustrated by means of taking the 20-newsgroups dataset as an example (further details are provided in Experiments and Discussion). First, the dataset is subjected to tokenization, stop-words removal, and TF-IDF in order to obtain the high-dimensional sparse text vector space (the feature dimension of the 20-newsgroups dataset was found to be 130,107). Then, a 4-layer SRP is constructed, this process is shown in Figure 1. Thus, the dimensionality reduction process from high dimensionality to low dimensionality is completed. The illustration of our proposed SRP is provided in Figure 2.

2.2. Improved DPC

The DPC algorithm is a granular computing model based on two assumptions: (1) the clustering center is surrounded by neighbour data points with lower local density; (2) the distance between any clustering center and data points with higher density is relatively far. In recent years, DPC has been applied in many fields, particularly natural language processing, due to its process and its effectiveness. The DPC algorithm can cluster data of different dimensions and shapes. At present, many researchers have researched DPC and have also proposed many improved algorithms. The main optimization aspects are speed improvement [16], accuracy improvement [1719], and other aspects [20, 21]. Heimerl et al. [22] applied the DPC algorithm in the high-dimensional space to estimate the optimal cluster numbers for a given set of documents and assigned stability to one of the peaks based on the density structure of the data; however, the resulting computing speed of the DPC algorithm in the high-dimensional space was slow. Wang et al. [23] used DPC to measure the hierarchical relevance and diversity of sentences and selected highly representative sentences to generate news summaries. However, they reported that if there are multiple peaks in the sentence, then the key sentence will be redundant.

For any point , two properties of the local density and relative distance are required. The calculation of these two attributes depends on the distance between any two points and in the graph. The two attributes are defined as follows:

Definition 1. local density ρi (Gaussian kernel): where is the Euclidean distance between and , and is the cut-off distance; these are important parameters for calculating . One recommended practice is to select so that the average nearest neighbour from each point is 1%~2% of the total dataset size. As can be seen in Equation (4), the more points contained in , the greater the local density .

Of the text clustering methods, the K-means method based on cosine similarity is still the most widely used text clustering algorithm due to its simplicity and fast convergence [24]. For text vectors, using cosine similarity has a better effect than Euclidean distance. Euclidean distance is a direct measure of the linear interval or length between vectors and is an absolute value of the difference in dimensional values. Cosine similarity describes the similarity between vectors using the cosine value of the angle, that is, the direction, and pays more attention to the difference between the relative levels of the dimensions. In text similarity analysis, one feature of similarity is the occurrence of the same words at the same time, which translates into nonzero values for the same dimension at the same time. We therefore redefine Definition 1 in terms of cosine similarity.

Definition 2. Local density based on cosine similarity (Gaussian kernel):
For any two vectors in space and , the cosine similarity is defined as the cosine of the angle between the two vectors: where is the cosine similarity between and , and is the cut-off distance which needs to manually set the value to the nearest neighbour number of the sample approximately 1%~2% of the size of the entire dataset. As can be seen in Equation (6), the more points contained in , the greater the local density .

Definition 3. Relative distance δi:

Equation (7) indicates that cosine similarity distance can be obtained by calculating the minimum distance from the data point to any point with a density greater than that. After calculating the two parameters, a decision graph with as the horizontal axis and as the vertical axis can be constructed. By observing the decision graph, the decision graph divides the data points into three different types, namely the density peak point, the normal point, and the outlier point. As shown in Figure 3, the data points are arranged in the order of decreasing density. There are five points that stand out, which are spread out towards the upper right corner of the decision graph, with varying high values and higher values. These five points indicate that there are no data points with higher density than these five points in a larger area. Therefore, these five points are the so-called peak density points, and so they make a suitable clustering center. In order to better verify the accuracy of the clustering center point in the decision graph, the DPC define another variable , where a clustering center point has a large value and value, the clustering center has a higher value. We conclude from our analysis of the decision graph and that and are of two different orders of magnitude. To avoid the influence of different orders of magnitude, it is necessary to normalize them.

The values for Equation (10) are plotted in Figures 4 and 5. Figure 4 verifies the correctness of the clustering centers in the decision graph shown in Figure 3. Figure 5 is plotted according to the descending order of value, and it can be noted that value changes from large to small. The point at the clustering center has an enormous value, while the noncenter point has a smaller value, and the change tends to be flat. It can be concluded that according to the value, there are five clustering centers.

2.3. DPC-K-means

The K-means clustering algorithm cannot extract data features effectively when processing high-dimensional data directly, and problems also occur when it randomly selects initial clustering centers and specifies the number of clustering in advance. These problems have been researched in numerous papers over the recent decades, as discussed elsewhere [2527]. Therefore, we propose an improved method using the DPC algorithm.

We first use SRP or random projection to reduce the dimensionality of high-dimensional text data and then combine it with the improved DPC algorithm. The choice of dimensionality reduction method SRP or random projection depends on whether the feature vector dimension is greater than the target dimension calculated according to Formula (3). If the feature vector dimension is greater than the minimum size of the target dimension, the SRP dimension reduction framework is performed. If the feature vector dimension is less than or equal to the target dimension, random projection is used directly. Using the cosine similarity calculation of and , we select some points with high local density, which are far apart from each other as the clustering center; by doing so, the initial clustering center and the number of clusters can be obtained automatically; this makes the clustering algorithm, which we name the DPC-K-means. The improved algorithm is described below:

The DPC-K-means.
Input: text feature vector , is the minimum size of the target dimension.
Output: the clustering results.
Begin:
Step1: determine whether is greater than calculated according to Formula (3). If is greater than , use the SRP dimension reduction framework in Step2. If is less than or equal to , random projection is used in Step2.
Step2: the SRP dimension reduction framework is used to reduce the dimensionality of layer by layer, until matrix after dimension reduction is obtained. Or directly use random projection to reduce the dimension to get the matrix .
Step3: Calculate the value and value of according to Equations (6) and (7) and plot the decision graph with and axes.
Step4: calculate the value according to Equation (10) to verify the clustering centers and the number of clusters.
Step5: perform K-means clustering: the clustering centers obtained in Step4 are used as the initial cluster centers, and the number of clusters is used as the value for K-means clustering.
Algorithm 1:

Suppose input data, the original dimension is the dimension of implementing SRP or random projection to reduce dimension to low-dimensional space, and the time complexity analysis of DPC-K-means algorithm is as follows: (1)The time complexity of a single random projection in Step2 is . The time complexity of Stacked-Random Projection is ( is the target dimension of the second layer, and is the target dimension for the penultimate layer)(2)The time complexity of Step3 is to calculate and , which is (3)The time complexity of Step4 is to calculate and sort in descending order, which is (4)The time complexity of Step5 K-means for specifying the cluster center and the number of clusters is .

The total time complexity of DPC-K-means algorithm is .

Figure 6 shows the overall structure of the proposed method. Table 1 shows the time complexity of several clustering algorithms.

3. Experiments and Discussion

3.1. Summarization Datasets

Experimental work was conducted on seven standard text datasets. The summary of datasets is presented in Table 2. Datasets are described as follows. The features are obtained by tokenization, stop-words removal, and TF-IDF.

The BBC news dataset (http://mlg.ucd.ie/datasets/bbc.html.) has a total of 2,225 text files on five topical areas published on the BBC news website. Text documents were arranged into folders containing five labels: business, entertainment, politics, sports, and technology.

The 20-newsgroups dataset (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) of approximately 20k newsgroup documents was partitioned evenly across the 20 different newsgroups. We selected 1k documents and 4~8 various newsgroups (4 groups~8 groups) for our experimental dataset.

The Sports Article dataset (http://archive.ics.uci.edu/ml/datasets.php) was labelled using Amazon Mechanical Turk as objective or subjective.

The Asian Religious (http://archive.ics.uci.edu/ml/datasets.php) dataset was the words from the bag of words preprocessing of the mini-corpus made up of eight religious books.

The CNAE-9 dataset (http://archive.ics.uci.edu/ml/datasets.php) contains 1,080 documents of free text business descriptions of Brazilian companies which were categorized into a subset of nine categories.

The Stack Overflow dataset (http://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/download/train.zip) is challenge data published on http://Kaggle.com/. The dataset consists of 3,370,538 samples dated from July 31, 2012, to August 14, 2012. In our experiments, we randomly selected 167 question titles from 4 different tags.

The Amazon dataset (http://archive.ics.uci.edu/ml/datasets.php) is the product reviews extracted from websites and marked with positive and negative.

3.2. Simulation Environments

The simulation environments for all algorithms performed in our experiments were as follows: the Python 3.7 software environment running with Intel i7-7500U CPU, 2.70GHz with 8GB RAM.

3.3. Experiment 1

According to Formula (3), the minimum size of the target dimension () of the BBC and 20-newsgroups datasets is 6,609 and 5,920. According to the flowchart Figure 6, the feature vector dimensions of the two datasets are larger than the minimum size of the target dimension, so that SRP was used to reduce the dimensionality of these two datasets. We compared the dimensionality reduction performance of Principal Component Analysis (PCA), Multiple Dimensional Scaling (MDS), Random Projection (RP), and Stacked-Random Projection (SRP). To correctly compare the performance of these dimensionality reduction methods, we experimentally reduced the feature vector of the BBC news dataset and 20-newsgroups dataset to 2k, 500, and 100. Table 3 shows the run time (time), mean ratio of distances (projected/original, ratio), and the standard deviation of ratio of distances (projected/original, standard deviation). The mean ratio of distances is the degree to which the distance between the original data is maintained in the low-dimensional space when the original high-dimensional data is reduced to low-dimensional data. The value is approximately close to 1, indicating better preservation. The smaller the standard deviation of the ratio of distances, the closer is it to the mean ratio of distances. As shown in Table 3, RP and SRP considerably shorten the run time of dimension reduction compared with PCA and MDS. We can see that there is little difference in the distribution of the distortion between SRP and RP for high values of the dimension. But for low values of the dimension, the distortion distribution is controlled, and the distances are well preserved by the SRP. Text data is usually high-dimensional and small-sampling data. The characteristic of high-dimensional and small-sampling data is that the number of dimensions is much larger than the number of samples. SRP is suitable for dimensionality reduction of this type of data, which significantly reduces the running time of dimensionality reduction, and the distances are well preserved.

3.4. Experiment 2

Since DPC is a clustering algorithm, we use the Euclidean distance and cosine similarity to calculate DPC local density and observe the difference between these two methods by clustering performance metrics. According to Formula (3), the minimum size of the target dimension () of the BBC and 20-newsgroups datasets is 6,609 and 5,920. According to the flowchart Figure 6, SRP was used to reduce the dimensionality of these two datasets to 100 dimensions. The minimum size of the target dimension of Sports Article, CNAE-9, and Stack Overflow datasets is larger than the feature vector dimension, so that the dimension can be reduced by random projection to 100 dimensions. The feature dimensions of the Asian Religions and Amazon datasets are ≤100, so there is no need for dimension reduction in this experiment. To correctly compare these two methods’ performance, we used the four cluster evaluation metrics—ARI (Adjusted Rand Index), NMI (Normalized Mutual Information), FMI (Fowlkes-Mallows Index), and Clusters (the number of clusters)—to evaluate the performance of the clustering algorithm. ARI, NMI, and FMI are all used to measure the consistency between clustering results and real category data, among which ARI, NMI, and FMI have value ranges of [-1,1], [0,1], and [0,1], respectively. The higher the three evaluation metrics’ values, the better the clustering quality, and the more consistent the clustering results are with the real category data. Clusters are the number of clusters after DPC. By comparing with Table 2, we can compare which method of the Euclidean distance and cosine similarity can cluster accurately. Table 4 shows the clustering performances of local density calculated by the Euclidean distance (Euclidean) and cosine similarity (Cosine).

To further judge the clustering performance proposed in this paper, a paired -test was used to test the clustering significance. A paired -test is used to determine whether there is a significant difference between the two samples. The Euclidean distance and cosine similarity were used to calculate the local density of DPC and test the cluster evaluation metrics. The value gives the probability of observing the test results under the null hypothesis. The confidence level is at 95%, and the cut-off value of is 0.05; if , the proposed algorithm clustering results and the comparison algorithm are significantly different. If , there is no significant difference between the proposed algorithm and the comparison algorithm’s clustering results. Table 5 shows the paired -test results of each evaluation metric of Euclidean distance (Euclidean) and cosine similarity (Cosine) in Table 4.

As shown in Table 5, there are substantial differences in FMI between the Euclidean distance and cosine similarity and no significant difference in ARI, NMI. As can be seen from the number of clusters of Tables 2 and 4, the improved DPC of the local density calculated with cosine similarity can accurately determine the number of clusters. Figure 7 shows the decision graph, the values of the BBC dataset following dimensionality reduction by SRP. Figure 8 shows the decision graph, the values of the four newsgroups in the 20-newsgroups dataset following dimensionality reduction by SRP. Figure 9 shows the decision graph, the values of the Amazon dataset. Figure 10 shows the decision graph, the values of the Sports Article dataset. As shown in these figures, improved DPC can accurately determine the dataset of the number of clusters, indicating that using cosine similarity to calculate the local density of DPC is better than using the Euclidean distance. Therefore, cosine similarity is more suitable for text vector calculation.

3.5. Experiment 3

We compared the clustering performance of DPC, DBSCAN, Spectral Clustering, Affinity Propagation, and DPC-K-means. In a comparative study of these clustering algorithms, we used the four evaluation metrics—ARI (Adjusted Rand Index), NMI (Normalized Mutual Information), FMI (Fowlkes-Mallows Index), and MSE (Mean Squared Error)—to evaluate the performance of the clustering algorithm. The mean-square error (MSE) is the average of the sum of squares of the difference between the predicted value and the real value used to measure the expected result. It is nonnegative, and values closer to zero are better. For even comparisons with these methods, we repeated the experiment ten times to obtain the average clustering performance as the final performance of each method. Table 6 shows the ARI of each method. Table 7 shows the NMI of each method. Table 8 shows the FMI of each method. Table 9 shows the MSE of each method.

To further judge the difference between the clustering results of the algorithm proposed in this paper DPC-K-means and those of other cluster methods, a paired -test was used to test the clustering results significance. Table 10 shows the paired -test results of each evaluation metric of these methods in Tables 69. The value gives the probability of observing the test results under the null hypothesis. The confidence level is at 95%, and the cut-off value of is 0.05; if , the proposed algorithm’s clustering results and the comparison algorithm are significantly different. If , there is no significant difference between the proposed algorithm and the comparison algorithm clustering performance.

As shown in Table 10, there are significant changes in NMI, FMI, and MSE metrics between DPC-K-means and comparison methods. DPC-K-means is superior to comparison algorithms in NMI, FMI, and MSE. DPC-K-means compared to DPC and Spectral Clustering has no significant difference in ARI, indicating that DPC and Spectral Clustering are performed as well as DPC-K-means on the ARI metric. A one-sample -test method is used to evaluate the DPC-K-means algorithm significance on different datasets. Taking the FMI evaluation metric in the BBC dataset as an example of significance testing, the process is as follows: Firstly, a test hypothesis is established and the threshold chosen for statistical significance was determined (, ). Secondly, the is calculated: where represents the FMI value obtained by DPC-K-means on the BBC dataset, is the mean FMI of the five comparison algorithms on the BBC dataset, represents the standard deviation FMI of the five comparison algorithms on the BBC dataset, and is the sample size. The degree of freedom used in this test is 4. Finally, the table was queried of -distribution, the value was determined, and an inference conclusion was made. According to and , the value is 1.533, , and H0 is rejected, indicating the difference between the FMI metric of DPC-K-means on the BBC dataset, and the FMI value of other comparison algorithms is statistically significant. According to the above significance test steps, the -test results of DPC-K-means were calculated on the ARI, NMI, and FMI evaluation metrics. The results are shown in Tables 1113.

As shown in Table 11, there are significant differences in the ARI metric of DPC-K-means on five datasets, and as shown in Table 12 there are significant differences in the NMI metric of DPC-K-means on eight datasets. It can be seen from Table 13 that DPC-K-means have significant difference in the FMI metric on the seven datasets. DPC-K-means are statistically significant on most datasets. Combined with Tables 9 and 10, it further shows that DPC-K-means is better than other comparison algorithms.

The clustering performance of DPC-K-means is better than K-means because DPC-K-means can select the number of clusters and obtain the initial clustering center. The number of clusters and the initial clustering centers can be used in the K-means algorithm, which achieves better clustering performance than K-means. Figures 11 and 12 illustrate the clustering centers automatically determined by DPC-K-means which are closer to the real class centers.

Tables 69 show that the clustering metrics changed significantly from 4 newsgroups to 8 newsgroups; this was caused by the loss of clustering due to irregular data distribution. Due to the inherent nature of the DPC algorithm, it cannot identify the phenomenon of “False peaks,” and its clustering effect on “No density peaks” datasets is low, which are all factors that affect the accuracy of the DPC-K-means algorithm. The algorithm is limited in its processing of more complex datasets.

DPC-K-means has a parameter , which is the cut-off distance. The value suggested in the literature [12] is set to the nearest neighbour number of the sample, approximately 1%~2% of the total dataset size. In the experiment, we were able to obtain the correct number of class clusters according to this valued principle. This parameter has no significant influence on the result of the algorithm within the value range of 1%–2% of the entire dataset size. The K-nearest neighbour method was used to establish the similarity matrix in the Spectral Clustering parameters in the experiment. The damping factor in Affinity Propagation was set to 0.9, and the nearest distance measurement of the DBSCAN parameter value was set to “cosine” by cosine similarity.

4. Conclusions

This study proposed a Stacked-Random Projection (SRP) dimension reduction framework based on deep networks and an improved K-means text clustering algorithm based on density peak (DPC-K-means). In the experiment, SRP, the improved DPC, and DPC-K-means were validated by using different datasets. Firstly, we compared SRP with PCA, MDS, and Random Projection. Multiple evaluation metrics demonstrated that SRP maintained a sufficient balance between running time and distance before and after dimension reduction. Secondly, we compared the difference between the Euclidean distance and cosine similarity in calculating DPC local density. Cosine similarity is more suitable for text vector calculation. Finally, DPC-K-means are an improved K-means algorithm that uses a text feature vector’s cosine similarity to calculate local density and get the initial clustering center and cluster number. Then, the K-means algorithm is used for clustering. We compared DPC-K-means with DPC, DBSCAN, Spectral Clustering, and Affinity Propagation. We found that DPC-K-means can accurately determine the number of clusters and the initial clustering centers of high-dimensional text data. It is superior to other clustering algorithms in ARI, NMI, FMI, and MSE. Furthermore, we analyzed the influence of parameters on the algorithm and limitations of our proposed methods. We will focus on determining the number of layers and the target dimension of each layer dimensionality reduction for future work and improve the matching degree between DPC-K-means and datasets.

Data Availability

The BBC news data used to support the findings of this study have been deposited in the open-source repository (http://mlg.ucd.ie/datasets/bbc.html). The 20-newsgroups data used to support the findings of this study have been deposited in the open-source repository (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets).

Conflicts of Interest

Yujia Sun and Jan Platoš declare that there is no conflict of interest regarding the publication of this paper.