Introduction

In information retrieval and machine learning, a good number of techniques utilize the similarity/distance measures to perform many different tasks [1]. Clustering and classification are the most widely-used techniques for the task of knowledge discovery within the scientific fields [2,3,4,5,6,7,8,9,10]. On the other hand, text classification and clustering have long been vital research areas of information retrieval (IR). While text classification is the process of classifying the text/document into its actual class by utilizing a similarity measure and a proper classifier. The clustering, on the other hand, is the process of grouping similar texts into similar groups called clusters. As a matter of fact, with the ever-piling amount of data and information on the internet, the necessity for a highly effective classification algorithm is urgent. Nevertheless, the enhancement of classification performance has still been the main task for researchers in the text mining field. Given the fact that the similarity/distance measures are the core component of the classification and clustering algorithm, their efficiency and effectiveness directly impact techniques’ performance in one way or another. Therefore, the selection of the best similarity measure for the techniques in question is still an open-ended challenging task.

Even though there have been several proposed works in IR literature to compare the similarity/distance measures for clustering and classification purposes [2, 3, 11,12,13,14,15,16], these studies are still incapable of providing a comprehensive preview of the actual performance of similarity measures. Besides, some of those works have presented an efficient similarity measure while ignoring effectiveness [17, 21]. While the others have presented only an effective similarity measure while ignoring efficiency [2,3,4]. Consequently, this work comes to cover this critical limitation by introducing a compromised (effective and time-efficient) similarity measure while the most widely used similarity measures are elegantly investigated in a thorough pattern under numerous circumstances. Using the K-nearest neighbor classifier (KNN), K-means clustering algorithm, and the bag of words (BoW) representation model [17,18,19] for feature selection, the similarity measures are examined in details. The K values (in KNN) is varied from (1) to (120) and the number of features is set to be in (50, 100, 300, 1000, 3000, 6000, and the whole number of features of the considered dataset). In doing so, the superiority of STB-SM measure is emphasized, and each measure is tested under several circumstances so the desired effectiveness including accuracy is being obtained in certain K values on several features. These measures are evaluated against low dimensional datasets (by studying their performance on 50, 100, 200, and 350) and high dimensional datasets (by studying their performance on 3000, 6000, and the number of all features of the dataset). The measures’ behavior has been analyzed to determine which measure gives the best results in certain K values on a specific number of features. Furthermore, for the clustering performance analysis, five evaluation metrics were employed with two of them are internal and three are external. The key objective of this work is to present a new competitive measure, compare and benchmark the similarity measures performance on the targeted datasets on both the low and the high-dimensional datasets. Briefly, the main contributions of this work are listed below:

  1. 1.

    Introducing a novel similarity measure for text retrieval that basically behaves based on the set theory mechanism. This measure has been named a set theory based similarity measure for text retrieval (STB-SM). In accordance with the experimental results of both classification and clustering, STB-SM has been shown to be a promising measure with its being superior over the existing state-of-the-art measures.

  2. 2.

    Along with proposing the STB-SM, seven similarity measures, that are commonly applied for text retrieval and machine learning purposes, are thoroughly investigated and evaluated to benchmark their impact on text retrieval. They are comprehensively tested on two of the most publicly available datasets (namely, web-KB and Reuters-21). Using BoW, a thorough comparative analysis for these measures, in terms of their effectiveness and efficiency, are drawn. While the classification effectiveness includes six evaluation factors, namely; accuracy, precision (PRE), recall (REC), F-Measure (FM), G-Measure (GM) and Average Precision Mean (AMP). The clustering effectiveness includes five evaluation metrics namely, Purity, Completeness and Rand Index as the external metrics, along with Calinski-Harabasz index and Davies-Bouldin index as the internal metrics. Moreover, for both classification and clustering efficiency, the run time, taken by each measure to find the similarity degree, is rigorously observed.

  3. 3.

    The scope of this work concentrates on promoting the performance of text clustering and classification through a new measure along with a detailed comparative analysis for the proposed measure against the state-of-art BoW-based similarity measures. The drawn analyses would provide an influential guide for the selection of similarity measures in terms of considered datasets as well as helping researchers in fully understanding the present and future challenges linked with text retrieval.

The rest of this paper is structured as follows: the most relevant similarity measures for this study are concisely presented in Sect. “Related work”. Section “The set theory” briefly describes the basics and definitions of set theory in the context of text retrieval. Section “The proposed similarity measure (STB-SM) defines, formulates, and analyzes the proposed similarity measure in the context of the set theory. The experimental setup is drawn in Sect. “Experimental setup”. The results of the work are given in Sect. “Experimental results”. The discussion is profoundly detailed in Sect. “Discussion”. Finally, conclusions and future work recommendations are presented in Sect. “Conclusions and future work”.

Related work

Vector Space Model (VSM) has long been used to represent document(s) when dealing with text retrieval. In VSM, each document is drawn as an N-dimensional vector. Each dimension represents a vocabulary term/feature. In information retrieval (IR) literature, there are a good number of similarity measures to compute the pairwise document similarity using VSM. While there have been some works that have been proposed in the IR literature to perform the clustering along with the classification using the similarity/distance measures [2,3,4, 11,12,13,14,15,16]. These works lack the comprehensive preview of the actual performance of similarity measures. Moreover, some of them have proposed efficient similarity measures irrespective of their effectiveness [21, 22]. Other works, however, have presented only effective similarity measures without consideration to their efficiency [2,3,4].

Euclidean and Manhattan distances are among the most famous geometric measures which have been utilized to find the distance between each vector pair [2, 20]. Similarly, Cosine similarity finds similarity between each document pair using the angle between their vectors [10]. The triangle distance is also looked at as the Cosine of a triangle between vector pair [10]. The value of this measure range between 0 and 2. On the other hand, for 0–1 vectors, the Hamming distance [4] is used to give the number of positions at which the feature weights are not equal. Kullback–Leibler divergences [23, 24], KLD, as a non-symmetric measure was used in [24] to compute the similarity between each vector pair using the probability distribution that is associated with the both vectors. In [4], a similarity measure for text processing, named SMTP, was found to calculate the similarity between document pair. An Information-Theoretic measure (IT-Sim), was proposed based on information theory in [18] for document Similarity purposes. In [3], a new similarity measure called Improved Sqrt-Cosine (ISC) was proposed. Meanwhile, Bhattacharya coefficient was invented in [21] to approximately calculate the overlap rate between each statistical sample pair. Jaccard coefficient was developed in [25] to find similarity using the ratio of the number of features existing in both documents to the number of features existing in at least one of them. Subsequently in [2], a new similarity measures named pairwise document similarity measure based on present term set (PDSM), was presented based on the feature weights as well as the number of features that existed in at least one of the considered documents.

Some of these measures have shown to be highly effective such as the PDSM [2], the ISC [3], and the SMTP [4], yet unfortunately time-inefficient. In contrast, some measures are not effective yet highly efficient notably the Euclidean and Manhattan. Cosine, on the other hand, has been seen as a compromised solution as an effective and highly efficient measure. Furthermore, as reported in IR literature, almost all of these measures were tested in the context of text classification and clustering. For example, PDSM was compared in [2] with five similarity measures in terms of classification and near duplicate application. Likewise, ISC [3] and SMTP [4] were evaluated against several similarity measures concerning text classification and clustering. Similarly, our proposed paper of this work has been evaluated against some of the most widely used similarity measures in machine learning and information retrieval literature, particularly with respect to text classification and clustering. Finally, [7] assessing the clustering performance of several measures on three collections of web documents. The experimental results of their experiment revealed that Cosine similarity outweighs both the Jaccard coefficient and the Euclidean distance.

The most relevant similarity measures

In this sub-section, the similarity measures that are considered to conduct this study are presented. Seven similarity measures are introduced as the most widely used measures for text clustering and classification [2, 20,21,22,23,24]. These similarity measures work by considering the terms’ presence and absence, or by evaluating the angle between each vector pairs or by finding the distance. Assuming that we have two documents doc1 and doc2 that have two vectors d1 and d2, the aim is to find how much similarities are there when using the intended similarity measure as follows;

Euclidean distance (ED)

Every document is drawn as a point in 2D space depending on the term frequency of N terms that would represent the N dimension. ED finds the similarity between each point pair in N-dimensional space using their coordinate based on the following equation:

$$D_{Euc} \left( {doc1,doc2} \right) = \sum \sqrt {(doc_{11 - } doc_{12} )^{2} + (doc_{21 - } doc_{22} )^{2} + \ldots (doc_{n1 - } doc_{n2} )^{2} }$$
(1)

Manhattan

Manhattan distance (known as sum-norm) finds the sum of absolute differences between the targeted coordinates of each document pair vectors as follows:

$$Manhattan - distance \left( {doc1, doc2} \right) = \mathop \sum \limits_{i = 1}^{n} \left| {doc1_{w1} - doc2_{w2} } \right|$$
(2)

Cosine similarity measure

The Cosine similarity calculates the pairwise similarity between the document pairs using the dot product and the magnitude of both vectors of both documents. It is mostly utilized within the scientific fields including the IR field [20], and is defined as follows:

$$Sim_{Cos} \left( {doc1,doc2} \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{n} (doc_{i1} * doc_{i2} )}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} doc_{i1}^{2} } *\sqrt {\mathop \sum \nolimits_{i = 1}^{n} doc_{i2}^{2} } }}$$
(3)

The union is used to normalize the inner product.

Jaccard similarity measure

This coefficient was invented in [25] to divide the intersection of the points by their unions, and the value of coefficient ranges between 0 (there is no similarity between the documents) and 1 (both documents are identical). The Jaccard similarity is given by the next equation:

$$Sim_{jaccard} \left( {doc1,doc2} \right) = \frac{doc1 \cap doc2}{{doc1{\bigcup }doc2}}$$
(4)

Bhattacharya coefficient

The Bhattacharyya coefficient is used to approximately calculate the overlap rate between each statistical sample pair [21]. In our works, however, these samples are thought of as documents. This coefficient is being utilized to find the approximate closeness of each document pair.

$$Sim_{Bhatta} \left( {doc1,doc2} \right) = 1 - \log \left( {\sum\nolimits_{i = 1}^{n} {\sqrt {doc_{i1 * } doc_{i2} } } } \right)$$
(5)

Kullback–Leibler divergence

It is also known as a “relative entropy” [23, 24]. It is used to measure the difference between probability distributions. Simply, when this measure reaches 0, it signals that the intended distributions pair is identical, following that, its equation is then drawn as follow;

$$Sim_{KL} \left( {doc1,doc2} \right) = \mathop \sum \limits_{i = 1}^{n} (doc_{i1} )*{ \log }\left( {\frac{{doc_{i1} }}{{doc_{i2} }}} \right)$$
(6)

PDSM

This measure has been introduced in [2] to tackle the limitation of the-state-of-art measures which included a number of present terms into account. PDSM was seen effective according to the experimental results of [2] as well as the experimental results of our current work. The PDSM equation is formulated as follows

$$D_{pdsm} \left( {doc1,doc2} \right) = \frac{{doc_{i1} \cap doc_{i2} }}{{doc_{i1} Udoc_{i2} }}*\frac{{PF\left( {doc_{i1} doc_{i2} } \right)}}{{M - AF\left( {doc_{i1} ,doc_{i2} } \right) + 1}}$$
(7)

where

$$doc_{i1} \cap doc_{i2} = { \hbox{min} }\left( {doc_{i1} , doc_{i2} } \right)$$
$$doc_{i1} Udoc_{i2} = { \hbox{max} }\left( {doc_{i1} , doc_{i2} } \right)$$

where \(PF(doc_{i1} doc_{i2} )\) represents the number of present terms and \(AF(doc_{i1} doc_{i2} )\) represents the number of absent terms and M is the total number of documents.

The set theory

Before introducing the proposed measure, some basics and definitions (upon which our measure behaves) for the set theory in the context of text retrieval should be conceived. So, in this section, the main objective is to introduce the relative set theory operations upon which our proposed measure behaves.

Generally speaking, the set theory is a vital component of modern mathematics and is widely used in all formal descriptions. The set can be a collection, a group, or even a cluster of points that are named members of that set. For instance, a set of documents is a collection of documents, or a set of people is a group of people, etc. For each point to be a member of that set, its membership shall be defined clearly. However, sometimes, due to the lack of information, membership definition is a difficult task and may even be a vague. So, if the membership definition is vague for some collection, the collection is then cannot be called a set. Simply put, if there has been a set S and its two members X and Y, then it shall not be unknown whether X = Y or they are not. Strictly speaking, the set can be either finite, infinite, or empty. In the following, some basic definitions and key operations are introduced to further understand the basics upon which STB-SM measure behaves.

Definition 1

If we have two sets S1 and S2, both sets are equal if and only if they have the same points, and then every \(X \in S1 \Leftrightarrow X \in S2.\) For example, in the context of text retrieval, if we have Doc1{Ali, Jun, Sarah} and Doc2{Jun, Sarah, Ali}. Then, we can say that Doc1 = Doc2, and they are both identical as every word belongs to Doc1 also belongs to Doc2.

Definition 2

If we have two sets S1 and S2, S1 is “a proper” subset of S2 (S1 \(\subseteq\) S2) if there has been X \(\in\) S1 and also X \(\in\) S2 as well. For example, in the context of text retrieval, if we have Doc1{Ali, Hassan, Sarah} and Doc2{Hassan, Sarah, Ali, Mark, Farah}. Then, we can say that Doc1 \(\subseteq\) Doc2, and Doc1 is a proper subset of S2 as every word belongs to Doc1 also belongs to Doc2.

Definition 3

he document doc is a collection of terms of vectors that holds these terms, that is, any subset of C, when C is the document collection, (involving C itself).

Let doc be a document, a subset of C. We say that doc exists as a vector if the terms of doc exist in the doc itself. First, let us define the key relationships between each document pair doc1 and doc2 in the collection C, as follows;

$$doc1 \subset doc2 \Leftrightarrow T \in doc1 \Rightarrow T \in doc2 \left( {containment} \right)$$
$$doc1 = doc2 \Leftrightarrow doc1 \subset doc2 and doc2 \subset doc1 \left( {equality} \right)$$

So, for the given document pair doc1 and doc2, the following set of operations are held as follows;

Operation 1—union

The union of two sets S1 and S2 (S1 \(\cup\) S2), is the set that contains all the elements of both sets S1 and S2 with the removal of duplication.

$$S1 \cup S2 = \left\{ {X |X \in S1 or X \in S2} \right\}$$

In the context of text retrieval, the Union operation of doc1 and doc2, doc1 \(\cup\) doc2, is the group of terms {t1,…, tn} where n is the number of addressed terms in both documents, that are involved in either doc1, doc2 or both:

$$doc1 \cup doc2 = \left\{ {t : t \in doc1 {\text{or }}t \in doc2} \right\}.$$

Operation 2—intersection

The Intersection of two sets S1 and S2 (S1 \(\cap\) S2), is the set that contains shared elements of sets S1 and S2.

$$S1 \cap S2 = \left\{ {X |X \in S1 and X \in S2} \right\}$$

In the context of text retrieval, the Intersection operation of doc1 and doc2, doc1 \(\cap\) doc2, is the group of terms {t1,…, tn} where n is the number of addressed terms in both documents, that are involved in both documents doc1 and doc2 at the same time:

$$doc1 \cap doc2 = \left\{ {t : t \in doc1 {\text{and }}t \in doc2} \right\}.$$

Operation 3—negation

The negation operation of doc1 or doc2, doc1/doc2 or doc2/doc1, is the group of terms that are either belongs to doc2/doc1 or doc1/doc2:

$$doc1 \backslash doc2 = \left\{ {t:t \notin doc2} \right\}.$$
$$doc2 \backslash doc1 = \left\{ {t:t \notin doc1} \right\}.$$

The proposed similarity measure (STB-SM)

The formulation of STB-SM similarity measure

Suppose we have a document pair doc 1 and doc2. Let doc1 = (w11, w12,…) and doc2 = (w21, w22,…) be the weighting vectors (using BoW model) of the term sets for document 1 and document 2, respectively. Let T1 {t11, t12,… t1n} and T2 {t21, t22,… t2n} be the sets of items that are contained by doc1 and doc2, respectively. For the sake of simplicity, the following is the proposed STB-SM equations:

$$X = \left( {\mathop \sum \limits_{{t \in doc_{1} \cap doc_{2} }} W_{1j} } \right)* \left( {\mathop \sum \limits_{{t \in doc_{1} \cap doc_{2} }} W_{2j} } \right)$$
(8)
$$Y = \left( {\mathop \sum \limits_{{t \in doc_{1} \backslash doc_{2} }} W_{1j} } \right)* \left( {\mathop \sum \limits_{t \in doc\backslash doc} W_{2j} } \right)$$
(9)
$$Z = \left( {\mathop \sum \limits_{{t \in doc_{1} }} W_{1j} } \right)* \left( {\mathop \sum \limits_{{t \in doc_{2} }} W_{2j} } \right)$$
(10)
$${\text{STB}} - {\text{SM}}\left( {doc_{1} ,doc_{2} } \right) = \frac{X}{Z}*\left( {1 - \frac{Y}{Z}} \right)$$
(11)

where the notations “∩” and “\” denote the intersection and complement operators in the set theory, and Wij is the weighting value. To further understand the mechanism of this measure and briefly clarify some deficit of the state-of-the-art measures, we have provided three examples as follows:

Example 1

Assuming we have doc1 (2, 5, 7, 8, 0, 9) and doc2 (9, 0, 0, 6. 5, 1), then STB-SM will work as follows; (for simplicity, X is x1 and x2; Y is y1 and y2, Z is z1 and z1, Ti.w suggests weighting of the term i)

 

T1.w

T2.w

T3.w

T4.w

T5.w

T6.w

Doc1

2

5

7

8

0

9

Doc2

9

0

0

6

5

1

X1 = 2+8 + 9=19; X2 = 9+6 + 1=16; Z1 = 2+5 + 7+8 + 9=31; Z2 = 9+6 + 5+1 = 21; Y1 = 5+7 = 12; Y2 = 5

While STB-SM yielded (0.47 * 0.91 = 0.43) Cosine and Jaccard yielded (0.42) and (0.22) respectively.

Example 2

Assuming we have doc1 (02, 1, 1, 0, 1) and doc2 (3, 1, 1, 1, 1, 0), then STB-SM will work as follows;

 

T1.w

T2.w

T3.w

T4.w

T5.w

T6.w

Doc1

0

2

1

1

0

1

Doc2

3

1

1

1

1

0

X1 = 4 X2 = 3 Z1 = 5 Z2 = 7 Y1 = 1 Y2 = 4

While STB-SM yielded (0.34 * 0.89 = 0.30), Cosine and Jaccard yielded (0.42) and (0.50) respectively.

Example 3

Assuming we have doc1 (1, 1, 3) and doc2 (1, 0, 2), then STB-SM will work as follows;

 

T1.w

T2.w

T3.w

Doc1

1

1

3

Doc2

1

0

2

X1 = 4 X2 = 3 Z1 = 5 Z2 = 3 Y1 = 1 Y2 = 0

While STB-SM yielded (0.80), Cosine and Jaccard yielded (0.94) and (0.25) respectively.

As seen from the drawn examples above, Cosine occasionally finds a good similarity as indicated in example (1). However, the Cosine similarity gives the same value for both examples (1 & 2) albeit the clear difference between both vectors, and to further exacerbate the issue the similarity value is highly exaggerated in example 3. It is worth indicating that one novelty of STB-SM measure, is that the similarity value has never been exaggerated as shown in example (3) for Cosine, or the more state-of-the-art measure. STB-SM measure enables non-zero/non-shared features to have an explicit contribution to the similarity computation. Therefore, STB-SM takes the presence and absence of all features into consideration effectively.

On the other hand, Jaccard occasionally produces a good similarity as shown in example (2), but more frequently the Jaccard similarity is poor, as indicated in examples (1 & 3). Our proposed measure, therefore, comes to find a compromised solution where the desired effect is being detected. Examples (1 & 3) show a better and more accurate similarity found by STB-SM in comparison with the Cosine and Jaccard.

STB-SM analysis

In this subsection, we concisely as well as informatively analyze the cases of the proposed measure as follows;

The worst-case:

This case occurs when there is not even one shared feature between the document vectors.

Example (worst case): Assuming we have doc1 (3, 0, 1) and doc2 (0, 2, 0). By applying the worst-case scenario, we find that X1 = 0, X2 = 0; Z1 = 4, z2 = 2, y1 = 4, y2 = 2; because X = zero. Accordingly, STB-SM = zero, for both documents (1, 0, 1) and (0, 1, 0), which is logically true since there is no shared feature exist.

The average case:

This occurs when there has been at least one shared feature(s) as given in the drawn above examples (1–3). In this case, STB-SM would have a value in the range [0–1].

The best case:

This occurs when both vectors are completely equivalent.

Example (best case): Assuming we have doc1 (4, 4, 4) and doc2 (4, 4, 4), or doc1 (1, 1, 1) and doc2 (1, 1, 1). By applying the best-case scenario, we find that x1 = 9, x2 = 9, z1 = 9, z2 = 9, y1 = 0, y2 = 0. Accordingly, STB-SM = 1 which is logically true as both documents are identical.

The properties of similarity measures

According to [2, 4], six vital properties every similarity measure should have for the relative measure to be considered an optimal measure. The following properties are listed below;

Property 1:

The existence or non- existence of the intended feature is more vital than the difference between the values linked with the existing feature. According to the calculated-above examples, STB-SM explicitly takes the presence and absence of features into consideration.

Property 2:

The value of similarity should be grown as the difference between the values of non-zero features values decline. For instance, if we have f1 and f2 as two features belong to doc1 and doc2 respectively. Then, for doc1 and doc2, the value of similarity between f1 = 12 and f2 = 6 is higher than the similarity between f1 = 20 and f2 = 6. This property is also clearly shown in example 3, along with the worst-case example.

Property 3:

The value of similarity should be reduced as the number of existent or non- existent features rises. This was showcased in both the worst and best case examples, clearly indicating the applicability of his property.

Property 4:

Any pair of documents is low similar to each other if there have been many non-zero-valued features corresponding to many zero-valued features in the same pair. For instance, if we have two vectors for two documents doc1(f1,f2) = (1,0) and doc2(f3, f4) = (1,1). Then, doc1.f2 and doc2.f4 are the key cause for lowering the similarity between both documents as f2 X f4 = 0 and, at the same time, f2 + f4 > 0. Example 2 supports the applicability of this property.

Property 5:

The similarity measure should possess asymmetrical features. For instance, the similarity between both doc1 (1, 1, 0) and doc2 (1,1,1) must be the same when doc 2(1,1,1) and doc 1(1, 1, 0) are considered. According to the drawn above examples, STB-SM enjoys this property completely.

Property 6:

The distribution value should have a contribution to the similarity between every two documents. That means features with higher spread (standard deviation) contribute more in similarity than that of a lower spread.

Experimental setup

Text pre-processing

Some operations were carried out normally for the text to be transformed into text vectors for processing. The text was converted from the lower case to upper case, numbers, punctuations, and stop words (common words), in addition to that extra white space were all removed, and some particular symbols (such as $,  %)were converted into spaces.

Text representation

The bag of words (BoW) model [26, 27] was used to represent documents that were in the vector space model (VSM). The BoW model represents each document as a word collection disregarding the grammar and word order [28].

Given the fact that we have used a python to run the text pre-processing, the preprocessing was performed using the Ntlk (Natural language toolkit) library of python as follows;

  • Tokenization: using the ntlk word tokenizer

  • Converting all the words to lower case: using the lower() python string function

  • Lemmatizing: using the ntlk stem WordNetLemmatizer

  • Stopword Removal: using the ntlk stopwords

  • Considering words with only 4 or more letters

The comparison mechanism of classification

After pre-processing, all of the documents were represented using the BoW model in VSM in order for the classification process to start smoothly. Following that, the performance of every similarity measured across the different kinds of documents was compared and evaluated against each other. Six evaluation measures were used to evaluate, namely, accuracy, precision, recall, F-measure, G-measure, and average mean precision. For each criterion, the KNN algorithm runs from K = 1 to K = 120 over each number of features of each dataset, and the averaged results were accumulated and drawn as given in the Tables below (5, 6, 7, 8, 9, 10, 11, 12, 13). Number of features (NF) was varied from NF = 10, NF = 50; NF = 100, NF = 200; NF = 350, NF = 3000, NF = 6000 and NF = the whole number of features (see Appendix samples). In consequence, we have eight runs for the KNN algorithm over two datasets to test and examine six criteria using eight similarity measures. The final number of implementations performed to have the results below were (8 × 2 × 6 × 8 = 768) runs. If we also consider the sixty (60) values of K that have been tested in each KNN cycle, the total runs would be 46080.

Term weighting

We adopted the most widely used Term Frequency (TF) technique of weighting which simply gives the occurrence of each word in the relative document [29, 30].

K-nearest neighbor classifier

The K-nearest neighbor algorithm (k-NN) is most widely used, in the IR literature, to perform document classification. Although it is a lazy algorithm [27], it is nonparametric, simple, and believed to be amongst the top ten algorithms in data mining [31]. It works based on selecting the nearest points to the point at the question. The concept of K-NN is that the points that exist in the same class are highly likely to be close to one another depending on the used similarity measure. KNN assumes the next: (1) Points in the feature space have a specific distance between each other and that distance is used as a metric to gauge closeness, (2) Each point in the training points has its vector and class label. Later, a certain number “k” is determined to draw the neighboring area of the point in question.

K-means clustering algorithm

Generally speaking, the clustering of a huge text dataset can be efficaciously made through utilizing the algorithms of partitional clustering. One of the most-popular partitional clustering algorithms is the K-means algorithm. It is widely known in the literature to be the best-fit approach for handling huge volumes of datasets [8, 32]. Similarly to any clustering algorithm, K-means leverages a similarity measure that finds the similarity between each document and the document representative of the cluster (head of the cluster). The similarity measure represents the core of the clustering process by which clustering algorithm performance is analyzed. However, the most suitable similarity measure to effectively perform clustering is still an open-ended challenge. In our work, for the clustering performance analysis, we ran the K-means for each similarity measure, as well as the values of evaluation of metrics (external metrics including purity, completeness and rand index, and internal metrics including the Calinski-Harabasz index and Davies-Bouldin index) were drawn accordingly. We used the voting technique to determine the best similarity measure that would best fit the K-means algorithm. The voting technique worked by enumerating how many metrics each similarity measure had achieved its best values. The bigger number of metrics is the best fit which is the similarity measure. According to the experimental results of the clustering process, our proposed measure (STB-SM) has been seen as the best fit in most cases. It has achieved (11) out of the (20) points by being the best in four metrics out of five. Unfortunately, in the K-means algorithm, the number of clusters is still an ill-posed problem as stated in [32, 33]. Therefore, in this study, we have picked numbers (4 and 8) to be the number of clusters just to analyze and emphasize the behavior of all the similarity measures. It is worth referring that we are not arguing that (K = 4 or K = 8) is optimal or the best value for the number of clusters. It is just chosen as the number of actual classes in each dataset [34] to draw the performance analysis of K-means using the considered similarity measures. In the follow-up work, we plan to further examine the performance analysis with several K numbers of clusters, and at the same time, with other clustering algorithms, like hierarchical clustering algorithms.

Machine description

Table 1 displays the machine and environment descriptions used to perform this work.

Table 1 Machine and environment description

Dataset description

Reuters Dataset (Table 2): Reuters-R8 Dataset holds the eight most frequent classes of the original ninety classes in Reuter’s dataset. After applying pre-processing, a total of 18308 features were extracted.

Table 2 Splitting of documents among eight classes in Reuters-R8 dataset

Web-kb dataset (Table 3): it consists of web pages of the computer science department from the following universities: Cornell, Texas, Washington, and Wisconsin. It was obtained from the World Wide Knowledge Base project of the CMU text learning group. After applying the pre-processing, a total of 33,025 features were extracted. The data in both datasets were divided into training and testing in ratio 2:1 (67%: 33%). To overcome the over-fitting or under-fitting issue, instead of dividing the whole data randomly in the training and testing data, each group was divided individually and then combined as training and testing data. Both datasets are read directly from Python platform as they are already integrated with python.

Table 3 Splitting of documents among four classes in Web-KB dataset

The classification evaluation criterions

This subsection holds the evaluation criterions used for classification as follows;

Accuracy (ACC)

ACC checks the total of samples that are correctly classified out of the whole sample collection. ACC is defined in the next equation.

$${\text{ACC}} = \frac{{{\text{True Positive}} + {\text{True Negative}}}}{{{\text{True Positive}} + {\text{True Negative}} + {\text{False Positive}} + {\text{False Negative}}}}$$
(12)

Precision (PRE)

PRE checks the total number of items that are correctly identified as positive out of the total items identified as positive.

$${\text{PRE}} = \frac{\text{True Positive}}{{{\text{True Positive}} + {\text{False Positive}}}}$$
(13)

Recall (REC)

REC checks the total number of items that are correctly identified as positive out of the actual positive.

$${\text{REC}} = \frac{\text{True Positive}}{{{\text{True Positive}} + {\text{False Negative}}}}$$
(14)

F-measure or F-Score (FM)

FM is the harmonic mean of precision and recall. It is useful when classes are not distributed evenly.

$${\text{FM}} = 2 *\frac{\text{Precision*Recall}}{{{\text{Precision}} + {\text{Recall}}}}$$
(15)

G-method or G-Score (GM)

GM is the geometric mean of precision and recall. It is also used when classes are not distributed evenly.

$${\text{GM}} = \frac{\text{True Positive}}{{\sqrt {\left( {{\text{True Positive}} + {\text{False Positive}}} \right) *\left( {{\text{True Positive}} + {\text{False Negative}}} \right) } }}$$
(16)

Average mean precision (AMP)

AMP is the mean of the average precision of all classes. This is used to evaluate how precisely the classifier is performing.

$$AMP = \mathop \sum \limits_{n} \left( {R_{n} - R_{n - 1} } \right)P_{n}$$
(17)

where Pn and Rn are the precision and recall at the nth threshold. Finally, True Positive, True Negative, False Positive and False Negative are defined as follow;

True positive: the number of class1 testing documents that are correctly identified into class1.

True negative: the number of instances of class2, class3,…., classN correctly identified as class2, class3.. classN respectively.

False positive: the number of class1 testing documents that are incorrectly identified into class2, class3,….., classN.

False negative: the number of class2, class3,….., classN testing documents that are incorrectly identified into class1.

The clustering evaluation criterions

This subsection holds the evaluation criterions used for clustering. While the external metrics require actual labels to assess the cluster quality (see Eqs. 18, 19, 20), the internal metrics do not require actual labels to assess the cluster quality (see Eqs. 21, 22).

Accuracy (also known as Purity)

It is used to check the index to which a cluster is pure. Particularly, every cluster has only one class and different clusters have different classes. In other words, this metric evaluates the coherence of a cluster. It is defined by the following equation.

$${\text{Purity}} = \frac{1}{N} \mathop \sum \limits_{i = 1}^{k} max_{j} \left| {c_{i } \cap t_{j} } \right|$$
(18)

Where N is the number of objects(data points), k is the number of clusters, ci is a cluster in C, and tj is the classification which has the max count for cluster ci.

Completeness

To check whether all members of a given class are assigned to the same cluster.

$$\begin{aligned} \hfill {\text{H}} = \left\{ {\begin{array}{*{20}c} 1& {{\text{if H}}\left( {{\text{K}},{\text{C}}} \right)\, = \,0} \\ { 1- \frac{H(K|C)}{H\left( K \right)}} & {\text{else}} \\ \end{array} } \right. \\ \hfill \varvec{ } \, \\ \end{aligned}$$
(19)

where

$${\text{H}}\left( {{\text{K}}|{\text{C}}} \right)\, = \, - \mathop \sum \limits_{{\varvec{c} = 1}}^{{\left| \varvec{C} \right|}} \mathop \sum \limits_{{\varvec{k} = 1}}^{{\left| \varvec{K} \right|}} \frac{ack}{N}\varvec{ }log\varvec{ }\frac{ack}{{\mathop \sum \nolimits_{{\varvec{k} = 1}}^{{\left| \varvec{K} \right|}} ack}}$$
$$\text{H}\left( \text{K} \right)\,\text{ = }\,{ - }\mathop {\sum }\limits_{{\text{k = 1}}}^{{\left| \text{K} \right|}} \frac{{\mathop {\sum }\nolimits_{{\text{c = 1}}}^{{\left| \text{C} \right|}} \text{ack}}}{\text{n}}\text{ log }\frac{{\mathop {\sum }\nolimits_{{\text{c = 1}}}^{{\left| \text{C} \right|}} \text{ack}}}{\text{n}}$$

Rand index

It is used to check how many points are correctly predicted.

$$\text{R} = \frac{{\left( {a + b} \right)}}{nC2}\varvec{ }$$
(20)

where n is the total number of samples, and (a + b) is the agreement between real and the assigned cluster label.

Calinski-Harabasz index

It is used to measure the ratio between cluster dispersion and inter cluster dispersion.

$$s = \frac{{\text{tr}\left( {B_{\varvec{k}} } \right)}}{{\text{tr}\left( {W_{\varvec{k}} } \right)}}*\frac{{n_{\varvec{E}} - k}}{k - 1}$$
(21)

where

$$W_{\varvec{k}} = \mathop \sum \limits_{{\varvec{q} = 1}}^{\varvec{k}} \varvec{ }\mathop \sum \limits_{{\varvec{x} \in \varvec{C}_{\varvec{q}} }} \varvec{ }\left( {x - c_{q} } \right)(x - c_{q} )^{T}$$
$$\varvec{B}_{\varvec{k}} = \mathop \sum \limits_{{\varvec{q} = 1}}^{\varvec{k}} \varvec{ n}_{\varvec{q}} \left( {\varvec{c}_{\varvec{q}} - \varvec{c}_{\varvec{E}} } \right)(\varvec{c}_{\varvec{q}} - \varvec{c}_{\varvec{E}} )^{\varvec{T}}$$

where Cq is the set of points in cluster q, cq is the center of cluster q, cE the center of E, and nq is the number of points in cluster q.

Davies-Bouldin index

This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves.

$$DB = \frac{1}{k}\mathop \sum \limits_{i = 1}^{\varvec{k}} \varvec{ }\mathop {max}\limits_{i \ne j} R_{ij}$$
(22)

where

$$R_{ij} = \frac{{s_{i} + s_{j} }}{{d_{ij} }}$$

where si is the average distance between each point of cluster i and the centroid of that cluster, dij is the distance between cluster centroids i and j. Finally, the best and worst values and the range of each measure are drawn in Table 4.

Table 4 The best and worst values and the range of each metric

Experimental results

Classification results

This work investigated all considered measures comprehensively based on six criterions for performance evaluation which is the first study of its type to do such investigation in the information retrieval field with respect to text classification. The K values of KNN were varied from (1) to (120) with an increment of value (2) in each cycle (see Appendix samples). The number of features of each dataset was diversified (10, 50, 100, 200, 350, 3000, 6000, NF) to clearly draw the best performance of each measure under several circumstances. Then, for each measure, the results were averaged for all K values on each NF to yield the results drawn in Tables (5, 6, 7, 8, 9, 10, 11, 12, 13). In other words, the following tables contain the results of each similarity measure which were averaged on each Number of Features (NF) over all K values in the range [1…120] as drawn in the appendix. In each table, the averaged results of all K values for each performance criterion is displayed. Table 5 displays the averaged results of all criterions when NF = 10. For simplicity, we draw the averaged results of all measures while analyzing briefly three criterions, namely, ACC, FM, and AMP.

Table 5 Performance evaluation of all measures when, NF = 10–the averaged results (K = 1–120; + 2)
Table 6 Performance evaluation of all measures when, NF = 50–the averaged results (K = 1–120; + 2)
Table 7 Performance evaluation of all measures when, NF = 100–the averaged results (K = 1–120; +2)
Table 8 Performance evaluation of all measures when, NF = 200–the averaged results (K = 1–120; +2)
Table 9 Performance evaluation of all measures when, NF = 350–the averaged results (K = 1–120; + 2)
Table 10 Performance evaluation of all measures when, NF = 3000– the averaged results (K = 1-120; +2)
Table 11 Performance evaluation of all measures when, NF = 6000–the averaged results (K = 1–120; + 2)
Table 12 Performance evaluation of all measures when, NF = the whole size (Reuters = 18308, web-kb = 33025 features)—the averaged results (K = 1–120; +2)
Table 13 Performance evaluation of all measures when taken the average of averaged results,—average results (K = 1–120; +2)

As shown in Table 5, for the Reuters dataset, Euclidean, followed by STB-SM and Cosine, met the highest accuracy. However, STB-SM, followed by Euclidean and kullback–Leibler, outperformed all measures on both FM and AMP criterions. On the other hand, on the Web-KB dataset, PDSM, followed by STB-SM and Cosine, outperformed all similarity measures in ACC. In regard to FM and AMP, Cosine, followed by STB-SM and PDSM, outweighed all measures with STB-SM being superior to PDSM on FM and PDSM being superior to STB-SM on AMP. So, the best measures, when NF = 10, were Euclidean, STB-SM, and Cosine on Reuters, and PDSM followed by STB-SM and Cosine on Web-KB.

Tables 6, 7, 89, 10, show that, for both Reuters and Web-KB, STB-SM, followed by PDSM, and Cosine, achieved the highest ACC, FM and AMP respectively. However, two exceptions are noted as follows; the first exception is when NF = 350, Cosine outweighed PDSM in terms of FM and AMP on both Reuters and Web-KB. The second exception is that when NF = 3000, Cosine outperformed PDSM in terms of FM and AMP on Reuters only. Nevertheless, the top performer measures, when NF in the range [50-3000], were STB-SM, PDSM, and Cosine.

Reuters in Tables 11, 12, similarly to Table (10), STB-SM, followed by PDSM and Cosine, had been superior with the highest ACC, FM, and AMP respectively. Moreover Cosine outweighed PDSM in terms of FM and AMP. In contrast, on Web-KB, PDSM, followed by STB-SM and Jaccard, had been superior in terms of ACC and AMP. However, Cosine had been superior to Jaccard in terms of FM only. So, the top performer measures, when NF in the range [6000-All features], were STB-SM, PDSM, Cosine, and Jaccard. It is worth mentioning that from Table 6 , 7, 8, 9, 10, 11, 12, results have been almost the same. In other words, results have been noted in stable condition.

Finally in Table 13, when the average of averaged results has been taken, it is clear that, for both Reuters and Web-KB, STB-SM, PDSM, and Cosine have been the best measures for all criterions. Thus, in conclusions, the top performer measures, when the average results have been taken, are STB-SM, PDSM, and Cosine.

Clustering results

In this subsection, we have evaluated and compared the impact of all considered similarity measures on the behavior of the K-means clustering algorithm. Fixing K on (4 and 8) and using the presented-above evaluation metrics for clustering (see Table 4), the experiments have been conducted on both datasets (Reuters and Web-KB) to experimentally identify and distinguish which measure would be the best fit for K-means. Either positively or negatively, the experiments could clearly signify the selection of similarity measure impact on clustering quality. All features of both datasets were considered when the clustering process has been running (Reuters = 18,308, Web-KB = 33,025 features). As stated earlier, we have used two internal metrics, and three external metrics for clustering evaluation of K-means under the umbrella of all considered similarity measures. As for the stopping condition, the K-means was allowed to stop after running (50) iterations to obtain the best performance, or alternatively, the algorithm reached the stability situation for two consecutive cycles. The stability situation is the case in which K-means clusters were recorded stable (unchanged) for two consecutive cycles. Centroids of clusters were chosen randomly in each iteration. We have used the voting technique (see Table 20) to decide the best fit similarity measure using which the performance of K-means has been noted to be the highest. According to the results drawn in Tables (14, 15, 16, 17, 18), STB-SM, followed by PDSM and Euclidean, has been the best fit in this study. The bolded values in Tables (14, 15, 16, 17, 18) signify the best values each measure had achieved on the corresponding metric.

Table 14 External Metric-Purity (mostly known as “Accuracy”)-K-means performance
Table 15 External metric—completeness-K-means performance
Table 16 External metric—Rand index-K-means performance
Table 17 Internal Metric-Calinski-Harabasz Index–K-means performance
Table 18 Internal metric—Davies-Bouldin Index–K-means performance

Discussion

The discussion revolves around two key points. First, the measure performance stability over both datasets. Second, in which the number of features each measure has performed the best in terms of accuracy (ACC), f-measure (FM), and average mean precisions (AMP).

Classification- performance stability

Based on the results given in Tables (5, 6, 7, 8, 9, 10, 11, 12, 13), 19 observed that the most stable measures on both datasets based on the points every measure has achieved on each number of features. It is concluded that the more points each measure achieves, the more stable it is. In the next Table, R and W indicate Reuters and Web-KB datasets respectively.

Table 19 Measure stability points

Table 19 shows that the most stable measures were STB-SM PDSM and Cosine with 48, 45, and 45 points respectively. While PDSM was more stable on web-kb than Cosine, Cosine was more stable on Reuters than PDSM. However, while Table 19 gives the stable measures according to its giving one plus for each measure when a measure has been superior in terms of specific criterions (out of three criterions, namely, ACC, FM, and AMP) on each dataset, the real numbers in results drawn in Tables (5, 6, 7, 8, 9, 10, 11, 12, 13) also showed indisputably that the top performer measures are also STB-SM, PDSM, Cosine, and Jaccard. It can also be deduced from the results that, unlike Reuters upon which measures have unstable performance, all measures on web-kb have almost stable performance, chiefly the top performers. On Reuters, the competition was held between STB-SM, PDSM, and Cosine. Moreover, from Table 13, based on points recorded on each NF, it can be concluded that STB-SM, PDSM, Cosine, and Jaccard can be used effectively for low, middle and high dimensional datasets as these measures performed well on each NF value. Euclidean and Manhattan also performed well on low dimensional datasets (NF in [10–200] features). Bhattacharya was observed to behave well on middle and high dimensional datasets (NF in [200-N] features).

Strictly speaking, the highest performance over both datasets was seen for both STB-SM and PDSM as both measures showed almost stable and close performance on all values of NF when the poor performance was seen for kullback–Leibler, Manhattan and Euclidean chiefly on high dimensional datasets.

Classification-performance climax

Second, in which the number of features, the similarity measure performance was at its climax in terms of accuracy, f-measure, and average mean precisions.

Performance analysis-reuters

The next Figs. (1, 2, 3) hold the map of the criterions movements (results were averaged) for all measures over several NF values.

Fig. 1
figure 1

a Accuracy over all Measures on all NF values– Average results (K = 1–120; + 2)—Reuters. b Accuracy over competitive measures on all NF values– Average results (K = 1–120; +2)–Reuters

Fig. 2
figure 2

a F-measure over all Measures on all NF values– Average results (K = 1–120; + 2)—Reuters. b F-measure over competitive measures on all NF values–Average results (K = 1–120; +2)–Reuters

Fig. 3
figure 3

a AMP over all Measures on all NF values– Average results (K = 1–120; +2)–Reuters. b AMP over competitive measures on all NF values– Average results (K = 1–120; +2)–Reuters

Figure 1-a depicts that Manhattan followed by Euclidean and kullback–Leibler did not have a stable accuracy and performed the worst as NF grew. In contrast, STB-SM, PDSM, and Bhattacharya had the most stable performance. The Cosine and Jaccard showed a punctuated accuracy while NF grew from 10 to 3000, and then started to be slightly declined as NF grew. Figure 1b draws the top competitors which were STB-SM, PDSM, and Cosine with STB-SM being superior.

Figure 2a shows that Euclidean and Manhattan had a higher FM when NF was in the range of [10–100]. However, their performance started deteriorating as NF grew. Like accuracy movement, STB-SM, PDSM, Cosine, Bhattacharya, and Jaccard yielded an almost stable FM while NF grew from 10 to N, with STB-SM and PDSM being the best, and Bhattacharya bettered Jaccard. Bhattacharya’s performance declined slightly for the favor of STB-SM, PDSM, and Cosine, though. Finally, kullback–Leibler was shown to perform the worst. Figure 2b draws the top competitors which were STB-SM, Cosine, and PDSM with STB-SM being superior over all of them, and Cosine superior over PDSM.

Finally, from Fig. 3a, it is noted that Euclidean followed by Manhattan and kullback–Leibler had the worst performance in terms of AMP. On the other hand, STB-SM, PDSM, Cosine, and Bhattacharya drew the best performance. The Jaccard measure was seen to represent a middle ground between those of the highest AMP and those of lowest AMP. It is worth noting that all measures of the highest performance were seen effective from NF = 10 till NF = 3000, and their effectiveness keeps improving on all features. However, as NF surpassed 3000 to reach 6000 or bigger, their performance started to lower slightly as shown in Fig. 3b.

Performance analysis-web-Kb

The next Figs. (4, 5, 6) hold the map of the criterions movements (results were averaged) for all measures over several NF values.

Fig. 4
figure 4

a Accuracy over all Measures on all NF values– Average results (K = 1–120; + 2) – Web-KB. b Accuracy over competitive measures on all NF values– Average results (K = 1–120; + 2)–Web-KB

Fig. 5
figure 5

a F-measure over all Measures on all NF values–average results (K = 1–120; +2)–Web-KB. B F-measure over competitive measures on all NF values– Average results (K = 1–120; +2)–Web-KBB F-measure over competitive measures on all NF values– Average results (K = 1–120; +2)–Web-KB

Fig. 6
figure 6

a AMP over all Measures on all NF values–average results (K = 1–120; +2)–Web-KB. B AMP over competitive measures on all NF values–Average results (K = 1–120; +2)–Web-KB

From Fig. 4a, it can be seen that Manhattan followed by kullback–Leibler had got an almost stable accuracy albeit the fact that they performed poorly as NF grew. STB-SM, PDSM, Cosine, and Jaccard showed a clear stable higher accuracy while NF grew from 10 to all features, with STB-SM and PDSM being highly superior. While STB-SM outweighed PDSM when NF is in the range [50–3000], PDSM outweighed STB_SM from 6000 to all features as shown in Fig. 4b both measures intersected at 3000 features, though. In addition, on average, STB-SM was still taking the lead. Manhattan and Euclidean had a close performance from each other when NF was in the range [10–200]. However, as NF grew, Euclidean outperformed Manhattan, while seen more closely to Bhattacharya when NF in [350–33,025].

From Fig. 5a, it is shown that Manhattan and kullback–Leibler had the worst performance albeit the fact that kullback–Leibler was seen closer to Cosine when NF = 10. It was a rare case, though. Similarly to Fig. 4b, PDSM, Fig. 5b exhibits that PDSM, STB-SM and Cosine were the best performance with PDSM and STB-SM being fiercely rivals. On the other hand, Jaccard, and Euclidean outperformed Manhattan and kullback–Leibler and Bhattacharya as NF was in the range [50–6000]. However, as NF grew bigger, Bhattacharya started to show a gradually-increasing performance over Euclidean.

Finally, from Fig. 6a, it is clear that Manhattan, Bhattacharya followed by kullback–Leibler had the worst performance in terms of AMP albeit the fact that Manhattan had higher AMP when NF was in [10-100]. When NF was in the range [50-200], Bhattacharya followed by kullback–Leibler were seen to have the worst AMP values. As NF grew, Bhattacharya started to have better performance over Manhattan, though. Similarly, Euclidean outweighed Jaccard as NF was in the range [10–350]. However, as NF grew, Jaccard behaved better than Euclidean. Similar to Fig. 5b, Fig. 6b exhibited that PDSM, STB-SM and Cosine had the best performance with PDSM and STB-SM being highly rivals.

Classification-execution time analysis

Finally, the time consumed by each measure on each dataset over each NF was accumulated and averaged to show which one runs the fastest and which one runs the slowest. A certain measure could give higher accuracy and desired performance while it ran slower compared with others and vice versa. The next Figures map the time taken by each measure to produce the results. According to execution time drawn in Figs. 7, 8, it is abundantly clear that all measures share one fact: the execution time is growing steadily as NF increases, PDSM in particular. It is worth mentioning that time was calculated as the similarity measure run on all evaluation metrics (six metrics) of classification.

Fig. 7
figure 7

Execution Time–Reuters

Fig. 8
figure 8

Execution Time–Web-KB

Figure 7 clearly shows that Bhattacharyya and Manhattan were the fastest similarity measures with Manhattan being much faster on all features. However, this came at the expense of the drawn-above results by both measures as it occupied the second and third-worst measures after kullback–Leibler. Euclidean had been observed to be the middle ground in terms of speed between the first group (Bhattacharyya, Manhattan) and the second group (PDSM, kullback–Leibler, Jaccard, Jaccard, Cosine, and STB-SM). Surprisingly, when all features addressed, Manhattan was the fastest measure and PDSM was the slowest measure as it took roughly 1493.85 min on Reuters when NF = all features. On the other hand, closer to Bhattacharyya, Euclidean recorded worse results when compared with Cosine, Jaccard, STB-SM, and PDSM. Jaccard, on the other hand, was faster than Cosine and STB-SM, and Cosine was slower than STB-SM. In order, PDSM, Jaccard followed by Cosine, and STB-SM were all observed to be the slower measures comparing with the first group, though. In fact, PDSM has been seen to be the slowest measure.

Similarly to Fig. 7,  8 clearly shows that Bhattacharyya and Manhattan were also the fastest similarity measure with Manhattan being subtly faster on all features. However, similar to Reuters, this came at the expense of the drawn-above results by both measures as it occupied the second and third-worst measure after kullback–Leibler. Euclidean, on the other hand, had been observed to be the fastest metric when all features considered, and PDSM was seen to be the slowest ever as it took almost 1001.067 min on all features–Web-KB. However, like Bhattacharyya, Euclidean recorded to have worse results compared with Cosine, Jaccard, PDSM, and STB-SM. Meanwhile, Cosine was faster than Jaccard and STB-SM in all NF cases except for the case in which all features were addressed. In this case, Jaccard was faster than both Cosine and STB-SM. In general, in order, PDSM, kullback–Leibler, Jaccard, STB-SM, and Cosine were the slowest measures with PDSM being the slowest ever.

Clustering analysis

Based on the results drawn in Tables 14, 15, 16, 17, 18, the analysis is done briefly in Table (20) across counting the points each similarity measure had achieved on each metric. The point is counted for measure if it is being bolded as higher value, in Tables 14, 15, 16, 17, 18. The total number of points are 20 points as we have two datasets and five metrics on two values of clustering variable (K = 4, K = 8). In each Table, there has been four points each measure could achieve based on the drawn results. For example Euclidean in Table 4 got 4 points as its results are spotted as top values for purity metric on both datasets on both K (4 and 8). The next Table draws the points recorded for each measure on each metric (Tables 14, 15, 16, 17, 18), and the points in total and rank as well. The bolded values in Table 22 suggest the highest values in Table 20, which reflect the optimality of each measures on the corresponding metric.

Table 20 Rank of similarity measures based on clustering results

In general STB-SM behaves better than PDSM on web-kb clustering chiefly when k grows. That means STB-SM could work optimally on big data, and STB-SM enjoys the scalability properties. The scalability case is the case in which dataset grow larger and larger in terms of data, for example. Ironically, unlike classification, PDSM works better than STB-SM on Reuters, though. Briefly, according to Tables (14, 15, 16, 17, 18), the order of top performer measures on Reuters was: Euclidean PDSM, Cosine and STB-SM, On the other extreme, the order of top performer measures on Web-KB was: STB-SM, Cosine, Euclidean and PDSM. Strictly speaking, the competition process is fiercely held between STB-SM, Euclidean, Cosine and PDSM with STB-SM being maximally superior. In other words, according to the real numbers drawn in results (Tables 14, 15, 16, 17, 18), STB_SM has better values than all measures in most cases. For example, in purity web-kb, completeness and Rand Index of both datsets, the result values of STB-SM are much bigger than those values of other measures. Thus, it can be confidently said that STB-SM outperformed all similarity measures significantly in most cases of clustering evaluation metrics.

Clustering–execution time analysis

Based on the time drawn in Tables (21, 22), PDSM was the slowest measure and Manhattan was the fastest measure. As given in Tables (14, 15, 16, 17, 18; 21, 22), our proposed STB-SM measure came as a compromised solution for both efficiency and effectiveness. It is worth referring that the clustering time has been calculated while K-mean run on nine evaluation metrics. However, in this work, we just used five metrics. So, this time (drawn time in Tables 21, 22) would be shorter (either slightly or significantly) than the expected time when the K-means is running on only these five metrics. That is because adding each metric in clustering often takes extra time, and consequently increase clustering time either slight or significantly. Nevertheless, this claim has not refuted or contradicted the final conclusion drawn in this paper on the speed or slowness of each similarity measure.

Table 21 Reuters–run time in (hour: minute: second)
Table 22 Web-KB – Run Time in (Hour: Minute: Second)

The applicability of proposed measure (STB-SM) on big data environment

Since the advent of Internet, the size of textual information keeps growing because of the continuous evolution of information technologies. These technologies have allowed massive volumes of data to be exponentially increasing across the online contents like all kinds of webpages (academic, scientific, news, medical, etc.), blogs, social networking like Facebook and twitter, and Youtube. In daily basis, trillions of bytes of data are generated that 90% of data in the world was thought to be existed in last couple of years [34, 35]. Consequently, this fast growing of data volumes has led to a critical information retrieval problems. Among these problems is how to get the relevant document(s) of interest amid such gigantic volumes of textual data and information. To solve such problem, the clustering as a data mining technique come for analyzing these massive volumes of data which is called “Big Data”. Without the clustering and classification, it is challenging to manage and discover the knowledge in the environment of big data. However, there have been difficulties for implementing clustering algorithms to big data as clustering algorithms accompanied with high computational costs and complexity. To make it worse, recently, emanation of big data (with all its characteristics including volumes, variety, velocity, variability and complexity) draws more difficulties to this issue which pushes more studies and research to find every possible way to improve clustering algorithms.

That lead us to the question of how to overcome this dilemma, and how to apply clustering algorithms to big data while obtaining the results in a reasonable time. One possible solution to improve the performance of clustering to get results of higher accuracy in reasonable time is to use the well-designed time-efficient similarity measure. In fact, the performance of clustering and classification is maximally dependent on the similarity measure in use as we have seen in this work in PDSM and STB-SM cases. Despite the fact that both measures are effective, PDSM is seen time-inefficient chiefly when used to clustering purpose. Unlike PDSM, STB-SM is time-efficient making it a promising measure for scalability of clustering.

Presently, similarity measure have been sought to mainly promote the accuracy of classification and clustering as well as the efficiency with the intended techniques like KNN classifier and k-means clustering algorithm. Therefore, in this work, we proposed a similarity measure which is thought to be capable of handling the big data analysis effectively and efficiently. Based on the results drawn for both classification and clustering in particular, we believe that our proposed measure (STB-SM) is promising to be an effective technique to process voluminous data in reasonable time with higher accuracy. When we applied STB-SM on all features of each dataset to perform clustering, STB-SM drew highly competitive results in a reasonable time comparing with all state of art. That means STB-SM enjoys it is being significantly effective and maximally efficient and would add a valued-contribution to the field of information retrieval (which is vital part of big data) in particular and machine learning in general. In fact, while designing STB-SM, our focus has been on drawing the measure that would help scale up and expedite clustering algorithm without sacrificing results quality. In doing so, the clustering process will enjoy flexibility and provide faster response time at the same time. In other words, with the proposed measure (STB-SM) being effective and efficient, the clustering for big data (including document clustering) can be efficaciously implemented to enhance the speed of search, precision, recall, search engines, and so on.

Conclusions and future work

Using the BoW model, KNN classifier, and K-means algorithm, in the context of text classification and clustering, this paper introduces a new similarity measure that is based on the set theory mechanism and is named the STB-SM. Besides the STB-SM, a comparative study has thoroughly been carried out on seven similarity measures using six classification criterions and five clustering metrics. The obtained results demonstrated that STB-SM similarity measure achieved almost the best performance on all classification and clustering criterions on both datasets (Reuters-21 or Web-KB). Moreover, to stress proposed measure superiority, it was imperative to utilize more than one performance criterion to effectively assess all similarity measures. In fact, it was difficult to determine which measure was the optimal one for any dataset and/evaluation criterion unless they are all evaluated against each other comprehensively. Because of that, each dataset displayed different characteristics when classification or clustering were performed on them. Nonetheless, from the obtained results, it can be concluded that STB-SM, PDSM, Cosine, and Jaccard showed superiority over other measures, and obtained the most stable performance trends on both datasets for all K values, compared to Euclidean, Manhattan, and kullback–Leibler measures with Manhattan and kullback–Leibler being noted to have the worst results. On the other extreme, Euclidean and Bhattacharya had a fluctuating performance which can be classified as a middle-ground between high performance and poor performance measures.

Additionally, using the K-means clustering algorithm, all similarity measures were involved in a fierce clustering competition. All similarity measures were individually used to evaluate K-means performance with respect to five evaluation metrics from which three metrics are external and the last two are internal metrics. The STB-SM, PDSM, and Euclidean were observed to be the top performers in terms of clustering. The STB-SM has outperformed Euclidean and PDSM in most stages of evaluation metrics. It worth mentioning that all these results of clustering were collected and analyzed for the case in which the number of clusters K is taken as number of actual classes in both datasets (4 and 8). Thus, in the follow-up work, to avoid and biasedness and get a deeper insight into clustering performance, an exhaustive analysis with several K values on different clustering algorithms will be carried out.

All these measures were rigorously examined with regard to their execution time when classification and clustering are run on either dataset. For classification, results has shown that some measures met the highest speed but at the expense of their overall performance, such as the Bhattacharyya, Manhattan, and Euclidean. On the other hand, and to confirm the fact that the trade-off is un-escapable, PDSM had been able to achieve better effectiveness results but again at the expense of its efficiency as this measure in particular was the slowest measure. Nevertheless, the STB-SM, Jaccard, and Cosine measures were a suitable compromised solution between the fastest measures (the Bhattacharyya, Manhattan, and Euclidean) and the slowest measure (the PDSM). They were not only faster than the PDSM but they were also closer to the speed of the fastest measures. On the other hand, for clustering, the PDSM was also the slowest measure and Manhattan was the fastest measure. As a compromised solution for both effectiveness and efficiency on both the classification and the clustering, our proposed measure the STB-SM has shown superiority with regard to clustering as well as classification. Finally, this work briefly described the applicability of the STB-SM to big data scenarios. In the future work, we plan to broaden the current work to involve more state-of-the-art measures such as that described in [3, 4]. Moreover, the behavior of all these measures will thoroughly be examined on different machine learning tasks such as text summarization [36] and plagiarism detection.