An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining,Entropy

当前位置： X-MOL 学术 › Entropy › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining
Entropy ( IF 2.1 ) Pub Date : 2021-04-29 , DOI: 10.3390/e23050553
Salim Miloudi ₁ , Yulin Wang ₁ , Wenjia Ding ₁

Affiliation

Clustering algorithms for multi-database mining (MDM) rely on computing

(n^{2} - n) / 2

pairwise similarities between n multiple databases to generate and evaluate

m \in [1, (n^{2} - n) / 2]

candidate clusterings in order to select the ideal partitioning that optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the n databases in one cluster or by returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness of the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms, which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in fewer upper-bounded iterations. To achieve our goal, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of the n multiple database in a way that minimizes a convex clustering quality measure

L (θ)

in less than

(n^{2} - n) / 2

iterations. By using a max-heap data structure within our CD algorithm, we optimally choose the largest weight variable

θ_{p, q}^{(i)}

at each iteration i such that taking the partial derivative of

L (θ)

with respect to

θ_{p, q}^{(i)}

allows us to attain the next steepest descent minimizing

L (θ)

without using a learning rate. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.

中文翻译：

一种改进的基于相似性的多数据库挖掘聚类算法

多数据库挖掘（MDM）的聚类算法依赖于计算

（ n^{2} - n ） / 2

n个数据库之间的成对相似度生成并评估

米 ε [1, （ n^{2} - n ） / 2]

候选聚类，以便选择优化预定义优度度量的理想分区。然而，当这些成对相似性分布在平均值周围时，聚类算法在选择哪些数据库对被认为有资格分组在一起时变得犹豫不决。因此，通过将所有n 个数据库放入一个集群或返回n 个单例集群，会产生一个简单的结果。为了解决后一个问题，我们提出了一种学习算法，通过梯度下降和反向传播最小化加权二元熵损失函数来减少相似性矩阵的模糊性。因此，学习模型将通过正确识别最佳数据库聚类来提高聚类算法的确定性。此外，与基于梯度的聚类算法相比，基于梯度的聚类算法对学习率的选择敏感并且需要更多的迭代才能收敛，我们提出了一种无学习率的算法来评估在更少的上层运行中动态生成的候选聚类。有界迭代。为了实现我们的目标，我们使用坐标下降（CD）和反向传播以最小化凸聚类质量度量的方式搜索n多个数据库的最佳聚类