当前位置: X-MOL 学术Artif. Intell. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluation of clustering and topic modeling methods over health-related tweets and emails
Artificial Intelligence in Medicine ( IF 7.5 ) Pub Date : 2021-05-07 , DOI: 10.1016/j.artmed.2021.102096
Juan Antonio Lossio-Ventura 1 , Sergio Gonzales 1 , Juandiego Morzan 2 , Hugo Alatrista-Salas 2 , Tina Hernandez-Boussard 1 , Jiang Bian 3
Affiliation  

Background

Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts.

Methods

We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels).

Results

In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets.

Conclusions

Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.



中文翻译:

对健康相关推文和电子邮件的聚类和主题建模方法的评估

背景

互联网为与患者交流提供了不同的工具,例如社交媒体(例如 Twitter)和电子邮件平台。这些平台提供了新的数据源,以阐明患者的医疗保健体验,并提高我们对患者与提供者沟通的理解。几种现有的主题建模和文档聚类方法已被用于自动分析这些新的自由文本数据。但是,推文和电子邮件通常都由短文本组成。现有的主题建模和聚类方法在这些短文本上表现不佳。此外,使用这些方法对与健康相关的短文本的研究变得难以复制和基准测试,部分原因是缺乏对这些短文本上最先进的主题建模和聚类方法的详细比较。

方法

我们对来自两个健康相关数据集(推文和电子邮件)的短文本训练了八种最先进的主题建模和聚类算法:潜在语义索引(LSI)、潜在狄利克雷分配(LDA)、带有吉布斯采样的 LDA(GibbsLDA) )、在线 LDA、Biterm 模型 (BTM)、在线 Twitter LDA 和 Dirichlet 多项式混合 (GSDMM) 的 Gibbs 采样,以及具有两种不同特征表示的k均值聚类算法:TF-IDF 和 Doc2Vec。我们使用聚类有效性指标来评估主题建模和聚类的性能:两个内部指标(即在没有外部信息的情况下评估聚类结构的好坏)和五个外部指标(即将聚类分析的结果与外部已知提供的类进行比较标签)。

结果

总体而言,对于从 2 到 50 的集群数量 ( k ),Online Twitter LDA 和 GSDMM 在内部指标方面取得了最佳性能,而 LSI 和带有 TF-IDF 的k -means 具有最高的外部指标。此外,在所有推文中(N  = 286, 971;HPV 代表 94.6% 的推文,lynch 综合征代表 5.4%),对于k  = 2,大多数方法都可以尊重这种初始聚类分布。然而,我们发现模型性能随数据来源和超参数的不同而不同,例如主题数量和用于训练模型的迭代次数。我们还使用汉明损失度量进行了误差分析,其中最差的值是 GSDMM 在两个数据集上获得的。

结论

希望对健康相关的短文本数据进行分组或分类的研究人员可以期望为他们的特定研究问题选择最合适的主题建模和聚类方法。因此,我们使用内部和外部聚类验证指标对两个健康相关的短文本数据集上最常用的主题建模和聚类算法进行了比较。内部指数建议 Online Twitter LDA 和 GSDMM 为最佳,而外部指数建议 LSI 和k -means与 TF-IDF 为最佳。总之,我们的工作建议研究人员可以通过使用各种指标来改进他们对模型性能的分析,因为没有单一的最佳指标。

更新日期:2021-05-22
down
wechat
bug