Review and Implementation of Topic Modeling in Hindi,Applied Artificial Intelligence

当前位置： X-MOL 学术 › Appl. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Review and Implementation of Topic Modeling in Hindi
Applied Artificial Intelligence ( IF 2.9 ) Pub Date : 2019-09-05 , DOI: 10.1080/08839514.2019.1661576
Santosh Kumar Ray ₁ , Amir Ahmad ₂ , Ch. Aswani Kumar ₃

Affiliation

ABSTRACT Due to the widespread usage of electronic devices and the growing popularity of social media, a lot of text data is being generated at the rate never seen before. It is not possible for humans to read all data generated and find what is being discussed in his field of interest. Topic modeling is a technique to identify the topics present in a large set of text documents. In this paper, we have discussed the widely used techniques and tools for topic modeling. There has been a lot of research on topic modeling in English, but there is not much progress in the resource-scarce languages like Hindi despite Hindi being spoken by millions of people across the world. In this paper, we have discussed the challenges faced in developing topic models for Hindi. We have applied Latent Semantic Indexing (LSI), Non-negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA) algorithms for topic modeling in Hindi. The outcomes of the topic model algorithms are usually difficult to interpret for the common user. We have used various visualization techniques to represent the outcomes of topic modeling in a meaningful way. Then we have used the metrics like perplexity and coherence to evaluate the topic models. The results of Topic modeling in Hindi seem to be promising and comparable to some results reported in the literature on English datasets.

中文翻译：

印地语主题建模的回顾和实施

摘要由于电子设备的广泛使用和社交媒体的日益普及，大量文本数据正以前所未有的速度生成。人类不可能阅读所有生成的数据并找到他感兴趣的领域正在讨论的内容。主题建模是一种识别大量文本文档中存在的主题的技术。在本文中，我们讨论了广泛使用的主题建模技术和工具。有很多关于英语主题建模的研究，但尽管印地语被全世界数百万人使用，但像印地语这样的资源稀缺语言并没有太大进展。在本文中，我们讨论了开发印地语主题模型所面临的挑战。我们应用了潜在语义索引 (LSI)、非负矩阵分解 (NMF)、以及用于印地语主题建模的潜在狄利克雷分配 (LDA) 算法。主题模型算法的结果通常难以为普通用户解释。我们使用各种可视化技术以有意义的方式表示主题建模的结果。然后我们使用困惑度和连贯性等指标来评估主题模型。印地语主题建模的结果似乎很有希望，并且与英语数据集文献中报告的一些结果相当。然后我们使用困惑度和连贯性等指标来评估主题模型。印地语主题建模的结果似乎很有希望，并且与英语数据集文献中报告的一些结果相当。然后我们使用困惑度和连贯性等指标来评估主题模型。印地语主题建模的结果似乎很有希望，并且与英语数据集文献中报告的一些结果相当。

更新日期：2019-09-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11