当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving Semantic Coherence of Gujarati Text Topic Model Using Inflectional Forms Reduction and Single-letter Words Removal
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 1.8 ) Pub Date : 2021-03-10 , DOI: 10.1145/3447760
Uttam Chauhan 1 , Apurva Shah 2
Affiliation  

A topic model is one of the best stochastic models for summarizing an extensive collection of text. It has accomplished an inordinate achievement in text analysis as well as text summarization. It can be employed to the set of documents that are represented as a bag-of-words, without considering grammar and order of the words. We modeled the topics for Gujarati news articles corpus. As the Gujarati language has a diverse morphological structure and inflectionally rich, Gujarati text processing finds more complexity. The size of the vocabulary plays an important role in the inference process and quality of topics. As the vocabulary size increases, the inference process becomes slower and topic semantic coherence decreases. If the vocabulary size is diminished, then the topic inference process can be accelerated. It may also improve the quality of topics. In this work, the list of suffixes has been prepared that encounters too frequently with words in Gujarati text. The inflectional forms have been reduced to the root words concerning the suffixes in the list. Moreover, Gujarati single-letter words have been eliminated for faster inference and better quality of topics. Experimentally, it has been proved that if inflectional forms are reduced to their root words, then vocabulary length is shrunk to a significant extent. It also caused the topic formation process quicker. Moreover, the inflectional forms reduction and single-letter word removal enhanced the interpretability of topics. The interpretability of topics has been assessed on semantic coherence, word length, and topic size. The experimental results showed improvements in the topical semantic coherence score. Also, the topic size grew notably as the number of tokens assigned to the topics increased.

中文翻译:

使用屈折形式减少和单字母词去除提高古吉拉特语文本主题模型的语义连贯性

主题模型是总结大量文本集合的最佳随机模型之一。它在文本分析和文本摘要方面取得了非凡的成就。它可以用于表示为词袋的文档集,而无需考虑词的语法和顺序。我们为古吉拉特语新闻文章语料库建模了主题。由于古吉拉特语具有多样的形态结构和丰富的屈折变化,古吉拉特语文本处理变得更加复杂。词汇量的大小在推理过程和主题质量中起着重要作用。随着词汇量的增加,推理过程变得更慢,主题语义连贯性降低。如果词汇量减少,则可以加快主题推理过程。它还可以提高主题的质量。在这项工作中,准备了在古吉拉特语文本中经常遇到单词的后缀列表。屈折形式已简化为与列表中后缀有关的词根。此外,古吉拉特语单字母词已被淘汰,以加快推理速度和提高主题质量。实验证明,如果屈折形式减少到它们的根词,那么词汇长度会在很大程度上缩小。这也使话题形成过程更快。此外,屈折形式减少和单字母单词去除增强了主题的可解释性。主题的可解释性已通过语义连贯性、词长和主题大小进行评估。实验结果表明主题语义连贯性得分有所提高。此外,随着分配给主题的令牌数量的增加,主题大小显着增加。
更新日期:2021-03-10
down
wechat
bug