An integrated retrieval framework for similar questions: Word-semantic embedded label clustering – LDA with question life cycle,Information Sciences

当前位置： X-MOL 学术 › Inform. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An integrated retrieval framework for similar questions: Word-semantic embedded label clustering – LDA with question life cycle
Information Sciences Pub Date : 2020-05-26 , DOI: 10.1016/j.ins.2020.05.014
Yue Liu , Aihua Tang , Zhibin Sun , Weize Tang , Fei Cai , Chengjin Wang

Question retrieval is an extremely important research field in Community Question Answering (CQA). Most existing question retrieval methods depend on semantic analysis of questions, whose effectiveness suffers from the short texts of the noise words in the question corpus. In order to recommend the questions with more advanced knowledge to users, the influence of the questions’ popularity should be considered during retrieving questions. To make retrieved questions with both similar semantics and high popularity, we propose an Integrated Retrieval Framework for Similar Questions named Word-semantic Embedded Label Clustering – LDA with Question Life Cycle (WELQLC-QR), consisting of Word-semantic Embedded Label Clustering – LDA (WEL) and Question Life Cycle Optimization Similar Question List Strategy (QLC). Firstly, WEL is proposed for question retrieval from the perspective of semantic matching. It not only overcomes the problem of over-generalization of the semantic information extracted by topic models when facing short questions with multi-levels labels, but also avoids the influence of noise vocabularies during semantic extracting of the questions. Then, based on the internal factors (i.e., the number of comments and answers to the question) and external factors (i.e., programming language ranking information) of questions, QLC constructs a popularity-predicted model to optimize the similar question set searched by WEL, making the final retrieval results both semantically similar and popular. Finally, experiments are conducted on CQADupStack dataset, and results show that the MRR@N of WELQLC-QR model has an average increase of 8.99%, 8.3%, 4.74% and 3.56% compared with that of L-LDA, LC-LDA, BM25 and Word2vec, respectively.

中文翻译：

针对类似问题的集成检索框架：单词语义嵌入标签聚类–具有问题生命周期的LDA

问题检索是社区问题解答（CQA）中极为重要的研究领域。现有的大多数问题检索方法都依赖于问题的语义分析，其有效性受到问题语料库中干扰词的简短文本的影响。为了向用户推荐具有更高级知识的问题，在检索问题时应考虑问题受欢迎程度的影响。为了使检索到的问题具有相似的语义和较高的知名度，我们提出了一个针对类似问题的集成检索框架，即单词语义嵌入式标签聚类–具有问题生命周期的LDA（WELQLC-QR），其中包括单词语义嵌入式标签聚类– LDA （WEL）和问题生命周期优化类似问题列表策略（QLC）。首先，从语义匹配的角度提出了WEL用于问题检索。它不仅克服了面对带有多级标签的简短问题时主题模型提取的语义信息过于笼统的问题，而且还避免了噪声词汇在问题语义提取过程中的影响。然后，根据问题的内部因素（即，问题的评论和答案的数量）和问题的外部因素（即，编程语言的排名信息），QLC构建一个流行度预测模型，以优化WEL搜索的相似问题集，从而使最终检索结果在语义上相似且受欢迎。最后，对CQADupStack数据集进行了实验，结果表明WELQLC-QR模型的MRR @ N平均增加了8.99％，8.3％，4.74％和3。

更新日期：2020-05-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11