Topic Modeling on User Stories using Word Mover's Distance,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Topic Modeling on User Stories using Word Mover's Distance
arXiv - CS - Information Retrieval Pub Date : 2020-07-10 , DOI: arxiv-2007.05302
Kim Julian G\"ulle, Nicholas Ford, Patrick Ebel, Florian Brokhausen, Andreas Vogelsang

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

中文翻译：

使用 Word Mover 距离对用户故事进行主题建模

需求获取最近得到了基于人群的技术的补充，这种技术不断涉及通过各种媒体表达他们的反馈的大型异构用户群。基于人群的启发在早期与（潜在）用户互动方面具有巨大潜力，但也会产生大量原始和非结构化反馈。整合和分析这些反馈是将其转化为合理的用户需求的关键挑战。在本文中，我们将主题建模作为在大量人群生成的用户故事中识别主题的一种手段，并比较了三种方法：（1）基于潜在狄利克雷分配的传统方法，（2）词嵌入的组合和主成分分析，以及 (3) 词嵌入和 Word Mover 距离的组合。我们评估了由众包工作者编写和分类的公开可用的 2,966 个用户故事集的方法。我们发现词嵌入和 Word Mover 的距离的组合是最有前途的。根据我们在方法中使用的词嵌入，我们设法以两种方式对用户故事进行聚类：一种更接近原始分类，另一种允许对数据集的新见解，例如找到潜在的新类别。不幸的是，不存在客观评价我们结果质量的措施。尽管如此，我们的发现为未来分析众包用户故事的工作提供了基础。根据我们在方法中使用的词嵌入，我们设法以两种方式对用户故事进行聚类：一种更接近原始分类，另一种允许对数据集的新见解，例如找到潜在的新类别。不幸的是，不存在客观评价我们结果质量的措施。尽管如此，我们的发现为未来分析众包用户故事的工作提供了基础。根据我们在方法中使用的词嵌入，我们设法以两种方式对用户故事进行聚类：一种更接近原始分类，另一种允许对数据集的新见解，例如找到潜在的新类别。不幸的是，不存在客观评价我们结果质量的措施。尽管如此，我们的发现为未来分析众包用户故事的工作提供了基础。

更新日期：2020-07-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文