Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering,Journal of Data and Information Science

当前位置： X-MOL 学术 › Journal of Data and Information Science › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering
Journal of Data and Information Science ( IF 1.5 ) Pub Date : 2021-06-01 , DOI: 10.2478/jdis-2021-0024
Sahand Vahidnia ₁ , Alireza Abbasi ₁ , Hussein A. Abbass ₁

Affiliation

Abstract Purpose Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem. Design/methodology/approach To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications. Findings Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics. Research limitations We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited. Practical implications As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics. Originality/value In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

中文翻译：

使用深度聚类从学术文档中基于嵌入的检测和提取研究主题

摘要目的检测研究领域或主题并了解动态有助于科学界做出有关建立科学领域的决定。这也有助于与政府和企业更好地合作。本研究旨在调查研究领域随时间的发展，将其转化为主题检测问题。设计/方法/方法为了实现目标，我们提出了一种改进的深度聚类方法，以从学术文件的摘要和标题中检测研究趋势。文档嵌入方法用于将文档转换为基于向量的表示。通过将所提出的方法与不同嵌入和聚类方法的组合以及经典主题建模算法（即 LDA）与基准数据集进行比较来评估所提出的方法。还进行了一项案例研究，探索人工智能 (AI) 的演变，检测相关 AI 出版物中的研究主题或子领域。结果使用聚类性能指标评估所提出方法的性能反映了我们提出的方法优于基准数据集的类似方法。使用所提出的方法，我们还展示了主题在最近 30 年中的演变，利用了用于集群标记和标签的关键字提取方法，展示了主题的上下文。研究限制我们注意到不可能为所有下游任务推广一种解决方案。因此，需要对每个任务甚至数据集的解决方案进行微调或优化。此外，集群标签的解释可能是主观的，并且会根据读者的意见而有所不同。评估标记技术也非常困难，使得对集群的解释进一步受到限制。实际意义正如案例研究中所展示的，我们在一个真实的例子中展示了所提出的方法如何使学术研究的研究人员和审稿人能够从数十年的学术文件中检测、总结、分析和可视化研究主题。通过建立和解释主题，这有助于科学界和所有相关组织对领域进行快速有效的分析。原创性/价值在这项研究中，我们引入了一种经过修改和调整的深度嵌入聚类以及 Doc2Vec 表示以进行主题提取。在本研究中，我们还使用概念提取方法作为标记方法。该方法的有效性已在 AI 出版物的案例研究中进行了评估，我们在其中分析了过去三年中的 AI 主题。我们引入了一种经过修改和调整的深度嵌入聚类，并结合 Doc2Vec 表示进行主题提取。在本研究中，我们还使用概念提取方法作为标记方法。该方法的有效性已在 AI 出版物的案例研究中进行了评估，我们在其中分析了过去三年中的 AI 主题。我们引入了一种经过修改和调整的深度嵌入聚类，并结合 Doc2Vec 表示进行主题提取。在本研究中，我们还使用概念提取方法作为标记方法。该方法的有效性已在 AI 出版物的案例研究中进行了评估，我们在其中分析了过去三年中的 AI 主题。

更新日期：2021-06-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文