当前位置: X-MOL 学术Annu. Rev. Stat. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Recent Advances in Text Analysis
Annual Review of Statistics and Its Application ( IF 7.9 ) Pub Date : 2023-11-29 , DOI: 10.1146/annurev-statistics-040522-022138
Zheng Tracy Ke 1 , Pengsheng Ji 2 , Jiashun Jin 3 , Wanshan Li 3
Affiliation  

Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.Expected final online publication date for the Annual Review of Statistics and Its Application, Volume 11 is March 2024. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

中文翻译:

文本分析的最新进展

文本分析是数据科学中一个有趣的研究领域,具有多种应用,例如人工智能、生物医学研究和工程。我们回顾了流行的文本分析方法,从主题建模到最近的神经语言模型。我们特别回顾了主题建模的统计方法 Topic-SCORE,并讨论了如何使用它来分析统计学家多属性数据集 (MADStat),这是我们收集和清理的统计出版物的数据集。将 Topic-SCORE 和其他方法应用于 MADStat 会得出有趣的发现。例如,我们确定了 11 个统计领域的代表性主题。对于每种期刊,主题权重随时间的演变都可以可视化,这些结果用于分析统计研究的趋势。特别是,我们提出了一种新的统计模型来对 11 个主题的引用影响进行排名,并且我们还构建了一个跨主题引用图来说明不同主题的研究结果如何相互传播。 MADStat 上的结果从文本分析的角度提供了 1975 年至 2015 年统计研究的数据驱动图景。《统计及其应用年度评论》第 11 卷的预计最终在线发布日期为 2024 年 3 月。请参阅 http: //www.annualreviews.org/page/journal/pubdates 了解修订后的估计。
更新日期:2023-11-29
down
wechat
bug