当前位置: X-MOL 学术Organ. Res. Methods › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations
Organizational Research Methods ( IF 8.9 ) Pub Date : 2020-11-23 , DOI: 10.1177/1094428120971683
Louis Hickman 1 , Stuti Thapa 1 , Louis Tay 1 , Mengyang Cao 2 , Padmini Srinivasan 3
Affiliation  

Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.



中文翻译:

组织研究中文本挖掘的文本预处理:回顾和建议

文本挖掘的最新进展提供了利用组织,员工和客户创建的大量自然语言文本数据的新方法。尽管经常被忽略,但是在文本预处理期间做出的决定会影响是否捕获语言的内容和/或样式,后续分析的统计能力以及从文本挖掘中得出的见解的有效性。过去的方法论文章描述了获取和分析文本数据的一般过程,但是对预处理文本数据的建议并不一致。此外,初级研究使用并报告了不同的预处理技术。为了解决这个问题,我们对计算语言学和组织文本挖掘研究进行了两次补充审查,以提供基于经验的文本预处理决策建议,这些建议考虑了进行的文本挖掘的类型(即开放或封闭的词汇),正在研究的研究问题以及数据集合的特征(即语料库大小和平均文档长度)。值得注意的是,由于一个人的文本数据的独特特性,与这些建议的偏离是适当的,有时是必要的。我们还提供报告文本挖掘的建议,以提高透明度和可重复性。语料库大小和平均文档长度)。值得注意的是,由于一个人的文本数据的独特特性,与这些建议的偏离是适当的,有时是必要的。我们还提供报告文本挖掘的建议,以提高透明度和可重复性。语料库大小和平均文档长度)。值得注意的是,由于一个人的文本数据的独特特性,与这些建议的偏离是适当的,有时是必要的。我们还提供有关报告文本挖掘的建议,以提高透明度和可重复性。

更新日期:2020-12-23
down
wechat
bug