当前位置: X-MOL 学术Environ. Model. Softw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparative study of term-weighting schemes for environmental big data using machine learning
Environmental Modelling & Software ( IF 4.9 ) Pub Date : 2022-09-23 , DOI: 10.1016/j.envsoft.2022.105536
JungJin Kim , Han-Ul Kim , Jan Adamowski , Shadi Hatami , Hanseok Jeong

Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF–inverse document frequency (TF-IDF), Best Match 25 (BM25), TF–inverse gravity moment (TF-IGM), and TF–IDF–inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.



中文翻译:

基于机器学习的环境大数据项权重方案比较研究

广泛使用的术语加权方案和具有默认参数设置的机器学习 (ML) 分类器在应用于环境大数据分析时的性能进行了评估。五种词项加权方案 [词频 (TF)、TF-逆文档频率 (TF-IDF)、最佳匹配 25 (BM25)、TF-反重力矩 (TF-IGM) 和 TF-IDF-逆类频率 ( TF-IDF-ICF)] 和五个不同的 ML 分类器 [支持向量机 (SVM)、朴素贝叶斯 ( NB )、逻辑回归 ( LR ))、随机森林 (RF) 和极端梯度提升 (XGBoost)] 进行了测试。最佳文本分类方案和分类器分别是 TF-IDF-ICF 和 LR。基于评估标准,它们的组合导致所有方案和分类器组合的最佳性能,用于完整的环境数据分析。类别分类性能因环境部分(气候、空气、水或废物/垃圾)而异,气候性能最佳,水性能最差。这证明了在人为环境大数据分析中选择术语加权方案和 ML 分类器的重要性。

更新日期:2022-09-23
down
wechat
bug