当前位置: X-MOL 学术IEEE Trans. Reliab. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Evaluation and Enhancement for Quality Improvement of Machine Learning
IEEE Transactions on Reliability ( IF 5.0 ) Pub Date : 2021-04-28 , DOI: 10.1109/tr.2021.3070863
Haihua Chen , Jiangping Chen , Junhua Ding

Poor data quality has a direct impact on the performance of the machine learning system that is built on the data. As a demonstrated effective approach for data quality improvement, transfer learning has been widely used to improve machine learning quality. However, the “quality improvement” brought by transfer learning was rarely rigorously validated, and some of the quality improvement results were misleading. This article first exposed the hidden quality problem in the datasets used to build a machine learning system for normalizing medical concepts in social media text. The system was claimed to have achieved the best performance compared to existing work on a machine learning task. However, the results of our experiments showed that the “best performance” was due to the poor quality of the datasets and the defective validation process. To address the data quality issue and build a high-performance medical concept normalization system, we developed a transfer-learning-based strategy for data quality enhancement and system performance improvement. The results of the experiments showed a strong correlation between the quality of the datasets and the performance of the machine learning system. The results also demonstrated that a rigorous evaluation of data quality is necessary for guiding the quality improvement of machine learning. Therefore, we propose a data quality evaluation framework that includes the quality criteria and their corresponding evaluation approaches. The data validation process, the performance improvement strategy, and the data quality evaluation framework discussed in this article can be used for machine learning researchers and practitioners to build high-performance machine learning systems. The code and datasets used in this research are available in GitHub ( https://github.com/haihua0913/dataEvaluationML ).

中文翻译:


数据评估和增强以提高机器学习的质量



不良的数据质量会直接影响基于数据构建的机器学习系统的性能。作为一种被证明有效的提高数据质量的方法,迁移学习已被广​​泛用于提高机器学习质量。然而,迁移学习带来的“质量提升”很少得到严格验证,并且一些质量提升结果具有误导性。本文首先揭露了用于构建机器学习系统的数据集中隐藏的质量问题,该系统用于规范社交媒体文本中的医学概念。据称,与现有的机器学习任务工作相比,该系统已经实现了最佳性能。然而,我们的实验结果表明,“最佳性能”是由于数据集质量差和验证过程有缺陷造成的。为了解决数据质量问题并构建高性能的医学概念标准化系统,我们开发了一种基于迁移学习的数据质量增强和系统性能改进策略。实验结果表明数据集的质量与机器学习系统的性能之间存在很强的相关性。结果还表明,对数据质量进行严格的评估对于指导机器学习质量的提高是必要的。因此,我们提出了一个数据质量评估框架,包括质量标准及其相应的评估方法。本文讨论的数据验证流程、性能改进策略和数据质量评估框架可以供机器学习研究人员和实践者构建高性能机器学习系统。 本研究中使用的代码和数据集可在 GitHub (https://github.com/haihua0913/dataEvaluationML) 中获取。
更新日期:2021-04-28
down
wechat
bug