当前位置: X-MOL 学术IEEE Trans. Reliab. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Evaluation and Enhancement for Quality Improvement of Machine Learning
IEEE Transactions on Reliability ( IF 5.9 ) Pub Date : 2021-04-28 , DOI: 10.1109/tr.2021.3070863
Haihua Chen , Jiangping Chen , Junhua Ding

Poor data quality has a direct impact on the performance of the machine learning system that is built on the data. As a demonstrated effective approach for data quality improvement, transfer learning has been widely used to improve machine learning quality. However, the “quality improvement” brought by transfer learning was rarely rigorously validated, and some of the quality improvement results were misleading. This article first exposed the hidden quality problem in the datasets used to build a machine learning system for normalizing medical concepts in social media text. The system was claimed to have achieved the best performance compared to existing work on a machine learning task. However, the results of our experiments showed that the “best performance” was due to the poor quality of the datasets and the defective validation process. To address the data quality issue and build a high-performance medical concept normalization system, we developed a transfer-learning-based strategy for data quality enhancement and system performance improvement. The results of the experiments showed a strong correlation between the quality of the datasets and the performance of the machine learning system. The results also demonstrated that a rigorous evaluation of data quality is necessary for guiding the quality improvement of machine learning. Therefore, we propose a data quality evaluation framework that includes the quality criteria and their corresponding evaluation approaches. The data validation process, the performance improvement strategy, and the data quality evaluation framework discussed in this article can be used for machine learning researchers and practitioners to build high-performance machine learning systems. The code and datasets used in this research are available in GitHub ( https://github.com/haihua0913/dataEvaluationML ).

中文翻译:

用于机器学习质量改进的数据评估和增强

数据质量不佳会直接影响基于数据构建的机器学习系统的性能。作为改进数据质量的有效方法,迁移学习已被广​​泛用于提高机器学习质量。然而,迁移学习带来的“质量提升”很少得到严格验证,部分质量提升结果具有误导性。本文首先揭示了用于构建机器学习系统的数据集中隐藏的质量问题,用于规范社交媒体文本中的医学概念。据称,与机器学习任务的现有工作相比,该系统取得了最佳性能。然而,我们的实验结果表明,“最佳性能”是由于数据集质量差和验证过程有缺陷。为了解决数据质量问题并构建高性能的医学概念规范化系统,我们开发了一种基于迁移学习的数据质量增强和系统性能改进策略。实验结果表明数据集的质量与机器学习系统的性能之间存在很强的相关性。结果还表明,需要对数据质量进行严格的评估,以指导机器学习的质量改进。因此,我们提出了一个数据质量评估框架,其中包括质量标准及其相应的评估方法。数据验证过程,性能改进策略,并且本文讨论的数据质量评估框架可用于机器学习研究人员和从业人员构建高性能机器学习系统。本研究中使用的代码和数据集可在 GitHub ( https://github.com/haihua0913/dataEvaluationML )。
更新日期:2021-06-11
down
wechat
bug