当前位置: X-MOL 学术ACM Trans. Softw. Eng. Methodol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Test Selection for Deep Learning Systems
ACM Transactions on Software Engineering and Methodology ( IF 4.4 ) Pub Date : 2021-01-03 , DOI: 10.1145/3417330
Wei Ma 1 , Mike Papadakis 1 , Anestis Tsakmalis 1 , Maxime Cordy 1 , Yves Le Traon 1
Affiliation  

Testing of deep learning models is challenging due to the excessive number and complexity of the computations involved. As a result, test data selection is performed manually and in an ad hoc way. This raises the question of how we can automatically select candidate data to test deep learning models. Recent research has focused on defining metrics to measure the thoroughness of a test suite and to rely on such metrics to guide the generation of new tests. However, the problem of selecting/prioritising test inputs (e.g., to be labelled manually by humans) remains open. In this article, we perform an in-depth empirical comparison of a set of test selection metrics based on the notion of model uncertainty (model confidence on specific inputs). Intuitively, the more uncertain we are about a candidate sample, the more likely it is that this sample triggers a misclassification. Similarly, we hypothesise that the samples for which we are the most uncertain are the most informative and should be used in priority to improve the model by retraining. We evaluate these metrics on five models and three widely used image classification problems involving real and artificial (adversarial) data produced by five generation algorithms. We show that uncertainty-based metrics have a strong ability to identify misclassified inputs, being three times stronger than surprise adequacy and outperforming coverage-related metrics. We also show that these metrics lead to faster improvement in classification accuracy during retraining: up to two times faster than random selection and other state-of-the-art metrics on all models we considered.

中文翻译:

深度学习系统的测试选择

由于所涉及的计算数量过多且过于复杂,深度学习模型的测试具有挑战性。因此,测试数据选择是手动并以特别的方式执行的。这就提出了我们如何自动选择候选数据来测试深度学习模型的问题。最近的研究集中在定义度量标准来衡量测试套件的彻底性,并依靠这些度量标准来指导新测试的生成。然而,选择/优先化测试输入(例如,由人类手动标记)的问题仍然悬而未决。在本文中,我们基于模型不确定性(特定输入的模型置信度)的概念对一组测试选择指标进行了深入的实证比较。直观地说,我们对候选样本的不确定性越大,该样本触发错误分类的可能性越大。同样,我们假设我们最不确定的样本是最多信息的,应该优先使用以通过再训练来改进模型。我们在五个模型和三个广泛使用的图像分类问题上评估这些指标,这些问题涉及由五代算法产生的真实和人工(对抗性)数据。我们表明,基于不确定性的指标具有识别错误分类输入的强大能力,比意外充分性强三倍,并且优于与覆盖率相关的指标。我们还表明,这些指标可以更快地提高再训练期间的分类准确性:在我们考虑的所有模型上,比随机选择和其他最先进的指标快两倍。我们假设我们最不确定的样本是最多信息的,应该优先用于通过再训练来改进模型。我们在五个模型和三个广泛使用的图像分类问题上评估这些指标,这些问题涉及由五代算法产生的真实和人工(对抗性)数据。我们表明,基于不确定性的指标具有识别错误分类输入的强大能力,比意外充分性强三倍,并且优于与覆盖率相关的指标。我们还表明,这些指标可以更快地提高再训练期间的分类准确性:在我们考虑的所有模型上,比随机选择和其他最先进的指标快两倍。我们假设我们最不确定的样本是最多信息的,应该优先用于通过再训练来改进模型。我们在五个模型和三个广泛使用的图像分类问题上评估这些指标,这些问题涉及由五代算法产生的真实和人工(对抗性)数据。我们表明,基于不确定性的指标具有识别错误分类输入的强大能力,比意外充分性强三倍,并且优于与覆盖率相关的指标。我们还表明,这些指标可以更快地提高再训练期间的分类准确性:在我们考虑的所有模型上,比随机选择和其他最先进的指标快两倍。
更新日期:2021-01-03
down
wechat
bug