To what extent do DNN-based image classification models make unreliable inferences?,Empirical Software Engineering

当前位置： X-MOL 学术 › Empir. Software Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

To what extent do DNN-based image classification models make unreliable inferences?
Empirical Software Engineering ( IF 3.5 ) Pub Date : 2021-06-18 , DOI: 10.1007/s10664-021-09985-1
Yongqiang Tian , Shiqing Ma , Ming Wen , Yepang Liu , Shing-Chi Cheung , Xiangyu Zhang

Deep Neural Network (DNN) models are widely used for image classification. While they offer high performance in terms of accuracy, researchers are concerned about if these models inappropriately make inferences using features irrelevant to the target object in a given image. To address this concern, we propose a metamorphic testing approach that assesses if a given inference is made based on irrelevant features. Specifically, we propose two metamorphic relations (MRs) to detect such unreliable inferences. These relations expect (a) the classification results with different labels or the same labels but less certainty from models after corrupting the relevant features of images, and (b) the classification results with the same labels after corrupting irrelevant features. The inferences that violate the metamorphic relations are regarded as unreliable inferences. Our evaluation demonstrated that our approach can effectively identify unreliable inferences for single-label classification models with an average precision of 64.1% and 96.4% for the two MRs, respectively. As for multi-label classification models, the corresponding precision for MR-1 and MR-2 is 78.2% and 86.5%, respectively. Further, we conducted an empirical study to understand the problem of unreliable inferences in practice. Specifically, we applied our approach to 18 pre-trained single-label image classification models and 3 multi-label classification models, and then examined their inferences on the ImageNet and COCO datasets. We found that unreliable inferences are pervasive. Specifically, for each model, more than thousands of correct classifications are actually made using irrelevant features. Next, we investigated the effect of such pervasive unreliable inferences, and found that they can cause significant degradation of a model’s overall accuracy. After including these unreliable inferences from the test set, the model’s accuracy can be significantly changed. Therefore, we recommend that developers should pay more attention to these unreliable inferences during the model evaluations. We also explored the correlation between model accuracy and the size of unreliable inferences. We found the inferences of the input with smaller objects are easier to be unreliable. Lastly, we found that the current model training methodologies can guide the models to learn object-relevant features to certain extent, but may not necessarily prevent the model from making unreliable inferences. We encourage the community to propose more effective training methodologies to address this issue.

中文翻译：

基于 DNN 的图像分类模型在多大程度上会做出不可靠的推断？

深度神经网络 (DNN) 模型广泛用于图像分类。虽然它们在准确性方面提供了高性能，但研究人员担心这些模型是否使用与给定图像中的目标对象无关的特征进行了不恰当的推断。为了解决这个问题，我们提出了一种变形测试方法，用于评估给定的推理是否基于不相关的特征。具体来说，我们提出了两种变形关系（MR）来检测这种不可靠的推论。这些关系期望（a）在破坏图像的相关特征后，具有不同标签或相同标签但模型确定性较低的分类结果，以及（b）在破坏不相关特征后具有相同标签的分类结果。违反变形关系的推论被认为是不可靠的推论。我们的评估表明，我们的方法可以有效地识别单标签分类模型的不可靠推断，两个 MR 的平均精度分别为 64.1% 和 96.4%。对于多标签分类模型，MR-1 和 MR-2 的对应精度分别为 78.2% 和 86.5%。此外，我们进行了一项实证研究，以了解实践中不可靠推论的问题。具体来说，我们将我们的方法应用于 18 个预训练的单标签图像分类模型和 3 个多标签分类模型，然后检查它们对 ImageNet 和 COCO 数据集的推断。我们发现不可靠的推论很普遍。具体来说，对于每个模型，实际上，使用不相关的特征进行了数以千计的正确分类。接下来，我们调查了这种普遍存在的不可靠推论的影响，发现它们会导致模型整体精度的显着下降。将这些来自测试集的不可靠推论包括在内后，模型的准确度可能会发生显着变化。因此，我们建议开发者在模型评估时要多注意这些不可靠的推论。我们还探讨了模型准确性与不可靠推论的大小之间的相关性。我们发现具有较小对象的输入的推论更容易不可靠。最后，我们发现当前的模型训练方法可以在一定程度上引导模型学习与对象相关的特征，但不一定能阻止模型做出不可靠的推断。我们鼓励社区提出更有效的培训方法来解决这个问题。

更新日期：2021-06-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11