1 Introduction

Object detection and identification in digital RGB images is a widely discussed problem in the literature. There are many projects containing ready-to-use modules with pre-learned neural networks that perform object recognition [1]. However, X-ray images are different from photographic images. An X-ray image of the contents of a package may help to determine if there is any dangerous object inside and avoid a possibly threatening situation. Unfortunately, recognition of objects after their shape is often insufficient in the case of X-ray images because there are many objects with an undefined shape or, to put it more precisely, those that can take a very different shape, such as liquids, powdery substances and fabrics. With only shape information, it is often impossible to identify objects that are obscured by other objects. Moreover, in X-ray scans, a reference to the matter from which these objects are made is helpful. The main problem in discriminating the materials of a given object from only a single projection in an X-ray image is to determine its thickness, density and composition. Two very different materials (e.g., steel and water) can give identical readings on the X-ray detectors if they have different densities and/or thicknesses. For this reason, multi-energy techniques are used that allow for making such a distinction for a single material. Dual-energy X-ray imaging (DEXA) is one such a well-known technique [2] that requires two measurements at different energies. However, because raw X-ray images are not always easy to analyze and interpret, some image processing methods like object detection, a frequency resolution increase, or a pseudocoloring are being used [3, 4]. Overall, the problem of material discrimination has not been well investigated by the computer vision community because, in the domain of luggage inspection, a significant part of the work is focused on object detection. We believe that a complete system for identifying and classifying objects in X-ray scans should primarily use information about the material and, secondly, information about shape.

In this paper, we propose a method that classifies materials in DEXA scans into six main types. Our aim is to employ a CNN approach for the entire feature extraction, representation and classification process. More specifically, we optimize the CNN structure and fine-tune convolutional fully connected and other layers for feature-to-classification pipeline within this problem domain. We perform experiments that illustrate the effect of various architectural decisions (i.e., regularization methods, number of layers and convolution filters) on the possibility of problem generalization. As a result of the proposed method, a per-pixel probability map will be created, mapping each point of the X-ray scan to one of the six classes of materials: background, light organic, heavy organic, light metals, heavy metals and non-penetrable, giving some form of initial segmentation. We believe that a complete system for identifying and classifying objects in X-ray scans should primarily use information about the material and, secondly, information about shape, especially in the case of contraband organic materials such as cigarettes, drugs, powders, explosives or liquids. Such materials do not have a specific shape so algorithms for their detection often fail.

This paper is organized as follows. Section 2 discusses the related works and summarizes the pros and cons of each of them. Section 3 gives all the necessary details of the proposed method. Section 4 then combines and analyzes the results from the experiments, while in Section 5 the results are compared to various popular CNN architectures designed to recognize visual patterns. Finally, Sect. 6 presents our conclusions and planned future works.

2 Related works

Table 1 A summary of the literature on X-ray security imaging in terms of the task and methods used

Dual-energy X-ray imaging requires two measurements at different energies providing two images based on different levels of radiation absorption of different materials. A common approach is to visualize these images using a linear color map (LCM). There are four or six main colors used widely in X-ray scanners to label material classes (see Fig. 1).

Fig. 1
figure 1

Material pseudocolors and its classes used widely in X-ray security scanners (color figure online)

More advanced methods allow for classifying scanned objects or their parts with certain materials on the basis of a mass attenuation coefficient [20]. This coefficient depends on the material’s atomic number. In theory, it should allow us to classify the object on the basis of its atomic number, but our tests showed that an object’s thickness has a major impact on the classification. As discussed in [26], the unambiguous definition of a material class for a composite of more than three substances is unachievable in the case of a fixed X-ray tube system. Due to this fact, several approximate hand-crafted methods have been proposed in [5,6,7] (see Table 1 for details). However, such classic material discrimination methods used for the dual-energy X-ray scans do not cope well with determining the type of material when there are many layers of different types of material at a given point. In the literature, several approaches have used classical machine learning methods in the automated inspection of X-ray images of airport baggage [9, 10, 14] and cargo [17, 18], object/thread detection [15, 16, 16, 18, 27], sub-component level segmentation strategies for supervised anomaly detection [13, 15, 17], or material identification [12]. The use of deep learning techniques allows real-time and accurate detection of prohibited items even in cluttered X-ray images, although this is very often the case with already segmented or colored images. In our previous work, we examined several machine learning techniques, such as SVM or Random Forest, for material prediction in X-ray scans [11], where the obtained classification results can be used for initial image segmentation. It should be added that there are many works where machine learning methods have been used for material recognition in conventional images [21, 22, 28, 29]. Authors in [19, 30] showed that CNN also could be used for the segmentation of large materials obtained using X-ray computed tomography.

The presented literature, summarized in Table 1, shows that there are relatively few methods based on deep learning used to classify materials in X-ray images. This type of problem is more popular in the domain of traditional images, while most X-ray/dual(multi)-energy X-ray or CT solutions focus on the detection of objects, threats, anomalies, or image classification. But such algorithms do not work well for objects of undefined shape, such as powders or liquids, so the methods proposed in related works do not solve the problem.

3 Proposed method

3.1 Proposed CNN architecture

The X-ray images processed by the proposed classification system are two-channel images (low and high energy) with 16-bit integer precision. We want to classify the materials into six groups: background, light organic, heavy organic, light metals, heavy metals and impenetrable. A single training sample is a set of patches with different sizes for the specific type of material. The patches have the following sizes: 3 \(\times \) 3, 5 \(\times \) 5, 7 \(\times \) 7, 9 \(\times \) 9, 15 \(\times \) 15. As attributes of materials are not defined semantically, we annotate every set of training patches with the appropriate label.

It is important to note that the statistical properties of different patch sizes of the input data can vary largely, which makes it difficult for a single, sequential model to directly encode such data (i.e., by simply concatenating the data and then applying the single-channel model). To overcome such a difficulty, we need a multi-scale model that gives us better capability of modeling multi-channel, multi-scale input data and fuses them together to generate high-level features. Inspired by deep convolutional networks (CNN) [25] and multi-scale architectures [22, 30] we propose our version of multi-scale network with five inputs fed with different patch sizes for giving a final material class on the output.

Figure 2 shows the schema of the proposed multi-scale convolutional neural network that consists of five subnetworks, each with a different structure depending on the size of the input patch. As presented in [25], to increase the performance of our convolutional neural network, we adapted our model subnetworks to the resolution of received patches. As the patch resolution increases, the subnet goes deeper and wider. The output of each subnetwork is a feature vector. The feature vectors from all five subnetworks are concatenated and passed to the serialized, two fully connected layers (FC) finalized by the softmax layer. This enables training CNNs based on multiple input scales.

Fig. 2
figure 2

Our approach multi-input CNN schema (color figure online)

Specifically, we regard the outputs from the last two layers of the CNN as the learned high-level appearance features of multiple input patches. It is essential to extract image features in a precise way. Despite the fact that the architecture of our CNN is not very complex, it allows for a good extraction of features. This is the result of a hierarchical arrangement of successive layers of the subnetwork. What’s more, we used an exponent-linear unit (ELU) as the activation function proposed in [31]. Their work proved that the result of the ELU was better than all the varieties of the rectified linear units (ReLU) function, resulting in shorter learning time and better neural network performance against the test set.

3.2 Training details

Our CNN classifier has been trained on input data based on low (LE) and high (HE) energy X-ray readings. Input data consisting of two X-ray energies compose the following three-channel image: (1) HE, (2) LE and (3) filled with zeros.

We train our CNN by fine-tuning the network, starting from the weights initiated using the Xavier initialization strategy. During training, we use an adaptive momentum estimation (Adam) optimizer with a 2048 batch size and a constant learning rate of 1e\(-\)4 (decay is zero). The choice of the optimization of the cost function was motivated by the fact that we trained the classifier to predict only one class (i.e., a multi-class model, not multi-output), so we use a softmax function (also called a normalized exponential function). The purpose of training is to obtain a model that estimates a high probability for the target class and at the same time a low probability for the other classes. Therefore, we use cross-entropy [32] as the cost function.

In order to avoid overfitting our model, we examined various regularization methods. One of the most important is a dropout layer with 0.5 rate (during the training process) proposed by Hinton et al. [33]. The accuracy metric was used to estimate network efficiency and its ability to generalize during the processing of validation data. The learning process was run 10,000 times for the entire training dataset (10,000 epochs). However, for many epochs, an early stopping regularization method is also used when the model has not improved for 100 epochs. In addition, we explored regularization to all convolutional layers. Regularizers allow for applying penalties on layer parameters or layer activity during optimization. These penalties are incorporated in the loss function that the network optimizes. We chose activation regularization with the norms \(|L_1|=0.001\) and \(|L_2|=0.01\). Finally, we verified the difference in the effectiveness and generalization capabilities of our network for the Dropout layer and multiple DropBlock layers. DropBlock, introduced by Ghiasi et al. [34], is a form of structured dropout, where units in a contiguous region of a feature map are dropped together. As the authors proved, DropBlock works better than dropout in regularizing convolutional networks.

4 Experiments and results

The main parameters we have analyzed are the accuracy of material recognition efficiency in terms of learning speed and model size. The proposed model is implemented using Tensorflow and Keras libraries. All of our experiments are conducted on the Nvidia graphics processing unit GeForce 2080 (Turing microarchitecture).

All training and test data come from our “Materials in DEXA Scans Database” (MDD) presented in [11]. We trained the classifier on a dataset comprised of over 1 million sample patches and over 100k test patches. Materials were classified into five groups: background, light organic, heavy organic, light metals and heavy metals. The last sixth class, i.e., impenetrable materials, was not trained as a degenerated class. All models created were trained and adjusted to the validation set, and the final results were made for the test set to verify the stability of the method.

Table 2 Normalized confusion matrices for the validation and test dataset with ELU function and various regularization methods
Table 3 The average accuracy of classification of all material classes for validation and test datasets for our multi-input CNN with various regularization methods

4.1 Proposed multi-scale solution

Our estimator (without any regularization methods) quickly reaches the accuracy over \(99\%\) for the validation set (just after the 142nd epoch), which is very satisfactory. But if we look at the confusion matrix for the material classes in Tables 2(1) and  4, it turns out that the network itself is poor at distinguishing the classes of light organics, heavy organics and light metals. It allows us to conclude that, along with the subsequent epoch of the learning process, the model is overfitted. For this reason, we have tested various regularization methods that will allow for better generalization of material classification.

Analyzing Table 3, we can note the following: using the ELU activation function for the network we proposed has resulted in improved performance. In addition, we also see this increase in the accuracy of combining the ELU activation function with the regularization methods, i.e., L1, L2 and DropBlock, with the exception of the Dropout method. Another point to notice is the reduction of learning time for DropBlock, Dropout and L1 regularization methods. The large generalization possibilities provided by the Dropout, L1, L2 methods and the benefits of using the ELU activation function prompted us to verify the combination of all these elements, which is presented in the last two rows of Table 3. As can be seen, the combination of these elements brought the greatest prediction stability in both cases for validation and test datasets—the difference in accuracy between the validation and test datasets is much smaller compared to the other options presented in the table. The problem that appears is the number of epochs needed to train such a stable classifier. However, as will be shown in Sect. 5 and Table 5, training time will be much shorter compared to popular architectures.

In order to verify the best regularization methods for our network generalization predictions, we have prepared the confusion matrices in Table 2 for a test dataset with ELU activation function. We achieved the best result for material classification for the ELU activation function with the L1 and L1 + Dropout regularization methods composition. It can be also seen in Table 2 that all tested cases without the L1 or L2 regularization method have a significant problem with the classification of the heavy organic materials. This is the class of materials that intermingles with other neighboring classes the most, i.e., with light organics and light metals, which we also noticed during the verification of machine learning methods in [11]. However, as shown in Tables 3 and 2, it should be noted that not all regularization methods have resulted in increased neural network accuracy and predictability generalization.

In summary, the best results for our neural network were achieved by using the L1 + Dropout method with the average accuracy equal to 0.955. The DropBlock layer can be a bit disappointing, especially in relation to the results presented in [34], probably due to the high noise of the input data and the relatively small input images (patch sizes: 3 \(\times \) 3–15 \(\times \) 15). Thus, we received the model that obtains the highest accuracy and has the best opportunity to generalize the problem of material classification in DEXA scans.

5 Discussion

Due to the lack of multi-input and multi-scale convolutional network architectures that would allow full use of the capabilities of our datasets [11], we decided to compare our results to three ImageNet challenge (ILSVCR) winning architectures: (1) VGG16 [23], (2) InceptionResNet V2 [24] and (3) EfficientNet in B0 version [25] as well as for our previous solution based on Random Forest classifier [11]. A full comparison study including all the methods selected in Table 1 is, however, beyond the scope of this work.

In each of the ImageNet architectures, we only change the last three dense layers to the following number of units: 128, 64 and the output with 5 units (number of material classes). For research purposes, we checked the accuracy of neural networks for the default input sizes: 224 \(\times \) 224 (VGG16, EfficientNet) and 299 \(\times \) 299 (InceptionResNet V2). These resolutions are much larger than those provided by the MDD dataset. Prior to training, each patch was scaled with cubic interpolation to the appropriate size for the given architecture and labeled with the appropriate material class. All values in patches were scaled to the range [0..255] and composed into three-channel images from HE, LE and zeros. All the ImageNet networks selected were trained for each patch type and size, and their accuracy was verified based on ensemble predictions for each patch size from a given perceptual area.

The results of the accuracy of these architectures and our proposal are presented in Table 5. The results are the ensemble accuracy of all patch sizes for classification of all material classes and are calculated based on the class prediction that obtained the highest probability in total of all patch sizes. We chose the ensemble accuracy for ImageNet architectures because it more closely resembles the operation of the model we proposed. The time needed to train the model in Table 5 is only a rough comparison, but it can be seen that the solution we propose is much faster and much more effective—in particular after adding regularization methods. As we can see in Table 2(8), in the confusion matrix the InceptionResNet-v2 (the best of the compared ImageNet models) incorrectly classifies materials from the heavy organics and light metal classes.

The weakest point in our comparison is the initial interpolation of the input data, which is necessary due to the architecture of the investigated ImageNet networks. Unfortunately, images from X-ray scanners are often very noisy. Interpolation of such images will cause this noise to be blurred and enlarged. This leads to recognizing the noise as a feature of a given material class, not artifacts that should be omitted during the process of learning material attributes. And as shown in [35], this type of quality distortions of an image affect the effectiveness of solutions based on convolutional neural networks. The networks are more sensitive to changes in blur and noise compared with compression and contrast. Most likely the above and the fact that the models are too complex for the problem presented by us makes the results of our solution much better. In particular, this can be seen on the confusion matrix (presented in Table 2), which depicts well the problem with the interpenetration of the classes of materials. Additionally, the network we proposed learns much faster, which is related to the fact that there are fewer learning parameters.

Table 4 Classification metrics for test dataset for the proposed CNN with ELU activation function and L1+Dropout regularization methods
Table 5 Comparison of the estimated training time, the total number of parameters and an average accuracy of the classification all material classes for a test dataset for all considered architectures

6 Conclusion and future scope

We successfully developed a multi-scale convolutional network architecture that classifies materials with per-pixel precision in DEXA scans into six main types with very high accuracy and achieve better results in comparison to popular CNN architectures and a machine learning method. The presented method creates a per-pixel probability map of belonging to the appropriate material class and can only be treated as an initial segmentation. We also confirmed our assumptions from [11] that deep learning methods will achieve better results in this problem than machine learning algorithms. In our publication, we analyzed some regularization methods and their impact on the effectiveness of our architecture. Additionally, by analyzing the two activation functions ReLU and ELU, we have shown that also for the problem presented by us and for the multi-scale neural network, the function of ELU activation brings benefits in the form of better accuracy and a faster learning process. A considerable advantage, in terms of the accuracy and speed, was observed for the proposed network-based method, with an accuracy improvement of approximately 7.30% compared to the best of selected ImageNet architectures (InceptionResNet-v2) and with 2.14% compared to our previous solution based on Random Forest classifier.

There are still some problems with the classification of the heavy organic material, which intermingles with other neighboring classes, i.e., with light organics and light metals, but the same problem was also noticed during the verification of other discussed methods. The limitation of the proposed method is also the classification into six fixed classes of materials, which, however, is motivated by the use of the method in typical security scanners.

Our next goal is to develop a method that allows for more precise material discrimination on full resolution scans and performs image segmentation, which will contribute to the development of this field of science and increase the efficiency of solutions for more advanced computer vision problems in X-ray images, i.e., object detection. There remains the problem of noisy X-ray images, which probably have the greatest impact on the classification and segmentation of such images. In order to solve this obstacle, we want to try out deep neural networks (autoencoders), whose aim is to denoise images or simultaneously interpolate images with removing noise.