1. Introduction
Significant attention has been devoted to the problem of in-time forest fire detection. Considering the rapid development of convolutional neural networks, as well as their universality and effectivity in comparison with classic algorithms, such methods can be applied to flame detection tasks as well as their initial purpose (i.e., medical image segmentation).
A previous study [
1] described a fire object-detection solution using a YOLOv2 [
2] model to obtain a bounding box for areas of flame without concretization of its class. The results are shown in
Figure 1.
The application of object-detection methods for fires, however, has several drawbacks. The first is that fire has a wide variety of available contour configurations and, unlike regular convex objects, such as automobiles and pedestrians (signs used in advanced driver-assistance systems), fire objects cannot be optimally inscribed into a bounding box, which leads to the large variance of mAP metric accuracy. Additionally, the center mass of a fire represents much more accurate information of a non-regular contour than the center of the bounding box. As such, fire-segmentation is a more applicable class of computer-vision tasks than object detection.
There are two methods of accurate fire-contour segmentation. The first is represented via a two-step pipeline: obtaining regions of interest (ROIs) represented by the same color and fire recognition for each region. Such methods have been described in previous studies [
3,
4]. The results of ROI recognition are shown in
Figure 2.
An aspect of super-pixel extractions is the main drawback of this method. Its calibration (the setup of color range) affects the form of concrete flame areas. It impacts the accuracy of the following recognition. Secondly, it is a sequential algorithm. As such, it cannot be applied to the conception of massive paralleling systems implemented in real-time GPU computations.
The second method of accurate fire-contour segmentation is one-shot semantic-segmentation models. Recent research achievements have been obtained in the area of using the Depplabv3+ [
5] method for binary fire segmentation, as described in [
6,
7]. This architecture is used with heavy backbones such as XCeption to obtain fire contours without concretization of color (i.e., binary fire segmentation).
We solve the problem of multiclass fire segmentation because shooting at the hottest regions of the fire is a very important action for optimal firefighting [
8]. The temperature of burning is correlated with the color of flame—that is, yellow-white regions are the hottest, whereas red areas are the coldest.
We used the UNet method [
9] in our task of multiclass fire segmentation with the original lightweight backbone VGG16 for two reasons. While extinguishing a fire, it is important to obtain real-time information about the fire’s vulnerable points [
10]. In addition, the lightweight model can be easily ported to Jetson architectures, which are widely used in robotics systems. Finally, it allows for a faster training process, with the ability to train the model on medium batches with an inexpensive GPU such as the Nvidia RTX2070.
We investigated the accuracy of image segmentation for an input data size of 224 × 224 and 448 × 448 pixels. We used two calculation schemes. The first fits the input image to the size of the CNN input (one-window). The second obtains non-intersected sub-areas of original images using a sliding window (non-intersected). Additionally, we suggest innovative methods to improve segmentation accuracy based on the composition of partially intersected areas via weighted addition and Gaussian mixtures of calculation results.
In the next step, we describe a new deep-learning architecture known as UUNet-concatenative, as a modernization of UNet. This model is the combination of binary and multiclass UNet methods. It enables the multiclass (differentiated by color) segmentation of signal obtained from the binary part of acquired single-nature objects (flame areas). Unlike the combination of two UNets, UUNet adds additional skip connections from the binary decoder to the encoder of the multiclass model. The concatenative suffix means that the result of binary segmentation joins with the input image via concatenation and goes to the multiclass UNet part as an input. wUUNet is the next step in the improvement of the UUNet model. It uses the entire combinatorial set of skip-connectors.
We prepared a custom dataset, including 6250 samples with a resolution of 224 × 224 pixels. We use soft Dice [
11,
12] and Jaccard [
13] as target loss functions. We use the Adam [
14] method to change CNN matrix weights with the initialization used by He et al. [
15]. We use the SGDR [
16] method for every 300 epochs with initial learning rate (lr) = 0.001 and 10× annealing of learning rates every 60 epochs.
3. Results
This section compares the models and calculation schemes presented in
Section 2 to determine the best approach to fire segmentation. First, we obtain the best UNet model with a simple schema to assign a baseline. Then, we apply the proposed methods and models described in this article to demonstrate the improvement in segmentation accuracy.
3.1. UNet One-Window vs. Full-Size
The first comparison in this study was between one-window models trained using soft Dice–Sørensen and Jaccard loss functions. The results are shown in
Figure 13.
There are hints for each part of the image describing its purpose. On the left side of
Figure 13a, we see the result of the UNet model based on soft Dice loss and, on the right side, the Jaccard result. The last line of the image demonstrates the difference between ground truth and actual data, which also shows that the Jaccard model does not recognize significant red flare, resulting in lower scores for both binary and multiclass segmentation. In fact, the model trained on soft Dice–Sørensen loss performs better, as can be seen from the frequency histogram of accuracy distribution in the validation dataset shown in
Figure 13b.
Although the peak of this histogram is in the same range of multiclass Jaccard accuracy (i.e., 82–84%), another significant peak of the model trained by the soft Jaccard function is at 66–68%, which negatively affects the average precision of the entire dataset.
The following comparison concerns the results of one-window and full-size modes of the UNet 448 × 448 model shown in
Figure 14. It can be clearly concluded that the full-size mode outperforms the one-window model; however, significantly better results are obtained when the soft Jaccard loss function is used.
The full-size model trained by Dice yields an unsatisfactory result, as evidenced by the absence of the detected flame in the center of the real one. However, considering the first and third differences, the full-size model recognizes flares better than the one-window model. This also has a positive effect on the accuracy of full-size model.
The last comparison in this section is shown in
Figure 15, between UNet 448 × 448 and 224 × 224 full-size models.
The UNet 224 × 224 model performs significantly better, showing the corresponding difference in binary and multiclass precision metrics. This is also confirmed by the difference of accuracies histogram shown in
Figure 15b, where we see the shifting of peaks towards higher accuracy.
To complete this section,
Table 3 reports the mean and variance of the binary and multiclass segmentation precision for the previously analyzed models. You can see that the full-size model of UNet 224 × 224 shows the best results in the context of multiclass segmentation, but the 448 × 448 model works better for binary segmentation. Additionally, the 224 × 224 full-size model exhibits the largest accuracy variation. The improved accuracy of the 224 × 224 full-size is shown in the next section.
3.2. Non-Intersected vs. Averaged Half-Intersected Calculation Schemes
Returning to the discussion of the use of a full-size non-intersected calculation scheme (
Section 2.2), with an illustration of the segmentation problem at the boundaries of non-intersected nodes, we propose new calculation schemes based on averaging the segmentation of half-intersected areas. In the process of looking for an optimal divisor
n in Equation (3), we obtained the graph of the multiclass Jaccard accuracy shown in
Figure 16.
This figure demonstrates that the accuracy of the function, depending on the value of n, corresponds to a local maximum, which can be additionally approximated by a dichotomy. The result of applying this algorithm demonstrates maximal accuracy when n = 4.07512.
Comparison results for full-size non-intersected and half-intersected models using both methods are shown in
Figure 17.
The results demonstrate the improved model accuracy with both methods. The shape of the accuracy histogram does not change significantly; however, the methods improve the accuracy for almost all samples in the validation dataset. This demonstrates the effectiveness of the methods and the corresponding mean accuracy and variance values provide further confirmation of this (see
Table 4). The application of the Gaussian distribution demonstrates the best values of accuracy, as well as significantly better variances than other full-size UNet 224 × 224 models for both multiclass and binary segmentation.
3.3. UUNet and wUUNet
The proposed models are described in
Section 2.3, and the comparison results are summarized in
Table 5. The wUUNet with Gauss half-intersected schema model represents the best segmentation quality, and yields an accuracy above 80% according to the multiclass Jaccard metric.
The performance characteristics are shown in
Table 6. We use an Nvidia RTX2070-based workstation as the target device and pure PyTorch model to obtain FPS values, minimum memory consumption, and the number of video streams able to run in parallel. The table shows that all of UNet based models can run in real time. We do not take into account the time taken to move the input frame from CPU to GPU because robotics systems will fetch and save the frame directly to video RAM memory.
To complete the investigation of the multiclass fire-segmentation task, it is worth noting that the methods can be used in different environmental conditions, as shown in
Figure 18 and
Figure 19. For each original image CNN calculates accurate fire-segmentation mask and we visualize it bellow by swapping red and blue channel of image and mark by red, orange and yellow the areas of detected flame.
The visualization demonstrates effective segmentation in cases of video capturing from the air and on the ground, as well as a large and a small fire. Additionally, it shows the reliable elimination of false alarms represented by firefighting vehicles and other objects of red, yellow, and orange colors.
4. Discussion
This article describes a detailed solution to multiclass fire image segmentation using an advanced neural-network architecture known as UNet. Because this problem has been solved for the first time, we collected and labeled datasets for training, and suggested some configurations for the UNet model. We analyzed the best configuration for fire segmentation and suggested innovative methods to improve accuracy. Based on the UNet architecture, we developed a wUUNet-concatenative model that demonstrates the best results in the task of multiclass fire segmentation. It outperforms the UNet model by 2% and 3% in case of binary and multiclass segmentation accuracy, respectively.
The next steps are to run the proposed model and calculation schemes in Jetson Nano board to perform real-time computations from the connected CSI camera module, and to create a prototype for fire detection system to provide automatic forest fire segmentation and effective suppression.