Abstract

Crack plays a critical role in the field of evaluating the quality of concrete structures, which affects the safety, applicability, and durability of the structure. Due to its excellent performance in image processing, the convolutional neural network is becoming the mainstream choice to replace manual crack detection. In this paper, we improve the EfficientNetB0 to realize the detection of concrete surface cracks using the transfer learning method. The model is designed by neural architecture search technology. The weights are pretrained on the ImageNet. Supervised learning uses Adam optimizer to update network parameters. In the testing process, crack images from different locations were used to further test the generalization capability of the model. By comparing the detection results with the MobileNetV2, DenseNet201, and InceptionV3 models, the results show that our model greatly reduces the number of parameters while achieving high accuracy (0.9911) and has good generalization capability. Our model is an efficient detection model, which provides a new option for crack detection in areas with limited computing resources.

1. Introduction

In the current infrastructure, the concrete structure accounts for the largest proportion. For the concrete structure, cracks are a frequently encountered disease. With the increase in service time, the number and width of cracks show a gradual increasing trend, which seriously affects the safety, applicability, and durability of the structure. Therefore, it is of great significance to detect cracks regularly and takes corresponding maintenance measures for the safety of the concrete structures [13]. The traditional crack detection method is mainly based on the direct detection of professionals with related instruments. This detection method is not only labor-consuming but also time wasting. In addition, it also brings great hidden dangers to the safety of people.

In order to find an efficient and safe crack detection method and overcome various shortcomings of manual detection, people turned their attention to image processing technologies (IPTs) [4]. In the past decades, the computer vision community has been working on the automated detection of images and proposed a series of image processing techniques, including thresholding [5, 6], edge detection [7], wavelet transforms [8, 9], and machine learning [10] and so on. Image thresholding divides the crack in the image at the pixel level according to the features of different pixel values, which makes the image simple and facilitates further processing. Differential operators used for edge detection mainly contain Roberts operator, Sobel operator, and Laplace operator [11]. The basic idea of wavelet transform is using a set of wavelet functions or basis functions to represent a function or signal, such as an image signal. Machine learning extracts feature vectors of crack from the training dataset and combines certain algorithms to make prediction. These methods solve the problem of crack detection in engineering effectively. However, due to the unevenness of cracks, the diversity of surface textures, and the complexity of the background, this field of research is still active.

As a new technology in the research of machine learning algorithms, deep learning is motivated by the establishment and simulation of a neural network for analyzing problems and learning like the human brain. Through typical feature learning [12], each layer of the network takes the output of the previous layer as its own input, learns a deep nonlinear network structure, and transforms the specific raw data into a more abstract expression model. The rapid expansion of effective datasets, the realization of high-performance computing hardware, and the continuous improvement of training methods are driving the rapid development of deep learning. In 2012, AlexNet [13] won the championship with an overwhelming advantage in ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). Convolutional neural network (CNN) begins to attract the attention of many researchers since then. CNN is a feed-forward neural network, and the connection between neurons is inspired by the animal visual cortex. It has the characteristics of local connectivity and parameter sharing and has excellent performance in large-scale image processing [14]. In recent years, researchers have begun to use convolutional neural networks to detect road defects automatically. The following is a brief overview of the application of CNN in crack detection.

Zhang et al. [15] proposed a crack detection method based on deep learning, which seems to be one of the earliest works applying CNN to road crack detection. The pavement pictures are taken by smartphones, and the network model is built on the Caffe DeepLearning (DL) framework. By comparing with traditional machine learning classifiers such as support vector machine (SVM) and boosting methods, the author proved the effectiveness of deep learning methods. Pauly et al. [16] studied the influence of CNN depth and the position change between training dataset and test dataset on pavement crack detection accuracy. The results show that increasing the network depth can improve the network performance, but when the image position changes, the detection accuracy will be greatly reduced. Maeda et al. [17] created a large-scale road damage dataset and marked the location and type of road damage in each picture. Finally, an end-to-end object detection method based on deep learning was used to train the damage detection model. Maeda et al. also transplanted their model into a mobile phone application to facilitate road damage detection in areas lacking experts and financial resources. Xu et al. [18] established an end-to-end bridge crack detection model to realize automatic bridge crack detection. The use of depthwise separable convolution reduces the number of parameters effectively. The atrous spatial pyramid pooling (ASPP) module extracts information at multiple scales. The model achieves a detection accuracy of 96.37% without pretraining. Li et al. [19] proposed the YOLOv3-Lite method, which greatly improves the crack detection speed without reducing the detection accuracy. Tong et al. [20] used convolutional neural networks (CNNs) to detect, locate, and measure ground penetrating radar images automatically and finally reconstruct concealed cracks in three dimensions, realizing a low-cost damage characterization method. Yang et al. [21] realized the pixel-level detection of cracks based on a fully convolutional neural network. The fully convolutional neural network is composed of upsampling and downsampling and can detect objects at different scales. In terms of crack segmentation, the accuracy, precision, recall, and F1-score are 97.96%, 81.73%, 78.97%, and 79.95%, respectively. Zhu and Song [22] used the transfer learning method to improve VGG16 and realized the accurate classification of surface defects on concrete bridges. The training of convolutional neural networks usually requires a large number of data, but in many cases, it is more expensive to obtain large-scale data. The pretrained model can be transferred to the task of crack detection by means of transfer learning. The results show that the model can effectively extract defect features and provide a new idea for surface defect detection. Deng et al. [23] added a region-based deformable module to Faster R-CNN [24], R-FCN [25], and FPN-based Faster R-CNN [26] to improve the evaluation accuracy of crack detection.

In this paper, we use the transfer learning method to build a model for concrete surface crack detection. Compared with existing models, our model achieves a good balance among accuracy, model size, and training speed. Due to the use of transfer learning, the model becomes easier to train and faster to converge and has better generalization capability.

The remaining of this paper is structured as follows: Section 2 describes the dataset and image preprocessing method; Section 3 presents the overall model architecture and training details; Section 4 shows our experimental results; and Section 5 delivers the conclusion of this paper.

2. Dataset and Data Augmentation

2.1. Building Dataset

In this study, we use the dataset collected by Li and Zhao [27]. The photos in this dataset were obtained by a smartphone with a resolution of 4160 × 3120 pixels from the surface of a pylon and anchor room of a suspension bridge in Dalian, Liaoning, China. Then, images are cropped to 256 × 256 pixel resolutions. After cropping, the images are manually divided into two categories: with cracks and without cracks. In this study, we only use 12,000 photos of the dataset, and the number of crack and noncrack images are set to equal. These images include crack features and background features under various conditions. The 12,000 selected images are divided into the training set, validation set, and test set at the ratio of 6 : 2 : 2. The number of crack and noncrack images in the three datasets are set to equal. In addition, we also select 1,000 concrete bridge images with cracks and 1,000 images without cracks from the SDNET2018 [28] dataset. This will introduce various changes, such as changes in lighting conditions and the features of cracks and crack surface texture to further test the generalization capability of the model and make a more comprehensive evaluation of the model. Figure 1 shows several crack and noncrack images in the two datasets.

2.2. Data Augmentation

The generalization capability of the neural network model is closely related to the number of training datasets. But in reality, the amount of data is limited. One way to solve this problem is to create fake data and add it to the training set—data augmentation [29]. By allowing limited data to generate more equivalent data to artificially expand the training dataset, data augmentation can also effectively overcome the overfitting phenomenon. Currently, it is widely used in various fields of deep learning. At present, the commonly used methods of data augmentation in the field of computer vision mainly include data augmentation based on image processing techniques and data augmentation based on deep learning.

In this paper, we use the built-in ImageDataGenerator interface of Tensorflow2.0 to enhance the input image data, including image flip, rotation, shift, and other operations. Figure 2 shows several crack images after data augmentation.

3. Model Construction and Training

3.1. The CNN

Convolutional neural network is a special kind of neural network. Its main feature is convolution operation, which has excellent performance for large-scale image processing. Generally speaking, a convolutional neural network is a hierarchical model that extracts the original data (such as RGB images) from the input layer through a series of operations such as convolution, pooling, and nonlinear activation function mapping. Abstract layer by layer, extract feature information, and finally make predictions. Deep convolutional neural networks have become popular since 2012, and now, they have become a pivotal research topic in the field of artificial intelligence. Classical convolutional neural network includes AlexNet [13], VGG [30], GoogLeNet [31], and ResNet [32]. A layer is the basic calculation unit of a CNN. CNN is mainly composed of input layer, convolution layer, activation function, pooling layer, fully connected layer, and Softmax layer.

3.2. Swish Activation Function

Ramachandran et al. [33] proposed a Swish activation function using a combination of exhaustive and reinforcement learning-based search. The effectiveness of this activation function has been verified in some large neural networks. The EfficientNet model used in this article uses the Swish activation function. The definition of Swish is defined as follows:where and is either a constant or a trainable parameter.

Figure 3 shows the graph of Swish for different values of .

3.3. Architecture Description

EfficientNet [34] was proposed by Google in 2019 which has great capability of feature extraction. Compared with other classic convolutional neural networks, it has fewer parameters and higher accuracy. The baseline network of EfficientNet is designed using multiobjective neural architecture search, and then, the baseline network is scaled in terms of depth, width, and resolution to achieve a balance among them. The compound scaling method is defined as follows: where can be calculated by a small grid search.

Firstly, EfficientNetB0 performs a 3 × 3 convolution operation on the input image, and then, the next 16 mobile inverted bottleneck convolution modules are used to further extract image features. Finally, after 1 × 1 convolution and global average pooling operations, the classification results can be obtained in the fully connected layer. After each convolution operation in the network, batch normalization is performed. The activation function used in this network is Swish. The overall architecture of EfficientNetB0 is shown in Figure 4.

The core component of the network is a mobile inverted bottleneck convolution module (MBConv). Figure 5 shows the framework of this module. The design of this module is inspired by inverted residual and residual structure. Before performing on 3 × 3 or 5 × 5 convolution, the dimension of images is increased via 1 × 1 convolution in order to extract more feature information. The Squeeze-and-Excitation (SE) [35] model is added after 3 × 3 or 5 × 5 convolution operation to further improve performance. Finally, 1 × 1 convolution operation is used to reduce the dimension, and a residual connection is added.

The Squeeze-and-Excitation block compresses the feature map, performs a global average pooling operation in the direction of the channel dimension, and performs an excitation operation on the global feature to learn the relationship in each channel. Then, it obtains the weights of different channels through the sigmoid activation function. Finally, the weights multiply the original feature map to get the final feature. The block allows the model to pay more attention to the channel features which has the most information, while suppressing those that are not important. Figure 6 illustrates the detail of the Squeeze-and-Excitation block.

3.4. Loss Function and Adam
3.4.1. Loss Function

The loss function is mainly used to evaluate the effect of the model. For a large amount of information, the machine discovers the laws through autonomous learning and makes predictions. The loss function is used to measure the degree of deviation between the predicted result and the actual value. During the network training process, the function is continuously updated until the best fitting result is found to reduce the error.

The cross-entropy loss function, also known as the Softmax function, is widely used to measure the gap between the predicted value and the actual value when the convolutional neural network deals with classification problems. The Softmax function is defined as follows: where represents the number of neurons in the output layer. is the input signal.

3.4.2. Adam

When training the model effectively and making accurate predictions, the internal parameters of the model play a significant role. This is why we ought to choose a suitable optimizer to update network parameters to approximate or reach the optimal value. The optimization algorithm helps to minimize the loss function, update the model parameters, and finally reach converge. In this article, the Adam algorithm is utilized to update weights.

Kingma [36] utilizes exponentially moving averages to estimate the moments: where and are moving averages, is gradient, and are the decay rates of moment estimate (setting to 0.9 and 0.999, respectively).

Then, we do bias correction:

The final formula for weight update iswhere is the learning rate and is the hyperparameter (setting to 1e − 3).

3.5. Training

The convolutional neural network models used in this article use transfer learning methods to detect concrete surface crack. Specifically, the weight of the model is trained on the ImageNet and saved and then transferred to our model. Therefore, our model has a higher starting point, which greatly saves training time and obtains better performance.

All the experiments in this paper are performed on TensorFlow in Windows system: hardware settings: CPU: Intel (R) Core (TM) [email protected] GHz, RAM: 16G, and GPU: NVIDIA GTX1080Ti.

Pretrained model is a model that has been trained and saved in advance on a large dataset. In order to realize the detection of concrete surface cracks, we need to retrain the pretrained convolutional neural network model. In addition, the last fully connected layer of the original model is replaced by a new fully connected layer. The specific experimental steps are as follows:Step 1: data loadingImporting concrete surface crack images. A batch of data is randomly loaded from the training set (batch size: 16) for subsequent data processing.Step 2: image preprocessingWe use built-in function in TensorFlow to adjust the size of input image to the fixed size of the model. Then, do data augmentation via image flip, translation, rotation, and other operations. It is worth noting that, due to the use of TensorFlow’s built-in function, a large number of pictures generated by data augmentation will not be saved to the local computer. All pictures are online.Step 3: define the structure of the crack detection model The pretrained model (such as EfficientNetB0) is loaded and fine-tuned. We remove the last fully connected layer and replace it with a custom layer. In this article, the number of classifications in the custom layer is set to 2. The value of weights in other layers will not change.Step 4: compile the model and start trainingBefore training the model, it is necessary to specify the hyperparameters related to the network structure and select the appropriate optimization strategy. In this experiment, a random batch training method is used to train the neural network. The dataset is randomly shuffled before each epoch of training to ensure that the same batch of data in each epoch of model training is different, which can increase the rate of model convergence. The learning rate plays a significant role in the training of model. Choosing an appropriate learning rate can speed up the model’s convergence speed; on the contrary, it may cause the loss value of objective function to explode. Since the transfer learning method is used, the network has converged on the original dataset, so the initial learning rate in this experiment is set to 5e − 7. When the validation loss does not decrease for two consecutive times, the model learning rate will be reduced to half. We select the adaptive learning rate optimization algorithm Adam and use the cross-entropy loss function to guide the model training. The initialization method of weights is as follows: initialize the weights of newly added fully connected layer randomly and initialize the remaining weights with the pretrained weights. Figure 7 shows the process of model adjustment and weights initialization.In order to solve the problem of how long to train the model, we adopt an early stopping strategy to avoid overtraining the network. During the training process, we monitor the validation loss. Once the validation loss drops less than 1e − 3 for 30 epochs, the model will stop training.Step 5: test the performance of the model.Test the performance of the model on the test dataset. In addition, we also select 1000 crack and 1000 nocrack images of concrete bridge decks from the SDNET2018 dataset to construct a completely different dataset so that we can further evaluate the detection performance of the model.

4. Experimental Results and Analyses

4.1. Experimental Results and Evaluation Index

When doing image classification tasks, in order to evaluate the performance of different algorithms, some evaluation metrics need to be selected. Accuracy refers to the ratio of number of correctly predicted crack and noncrack images to the total number of input images. Precision can be understood as the number of correctly predicted crack images divided by the number of crack images predicted by the classifier. Recall is the percentage of the number of correctly predicted crack images to the total number of cracked images. is the harmonic mean of precision and recall. Accuracy, Precision, Recall, and are defined as follows: where TP and TN mean images with crack and without crack, which are correctly classified. FP and FN mean images with crack and without crack which are wrongly classified.

Figures 8 and 9 show the images, together with the respective probability of correct classification.

4.2. Comparative Experiments

In order to verify the performance of the model, we compare the proposed model with other classic convolutional neural networks on the same dataset. Figure 10 shows the change in loss and accuracy during the training and validation process. The number of parameters of each model can be seen in Table 1. The results of different methods are compared in Table 2. Table 3 compares the training time of four models. Table 4 shows the size of four models. We can see that, in contrast to the other three models, MobileNetV2 [37] has the smallest number of parameters in the task of detecting concrete surface cracks, but its test accuracy is obviously lower than the other three models. Although the accuracy of EfficientNetB0 on the test dataset is slightly lower than DenseNet201 [38] (0.21%), its model size is 3.5x smaller and 2.5x faster (average training time of each epoch). Its parameters are reduced by 77.89% at the same time. EfficientNetB0 achieves a good balance among accuracy, model size, and training speed. In terms of crack detection task, the model is quite efficient. It can also be seen from Table 2 that when tested on a dataset which is quite different from the training dataset, the performance of the network drops a little. This drop is mainly caused by the changes in background conditions of the images in the dataset, and some image features have not been well learned by the network.

It is worth noting that, in Figure 10, the accuracy of validation during the training and the validation process is slightly higher than that of the training. Two reasons can account for this phenomenon. On the one hand, we use the transfer learning method to train the model. The network was initialized with pretrained weights (pretrained on ImageNet). Therefore, the model has a better feature extraction ability and features are more effective. On the other hand, this phenomenon results from the use of dropout since its behavior during training and validation is different. Dropout forces the neural network to become a very large set of weak classifiers, which means that a single classifier does not have too high classification accuracy, and only when we connect them together will they become more powerful. During training, dropout cuts off the random set of these classifiers, so the training accuracy will be affected. During validation, dropout will automatically turn off and allow all weak classifiers in the neural network to be used, so the validation accuracy is improved.

5. Conclusions

In this paper, a concrete surface crack detection model based on transfer learning and convolutional neural network is proposed. EfficieneNetB0 is a highly effective convolutional neural network. The last fully connected layer is replaced by a new fully connected layer with a classification number of 2. The newly added fully connected layer is initialized randomly, and the remaining weights are initialized with pretrained weights. Finally, by comparing with other models, the results show that our model achieves a good balance among accuracy, model size, and training speed. In addition, when tested on crack images taken from other places, our model also shows good performance and generalization capability. Our model is an efficient crack detection model, which is a good choice for areas with limited computing resources. Traditional crack detection mostly pays attention to how to identify the cracks in the image. In addition, it is also important to characterize the severity of the cracks, which is an area that is often overlooked. We will be devoted to this work in future research.

Data Availability

The codes used in this paper are available from the author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was funded by the National Natural Science Foundation of China (grant no. 51579089).