Introduction

Nowadays, image classification method is very popular in a lot of fields, playing a pretty important role. Image recognition supports several applications, for instance, facial recognition, image classification, video analysis, and so on. Deep learning technology has been the core topic in machine learning and it has outstanding results in image identification [1]. Deep learning uses multilayer structure to process image features, which greatly enhance the performance of image identification [2]. Image recognition and deep learning are developing so fast and more and more fields benefit from them. Fresh supply chain, factories, supermarkets, etc. are the popular fields that are relying on image recognition and deep learning to obtain a better development. In other words, the application of image recognition and deep learning in logistics and supply chain field becomes a trend. For example, image recognition can help to guide the path of logistics and transportation, and it can solve the problem that several automatic transportation vehicles make mistakes because of large path identification errors [3]. Another example is fruit and vegetable classification. Deep learning can extract image features effectively and then implement classification.

In the past, the fruit picking and processing is based on artificial methods, resulting in a large amount of waste of labor [4]. Recently, researchers tended to apply near-infrared imaging, gas sensor, and high-performance liquid chromatography devices to scan the fruit. Nevertheless, those methods need expensive devices (different types of sensors) and professional operators, and their overall accuracies are commonly lower than 85% [4]. With the fast development of 4G communication and widespread popularity of various mobile Internet devices, people have generated a great quantity of images, sounds, videos, and other information, and image recognition technology has been gradually mature. Image-based fruit identification has attracted the attention of researchers since their cheap device and excellent performance [4]. The intelligent identification of fruit can be used not only in the picking stage of the early fruit, but also in the picking and processing stage in the later stage. Fruit recognition technology based on deep learning can significantly improve the performance of fruit recognition, and has a positive effect on promoting the development of smart agriculture. Comparing with artificial features + traditional machine learning combination method, deep learning can automatically extract features, and has better performance, which gradually becomes the mainstream method of smart identification [5]. For instance, Rocha et al. [6] presented a unified method that could combine features and classifier. Tu et al. [7] developed a machine vision method to detect passion fruit and identify maturity applying RGB-D images. Fruit and vegetable classification is challenging, because it is hard to give each kind of fruit an adequate definition. However, achieving accurate fruit and vegetable classification is very important in fresh supply chain for several reasons. First, fruit and vegetable automatic classification decreases labor cost, because factories do not need workers to do this classification task anymore. Labor cost is saved and this cost can be invested in other aspects. Second, accurate classification contributes to factory automatic fruit-packing and transportation [5]. In these fields, packing and transportation are two core parts. If any errors happen in these parts, it will cause a bad influence on the following processes. For instance, the errors will delay the time which the customers receive the fruits and vegetables. Thirdly, accurate classification can save the time and bring higher efficiency. After lots of training, automatic classification can reach a high standard, which can promise the accuracy of the classification result. In addition, unlike humans, automatic classification has memories and it can continue to repeat the classification process. It helps to save much time, enhancing the efficiency greatly [8].

The development of attention mechanism and autoencoder is constantly improving the performance of deep learning. The attention mechanism in the deep learning model is a model that can simulate the attention of the human brain [8]. Attention mechanisms were first presented in the field of image research and then were used in wider fields [9]. When people observe images, they only focus their attention on important parts instead of looking at every pixel of the image. As more and more people use attention mechanism, more cutting-edge models have been created and used in various studies. There are four categories: number of sequences, number of abstraction levels, number of positions, and number of representations [10]. Each category corresponds to different types. The number of sequences has distinctive, co-attention, and self-types. The number of abstraction levels has single-level and multi-level types. The number of positions has soft, hard, and local types. The number of representations has multi-representational and multi-dimensional types. As for autoencoder, it is popularly used as an effective feature extraction method. Autoencoder (AE) is a special artificial neural network used in semi-supervised learning and unsupervised learning and is composed of encoder and decoder [11]. Its function is to have representational learning from the input information using the input information as the learning target [12].

There are two main innovations and contributions in this paper: (1) scholars used manual features decided by experts, when there are many fruit subspecies, it is very difficult to identify specific fruit subspecies using artificial feature extraction. We construct a model which can identify specific fruit subspecies accurately. (2) An attention module and CAE are integrated into DenseNet, which refine the features of fruit images and improves the interpretability of the method.

In this paper, we develop a hybrid deep learning-based fruit recognition method, named attention-based densely connected convolutional networks with convolution autoencoder (CAE-ADN), which uses a convolution autoencoder to pre-train the images and uses an attention-based DenseNet to process the images. Experimental results illustrate the effectiveness of the method, and the model can improve the efficiency of related work.

The rest sections of this work are shown below. Section 2 delineates the related works of fruit classification. Section 3 delineates the detailed methodology. Section 4 describes the sets and results of experiment, and comparison with other studies. Finally, Sect. 5 makes a conclusion of this work.

Related work

With the rapid development of machine vision technology, automatic sorting method using machine vision has been used in production and processing fields. To resolve the problem of low efficiency and accuracy of traditional sorting method under the condition of huge fruit output, machine vision and deep learning technology can assuredly promote the efficiency and accuracy of sorting. However, under the actual conditions, the image will be affected by light, fruit reflection, and occlusion. For example, different shapes and colors of fruits make it difficult to identify and locate under different conditions (such as different light and noise). In addition, because the color and texture features of fruit image are related to the growth period, it also increases the complexity of fruit recognition.

In the previous studies, color, texture, and edge properties are considered to categorize fruits [13,14,15]. Garcia et al. [16] used artificial vision for fruit recognition and extracted shape, color chromaticity, and texture features. To distinguish large types of figures, features in low and middle levels are applied [17,18,19]. When referring to the product classification, the first attempt of a fruit recognition system in supermarket, which considered texture, color, and density, must be mentioned [20]. The accuracy could reach nearly 95% in some situations. Later, Jurie et al. and Agarwal et al. [21, 22] used the method of breaking down the classification problem to the recognition of different parts, that are features of each object class. These techniques were called bag-of-features and they presented potential results, even though they did not model spatial constraints between features [23, 24].

Besides color, texture, and edge properties, many different methods are used in fruit and vegetable classification. For example, scholars use gas sensor, near-infrared, and high-performance liquid chromatography devices to scan the fruit [25,26,27]. Fei-Fei et al. [28] introduced prior knowledge into the estimation of the distribution, thus reducing the number of training examples to around ten images while preserving a good recognition rate. Even with this improvement, the problem of exponential growth with the number of parts persists, which makes it unpractical for the problem presented in this paper, which requires speed for on-line operation. However, these methods, with expensive devices, do not bring good results, because the accuracies are lower than 85% [4]. To solve this problem, later scholars begin to pay attention to image-based fruit classification for its cheap device and wonderful performance. Wang et al. [29] applied backpropagation neural network (BPNN) and used fractional Fourier entropy (FRFE) as the features. Lu et al. [30] proposed an improved hybrid genetic algorithm (IHGA) to take the place of BPNN. Zhang et al. [31] presented a novel method called biogeography-based optimization and feedforward neural network (BBO-FNN) which achieved higher accuracy. Zhang et al. [29] created a categorization model using fitness-scaled chaotic artificial bee colony (FSCABC) method to replace kSVM.

According to the past studies, two main problems must be solved to promote the accuracy of fruit recognition. One limitation is that some scholars used manual features decided by experts, when there are many fruit subspecies, it is very difficult to identify specific fruit subspecies using artificial feature extraction. And another limitation is that the accuracy of the existing models is not enough to support the recognition of dozens of fruits in related fields [32, 33].

Method

Model designing

To explore all fruit features contained in the image, we use an attention module to force the networks to learn the high-level feature. Attention not only illustrates where should be focused, but also enhances the representation of features. Woo et al. [34] proposed an effective attention module convolutional block attention module (CBAM) which can be widely used to boost the representation power of convolutional neural networks (CNNs). In addition, considering the complexity of fruit features, it is difficult for the traditional CNNs to train these images, densely connected convolutional networks (DenseNet) [35] is more suitable for the proposed problem due to the improvement of feature delivery. The problem of gradient vanishment is resolved applying a direct path to all preceding layers to route residuals during backpropagation. We proposed an attention-based DenseNet to train the images.

An autoencoder (AE) is a special artificial neural network, and it is applied to unsupervised learning and efficient coding [36]. Hinton and PDP Group [37] first proposed AEs in the 1980s to solve the problem of “backpropagation without teachers” by applying training dataset as teachers [38]. Nowadays, AEs become more widely applied in the field of learning generative models [39], and convolutional autoencoder (CAE) also has a good performance of image identification [40].

In this work, an attention-based densely connected convolutional network with convolution autoencoder (CAE-ADN) framework is developed, which uses a convolution autoencoder to pre-train the attention-based densely connected convolutional networks. The whole framework of CAE-ADN is illustrated in Fig. 1. Details are as follows: in the first part of the framework, an unsupervised method with a set of images is applied to pre-train the greedy layer-wised CAE. We use CAE structure to initialize a set of weights and bias of ADN. In the second part of the framework, the supervised ADN with the ground truth is implemented. The final part of the framework makes a prediction of the category of fruits.

Fig. 1
figure 1

The structure of the CAE-ADN

Attention-based DenseNet

Attention mechanism

CBAM represents the attention mechanism module of the convolution module. It is a kind of attention mechanism module which combines spatial and channel. Compared with channel only attention mechanism, it can achieve better results. An intermediate feature map \(F\epsilon{\mathbb{R}}^{C \times H \times W}\) is given as an input of the CBAM, and a 1D channel attention map \(M_{{\text{c}}}\epsilon{\mathbb{R}}^{C \times 1 \times 1}\) and a 2D spatial attention map \(M_{{\text{s}}}\epsilon{\mathbb{R}}^{1 \times H \times W}\) are computed consecutively, as shown in Fig. 2. The attention structure is concluded as:

$$ F^{\prime} = M_{{\text{c}}} \left( F \right) \otimes F $$
(1)
$$ F^{\prime\prime} = M_{{\text{s}}} \left( {F^{\prime}} \right) \otimes F^{\prime}, $$
(2)

where \(\otimes\) represents an element-wise multiplication operation. \(F^{\prime\prime}\) is the final refined output of CBAM. Details of each attention module are described as followed.

Fig. 2
figure 2

Structure of CBAM

First, we aggregate the spatial feature of a feature map by utilizing average-pooling and max-pooling layers, and two different spatial context descriptors are generated: \(F_{{{\text{avg}}}}^{{\text{c}}}\) denotes average-pooled features and \(F_{\max }^{{\text{s}}}\) denotes average-pooled features. Both \(F_{{{\text{avg}}}}^{{\text{c}}}\) and \(F_{\max }^{{\text{s}}}\) are then inputted to a shared network to compute the channel attention map \(M_{{\text{c}}}\epsilon{\mathbb{R}}^{C \times 1 \times 1}\). To avoid oversize parameters, the hidden activation size is set to \({\mathbb{R}}^{C/r1 \times 1 \times 1}\), where \(r\) denotes the reduction ratio. In a nutshell, we compute the channel attention as:

$$ \begin{aligned} M_{{\text{c}}} \left( F \right) & = \sigma \left( {{\text{MLP}}\left( {{\text{AvgPool}}\left( F \right))} \right) + {\text{MLP}}\left( {{\text{MaxPool}}\left( F \right))} \right)} \right) \\ & \quad = \sigma \left( {W_{1} \left( {W_{0} \left( {F_{{{\text{avg}}}}^{{\text{c}}} } \right)} \right) + W_{1} \left( {W_{0} \left( {F_{{{\text{avg}}}}^{{\text{c}}} } \right)} \right)} \right), \\ \end{aligned} $$
(3)

where \(\sigma\) represents the sigmoid function, \(W_{0}\epsilon{\mathbb{R}}^{C/r \times C}\), and \(W_{1}\epsilon{\mathbb{R}}^{C \times C/r}\) denote the weights of MLP, \(W_{0}\) and \(W_{1}\), are employed for both inputs, and the ReLU activation function is followed by \(W_{0}\). It is proved that using pooling operations along the channel axis can highlight informative regions effectively [41,42,43]. On the concatenated feature descriptor, a convolution layer is applied to compute a spatial attention map \( M_{{\text{s}}} \left( F \right) \in {\mathbb{R}}^{H \times W}\) which encodes where to focus on. Detailed operation is shown in the following [34].

The channel features of a feature map are aggregated by applying two pooling layers, computing two 2D maps:\(F_{{{\text{avg}}}}^{{\text{s}}} \in {\mathbb{R}}^{1 \times H \times W}\) represents average-pooled features across the channel and \(F_{\max }^{{\text{s}}} \in {\mathbb{R}}^{1 \times H \times W}\) represents max-pooled features across the channel. In a nutshell, we compute the spatial attention as [34]:

$$ \begin{aligned} M_{{\text{s}}} \left( F \right) & = \sigma \left( {f^{7 \times 7} \left( {\left[ {{\text{AvgPool}}\left( F \right)} \right];\left[ {{\text{MaxPool}}\left( F \right)} \right]} \right)} \right) \\ & \quad = \sigma \left( {f^{7 \times 7} \left( {F_{{{\text{avg}}}}^{{\text{s}}} ;F_{\max }^{{\text{s}}} } \right)} \right), \\ \end{aligned} $$
(4)

where \(\sigma\) represents the sigmoid function and \(f^{7 \times 7}\) denotes a convolution operation with the filter size of \(7 \times 7\).

Attention-based dense block

Huang et al. [35] introduced direct connections from any layer to all subsequent layers. To refine the feature of each layer, we combine the CBAM and DenseNet to pay more attention to the scale features of targets.

Figure 3 illustrates the structure of attention-based dense block. Finally, the \(\ell\) th layer connects the feature maps of all preceding layers, \(x_{0} ,x_{1} , \ldots ,x_{\ell - 1}\), as input:

$$ x_{\ell } = H_{\ell } \left( {\left[ {x_{0} ,x_{1} , \ldots ,x_{\ell - 1} } \right]} \right), $$
(5)

where \(\left[ {x_{0} ,x_{1} , \ldots ,x_{\ell - 1} } \right]\) denotes a concatenate operation of the \(x_{0} ,x_{1} , \ldots .,x_{\ell - 1}\) generated in \(0, \ldots , \ell - 1\). And we concatenate the multiple inputs of \(H_{\ell } \left( \cdot \right)\) in Eq. (5) into a single tensor. Inspired by Huang et al. [35], \(H_{\ell } \left( \cdot \right)\) is defined as a composite function of four sequential operations: batch normalization (BN) → rectified linear unit (ReLU) → 3 × 3 convolutional layer (Conv) → CBAM. And other settings of attention-based dense block include growth rate, Bottleneck layers, compression, etc. are the same as DenseNet.

Fig. 3
figure 3

Structure of attention-based dense block

Attention-based DenseNet

The input of attention-based DenseNet is the image followed by a convolution layer with a 7 × 7 filter, represented by \(D_{0}^{1}\):

$$ D_{i}^{q} ={ \circledcirc }\left( {D_{{i - 1}}^{p} ,W_{i}^{d} } \right), $$
(6)

where \(D_{i}^{q}\) denotes the ith feature map of attention-based DenseNet with the channel size of q, \({ \circledcirc }\) denotes a set of operations: attention-based dense block followed by a \(1 \times 1\) convolution layer and a \(2 \times 2\) average-pooling layer, stride 2, and \(W_{i}^{d}\) denotes to a set of parameters of the ith attention-based DenseNet.

Stacked convolutional autoencoders

The AEs is a well-known learning algorithm, derived from the idea of sparse coding, which has a huge advantage in data feature extraction. The traditional AEs consist an encoder and a decoder, using a backpropagation algorithm to find the optimum solution to make the prediction result equal to the ground truth. Traditional AEs ignore the 2D image structure. The convolutional autoencoder uses a convolutional layer instead of a fully connected layer. The principle is the same as that of an autoencoder. It downsamples the input symbol to provide a smaller representation of the latent dimension, and forces the autoencoder to learn a compressed version of the symbol [44, 45]. When processing high-dimensional data like images, the traditional AEs will cause the network to generate many redundant parameters, especially for three-channel color images. And due to the network layer parameters of traditional AEs which are global, so traditional AEs cannot retain the spatial locality, and slow down the network learning rate.

The structure of a CAE is shown in Fig. 4, which has \(N\) attention-based dense blocks, two blocks are connected by a convolution operation and a pooling operation; the output is the same as input image, which can pre-train the structure using supervision training method, and there is a convolution operation between the input or output.

Fig. 4
figure 4

The structure of CAE section

Classification and parameter learning

The structure of an ADN is shown in Fig. 5, the ith attention-based dense block in ADN is the same as CAE, and ADN remains first half attention-based dense blocks of CAE. If \(N\) is an odd number, we round up the number, and ADN has \({\text{Ceiling}} \left( {N/2} \right)\) attention-based dense blocks. Finally, a SoftMax function is applied to make the final prediction. And we train the networks by minimizing logarithmic loss between the prediction results and the labels:

$$ \theta = \arg \min_{\theta } \left\{ { - \mathop \sum \limits_{i = 1}^{M} g_{i} \log \left( {p_{i} } \right)} \right\}, $$
(7)

where M is the number of images in the training set, \(g_{i}\) is the ground truth of the image, and \(p_{i}\) is the output of the neural network after SoftMax. And \(\theta\) represent a set of parameters of the framework and we learn \(\theta\) using an Adam [44] optimizer with backpropagation algorithm.

Fig. 5
figure 5

The structure of ADN section

Experimental result

Datasets

We use two fruit datasets to test the effectiveness of the model. Mureşan et al. [45] collected a fruit dataset including 26 labels (Fruit 26), they displayed a fruit with the original background in the image, and after removing the background, the image was downsampling to 100 × 100 pixels. The Fruit 26 dataset included 124,212 fruit images spread across 26 labels. 85,260 images are for training and 38,952 images are for testing. Figure 6 shows some samples of Fruit 26.

Fig. 6
figure 6

Samples of Fruit 26

Hussain et al. [46] collected a fruit dataset including 15 labels (Fruit 15). There are 15 different kind of fruits consisting of 44,406 images. All the figures are captured on a clear background with resolution of 320 × 258 pixels. Figure 7 shows some samples of Fruit 15.

Fig. 7
figure 7

Samples of Fruit 15

Experimental implementation

The loss is defined as the cross-entropy loss between the predicted result and the ground truth, which is defined as:

$$ L_{i} = - \mathop \sum \limits_{i = 1}^{M} g_{i} \log \left( {p_{i} } \right), $$
(8)

where \(M\) represents the number of images of training dataset, \(g_{i}\) denotes ground truth of the image, and \(p_{i}\) is the output of the neural network after SoftMax.

Accuracy:

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}, $$
(9)

where \({\text{TP}}\) and \({\text{FP}}\) represent True Positive and False Positive; \({\text{TP}}\) and \({\text{FN}}\) represent True Positive and False Negative.

Considering the GPU memory, we set batch size to 10. The learning rate is set to 0.0001 and reduced to half at epochs 50, 100, and 150, and the network is fine-tuned with a learning rate of 0.000001 at the last 10 epochs. And we conduct 200 epochs to train the network.

In this paper, we evaluate ADN-q of CAE-ADN with q ∈ [121, 169, 201]. The q in ADN-q denotes the number of layers in ADN. Table 1 shows the ADN-q architectures. The CAE-p is generated by the corresponding ADN-q to pre-train the structure. All experiments are conducted on one GPU card (NVIDIA GeForce RTX 2080) with 8 GB RAM using Tensorflow.

Table 1 ADN of CAE-ADN architectures

Results

Comparison with baselines

We compare CAE-ADN with two baselines: ResNet-50 [47], and DenseNet-169 [35] (these two states of the art methods have the similar parameter numbers as CAE-ADN). For ResNet-50, we use the kernel size of 3 × 3 at each convolutional layer, and for DenseNet-169, k value sets 4, k represents the growth rate of network, and for convolutional layers with kernel size 3 × 3, each side of the inputs is zero-padded by one pixel to keep the feature-map size fixed. A 1 × 1 convolution followed by 2 × 2 average pooling is applied as transition layers between two contiguous dense blocks. The training configurations of ATP-DenseNet, ResNet-50, and DenseNet-169 are the same for a fair comparison. Table 2 illustrates the Top-1 and Top-5 accuracy of fruit classification with several networks, ADNs without pre-train perform better than ResNet-50 and DenseNet-169, and ADN with pre-train has higher accuracy than ADN without pre-train. Results show that AND-169 with pre-train is the best structure of CAE-ADN, whose Top-1 accuracy is 95.86% and 93.78%, Top-5 accuracy is 99.98%, and 98.78% of Fruit 26 and Fruit 15 respectively.

Table 2 Performance of fruit classification with different networks

Performance of each class

The precision and recall of the testing dataset is presented in Tables 3 and 4. The results show that the model has good performance in all kinds of fruits. And the model has a good ability of fruit color recognition, which can identify different varieties of the same kind of fruit. For instance, the Precisions of Apple Red 1, Apple Red 2, and Apple Red 3 are 96.07%, 95.36%, and 95.28%, respectively, and the Recalls of Apple Red 1, Apple Red 2, and Apple Red 3 are 95.67%, 95.58%, and 94.89%, respectively. Due to the difference in shooting scenes, the accuracy of Fruit 26 is generally higher than Fruit 15.

Table 3 Performance of each class (Fruit 26)
Table 4 Performance of each class (Fruit 15)

Comparison with other studies

We compared our framework with five up-to-date methods: PCA + FSCABC [48], WE + BBO [49], FRFE + BPNN [50], FRFE + IHGA [51], and 13-layer CNN [52]. Some of them used traditional machine learning methods with feature engineering, and others used simple BPNN and CNN structures. The overall accuracies of these methods are shown in Table 5, and the accuracies of our method reach 95.86% and 93.78% for Fruit 26 and Fruit 15, which illustrate that CAE-ADN performs better than state-of-the-art approaches. It can be summarized that the attention model and pre-training using CAE can improve the performance of CNN model in fruit classification problem. Compared with the traditional recognition method using sample features, the deep learning algorithm based on convolution neural networks have stronger adaptability, better robustness, and higher recognition accuracy. Image recognition based on deep learning is learned by the abstract features of the algorithm, which avoids the difficulty of generating specific features for specific tasks and makes the whole recognition process more intelligent. Due to its strong learning ability, it can be transplanted to other tasks well, only need to retrain the convolutional neural network. Considering the development of the technology of IoT [34, 53], it is meaningful to establish a decision-making system [54,55,56], and we could use the proposed method in practice.

Table 5 Comparison with other studies

Conclusion

In this work, we develop a hybrid deep learning-based fruit image classification approach, which uses a convolution autoencoder to pre-train the images and uses an attention-based DenseNet to extract the features of image. In the first part of the framework, an unsupervised method with a set of images is applied to pre-train the greedy layer-wised CAE. We use the CAE structure to initialize a set of weights and bias of ADN. In the second part of the framework, the supervised ADN with the ground truth is implemented. The final part of the framework makes a prediction of the category of fruit. We use two fruit datasets to test the effectiveness of the model, the overall accuracies of these methods are shown in Table 5, and the accuracies of our method reach 95.86% and 93.78% for Fruit 26 and Fruit 15, which illustrate that CAE-ADN performs better than other approaches. It can be summarized that the attention model and pre-training using CAE can promote the performance of CNN algorithm in fruit classification problem. In addition, compared with the traditional algorithm, deep learning is a method to simulate human visual perception through neural network. Through abstracting the local features of the lowest layer, each layer receives the input of the higher layer. It can automatically learn the hidden features of the data. At last, it can obtain the perception of the whole target.

Our method now stays in the offline experiment. In the future, we will build an application to experiment in the actual supply chain scenario. In addition, we will expand the data set, so that the model can adapt to more complex classification tasks. The proposed model cannot be installed in some embedded systems of machine vision, and it is still in the research of two-dimensional image, which cannot really realize the location of spatial points. In the future, we need to study the transformation from 2-D coordinates to 3-D coordinates, which can be combined with Kinect sensor for three-dimensional positioning.