Introduction

Natural disasters, climate change [54] and plant diseases [51] are among many factors that threaten the food security. Plant diseases in particular, may cause great loss not only for farmers, but also for global economy. For instance, The International Potato Center (CIP) reports that around 15 % loss of potatoes production [24] is due to late blight diseases only. Globally, plant diseases cause more than 20 % crop loss annually [49]. The plant diseases are even bigger threats for smallholder farmers due their limited knowledge, resources and financial power to deal with them. This is the case in Indonesia, where smallholder farmers comprise the majority of farmers.

Early detection of plant diseases proves to be effective to reduce the risk of crop failure as farmers may perform some curative and preventive actions to avoid further damages. Detection of plant diseases with naked eye inspection would require human experts. A large team of experts that continuously monitor the condition of the farms would be needed and this would cost greatly especially if the area of the farms are large and disperse. For countries like Indonesia where of farmers are distributed in large areas and separated by seas/islands, the government may have limitation to provide experts especially, especially for remote areas. Therefore, automatic detection for plant diseases is needed.

Some works indicate the diseases of the plants by detecting their level of stressness. This could be detected using various methods, for instance, using hyperspectral and multispectral sensing [5, 13, 36, 62], thermal imaging methods [32], chemical substances [6], and/or molecular and genetic level analysis [34, 35]. However, these approaches would be expensive and require experts to operate, and thus would be unattainable for smallholder farmers. Other studies apply various image processing techniques combined with machine learning for pattern searching of the diseases is one solution to provide easily accessible aids for small farmers. Given enough image data of infected plants, we could train machine learning systems that are able to identify the diseases given the corresponding data. Various machine learning methods such as Support Vector Machines (SVM) [10], k-Nearest neighbor (k-NN), Naive Bayes, or Random Forest [19] are applied. For instance, Spectroscopy of plant tissues is used in [26]. In other study [14], multispectral data are used as features for Neural Networks to detect diseases on cucumber. SVM is used as classifier for Huanglongbing citrus diseases detection with fluorescence images in a reported study [57]. PCA and Linear Discriminant Analysis are used Blast rice diseases in [59]. For a review for the use of spectral data for plant diseases detection could refer to [47]. Unfortunately, spectral data are extracted using expensive equipments [44].

Many efforts have been made to develop machine learning based detectors that work with standard image data. Using such data, the efforts focus on developing hand-engineered features from images, usually based on human knowledge on particular problems using various transformations, such that they are good to discriminate each class for recognition. Several examples of engineered features are Histogram of Oriented Gradients (HOG) [11], Scale-Invariant Feature Transform (SIFT) [33], Speeded Up Robust Features (SURF) [4], and Linear Binary Pattern (LBP) [38]. For examples, k-means clustering with neural network are used in [2]. In [42], combinations of Gabor wavelet transform (GWT) and gray level co-occurrence matrix (GLCM) are used as features with k-NN as the classifier. In other study, SIFT features with principle components analysis (PCA) are used with Learning Vector Quantization (LVQ) [41]. Variant of SVM with GLCM and LBP features are used in [1] for citrus. Unfortunately, these engineered features usually require complex computations and processes [47].

Recent advancement in machine learning, called deep learning (DL), paves way for more accurate classifications given simpler features [30]. DL methods have been used in many machine learning tasks, such as speech recognition [12, 46] natural language processing [9, 48], and computer vision [56]. In the area of computer vision and object recognition [45], DL is state-of-the art for many applications. Most studies in computer vision and object recognition use the convolutional neural networks (CNN) [31] and their variants, such as AlexNet [29], VGGNet [50], GoogleNet [53], Xception [7], and MobileNet [20], ZFNet [61], and ResNet [18]. As it is reported in [45], it is clear that CNN become the dominant technology for object recognition.

For diseases detection, many studies also have implemented the aforementioned architectures. In [37], GoogleNet and AlexNet are used to detect diseases from 19 plants. Meanwhile, ResNet and VGGNet are used to detect diseases on tomatoes in [16]. AlexNet and VGGNet are used to detect diseases of 25 plants in a reported study [15]. Simplified VGGNet is proposed in [39] for potatoes diseases detection. MobileNet combined with the Single Shot Multibox (SSD) model are used to detect diseases on cassava [43]. These works use variants of convolutional neural networks with decent resolution image data (above \(128 \times 128\) pixel resolutions). Most studies also have not evaluated the robustness of the methods when tested on non-ideal conditions where the image data may be blurred, have different orientations and/or resolutions than training data. To improve robustness, multicondition training on CNN for tea diseases detection is proposed in [60].

In addition, the DL models are placed in the servers due to their large sizes for the implementations. This is due to the large size of the networks. The targeted image data are then transferred to the server for further processing and classification, and the results are transmitted back to the devices. Most studies usually work with decent quality images (with resolutions of \(256 \times 256\)). However, many developing countries such as Indonesia may still have limited speed for internet connectivity, especially in remote areas where many small farmers live. Therefore, it is better to have the systems that are trained with low resolution images. In addition, the systems must be robust against various image transformations as it is very likely that the images are taken on different conditions as the data training.

Meanwhile, it is well-known that increasing the size of the deep learning networks by increasing the depth of the networks is effective improve the performance of deep learning systems. Later variants of CNN usually come with increasing depth of convolutional layers. But, this causes two drawbacks. First, this may increase computational loads due to the high number of model parameters. Second, very deep networks may prone to overfit especially when the number of training data is limited, and hence adding more layers do not necessarily improve the performance. This is called the gradient vanishing problem [23].

One solution to this problem is to use skip connections. Skip connections are layers that are designed to skip few layers of the networks so the layers could reuse the activations of the skip connections layers during training, and hence avoiding the gradient to vanish. Residual network is an example of skip connections. This is applied in ResNet and Xception where residual networks are added to the output of the convolutional layers [18]. However, information fusion as in Xception may alleviate information passed through to the next layers. Meanwhile, to reduce the number of the parameters, other studies replace convolutional layers with separable convolution [25] as in MobileNet and Xception. With this finding, several newer networks use more branch of skip connections such as DenseNet [21] and ResNext [58]. But, such networks usually require more memory to train and hence may not applicable for machines with limited resources.

The contributions of this paper are two-folds. First is a new DCNN architectures to significantly reduce the number of the training parameters and to overcome vanishing gradient problems. We called it Compact Networks (ComNet). Our method works by applying two branch of concatenation layers only to minimize the need for more memory for training. The first layer is to carry information from previous layers and the second is concatenated the output of first concatenation layer with small kernels convolutional layers to avoid the gradients to vanish during training. We evaluate the performance and robustness of five major DCNN architectures: VGGNet [50], AlexNet [29], Xception [7], MobileNet [20] and ResNet [18] on low resolution data. Secondly, we develop a dataset of tea diseases that consists of 5 types of diseases that are common in Indonesia and a healthy class. In addition to the evaluation our dataset, we also evaluated them on subset of Plantvillage dataset [37] with reduced resolution due our limited resource for training. We evaluated the methods on 3 plants: apple, corn, and potato, with 11 class labels that were made up by 8 types of diseases and three healthy classes.

The remainder of the paper is organized as follow. We explain all the architectures we used in this study in “Convolutional neural networks and their variants” section. Our proposed method is explained in “Proposed method” section in more details. We describe our experimental setup in “Experimental setup” section and the results are discussed and analysed in “Results and discussions” section. We conclude the paper in “Conclusion” section.

Convolutional neural networks and their variants

DL caught attentions of many researchers in the areas of machine learning in recent years. DL systems won numerous competitions in pattern recognition and machine learning [3, 8]. For object recognition tasks such as Large Scale Visual Recognition Challenge (ILSVRC) [45], DL achieves the best performance and outperformes many other conventional methods.

Deep learning technologies are mostly based on artificial neural network (ANN). In ANN, perceptrons are stacked in such way to approximate the relation between the inputs with the outputs (usually the target classes). Therefore, DL can be seen as universal function approximators. By having many hidden layer in the network, DL can model any complex relation using large number of hidden layers, allowing the networks to learn about various abstractions of the data in different layers of the networks. This would allow the networks to learn about the representations of the data by themselves given only the raw information. This is one of the advantages of the DL architectures. Thus, it is unnecessary to design a handcrafted features, which is a common approach when conventional machine learning methods are used.

For object recognition, deep convolutional neural networks (DCNN) and their variations are mainly used for object recognitions. DCNN is built upon stacked convolutional neural networks (CNNs). CNN is a variant of feed-forward network (FFN) where the flow of information has no feedback from the output layer to the previous layers. Like a typical FFN, CNN consists of input, hidden, and output layers with the hidden layers are typically constructed by convolutional layers, pooling layers, and fully connected layers. The CNN architecture is first proposed in the 80’s [17]. In the paper, the structure, which is called Neocognitron, has very similar structure to the current CNN except on how the weights are updated. It is updated in unsupervised manner and pre-wired. Meanwhile in current CNN, the weights are updated with the use of gradient descent based methods [31]. CNN is applied for many tasks in computer visions [56]. Currently, CNN is the leading architecture for image recognition, classification, and detection [29].

In convolutional layers, a convolution operation is applied to the input like standard FFN. The difference is, the nodes in a convolutional layer only connected to a particular inputs unlike FFN where they are connected to all inputs. This is one advantages of CNN over FFN on image data. For image data, the local relation between inputs may need to be emphasized first before the network learn more of the global relations of the images. By doing so, the networks can learn larger area of abstraction and different kind of abstraction when the data pass the higher layers of the networks to obtain different abstraction. Additionally, applying FFN on images would be impractical as it would produce significantly larger number of parameters. This could significantly be reduced by using CNN.

A pooling layer is usually included after the convolutional layers. In the pooling layer, outputs of grouped neuron are combined to produce a single node that would be passed into the following layers. The commonly used combination rule is max-pooling where the output is the maximum values of the grouped neurons. Other approach is average pooling where the average value of the nearby inputs is used as the output.

A fully connected layer is usually put at the top of CNN architecture. This layer connect every neuron in previous layer to every neuron in the next layer. So in principle, it is the same with standard multi-layer perceptron network. The purpose is the find the global relation of the data.

There have been many architectures of DCNN. However, due to our limitation for training, we are only capable to evaluate 5 DCNN architectures in this study. They are AlexNet, VGGNet, MobileNet, Xception, and ResNet. Newer architectures such as DenseNet, ResNext, Inception-v4, and Incetion-Resnet-v2 [52] are inapplicable in our current machines.

AlexNet

AlexNet is proposed in [29]. AlexNet is the winner of ILSVRC in 2012 where CNN gain its global recognition for the first time. It achieves significantly better performance than conventional methods. It is comprised with five convolutional layers and three fully connected layers. After the first, second, and fifth convolutional layers, max-pooling is applied. Dropout is applied after the first and second fully connected layers.

Originally, AlexNet is used for images with size of \(224 \times 224\). To make it able to accommodate low resolution images (\(64 \times 64\)), we modify the filter size of each convolutional layers from 96-256-384-256 in the original design to 64-256-384-384-256. We also modify the kernel size of \(11 \times 11\) to \(3\times 3\) on the first layers. The details of our implementation of AlexNet is shown in Fig. 1.

Fig. 1
figure 1

AlexNet architecture

VGGNet

VGGNet is proposed in [50]. It is the runner-up at the ILSVRC 2014. VGGNet is similar to AlexNet except it is deeper by utilizing smaller convolutional kernels. It has 12 convolutional layers instead of 4 as in AlexNet and 3 fully connected layers. Even though VGGNet achieved the second place in ILSVRC 2014, it is quite popular architecture for learning features from image due to its simplicity. It only uses \(3 \times 3\) convolution and \(2 \times 2\) pooling for the whole networks. This architecture shows that improving the networks can be done by keep adding new convolutional layers. One drawback of VGGNet is the size of the networks. It has significantly larger parameters and requires longer training time. There are two variants of VGGNet. They are VGGNet16 with 16 layers and VGGNet19 with 19 layers. Here, we implement VGGNet16. The details of VGGNet we use is shown in Fig. 2.

Fig. 2
figure 2

VGGNet architecture

MobileNet

MobileNet is proposed in [20]. DCNN architectures such as VGGNet require heavy computational loads and great number of parameters. To make it smaller, CNN is replaced by depth-wise separable convolution in MobileNet. Conventional convolutional layers operate by performing summing and filtering operation between input and a defined filter, i.e. cross-correlation operation to be exact, to produce a new set of output. Here, CNN operations are separated by depth-wise separable convolution into two operation: \(3 \times 3\) depthwise convolution for filtering for the first channel and \(1 \times 1\) pointwise convolution combining the inputs to other channels. This division produces much smaller computation and model size since full convolution operations are conducted in the first channel only.

The structures of a MobileNet is shown in Fig. 3. First layer uses a regular convolutional layer. Then, it is followed by 13 layers of deep wise convolution and pointwise convolution layers. After that, average pooling and dropout are used before applying CNN as final layer.

Fig. 3
figure 3

MobileNet architecture

ResNet

ResNet is proposed to overcome the vanishing gradients problems [55]. Due to the use of small kernels, studies found that VGGNet that use smaller kernels could only works until certain depth. In ResNet, skip connections are implemented. Here, residual networks are introduced as skip connections. The residual networks is illustrated in Fig. 4. The network is designed to learn “the residual” of the networks by adding the input of the block to its output.

Fig. 4
figure 4

Structure of a residual network

In the paper, we implement ResNet50. ResNet is build by stacking Conv Block and Identity block with the structure illustrated in Fig. 5. The image data pass a layers of a CNN in the beginning. ResNet 50 consists of five Conv Blocks and 13 Identity blocks. At final stage, average pooling is applied before passing it into softmax classification.

Fig. 5
figure 5

ResNet architecture

Xception

Xception is proposed in [7]. Xception stands for eXtreme inception. Xception is similar to ResNet by applying residual networks to enable very deep networks. The difference is Xception apply inception module [53]. Inception modules work by concatenating convolutional layers with various sizes. The purpose of it is to keep good resolution for small area of images while also getting the information on larger area. But, in Xception, the inputs are projected into various output channel by using deepwise separable convolution layer instead. The architecture of Xception is shown in Fig. 6.

Fig. 6
figure 6

Xception architecture

In Xception, the data first pass 2 layers of regular CNN before it is passed into the entry block. In entry block, 2 layers of \(3 \times 3\) Deepwise Separable Convolution are concatenated with \(1 \times 1\) convolution layer. This procedure is repeated for three times. Then, in middle block, three layers of deepwise separable convolution are stacked. This is repeated for 8 times. Then in exit block, 2 layers of \(3 \times 3\) Deepwise Separable Convolution are combined with \(1 \times 1\) convolution layer and it is followed by 2 layers of Deepwise separable convolution.

Proposed method

The most straightforward approach to improve the performance of DCNN is to increase the model complexity by adding the width or the depth of the networks. However, adding more layers would make the networks prone to gradient vanishing/exploding. As the results, the networks are unable to converge. This could be overcome by adding skip connections. This means a layer skips some layers. By doing so, the networks could reuse activations from much lower layers, and hence vanishing gradient could be avoided. One implementations of skip connections are in residual networks [18]. This concept is applied at Xception, where residual networks are applied as skip connections.

Residual networks work as follows. Let us consider Sub-Middle Block of Xception (See Fig. 6). An output of the \(l^{th}\) layer with transformation function f of a residual network can be expressed as:

$$\begin{aligned} x_l = f \left( H_l \left( x_{l-1} \right) + x_{l-1} \right) \end{aligned}$$
(1)

where \(H_l\) is the nonlinear transformation due to stacking process. In Xception, it is a composite of batch normalization, rectified linear units (ReLU), pooling, and convolutional layers. In Xception, a layer receives the outputs of previous layers and added them with the residual networks. Therefore, vanishing gradients could be avoided.

With this finding, more recent studies apply more branches of skip connections as in DenseNet [21], ResNext [58], and Inception-V4 [52]. However, due to the multiple branches of skip connections, it requires larger memory capacity for the GPU, making it inapplicable for computers with limited GPU memory size.

To overcome this, we limit only two branches of skip connections. In addition, due to the addition process in Xception, some information may be lost. To avoid this, we to concatenate the layers instead. We called it the compact modules since all the input compacted at the output. The comparisons between the proposed architectures and residual networks is illustrated in Fig. 7.

Fig. 7
figure 7

Comparison of a residual networks on Xception and b concatenation layers of the proposed system

The detail implementations of the proposed architectures is shown in Fig. 8 and Table 1. Our networks is comprise of two convolutional layers, then followed by three compact modules, with transition modules in between, and then Global average pooling followed by output layers. The transition modules are built from a convolutional layer, batch normalization, and average pooling. We refer this architectures as ComNet.

Fig. 8
figure 8

The proposed architecture

Table 1 Architectures of the proposed model

Experimental setup

Dataset

We used subset of Database published in [22]. The database itself contains 54,306 images of plant leaves with total of 38 class labels. There are 14 plants in the dataset. From the data, we selected three plants: Apple, Corn, and Potato due to our limited computation resources. There were a total of 9,176 data for the experiments. The original size of the dataset is \(256 \times 256\) pixels. We rescaled them into \(64 \times 64\). The sample images of the dataset is shown in Fig. 10.

We also develop a dataset for tea diseases. We collected 11,367 images of tea leaves which comprise of four classes: one healthy class and 6 types of diseases that are commonly find in tea. They are tea plants with blister blight, leafhopper attacks, looper caterpillars attacks, mosquito bugs attacks, and yellow-mite attacks. The data are collected at Research Center for Tea and Cinchona, Gambung, West Java, Indonesia. The data are collected using two digital cameras and five smartphone cameras. All images are taken indoor with only room lighting. The data are collected during various hours from 8 a.m. in the morning until 5.30 p.m. in the afternoon. The data are scaled into \(256 \times 256\) and then rescaled down into \(64 \times 64\) for the experiments. This dataset in an extension of dataset that we published in [28, 60]. The sample images of this dataset is shown in Fig. 9.

Fig. 9
figure 9

Samples of collected images of tea plants

The distribution of the data for each plant and class label is shown in Table 2. We apply various image transformations to the test data: Gaussian blur with \(5\times 5\) kernel size, median blur with median size is set to 5, 90 degrees rotations, 180 degrees rotation, scaled-down to \(32 \times 32\) and \(48 \times 48\). The sample images for training are shown in Fig. 10 while the sample of the test data (apple) with the resulting transformed images are depicted in Fig. 11.

Table 2 List of plants and class labels used in the experiments with the number of data for each class
Fig. 10
figure 10

Samples of training images for Plantvillage dataset

Fig. 11
figure 11

Samples of apple test images: original and after various transformations

Experimental configurations

For the experiments, we use 10-fold cross validations on each model. For learning rate, we use 0.00001 and Adam method [27] for adaptive learning and cross-entropy as loss function. For Plantvillage, we trained all six architectures for all three plants altogether and set the epochs into 100. We also train all architectures for tea dataset with the same configurations as above.

Results and discussions

Number of parameters and training time

As shown in Table 3, of all evaluated architectures, VGGNet has the largest parameters whereas MobileNet is the smallest. This is due to the use of separable convolutional layers. ComNet has only around 9.666 millions parameters. It is much smaller than VGGNet, ResNet, Xception, and AlexNet.

Table 3 The number of parameters (in millions) and the average training time (in seconds) of an epoch for each CNN architectures

It is also clear that the training times of our method are smaller than VGGNet, ResNet, Xception, and MobileNet, indicating its smaller computational complexity. The computational times depicted in Table 3 are the average computational time over 100 times iterations for both datasets. The experiments are conducted on Intel Xeon E5 2.10GHz CPU and TESLA P100 GPU with 4GB RAM. This is not surprising since our architecture has much smaller number parameters than VGGNet and Xceptions. Interestingly, we notice that compare to MobileNet that has smaller number of parameters, ComNet requires less computational time. This may due to the use of depthwise separable convolutions which are yet to be supported by cuDNN library [40]. Meanwhile, the training time of our method is slightly worse than AlexNet despite having less number of parameters. The use of concatenation layers requires the network to consume more memory space. Since our computational resources have very limited memory space, more communications to the storage are expected, thus making it slower.

Performance comparisons

The results are summarized in Table 4. The results are the average accuracy over 10-fold cross validations. The results clearly show that the proposed method has higher accuracy on both datasets despite its less number of parameters. The boxplot results are shown in Fig. 12. ComNet achieves more than 2% improvements over AlexNet (which has the second best performance) on Plantvillage and almost 5% over VGGNet (the second best) on Tea. Over 10 times repeat, ComNet has quite small range for the boxplot, which suggest the performance of the method is consistents. The results are consistent on both dataset.

Table 4 The average accuracy of DCNN architectures
Fig. 12
figure 12

Box-plot of accuracies of evaluated DCNN using 10-fold cross validations

The progressions of accuracy and loss of training and testing data for the networks are plotted in Figs.  13 and 14. For Plantvillage, we observe the followings. First, on training data, ResNet has the slowest loss to converge and then followed by MobileNet, and VGGNet. ComNet and Xception are comparable in terms of how fast the loss to converge. On the progression of the accuracy, MobileNet and VGGNet are the slowest to converge. This is as expected as their losses are also slow to converge. Interestingly, the accuracy of training data for ResNet is pretty quick to converge despite the slow losses progression. Meanwhile on testing data, MobileNet appears that it fails to learn and as the results, the performance is the worst among all the networks. It may be stuck in local minimum for the solutions and larger learning rate may be needed for it to work.

Fig. 13
figure 13

Progression of a Training loss, b Training accuracy, c Testing Loss, and d Testing Accuracy for all DCNN architectures on Plantvillage. The results are the average values of 10-fold cross validations

For tea dataset, we notice similar results as Plantvillage for training data. Meanwhile for testing data, we notice that the losses are slowly increasing when the epochs are larger than around 25 for MobileNet, Xception, and AlexNet. This indicates hat the networks may already be overfit. As the consequences, the accuracies on testing data are also slowly decreasing. MobileNet also fails to learn for this case.

Fig. 14
figure 14

Progression of a Training loss, b Training accuracy, c Testing Loss, and d Testing Accuracy for all DCNN architectures on Tea. The results are the average values of 10-fold cross validations

Our results confirm very promising results that DCNN could be used to recognize plant diseases even when trained with low resolution data. We achieve the accuracy more than 90 % for Plantvillage except for MobileNet. Lower accuracies are obtained for Tea dataset. The results strongly indicate the need for more data for tea. This is due to high variations of data acquisition for tea. The results also show that the DCNN architectures could be used the without any usage of complex feature engineering.

Evaluations on the robustness

Table 5 shows the performance of evaluated DCNN when tested with transformed images: blurred, rotated, or scaled down. We transformed the image data using Gaussian Blur with \(5 \times 5\) kernel (notated as GauBlr), Median Blur with size of 5 (notated as MedBlr), 90 degrees rotation (notated as Rot90), 180° rotation (notated as Rot180), scaled down to \(32 \times 32\) (notated as Sc32), and scaled down to \(48 \times 48\) (notated as Sc48). In most experiments, the transformations cause drop in performance. We notice blurring and rotations are found to cause huge drops in performance and scaled-down to have the least effect. When we test on images that are scaled down to \(48 \times 48\), the performances are only slightly worse in most cases. This should be expected as we can see a convolutional layers act as a scaling-down operations on image. DCNN aims to learn the abstraction of image data with various sizes to capture various resolution of the data. So a more robust to scaling down operations are expected.

Table 5 The robustness of evaluated DCNN architectures against images transformations

We observed that ComNet is consistently more robust than other networks. The two stages of concatenation process and average pooling may contribute to these performances. Mostly, the diseases are identified by the spots found in the leaves. At higher layers with adding and global average pooling operations where larger convolution windows are used, these operations may produced outputs similar to a blurred version of the image, contributing to its robustness. Need to be noted however, our architectures have not been evaluated on realistic “noisy” data as the data are only transformed artificially. Evaluations on more realistic scenarios are needed.

Conclusions

We propose a DCNN architecture for plant diseases detection in this paper using two branches of skip connections. We compared it with other 5 popular DCNN architectures: AlexNet, ResNet, Xception, MobileNet, and VGGNet. We found that our method is consistently better than the reference methods even with smaller number of parameters and faster training time. It is also more robust when it is tested with blurred, rotated, and scaled-down image. However, we should say that the methods have not been evaluated with data with very different conditions as the actual conditions on the fields. Therefore, adding different types of data collected from different conditions are substantial to improve the robustness of the systems on various environmental conditions. This is in our future plans.

It is worth noting that the plant diseases are not meant to replace the actual diagnosis from experts instead it meant to supplement that. Since machine learning methods could only predict with some uncertainty, laboratory test remain the most reliable way to diagnose the plant diseases. However, the implementation could be used to be a help for smallholder farmers who may find it difficult to have fast response from experts