Introduction

Sustainability is a crucial goal that involves ecological, economic and social concerns to impact the health of present and future societies. Scientific progress has developed new automatic tools to assist the human workforce by integrating artificial intelligence and robotics to meet such high-level needs. These efforts affect all production fields but, significantly, agriculture, whose improvement needs to face sustainability-related topics, such as finite resource management, yield optimization and pest control. In general, every sustainable goal can require actual crop monitoring by implementing low-cost technologies (cameras) and reliable methodologies (machine and deep learning techniques) in engineered solutions (Saleem et al., 2021). These requirements translate into the need for developing image acquisition and processing systems for extracting helpful information for the farmer. At a low level, systems must identify specific targets by applying semantic inference mechanisms, including image classification or segmentation.

In general, crop monitoring without physical contact of the targets can be clustered in remote and proximal sensing, depending on the sensor-plant distance and, thus, the level of details of the achievable information. Remote sensing typically refers to aerial imaging from satellites, unmanned aerial vehicles (UAVs) or airplanes. UAVs are equipped with imaging sensors, such as hyperspectral, LIDAR and RGB cameras (Adão et al., 2017; Kim et al., 2019), to compute vegetation indicators, e.g. the normalized difference vegetation index (NDVI) or canopy size and volume (Zhou et al., 2020) or to create semantic maps of the fields (Dyson et al., 2019; Guo et al., 2018; Osco et al., 2021; Wu et al., 2019; Yang et al., 2020a). In proximal sensing, acquisitions are taken from the ground, close to the target, and with more details. Typical sensors include color, hyperspectral and infrared (IR) thermal cameras and LIDAR (Das et al., 2015; Tian et al., 2020a), targeted to object segmentation, fruit counting, phenotype analysis, plant classification and disease monitoring (Jiang et al., 2019; Ma et al., 2017; Yang et al., 2020b). In proximal sensing, data can be collected in structured and well-controlled environmental contexts, such as greenhouses (Afonso et al., 2020; Sa et al., 2016), or under excellent acquisition conditions, typically manual, with high-resolution sensors (Mack et al., 2017). Referring to extensive crops, the practical implementation of proximal sensing is achievable through agricultural robots working in-field. However, any approach to extensive monitoring must face the actual problems of in-field raw image data (natural images), such as low resolution, motion blurring, occlusions and uncontrolled lighting conditions.

The processing of natural images captured from ground robotic platforms, and more specifically the semantic segmentation of images, has been proposed mainly for weed detection (Bosilj et al., 2020; Knoll et al., 2018; Milioto et al., 2018; Wang et al., 2020a), even sharing the same input dataset (Chebrolu et al., 2017) and a common processing background, centered on deep learning (LeCun et al., 2015). More specifically, input images, which are often reported in terms of NDVI, are processed by convolutional neural networks (CNNs) for pixel or area classification, trained from scratch or by applying transfer learning (Tan et al., 2018). Deep learning is often used to segment objects of interest, such as fruits, leaves, infrastructures (wires and poles) and single branches (Naranjo-Torres et al., 2020; Wosner et al., 2021). In horticulture, several methodologies have been presented for monitoring fruit orchards through flower classification for thinning (Tian et al., 2020b), fruit classification for automatic harvesting (Gao et al., 2020) and segmentation of supporting infrastructures, such as wires (Song et al., 2021).

Automatic procedures for object segmentation are even more attractive in those areas of horticulture of high added value, such as viticulture (Barriguinha et al., 2021). Here, monitoring at the plant scale allows vine-growers to understand possible spatial variabilities and find fine-tuned solutions. For instance, Majeed et al. (2020) presented a ResNet deep residual network and region-based convolutional neural network to detect green shoots in grapevine canopies and precisely segment the trajectories of cordons for thinning purposes. Grape cluster and canopy segmentation using an artificial neural network and a genetic algorithm on images of a publicly available dataset (Berenstein et al., 2010) were proposed by Behroozi-Khazaei and Maleki (2017), while Santos et al. (2020) presented a comparison of three neural networks for instance segmentation of grape clusters tested on their public dataset (Embrapa Wine Grape Instance Segmentation Dataset - WGISD) of RGB images captured from a mobile robot.

In any of the cases above, all sensors are standard RGB cameras, which provide a flat 2D representation of the targets. In contrast, RGB-D cameras, able to produce three-dimensional (3D) colored models of the crops, can give more information, helpful for fruit monitoring and counting (Fu et al., 2020a). Several technologies, including complex setups of dedicated 3D cameras (Barnea et al., 2016; Gongal et al., 2016) or integrated low-cost consumer-grade cameras, such as the Microsoft Kinect v1 and v2 cameras (Redmond, WA, USA) (Fu et al., 2020b; Nguyen et al., 2016; Paulus et al., 2014; Tao & Zhou, 2017; Zhang et al., 2018), have been used for plant phenotyping, fruit counting and automatic robotic harvesting. Even low-cost stereo cameras, such as those of the Intel Realsense family (R200 and D4xx, Santa Clara, CA, USA), have gained attention in fruit detection and plant phenotyping (Milella et al., 2019) since they can effectively model the outdoors without suffering from illumination variability due to sunlight (Kuan et al., 2019). Several works processing color images acquired by the Intel RealSense R200 and D435 for object segmentation have been presented (Kang & Chen, 2020; Marani et al., 2021; Wang et al., 2020b). Although RGB-D cameras help yield monitoring, output color images are often of low quality and resolution due to the actual scope of such low-cost cameras, which are mainly designed for robot navigation, mapping and manipulation. Natural image segmentation from color data is still an open problem since its effective solution enables the effective use of the depth channel.

In this scenario, this paper extends previous work by Heras et al. (2021) for the exact segmentation of plant leaves and wooden structures (trunks, branches, canes, etc.), artificial infrastructures (poles, ropes, cables, etc.) and fruits. Here, multiple network architectures were compared to find the best solution for natural image segmentation. Even a refined ground truth was considered to further improve the quality of segmentation. The original contribution of the paper is manifold:

  1. 1.

    The analysis of three semi-supervised learning models to contrast the small size of the annotated dataset by taking advantage of unlabeled images;

  2. 2.

    A detailed comparison of several pre-trained deep neural networks (architectures and backbones) for processing images of low-quality, affected by blurring and compression artifacts, as captured by consumer-grade devices mounted onto moving agricultural vehicles;

  3. 3.

    A statistical analysis to identify significant differences among the deep learning models studied and the semi-supervised learning methods;

  4. 4.

    A comprehensive discussion on the quality of manual image annotation and how it can affect segmentation.

Materials and methods

Input dataset

This work tackled the problem of segmentation of single natural images captured in-field by the low-cost consumer-grade Intel Realsense R200 camera (Santa Clara, CA, USA). Semantic segmentation is the classification of every pixel of an image among target classes of interest. In viticulture, segmenting specific targets, such as leaves, fruits, wooden structures (trunks, branches, canes, etc.) and artificial infrastructures (poles, ropes, cables, etc.) can be the key for yield monitoring and robotic harvesting.

Several techniques for natural image segmentation were tested on the dataset by Marani et al., (2019, 2021). This dataset consists of 405 color images acquired by the Intel Realsense R200 in a vineyard in Switzerland (Räuschling, (N47° 14′ 27.6″, E8° 48′ 25.2″)). The camera was mounted on a moving agricultural tractor (Niko Caterpillar, Bühl/Baden, Germany) and acquired lateral views of the line of the grape plants at a distance between 0.8 and 1 m. Under those conditions, every image covered a horizontal field of view between 0.9 and 1.2 m to completely frame every plant in a single image. The tractor moved within lines at an average speed of 1.5 m/s. Image frame rate was then tuned according to the robot speed and the horizontal field of view of the camera to frame the same plant in at least three consecutive captures. A camera frame rate of 5 Hz was enough to produce image overlaps, corresponding to about 0.3 m. The image resolution was limited to 640 × 480 pixels to match the maximum resolution of the depth data stream. It is worth noticing that, although video sequences were produced to create overlaps, the proposed implementations did not take advantage of object tracking strategies, like the one of Santos et al. (2020). All methodological approaches considered images individually, without managing multiple detections of the same elements. A sample image of the dataset is shown in Fig. 1.

Fig. 1
figure 1

a A sample color image acquired by the Intel Realsense R200. b and c are magnifications of the area enclosed by the yellow and cyan boxes, respectively

As shown in Fig. 1, the resulting images are poor in detail and clearness. As the effect of the movement of the tractor and the low quality of both camera sensor and optics, images suffer from blurring, soft hue and weak contrast. Moreover, the JPEG compression further decreased the quality of the acquisition. For example, the inset of Fig. 1 shows how similar the appearance of the foreground grape bunches and the background small leaves are.

The automatic segmentation of natural images is achieved by representing them in more descriptive and discriminative feature spaces, learned from actual images, where pixels having similar semantic attributes can be grouped and labeled in different classes. A set of annotated images is thus required to train the model and then evaluate the segmentation results on ground truth.

Manual annotation is a complex, time-demanding and tedious task. For this reason, annotation is typically limited to a small subset of all the images acquired in-field. However, unlabeled images were captured under the same experimental conditions and could give further information to tune the training of the networks using semi-supervised approaches.

The whole dataset of 405 natural color images from the Intel Realsense R200 camera was thus split into two sets of 85 manually annotated images and 320 unlabelled images. The 20–80 proportion was chosen to give more evidence to the improvement of results due to the semi-supervised approaches.

Within these lines, images were processed to segment five classes of interest:

  • Bunch: bunches of white grapes;

  • Pole: supporting infrastructure made of concrete or metal poles;

  • Wood: canes, cordons and trunks of the plant;

  • Leaves: canopy leaves of the grape; and,

  • Background: the remaining objects framed by the camera, such as the ground, the sky and far grape lines.

Manual annotation was performed twice on the same images to produce two sets of labels:

  • Bunch/leaves-detection-oriented (BLDO) labels: BLDO labels were the same as in Milella et al. (2019) and Marani et al., (2019, 2021) and were mainly focused on the bunch and leaves segmentation. The corresponding ground truth was obtained for each image, giving different priority levels to each class. First, bunches were annotated as closed objects, even if their appearance slightly differed from what was expected as the effect of a crossing object or image artifacts. Then, plant leaves, poles and wooden structures were annotated with the same strategy but with decreasing priority levels. The background was the last labeled class, enclosing the remaining pixels; and,

  • Object-segmentation-oriented (OSO) labels: OSO labels were created for an object segmentation task, as typically referred to in the corresponding literature. Annotation gave equal priority to every class to label objects as they appeared in the image.

An insight into the difference between the two kinds of labels is shown in Fig. 2. Specifically, all wooden structures or supporting infrastructures, i.e. poles, have more weight and are better detailed since they are no longer included in the leaves class. In the following lines, all analyses were run on BLDO labels for enabling the comparison with previous results in Marani et al., (2019, 2021). Then, further experiments on the best models, trained by OSO labels, are presented to discuss the importance of manual labeling for accurate segmentation results.

Fig. 2
figure 2

Annotations of the image in Fig. 1a: a bunch/leaves-detection-oriented labels as in Marani et al., (2019, 2021); b object-segmentation-oriented labels, resulting in a fine refinement of the BLDO labels in (a)

The whole datasets, made of in-field natural images and corresponding labels (BLDO and OSO), can be downloaded for further comparative tests at the following webpage: https://github.com/ispstiima/S3CavVineyardDataset.

Once the dataset and its annotations have been presented, the following subsections detail the network architectures and the semi-supervised algorithms.

Semantic segmentation models

As stated in the previous section, the 85 labeled images, split into training and test sets, were used to set up and evaluate the deep segmentation architectures. The training set was used to fine-tune several deep-learning segmentation architectures (Razavian et al., 2014) to produce inference on natural images in the form of output masks. With more details, 13 architectures, summarized in Table 1, were trained. All the selected architectures were based on either fully convolutional networks (FCN) (Long et al., 2015) or encoder-decoder networks (Ronneberger et al., 2015).

Table 1 Segmentation architectures and backbones employed in this work

FCN architectures extract features from a given image using a backbone of convolutional layers and generate an initial coarse classification map. The classification map is a spatially reduced version of the original image. Then, deconvolutional layers restore the original resolution of the classification map to output the final segmentation mask. The main two drawbacks of this architecture are the loss of information when working with high-resolution images and its speed. For tackling the high-resolution problem, in the HRNet architecture (Sun et al., 2019), high-resolution representations were maintained by connecting high-to-low resolution convolutions in parallel and repeatedly conducting multi-scale fusions across parallel convolutions. Atrous convolutions were instead used in the DenseApp architecture (Yang et al., 2018) to face the same resolution issue. For tackling the problem of the high time requirements, the ContextNet architecture (Poudel et al., 2018) used factorized convolution, network compression and pyramid representation, while the CGNet architecture (Wu et al., 2018) employed a context-guided block.

In the encoder-decoder architectures, the encoder is usually made of several convolutional and pooling layers responsible for extracting the features and generating an initial coarse prediction map. In these architectures, the encoder is known as the backbone. The decoder, commonly composed of convolution, deconvolution and/or unpooling layers, is responsible for further processing the initial prediction map, increasing its spatial resolution gradually and generating the final prediction. The Unet architecture (Ronneberger et al., 2015) was the first network to propose an encoder-decoder architecture to perform semantic segmentation in medical contexts. From that seminal work, several variants have been proposed to address the two main limitations of the Unet architecture that are the same as previously mentioned for the FPN architecture: the loss of information when working with high-resolution images and its speed. Regarding the issues related to the use of images of high-resolution:

  • the DeepLabV3+ (Chen et al., 2018) architecture introduced the notion of atrous convolutions to extract features at an arbitrary resolution;

  • the PAN architecture (Li et al., 2018) adopted global attention upsample module to squeeze high-level context and embedded it into low-level features as guidance;

  • the FPENet architecture (Liu & Yin, 2019) defined a MEU module that used attention maps to embed semantic concepts and spatial details to low-level and high-level features; and,

  • the Unet++ architecture (Zhou et al., 2018) redesigned the connection between the encoder and the decoder components of the architecture.

Referring to the speed issue:

  • the Bisenet architecture (Yu et al., 2018) proposed a fast downsampling strategy to obtain a sufficient receptive field; and,

  • the LedNet architecture (Wang et al., 2019) employed an attention pyramid network in the decoder.

All the aforementioned architectures are based on convolutional operations. In addition, two other architectures based on the attention mechanism, namely OCNet (Yuan & Wang, 2018) and Manet (Li et al., 2020), were considered.

All the architectures with their respective backbones presented in Table 1 were trained using the PyTorch (Paszke et al., 2019) and FastAI (Howard & Gugger, 2020) libraries on an Nvidia RTX 2080 Ti GPU (Santa Clara, CA, USA). The procedure presented in Howard and Gugger (2020) was employed to set the learning rate for the different architectures. Also, early stopping was applied in all the architectures to avoid overfitting.

After training, all the models were then evaluated on the test set of 25 annotated images using the mean segmentation accuracy of the c-th class (MSAc):

$$ MSA_{c} = mean\left\{ {\frac{{TP_{c} }}{{n_{obs,c} }},\forall image \in Dataset} \right\} $$
(1)

where TPc is the number of true positives, i.e. correct pixel labels over the entire population of the c-th class (nobs,c) (Marani et al., 2021).

All the code necessary for training the models is available at https://github.com/ancasag/GrapeBunchSegmentation.

Semi-supervised learning methods

As stated in the previous section, the dataset contains 320 additional unlabeled images. In this case, semi-supervised learning approaches can help the training phase by adding more information from unlabeled images. For this reason, three semi-supervised learning approaches were employed: PseudoLabeling (Lee, 2013), Distillation (Hinton et al., 2015) and Model distillation (Bucilua et al., 2006). Figure 3 presents a sketch of each of these semi-supervised learning methods.

Fig. 3
figure 3

Schemes of the semi-supervised approaches presented in this analysis: a pseudolabeling, b distillation and c model distillation. Yellow, blue and green arrows refer to the processes of dataset union between manual and automatically labeled datasets, model training on the corresponding training set and prediction of the input images with the model crossed by the arrow, respectively. The models enumerated from 1 to N represent the architectures and backbones of Table 1

The pseudoLabeling approach consists of two steps: given a deep learning architecture, a first model is trained using that architecture on a manually labeled dataset to make predictions in an unlabeled dataset; secondly, the manually and automatically-labeled datasets are combined to train a new model using the same previous architecture. This pseudolabeling approach was applied to all the architectures presented in the last section (Table 1).

The distillation approach is similar to pseudolabeling, but in the second step, the trained model might have a different underlying architecture than the model trained on the first step. In this case, all the models of Table 1 were trained using the training procedure presented in the previous sections, but only the best model was used for generating the automatically labeled dataset. Then, both sets (manually and automatically labeled) were combined to re-train all the architectures in Table 1.

Finally, model distillation differs from the distillation approach in producing the automatically labeled dataset. Instead of using a single model for making predictions in an unlabeled dataset, predictions are generated from an ensemble of models. In this approach, the five models with the best total MSA produced the predictions on the unlabeled dataset, which were then combined to create single images. Finally, as in the previous approaches, the manually and automatically-labeled datasets were used to train all the architectures presented in the last section.

Experimental study

In addition to searching for the best-performing model, a statistical study was conducted to determine whether the results obtained with the different semi-supervised learning approaches were statistically significant. To this aim, several null hypothesis tests were performed using the methodology presented by García et al (2010) and Sheskin (2011). In order to choose between a parametric or a non-parametric test to compare the models, three conditions were checked: independence, normality and heteroscedasticity. If all three conditions were satisfied, the use of a parametric test was appropriate (Garcia et al., 2010). This study fulfilled the independence condition since each semi-supervised learning approach was independent of the others. Normality was checked by the Shapiro–Wilk test (Shapiro & Wilk, 1965), where the null hypothesis consisted of the normal of the data. Finally, the heteroscedasticity was checked by the Levene test (Levene, 1960), where the null hypothesis was that the results are heteroscedastic.

Since more than two training approaches were compared, an ANOVA test was used when parametric conditions were fulfilled, while a Friedman test was used otherwise (Sheskin, 2011). In both cases, the null hypothesis was that all the training approaches had the same performance. After checking which method was statistically better than the others, a post-hoc procedure was employed to address the multiple hypothesis testing among the different approaches. A Holm post-hoc procedure (Holm, 1979), in the non-parametric case, or a Bonferroni-Dunn post-hoc procedure (Sheskin, 2011), in the parametric case, was used for detecting the significance of the multiple comparisons (Garcia et al., 2010; Sheskin, 2011) and whether the p-values should be corrected and adjusted. The level of confidence of the experimental analysis was set to 0.05. In addition, the size effect was measured using Cohen’s d (Cohen, 1969) and Eta Squared (Cohen, 1973).

Results and discussion

The performance of the trained networks (both by applying and without applying the semi-supervised learning methods) was first evaluated considering an independent test set of 25 images. Performance was first assessed using the BLDO labels to compare the results with those in Marani et al. (2021), where several classification networks (namely, AlexNet, GoogleNet, VGG16 and VGG19) were implemented to construct probability maps from image patches generated using a sliding window. Then, the best models were trained and tested using the OSO labels to show the influence of manual annotation on the segmentation results. Finally, the time performance on inference time of the different architectures was analyzed.

Evaluation of the semi-supervised learning methods

All but two deep segmentation networks trained without semi-supervised learning methods, see Table 2, outperformed the approach presented in Marani et al. (2021). Namely, the total MSA (average of the five MSAc values) of the best segmentation model improved by more than 15%, and the bunch MSA more than 5%. It is worth mentioning that the approach presented in Marani et al. (2021) was aimed to help only the segmentation of the bunch class. For this reason, the improvement in the segmentation of the bunch class was lower than the one of the other classes, which was much more considerable.

Table 2 Mean segmentation accuracy (percentage) computed on test images of the deep learning models trained by the manually labeled dataset

If the segmentation networks were compared, there were four networks (DeepLabV3+-ResNext50, Manet-EfficientnetB3, Manet-Resnest50 and Unet++-ResNet50) with a total MSA of over 84%. Among them, the DeepLabV3+-ResNext50 showed better segmentation accuracy than the other networks. With the focus on the bunch class, the DeepLabV3+-ResNext50 and the Manet-EfficientnetB3 networks shined before the others achieving an MSA over 85% for that class. The Pan-Resnet50 model produced the best segmentation of the leaves, while the Unet++-ResNet50 model outperformed the others for the pole class and the Manet-Resnest50 model for the wood class. This illustrates the importance of testing different architectures since they focus on various aspects of the images. Therefore, they can be employed with different aims. For instance, if the final objective is measuring the production of grape bunches, DeepLabV3+-ResNext50 or Manet-EfficientnetB3 models should be used since they provided the best accuracy for the bunch class. In contrast, if this segmentation aims at trimming, Manet-Resnest50 model should be used since it offered the best accuracy for the wood class.

In addition to the raw numbers, several conclusions can be drawn by observing the segmentations of the best model for each class in Fig. 4. For the same image, even if all the models achieved a mean bunch segmentation accuracy of over 80%, only the Manet-EfficientnetB3 model could detect three of the four grape bunches. In addition, some leaves partially occluded the last bunch, making segmentation difficult since that region was segmented as either background or leaves by all the models.

Fig. 4
figure 4

Example of the segmentation results using the best model for each class

The impact of the different semi-supervised learning methods for the networks studied is provided in Table 3—the results of the semi-supervised methods for each class are in the appendix. At the same time, Fig. 5 shows the effects of applying these approaches on the segmentation mask output of the DeepLabV3+-ResNext50, which produced the best total MSA with plain training. From Fig. 5, it can be noticed that the segmentations made by using the semi-supervised learning methods were less noisy than those produced by the original models. This happens because the semi-supervised methods helped to smooth the predictions. It is also worth mentioning that training using semi-supervised learning methods could help detect objects, like grape bunches in the pseudolabeling approach of Fig. 5, that were not previously seen by the models trained only with the manually annotated data.

Table 3 Total mean segmentation accuracy (percentage) from applying the different semi-supervised learning procedures to label the testing images
Fig. 5
figure 5

Example of the segmentation results using DeepLab v3+-ResNext50 with the four training strategies

With more details, the pseudolabeling approach produced a mean improvement of 5.62% (with a standard deviation of 13.04%). Only four networks got worse results using this training approach while, in some cases, namely for the FPENet model in Fig. 6, the improvement was over 55%. In Fig. 6, grape bunches and other objects that were not segmented with the initial FPENet model were correctly detected using the FPENet version trained with the pseudolabeling approach. Similarly, the distillation method produced a mean improvement of 6.01% (with a standard deviation of 12.91%), with only two networks having worse results. Finally, the model distillation method also considerably improved the performance of the models (a mean of 5.80% with a standard deviation of 12.90%). However, this improvement was slightly lower than the distillation approach.

Fig. 6
figure 6

Example of the segmentation results using the FPENet with plain training and using pseudolabeling

As stated before, a statistical analysis was performed to determine significant differences among the training procedures. Since the normality condition was not fulfilled (Shapiro–Wilk’s test W = 0.313172; p = 0.000000), Friedman’s non-parametric test was employed to compare the training procedures. Friedman’s test performed a ranking of the training procedures under comparison (see Table 4), assuming as null hypothesis that all the models have the same performance. In this case, significant differences arised (F = 15.66; p < 8.48e−8) with a large size effect eta squared 0.13. The distillation method produced the best models. Moreover, looking at the standard deviation values of Table 4, the performance variability produced by the distillation approach is considerably reduced compared with plain training. Consequently, models can be trained more efficiently but can lead to poor results if only manually annotated data is used.

Table 4 Friedman’s test for the mean Total MSA of the training methods

Table 5 shows the results of the application of the Holm algorithm to compare the control training procedure (winner, based on distillation) with all the other training approaches, adjusting the p-value. Results proved significant differences between the semi-supervised learning procedures and the plain training approach, while all the semi-supervised learning methods produced the same outcomes. The size effect was also taken into account using Cohen’s d, and, as shown in Table 5, it is medium or large when the winning approach was compared with the rest of the models.

Table 5 Adjusted p-values with Holm and Cohen’s d

In summary, semi-supervised learning methods provided a considerable boost to all segmentation models without requiring the annotation of additional images. Providing precise annotations was a time-consuming task, and, therefore, reducing the annotation load could help the adoption of deep learning methods. However, deep learning models can only learn what is provided in the annotations. For segmentation tasks in agriculture, several small objects are annotated as background, making unfeasible their automatic segmentation, even applying semi-supervised learning methods. This could be solved by a more fine-grained annotation, implementing object-segmentation-oriented (OSO) labels, as show in the next section.

Evaluations with OSO labels

As described in the “Input dataset” subsection, a different annotation scheme was followed to produce more refined labels suitable for object segmentation models (OSO labels). These labels were used to train the same segmentation models of Table 1, following the plain training approach. The new results of the different architectures trained with the OSO labels are shown in Table 6.

Table 6 Mean segmentation accuracy (percentage) computed on test images of the deep learning models trained on the dataset of OSO labels

Several models achieved a total MSA of over 85%, including DeepLabV3+-ResNet50, Pan-Resnet50, HRNet and all the versions of the Unet and Unet++ architectures. The best overall model was HRNet, with a total MSA of 85.91%. This model also obtained the best accuracy for the leaves, pole and bunch classes. In contrast, the best models for segmenting wood and the background were based on the Unet++ architecture. The outstanding results of the HRNet model were due to the design of its architecture, which aggregated the output representations at four different resolutions, thus allowing models to provide a precise segmentation of objects with different scales.

Segmentation maps in Fig. 7 help draw additional conclusions about the models trained with the BLDO and OSO labels. For the same image, the best overall model trained with the BLDO labels (DeepLabV3+-ResNext50 using model distillation) and the best model trained with the OSO labels (HRNet using plain training) could both segment grape bunches and leaves. However, the segmentation of smaller objects, such as small wooden fragments, was much better when models were trained with OSO labels. In contrast, BLDO labels were not accurate enough to train a model of such small objects, even using any semi-supervised learning approach.

Fig. 7
figure 7

Comparison of the results obtained with the best model trained on the BLDO labels (DeepLabV3+-ResNext50 using model distillation) and the best model (HRNet) trained on the refined OSO labels

Whether it is better to produce a dataset with a coarse annotation that is later combined with semi-supervised learning methods or a dataset with a fine-grained annotation depend on the final aim of the trained models. Production monitoring or vegetation indices estimation require the segmentation of the main objects of the images (bunches and leaves), achievable even with coarse datasets carrying information about their appearance. However, tasks like trimming or robot harvesting require more precise segmentation to interact with the environment appropriately. Here, it was mandatory to invest more time and effort in producing a fine-grained annotation of the images.

Time inference performance

This comparative study ended with the analysis of the inference time of the models since producing segmentation in a reasonable time is as crucial as obtaining precise results. This will enable their actual implementation for accurate yield monitoring and robot harvesting in almost real-time. The inference times of each model using an Nvidia RTX 2080 Ti GPU and an Intel(R) Core(TM) i7-4790 CPU @ 3.60 GHz are shown in Fig. 8. It is worth noticing that the inference time was independent of the training method or the dataset used to construct the models as it only depends on the selected architecture. The DeepLabV3+-ResNext50 model, which obtained the best accuracy with BLDO labels, could process 100 images in 26.1 ms using a GPU and 315 with a CPU; whereas the HRNet model, which obtained the best accuracy with the OSO labels, processed 100 images in 26.3 ms using a GPU and 118 ms with a CPU. The best model at inference time was the ContextNet model, which segmented 100 images in 11.6 ms using a GPU and 68.9 ms using a CPU. This model also provided the best trade-off between accuracy and inference time. Therefore, ContextNet would be the preferred model to be implemented in-field for real-time processing.

Fig. 8
figure 8

Inference time (in milliseconds) for 100 images of each segmentation model using CPU and GPU

Future work

Future works will address the use of infrared and depth streams returned by the proposed cameras as input of the proposed models or as the object of investigation for accurate yield monitoring. Moreover, hardware-aware models or quantization methods will be explored to integrate the segmentation models in low-cost devices used in-field. The more significant amount of feeding information will lead to better segmentation results and even to the direct regression of crop productivity.

Conclusions

Analyzing natural images captured by moving robotic platforms is a key point for yield monitoring at the plant level. Its actual implementation requires low-cost sensors, such as RGB-D cameras, able to provide detailed information about both appearance and volume of the targets, e.g. the whole plants or single fruits. As a first step in using these data, reliable software methods are mandatory to process low-quality color images and give helpful knowledge to the farmers.

In the scenario of viticulture, this paper presented:

  • several deep learning architectures for the segmentation of natural color images acquired in vineyards by the Intel Realsense R200 stereo camera;

  • three semi-supervised approaches to improve segmentation accuracy by taking advantage of a set of unlabeled images, thus avoiding the need for a large dataset of labeled images, whose annotation can be time-demanding; and

  • a comprehensive discussion on the need for high-detailed manual annotation for improving environmental awareness.

Results showed that the DeepLabV3+ -ResNext50 model, trained by the set of labeled images, achieves the best MSA of 84.78% (average of the MSAs of all the target classes), whereas the Manet-EfficientnetB3 model reaches the MSA of the bunch class (85.69%) under the same training conditions. On average, the application of semi-supervised learning methods boosted MSAs between 5.62 and 6.01%. In particular, the model distillation semi-supervised approach improved the total MSA of the DeepLabV3+-ResNext50 model to 85.86%. However, other architectures, such as the FPENet, benefit more than 55% in MSA from the semi-supervised approaches, which de facto enabled the creation of appropriate models. Finally, also time-efficiency was investigated, proving that the ContextNet model almost halved the inference time of the DeepLabv3+-ResNext50 model at the expense of a slight worsening of the total MSA, which in the case of the distillation semi-supervised learning procedure, reaches 83.44%.

A final comparison of models trained with two label sets, oriented at bunch/leaves detection and object segmentation, was presented to show the effect of manual annotation. Specifically, coarse labels can be efficiently used to model objects of large sizes, such as grape bunches or leaf clusters, making them suitable for production monitoring and vegetation indices estimation. In contrast, those applications requiring the exact environmental awareness, such as robotic harvesting or trimming, must use more detailed labels to create exhaustive segmentation models.