Introduction

Food insecurity is a major threat to the human civilization given the increasing global population and climate change (Al Hasan et al., 2022; Andreoli et al., 2021). Several studies have estimated that the global food production should be increased by at least 70% by 2050 to meet the demand (Beltran-Pena et al., 2020; Jain et al., 2020; van Dijk et al., 2021). However, the augmentation of agricultural land use and the heightened utilization of natural resources to bolster food production cannot be undertaken without considering the consequential impact on the dynamics of climate change (Fanzo et al., 2021; Wheeler & von Braun, 2013). Therefore, policy makers, breeders, and growers around the world have acknowledged the use of smarter technologies and scientific innovation in crop breeding, which results in hybrids that can produce more yield with fewer natural resources (Edgerton, 2009; Lobell et al., 2009). Accurate and non-destructive estimation of plot-level yield is a significant step in revolutionizing crop breeding operations and understanding the interactions of genetics-management-environment in crop yield (Araus & Cairns, 2014; Song et al., 2021).

Crop yield prediction is a broad topic of interest for crop growers, breeders, and policy makers, simultaneously. While policy makers are concerned with crop yield over larger areas at coarser scale, the growers and breeders are more interested in the variance of crop yield at much finer scales often within the agricultural field (Mcmichael, 1994). Crop growers can be benefited by having an early-stage yield prediction mechanism (van Klompenburg et al., 2020), which can lead to in-field management decisions, such as, precise fertilizer application, irrigation, pest management, etc. Alternatively, crop breeders often use the yield as a metric to evaluate the performance of certain genotypes in development (Hassan et al., 2019). Typically, crop breeding is a multi-year process involving many trials in greenhouses and experimental fields. The trials in experimental fields are more complex as the interactions between different genotypes, management and environmental conditions can affect the yield (Marsh et al., 2021). Therefore, breeders use mechanical harvesters and measure the yield per experimental plot to identify which genotypes or trials perform well in terms of different management and environmental conditions. This is known as the crop breeding pipeline and along with many other plant traits (or phenotypes), yield is one of the most important components of it (Hu et al., 2018). However, measuring yield using harvesters or other destructive means is time consuming and costly in terms of large breeding operations. Therefore, high-throughput, efficient and scalable approaches to crop yield prediction can significantly improve the breeding pipeline and offer much shorter time for seed development (Marsh et al., 2021).

Numerous approaches have been undertaken to conduct crop yield prediction, such as, expert knowledge (Ballot et al., 2018; Papageorgiou et al., 2011, 2013), crop growth models (Basso & Liu, 2019; Huang et al., 2019; Kasampalis et al., 2018), remote sensing methods (Ali et al., 2022; Hara et al., 2021; Khaki et al., 2021; Morales et al., 2023), and hybrid approaches where crop growth models are combined with environmental factors (Gopal & Bhargavi, 2019; Kogan et al., 2013) and remote sensing data (Bai et al., 2019; Zhuo et al., 2019, 2022). While crop growth models are robust and generalized across many different crops and experimental sites, the difficulty often arises during properly parameterizing models for large scale applications and the intrinsic variability of agriculture (Kipkulei et al., 2022).

The plant science community has relied on remote sensing for the past few decades in many different agricultural applications, for instance, estimation of aboveground biomass (Kumar & Mutanga, 2017; Lu et al., 2016;), nitrogen (Knyazikhin et al., 2013; Wang et al., 2017), chlorophyll concentration (Carmona et al., 2015; Guo et al., 2023; Xie et al., 2019), crop yield (Cao et al., 2021; Maimaitijiang et al., 2020; Morales et al., 2023; Sagan et al., 2021), and so on. Since there have been immense developments in spatial, spectral, and temporal resolution in remote sensing products, the high-throughput nature of crop-trait prediction has been receiving new momentum. The principal idea behind using remote sensing techniques in crop trait estimation is that the imageries can capture the spatial and spectral variability of different crops in the field. Such information can then be mapped using linear or non-linear machine learning models, where the independent variables are either the imageries or different features derived from the imageries and the target variable is the measured yield from samples within the experiment.

Many studies have used satellite-level remote sensing and different machine learning models to perform yield prediction for different crops. For example, Schwalbert et al. (2020) utilized Long Short-Term Memory (LSTM) neural networks combined with satellite imagery and weather data to predict soybean yield of Brazil at the municipality level. Peng et al. (2020) used satellite-based Solar-Induced Chlorophyll Fluorescence (SIF) products along with vegetation indices and land surface temperature (LST) from the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite to predict soybean and maize yield for the entire United States. Similarly, regional-scale crop yield prediction has been performed using random forests (Han et al., 2020; Jeong et al., 2016), deep neural networks (Crane-Droesch, 2018), and convolutional neural networks (CNN) based LSTM models (Sun et al., 2019). However, the major challenge with satellite-based remote sensing is that the in-field variation of crop characteristics is not well-captured by the coarser spatial resolution of the satellite data, which does not assist the crop breeders in the breeding pipeline or growers in making management decisions at the within field scale.

Recent advances in unmanned aerial vehicles (UAVs) and sensor technologies have dramatically shifted the capabilities of remote sensing and machine learning in plot-level yield prediction. The higher spatial, spectral, and temporal resolution of UAV sensors offer significant improvements in yield prediction accuracy along with other crop trait estimations. Numerous studies have conducted yield prediction using multispectral (Hassan et al., 2019; Wan et al., 2020), hyperspectral (Feng et al., 2020; Li et al., 2020), or fusion of multi-modal UAV systems (Fei et al., 2022; Maimaitijiang et al., 2020) and achieved consistent results. A major component in these studies was the extraction of hand-crafted features, such as, vegetation indices (i.e., normalized difference vegetation index—NDVI, the wide dynamic range vegetation index—WDRVI), structural (i.e., different height metrics) and textural features (i.e., grey level co-occurrence matrix—GLCM). Extraction of these hand-crafted features requires a priori knowledge of reflectivity and crop biophysical properties. There could be hundreds or thousands of potential features as input for the machine learning model, which can significantly limit the robustness and transferability of the model (Sagan et al., 2021).

A convolutional neural network (CNN) is a specific type of deep learning model, which offers the recognition of spatial and textural patterns within a given sample image. CNN uses convolution and pooling layers to extract meaningful information from complex images and learns to predict a certain target variable. Such capabilities can be leveraged for automatic feature extraction and tasks of hand-crafted feature engineering can be avoided during the model training.

CNN architectures have been proven as highly effective for general computer vision tasks, such as, classification and object detection. The superiority of CNN architectures was first kickstarted with the release of AlexNet (Krizhevsky et al., 2017), followed by VGG (Simonyan & Zisserman, 2015), ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and many more architectures. However, applications of these architectures were mostly confined within general computer vision tasks (Hussain et al., 2019; Nanni et al., 2017) and medical imaging (Singh et al., 2020; Suzuki, 2017). Recently, several studies have employed CNN architectures in direct imagery-based yield prediction from UAV images and have achieved considerable accuracies as well. For example, Yang et al. (2019) concluded that their proposed CNN model provided the most robust rice grain yield forecast compared to other methods. Nevavuori et al., (2019) also found the superiority of CNN models in estimating yield of wheat and malting barley.

While CNN architectures were initially proposed to solve computer vision problems from 2D images, several studies have identified the superiority of 3D CNN for analyzing multitemporal images. In 3D CNN, the images captured from a linear timeseries are stacked together and 3D convolution kernels are used across all three dimensions, i.e., two dimensions for space and the third dimension for time. The principal assumption behind this idea is that 3D convolution can capture both spatial and temporal characteristics from given image samples to predict a target variable. For example, Nevavuori et al. (2020) utilized a CNN-LSTM model and a 3D CNN architecture to estimate crop yield and found that the 3D CNN provided better results compared to the CNN-LSTM model. Similarly, Sagan et al. (2021) employed a 3D version of ResNet architecture for soybean and corn yield prediction, achieved improved performance over traditional machine learning models. An important finding from these studies is being highlighted, namely the superior performance of RGB images in CNN models for yield prediction. Both Yang et al. (2019) and Nevavuori et al. (2019) concluded that RGB images offer more robust CNN training compared to multispectral images. However, many studies have indicated the importance of having multispectral cameras including near-infrared and red-edge bands in yield prediction (Bascon et al., 2022; Shen et al., 2022; Tanabe et al., 2023), Yang et al. (2019) and Nevavuori et al. (2019) highlighted the capability of CNNs to capture important textural and spatial information from the RGB colors in a yield prediction model. Since RGB cameras are more lightweight, low-cost and easy to process, use of RGB images in CNN offers much higher potential in crop breeding and precision agriculture applications.

Although CNN offers automatic feature extraction along with an end-to-end mechanism for crop yield prediction, most of the deep neural networks require massive amounts of ground truth or label data to maintain a robust learning. For typical computer vision and medical imaging tasks, generating ground truth data often involves manual labor and some level of expert knowledge. However, producing ground truth information in terms of agricultural application is a major problem, specifically for regression tasks. For example, a yield prediction problem would require an experimental site with varieties of crop genotypes, management, and environmental conditions. Additionally, the images are required to be collected and plot-level yield information must be measured using harvesters or manual methods. To the best of current knowledge, no studies have been identified that have investigated different CNN architectures for plot-level yield prediction problems for a larger region with a sample size closer to 30,000 plots. Therefore, a knowledge gap exists on the use of specific CNN architecture and optimal spatiotemporal resolution for UAV-based yield prediction.

In this study, end-to-end 3D CNN architectures have been examined for plot-scale soybean yield prediction using multitemporal UAV-based RGB images. Datasets from a soybean breeding pipeline, encompassing a large-scale training sample from different parts of Argentina were incorporated in this study with high variability in genotypes, management, and environmental conditions. To the extent of current awareness, this is the first study to tackle such big data in terms of deep learning and agricultural management within the open-source scientific community. The major objectives of this study are: (a) to determine the best 3D CNN architecture for end-to-end yield prediction, (b) to identify the optimal spatiotemporal resolution for the best prediction accuracy, and (c) to provide an end-to-end solution for big data processing and training in the cloud and offer recommendations for scalability.

Methods

Experimental sites

The experimental sites for this study are in Argentina, more specifically in the Province of Buenos Aires, where soybean trials were distributed in four locations. For anonymity of the breeding program, the locations have been codenamed as CON, RAM, ASC, and IND (Fig. 1). ASC and IND are located in the middle, while CON and RAM are in the north of the Province of Buenos Aires. Since all the locations are in the southern hemisphere, the coldest temperatures are found around July, while the warmest temperatures occur around January. All locations have predominant mild weather, which is typical of the Province of Buenos Aires. ASC and IND are 50 km apart from each other; thus, the average temperatures and precipitation are similar. The warmest month is January with historical average temperatures of 23 °C and the coldest month is July with and average temperature of 9 °C. RAM and CON are around 24 km apart; therefore they also share similar average precipitation and temperatures, i.e., the maximum average temperatures in January is 25 °C and the minimum is 11 °C in July. In terms of precipitation, ASC and IND receive an average precipitation of 1043 mm annually, while CON and RAM accumulated 1135 mm of precipitation in the 2021 growing season.

Fig. 1
figure 1

Location of the experimental sites. Total 13 fields were established in 4 locations (i.e., RAM, CON, IND, and ASC) northwest of Buenos Aires, Argentina

Each location contained several fields, which can be seen in Fig. 1. There were 3 types of trial analyzed in this study. Early (EAL) trials consisted of soybean genotypes in the early stages of the breeding program, where no previous selections have been performed. Intermediate (MID) trials were formed with genotypes in intermediate stages of the breeding program and had gone through previous selection filters. Finally, the advanced (ADV) trials included genotypes with more than one year of evaluation and several selection filters, from which it can be assumed that the genotypes in this trial should have the highest performance consistency throughout locations. All trial fields were planted between October and November of 2020 and harvested between March and May of 2021 (following the different maturity groups of the genotypes in the trials). All trials were managed under no-till schemes and no irrigation, following the typical agricultural practices for the region. The trials also had received similar plot size and distribution. Each plot contained 4 row industry standard dimensions.

Ground truth data collection

The ground truth for this study was yield values measured from each plot in kg.ha−1. The yield value was obtained by harvesting each individual plot with ALMACO Specialized Plot Combines SPC20 (Fig. 2d, ALMACO, Nevada, Iowa, United States). These combines are widely used in the breeding industry. After harvesting each plot, the combines measured humidity and weight of the plot, and stored the information in a local database, where each of these values were associated with its corresponding plot ID. Finally, the plot weight was corrected for humidity by subtracting the weight of the seed due to water content and extrapolated from kg.plot−1 to kg.ha−1. These values were stored in an online database, where each plot was associated with a measured yield value.

Fig. 2
figure 2

Picture of soybean trials in seed filling stage (a), full maturity (b), the DJI Phantom 4 Pro UAV with P4 Sensor used to collect RGB data (c), and the ALMACO harvester used for measurement of yield (d)

UAV data collection

The DJI Phantom 4 Pro UAV (Shenzhen, China) equipped with DJI P4 (Shenzhen, China) multispectral camera (Fig. 2c) was used in this study. The P4 sensor has a total of 6 lenses which captures blue, green, red, red-edge, near-infrared and one dedicated RGB lens. For this study, only the outputs from RGB cameras were used to test different 3D CNN architectures. The flights were performed using DJI GS PRO software (v2.0.17) with a front overlap of 80% and side overlap of 70%. All the flights were performed at a speed of 4.2 ms−1 with images captured every 2 s.

Flight dates were planned based on the soybean phenological cycle, where the focus was to cover all the reproductive stages from R1 to R6 since these periods have been found to be the most important stages for soybean yield prediction. More information about soybean growth stages can be found in (Fehr & Caviness, 1977). First, an initial flight was done at around late vegetative stage (V3-V4) for all locations with the purpose of seeing the plots in a clear way for future plot demarcation. After that, the flights were conducted when the crops were approximately in R1 stage and continued to collect at roughly 2-week intervals (Fig. 3a). However, there were some gaps in the UAV flights due to logistical complexities and bad weather conditions. Approximately nine flights were conducted per site. Figure 3a shows that the gaps between the first and the second flight for ASC and IND are shorter than for CON and RAM. The reason behind this is ASC and IND have genotypes of shorter maturity groups, meaning they reached the reproductive stage earlier than the genotypes in CON and RAM.

Fig. 3
figure 3

UAV data collection timeline at the experimental sites (a) and a sample UAV orthomosaic generated from the RAM-R1 field for the entire growth season (b). The orthomosaics in b are not to scale and are distorted for perspective visualization

The flights were conducted in clear sky conditions. Also, the UAVs were flown near noontime to ensure consistent illumination conditions. Unfortunately, no calibration panel was included for further reflectance calculation, which was one of the major limitations for this study and further explained in the Discussions section. However, the assumption was that since the flights were conducted in clear sky conditions near noontime, the RGB lens DN values would provide normalized information about crop growth. During the flights, several ground control points (GCPs) were placed on the ground which were consistent in between different flights for the same location.

Data processing pipeline

Given the large number of datasets involved in this study, different automatic and semi-automatic measures were adopted in the data processing pipeline. The overall workflow of the pipeline is illustrated in Fig. 4.

Fig. 4
figure 4

Data processing pipeline, which includes orthomosaic generation, georeferencing, plot boundary extraction, and sample preparation for CNN training. The level of automation is also defined in each part

Orthomosaic and georeferencing

The orthomosaic images for each field location and date were performed in Pix4D (v4.6.4, Caputo et al., 2023). This process takes individual images captured sequentially and an automated pipeline within the Pix4D software was run to generate RGB orthomosaics. The images from the individual RGB lens of the P4 camera were used which provides an automatically calibrated digital numbers (DN) ranging from 0 to 255. Therefore, any radiometric calibration was not performed to compute reflectance for this lens. Rather the primary objective was to assess the performance of uncalibrated RGB images within CNN models, as many low-cost UAV systems may lack the capability to perform radiometric calibration. Although the UAV had its own GPS and INSS onboard, the resulting orthomosaic still had some errors in terms of accurate locations. Therefore, orthomosaic from a field on one date did not fully align with an orthomosaic from a different date. To fix this problem, the ENVI Image Registration workflow (Xie et al., 2003) was used to automatically connect the consecutive temporal orthomosaics for each field. The Image Registration workflow could identify the GCPs in the fields automatically using its proprietary Registration Engine and used first order polynomial transformation or affine transformation to rectify the georeferenced images. However, this process required manual observations and checking to ensure the quality of the georeferencing outputs. If appropriate overlap was not found within images, a recheck for the tie points was done and adjusted accordingly until the GCPs, and other prominent features aligned with each other.

Plot boundary creation

The plot boundaries from each field were created using both automatic and manual digitizing methods in ArcGIS Pro. First, the starting location for planting was acquired based on how the planter was used to plant soybean seeds in each field. Based on that initial coordinate, a rectangular grid polygon was created based on the known plot width and height information using the Create Fishnet tool in ArcGIS Pro. To check the accuracy of the plot boundaries created by the fishnet tool, a randomized quality checking system was developed involving multiple humans in the process. Such human-in-the-loop mechanism enabled us to remove unnecessary plot polygons or change the shape of the polygon to accurately match the underlying soybean rows. Each plot was uniquely designed with three types of identifiers, i.e., block, range, and column. The three attributes were filled up with semi-automatic fashion in ArcGIS Pro’s automatic numbering tools.

Sample preparation for CNN

The CNN architecture requires the input to be in a multi-dimensional tensor format, where the values in the tensors represent image patches. A fully automatic sample preparation pipeline in Python was generated to efficiently generate the desired sample outputs. First, each plot was used to clip all the temporal images to generate clipped images. Since the orientation of the experimental sites were not the same, the clipped images would have different no data segments for different fields (no data is the black area in Fig. 4d). If such images are used as input for the CNN models, then the CNN might pick up the variation of different orientation of the samples and incorrectly correlate the yield with the no data orientation. Therefore, an automatic rotation algorithm was developed to horizontally rotate each clipped image. First, the clipped images were converted to binary images, where all the no data pixels were assigned “0” and all other value pixels were assigned with “1”. Due to the high contrast of the binary image, the solid boundary lines could be detected using canny edge detectors. Then, the angle of the longest line was calculated and used as the rotation parameter to rotate the images. However, the resulting image then had a significant amount of padding or no data values around the actual pixel values (Fig. 4d), which was reduced using threshold values.

In this study, the effect of spatiotemporal resolution was explored in different CNN architectures while doing yield prediction. Therefore, different sizes of spatial and temporal interpolation for all the image samples was performed. Originally, the 60-m altitude UAV flight resulted in around 3.4 cm ground sampling distance (GSD) resolution. Spatial interpolation was done using the nearest neighbor method and it was necessary because the image samples had ununiform spatial dimensions after clipping, rotation, and padding removal. Table 1 shows the different sizes of spatial grid size considered for experiment in this study. Alternatively, only 9 temporal dimensions were available as the depth of each image. The architectures used in this study required more than 9 temporal dimensions to run all the convolution and pooling layers within the architecture. Therefore, each pixel and its corresponding 9-pixel values were considered as independent timeseries (from 9 temporal images), and different sizes of values, i.e., 32, 48, 64, 80, were interpolated. The linear interpolation was performed by considering the band values for each pixel as a timeseries and populating the intermediate values by considering the number of days between each image acquisition. Table 1 shows all the considerations of different spatial and temporal sizes for experiments within the study. The number 32 was simply picked as the base number of sizes and incremented with 16 for different combinations of input configuration.

Table 1 Combination of CNN input samples (excluding the number of bands in the shape) considering different spatiotemporal sizes

CNN architectures

Numerous studies in the field of crop yield prediction have employed CNNs, with many constructing bespoke architectures by iteratively adding layers with varying input dimensions, kernel shapes, and filter configurations (Nevavuori et al., 2019, 2020; Yang et al., 2019). However, in the realm of computer vision, where millions of training samples are readily available in standardized datasets, CNNs have evolved into various well-established architectures like VGG, ResNet, DenseNet, among others. While availability of such architectures would ensure scalability, robustness and transferability in a crop breeding pipeline, evaluating these architectures for crop phenotyping or yield prediction becomes challenging due to the scarcity of large-scale samples in this specialized field. Therefore, a substantial dataset of approximately 30,000 instances with accompanying ground truth yield data was leveraged to analyze three fundamental and widely recognized architectural types (i.e., VGG, ResNet, and DenseNet) to discern the most accurate and efficient performer, regardless of input spatiotemporal dimensions.

Neural networks have shown improved performances over many different classifications and regression problems. However, deep neural networks often do not share parameters well in terms of 2D images. Convolution is a mathematical operation that allows the merging of spatial information in a 2D space. Therefore, 2D convolutions are the most popular operations in many CNN architectures where they extract useful information from a complex 2D space.

The 2D input image can have multiple channels, which is 3 in this study (i.e., red, green and blue), and multiple convolution kernels with \(k\times k\) size can be acted as a sliding window over the image to perform the following mathematical operation:

$$z^{{\left( {t,h} \right)}} \, = \,\mathop \sum \limits_{i = 1}^{{k_{h} k_{w} }} x_{i}^{\left( t \right)} y_{i}^{\left( h \right)} \, + \,b^{\left( h \right)}$$
(1)

where, \(t\) and \(h\) represent the \(t\)-th sliding window and \(h\)-th convolution kernel, respectively; \({k}_{h}\) and \({k}_{w}\) are the spatial kernel sizes of \(h\) height and \(w\) width; \({x}^{(t)}\) and \({z}^{(t)}\) are the \(t\)-th input and \(t\)-th output, and \({y}^{(h)}\) and \({b}^{(h)}\) denote the \(h\)-th kernel matrix and its bias, respectively. The output from a single channel convolution is a 2D feature map, where the shape is expressed as \((h-{k}_{h}+1)\times \left(w-{k}_{w}+1\right)\) map.

Similarly, 3D convolution is an intuitive extension of 2D convolutions that can consider an extra dimension for more information extraction. In this case, the 3rd dimension is the temporal dimension, which represents sequential growth patterns of soybean in the field. The 3D convolution operation for a 3D input matrix can be expressed as Eq. 2.

$$z^{{\left( {t,h} \right)}} \, = \,\mathop \sum \limits_{j = 1}^{{k_{d} }} \mathop \sum \limits_{i = 1}^{{k_{h} k_{w} }} x_{i,j}^{\left( t \right)} y_{i,j}^{\left( h \right)} \, + \,b^{\left( h \right)}$$
(2)

where, \({k}_{d}\) is the depth of kernels representing the temporal information. In 3D convolution, a 3D kernel slides around the 3D input matrix and summarizes the information content across all the channels. The resulting output from 3D convolution is also a 3D matrix, where the shape is \((h-{k}_{h}+1)\times \left(w-{k}_{w}+1\right)\times \left(d-{k}_{d}+1\right)\) map.

In this study, the three most used CNN architecture families (i.e., VGG, ResNet, and DenseNet) were considered as the basis of the experiment. However, the CNN architectures were initially proposed to handle 2D images for classification problem, whereas the dimensionality of these architectures were extended to match the input–output shape of a 3D matrix. In the following section, the major configuration of these architectures and their 3D representation for yield prediction problem are discussed.

VGG

The family of VGG (Visual Geometry Group) architectures was first proposed by Simonyan and Zisserman (2015) as an improvement to AlexNet (Krizhevsky et al., 2017) and ZFNet (Zeiler & Fergus, 2014). The VGG family was proposed with several levels of network depth consisting of 11, 13, 16, and 19 layers. The network consists of a number of repeating building blocks, initialized with a stack of convolution layers up to four times and adjacent dimension reduction by max pooling. The architecture used \(3\times 3\) kernel size exclusively in the convolutional layers. However, the kernel size was extended from \(3\times 3\) to \(3\times 3\times 3\) to match the 3D input shape of the yield prediction problem. After each convolution, the ReLU (Rectified Linear Unit) was used as the activation function, which calculates \(f\left(x\right)=\underset{}{{\text{max}}}(0,x)\) on the resulting feature maps (Nair & Hinton, 2010). The VGG-13 was chosen as one of the candidates from the VGG family architecture where 13 layers of convolution and pooling operations are performed (Fig. 5a). After the feature map generation from the convolution and pooling layers, a global average pooling was performed from the 3D feature map and two dense layers were used followed by an output layer with one neuron to predict yield.

Fig. 5
figure 5

The schematic diagram of the three 3D CNN architectures, i.e., VGG-13 (a), ResNet-50 (b), and DenseNet-169 (c). VGG-13 includes several convolution and pooling layers as the building block of the architecture. The ResNet-50 has two types of residual blocks building the entire network. Alternatively, DenseNet-169 consists of dense and transition blocks as the main backbone

ResNet

The ResNet family was first introduced in 2015 by He et al., (2016) to account for the vanishing gradient problem of the CNN architectures. While building very deeper networks, the computer vision community identified the vanishing gradients or exploding gradients, where the value of the product of derivative decreases until the partial derivative of the loss function approaches a value close to zero (Ide & Kurita, 2017; Su et al., 2018; Yang et al., 2022a, 2022b). This phenomenon often leads to poor learning or no learning at all at the deeper layers resulting in decreased performance. ResNet architectures avoid the degradation of deeper networks and support the optimization with backpropagation by reformulating the function \(H(x,{W}_{i})\), which is the convolutional operations of the main trunk. The function acts as a residual function \(H\left( {x,W_{i} } \right)\, = \,F\left( {x,W_{i} } \right) + x\), where \(F\left(x,{W}_{i}\right)\) is the residue to approximate and \(x\) is the input before the convolutional operation. Approximation of the residue is easier as \(x\) is known due to the residual connection. Moreover, an identity function is learned from \(x\) since the residue of convolutional operations in final part of the network is zero. In terms of the backpropagation point of view, the transportation of larger gradients to convolution operations is enabled and optimization of those layers is confirmed. Therefore, residual connections offer better performance in deeper networks compared to shallower networks (Xie et al., 2017; Yang et al., 2022a, 2022b).

If a network is considered with \(L\) layers, where each layer performs a non-linear transformation \({H}_{L}\), the output of the \({L}_{th}\) layer is denoted as \({X}_{L}\) and the input image is represented as \({x}_{0}\), then the skip connection can be represented as Eq. 3.

$$X_{L} \, = \,H_{l} \left( {X_{l - 1} } \right)\, + \,X_{l - 1}$$
(3)

ResNet was proposed with multiple depths, for instance, 18, 34, 50, 101, and 152 layers. However, the ResNet-50 was considered as the candidate for yield prediction model (Fig. 5b). Similar to the VGG-13 network, all the 2D convolution and pooling layers were transformed into 3D shape to accommodate the spatiotemporal input data. The final output layer of the network also consists of one neuron to directly predict yield. Figure 5b shows the overall architecture of ResNet-50 along with the schematic diagram of residual blocks.

DenseNet

DenseNet is a variant of residual network that was first proposed in 2017 by Huang et al., (2017). Whereas in ResNet, identify mapping is performed to promote the gradient propagation using element-wise addition, DenseNet obtains additional inputs from all preceding layers and passes on its own feature maps to the subsequent layers (Fig. 5c). The DenseNet network is divided into multiple densely connected dense blocks to facilitate both down-sampling and feature concatenation. Such division of the network into dense blocks allows the concatenation of feature maps from different sizes. Moreover, transition layers are also used in between the dense layer which consists of a batch normalization layer, \(1\times 1\times 1\) convolution followed by a \(2\times 2\times 2\) average pooling layer. Because of the use of concatenation between dense blocks, each layer learns a collective knowledge from all preceding layers resulting in a thinner and compact network. Therefore, DenseNet require fewer model parameters to train and provide higher computation efficiency (Chen et al., 2021).

The difference between a skip connection of a ResNet and the denseonnectivevity in DenseNet architecture can be seen in Eq. 4 in terms of the variables set for Eq. 3.

$$X_{L} \, = \,H_{l} \left( {\left[ {X_{0} ,\,X_{1} , \ldots ,X_{l - 1} } \right]} \right)$$
(4)

where, \([{X}_{0},{X}_{1},\dots ,{X}_{l-1}]\) represents the concatenation of the feature maps produced by \(\mathrm{0,1},\dots ,{L}_{th}\) layers.

Different depths of DenseNet architectures were originally proposed, i.e., 121, 169, 201, and 264 layers. The DenseNet-169 network was chosen for this study. Similar to the VGG and ResNet architecture, the convolution and pooling layers were transformed into 3D format and the final output layer was considered with one neuron to perform yield prediction.

Model training

The models were trained using cloud solutions provided by Amazon Web Services (AWS). The AWS Sagemaker (v2.140.1) platform was used to train and evaluate the performance of the three architectures for different input configurations shown in Table 1. The automatic sample preparation pipeline was utilized to generate different sizes of input configurations and uploaded to AWS S3 bucket. During the model training, the input data was directly fed from the corresponding S3 bucket. The TensorFlow API (v2.6.0) was used for training the models in AWS Sagemaker.

The training configuration for each architecture and input shape was kept similar so the experiments return a fare comparative result. First, the sample dataset for each input configuration was divided into three different sets, i.e., training, validation and test set using a randomized 50%, 20% and 30% split, respectively. The sample image values ranged from 0 to 255 in a 32-bit data structure. Therefore, each sample was divided by 255 to scale the images from 0 to 1. Since the problem in this study is a regression problem, mean squared error was considered as the loss function for all the model training. The Adam algorithm (Kingma & Ba, 2015) was utilized to optimize for the loss function, which has proved to be highly efficient in most of the advanced deep learning problems.

A standardized set of hyperparameters were considered during the model training. The batch size was kept as 32 for all the training. The learning rate was initially set at 0.0001 and slowly reduced with the increasing iterations. The total number of iterations were set at 200, but an early stopping criterion was used to automatically stop the model training if the validation set loss does not improve. During the model training, both training and validation loss were retained for further evaluation.

Model evaluation

The performance of each model was evaluated based on three common regression metrics, i.e., coefficient of determination (\({R}^{2}\)), root mean squared error (\(RMSE\)), and normalized root mean squared error (\(NRMSE\)), which are expressed in following equations:

$$R^{2} \, = \,1\, - \,\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} }}{{\left( {y_{i} - \overline{y}_{i} } \right)^{2} }}$$
(5)
$$RMSE\, = \,\sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} \, - \,\hat{y}_{i} } \right)^{2} }}{n\, - \,1}}$$
(6)
$$NRMSE\, = \,\frac{RMSE}{{\overline{y}}}\, \times \,100$$
(7)

where, \(i=\mathrm{1,2},3,\dots \dots ..,n\) is the test sample, \({\widehat{y}}_{i}\) and \({y}_{i}\) represent the predicted and measured yield values, respectively, and \(\overline{y }\) is the average of each measured values.

The Global Moran’s I statistic (Anselin, 1995; Harries, 2006) of the regression residuals was also calculated for each plot, field, and model. First, each trained model was applied to all the plots for each field and then the residuals were analyzed with Global Moran’s I. Understanding the spatial heterogeneity and dependency of crops over the experimental fields is an important consideration for yield prediction as crop breeding often suffer from intrinsic spatial dependence (Haghighattalab et al., 2017). Therefore, many studies have accounted for the effect of spatial autocorrelation in regression-based estimations (Anselin et al., 2004; Ghulam et al., 2015; Maimaitijiang et al., 2015; Peralta et al., 2016). The Global Moran’s I measure the spatial autocorrelation or dependence based on both feature locations and values simultaneously. It simply explains if a given set of values is clustered, dispersed, or random in terms of a given spatial scale by resulting in a Moran’s Index and associated p-value for statistical significance. A positive Moran’s Index indicates that the spatial pattern is clustered, whereas a negative index represents dispersed pattern in a given space. However, if the p-value associated with the Moran’s Index is not statistically significant, then it can be inferred that there exists a complete spatial randomness in the model residuals.

Results

Descriptive statistics of yield

The descriptive statistics for the target variable, yield, are shown in Fig. 6 and Table 2. When all the datasets are considered (32,272 samples), the distribution of yield shows a normal distribution (Fig. 6a) with 3610 kg ha−1 as the mean and 663 kg ha−1 as standard deviation (Table 2). In terms of different trials, the values of yield also show a normal distribution (Fig. 6b), where the number of samples in MID trials are the highest followed by EAL and ADV (Fig. 2b and Table 2). As per the previous discussion, EAL trial includes soybean genotypes that were in a very early stage of the breeding program, which might result in an extreme variation of its yield values. Such phenomenon can be observed in Table 1 by looking at the range values. The range of the EAL trials (6225 samples) is significantly higher than the MID (4819 samples) and the ADV (4621 samples). The interquartile range also follows a similar pattern in the yield values. However, the distribution of yield values over different trials and fields shows a variety of patterns as well. Figure 6c visualizes the distributions as multiple boxplots for 13 different experimental fields. Each field may or may not contain multiple trials. For example, the ASC-A1 and CON-C4 fields only had EAL trials, whereas IND-I2, RAM-R5, and RAM-R4 had all three trials. All other fields contained at least a combination of MID and ADV trials. Figure 6c clearly shows that the range (total spread of the lines) and the interquartile range (the height of each box) were comparatively higher for the ERL trials, which indicates the higher variability of yield in earlier stages of corn breeding. The median yield value in ERL trials (the middle line in the boxplots) also tends to be higher than the median of corresponding trials. Although the amount of ERL samples is the second highest compared to MID and ADV, only 6 fields among 13 had ERL trials.

Fig. 6
figure 6

Descriptive statistics of the yield, where a shows the distribution of the entire dataset, b visualizes the distribution of yield over three trials in the breeding program (i.e., ERL, MID, and ADV), and c shows the individual distribution of yield over both trial and fields. In boxplots: whiskers show maximum and minimum values, the box shows the interquartile range, the middle line is the median

Table 2 Descriptive statistics of yield values for all data and different trials

Prediction results

The test set model evaluation metrics are shown in Table 3 and the R2 of different architectures is visually compared in Fig. 7. Overall, the VGG-13 and DenseNet-169 both outperformed the ResNet-50 in terms of model accuracy. The highest R2 (0.70) and the lowest NRMSE (10.08%) were achieved by VGG with the input shape of \(80\times 80\times 32\) (Table 3). However, the VGG also performed very similarly with the input shape of \(48\times 48\times 32\) (R2 = 0.70 and NRMSE = 10.15%), which indicates the usefulness of even smaller spatial resolution in imagery-based yield prediction. The second-best result was achieved by DenseNet with the input shape of \(80\times 80\times 48\) (R2 = 0.69 and NRMSE = 10.33%), followed by \(48\times 48\times 32\) (R2 = 0.69 and NRMSE = 10.44%) with a slight difference in the NRMSE. The overall performance of ResNet was comparatively poorer than VGG and DenseNet. The best ResNet performance was achieved with \(64\times 64\times 64\) input shape (R2 = 0.65 and NRMSE = 11.05%).

Table 3 Evaluation metrics of test set for different models and input shape combinations
Fig. 7
figure 7

Coefficient of determination (R2) of test set data for VGG, ResNet, and DenseNet architectures with different input shapes. The x-axis shows the varying input shapes, y-axis in the left side shows the R2 and the y-axis in the right side shows the number of trainable parameters in millions. The dashed lines correspond to the right-side y-axis representing the number of parameters

In terms of different trials, the trained CNN architectures performed relatively well in predicting unseen samples from ADV compared to MID and ERL samples (Fig. 8). For example, VGG resulted in an R2 of 0.71 and NRMSE of 9.63% in ADV, whereas it slightly underperformed for MID (R2 = 0.67 and NRMSE = 10.00%) and ERL (R2 = 0.67 and NRMSE = 10.32%). A similar pattern was also observed for DenseNet (Fig. 8b, 10e, and h) and ResNet (Fig. 8c, f, and i). Since ERL trials included lots of earlier candidates of soybean genotypes in the breeding operations, the lower performance in ERL trials is expected. However, most of the models consistently provided the best performance in ADV trials, which included the most advanced soybean lines in the breeding pipeline. It should be noted that crop breeders are more interested in accurate yield prediction of the commercially successful trials rather than preliminary trials, which was achieved by the 3D CNN architectures for this particular experiment (Fig. 8g–i).

Fig. 8
figure 8

Scatterplot of measured and predicted yields from different models and trials. The models consistently performed the best for ADV trials followed by the MID and ERL trials. The dashed blue circle indicates the higher error rate

Figure 8 also highlights that the inconsistencies between the measured and predicted yield are concentrated at the lower yield samples compared to the higher yield samples (marked with blue dashed circles in Fig. 8). This phenomenon was further quantified by dividing the test set prediction results into 4 quartiles (i.e., 0–25th percentile, 25–50th percentile, 50–75th percentile, and 75–100th percentile) of the measured yield range. Figure 9 shows the plotting of absolute error over different quartiles of measured yield and corresponding NRMSE value. It is evident that the NRMSE value is consistently the highest in 1st quartile, followed by the 4th quartile, which indicates that model could successfully explain the yield variability in the middle and extreme quartiles, whereas most of the errors are concentrated in the lower performing soybean plots in terms of yield.

Fig. 9
figure 9

Absolute error distribution of the models for different quartiles of the measured yield. The NRMSE is consistently higher for the lower quartile, whereas the middle quartiles showed the least NRMSE

Efficient CNN architecture

The results from this study clearly indicate the efficacy of 3D CNN architectures in understanding the spatiotemporal characteristics of soybean growth and accurately predicting soybean yield. Both VGG and DenseNet outperform the result of ResNet in overall yield prediction. Sagan et al., (2021) reported that the 3D version of ResNet provided the best performance over hand-crafted feature-based yield predictions from traditional machine learning algorithms (i.e., SVR, RFR, and PLSR) while using WorldView and PlanetScope satellite information. However, this study shows that a typical vintage architecture (i.e., VGG) and a more simplified residual network (i.e., DenseNet) can outperform ResNet. It is noteworthy to highlight that ResNet is a more costly architecture in terms of the model parameters compared to VGG and DenseNet. Figure 7 shows three lines representing the number of trainable parameters for different architectures and ResNet has a significantly higher number of model parameters compared to others. For example, the best model from ResNet was found for the input of \(64\times 64\times 64\) with around 130 million parameters, whereas the best model from VGG and DenseNet had approximately 53 million and 42 million parameters, respectively. As per previous discussion, ResNet uses residual skip connections and additions that results in a higher number of trainable parameters. Alternatively, DenseNet, which is a more efficient version of residual network, uses concatenation in its dense blocks and results in a smaller number of parameters regardless of the input shape. CNN architectures that are lightweight in that parameters provide faster training and prediction inference from large amount of data (Hoeser & Kuenzer, 2020). Therefore, it can be concluded that DenseNet provided the most efficient performance in yield prediction as the evaluation metrics were consistent yet with a smaller number of model parameters and providing much faster training and inference speed.

Spatial analysis of yield prediction

The spatial pattern of predicted yield for each plots in each field was performed by using the Global Moran’s I statistic. Figure 10 shows the prediction map generated from the best performing DenseNet model in one field from each location. Similar results was generated for all the fields and models within this study and the Global Moran’s I analysis was performed for all the models and illustrated in Fig. 11.

Fig. 10
figure 10

Prediction map after applying the best DenseNet model over all the plots for four sample fields, i.e., ASC-A1 (a), CON-C1 (b), IND-I1 (c), and RAM-R1 (d)

Fig. 11
figure 11

Spatial autocorrelation result of the prediction residuals for all samples in different fields and models. The positive Moran’s Index indicates spatial clustering, whereas the negative value shows a dispersed spatial pattern. The bars with an asterix (*) indicate the Morna’s I index of statistical significance at 95% confidence interval (\(p<0.05\))

The Global Moran’s I analysis resulted in some statistically insignificant result considering 95% confidence interval (\(p<0.05\)). However, some of the fields returned statistically significant results and in most of the cases, DenseNet showed comparatively lower clustering pattern than the other models. The clustering could be resulted due to the planting of soybean plots in similar trials within blocks. However, in terms of spatial autocorrelation, DenseNet tends to outperform the other models in providing less clustered residuals of yield over space.

Discussions

Effectiveness of RGB images in yield prediction

The contribution of spectral information through different vegetation indices (VIs), structural, and textural features have been well documented in different remote sensing literature for crop trait estimation (Ballester et al., 2017; Zhou et al., 2017). Many studies have utilized CNNs with UAV-based multispectral sensor with near-infrared and red-edge bands for yield prediction. For example, Li et al. (2023) recommended CNN as the best performing model when compared with three statistical machine learning models (i.e., random forest, gradient boosting machine, K-nearest neighbor) for winter wheat yield prediction using 16 yield-sensitive Vis. Similarly, Tanabe et al. (2023) and Bellis et al. (2022) found the superiority of CNN in winter wheat and rice yield prediction by using several multispectral-derived Vis. In most of these studies, near-infrared band-based Vis seemed to be highly informative for yield prediction performance.

However, when direct images are concerned in predictive models instead of hand-crafted VIs, several studies have found that RGB images provide better and robust predictive models over VI-based models (Nevavuori et al., 2019; Yang et al., 2019). The Canopy Height Model (CHM) and Vegetation Fraction (VF) information derived from RGB images were often found to be highly informative over VI-based spectral features in understanding plant characteristics (Carly et al., 2017; Maimaitijiang et al., 2017; Rischbeck et al., 2016; Wallace, 2013). The probable reason behind this could be that direct imagery-based CNN models have the capabilities of understanding the spatial variation of canopy architectures in a given plot, and therefore, accurate mapping of yield per plot is possible in an efficient manner. Additionally, the inclusion of multi-temporal observation enables the 3D CNN architectures to understand the temporal growth pattern of crops over space, whereas a tabular feature-based modeling would not consider the spatial and temporal dependency of features. Considering these scenarios, the RGB image-based yield prediction was considered but more focus was given on the specific type of CNN architecture and spatiotemporal resolution for this task. It is possible that the inclusion of near-infrared bands might improve the model performance, but it increases the complexity of expensive camera system, higher battery consumption of UAVs, and decreases the amount of training data in the process. Alternatively, use of RGB-equipped UAV provides the advantages of simple flight operations, easier image calibration and post-processing tasks in the modeling pipeline.

Effect of spatiotemporal resolution

Although the utilization of custom CNN architectures was previously examined in different studies, it was unclear whether there is any significant effect of spatiotemporal resolution in model performance. Based on the flight altitude of the UAV, an approximated 3.4 cm spatial resolution was achieved for the orthomosaic. However, the spatial resolution was increased to three different levels (Table 1) and tested these with different CNN architectures. The results suggest that even with coarser spatial resolution, the architectures could achieve comparable performance in yield prediction. While different literature reported the use of 3–4 cm spatial resolution for each sample plot in the CNN architectures (Nevavuori et al., 2019, 2020; Yang et al., 2019), the VGG and DenseNet architecture could achieve comparable performance using \(48\times 48\) as the spatial grid size resembling approximately 8.93 cm ground resolution.

In terms of temporal resolution, the temporal dimension was increased from 9 to 32, 48, 64, and 80 by performing linear interpolation at the temporal scale. This was done to accommodate the deeper 3D CNN architectures addressed in this study. If only 9 dimensions were used as the temporal resolution, then the 3D versions of VGG, ResNet and DenseNet could not be trained as negative dimension would be caused from fewer dimensions. The temporal interpolation enabled the architectures to learn about the smooth transition of soybean growth over time. The importance of enabling temporal observation in CNN and deep learning has also been reported in multiple studies mainly for crop type classification and yield prediction (Ghamisi et al., 2019; Hauglin & Orka, 2016; Ji et al., 2018; Nevavuori et al., 2020; Sun et al., 2020). However, it was not observed that increasing the temporal dimension would necessarily improve the model performance as the highest performance from both VGG and DenseNet was achieved with only 32 temporal dimensions. The recommendation arises that a 2-week interval UAV flight can accurately explain the growth variability of soybean in a 3D CNN-based model training.

Integration of end-to-end modeling in crop breeding

Improving crop yield is a fundamental goal in achieving food security and sustainable agricultural management. Crop breeding offers significant advancement of genetic technologies and high-throughput phenotyping (HTTP) to achieve such yield goals with fewer natural resources (Araus & Cairns, 2014). However, it still takes years to breed a successful genotype given the high variability of climate change and other natural hazards. An end-to-end yield prediction pipeline involving low-cost RGB cameras and UAVs can offer significant breakthroughs in the bottleneck of crop breeding (Shakoor et al., 2017; Song et al., 2021). The findings from this study can be easily integrated into the current crop breeding pipeline and reduce the time required for typical breeding. It was also found that a 3D version of DenseNet could accurately explain the variability of soybean yield with the use of RGB images, which is easier to collect and process. Although the analysis was done for the 2020–2021 growing season involving four major locations for soybean crops, the methodology can be easily extended to other study areas and crops in the future. Additionally, the concept of transfer learning can be applied to this scenario (Tan et al., 2018; Zhuang et al., 2021), where a model is trained from a large number of datasets and later used to retrain with a smaller batch of datasets. Moreover, the use of AWS cloud infrastructure in training the models offers significant scalability in the future as well. A prediction model can be easily hosted as an endpoint in the cloud and predict future yield from newer sets of UAV-based data.

Limitations and future steps

In this study, a UAV equipped with a multispectral sensor that can collect blue, green, red, red-edge, and near-infrared information simultaneously. However, only the RGB information was used instead of the multispectral data because of incorrect radiometric calibration issues. The DJI P4 sensor offers a sunshine sensor that collects simultaneous sun irradiance (in W m−2) values for each scene capture. The camera documentation suggests that the sunshine sensor is accurate enough to perform the radiometric calibration without any use of a reference panel on the ground. However, literature suggests that use of a factory-tested calibration panel during the UAV data collection procedure offers accurate calculation of reflectance values for analysis (Cao et al., 2019; Iqbal et al., 2018). After processing the orthomosaic and radiometric calibration in Pix4D, incorrect reflectance values were observed from the multispectral bands. For instance, the value of the blue band was found to be significantly higher than the corresponding green or red band, which is theoretically not possible in terms of healthy vegetation. Therefore, any yield prediction modeling for the multispectral images was not performed. However, the use of a factory-tested calibration panel during data collection is being considered at this moment, and in future a comparative assessment of will be done for correctly calibrated multispectral images in yield prediction.

Another limitation for this study was using a dataset from a single growing season. If there were multiple growing seasons with data for similar genotypes and experimental sites, then the training could be done on one or two seasons of data, while the performance of the models could be validated using data for a different season. However, the major objective for this work was to evaluate the performance of some state-of-the-art CNN architectures in yield prediction and assess which end-to-end system had the most efficient performance. The comparison was done for all the architectures using the same train-test splitting strategy, which made it possible to make robust inference about the model performance. In future, more growing season data will be accumulated and will be used with the most efficient model from this study (i.e., DenseNet). This will directly enable a more robust splitting strategy.

The major reason behind not training any hand-crafted feature-based modeling in this study was the lack of calibrated reflectance values from the RGB lens. Typically, in a tabular feature space, many different vegetation indices are calculated first and then different statistical machine learning models, such as, random forest, support vector machines, or extreme gradient boosting algorithms are used to train a model. Since the dataset from this study only had scaled 8-bit information as DN, calculation of vegetation indices would be unrealistic. Furthermore, the investigation aimed to explore the potential utility of CNNs within a crop breeding pipeline. While it was evident that CNNs can learn to estimate yield, the models can be further enhanced by implementing transfer learning with datasets from other crops, growing seasons, and locations. On the other hand, statistical machine learning algorithms have much shallower architecture, which limits the practicality of using transfer learning.

Conclusions

In conclusion, this study presents a comprehensive end-to-end 3D CNN framework for plot-level soybean yield estimation utilizing multitemporal Unmanned Aerial Vehicle (UAV)-based RGB imagery. To the best of current knowledge, this is the first study of its kind, incorporating approximately 30,000 samples across various fields and multiple CNN architectures for the purpose of soybean yield prediction. Among the key discoveries, DenseNet emerged as the most effective architecture in terms of both accuracy and model complexity for yield prediction, with VGG and ResNet following suit. Although VGG exhibited marginal improvements in evaluation metrics compared to DenseNet, it proved more resource-intensive in terms of training and inference. Furthermore, it was observed that higher spatiotemporal resolution of input samples did not enhance model performance, but instead increased the complexity for VGG and ResNet. A coarser spatiotemporal resolution of \(48\times 48\times 32\) achieved comparable performance in yield prediction to finer resolutions. In addition, DenseNet displayed reduced clustering patterns in its prediction maps relative to other models. Overall, the study demonstrated that multitemporal observations employing UAV-based RGB images supplied sufficient information for the 3D CNN architectures to accurately and efficiently estimate soybean yields in a non-destructive manner.

In summary, the results and the development of the end-to-end 3D CNN pipeline underscore the potential for employing cost-effective RGB-based UAV systems in soybean yield prediction, with applications in crop breeding operations and precision agriculture. Nevertheless, it is imperative to evaluate the model’s efficacy across various test sites, crops, and temporal scales. Additionally, incorporating secondary datasets (e.g., weather station data, soil sensor information) as separate branches in the prediction model, in conjunction with the 3D CNN, could further enhance its performance. Consequently, the insights gathered from this study offer valuable guidance on selecting appropriate architectures, determining optimal input configurations, and managing vast amounts of data.