当前位置: X-MOL 学术Plant Biotech. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Machine learning in agriculture: from silos to marketplaces
Plant Biotechnology Journal ( IF 10.1 ) Pub Date : 2020-12-02 , DOI: 10.1111/pbi.13521
Philipp E. Bayer 1 , David Edwards 1
Affiliation  

Introduction

The increasing global population combined with climate change present a major challenge for agriculture. Most crops have been bred to perform in specific environments, and with the time required to produce new varieties, it is unlikely that breeders will be able to adapt varieties to the changing climate (Abberton et al., 2016). There is an urgent need to develop new approaches to accelerate the production of high performing resilient crop varieties. Crop breeding has seen many changes in recent decades, from the application of molecular markers, through to genetically modified, and more recently, genome‐edited crops (Scheben et al., 2017). However, these approaches are often limited by our lack of understanding of the genomic basis for complex traits, even with the deluge of data being generated by new genome sequencing and phenomics technologies. New approaches are required to translate this explosion of data into improved crop varieties.

Deep learning represents a set of machine learning approaches that are not new, but have seen major advances in the last few years, with the adoption of deep learning in robotics, smart cars, smart homes, and agriculture. The breakthroughs in deep learning have not been driven by major advances in deep learning methods but rather by the increasing availability of large labelled training data as well as advances in computational hardware, especially graphics processors (GPUs). With the continued increase in agricultural phenotype and genotype data, there are opportunities to apply deep learning to accelerate crop breeding and agricultural productivity (Figure 1).

image
Figure 1
Open in figure viewerPowerPoint
An overview of the lifecycle of data in farming. First, data are generated using various observational methods, from cameras to drones to satellites. Then, these data are analysed using different machine learning methods. The predictions of these techniques will then be used to inform business decisions. Outcomes of these business decisions will result in different data being generated in the next growing season, restarting the cycle.

Deep learning approaches are most advanced in the field of image recognition (Kamilaris and Prenafeta‐Boldú, 2018); for example, Convolutional Neural Networks (CNNs, a class of deep neural network) have been developed to count wheat spikes and spikelets with 95.91% and 99.66% accuracy (Pound et al., 2017). For training, a novel dataset with 520 images of wheat plants was generated. The data were annotated by an expert who manually counted the 4100 ears and 48 000 spikelets, and whether the image contains an awned phenotype.

In the field, images taken by aerial drones of lettuce fields were used to accurately count lettuce plants and predict yields at a fraction of the cost of manual counting (Bauer et al., 2019). This was a larger dataset than the wheat spikelet counting data: 60 subsections of lettuce‐growing fields had their pictures taken from light aircraft. Each subsection contained between 300 and 1000 lettuce heads. For training purposes, each lettuce head was manually labelled with a red dot. This work therefore required the labelling of up to 60 000 lettuce heads. The complete training set, including labelled regions containing no lettuce heads, with 100 000 manually generated and curated data points.

Another example is SeedGerm, a breeding platform that combines cost‐effective custom hardware implementation with a graphical user interface that uses three different machine learning approaches to automatically phenotype commercial seeds in the growth chamber (Colmer et al., 2020). The performance of SeedGerm matches that of experts when asked to score radicle emergence.

Outside of image recognition, there are few examples of the application of deep learning in crops. The prediction of phenotypes from genotypes is increasingly being applied in crop breeding but remains challenging due to the complexity of interactions in the data; however, this may be addressed using deep learning approaches. DeepGS is a CNN trained with data from 2403 Iranian bread wheat landraces from CIMMYT’s wheat gene bank (Ma et al., 2018). The data consisted of 33 709 DArT markers per individual, and each individual was assessed for 8 phenotypes of agronomic importance such as grain hardness and plant height. Trained using the DArT markers to predict these eight phenotypes, DeepGS led to improvements in prediction accuracy between 1% and 65% greater than the state‐of‐the‐art RR‐BLUP approach. A later publication used a CNN similar to DeepGS along with many other state‐of‐the‐art machine learning approaches on six plant datasets (Azodi et al., 2019). These datasets ranged from 332 178 markers for 391 maize individuals to 4420 markers for 5014 soybean individuals and included three phenotypes per species phenotyped for height, flowering time, yield, grain moisture, time to R8 developmental stage, diameter at breast height, wood density, and standability. Six linear and six nonlinear approaches, including Artificial Neural Networks (ANNs) and CNNs, were tested for accuracy in phenotype prediction. None of these approaches, including the more complex ANN and CNN, outperformed all other approaches consistently, with no clear link between different traits and how their prediction performed in different algorithms. This supports a large body of work that machine learning experts need to consider and assess multiple diverse approaches. At the same time, these results make clear that machine learning methods will not replace skilled plant breeders. Instead, these methods will support their work, making it more accurate and reliable.

A major limitation to the advancement of deep learning in agriculture is the availability of suitable high‐quality labelled data. As outlined above, deep learning approaches need tens of thousands data points to make highly accurate predictions possible. Deep learning is also somewhat vulnerable to unseen conditions and therefore requires data from as many distinct conditions as possible. Vast quantities of genomic and phenotypic data are generated each year; however, this is predominantly maintained in data silos, and training labels are either missing or not curated. Remaining unlabelled data can be either manually labelled by experts or labels can be imputed using well‐trained machine learning solutions. Commercial systems exist to automatically capture farm‐scale data, but due to the lack of agreed standards these data often remain within the data silos of the company which built the data capture solution. However, there is an increasing culture of sharing and integrating crop data, and some bodies such as the AgBioData consortium, the wheat information system and international rice informatics consortium are establishing best practices for integrating crop data across data silos (Harper et al., 2018; Scheben et al., 2018).

Along with the exchange of data, there is a requirement for an exchange of knowledge and skills as technical approaches develop. Choosing which particular deep learning architecture or data‐cleaning steps to use in a new project is still an issue of experience which leads to a second type of silo: the silo of skills. Skills are often distributed by informal social media, blogs, forum posts, and communities of practitioners such as Kaggle, which complement the more traditional scientific publications, reference books and presentations at conferences. Practitioners seeking to apply machine learning in agriculture will need to connect to these streams in order to assess models and drive the field forward.

If we are going to continue to accelerate crop production, we need to capture, label and connect as much data as possible, to train complex machine learning models and apply them to crop breeding. This will require that plant breeding companies, universities and other research institutions share large amounts of clean and labelled data from their own data silos in agreed formats, as well as connect the silos of knowledge and skills needed to interpret this data. Connecting these disparate data requires an enormous effort by researchers with experience in handling and normalizing ‘dirty’ data together with enough domain knowledge to assess the quality of agricultural data. This needs to be paired with skilled practitioners on the data‐generating side who manually label training datasets with an understanding of what is required of the labels and the data by the machine learning practitioners. Currently, these sets of skills are rare. Over time, some of the skills required may be supplanted by automated machine learning (AutoML) approaches, where parts of the machine learning pipeline are supervised and customized by other learning algorithms, allowing non‐experts to perform the required analyses. To our knowledge, no AutoML implementations aimed at non‐experts in the agriculture space are available.

Only by merging disparate silos will there be a deep learning revolution in agriculture. Data from different silos and backgrounds will result in models that generalize much better than models trained on data from only one source. Given the increased prediction accuracy with larger datasets, we expect that within the next few years, companies and academia will increasingly pool their training data and share their experience, providing a significant advantage over groups which maintain their data and skills within silos.



中文翻译:

农业机器学习:从筒仓到市场

介绍

全球人口的增加加上气候变化给农业带来了重大挑战。大多数农作物已在特定环境中繁殖,并且随着生产新品种所需的时间增加,育种者不太可能使品种适应气候变化(Abberton2016)。迫切需要开发新方法来加速高性能抗逆作物品种的生产。从分子标记的应用到基因修饰的作物,近几十年来,作物育种发生了许多变化(Scheben et al。,2017)。然而,这些方法通常由于我们对复杂性状的基因组基础缺乏了解而受到限制,即使新的基因组测序和表型学技术产生了大量的数据也是如此。需要新的方法来将数据的爆炸式增长转化为改良的作物品种。

深度学习代表了一系列机器学习方法,这些方法并不是新事物,但在最近几年中取得了重大进展,随着机器人,智能汽车,智能家居和农业中深度学习的采用。深度学习的突破并不是由深度学习方法的重大进步驱动的,而是由大型标签训练数据的可用性以及计算硬件(尤其是图形处理器(GPU))的增长推动的。随着农业表型和基因型数据的不断增加,有机会应用深度学习来加速作物育种和农业生产力(图1)。

图像
图1
在图形查看器中打开微软幻灯片软件
农业数据生命周期概述。首先,使用各种观测方法生成数据,从摄像机到无人机再到卫星。然后,使用不同的机器学习方法分析这些数据。这些技术的预测随后将用于指导业务决策。这些业务决策的结果将导致在下一个生长季节中生成不同的数据,从而重新启动周期。

深度学习方法是图像识别领域中最先进的方法(Kamilaris和Prenafeta‐Boldú,2018年); 例如,已经开发出卷积神经网络(CNN,一类深层神经网络)来计算小麦穗和小穗的准确性为95.91%和99.66%(Pound et al。,2017)。为了进行训练,生成了具有520张小麦植物图像的新数据集。数据由一位专家注释,该专家手动计算了4100耳和48 000小穗,以及图像是否包含带篷表型。

在田间,使用莴苣田间空中无人机拍摄的图像来精确计数莴苣植物并预测产量,而成本仅为人工计数的一小部分(Bauer等人2019年)。这是一个比小麦小穗计数数据更大的数据集:60个生菜种植地的小节的照片是从轻型飞机上拍摄的。每个小节包含300至1000个生菜头。出于培训目的,每个生菜头均手动标记了一个红点。因此,这项工作需要贴标签多达6万个生菜头。完整的训练集,包括没有生菜头的标记区域,以及10万个手动生成和策划的数据点。

另一个例子是SeedGerm,这是一个育种平台,将具有成本效益的定制硬件实现与图形用户界面相结合,该用户界面使用三种不同的机器学习方法来自动在生长室中对商业种子进行表型化(Colmer等人2020年)。当要求对胚根出现进行评分时,SeedGerm的性能与专家的性能相匹配。

除了图像识别,很少有将深度学习应用于农作物的例子。基因型对表型的预测正越来越多地应用于农作物育种,但由于数据相互作用的复杂性,仍然具有挑战性。但是,这可以使用深度学习方法来解决。DeepGS是经过CNN训练的CNN,使用来自CIMMYT小麦基因库的2403个伊朗面包小麦地方品种的数据进行训练(Ma2018)。数据由每个个体33 709个DArT标记组成,并对每个个体的8种具有农艺重要性的表型进行了评估,例如籽粒硬度和株高。在使用DArT标记进行预测以预测这八种表型的过程中,DeepGS的预测准确性比最新的RR-BLUP方法提高了1%至65%。后来的出版物在六个植物数据集上使用了类似于DeepGS的CNN以及许多其他最新的机器学习方法(Azodi等人2019)。这些数据集的范围从391个玉米个体的332178个标记到5014个大豆个体的4420个标记,包括针对每个物种的三种表型,其表型分别是高度,开花时间,产量,谷物水分,R8发育阶段所需的时间,胸高处的直径,木材密度,和可站立性。测试了六种线性和六种非线性方法(包括人工神经网络(ANN)和CNN)的表型预测准确性。这些方法,包括更复杂的ANN和CNN,都没有一贯胜过所有其他方法,在不同特征之间以及它们在不同算法中的预测方式之间没有明确的联系。这为机器学习专家需要考虑和评估多种不同方法的大量工作提供了支持。同时,这些结果表明,机器学习方法不会取代熟练的植物育种者。相反,这些方法将支持其工作,使其更加准确和可靠。

农业深度学习发展的主要限制是合适的高质量标签数据的可用性。如上所述,深度学习方法需要成千上万个数据点,才能进行高度准确的预测。深度学习在某种程度上也容易受到看不见的条件的影响,因此需要尽可能多的不同条件下的数据。每年产生大量的基因组和表型数据。但是,这主要保留在数据孤岛中,并且缺少培训标签或没有整理标签。剩余的未标记数据可以由专家手动标记,也可以使用训练有素的机器学习解决方案来估算标签。商业系统可以自动捕获农场规模的数据,但是由于缺乏公认的标准,这些数据通常保留在构建数据捕获解决方案的公司的数据孤岛内。但是,共享和整合农作物数据的文化越来越多,一些机构如AgBioData财团,小麦信息系统和国际水稻信息学财团正在建立最佳实践,以跨数据孤岛整合农作物数据(Harper2018 ; Scheben et al。,2018)。

随着数据的交换,随着技术方法的发展,还需要交换知识和技能。选择在新项目中使用哪种特定的深度学习架构或数据清理步骤仍然是经验问题,这会导致第二种孤岛:技能孤岛。技能通常由非正式的社交媒体,博客,论坛帖子以及从业人员社区(例如Kaggle)分发,这些技能是对更为传统的科学出版物,参考书和会议介绍的补充。寻求在农业中应用机器学习的从业者将需要连接到这些流,以评估模型并推动该领域向前发展。

如果我们要继续加速作物生产,我们需要捕获,标记和连接尽可能多的数据,以训练复杂的机器学习模型并将其应用于作物育种。这将要求植物育种公司,大学和其他研究机构以商定的格式共享来自其自身数据孤岛的大量干净且带有标签的数据,并连接解释该数据所需的知识和技能孤岛。连接这些不同的数据需要研究人员付出巨大的努力,这些研究人员应具有处理和标准化“脏”数据的经验,并具有足够的领域知识来评估农业数据的质量。这需要与数据生成方的熟练从业人员配对,他们在了解机器学习从业者对标签和数据的要求之后,手动标记训练数据集。目前,这些技能很少见。随着时间的流逝,某些所需的技能可能会被自动机器学习(AutoML)方法所取代,在这种方法中,机器学习管道的某些部分将由其他学习算法进行监督和定制,从而使非专家可以执行所需的分析。据我们所知,没有针对农业领域非专家的AutoML实现可用。机器学习管道的一部分由其他学习算法监督和定制,从而使非专家可以执行所需的分析。据我们所知,没有针对农业领域非专家的AutoML实现可用。机器学习管道的一部分由其他学习算法监督和定制,从而使非专家可以执行所需的分析。据我们所知,没有针对农业领域非专家的AutoML实现可用。

只有合并不同的筒仓,农业才会发生一场深度学习革命。来自不同孤岛和背景的数据将导致模型的泛化性比仅基于一个来源的数据训练的模型要好得多。鉴于使用更大的数据集提高了预测准确性,我们预计在未来几年内,公司和学术界将越来越多地汇集他们的培训数据并分享他们的经验,相对于将他们的数据和技能保持在孤岛内的群体而言,这将提供显着的优势。

更新日期:2020-12-02
down
wechat
bug