Plants meet machines: Prospects in machine learning for plant biology,Applications in Plant Sciences

当前位置： X-MOL 学术 › Appl. Plant Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Plants meet machines: Prospects in machine learning for plant biology
Applications in Plant Sciences ( IF 2.7 ) Pub Date : 2020-07-01 , DOI: 10.1002/aps3.11371
Pamela S. Soltis ₁ , Gil Nelson ₁ , Alina Zare ₂ , Emily K. Meineke ₃

Affiliation

Machine learning approaches are affecting all aspects of modern society, from autocorrect applications on cell phones to self‐driving cars to facial recognition, personalized medicine, and precision agriculture. Although machine learning has a long history, drastic improvements in these application areas recently have been driven by improvements to computational infrastructure; increased computing power; increased ability to collect, manage, and store very large amounts of data; and algorithmic advances. Multiple types of machine learning have been developed, each with its own techniques, strengths, and weaknesses, making certain approaches better matches for certain problems than others.

Supervised machine learning and the use of neural networks (e.g., deep learning; Table 1) underlie much of the recent accelerated application of machine learning to many biological problems, including those across a range of scientific questions in plant science. For example, deep learning technologies have recently achieved impressive performance on a variety of predictive tasks, such as species identification (Unger et al., 2016; Carranza‐Rojas et al., 2017), plant species distribution modeling (e.g., Zhang and Li, 2017; Botella et al., 2018), weed detection (Yu et al., 2019), and mercury damage to herbarium specimens (Schuettpelz et al., 2017). They are also being applied to questions of comparative genomics (e.g., Xu and Jackson, 2019) and gene expression (Mochida et al., 2018) and to conduct high‐throughput phenotyping (e.g., Singh et al., 2016; Ubbens and Stavness, 2017) for agricultural and ecological research. Moreover, novel approaches are poised to revolutionize studies of plant phenology (e.g., Pearson et al., 2020) and functional traits through application to more than 30 million images of herbarium specimens now available at iDigBio (http://www.idigbio.org) as well as other digital repositories.

Table 1. Glossary of terms related to machine learning.

Term	Definition
Artificial neural network	A type of machine learning algorithm whose computational model is (loosely) motivated by biological neural networks.
Deep learning	The use of artificial neural networks composed of many layers of neurons.
Supervised learning	A type of machine learning in which a model is fit using labeled training examples.
Unsupervised learning	A type of machine learning in which data samples are unlabeled. The goal of unsupervised learning is to uncover the latent structure in the data.
Clustering	A type of of unsupervised learning in which the goal is to partition the data into groups that are composed of similar samples.
Classification	A type of supervised learning in which the goal is to identify (i.e., classify) samples into one of several known categories.
Convolutional neural networks	A type of artificial neural network (or deep learning network, if the network consists of many layers) in which spatial arrangement of input data (e.g., pixels in an image) is leveraged during analysis.

The application of machine learning methods to extract data from herbarium specimens has grown and diversified in a few short years, beginning with species identification in a specific geographic region (e.g., Unger et al., 2016). Subsequent attempts to use deep learning to tackle the difficult taxonomic task of identifying species in large collections of herbarium specimens showed that convolutional neural networks trained on thousands of digitized herbarium sheets are able to learn highly discriminative patterns (e.g., Carranza‐Rojas et al., 2017). These results are very promising for extracting a broad range of accurate annotations in a fully automated way. Such approaches are also being applied to identification of plant phenophase (i.e., bud, flower, fruit), which is important for assessing the effects of climate change on plant growth and reproduction and for comparing plant responses with those of pollinators, migratory birds, and other species that rely on plants for food and/or nesting sites (see, e.g., Lorieul et al., 2019; Pearson et al., 2020; Brenskelle et al., 2020; Goëau et al., 2020). Likewise, other evolutionary or ecological traits, such as leaf shape and size, leaf margins, and flower color, could also potentially be scored from images of herbarium specimens. However, despite the promise of applying deep learning to herbarium specimen images to address a range of questions, this emerging field also raises challenging methodological questions about how to avoid any bias and misleading conclusions when analyzing the produced data. Indeed, as for any statistical learning method, convolutional neural networks are sensitive to bias issues, including the way in which the training data sets are built. Moreover, as good as the prediction might be on average, the quality of the produced annotations can be very heterogeneous from one sample to another, depending on various factors such as the morphology of the species, the storage conditions in which the specimen was preserved, and the age of the specimen when imaged. Given both the opportunities and challenges, additional research into the application of machine learning approaches to herbarium specimen images is needed to enable greater applicability to a broad range of scientific questions.

The field of machine learning is moving rapidly, with the development of alternative approaches that may be best suited to specific questions, data sources, and analytical techniques. This special collection of articles in Applications in Plant Sciences presents 16 papers, published across two issues of the journal, that explore methods and applications of machine learning to studies of plant ecology, morphology, genomics, and agriculture. The first issue comprises eight papers and focuses on applications to images of herbarium specimens, on topics from phenology to herbivory. The second issue includes papers that address a broader range of topics, data, and biological scale. We summarize the content of both issues here.

Plant phenological research has seen major advances in recent years through the use of herbarium specimens (Willis et al., 2017). Herbarium specimens collected over the past three centuries provide insight into flowering, leaf‐out, and fruit timing globally and across plant phylogeny (Davis et al., 2015). A major hurdle, however, is that to harness the full power of herbarium specimens for phenological research requires counting reproductive structures, which can be time consuming. Thus, automated recognition of reproductive structures on herbarium specimens is a key goal in current phenological research (Lorieul et al., 2019; Pearson et al., 2020). Two papers in this special issue address plant reproductive phenology. To make use of the extensive volume of herbarium specimens for examining angiosperm reproductive phenology, Goëau et al. (2020) applied a state‐of‐the‐art segmentation approach (mask R‐CNN ) to automate locating, segmenting, and counting reproductive structures on images of herbarium specimens of Streptanthus tortuosus Kellogg (Brassicaceae). Phenological stages (i.e., buds, flowers, immature fruits, mature fruits) are distinct in S. tortuosus , and specimens were scored for phenophase. Evaluation of the performance of the method indicated that it shows particular promise in identifying the number of reproductive structures (accuracy was nearly 80%), but the accuracy of the results varied with respect to the training annotations, the type of reproductive structures scored, and the size of the reproductive structures. Although promising, these results suggest that further refinement is needed, and it is unclear how well the approach will scale to other species with different floral morphologies and perhaps less well‐differentiated phenophases.

To train machine learning algorithms to do this, however, will require massive input data to data‐hungry machine learning algorithms. In this issue, Brenskelle et al. (2020) assess the conditions needed for volunteers to help gather these data. The authors test for the effects of training type (in person or online), career stage, plant taxon, and phenological stage scored on the accuracy of volunteer‐provided phenological data from herbarium specimens. Regardless of expertise and training method, users provided highly accurate data, although data from people trained in person were more accurate than those trained online. This study provides a best practices guide for collecting annotation data. Importantly, the authors also demonstrate that online citizen science platforms might be able to provide accurate annotation data that can then be used downstream to train machine learning algorithms to recognize phenological stages.

Morphological variation, coupled with variation in the quality of herbarium specimens, leads to noise and potential bias in automated coding of characters from specimen images. Image segmentation is a computer vision algorithm that groups together pixels of an image that have similar attributes and generates a mask for each focal object in the image, such as a flower in an image of a herbarium specimen. Application of masks, such as those applied to plant phenophases, can help to reduce noise and bias. White et al. (2020) developed a workflow to apply segmentation masks to plant images using deep learning. Focusing on ferns, they generated a model that could segment herbarium images automatically, efficiently, and accurately across the morphological diversity of this clade. Although their study was restricted to ferns, the workflow is generalizable to all herbarium images and, with modification, may be applicable to other clades of plants with highly different morphologies.

Plants and insects have been interacting for 400 million years, and these interactions have likely driven diversification of both clades. The fossil record shows evidence of herbivory, providing a glimpse into long‐term patterns of plant–herbivore interactions and evolution. However, how herbivory changes over shorter timescales and geography is much less clear. Despite the fact that botanists generally attempt to collect specimens that are free of herbivore damage, herbarium specimens offer a view of plant–herbivore interactions over the past three or four centuries, with the potential to infer spatial and temporal patterns of herbivory, including response to climate change (Meineke and Davies, 2018; Meineke et al., 2018). However, manual scoring of insect damage to herbarium specimens is extremely laborious, and the possibility of applying machine learning to quantify the patterns and extent of insect damage to plant specimens is appealing. Meineke et al. (2020) initiated machine learning methods to explore their ability to classify multiple types of herbivory (and its absence) across a pair of divergent plant species. Although herbivory could not always be classified with high accuracy, the use of hand‐drawn boxes to locate areas of potential herbivory increased the accuracy of herbivory classification to 81.5%. The authors further identify ways to expand the accuracy of the models in future applications, potentially paving the way for exploring patterns of herbivory in relation to climate change, invasive species, and more.

The contributions of machine learning to the plant sciences, especially for automated species identification from images of digitized herbarium specimens, is showing great promise (Schuettpelz et al., 2017; Wäldchen and Mäder, 2018). This is especially true for genera with only slight morphological variation among species, particularly when compounded by hybridization and the presence of infraspecific taxa. Pryer et al. (2020) have built on this work with Equisetum L., a distinctive genus with 15 extant species complicated by morphological plasticity and frequent hybridization events that have resulted in a disproportionately high number of misidentified herbarium specimens. Equisetum includes two relatively distinct species (E. hyemale L. and E. laevigatum A. Braun) and a widespread, sexually sterile hybrid between them (E. ×ferrissii Clute) (Rutz and Farrar, 1984; Des Marais et al., 2003). The challenges faced here result from the cylindrical nature of the stem, which results in dramatic differences in specimen images due to factors such as the geometry of the flattened stems, the number of stems included on a single sheet, stem colors, and imaging parameters. Compounding the variations among images is the fact that accurate identification has more to do with the appearance of stem nodes and strobili than other features. Through successive testing of several models, Pryer and colleagues discovered that, out of 30 test images, 27 were classified correctly. Although the number of specimens is probably too small to be broadly generalizable, E. hyemale images were correctly classified in nine of 10 cases, E. ×ferrissii images in eight of 10 cases, and E. laevigatum images were never confused, resulting in an accuracy of 90%. These results suggest strong potential for machine learning’s impact on the accurate determination of closely similar taxa.

In their contribution, Ott et al. (2020) outline the development and output of GinJinn, object‐detection software designed to extract leaf images from herbarium specimens based on the TensorFlow (Abadi et al., 2016) object‐detection application programming interface (API), an API designed to make supervised deep learning object detection accessible for plant scientists. Although GinJinn makes heavy use of TensorFlow’s API, the authors maintain that GinJinn is not merely a wrapper for the API; it also provides data preprocessing, project set up, pretrained model download, simple model exporting, and the use of trained networks for the extraction of bounding boxes from newly acquired data. GinJinn was tested on a data set of 286 JPEG images of preserved plant herbarium specimens provided by the herbarium of the Botanic Garden and Botanical Museum Berlin‐Dahlem, Berlin, Germany. The images were annotated using the free open‐source tool LabelImg version 1.8.1 (https://github.com/tzutalin/labelImg), resulting in a total of 889 annotated intact leaves within 243 images of herbarium specimens of two species of Leucanthemum Mill. (the diploid L. vulgare Lam. and the tetraploid L. ircutianum DC.) known for their high variability in leaf shape. The task is complicated by the rare occurrence of intact leaves versus non‐intact leaves in these species. Using 183 specimens as the training data set, the GinJinn pipeline extracted one or more intact leaves in 95% of 61 test images.

A major challenge to cataloging and describing plant diversity lies in the development of high‐throughput technologies that facilitate rapid discovery of new taxa hidden in the backlog of still‐to‐be processed herbarium specimens. The 400,000 plant species currently known to science have required more than 250 years to name and classify, and as many as 70,000 flowering plant species are likely yet to be discovered (Joppa et al., 2011). Many of these may well be among the estimated one million specimens currently backlogged in herbaria. From Little et al.’s (2020) perspective, this renders herbaria largely untapped resources for the new and rapidly developing use of artificial intelligence (AI) in taxonomic research (Wäldchen and Mäder, 2018). To capitalize on this enthusiasm and encourage an increasing number of AI specialists to devote attention to algorithms that can produce species identifications, these authors mounted a Kaggle competition platform to crowdsource effective machine learning algorithms for analyzing plant specimen images. The competition data set included 46,469 images representing 683 species of the family Melastomataceae (Tan et al., 2019). In just two months, 254 models were developed that automatically identified the taxa among these digital representations, with the top four models identifying specimens to species with >88% accuracy.

Trait extraction from herbarium specimens can be laborious and time consuming, making the process an excellent candidate for the application of high‐throughput machine learning protocols and algorithms. Here, Weaver et al. (2020) describe and test LeafMachine, an automated, open‐source software tool for recognizing and measuring leaf dimensions from herbarium specimens and single leaf images across a wide range of largely woody taxa (trees, shrubs, lianas), although some herbaceous taxa were also included. The tests show varying results based on image resolution, specimen presentation, leaf condition, and whether leaf clumping was present. Of ~1000 images containing measurable leaves as confirmed through assessment, LeafMachine produced morphometric information for at least one leaf in 82.0% of high‐resolution images and 60.8% of low‐resolution images, suggesting positive results to the researchers but with a need for enhancement as machine learning technologies advance.

The second set of papers explores a broad range of topics, beginning with application of machine learning approaches to agriculture. The use of herbicides to control weeds in agricultural fields is costly both economically and environmentally, and alternatives are needed, especially for organic farming. Possible solutions include the use of targeted application of small doses of herbicide precisely on weeds via a robotic detector and application system and non‐herbicide methods of removal such as electrocution. However, such approaches to precision agriculture require highly accurate methods of detection and identification of weeds in agricultural fields. Champ et al. (2020) applied an instance segmentation convolutional neural network to robotically generated images of agricultural field plots to detect individual plants and then identify them as crops or weeds. Using this mask R‐CNN approach, the authors were able to correctly identify individual maize and bean crop plants at average precision values of 0.85 and 0.59, respectively; identification of weeds was generally more difficult, with average precision values as high as 0.73 for Brassica nigra W. D. J. Koch but less than 0.5 for the other weeds studied. Using these detection results, up to 60% of weeds could be removed, and plant centroids were more precisely located than with alternative bounding box approaches. Refinement of the models to account for plant species, plant size, plant position, and possible crop–weed interactions could improve accuracy for greater automated weed removal with fewer possibilities of confusion with crops.

Plant–insect interactions are biodiverse (Forister et al., 2015) and can be highly consequential for agricultural productivity (Sharma, 2014) and ecosystem function (Kurz et al., 2008). As a result, quantifying plant traits associated with resistance to insects is of broad interest in the natural sciences. One such type of defense against insect herbivores are trichomes, small hairs that serve as mechanical defenses that discourage insect herbivore feeding, oviposition, and movement. Like many such leaf traits, counting the trichomes required to address a given research hypothesis can be a Herculean task. Mirnezami et al. (2020) make advances toward automating quantification of trichome densities by capturing images of leaves, making the leaves transparent through a clearing process, and applying novel semi‐automatic and automatic methods for counting trichomes. They then compare results from these novel methods to manual counting and determine that the most accurate novel method was semi‐automatic (requiring input from the user) and was 90% accurate at estimating trichome densities on leaf surfaces. Although fully automated trichome counting has not yet been achieved, this study represents an important and detailed description of a major step forward in automated defense trait phenotyping for plants.

Given the ability to automate the estimation of plant traits, a follow‐on question would be whether the plant traits extracted could be reliably used for plant species identification. Furthermore, could the most informative traits for species identification be determined using machine learning approaches? Almeida et al. (2020) investigate the use of decision trees for plant identification using trait databases as well as identifying the most informative traits distinguishing between species. Using the TRY Plant Trait Database (Kattge et al., 2011, 2020) and a collection of species that spanned trees, herbs, grasses, and other taxa, they were able to correctly identify plant species with up to 90% accuracy in cross‐validation. Traits such as leaf shape, fruit type, and flower color were identified as being some of the most informative. As more plant trait data are collected (including by automated methods as mentioned above), the type of approach presented in this paper can be used to guide and inform the data collection process.

Acquiring high‐resolution images of plant root architecture for use in downstream analysis and machine learning algorithms has proved a challenging endeavor. Most current methods use techniques that are destructive to root architecture (e.g., Trachsel et al., 2011); involve ex situ imaging under controlled conditions, often using aboveground rhizotrons (chambers with windows into the soil of plants under cultivation); incorporate intrusive methods through which cameras are inserted into the ground (Johnson et al., 2001), sometimes by soil coring (Wu et al., 2018), with the tendency to disturb soil and roots; or use non‐intrusive methods such as ground‐penetrating radar for trees and woody plants with roots ≥1 cm in diameter or X‐ray computed tomography (Tabb et al., 2018) or magnetic resonance imaging (Pflugfelder et al., 2017) for pot‐grown plants with finer root systems. Ruiz‐Munoz et al. (2020) report on experiments to improve the resolution of these images by adapting two state‐of‐the‐art deep learning approaches, the Fast‐Super‐Resolution Convolutional Neural Network (FSRCNN) (Dong et al., 2016) and the Super Resolution Generative Adversarial Network (SRGAN). Their method is designed to estimate high‐resolution output from low‐resolution images to expose details not clearly delineated by a sensing device. Results of these evaluations demonstrate that these super‐resolution models outperform the basic bicubic interpolation even when trained with non‐root data sets.

Supervised machine learning methods are the methods most commonly used when applied to plant science. Often machine learning approaches are used to automate or reduce the effort and time needed to complete tasks that were traditionally completed manually by researchers. These sorts of tasks lend themselves well to supervised approaches. Yet, machine learning approaches also provide mechanisms for data mining and unsupervised exploration of collected data. Saryan et al. (2020) investigated and proposed the use of an unsupervised spectral clustering aid in discovery of species boundaries. The authors (with comparison to principal component analysis and non‐metric multidimensional scaling) determine that interactive spectral clustering can lead to improved partitioning and understanding in some problems and data sets.

Text recognition and mining are useful in a range of applications including the automated processing of specimen labels and search indexing. Thus, the automated recognition of Latin scientific names can be particularly useful for some applications. Little (2020) investigated and developed an open‐source browser‐executable approach for Latin scientific name recognition using artificial neural networks. The method relies on an ensemble network approach that can recognize Latin scientific names across a range of languages (e.g., Chinese, French, German, Japanese) with high recall and precision and at competitive speeds of 8.6 ms/word.

Plant genomes are generally large and complex, with multi‐gene families and high amounts of repeated sequences. With over 200 plant genomes now published (Chen et al., 2018), many more underway, and both genomic and transcriptomic resources available for thousands of other plant species (e.g., Matasci et al., 2014; Leebens‐Mack et al., 2019), data are now available for comparative analysis of plant genomes across phylogenetic scales. Although methods for identifying genic regions are currently quite successful, tools for inferring gene function and other attributes of plant genomes require further refinement. Machine learning approaches are being applied to a range of problems in plant genomics, and Mahood et al. (2020) review the promise of these methods. They focus on supervised machine learning for predicting gene function from sequence information as well as post‐genomic data. Because gene function may vary spatially and temporally within a plant and have either direct or indirect effects on phenotypes, functional prediction involves a combination of analyses aimed at genome structure, gene expression patterns, and protein–protein interactions, and the authors review machine learning methods aimed at each of these problems as well as those designed to integrate information across molecular and biological scales. Beyond introducing these methods, the authors identify current roadblocks to more efficient models and suggest possible solutions.

Many machine learning methods have been developed for visual imagery and text as outlined above. Yet more and more methods are being developed and adapted to non‐visual imagery such as X‐ray computed tomography, ground‐penetrating radar, and hyperspectral imagery (Zare and Ho, 2013; Rogers et al., 2016; Travassos et al., 2018). Théroux‐Rancourt et al. (2020) developed a three‐dimensional segmentation and characterization approach for leaf internal anatomy using X‐ray microcomputed tomography. The approach outlined by the authors leveraged a small number of hand‐segmented image slices to automate segmentation over more than 1000 scans with accuracies of greater than 90%. The approach is focused on segmented grapevine leaf scans while requiring minimal manual labeling, but highlights the possibilities of being able to apply machine learning methods to automate the analysis of a wide variety of data and image types.

The application of machine learning to questions in plant biology is still in its infancy, yet the promise of these methods to a broad range of problems is clear. From genomic tools to measures of plant morphology, growth, and development, and from assessing ecological interactions of plants with herbivores and their broader, changing environment to use in agriculture, new approaches involving machine learning have the potential to change how we study plants and even the questions we can ask. Further integration with fields ranging from subcellular to ecosystem scales, all likewise enabled by new machine learning approaches, will further enable new discoveries in plant biology. However, as the contributions to this special issue have cautioned, methods with sufficiently high accuracy for application are still under development and may require extensive investments in generating training data sets. Thus, despite the promise and appeal of machine learning approaches, certain problems may not be amenable either because of difficulty in refining the underlying model or because the data needed for appropriate training sets are not available or not easily acquired. We hope that the papers presented in this collection encourage further progress on the emerging applications of machine learning to plant biology.

中文翻译：

植物与机器相遇：植物生物学机器学习的前景

机器学习方法正在影响现代社会的各个方面，从手机上的自动更正应用到自动驾驶汽车再到面部识别，个性化医学和精准农业。尽管机器学习的历史由来已久，但是最近在这些应用领域中的巨大进步是由计算基础架构的改进所推动的。计算能力增强；增强了收集，管理和存储大量数据的能力；和算法的进步。已经开发了多种类型的机器学习，每种类型都有自己的技术，强项和弱项，使得某些方法比其他方法更适合某些问题。

监督机器学习和神经网络的使用（例如，深度学习；表1）是机器学习对许多生物学问题（包括植物科学中一系列科学问题的生物学问题）最近加速应用的基础。例如，深度学习技术最近在各种预测任务上取得了令人印象深刻的性能，例如物种识别（Unger等人，2016 ; Carranza-Rojas等人，2017），植物物种分布模型（例如Zhang和Li），2017 ; Botella等，2018），杂草检测（Yu等，2019），以及对标本室标本的汞破坏（Schuettpelz等，2017））。它们还被用于比较基因组学问题（例如Xu和Jackson，2019年）和基因表达问题（Mochida等人，2018年）以及进行高通量表型分析（例如Singh等人，2016年; Ubbens和Stavness），2017）用于农业和生态研究。此外，通过应用到现在可从iDigBio（http://www.idigbio.org）获得的超过3,000万张植物标本室标本图像上，新颖的方法有望彻底改变植物物候学（例如Pearson等人，2020年）和功能性状的研究。）以及其他数字存储库。

表1.与机器学习相关的术语表。

术语	定义
人工神经网络	一种机器学习算法，其计算模型（宽松地）是由生物神经网络驱动的。
深度学习	人工神经网络的使用由多层神经元组成。
监督学习	一种机器学习，其中使用标记的训练示例来拟合模型。
无监督学习	一种不标记数据样本的机器学习。无监督学习的目的是发现数据中的潜在结构。
聚类	一种无监督学习，其目的是将数据划分为由相似样本组成的组。
分类	一种监督学习，其目的是将样本识别（即分类）为几种已知类别之一。
卷积神经网络	一种人工神经网络（如果网络由许多层组成，则为深度学习网络），其中在分析过程中利用输入数据（例如，图像中的像素）的空间排列。

机器学习方法从标本室标本中提取数据的应用在短短几年内就得到了增长和多样化，首先是在特定地理区域进行物种识别（例如，Unger等人，2016年）。随后尝试使用深度学习来解决在大量标本集标本中识别物种的困难分类学任务，结果表明，在数以千计的数字化标本馆表上训练的卷积神经网络能够学习高度区分性的模式（例如Carranza-Rojas等，2017年）。这些结果对于以全自动方式提取各种准确的注释非常有希望。此类方法也正在用于鉴定植物的表相（即芽，花，果实），这对于评估气候变化对植物生长和繁殖的影响以及将植物与传粉媒介，候鸟和昆虫的反应进行比较非常重要。其他依靠植物为食物和/或筑巢地点的物种（参见，例如Lorieul等人，2019年; Pearson等人，2020年; Brenskelle等人，2020年;Goëau等人，2020年）。同样，还可以从植物标本室的标本中对其他进化或生态特征（例如叶片形状和大小，叶片边缘和花朵颜色）进行评分。然而，尽管有望对植物标本室图像进行深度学习以解决一系列问题，但这一新兴领域也提出了具有挑战性的方法论问题，即在分析生成的数据时如何避免任何偏见和误导性结论。实际上，对于任何统计学习方法，卷积神经网络都对偏见问题敏感，包括构建训练数据集的方式。此外，根据平均水平的预测，所产生注释的质量可能会从一个样本到另一个样本非常不同，具体取决于各种因素，例如物种的形态，标本保存的储存条件以及标本成像时的年龄。在机遇与挑战并存的情况下，需要对机器学习方法在植物标本室图像中的应用进行更多的研究，以使之更广泛地应用于广泛的科学问题。

随着替代方法的发展，机器学习领域正在快速发展，这些替代方法可能最适合特定问题，数据源和分析技术。该植物科学应用中的文章的特殊集合提供了16篇论文，分别发表在该期刊的两期中，探讨了机器学习在植物生态学，形态学，基因组学和农业研究中的方法和应用。第一期包括八篇论文，重点是对植物标本室的图像的应用，涉及从物候到食草的话题。第二期包括涉及更广泛主题，数据和生物学规模的论文。我们在这里总结两个问题的内容。

近年来，通过使用植物标本室标本，植物物候学研究取得了重大进展（Willis等，2017）。在过去的三个世纪中收集的植物标本室标本提供了全球以及整个植物系统发育的开花，叶期和果实时间的见识（Davis等，2015）。但是，主要障碍是要利用植物标本室的全部功能进行物候研究，需要对生殖结构进行计数，这可能很耗时。因此，自动识别植物标本室上的生殖结构是当前物候研究的关键目标（Lorieul等人，2019年; Pearson等人，2020年）。本期特刊中的两篇论文涉及植物生殖物候学。为了利用大量的植物标本室标本检查被子植物的生殖物候，Goëau等人。（2020年）应用了最先进的分割方法（mask R-CNN）来自动定位，分割和计数在白色链霉菌（Brassicaceae）的植物标本室标本上的生殖结构。候阶段（即，芽，花，未成熟果实，成熟水果）是在不同的S. tortuosus，并对标本进行表位评分。对方法性能的评估表明，它在确定生殖结构的数量方面显示出了特别的希望（准确性接近80％），但是结果的准确性随训练注释，评分的生殖结构类型和生殖结构的大小。尽管有希望，但这些结果表明还需要进一步完善，目前尚不清楚该方法将如何扩展到具有不同花形和也许分化程度较低的表相的其他物种。

然而，要训练机器学习算法来做到这一点，将需要大量的输入数据来消耗大量数据的机器学习算法。在这个问题上，Brenskelle等。（2020年）评估志愿者帮助收集这些数据所需的条件。作者测试了培训类型（亲自或在线），职业阶段，植物分类和物候阶段得分对标本室志愿者提供的物候数据准确性的影响。不管专业知识和培训方法如何，用户提供的数据都非常准确，尽管亲自培训的人的数据比在线培训的人更准确。这项研究为收集注释数据提供了最佳实践指南。重要的是，作者还证明了在线公民科学平台可能能够提供准确的注释数据，然后可以在下游使用这些注释数据来训练机器学习算法以识别物候阶段。

形态变化以及标本室标本质量的变化会导致标本图像中字符的自动编码产生噪声和潜在偏差。图像分割是一种计算机视觉算法，可将具有相似属性的图像像素组合在一起，并为图像中的每个焦点对象（例如标本室标本图像中的花朵）生成遮罩。遮罩的应用（例如应用于植物表相的遮罩）可以帮助减少噪音和偏差。怀特等。（2020年）开发了一个工作流程，以使用深度学习将分割蒙版应用于植物图像。他们着眼于蕨类植物，生成了一个模型，该模型可以自动，高效和准确地分割该进化枝的形态多样性。尽管他们的研究仅限于蕨类植物，但该工作流程可推广到所有植物标本室图像，并且经过修改后，可能适用于其他形态各异的植物。

植物和昆虫已经相互作用了4亿年，这些相互作用很可能推动了两个进化枝的多样化。化石记录显示出食草动物的证据，使人们对植物与草食动物之间相互作用和进化的长期模式有了一窥。但是，食草动物如何在较短的时间范围和地理范围内变化尚不清楚。尽管植物学家通常试图收集没有草食动物危害的标本，但标本馆标本提供了过去三，四个世纪植物与草食动物之间相互作用的观点，并具有推断食草动物的时空分布特征的潜力，包括对食草动物的反应。气候变化（Meineke and Davies，2018 ; Meineke et al。，2018）。但是，对昆虫标本室标本的人工评分非常费力，并且应用机器学习来量化昆虫对植物标本的损害程度和程度的可能性很有吸引力。Meineke等。（2020年）发起了机器学习方法，以探索它们对一对不同植物物种中的多种食草类型（及其缺失）进行分类的能力。尽管无法始终以很高的精度对草食动物进行分类，但是使用手绘框定位潜在的草食动物的区域，使草食动物分类的准确性提高了81.5％。作者进一步确定了在将来的应用中扩展模型准确性的方法，从而为探索与气候变化，入侵物种等相关的食草模式铺平了道路。

机器学习对植物科学的贡献，特别是对从数字化标本馆标本图像进行物种自动识别的贡献，显示出巨大的希望（Schuettpelz et al。，2017 ;WäldchenandMäder，2018）。对于物种间形态变化很小的属，尤其如此，尤其是通过杂交和亚种下分类群的存在而复合时。Pryer等。（2020）在Equisetum L.的基础上开展了这项工作，Equisetum L.是一个独特的属，有15种现存物种，并伴随着形态可塑性和频繁的杂交事件，导致误判的标本室标本数量过多。木贼包括两个相对不同的物种（E. hyemale L.和E. laevigatum A. Braun）以及它们之间广泛分布的性不育杂种（E.×ferrissii Clute）（Rutz和Farrar，1984 ; Des Marais等，2003））。此处面临的挑战来自茎的圆柱性质，由于诸如扁平茎的几何形状，单个薄片中包含的茎数量，茎颜色和成像参数等因素，导致标本图像的巨大差异。使图像之间的变化更加复杂的事实是，准确识别与茎节和可塑性的外观比其他特征更多。通过对多个模型的连续测试，Prayer及其同事发现，在30张测试图像中，有27张被正确分类。虽然样本数量，可能是太小，无法广泛普及，E. hyemale图像进行正确分类的10 9例，E.×ferrissii图像八10例，E. laevigatum图像从未混淆，因此准确度达到90％。这些结果表明，机器学习对精确确定紧密相似的分类单元有很大的潜力。

在他们的贡献中，Ott等人。（2020）概述了目标检测软件GinJinn的开发和输出，该软件旨在基于TensorFlow从植物标本室样本中提取叶片图像（Abadi等，2016）对象检测应用程序编程接口（API），该API旨在使植物科学家可以访问受监督的深度学习对象检测。尽管GinJinn大量使用TensorFlow的API，但作者坚持认为GinJinn不仅是该API的包装，它还包含许多其他功能。它还提供数据预处理，项目设置，预训练的模型下载，简单的模型导出，以及使用训练有素的网络从新采集的数据中提取边界框。GinJinn在保存的286张JPEG图像数据集上进行了测试，该图像由德国柏林植物园和植物博物馆的植物标本室提供。使用免费的开放源代码工具LabelImg版本1.8.1（https://github.com/tzutalin/labelImg）对图像进行注释，白花磨。（二倍体L.大麦榄。和四倍体L. ircutianum DC。）在叶形它们的高变异性是已知的。在这些物种中，完整叶子和非完整叶子的罕见发生使任务变得复杂。GinJinn管道使用183个样本作为训练数据集，在61张测试图像的95％中提取了一张或多张完整的叶子。

对植物多样性进行分类和描述的主要挑战在于开发高通量技术，该技术有助于快速发现隐藏在仍待处理的植物标本室标本中的新分类单元。目前已知的40万种植物名称和分类需要250多年的时间，并且有多达70,000种开花植物物种尚未发现（Joppa等，2011）。其中许多很可能是目前草herb中估计积压的一百万个标本中的一个。从利特尔（Little）等人（2020）的角度来看，这使得草类资源在生物分类学研究中新的和快速发展的人工智能（AI）的使用中有大量未开发的资源（WäldchenandMäder，2018）。为了利用这种热情并鼓励越来越多的AI专家专注于可以产生物种识别的算法，这些作者安装了Kaggle竞争平台来众包有效的机器学习算法来分析植物标本图像。竞争数据集包括46,469张图像，代表了melastomataceae科的683种（Tan等人，2019年）。在短短两个月内，开发了254个模型，这些模型可以自动识别这些数字表示形式中的分类单元，而最上面的四个模型则可以识别物种标本，其准确度> 88％。

从植物标本室标本中提取特征可能很费力且费时，这使得该过程成为应用高通量机器学习协议和算法的理想选择。在这里，Weaver等。（2020年）描述并测试LeafMachine，这是一种自动化的开源软件工具，用于识别和测量植物检体标本和单叶图像在各种木质类群（树木，灌木，藤本植物）上的叶片尺寸，尽管其中还包括一些草本类群。测试显示基于图像分辨率，标本呈现，叶片状况以及是否存在叶片结块的不同结果。经评估确认，在约1000张包含可测量叶片的图像中，LeafMachine在高分辨率图像的82.0％和低分辨率图像的60.8％中产生了至少一张叶子的形态计量信息，这对研究人员来说是积极的结果，但需要增强随着机器学习技术的发展。

第二套论文探讨了广泛的主题，首先是将机器学习方法应用于农业。在农业上使用除草剂控制杂草在经济上和环境上都是昂贵的，并且需要替代品，特别是对于有机农业。可能的解决方案包括通过机器人检测器和施用系统有针对性地在杂草上精确地施用小剂量除草剂，以及使用非除草剂去除方法，例如电死刑。然而，这种用于精确农业的方法需要在农业领域中用于杂草的检测和识别的高精度方法。Champ等。（2020年）将实例分割卷积神经网络应用于机器人生成的农田图的图像，以检测单个植物，然后将其识别为农作物或杂草。使用这种掩盖R-CNN方法，作者能够正确地分别以0.85和0.59的平均精度值正确识别出单个玉米和豆类作物。杂草的鉴定通常比较困难，黑芥子的平均精确度值高达0.73WDJ Koch但研究的其他杂草小于0.5。使用这些检测结果，可以清除多达60％的杂草，并且与其他包围盒方法相比，植物质心的定位更为精确。改进模型以考虑植物种类，植物大小，植物位置以及可能的作物与杂草之间的相互作用，可以提高准确性，从而可以更轻松地自动清除杂草，从而减少与作物混淆的可能性。

植物与昆虫之间的相互作用是生物多样性的（Forister等，2015），对农业生产力（Sharma，2014）和生态系统功能（Kurz等，2008）具有很高的意义。结果，量化与昆虫抗性相关的植物性状在自然科学中引起了广泛的兴趣。一种针对昆虫食草动物的防御措施是毛线虫，细小毛发，这些机械毛可以阻止昆虫食草动物的进食，产卵和运动。像许多此类叶片性状一样，计算满足给定研究假设所需的毛状体可能是一项艰巨的任务。Mirnezami等。（2020年）通过捕获叶片图像，通过清除过程使叶片透明以及应用新颖的半自动和自动计数毛状体的方法来实现毛状体密度的自动化量化方面的进展。然后，他们将这些新颖方法的结果与手动计数进行比较，并确定最准确的新颖方法是半自动的（需要用户输入），并且在估计叶表面毛状体密度方面准确度高达90％。尽管尚未实现全自动毛状体计数，但该研究代表了植物的自动防御性状表型迈出的重要一步的重要而详尽的描述。

考虑到能够自动估计植物性状的能力，接下来的问题将是提取的植物性状是否可以可靠地用于植物种类鉴定。此外，是否可以使用机器学习方法确定最丰富的物种识别特征？Almeida等。（2020）研究了使用决策树通过性状数据库进行植物鉴定，并鉴定出最能区分物种的信息性状。使用TRY植物性状数据库（Kattge等，2011，2020）以及涵盖树木，草药，草和其他类群的物种的集合，他们能够以高达90％的交叉验证精度正确识别植物物种。诸如叶片形状，果实类型和花色之类的特征被认为是最有用的。随着收集更多的植物性状数据（包括如上所述的自动化方法），本文介绍的方法类型可用于指导和告知数据收集过程。

获取植物根架构的高分辨率图像以用于下游分析和机器学习算法已证明是一项艰巨的努力。当前大多数方法使用的技术都对根结构具有破坏性（例如Trachsel等，2011）。包括在受控条件下进行异地成像，通常使用地上的变径管（带有窗口的室进入正在耕种的植物的土壤中）；结合侵入性方法，通过这种方法将摄像机插入地面（Johnson等，2001），有时通过土壤取芯（Wu等，2018）），有干扰土壤和根部的趋势；或使用非侵入式方法，例如穿透直径大于1 cm的树木和木本植物的探地雷达或X射线计算机断层摄影术（Tabb等，2018）或磁共振成像（Pflugfelder等，2017）适用于根系较细的盆栽植物。Ruiz-Munoz等。（2020）报告了通过采用两种最先进的深度学习方法（快速超高分辨率卷积神经网络（FSRCNN））来提高这些图像的分辨率的实验报告（Dong等人，2016）和超分辨率生成对抗网络（SRGAN）。他们的方法旨在估计低分辨率图像的高分辨率输出，以暴露传感设备未明确描绘的细节。这些评估的结果表明，即使使用非根数据集训练，这些超分辨率模型也优于基本的三次三次插值。

有监督的机器学习方法是应用于植物科学时最常用的方法。通常，机器学习方法用于自动化或减少完成传统上由研究人员手动完成的任务所需的精力和时间。这些任务非常适合于受监督的方法。然而，机器学习方法还提供了数据挖掘和对收集的数据进行无监督探索的机制。Saryan等。（2020年）研究并提出了在物种边界发现中使用无监督光谱聚类辅助工具的建议。作者（与主成分分析和非度量多维标度相比）确定，交互式频谱聚类可以改善对某些问题和数据集的划分和理解。

文本识别和挖掘在许多应用中都非常有用，包括自动处理标本标签和搜索索引。因此，对拉丁学名的自动识别在某些应用中可能特别有用。Little（2020）研究并开发了一种开放源代码浏览器可执行方法，用于使用人工神经网络进行拉丁语科学名称识别。该方法依赖于集成网络方法，该方法可以以较高的召回率和精度并以8.6毫秒/字的竞争速度识别多种语言（例如，中文，法语，德语，日语）中的拉丁科学名称。

植物基因组通常很大且很复杂，具有多基因家族和大量重复序列。目前已经发布了200多种植物基因组（Chen等人，2018年），还有更多的在进行中，并且基因组和转录组资源可用于成千上万种其他植物物种（例如Matasci等人，2014年; Leebens-Mack等人，2019），数据现已可用于跨系统发育规模的植物基因组比较分析。尽管目前鉴定基因区的方法相当成功，但用于推断基因功能和植物基因组其他属性的工具仍需进一步完善。机器学习方法正在应用于植物基因组学中的一系列问题，Mahood等人。（2020年）查看这些方法的前景。他们专注于有监督的机器学习，可根据序列信息和后基因组数据预测基因功能。由于基因功能在植物中可能在空间和时间上变化，并且对表型具有直接或间接影响，因此功能预测涉及针对基因组结构，基因表达模式和蛋白质-蛋白质相互作用的分析的组合，并且作者回顾了机器学习方法针对这些问题以及旨在整合分子和生物学规模信息的问题。除了介绍这些方法之外，作者还确定了当前通往更有效模型的障碍，并提出了可能的解决方案。

如上所述，已经开发了许多用于视觉图像和文本的机器学习方法。越来越多的方法正在开发并适应非视觉图像，例如X射线计算机断层扫描，探地雷达和高光谱图像（Zare和Ho，2013年; Rogers等人，2016年; Travassos等人，2018）。Théroux-Rancourt等。（2020年）开发了使用X射线微计算机断层扫描技术对叶片内部解剖结构进行三维分割和表征的方法。作者概述的方法利用少量的手工分割图像切片来自动分割1000多次扫描，准确率超过90％。该方法侧重于对葡萄叶进行分段扫描，同时需要最少的手动标记，但是强调了能够应用机器学习方法来自动分析各种数据和图像类型的可能性。

机器学习在植物生物学问题中的应用仍处于起步阶段，但是这些方法解决广泛问题的前景是显而易见的。从基因组工具到植物形态，生长和发育的度量，从评估植物与食草动物及其广泛变化的生态环境在农业中的相互作用，涉及机器学习的新方法有可能改变我们研究植物的方式，甚至我们可以问的问题。通过新的机器学习方法，与从亚细胞到生态系统范围的各个领域的进一步整合，都将进一步促进植物生物学的新发现。但是，正如对此特刊的贡献所提醒的那样，具有足够高的应用精度的方法仍在开发中，可能需要大量投资来生成训练数据集。因此，尽管机器学习方法有希望和吸引力，但由于难以完善基础模型或由于无法获得或不容易获取合适的训练集所需的数据，某些问题可能无法解决。我们希望本系列中提出的论文鼓励在机器学习对植物生物学的新兴应用方面取得进一步的进展。某些问题可能由于改善基础模型的困难或由于无法获得或不容易获取合适的训练集所需的数据而无法解决。我们希望本系列中提出的论文鼓励在机器学习对植物生物学的新兴应用方面取得进一步的进展。某些问题可能由于改善基础模型的困难或由于无法获得或不容易获取合适的训练集所需的数据而无法解决。我们希望本系列中提出的论文鼓励在机器学习对植物生物学的新兴应用方面取得进一步的进展。

更新日期：2020-07-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11