当前位置: X-MOL 学术Prog. Oceanogr. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automated classification of fauna in seabed photographs: The impact of training and validation dataset size, with considerations for the class imbalance
Progress in Oceanography ( IF 3.8 ) Pub Date : 2021-05-20 , DOI: 10.1016/j.pocean.2021.102612
Jennifer M. Durden , Brett Hosking , Brian J. Bett , Danelle Cline , Henry A. Ruhl

Machine learning is rapidly developing as a tool for gathering data from imagery and may be useful in identifying (classifying) visible specimens in large numbers of seabed photographs. Application of an automated classification workflow requires manually identified specimens to be supplied for training and validating the model. These training and validation datasets are generally generated by partitioning the available manual identified specimens; typical ratios of training to validation dataset sizes are 75:25 or 80:20. However, this approach does not facilitate the desired scalability, which would require models to successfully classify specimens in hundreds of thousands to millions of images after training on a relatively small subset of manually identified specimens. A second problem is related to the ‘class imbalance’, where natural community structure means that fewer specimens of rare morphotypes are available for model training. We investigated the impact of independent variation of the training and validation dataset sizes on the performance of a convolutional neural network classifier on benthic invertebrates visible in a very large set of seabed photographs captured by an autonomous underwater vehicle at the Porcupine Abyssal Plain Sustained Observatory. We tested the impact of increasing training dataset size on specimen classification in a single validation dataset, and then tested the impact of increasing validation set size, evaluating ecological metrics in addition to computer vision metrics. Computer vision metrics (recall, precision, F1-score) indicated that classification improved with increasing training dataset size. In terms of ecological metrics, the number of morphotypes recorded increased, while diversity decreased with increasing training dataset size. Variation and bias in diversity metrics decreased with increasing training dataset size. Multivariate dispersion in apparent community composition was reduced, and bias from expert-derived data declined with increasing training dataset size. In contrast, classification success and resulting ecological metrics did not differ significantly with varying validation dataset sizes. Thus, the selection of an appropriate training dataset size is key to ensuring robust automated classifications of benthic invertebrates in seabed photographs, in terms of ecological results, and validation may be conducted on a comparatively small dataset with confidence that similar results will be obtained in a larger production dataset. In addition, our results suggest that automated classification of less common morphotypes may be feasible, providing that the overall training dataset size is sufficiently large. Thus, tactics for reducing class imbalance in the training dataset may produce improvements in the resulting ecological metrics.



中文翻译:

海底照片中动物群的自动分类:训练和验证数据集大小的影响,考虑类别不平衡

机器学习作为一种从图像中收集数据的工具正在迅速发展,并且可能有助于识别(分类)大量海底照片中的可见标本。自动分类工作流程的应用要求提供人工识别的样本,以训练和验证模型。这些训练和验证数据集通常是通过对可用的手动识别样本进行分区来生成的;训练与验证数据集大小的典型比率为 75:25 或 80:20。然而,这种方法不利于所需的可扩展性,这需要模型在对相对较小的手动识别样本子集进行训练后,成功地对数十万到数百万张图像中的样本进行分类。第二个问题与“阶级不平衡”有关,其中自然群落结构意味着可用于模型训练的稀有形态类型样本较少。我们研究了训练和验证数据集大小的独立变化对卷积神经网络分类器对底栖无脊椎动物的性能的影响,这些海底无脊椎动物在 Porcupine Abyssal Plain Sustained Observatory 的自主水下航行器拍摄的大量海底照片中可见。我们在单个验证数据集中测试了增加训练数据集大小对样本分类的影响,然后测试了增加验证集大小的影响,评估了除计算机视觉指标之外的生态指标。计算机视觉指标(召回率、精度、F1 分数)表明分类随着训练数据集大小的增加而改善。在生态指标方面,记录的形态类型数量增加,而多样性随着训练数据集大小的增加而减少。多样性指标的变化和偏差随着训练数据集大小的增加而减少。表观群落组成中的多元分散减少,专家衍生数据的偏差随着训练数据集大小的增加而减少。相比之下,分类成功和由此产生的生态指标在验证数据集大小不同的情况下没有显着差异。因此,就生态结果而言,选择合适的训练数据集大小是确保海底照片中底栖无脊椎动物的强大自动分类的关键,并且可以在相对较小的数据集上进行验证,并确信将在更大的生产数据集中获得类似的结果。此外,我们的结果表明,如果整体训练数据集足够大,对不太常见的形态类型进行自动分类可能是可行的。因此,减少训练数据集中类别不平衡的策略可能会改善所产生的生态指标。

更新日期:2021-05-30
down
wechat
bug