Study becomes insight: Ecological learning from machine learning,Methods in Ecology and Evolution

当前位置： X-MOL 学术 › Methods Ecol. Evol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Study becomes insight: Ecological learning from machine learning
Methods in Ecology and Evolution ( IF 6.3 ) Pub Date : 2021-07-28 , DOI: 10.1111/2041-210x.13686
Qiuyan Yu ₁ , Wenjie Ji _{1,

2} , Lara Prihodko ₃ , C Wade Ross ₄ , Julius Y Anchang ₁ , Niall P Hanan ₁

Affiliation

The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the ‘learning’ hidden in the ML models.
We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi-variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables.
We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non-influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three-dimensional visualizations and use of loess planes to represent independent variable effects and interactions.
Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to ‘learn from machine learning’.

中文翻译：

研究变成洞察：机器学习中的生态学习

生态和环境科学界已将机器学习 (ML) 用于经验建模和预测。然而，超越预测以深入了解响应变量和环境“驱动因素”之间的潜在功能关系并不那么简单。从拟合的 ML 模型中获得生态见解需要提取隐藏在 ML 模型中的“学习”的技术。
我们重新审视了从 ML 中获得见解的四种方法的理论背景和有效性：对自变量重要性进行排序（基尼重要性，GI；排列重要性，PI；分裂重要性，SI；和条件排列重要性，CPI），以及两种方法用于推断双变量函数关系（部分依赖图，PDP；和累积局部效应图，ALE）。我们还探索了使用代理模型来可视化和解释响应变量和环境驱动因素之间复杂的多变量关系。我们研究了使用这些解释方法提取生态见解的挑战和机遇。具体来说，我们旨在通过研究有效性与 (a) 解释算法的关系来改进 ML 模型的解释，
我们的分析基于响应变量和预测变量之间已知的潜在函数关系的模拟，增加了白噪声和存在相关但无影响的变量。结果表明，获得生态洞察力受到解释算法和虚假变量的强烈影响，并受到样本量的适度影响。去除虚假变量可以改善 ML 模型的解释。同时，在存在虚假变量的情况下，增加样本量的价值有限，但是一旦省略了虚假变量，增加样本量确实会提高性能。在四种排序方法中，SI 在存在虚假变量的情况下比其他方法稍微更有效，而在去除虚假变量时，GI 和 SI 的准确性更高。PDP 在检索底层函数关系方面比 ALE 更有效，但在存在虚假变量的情况下其可靠性急剧下降。使用代理模型可以增强预测变量和响应变量的交互效应的可视化和解释，包括三维可视化和使用黄土平面来表示自变量效应和交互。
机器学习分析师应该意识到，在 ML 模型中包含与响应变量没有明确因果关系的相关自变量可能会干扰生态推理。当生态推断很重要时，ML 模型应该使用对响应变量有明确因果影响的自变量构建。虽然解释 ML 模型以进行生态推理仍然具有挑战性，但我们表明，仔细选择解释方法、排除虚假变量和足够的样本量可以提供更多更好的“从机器学习中学习”的机会。

更新日期：2021-07-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11