当前位置: X-MOL 学术J. Agric. Biol. Environ. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Techniques to Improve Ecological Interpretability of Black-Box Machine Learning Models
Journal of Agricultural, Biological and Environmental Statistics ( IF 1.4 ) Pub Date : 2021-10-28 , DOI: 10.1007/s13253-021-00479-7
Thomas Welchowski 1 , Kelly O Maloney 2 , Richard Mitchell 3 , Matthias Schmidt 1
Affiliation  

Statistical modeling of ecological data is often faced with a large number of variables as well as possible nonlinear relationships and higher-order interaction effects. Gradient boosted trees (GBT) have been successful in addressing these issues and have shown a good predictive performance in modeling nonlinear relationships, in particular in classification settings with a categorical response variable. They also tend to be robust against outliers. However, their black-box nature makes it difficult to interpret these models. We introduce several recently developed statistical tools to the environmental research community in order to advance interpretation of these black-box models. To analyze the properties of the tools, we applied gradient boosted trees to investigate biological health of streams within the contiguous USA, as measured by a benthic macroinvertebrate biotic index. Based on these data and a simulation study, we demonstrate the advantages and limitations of partial dependence plots (PDP), individual conditional expectation (ICE) curves and accumulated local effects (ALE) in their ability to identify covariate–response relationships. Additionally, interaction effects were quantified according to interaction strength (IAS) and Friedman’s \(H^2\) statistic. Interpretable machine learning techniques are useful tools to open the black-box of gradient boosted trees in the environmental sciences. This finding is supported by our case study on the effect of impervious surface on the benthic condition, which agrees with previous results in the literature. Overall, the most important variables were ecoregion, bed stability, watershed area, riparian vegetation and catchment slope. These variables were also present in most identified interaction effects. In conclusion, graphical tools (PDP, ICE, ALE) enable visualization and easier interpretation of GBT but should be supported by analytical statistical measures. Future methodological research is needed to investigate the properties of interaction tests. Supplementary materials accompanying this paper appear on-line.



中文翻译:

提高黑盒机器学习模型的生态可解释性的技术

生态数据的统计建模常常面临大量的变量以及可能的非线性关系和高阶相互作用效应。梯度提升树(GBT)已成功解决这些问题,并在非线性关系建模中表现出良好的预测性能,特别是在具有分类响应变量的分类设置中。它们对于异常值也往往具有鲁棒性。然而,它们的黑盒性质使得解释这些模型变得困难。我们向环境研究界介绍了几种最近开发的统计工具,以促进对这些黑盒模型的解释。为了分析这些工具的特性,我们应用梯度增强树来调查美国本土河流的生物健康状况,通过底栖大型无脊椎动物生物指数进行测量。基于这些数据和模拟研究,我们证明了部分依赖图(PDP)、个体条件期望(ICE)曲线和累积局部效应(ALE)在识别协变量-响应关系的能力方面的优点和局限性。此外,根据相互作用强度(IAS)和弗里德曼\(H^2\)统计量对相互作用效应进行了量化。可解释的机器学习技术是打开环境科学中梯度增强树黑匣子的有用工具。这一发现得到了我们关于不透水表面对底栖条件影响的案例研究的支持,这与文献中先前的结果一致。总体而言,最重要的变量是生态区、河床稳定性、流域面积、河岸植被和流域坡度。这些变量也存在于大多数确定的交互效应中。总之,图形工具(PDP、ICE、ALE)可以实现 GBT 的可视化和更容易的解释,但应该得到分析统计措施的支持。未来的方法学研究需要调查交互测试的特性。本文附带的补充材料出现在网上。

更新日期:2021-10-28
down
wechat
bug