当前位置: X-MOL 学术Geoderma › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bayesian approach for sample size determination, illustrated with Soil Health Card data of Andhra Pradesh (India)
Geoderma ( IF 5.6 ) Pub Date : 2021-09-09 , DOI: 10.1016/j.geoderma.2021.115396
D J Brus 1 , B Kempen 2 , D Rossiter 2, 3 , Balwinder-Singh 4 , A J McDonald 3
Affiliation  

A crucial decision in designing a spatial sample for soil survey is the number of sampling locations required to answer, with sufficient accuracy and precision, the questions posed by decision makers at different levels of geographic aggregation. In the Indian Soil Health Card (SHC) scheme, many thousands of locations are sampled per district. In this paper the SHC data are used to estimate the mean of a soil property within a defined study area, e.g., a district, or the areal fraction of the study area where some condition is satisfied, e.g., exceedence of a critical level. The central question is whether this large sample size is needed for this aim. The sample size required for a given maximum length of a confidence interval can be computed with formulas from classical sampling theory, using a prior estimate of the variance of the property of interest within the study area. Similarly, for the areal fraction a prior estimate of this fraction is required. In practice we are uncertain about these prior estimates, and our uncertainty is not accounted for in classical sample size determination (SSD). This deficiency can be overcome with a Bayesian approach, in which the prior estimate of the variance or areal fraction is replaced by a prior distribution. Once new data from the sample are available, this prior distribution is updated to a posterior distribution using Bayes’ rule. The apparent problem with a Bayesian approach prior to a sampling campaign is that the data are not yet available. This dilemma can be solved by computing, for a given sample size, the predictive distribution of the data, given a prior distribution on the population and design parameter. Thus we do not have a single vector with data values, but a finite or infinite set of possible data vectors. As a consequence, we have as many posterior distribution functions as we have data vectors. This leads to a probability distribution of lengths or coverages of Bayesian credible intervals, from which various criteria for SSD can be derived. Besides the fully Bayesian approach, a mixed Bayesian-likelihood approach for SSD is available. This is of interest when, after the data have been collected, we prefer to estimate the mean from these data only, using the frequentist approach, ignoring the prior distribution. The fully Bayesian and mixed Bayesian-likelihood approach are illustrated for estimating the mean of log-transformed Zn and the areal fraction with Zn-deficiency, defined as Zn concentration <0.9 mg kg −1, in the thirteen districts of Andhra Pradesh state. The SHC data from 2015–2017 are used to derive prior distributions. For all districts the Bayesian and mixed Bayesian-likelihood sample sizes are much smaller than the current sample sizes. The hyperparameters of the prior distributions have a strong effect on the sample sizes. We discuss methods to deal with this. Even at the mandal (sub-district) level the sample size can almost always be reduced substantially. Clearly SHC over-sampled, and here we show how to reduce the effort while still providing information required for decision-making. R scripts for SSD are provided as supplementary material.



中文翻译:

确定样本量的贝叶斯方法,以安得拉邦(印度)的土壤健康卡数据为例

为土壤调查设计空间样本的一个关键决定是需要以足够的准确度和精确度回答决策者在不同地理聚集级别提出的问题所需的采样位置数量。在印度土壤健康卡 (SHC) 计划中,每个地区对数千个地点进行抽样。在本文中,SHC 数据用于估计定义的研究区域内土壤特性的平均值,例如一个地区,或研究区域满足某些条件(例如超出临界水平)的面积部分。核心问题是是否需要如此大的样本量来实现这一目标。可以使用经典抽样理论中的公式计算给定最大置信区间长度所需的样本量,使用对研究区域内感兴趣的财产的方差的先验估计。类似地,对于面积分数,需要对该分数进行事先估计。在实践中,我们不确定这些先前的估计,并且我们的不确定性在经典样本量确定 (SSD) 中没有考虑在内。这个缺陷可以通过贝叶斯方法来克服,其中方差或面积分数的先验估计被先验分布代替。一旦来自样本的新数据可用,这个先验分布就会使用贝叶斯规则更新为后验分布。在抽样活动之前使用贝叶斯方法的明显问题是数据尚不可用。这个困境可以通过计算给定样本大小的数据的预测分布来解决,给定人口和设计参数的先验分布。因此,我们没有一个带有数据值的向量,而是一组有限或无限的可能数据向量。因此,我们拥有与数据向量一样多的后验分布函数。这导致贝叶斯可信区间的长度或覆盖率的概率分布,从中可以推导出 SSD 的各种标准。除了完全贝叶斯方法,SSD 的混合贝叶斯似然方法也是可用的。当收集数据后,我们更愿意仅从这些数据中估计均值,使用频率论方法,忽略先验分布时,这很有趣。说明了完全贝叶斯和混合贝叶斯似然方法,用于估计对数转换 Zn 的平均值和具有 Zn 缺陷的面积分数,-1,在安得拉邦的十三区。2015-2017 年的 SHC 数据用于推导先验分布。对于所有地区,贝叶斯和混合贝叶斯似然样本量远小于当前样本量。先验分布的超参数对样本大小有很大影响。我们讨论处理这个问题的方法。即使在曼达尔(分区)级别,样本量几乎总是可以大幅减少。显然 SHC 过采样,在这里我们展示了如何减少工作量,同时仍然提供决策所需的信息。SSD 的 R 脚本作为补充材料提供。

更新日期:2021-09-10
down
wechat
bug