Skip to main content
Log in

Extending null scenarios with Faddy distributions in a probabilistic randomization protocol for presence-absence data

  • Published:
Environmental and Ecological Statistics Aims and scope Submit manuscript

Abstract

Navarro and Manly (Popul Ecol 51:505–512, 2009) (NM) have proposed a randomization protocol for null model analysis of species occurrences at discrete locations based on probability distributions and generalized linear models. In the NM method, presences-absences are governed by independent Bernoulli random variables. In addition, a non-observable non-negative random variable (“quasi-abundance”) from either Poisson, Binomial or Negative Binomial distributions are log-linearly related to the qualitative effects of species and location. By connecting the probability of occurrence of each species on each location and the quasi-abundance distributions, one generalized linear model for the observed presences-absences is selected by profile deviance, and the resulting fitted probabilities of the null model with minimum deviance is used to generate random matrices via parametric bootstrap. This work contributes with a unified theoretical formulation of the NM method, based on Faddy distributions, to allow general distributions of over-dispersed and under-dispersed discrete random variables. For a subset of the Faddy models, the log concave property of the inverse link function guarantees convergence to a global minimum deviance thus providing unique estimates for the linear parameters of the models. The method is illustrated using presence-absence data of island lizard communities. Interpretations of this combined GLM-parametric bootstrap protocol are discussed, highlighting the way fitted probabilities under the chosen null model are related to the row and column totals of the observed table. Additional properties of the probabilistic NM protocol, with possible avenues of future research, are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

All data and code generated or analysed during this study are included in this published article [and its supplementary information files].

References

  • Aranda-Ordaz FJ (1981) On two families of transformation to additivity for binary response data. Biometrika 68:357–363

    Article  Google Scholar 

  • Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16:1190–1208. https://doi.org/10.1137/0916069

    Article  Google Scholar 

  • Case TJ (1983) Niche overlap and the assembly of island lizard communities. Oikos 41:427–433

    Article  Google Scholar 

  • Czado C, Munk A (2000) Noncanonical links in generalized linear models—when is the effort justified? J Stat Plan Inference 87:317–345

    Article  Google Scholar 

  • Diamond JM (1975) Assembly of species communities. In: Cody ML, Diamond JM (eds) Ecology and evolution of communities. Harvard University Press, Cambridge, pp 342–444

    Google Scholar 

  • Dobson AJ, Barnett AG (2018) An introduction to generalized linear models, 4th edn. Chapman and Hall/CRC, Boca Raton

    Google Scholar 

  • Donoghoe MW, Marschner IC (2018) logbin: an R package for relative risk regression using the log-binomial model. J Stat Softw 86(9):1–22

    Article  Google Scholar 

  • Faddy MJ (1997) Extended Poisson process modelling and analysis of count data. Biom Ournal 39(4):431–440

    Google Scholar 

  • Faddy MJ, Smith DM (2011) Analysis of count data with covariate dependence in both mean and variance. J Appl Stat 38(12):2683–2694

    Article  Google Scholar 

  • Faddy MJ, Smith DM (2012) Extended Poisson process modelling and analysis of grouped binary data. Biom J 54(3):426–435

    Article  Google Scholar 

  • Fletcher R (2000) Practical methods of optimization, 2nd edn. Wiley, Chichester

    Book  Google Scholar 

  • Gilpin ME, Diamond JM (1982) Factors contributing to non-randomness in species co-occurrences on islands. Oecologia 52:75–84

    Article  Google Scholar 

  • Gotelli NJ (2000) Null model analysis of species co-occurrence patterns. Ecology 81:2606–2621

    Article  Google Scholar 

  • Gotelli NJ, Ulrich W (2012) Statistical challenges in null model analyses. Oikos 121:171–180

    Article  Google Scholar 

  • Götzenberger L, de Bello F, Bråthen KA, Davison J, Dubuis A, Guisan A, Lepš J, Lindborg R, Moora M, Pärtel M, Pellissier L, Pottier J, Vittoz P, Zobel K, Zobel M (2012) Ecological assembly rules in plant communities-approaches, patterns and prospects. Biol Rev 82:111–127

    Article  Google Scholar 

  • Henningsen A, Toomet O (2011) maxLik: a package for maximum likelihood estimation in R. Comput Stat 26(3):443–458. https://doi.org/10.1007/s00180-010-0217-1

    Article  Google Scholar 

  • Huang X (2020) Improved wrong-model inference for generalized linear models for binary responses in the presence of link misspecification. Stat Methods Appl. https://doi.org/10.1007/s10260-020-00529-3

    Article  Google Scholar 

  • Kallio A (2016) Properties of fixed-fixed models and alternatives in presence-absence data analysis. PLoS ONE 11(11):e0165456. https://doi.org/10.1371/journal.pone.0165456

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Manly BFJ (1995) A note on the analysis of species co-occurrences. Ecology 76:1109–1115

    Article  Google Scholar 

  • Marschner IC (2014) Combinatorial EM algorithms. Stat Comput 24(6):921–940

    Article  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models. Chapman and Hall, London

    Book  Google Scholar 

  • Navarro J (2003) Generalized linear models and Monte Carlo methods in the analysis of species co-occurrences. Dissertation. University of Otago, Dunedin, New Zealand

  • Navarro J, Manly BFJ (2009) Null model analyses of presence-absence matrices need a definition of independence. Popul Ecol 51:505–512

    Article  Google Scholar 

  • Nychka D, Furrer R, Paige J, Sain S (2017). Fields: tools for spatial data. https://doi.org/10.5065/D6W957CT. R package version 11.6, https://github.com/NCAR/Fields

  • Peres-Neto PR, Olden J, Jackson DA (2001) Environmentally constrained null models: site suitability as occupancy criterion. Oikos 93:110–120

    Article  Google Scholar 

  • Pascual-García A, Tamames J, Bastolla U (2014) Bacteria dialog with Santa Rosalia: are aggregations of cosmopolitan bacteria mainly explained by habitat filtering or by ecological interactions? BMC Microbiol 14:284. https://doi.org/10.1186/s12866-014-0284-5

    Article  PubMed  PubMed Central  Google Scholar 

  • R Core Team (2021) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Roberts A, Stone L (1990) Island-sharing by archipelago species. Oecologia 83:560–567

    Article  Google Scholar 

  • Ryti RT, Gilpin ME (1987) The comparative analysis of species occurrence patterns on archipelagos. Oecologia 73:282–287. https://doi.org/10.1007/BF00377519

    Article  CAS  PubMed  Google Scholar 

  • Stone L, Roberts A (1990) The checkerboard score and species distributions. Oecologia 85:74–79. https://doi.org/10.1007/BF00317345

    Article  PubMed  Google Scholar 

  • Stone L, Roberts A (1992) Competitive exclusion, or species aggregation? An aid in deciding. Oecologia 91:419–424

    Article  Google Scholar 

  • Tang W, Ye Y (2020) The existence of maximum likelihood estimate in high-dimensional binary response generalized linear models. Electron J Stat 14:4028–4053

    Article  Google Scholar 

  • Tornero-Velez R, Egeghy P, Cohen Hubal E (2012) Biogeographical analysis of chemical co-occurrence data to identify priorities for mixtures research. Risk Anal 32(2):224–236

    Article  Google Scholar 

  • Williamson T, Eliasiw M, Fick GH (2013) Log-binomial models: exploring failed convergence. Emerg Themes Epidemiol 10:14

    Article  Google Scholar 

  • Wilson JB (1995) Null models for assembly rules: the Jack Horner effect is more insidious than the Narcissus effect. Oikos 72:139–144

    Article  Google Scholar 

  • Varadhan R (2015) Alabama: constrained nonlinear optimization. R package version 2015.3-1. https://CRAN.R-project.org/package=alabama. Accessed 25 Aug 2020

  • Veech JA (2013) A probabilistic model for analysing species co-occurrence. Glob Ecol Biogeogr 22:252–260

    Article  Google Scholar 

  • Whittam TS, Siegel-Causey D (1981) Species interactions and community structure in Alaskan seabird colonies. Ecology 62:1515–1524

    Article  Google Scholar 

Download references

Acknowledgements

We wish to thank an anonymous reviewer for suggestions that led to a remarkable improvement of the present work.

Funding

This study was carried out during Jorge Navarro’s sabbatical year in the Department of Mathematics and Statistics, University of Wyoming, under the sponsorship of the Consejo Nacional de Ciencia y Tecnología, Mexico (2018–000007-01EXTV-00185).

Author information

Authors and Affiliations

Authors

Contributions

JN developed the model extension based on a suggestion given by BFJM. JN also designed, programmed and tested the R code for estimation and randomization tests. BFJM and KG provided conceptual support. All authors contributed to writing.

Corresponding author

Correspondence to Jorge A. Navarro Alberto.

Ethics declarations

Conflict of interest

All authors have no conflict of interest to declare.

Additional information

Handling Editor: Luiz Duczmal.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (ZIP 1486 kb)

Appendices

Appendix

The basic null model is given in Eq. (1) in terms of the expected value of the unobserved quasi-abundance \(\mu_{kl} = E(N_{kl} )\), k = 1,.., t; l = 1,…,s:

$$ \mu_{kl} = \exp (\gamma + \alpha_{k} + \beta_{l} ),k = {1},..,t;l = {1}, \ldots ,s $$

where \({\alpha }_{1}={\beta }_{1}=0\). These unobserved means are connected to probabilities of observed Bernoulli-distributed outcomes \({Y}_{kl}\), \(E({Y}_{kl})={\pi }_{kl}\), using different generalized linear models. One collection of models defines link functions in terms of the diffusion approximation to the mean quasi-abundance for Faddy distributions with parameters b and c, as described in Sects. 3.2 (Eq. 20) and Sect. 3.5.1. Another generalized linear model based on Faddy distributions is given by Eq. 21. This “Appendix” focuses mostly on models given by Eq. 20 (indicated here as Eq. 30):

$$ \pi_{kl} = \left\{ {\begin{array}{ll} {1 - \exp \left\{ {\frac{b}{1 - c}\left[ {1 - \left( {1 + \frac{{\mu_{kl} }}{b}} \right)^{1 - c} } \right]} \right\},} & {c < 1} \\ {1 - \left( {\frac{b}{{b + \mu_{kl} }}} \right)^{b} ,} & {c = 1} \\ \end{array} } \right. $$
(30)

where b > 0. The estimation of each generalized linear model is performed by maximum likelihood, and a model-selection process using profile deviance is conducted for varying values of b and c, the deviance function given by Eq. 25.

The following lemma applies to any model under the NM protocol:

Lemma

If \(P(Y_{kl} = 0) = P(N_{kl} = 0);\quad P(Y_{kl} = 1) = P(N_{kl} \ge 1)\), then, for all probability mass functions on the non-negative integers \(N_{kl}\):

$${\pi}_{\text{kl}}\left({\mu}_{\text{kl}}\right){\le}{\mu}_{\text{kl}}.$$
(31)

Proof

$$ \mu _{{kl}} = E\left( {N_{{kl}} } \right) = \sum\limits_{{n_{{kl}} = 0}}^{\infty } {n_{{kl}} } P\left( {N_{{kl}} {\text{ }} = {\text{ }}n_{{kl}} } \right) \ge \sum\limits_{{n_{{kl}} = 1}}^{\infty } P \left( {N_{{kl}} {\text{ }} = {\text{ }}n_{{kl}} } \right) = P\left( {N_{{kl}} \ge 1} \right) = P\left( {Y_{{kl}} {\text{ }} = {\text{ }}1} \right){\text{ }} = {\text{ }}\pi _{{kl}} . $$

The equality in \({\pi}_{\text{kl}}\left({\mu}_{\text{kl}}\right){\le}{\mu}_{\text{kl}}\) is achieved when \({\text{P}}\left({\text{N}}_{\text{kl}}\text{ = 0}\right)\text{ = 1}-{ \pi}_{\text{kl}}\), \({\text{P}}\left({\text{N}}_{\text{kl}}\text{ = 1}\right)\text{=}{ \pi}_{\text{kl}}\), and \({\text{P}}\left({\text{N}}_{\text{kl}}\text{ > 1}\right)\text{ = 0}\).

The inequality (31) is essential for the appropriate application of the “Faddy inspired” models (30) and will constrain the set of competing models in the profile deviance.

Theorem

Under the NM protocol, the generalized linear models connecting the probabilities of the observed Bernoulli-distributed outcomes \(Y_{kl}\) with parameter \(E(Y_{kl} ) = \pi_{kl}\), and the quasi-abundance whose link functions are defined in terms of the diffusion approximation to the mean for Faddy distributions with parameters b and c, are restricted to the set

R = {b > 0, c ≤ 1, b ≥ ‒c}.

Proof

For (31) to be true, the second derivative of \({\pi}_{\text{kl}}\left({\mu}_{\text{kl}}\right)\) evaluated at \({\mu}_{\text{kl}}\)= 0 must be less than or equal to 0. And it turns out for Eq. (30) this is also a sufficient condition. Evaluating the derivative in Eq. (30) we have:

$${\left.\frac{d\pi ({\mu }_{kl})}{d{\mu }_{kl}}\right|}_{{\mu }_{kl}=0}={\left.\mathrm{exp}\left(\left(b\left(1-{\left({1+\mu }_{kl}/b\right)}^{1-c}\right)\right)/(1-c)\right){\left({1+\mu }_{kl}/b\right)}^{-c}\right|}_{{\mu }_{kl}=0}=1$$

In addition, \({\pi}_{\text{kl}}\left({\mu}_{\text{kl}}\right){\le}{\mu}_{\text{kl}}\) implies that \(\frac{{d}^{2}\pi ({\mu }_{kl})}{{d}^{2}{\mu }_{kl}} \le 0\) and

$${\left.\frac{{d}^{2}\pi ({\mu }_{kl})}{d{{\mu }_{kl}}^{2}}\right|}_{{\mu }_{kl}=0}={\left.-\frac{\mathrm{exp}\left(\left(b\left(1-{\left({1+\mu }_{kl}/b\right)}^{1-c}\right)\right)/(1-c)\right){\left(\frac{b+{\mu }_{kl}}{b}\right)}^{-2c}\left(c{\left(\frac{b+{\mu }_{kl}}{b}\right)}^{c}+b+{\mu }_{kl}\right)}{b+{\mu }_{kl}}\right|}_{{\mu }_{kl}=0}\le 0$$

Note that the last term in the numerator is the only term that can be negative. Therefore, the last inequality implies that

$$-(b+c)\le 0$$

For \(b>0\), and \(0\le c\le 1\), this last expression is always valid. However, the validity of this expression forces the inequality:

$$b\ge - c.$$

In particular, for c < 0, we must have: \(b>-c\).

The case b = ‒c in Eq. (30)

On the lower boundary of the valid region R, b = ‒c, c < 0, the Faddy model can be written as:

$$\pi \left({\mu }_{kl}\right)=1-\mathrm{exp}\left(1-\left(b/\left(1+b\right)\right){\left(1+{\mu }_{kl}/b\right)}^{1+b}\right)$$

In this case, the probability mass function of \({\text{N}}_{\text{kl}}\) is under-dispersed. The profile deviance of the lizard data suggested that the minimum deviance is possibly attained at this boundary as b goes to infinity. Thus, we were also interested in the “Faddy model” given by:

$$\pi \left({\mu }_{kl}\right)=\underset{b\to \infty }{\mathrm{lim}}\left(1-\mathrm{exp}\left(1-\left(b/\left(1+b\right)\right){\left(1+{\mu }_{kl}/b\right)}^{1+b}\right)\right)=1-\mathrm{exp}\left(1-\mathit{exp}\left({\mu }_{kl}\right)\right)$$
(32)

Putting model (32) to compete in the profile deviance for the lizard data, turns out to be the optimal model among all models of the form (30).

Log-concavity for models given by Eq. (30)

The proposed Faddy models (30) fitted by maximum likelihood have good properties in the permissible region R. For every data set and for every starting value in this region, the minimization procedures will converge to a global minimum. This statement can be proved by noticing that the “inverse links functions” defined by expression (30) for b and c in R, all have the property of log-concavity. Therefore, as far as maximum likelihood estimation is concerned, a subset of the link functions induced by Faddy models have the same good properties as the usual link functions for binary data (except for the exponential family properties). For parameters outside this region, convergence to the global minimum may or not may be assured; it will depend on the observed binary data and on the initial starting values. The model fitting process was also tried for the lizard data using Faddy models with b < ‒c and values of the deviances were computed but they were highlighted so that their interpretation should be made with caution. To prove the claim that the deviance function

$$D\left({\varvec{\Pi}}\text{,}{\text{y}}\right)\text{=}-2\left[\sum_{{y}_{kl}=0}\mathrm{log}{(1-\pi }_{kl})+\sum_{{y}_{kl}=1}\mathrm{log}{(\pi }_{kl})\right]$$

is concave when the inverse links functions \({\pi }_{kl}\) are defined by expression (30), we make use of a result given in Tang and Ye (2020) that the sum of concave functions is concave. Thus, a sufficient condition for the existence of a unique MLE estimator of the log-linear parameters is that \(\mathrm{log}({\pi }_{kl}\left({\mu }_{kl}\right))\) and \(\mathrm{log}(1-{\pi }_{kl}\left({\mu }_{kl}\right))\) be concave. This is equivalent to show that

\(\frac{{\partial }^{2}\mathrm{log}({\pi }_{kl}\left({\mu }_{kl}\right))}{\partial {{\mu }_{kl}}^{2}}\le 0\) and \(\frac{{\partial }^{2}\mathrm{log}({1-\pi }_{kl}\left({\mu }_{kl}\right))}{\partial {{\mu }_{kl}}^{2}}\le 0\)

It can be shown that, for c < 1 and b > 0:

$$\frac{{\partial }^{2}\mathrm{log}({1-\pi }_{kl}\left({\mu }_{kl}\right))}{\partial {{\mu }_{kl}}^{2}}=\frac{{e}^{{\mu }_{kl}}{\left(\frac{b+{e}^{{\mu }_{kl}}}{b}\right)}^{-c-1}\left(\left(c-1\right){e}^{{\mu }_{kl}}-b\right)}{b}\le 0$$

The case c = 1, b > 0 is analogous:

$$\frac{{\partial }^{2}\mathrm{log}({1-\pi }_{kl}\left({\mu }_{kl}\right))}{\partial {{\mu }_{kl}}^{2}}=-\frac{b(b+1){\left(\frac{b}{b+{\mu }_{kl}}\right)}^{b}}{{(b+{\mu }_{kl})}^{2}}\le 0$$

Thus, \(\mathrm{log}(1-{\pi }_{kl}\left({\mu }_{kl}\right))\) is log-concave. Showing log-concavity of \(\mathrm{log}({\pi }_{kl}\left({\mu }_{kl}\right))\) by looking at the sign of \(\frac{{\partial }^{2}\mathrm{log}({\pi }_{kl}\left({\mu }_{kl}\right))}{\partial {{\mu }_{kl}}^{2}}\) is more complex but the resulting expression can be inspected in a similar way as the previous case.

$$ \frac{{\partial ^{2} {\text{log}}(\pi _{{kl}} \left( {\mu _{{kl}} } \right))}}{{\partial \mu _{{kl}} ^{2} }} = \frac{{e^{{\frac{{b\left( { - 1 + \left( {\frac{{b + e\mu _{{kl}} }}{b}} \right)^{{1 - c}} } \right)}}{{ - 1 + c}}}} + \mu _{{kl}} \left( {\frac{{b + e^{{\mu _{{kl}} }} }}{b}} \right)^{{ - 1 - 2c}} \left( { - e^{{\mu _{{kl}} }} \left( {b + e^{{\mu _{{kl}} }} } \right) + \left( { - 1 + e^{{\frac{{b\left( { - 1 + \left( {\frac{{b + e\mu _{{kl}} }}{b}} \right)^{{1 - c}} } \right)}}{{ - 1 + c}}}} } \right)\left( {\frac{{b + e^{{\mu _{{kl}} }} }}{b}} \right)^{c} ( - b + ( - 1 + c)e^{{\mu _{{{\text{kl}}}} }} } \right)}}{{b\left( { - 1 + e^{{\frac{{b\left( { - 1 + \left( {\frac{{b + e\mu _{{kl}} }}{b}} \right)^{{1 - c}} } \right)}}{{ - 1 + c}}}} } \right)}} $$

With the exception of the last term in the numerator, all terms are positive for c < 1 and b > 0. Thus, we need to demonstrate that:

$$ f = \left( { - e^{{\mu _{{kl}} }} \left( {b + e^{{\mu _{{kl}} }} } \right) + \left( { - 1 + e^{{\frac{{b\left( { - 1 + \left( {\frac{{b + e\mu _{{kl}} }}{b}} \right)^{{1 - c}} } \right)}}{{ - 1 + c}}}} } \right)\left( {\frac{{b + e^{{\mu _{{kl}} }} }}{b}} \right)^{c} ( - b + ( - 1 + c)e^{{\mu _{{kl}} }} } \right) \le 0$$

One can show that the inequality does not hold for R1 = {c < 0, 0 < b < ‒c} (Region 1). At first glance, it appears that the inequality holds for R2 = {c < 0, b ≥ ‒c} (Region 2), and also for R3 = {0 ≤ c < 1, b > 0} (Region 3). Therefore, numerically maximizing the function f over the Region

$${\text{R}}_{2}{ \cup } \, {\text{R}}_{3}=\mathrm{R}$$

always yields a non-positive value. To show that the expression fails over R1, we can demonstrate that it fails in the neighbourhood of \({\mu }_{kl}=0\). It can be shown that the evaluation of f at 0 in the limit and its first two derivatives yields:

$$\underset{{\mu }_{\mathit{kl}}\to 0}{\mathrm{lim}}f=0$$
$$\underset{{\mu }_{\mathit{kl}}\to 0}{\mathrm{lim}}\frac{\partial f}{\partial {\mu }_{kl}}=0$$
$$\underset{{\mu }_{\mathit{kl}}\to 0}{\mathrm{lim}}\frac{{\partial }^{2}f}{\partial {{\mu }_{kl}}^{2}}=-c-b>0$$

By continuity of f and its first two derivatives, there exists \(\varepsilon >0\), such that for \(\varepsilon >{\mu }_{kl}>0\) and \(-c>b\) we have \({f}^{{^{\prime}}{^{\prime}}}\left({\mu }_{kl}\right)>0\), then \({f}^{^{\prime}}\left({\mu }_{kl}\right)>0\) and, consequently, \(f\left({\mu }_{kl}\right)>0.\) However, a necessary condition for log-concavity of \(f\) is that \(f\left({\mu }_{kl}\right)\le 0\) for all \({\mu }_{kl}>0.\) Thus, f fails to be log-concave in R1 = {c < 0, 0 < b < ‒c}. By the same logic we can see that, for Regions 2 and 3, \(f\left({\mu }_{kl}\right)\le 0\) in the neighbourhood of \(0\). Although this is not an analytic proof that the function f(\({\mu }_{kl},b,c)\) over Regions 2 and 3 is always negative, this “proof” by numerical maximization seems sufficient.

Remark

The most extreme function that is not in the “Faddy family” (30) but satisfies (31) is:

$${\pi }^{*}\left({\mu }_{kl}\right)=\mathrm{min}\left({\mu }_{kl},1\right).$$

We explored this model and we found that it yields a higher likelihood (smaller deviance) than any other model in the permissible region. However, in a “real-world” setting one would not be able to justify this kind of model.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Navarro Alberto, J.A., Manly, B.F.J. & Gerow, K. Extending null scenarios with Faddy distributions in a probabilistic randomization protocol for presence-absence data. Environ Ecol Stat 29, 625–654 (2022). https://doi.org/10.1007/s10651-022-00537-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-022-00537-4

Keywords

Navigation