Abstract
Discrete random probability measures are a key ingredient of Bayesian nonparametric inference. A sample generates ties with positive probability and a fundamental object of both theoretical and applied interest is the corresponding number of distinct values. The growth rate can be determined from the rate of decay of the small frequencies implying that, when the decreasingly ordered frequencies admit a tractable form, the asymptotics of the number of distinct values can be conveniently assessed. We focus on the geometric stick-breaking process and we investigate the effect of the distribution for the success probability on the asymptotic behavior of the number of distinct values. A whole range of logarithmic behaviors are obtained by appropriately tuning the prior. A two-term expansion is also derived and illustrated in a comparison with a larger family of discrete random probability measures having an additional parameter given by the scale of the negative binomial distribution.
Similar content being viewed by others
References
Archibald, M., Knopfmacher, A., Prodinger, H. (2006). The number of distinct values in a geometrically distributed sample. European Journal of Combinatorics, 27, 1059–1081.
Argiento, R., Cremaschi, A., Vannucci, M. (2020). Hierarchical normalized completely random measures to cluster grouped data. Journal of the American Statistical Association, 115(529), 318–333.
Arratia, R., Barbour, A.D., Tavaré, S. (2003). Logarithmic combinatorial structures: A probabilistic approach. EMS Monographs in Mathematics, European Mathematical Society, Zurich
Ayed, F., Lee, J., Caron, F. (2019). Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with double power-law behavior. In: K. Chaudhuri and R. Salakhutdinov (eds) Proceedings of the 36th International Conference on Machine Learning, PMLR (vol. 97, pp. 395–404).
Barndorff-Nielsen, O. E., Cox, D. R. (1989). Asymptotic techniques for use in statistics. London, New York: Chapman and Hall.
Bassetti, F., Casarin, R., Rossini, L. (2020). Hierarchical species sampling models. Bayesian Analysis, 15(3), 809–838.
Bingham, N. H., Goldie, C. M., Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.
Camerlenghi, F., Lijoi, A., Orbanz, P., Prünster, I. (2019). Distribution theory for hierarchical processes. The Annals of Statistics, 47(1), 67–92.
Caron, F., Fox, E. B. (2017). Sparse graphs using exchangeable random measures. Journal of the Royal Statistical Society : Series B (Statistical Methodology), 79(5), 1295–1366.
Corless, R. M., Gonnet, G. H., Hare, D. E. G., Jeffrey, D. J., Knuth, D. E. (1996). On the Lambert W function. Advances in Computational Mathematics, 5, 329–359.
Dahl, D. B., Day, R., Tsai, J. W. (2017). Random partition distribution indexed by pairwise information. Journal of the American Statistical Association, 112(518), 721–732.
De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., Ruggiero, M. (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2), 212–229.
De Blasi, P., Martinez, A. F., Mena, R. H., Pruenster, I. (2020). On the inferential implications of decreasing weight structures in mixture models. Computational Statistics and Data Analysis, 147, 106940.
Di Benedetto, G., Caron, F., Teh, Y. W. (2020). Non-exchangeable random partition models for microclustering. The Annals of Statistics. (forthcoming).
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209–230.
Fuentes-García, R., Mena, R. H., Walker, S. G. (2010). A new Bayesian nonparametric mixture model. Communications in Statistics Simulation and Computation, 39(4), 669–682.
Gnedin, A. (2004). The Bernoulli sieve. Bernoulli, 10, 79–96.
Gnedin, A. (2010). Regeneration in random combinatorial structures. Probability Surveys, 7, 105–156.
Gnedin, A., Pitman, J. (2005). Regenerative composition structures. The Annals of Probability, 33(2), 445–479.
Gnedin, A., Pitman, J., Yor, M. (2006a). Asymptotic laws for compositions derived from transformed subordinators. The Annals of Probability, 34(2), 468–492.
Gnedin, A., Pitman, J., Yor, M. (2006b). Asymptotic laws for regenerative compositions: Gamma subordinators and the like. Probability Theory and Related Fields, 135(4), 576–602.
Gnedin, A., Hansen, B., Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Probability Surveys, 4, 146–171.
Gnedin, A., Iksanov, A. M., Pavlo, N., Uwe, R. (2009). The Bernoulli sieve revisited. The Annals of Applied Probability, 19, 1634–1655.
Gutiérrez, L., Gutiérrez-Peña, E., Mena, R. H. (2014). Bayesian nonparametric classification for spectroscopy data. Computational Statistics and Data Analysis, 78, 56–68.
Hatjispyros, J., Merkatas, C., Nicoleris, T., Walker, S. (2018). Dependent mixtures of geometric weights priors. Computational Statistics and Data Analysis, 119, 1–18.
Ishwaran, H., James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161–173.
Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. Journal of Mathematics and Mechanics, 17(24), 373–401.
Korwar, R. M., Hollander, M. (1973). Contributions to the theory of Dirichlet processes. The Annals of Probability, 1(4), 705–711.
Lijoi, A., Mena, R. H., Prünster, I. (2007a). A Bayesian nonparametric method for prediction in EST analysis. BMC Bioinformatics, 8, 339.
Lijoi, A., Mena, R. H., Prünster, I. (2007b). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika, 94(4), 769–786.
Lijoi, A., Mena, R. H., Prünster, I. (2007c). Controlling the reinforcement in Bayesian non-parametric mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 715–740.
Lijoi, A., Muliere, P., Prünster, I., Taddei, F. (2016). Innovation, growth and aggregate volatility from a Bayesian nonparametric perspective. Electronic Journal of Statistics, 10(2), 2179–2203.
Mena, R. H., Ruggiero, M., Walker, S. G. (2011). Geometric stick-breaking processes for continuous-time Bayesian nonparametric modeling. Journal of Statistical Planning and Inference, 141(9), 3217–3230.
Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2), 145–158.
Pitman, J. (2006). Combinatorial Stochastic Processes. Berlin: Springer.
Pitman, J., Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2), 855–900.
Teh, Y.W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of Coling/ACL (pp. 985–99).
Acknowledgements
The authors are grateful to an Associate Editor and two Referees for their helpful comments and suggestions. R.H. Mena gratefully acknowledges the support of PAPIIT-UNAM project IG100221. P. De Blasi and I. Prünster are supported by MIUR, PRIN Project 2015SNS29B. P. De Blasi also acknowledges “Dipartimenti di Eccellenza” Grant 2018-2020 and CNM for supporting this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Theorem 1
The proof follows arguments similar to those of Bingham et al. (1987, Theorem 3.9.1). It consists in evaluating \(\big (\varPhi (n)-\overrightarrow{\nu }(1/n)\big )/\ell (1/n)\) in the decomposition
Indeed, as \(\varPhi (n)\sim \overrightarrow{\nu }(1/n)\) and \(\ell (1/n)\) are slowly varying, \(|\mathrm {E}(K_n)-\varPhi (n)| \le \textstyle \frac{2}{n} \displaystyle \varPhi (n)=o(\ell (1/n))\), cf. (4), so the conclusion follows by showing that \(\big (\varPhi (n)-\overrightarrow{\nu }(1/n)\big )/\ell (1/n) \rightarrow -c\gamma\). To this aim,
where in taking the limit we used the dominated convergence theorem, cf. global bounds in Bingham et al. (1987, Theorem 3.8.6). \(\square\)
1.2 Proof of Lemma 1
We will use the following integral representation of the Euler-Mascheroni constant:
By the change of variable \(t=-\log (1-\mathrm {e}^{-x})\) so that \(\mathrm {d}t=-\frac{\mathrm {e}^{-x}}{1-\mathrm {e}^{-x}} \mathrm {d}x\) and \(x=-\log (1-\mathrm {e}^{-t})\), we obtain
It is easy to check that \(\lim _{t\rightarrow 0}{f}(t)=0\) and \(\lim _{t\rightarrow \infty }{f}(t)=1\). As for the tail behavior, by the Taylor expansion of \(\log (1+x)=x-x^2/2+O(x^3)\) as \(x\rightarrow 0\), we find that, as \(t\rightarrow \infty\),
\(\square\)
1.3 Proof of Equation (11)
Let W(z) be the Lambert function defined by \(W(z)\mathrm {e}^{W(z)}=z,\) where W(z) is a multivalued function that has, for z a real number, two branches, the principal branch \(W_0(z)\) for \(W(z)\ge -1\), and the branch \(W_{-1}(z)\) for \(W(z)< -1\). We have that \(\lim _{z\rightarrow 0^+}W_0(z)=0\) while \(\lim _{z\rightarrow 0^-}W_{-1}(z)=-\infty\). In particular, according to Corless et al. (1996, Section 4), as \(z\rightarrow 0^-\)
By algebraic manipulation of (10)
where, in the last display,
Solving for m(x, p),
where we used the branch \(W_{-1}\) of W(z) since z in (17) is \(\le 0\) and \(W(z)\le -1\). The fact that \(W(z)\le -1\) is easily checked by using \(m(x,p)\ge 0\). In fact
and \(\frac{1+p}{p}\log (1-p)\) decreases from \(-1\) to \(-\infty\) for \(p\in (0,1)\). From (17) one finds that \(z\rightarrow 0^-\) as \(x\rightarrow 0^+\). In particular, from \(w_1(p)>x\), that is \(p(1+p)/2>x\), it follows that
and the lower bound is larger than \(-\mathrm {e}^{-1}\) for any \(p\in (0,1)\). Hence \(\log (-z)<-1\) and \(\log (-\log (-z))>0\). By direct calculation
and
Substitute in (18) \(W_{-1}(z)\) for \(\log (-z)-\log (-\log (-z))\) according to the two terms expansion in (16), to find
The remainder of the expansion is easily found. \(\square\)
About this article
Cite this article
De Blasi, P., Mena, R.H. & Prünster, I. Asymptotic behavior of the number of distinct values in a sample from the geometric stick-breaking process. Ann Inst Stat Math 74, 143–165 (2022). https://doi.org/10.1007/s10463-021-00791-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-021-00791-6