Skip to main content
Log in

Asymptotic behavior of the number of distinct values in a sample from the geometric stick-breaking process

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Discrete random probability measures are a key ingredient of Bayesian nonparametric inference. A sample generates ties with positive probability and a fundamental object of both theoretical and applied interest is the corresponding number of distinct values. The growth rate can be determined from the rate of decay of the small frequencies implying that, when the decreasingly ordered frequencies admit a tractable form, the asymptotics of the number of distinct values can be conveniently assessed. We focus on the geometric stick-breaking process and we investigate the effect of the distribution for the success probability on the asymptotic behavior of the number of distinct values. A whole range of logarithmic behaviors are obtained by appropriately tuning the prior. A two-term expansion is also derived and illustrated in a comparison with a larger family of discrete random probability measures having an additional parameter given by the scale of the negative binomial distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Archibald, M., Knopfmacher, A., Prodinger, H. (2006). The number of distinct values in a geometrically distributed sample. European Journal of Combinatorics, 27, 1059–1081.

    Article  MathSciNet  Google Scholar 

  • Argiento, R., Cremaschi, A., Vannucci, M. (2020). Hierarchical normalized completely random measures to cluster grouped data. Journal of the American Statistical Association, 115(529), 318–333.

    Article  MathSciNet  Google Scholar 

  • Arratia, R., Barbour, A.D., Tavaré, S. (2003). Logarithmic combinatorial structures: A probabilistic approach. EMS Monographs in Mathematics, European Mathematical Society, Zurich

  • Ayed, F., Lee, J., Caron, F. (2019). Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with double power-law behavior. In: K. Chaudhuri and R. Salakhutdinov (eds) Proceedings of the 36th International Conference on Machine Learning, PMLR (vol. 97, pp. 395–404).

  • Barndorff-Nielsen, O. E., Cox, D. R. (1989). Asymptotic techniques for use in statistics. London, New York: Chapman and Hall.

    Book  Google Scholar 

  • Bassetti, F., Casarin, R., Rossini, L. (2020). Hierarchical species sampling models. Bayesian Analysis, 15(3), 809–838.

    Article  MathSciNet  Google Scholar 

  • Bingham, N. H., Goldie, C. M., Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Camerlenghi, F., Lijoi, A., Orbanz, P., Prünster, I. (2019). Distribution theory for hierarchical processes. The Annals of Statistics, 47(1), 67–92.

    Article  MathSciNet  Google Scholar 

  • Caron, F., Fox, E. B. (2017). Sparse graphs using exchangeable random measures. Journal of the Royal Statistical Society : Series B (Statistical Methodology), 79(5), 1295–1366.

    Article  MathSciNet  Google Scholar 

  • Corless, R. M., Gonnet, G. H., Hare, D. E. G., Jeffrey, D. J., Knuth, D. E. (1996). On the Lambert W function. Advances in Computational Mathematics, 5, 329–359.

    Article  MathSciNet  Google Scholar 

  • Dahl, D. B., Day, R., Tsai, J. W. (2017). Random partition distribution indexed by pairwise information. Journal of the American Statistical Association, 112(518), 721–732.

    Article  MathSciNet  Google Scholar 

  • De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., Ruggiero, M. (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2), 212–229.

    Article  Google Scholar 

  • De Blasi, P., Martinez, A. F., Mena, R. H., Pruenster, I. (2020). On the inferential implications of decreasing weight structures in mixture models. Computational Statistics and Data Analysis, 147, 106940.

    Article  MathSciNet  Google Scholar 

  • Di Benedetto, G., Caron, F., Teh, Y. W. (2020). Non-exchangeable random partition models for microclustering. The Annals of Statistics. (forthcoming).

  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209–230.

    Article  MathSciNet  Google Scholar 

  • Fuentes-García, R., Mena, R. H., Walker, S. G. (2010). A new Bayesian nonparametric mixture model. Communications in Statistics Simulation and Computation, 39(4), 669–682.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A. (2004). The Bernoulli sieve. Bernoulli, 10, 79–96.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A. (2010). Regeneration in random combinatorial structures. Probability Surveys, 7, 105–156.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A., Pitman, J. (2005). Regenerative composition structures. The Annals of Probability, 33(2), 445–479.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A., Pitman, J., Yor, M. (2006a). Asymptotic laws for compositions derived from transformed subordinators. The Annals of Probability, 34(2), 468–492.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A., Pitman, J., Yor, M. (2006b). Asymptotic laws for regenerative compositions: Gamma subordinators and the like. Probability Theory and Related Fields, 135(4), 576–602.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A., Hansen, B., Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Probability Surveys, 4, 146–171.

    Article  MathSciNet  Google Scholar 

  • Gnedin, A., Iksanov, A. M., Pavlo, N., Uwe, R. (2009). The Bernoulli sieve revisited. The Annals of Applied Probability, 19, 1634–1655.

    Article  MathSciNet  Google Scholar 

  • Gutiérrez, L., Gutiérrez-Peña, E., Mena, R. H. (2014). Bayesian nonparametric classification for spectroscopy data. Computational Statistics and Data Analysis, 78, 56–68.

    Article  MathSciNet  Google Scholar 

  • Hatjispyros, J., Merkatas, C., Nicoleris, T., Walker, S. (2018). Dependent mixtures of geometric weights priors. Computational Statistics and Data Analysis, 119, 1–18.

    Article  MathSciNet  Google Scholar 

  • Ishwaran, H., James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161–173.

    Article  MathSciNet  Google Scholar 

  • Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. Journal of Mathematics and Mechanics, 17(24), 373–401.

    MathSciNet  MATH  Google Scholar 

  • Korwar, R. M., Hollander, M. (1973). Contributions to the theory of Dirichlet processes. The Annals of Probability, 1(4), 705–711.

    Article  MathSciNet  Google Scholar 

  • Lijoi, A., Mena, R. H., Prünster, I. (2007a). A Bayesian nonparametric method for prediction in EST analysis. BMC Bioinformatics, 8, 339.

    Article  Google Scholar 

  • Lijoi, A., Mena, R. H., Prünster, I. (2007b). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika, 94(4), 769–786.

    Article  MathSciNet  Google Scholar 

  • Lijoi, A., Mena, R. H., Prünster, I. (2007c). Controlling the reinforcement in Bayesian non-parametric mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 715–740.

    Article  MathSciNet  Google Scholar 

  • Lijoi, A., Muliere, P., Prünster, I., Taddei, F. (2016). Innovation, growth and aggregate volatility from a Bayesian nonparametric perspective. Electronic Journal of Statistics, 10(2), 2179–2203.

    Article  MathSciNet  Google Scholar 

  • Mena, R. H., Ruggiero, M., Walker, S. G. (2011). Geometric stick-breaking processes for continuous-time Bayesian nonparametric modeling. Journal of Statistical Planning and Inference, 141(9), 3217–3230.

    Article  MathSciNet  Google Scholar 

  • Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2), 145–158.

    Article  MathSciNet  Google Scholar 

  • Pitman, J. (2006). Combinatorial Stochastic Processes. Berlin: Springer.

    MATH  Google Scholar 

  • Pitman, J., Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2), 855–900.

    Article  MathSciNet  Google Scholar 

  • Teh, Y.W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of Coling/ACL (pp. 985–99).

Download references

Acknowledgements

The authors are grateful to an Associate Editor and two Referees for their helpful comments and suggestions. R.H. Mena gratefully acknowledges the support of PAPIIT-UNAM project IG100221. P. De Blasi and I. Prünster are supported by MIUR, PRIN Project 2015SNS29B. P. De Blasi also acknowledges “Dipartimenti di Eccellenza” Grant 2018-2020 and CNM for supporting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierpaolo De Blasi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proof of Theorem 1

The proof follows arguments similar to those of Bingham et al. (1987, Theorem 3.9.1). It consists in evaluating \(\big (\varPhi (n)-\overrightarrow{\nu }(1/n)\big )/\ell (1/n)\) in the decomposition

$$\begin{aligned} \mathrm {E}(K_n) =\overrightarrow{\nu }(1/n) +\frac{\varPhi (n)-\overrightarrow{\nu }(1/n)}{\ell (1/n)}\ell (1/n) +\mathrm {E}(K_n)-\varPhi (n). \end{aligned}$$

Indeed, as \(\varPhi (n)\sim \overrightarrow{\nu }(1/n)\) and \(\ell (1/n)\) are slowly varying, \(|\mathrm {E}(K_n)-\varPhi (n)| \le \textstyle \frac{2}{n} \displaystyle \varPhi (n)=o(\ell (1/n))\), cf. (4), so the conclusion follows by showing that \(\big (\varPhi (n)-\overrightarrow{\nu }(1/n)\big )/\ell (1/n) \rightarrow -c\gamma\). To this aim,

$$\begin{aligned} \frac{\varPhi (1/x)-\overrightarrow{\nu }(x)}{\ell (x)}&=\frac{1}{\ell (x)}\bigg [ \int _0^\infty \frac{1}{x}\mathrm {e}^{-y/x}\overrightarrow{\nu }(y)\mathrm {d}y -\int _0^\infty \overrightarrow{\nu }(x)\mathrm {e}^{-\lambda }\mathrm {d}\lambda \bigg ]\\&=\frac{1}{\ell (x)}\bigg [ \int _0^\infty \mathrm {e}^{-\lambda }\overrightarrow{\nu }(\lambda x) \mathrm {d}\lambda -\int _0^\infty \overrightarrow{\nu }(x)\mathrm {e}^{-\lambda }\mathrm {d}\lambda \bigg ] \\&=\int _0^\infty \frac{\overrightarrow{\nu }(\lambda x)-\overrightarrow{\nu }(x)}{\ell (x)} \mathrm {e}^{-\lambda }\mathrm {d}\lambda \\&\rightarrow \int _0^\infty c(\log \lambda )\mathrm {e}^{-\lambda } \mathrm {d}\lambda =c\varGamma '(1)=-c\gamma ,\quad \text {as }x\rightarrow 0, \end{aligned}$$

where in taking the limit we used the dominated convergence theorem, cf. global bounds in Bingham et al. (1987, Theorem 3.8.6). \(\square\)

1.2 Proof of Lemma 1

We will use the following integral representation of the Euler-Mascheroni constant:

$$\begin{aligned} \gamma =\int _0^\infty \left( \frac{1}{1-\mathrm {e}^{-x}}-\frac{1}{x} \right) \mathrm {e}^{-x}\mathrm {d}x. \end{aligned}$$

By the change of variable \(t=-\log (1-\mathrm {e}^{-x})\) so that \(\mathrm {d}t=-\frac{\mathrm {e}^{-x}}{1-\mathrm {e}^{-x}} \mathrm {d}x\) and \(x=-\log (1-\mathrm {e}^{-t})\), we obtain

$$\begin{aligned} \gamma&=\int _0^\infty \left( \frac{1}{1-\mathrm {e}^{-x}}-\frac{1}{x} \right) \mathrm {e}^{-x}\mathrm {d}x =\int _0^\infty \left( 1-\frac{1-\mathrm {e}^{-x}}{x} \right) \frac{\mathrm {e}^{-x}}{1-\mathrm {e}^{-x}}\mathrm {d}x\\&=\int _0^\infty \left( 1-\frac{\mathrm {e}^{-t}}{-\log (1-\mathrm {e}^{-t})} \right) \mathrm {d}t =\int _0^\infty \big (1-{f}(t)\big )\mathrm {d}t. \end{aligned}$$

It is easy to check that \(\lim _{t\rightarrow 0}{f}(t)=0\) and \(\lim _{t\rightarrow \infty }{f}(t)=1\). As for the tail behavior, by the Taylor expansion of \(\log (1+x)=x-x^2/2+O(x^3)\) as \(x\rightarrow 0\), we find that, as \(t\rightarrow \infty\),

$$\begin{aligned} 1-{f}(t)&=1-\frac{\mathrm {e}^{-t}}{-\log (1-\mathrm {e}^{-t})} \sim 1-\frac{\mathrm {e}^{-t}}{\mathrm {e}^{-t}+\mathrm {e}^{-2t}/2} =\frac{\mathrm {e}^{-2t}/2}{\mathrm {e}^{-t}+\mathrm {e}^{-2t}/2} \sim \frac{\mathrm {e}^{-t}}{2}. \end{aligned}$$

\(\square\)

1.3 Proof of Equation (11)

Let W(z) be the Lambert function defined by \(W(z)\mathrm {e}^{W(z)}=z,\) where W(z) is a multivalued function that has, for z a real number, two branches, the principal branch \(W_0(z)\) for \(W(z)\ge -1\), and the branch \(W_{-1}(z)\) for \(W(z)< -1\). We have that \(\lim _{z\rightarrow 0^+}W_0(z)=0\) while \(\lim _{z\rightarrow 0^-}W_{-1}(z)=-\infty\). In particular, according to Corless et al. (1996, Section 4), as \(z\rightarrow 0^-\)

$$\begin{aligned} W_{-1}(z)= \log (-z)-\log (-\log (-z))+O\left( \frac{\log (-\log (-z))}{\log (-z)}\right) . \end{aligned}$$
(16)

By algebraic manipulation of (10)

$$\begin{aligned}&p(1-p)^{m(x,p)}\frac{1}{2}(1+p+p\,m(x,p))=x;\quad \mathrm {e}^{\log (1-p)m(x,p)}(1+p+p\,m(x,p))=2x/p;\\&\quad (1+p+p\,m(x,p))\log (1-p)\mathrm {e}^{\log (1-p)m(x,p)} =\frac{2x\log (1-p)}{p};\\&\quad (1+p+p\,m(x,p))\log (1-p)\mathrm {e}^{\log (1-p)(1/p+1+m(x,p)} =\frac{2x\log (1-p)}{p}\mathrm {e}^{(1/p+1)\log (1-p)};\\&\quad \frac{\log (1-p)}{p}(1+p+p\,m(x,p)) \mathrm {e}^{\frac{\log (1-p)}{p}(1+p+p\,m(x,p))} =\frac{2x\log (1-p)}{p^2} \mathrm {e}^{\frac{1+p}{p}\log (1-p)};\\&\frac{\log (1-p)}{p}(1+p+p\,m(x,p)) =W(z), \end{aligned}$$

where, in the last display,

$$\begin{aligned} z=\frac{2x\log (1-p)}{p^2} \exp \Big (\frac{1+p}{p}\log (1-p)\Big ). \end{aligned}$$
(17)

Solving for m(xp),

$$\begin{aligned} m(x,p)=\frac{1}{p\log (1-p)} \left( pW_{-1}(z) -\log (1-p)\right) -1, \end{aligned}$$
(18)

where we used the branch \(W_{-1}\) of W(z) since z in (17) is \(\le 0\) and \(W(z)\le -1\). The fact that \(W(z)\le -1\) is easily checked by using \(m(x,p)\ge 0\). In fact

$$\begin{aligned}&\frac{1}{p\log (1-p)}\left( pW(z) -\log (1-p)\right) -1\ge 0;\quad pW(z)-\log (1-p)\le p\log (1-p);\\&pW(z)\le \log (1-p)(1+p);\quad W(z)\le \log (1-p)\frac{1+p}{p} \end{aligned}$$

and \(\frac{1+p}{p}\log (1-p)\) decreases from \(-1\) to \(-\infty\) for \(p\in (0,1)\). From (17) one finds that \(z\rightarrow 0^-\) as \(x\rightarrow 0^+\). In particular, from \(w_1(p)>x\), that is \(p(1+p)/2>x\), it follows that

$$\begin{aligned} \frac{1+p}{p}\log (1-p) \exp \Big (\frac{1+p}{p}\log (1-p)\Big ) \le z\le 0 \end{aligned}$$

and the lower bound is larger than \(-\mathrm {e}^{-1}\) for any \(p\in (0,1)\). Hence \(\log (-z)<-1\) and \(\log (-\log (-z))>0\). By direct calculation

$$\begin{aligned} \log (-z) =\log (1-p)\left( \frac{\log x/p}{\log (1-p)} +\frac{1}{\log (1-p)}\log \frac{-2\log (1-p)}{p} +\frac{1+p}{p}\right) \end{aligned}$$

and

$$\begin{aligned}&\frac{1}{p\log (1-p)}\left( p\log (-z) -\log (1-p)\right) -1 \\&\quad =\frac{\log x/p}{\log (1-p)}+ \frac{1}{\log (1-p)} \log \frac{-2\log (1-p)}{p}. \end{aligned}$$

Substitute in (18) \(W_{-1}(z)\) for \(\log (-z)-\log (-\log (-z))\) according to the two terms expansion in (16), to find

$$\begin{aligned}&\frac{1}{p\log (1-p)} \left( p\Big (\log (-z) -\log (-\log (-z))\Big ) -\log (1-p)\right) -1\\&\quad =\frac{1}{p\log (1-p)} \left( p\log (-z)-\log (1-p)\right) -1 -\frac{1}{\log (1-p)}\log (-\log (-z))\\&\quad =\frac{\log x/p}{\log (1-p)}+ \frac{1}{\log (1-p)} \log \frac{-2\log (1-p)}{p} -\frac{1}{\log (1-p)}\log (-\log (-z))\\&\quad =\frac{\log x/p}{\log (1-p)} -\frac{1}{\log (1-p)} \log \left( \frac{p}{2\log (1-p)} \log (-z)\right) \\&\quad =\frac{\log x/p}{\log (1-p)} -\frac{1}{\log (1-p)} \\&\qquad \log \left( \frac{p}{2} \left( \frac{\log x/p}{\log (1-p)} +\frac{1}{\log (1-p)}\log \frac{-2\log (1-p)}{p} +\frac{1+p}{p}\right) \right) \\&\quad =\frac{\log x/p}{\log (1-p)} -\frac{1}{\log (1-p)}\\&\qquad \log \left( \frac{1}{2}\left( 1+p +p\frac{\log x/p}{\log (1-p)} +\frac{p}{\log (1-p)} \log \frac{-2\log (1-p)}{p} \right) \right) . \end{aligned}$$

The remainder of the expansion is easily found. \(\square\)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

De Blasi, P., Mena, R.H. & Prünster, I. Asymptotic behavior of the number of distinct values in a sample from the geometric stick-breaking process. Ann Inst Stat Math 74, 143–165 (2022). https://doi.org/10.1007/s10463-021-00791-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-021-00791-6

Keywords

Navigation