Skip to main content
Log in

Bayesian nonparametric priors for hidden Markov random fields

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

One of the central issues in statistics and machine learning is how to select an adequate model that can automatically adapt its complexity to the observed data. In the present paper, we focus on the issue of determining the structure of clustered data, both in terms of finding the appropriate number of clusters and of modeling the right dependence structure between the observations. Bayesian nonparametric (BNP) models, which do not impose an upper limit on the number of clusters, are appropriate to avoid the required guess on the number of clusters but have been mainly developed for independent data. In contrast, Markov random fields (MRF) have been extensively used to model dependencies in a tractable manner but usually reduce to finite cluster numbers when clustering tasks are addressed. Our main contribution is to propose a general scheme to design tractable BNP–MRF priors that combine both features: no commitment to an arbitrary number of clusters and a dependence modeling. A key ingredient in this construction is the availability of a stick-breaking representation which has the threefold advantage to allowing us to extend standard discrete MRFs to infinite state space, to design a tractable estimation algorithm using variational approximation and to derive theoretical properties on the predictive distribution and the number of clusters of the proposed model. This approach is illustrated on a challenging natural image segmentation task for which it shows good performance with respect to the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Note that the notation introduced for the different \(\theta _j^\star \) differs from that devoted to the stick-breaking variables, \(\theta _j^*\).

References

  • Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)

    Google Scholar 

  • Albughdadi, M., Chaâri, L., Tourneret, J., Forbes, F., Ciuciu, P.: A Bayesian non-parametric hidden Markov random model for hemodynamic brain parcellation. Signal Process. 135, 132–146 (2017)

    Google Scholar 

  • Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)

    Google Scholar 

  • Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp. 1027–1035. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2007). http://dl.acm.org/citation.cfm?id=1283383.1283494

  • Beal, M., Ghahramani, Z.: The variational Bayesian EM Algorithm for incomplete data: with application to scoring graphical model structures. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics, pp. 453–464. Oxford University Press, Oxford (2003)

    Google Scholar 

  • Beal, M.J.: Variational algorithms for approximate Bayesian inference. PhD thesis (2003)

  • Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodol.) 36, 192–236 (1974)

    MathSciNet  MATH  Google Scholar 

  • Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1(1), 121–143 (2006)

    MathSciNet  MATH  Google Scholar 

  • Celeux, G., Forbes, F., Peyrard, N.: EM procedures using mean field-like approximations for Markov model-based image segmentation. Pattern Recognit. 36, 131–144 (2003)

    MATH  Google Scholar 

  • Chaari, L., Vincent, T., Forbes, F., Dojat, M., Ciuciu, P.: Fast joint detection-estimation of evoked brain activity in event-related fMRI using a variational approach. IEEE Trans. Med. Imag. 32(5), 821–837 (2013)

    Google Scholar 

  • Chandler, D.: Introduction to Modern Statistical Mechanics. Oxford University Press, New York (1987)

    Google Scholar 

  • Chatzis, S.P.: A Markov random field-regulated Pitman–Yor process prior for spatially constrained data clustering. Pattern Recognit. 46(6), 1595–1603 (2013)

    MATH  Google Scholar 

  • Chatzis, S.P., Tsechpenakis, G.: The infinite hidden Markov random field model. IEEE Trans. Neural Netw. 21(6), 1004–1014 (2010)

    Google Scholar 

  • Corduneanu, A., Bishop, C.M.: Variational Bayesian model selection for mixture distributions. In: Proceedings Eighth International Conference on Artificial Intelligence and Statistics, pp. 27–34. Morgan Kaufmann (2001)

  • da Silva, A.R.F.: A Dirichlet process mixture model for brain MRI tissue classification. Med. Image Anal. 11(2), 169–182 (2007)

    Google Scholar 

  • De Blasi, P., Favaro, S., Lijoi, A., Mena, R.H., Prünster, I., Ruggiero, M.: Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 212–229 (2015)

    Google Scholar 

  • Favaro, S., Lijoi, A., Nava, C., Nipoti, B., Prünster, I., Teh, Y.W.: On the stick-breaking representation for homogeneous NRMIs. Bayesian Anal. 11(3), 697–724 (2016)

    MathSciNet  MATH  Google Scholar 

  • Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973)

    MathSciNet  MATH  Google Scholar 

  • Forbes, F., Peyrard, N.: Hidden Markov random field model selection criteria based on mean field-like approximations. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1089–1101 (2003)

    Google Scholar 

  • Ghosal, S., Van der Vaart, A.: Fundamentals of Nonparametric Bayesian Inference, vol. 44. Cambridge University Press, Cambridge (2017)

    MATH  Google Scholar 

  • Green, P.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 (1995)

    MathSciNet  MATH  Google Scholar 

  • Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)

    MathSciNet  MATH  Google Scholar 

  • Johnson, T.D., Liu, Z., Bartsch, A.J., Nichols, T.E.: A Bayesian non-parametric Potts model with application to pre-surgical fMRI data. Stat. Methods Med. Res. 22(4), 364–381 (2013)

    MathSciNet  Google Scholar 

  • McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (1996)

    MATH  Google Scholar 

  • Miller, J.W., Harrison, M.T.: Mixture models with a prior on the number of components. J. Am. Stat. Assoc. 113, 340–356 (2018)

    MathSciNet  MATH  Google Scholar 

  • Murphy, K.P.: Conjugate Bayesian analysis of the Gaussian distribution. def 1(\(2\sigma 2\)), 16 (2007)

  • Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse and other variants. In: Jordan (ed.) Lear. in Graph. Mod., pp. 355–368 (1998)

  • Orbanz, P., Buhmann, J.M.: Nonparametric Bayesian image segmentation. Int. J. Comput. Vis. 77(1–3), 25–45 (2008)

    Google Scholar 

  • Pitman, J.: Combinatorial stochastic processes. Lecture Notes in Mathematics, vol. 1875. Springer, Berlin (2006). Lectures from the 32nd Summer School on Probability Theory held in Saint-Flour, July 7–24 (2002)

  • Pitman, J., Yor, M.: The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25(2), 855–900 (1997)

    MathSciNet  MATH  Google Scholar 

  • Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Google Scholar 

  • Sethuraman, J.: A constructive definition of Dirichlet priors. Stat. Sin. 4(2), 639–650 (1994)

    MathSciNet  MATH  Google Scholar 

  • Shyr, A., Darrell, T., Jordan, M.I., Urtasun, R.: Supervised hierarchical Pitman–Yor process for natural scene segmentation. In: Proceedings of CVPR 2011, pp. 2281–2288 (2011)

  • Sodjo, J., Giremus, A., Dobigeon, N., Giovannelli, J.F.: A generalized Swendsen–Wang algorithm for Bayesian nonparametric joint segmentation of multiple images. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1882–1886. IEEE, La Nouvelle Orléans (2017)

  • Stoehr, J.: A review on statistical inference methods for discrete Markov random fields (2017). arXiv e-prints arXiv:1704.03331

  • Sudderth, E.B., Jordan, M.I.: Shared segmentation of natural scenes using dependent Pitman–Yor processes. In: Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8–11, 2008, pp. 1585–1592 (2008)

  • Teh, Y.W.: A Bayesian interpretation of interpolated Kneser–Ney. Technical report (2006)

  • Unnikrishnan, R., Pantofaru, C., Hebert, M.: A measure for objective evaluation of image segmentation algorithms. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)—Workshops, pp. 34–34 (2005)

  • Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. Int. J. Comput. Vis. 62(1), 61–81 (2005)

    Google Scholar 

  • Wallach, H., Jensen, S., Dicker, L., Heller, K.: An alternative prior process for nonparametric Bayesian clustering. In: Y.W. Teh, M. Titterington (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp. 892–899. PMLR, Chia Laguna Resort, Sardinia (2010)

  • Wang, C., Blei, D.M.: Truncation-free stochastic variational inference for Bayesian nonparametric models. In: Proceedings of the 25th International Conference on Neural Information Processing Systems—Volume 1, NIPS’12, pp. 413–421 (2012)

  • Xu, D., Caron, F., Doucet, A.: Bayesian nonparametric image segmentation using a generalized Swendsen–Wang algorithm. ArXiv e-prints (2016)

Download references

Acknowledgements

This article was developed in the framework of the Grenoble Alpes Data Institute, supported by the French National Research Agency under the “Investissements d’avenir” program (ANR-15-IDEX-02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florence Forbes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

Proofs of propositions 1 and 2 are given in the two first sections. Details on the VBEM steps, to complete the previous developments when necessary, follow.

1.1 A.1 Proof of Proposition 1

Integrating out \({\varvec{\pi }} \) from distribution (12) and noticing that the Markov term does not depend on \({\varvec{\pi }} \), leads to

$$\begin{aligned} p(z_{1:n+1}; \beta ) \propto p(z_{1:n+1}; \beta =0)\; \mathrm {e}^{\beta H(z_{1:n+1})} \end{aligned}$$

where \(H(z_{1:n+1})= \sum _{i\sim j}\delta _{z_i=z_j}\) is counting the number of homogeneous edges. It follows that

$$\begin{aligned} p(z_{n+1} | z_{1:n}; \beta ) \propto p(z_{n+1} | z_{1:n}; \beta =0) \; \mathrm {e}^{\beta H(z_{1:n+1})} \;. \end{aligned}$$

When \(\beta =0\), the model (12) reduces to standard Gibbs-type priors for which the quantities above have been given in (23). Using (23) and enumerating how \(z_{n+1}\) can affect the number of homogeneous edges, it comes that:

for \(z_{n+1} \not \in \{z_1, \ldots , z_n\}\),

$$\begin{aligned} p(z_{n+1}|z_{1:n}; \beta ) \propto \frac{V_{n+1,{K_n}+1}}{V_{n,{K_n}}}\; \mathrm {e}^{\beta H(z_{1:n})} \end{aligned}$$

and for \(\ell \in \{z_1 \ldots z_{n}\}\),

$$\begin{aligned} p(z_{n+1}= & {} \ell \; | \; z_{1:n}; \beta ) \propto \frac{V_{n+1,{K_n}}}{V_{n,{K_n}}} (n_\ell -\sigma ) \; \mathrm {e}^{\beta \tilde{n}_\ell \delta _{{{\mathcal {N}}_{n+1}}}(\ell )} \;\\&\mathrm {e}^{\beta H(z_{1:n}) } \; . \end{aligned}$$

The result follows by normalizing the quantities above, using (21) and noticing that

$$\begin{aligned} \sum _{\ell \in \{z_1 \ldots z_{n}\}} (n_\ell -\sigma ) \mathrm {e}^{\beta \tilde{n}_\ell \delta _{{{\mathcal {N}}_{n+1}}}(\ell )} = \varvec{\eta }_{n+1} + n -\sigma K_n \; . \end{aligned}$$

\(\square \)

1.2 A.2 Proof of Proposition 2

Consider the DP first. Let \(D_n = \delta _{\theta _n\,\text {is new}\vert \theta _{1:n-1}}\) be the Bernoulli random variable equal to one when \(\theta _n\) is a fresh draw, which for the DP-MRF happens with probability

$$\begin{aligned} \frac{\alpha }{\alpha +n-1+ \varvec{\eta }_{n}(0,\beta )}. \end{aligned}$$

Then \({K_n}= \sum _{i=1}^n D_i\), so we have for the DP-MRF

$$\begin{aligned} {\mathbb {E}}[{K_n}] = \sum _{i=0}^{n-1} {\mathbb {E}}\left[ \frac{\alpha }{\alpha +i + \varvec{\eta }_{i+1}(0,\beta )}\right] . \end{aligned}$$
(48)

By definition of the maximal degree D of the graph, the multiplicity \({\tilde{n}}_\ell \) is at most D for any \(\ell \), hence we have the following upper bound

$$\begin{aligned} \sum _{\ell \in z_{{\mathcal {N}}_{i+1}}}n_{\ell }(\mathrm {e}^{\beta {\tilde{n}}_\ell }-1) \le \sum _{\ell \in z_{{\mathcal {N}}_{i+1}}}n_{\ell }(\mathrm {e}^{D\beta }-1) \le i(\mathrm {e}^{D\beta }-1). \end{aligned}$$
(49)

Plugging (49) into (48) provides the desired inequality

$$\begin{aligned} {\mathbb {E}}[{K_n}]&\ge \sum _{i=0}^{n-1} \frac{\alpha }{\alpha +i\mathrm {e}^{D\beta }}\\&= \frac{\alpha }{\mathrm {e}^{D\beta }} \sum _{i=0}^{n-1} \frac{1}{i+\alpha \mathrm {e}^{-D\beta }} \gtrsim \frac{\alpha }{\mathrm {e}^{D\beta }}\log n. \end{aligned}$$

Turning to the Pitman–Yor process, we follow the proof technique of Teh (2006). The prior expectation of the number of clusters satisfies the following recursion

$$\begin{aligned} {\mathbb {E}}[{K_{n+1}}]={\mathbb {E}}[{K_n}]+{\mathbb {E}}\left[ \frac{\alpha +\sigma {K_n}}{\alpha +n+ \varvec{\eta }_{n+1}}\right] . \end{aligned}$$

Assuming here that \(\alpha \ge 0\) and using the lower bound (49) yields

$$\begin{aligned} {\mathbb {E}}[{K_{n+1}}]\ge {\mathbb {E}}[{K_n}]\left( 1+\frac{\sigma }{\alpha +n\mathrm {e}^{D\beta }}\right) . \end{aligned}$$

By induction, and using \(K_1=1\),

$$\begin{aligned} {\mathbb {E}}[{K_n}]&\ge \prod _{i=1}^{n-1}\left( 1+\frac{\sigma }{\alpha +i\mathrm {e}^{D\beta }}\right) \\&\quad = \exp {\left( \sum _{i=1}^{n-1}\log \left( 1+\frac{\sigma }{\alpha +i\mathrm {e}^{D\beta }}\right) \right) }\\&\simeq \exp {\left( \sum _{i=1}^{n-1}\frac{\sigma }{\alpha +i\mathrm {e}^{D\beta }}\right) } \\&\quad \simeq \exp { \left( \frac{\sigma }{\mathrm {e}^{D\beta }}\log n \right) } = n^{\sigma \mathrm {e}^{-D\beta }}, \end{aligned}$$

where \(a_n\simeq b_n\) means \(a_n/b_n\) converges to some positive constant when \(n\rightarrow \infty \). The desired inequality for PY is obtained by writing this constant equal to c. \(\square \)

1.3 A.3 VB-E-\({\varvec{\tau }} \) step

$$\begin{aligned}&q_{\tau _k}(\tau _k)\nonumber \\&\quad \propto \exp {\biggl (\displaystyle {\mathbb {E}}_{q_{\alpha ,\sigma }}[\log p(\tau _k \mid \alpha , \sigma )] + \sum _{j=1}^n {\mathbb {E}}_{q_{z_j} q_{\tau \backslash \{k\}}}[\log \pi _{z_j}({\varvec{\tau }})] \biggr )} \nonumber \\&\quad \propto \exp { \biggl ( \displaystyle - {\mathbb {E}}_{q_{\alpha ,\sigma }}[\sigma ] \log \tau _k + ({\mathbb {E}}_{q_{\alpha ,\sigma }}[\alpha ] + k {\mathbb {E}}_{q_{\alpha ,\sigma }}[\sigma ]-1) \log (1-\tau _k)} \nonumber \\&\qquad + \sum _{j=1}^n \sum _{l=k+1}^K q_{z_j}(l) \log (1-\tau _k) + \sum _{j=1}^n q_{z_j}(k) \log (\tau _k) \biggr ) . \end{aligned}$$
(50)

Considering the terms involving \(\tau _k\), we recognize the beta distribution \(\mathcal{B}(\tau _k; {\hat{\gamma }}_{k,1}, {\hat{\gamma }}_{k,2})\) specified in (33). It follows the expressions of the following quantities,

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{q_{\tau _k}}[\log (\tau _k)] = \psi ({\hat{\gamma }}_{k,1})-\psi ({\hat{\gamma }}_{k,1}+{\hat{\gamma }}_{k,2}) \\&\mathbb {E}_{q_{\tau _k}}[\log (1-\tau _k)] = \psi ({\hat{\gamma }}_{k,2})-\psi ({\hat{\gamma }}_{k,1}+{\hat{\gamma }}_{k,2}) \end{aligned} \end{aligned}$$
(51)

where \(\psi (\cdot )\) is the digamma function defined in Sect. 5.2.

1.4 A.4 VB-E-\((\alpha , \sigma )\) step

In the PY case, \(q_{\alpha ,\sigma }(\alpha ,\sigma )\) is proportional to

$$\begin{aligned}&{\tilde{q}}_{\alpha ,\sigma }(\alpha ,\sigma )\\&\quad \propto p(\alpha , \sigma ; s_1, s_2, a) \; \exp { \left( \displaystyle \sum _{k=1}^{K-1} {\mathbb {E}}_{q_{\tau _k}}[\log p(\tau _k \mid \alpha , \sigma )]\right) } \\&\quad \propto p(\alpha , \sigma ; s_1, s_2, a)\; \prod _{k=1}^{K-1} B(1-\sigma , \alpha +k\sigma )^{-1}\; \times \\&\quad \exp \biggl ( \displaystyle -\sigma \left( \sum _{k=1}^{K-1} {\mathbb {E}}_{q_{\tau _k}}[\log \tau _k]- \sum _{k=1}^{K-1} k \; {\mathbb {E}}_{q_{\tau _k}}[\log (1-\tau _k)] \right) \\&\qquad + (\alpha -1) \sum _{k=1}^{K-1} {\mathbb {E}}_{q_{\tau _k}}[\log (1-\tau _k)]\biggr ) , \end{aligned}$$

where \(B(a,b):=\frac{\varGamma (a)\varGamma (b)}{\varGamma (a+b)}\) represents the beta function. The last term simplifies into

$$\begin{aligned} \begin{aligned}&\prod _{k=1}^{K-1} B(1-\sigma , \alpha +k\sigma )^{-1}\\&\quad = \dfrac{\varGamma (\alpha )}{ \varGamma (1-\sigma )^{K-1} \; \varGamma (\alpha +(K-1)\sigma )} \\&\quad \quad \times \prod _{k=1}^{K-1} (\alpha +(k-1)\sigma ). \end{aligned} \end{aligned}$$
(52)

The difficulty is that except in the DP-MRF case, the normalizing constant for \({\tilde{q}}_{\alpha ,\sigma }(\alpha ,\sigma )\) is not tractable. Nevertheless, to carry out the VBEM algorithm, we do not need the full \(q_{\alpha , \sigma }\) distribution but only the means \({\mathbb {E}}_{q_{\sigma }}[\sigma ]\) and \({\mathbb {E}}_{q_{\alpha }}[\alpha ]\). One solution is therefore to use importance sampling or MCMC to compute these expectations via Monte Carlo sums. Using the prior on \((\alpha ,\sigma )\) given in (31), it comes that

$$\begin{aligned} \begin{aligned} {\tilde{q}}_{\alpha ,\sigma }(\alpha ,\sigma )&= \mathcal{G}({\alpha +\sigma } ; {\hat{s}}_1, {\hat{s}}_2) \; \mathrm {e}^{-\sigma \xi } \\&p(\sigma ; a) \; \dfrac{\varGamma (\alpha )}{ \varGamma (1-\sigma )^{K-1} \; \varGamma (\alpha +(K-1)\sigma )}\\&\quad \prod _{k=1}^{K-1} \left( {\frac{\alpha +(k-1)\sigma }{\alpha + \sigma }}\right) \; \end{aligned} \end{aligned}$$
(53)

where \(\mathcal{G}(\alpha +\sigma ; {\hat{s}}_1, {\hat{s}}_2)\) is the pdf of a \(\sigma \)-shifted gamma distribution with \({\hat{s}}_1, {\hat{s}}_2\) given in (36). The parameter \(\xi \) is defined as

$$\begin{aligned} \begin{aligned} \xi = \sum _{k=1}^{K-1} {\mathbb {E}}_{q_{\tau _k}}[\log \tau _k] - \sum _{k=1}^{K-1} {(k-1)} \; {\mathbb {E}}_{q_{\tau _k}}[\log (1-\tau _k)], \end{aligned} \end{aligned}$$
(54)

which can be computed using (51). We propose then to use as importance distribution \(\nu (\alpha , \sigma ) = {{\mathcal {G}}}(\alpha +\sigma ; {\hat{s}}_1, {\hat{s}}_2) \; p(\sigma ; a) \) with \(p(\sigma ; a) \) the uniform distribution on [0,1], \({{\mathcal {U}}}_{[0,1]}(\sigma )\). It comes an expression for the importance weights,

$$\begin{aligned} \begin{aligned} W(\alpha ,\sigma )&= \dfrac{ \tilde{q}_{\alpha ,\sigma }(\alpha ,\sigma ) }{\nu (\alpha , \sigma )} \\&= \mathrm {e}^{-\sigma \xi } \; \dfrac{\varGamma (\alpha )}{ \varGamma (1-\sigma )^{K-1} \; \varGamma (\alpha +(K-1)\sigma )}\\&\quad {\prod _{k=1}^{K-1} \left( \frac{\alpha +(k-1)\sigma }{\alpha +\sigma }\right) }. \end{aligned} \end{aligned}$$
(55)

The importance sampling scheme then consists of:

  • For \(i=1\) to M, simulate first independently \(\sigma _i\) from \({{\mathcal {U}}}_{[0,1]}(\sigma )\) and then simulate conditionally \(\alpha _i\) using the \(\sigma _i\)-shifted gamma \({\mathcal SG}(\sigma _i, {\hat{s}}_1, {\hat{s}}_2) \). This later simulation is easily obtained by simulating a standard \({{\mathcal {G}}}({\hat{s}}_1, {\hat{s}}_2) \) and then substracting \(\sigma _i\) to the result.

  • Compute the importance weights \(w_i = W(\alpha _i, \sigma _i)\).

  • Approximate the means \({\mathbb {E}}_{q_{\sigma }}[\sigma ] \approx \dfrac{\sum _{i=1}^M w_i \sigma _i}{\sum _{i=1}^M w_i}\) and \({\mathbb {E}}_{q_{\alpha }}[\alpha ] \approx \dfrac{\sum _{i=1}^M w_i \alpha _i}{\sum _{i=1}^M w_i}\)

Note that this complication is due to the PY. In the DP-MRF case, the E-\(\alpha \) step is considerably simpler as it reduces to computing the approximate posterior expectation of \(\alpha \), namely \(\mathbb {E}_{q_\alpha }[\alpha ] = {\hat{s}}_1/{\hat{s}}_2\).

1.5 A.5 VB-E-\({\mathbf {Z}} \) step

The VB-E-\({\mathbf {Z}} \) is divided into n steps. Since we assume \(q_{z_j}(z_j)=0\) for \(z_j > K\), it is only necessary to compute the distributions for \(z_j \le K\), namely

$$\begin{aligned} \begin{aligned}&q_{z_j}(z_j) \propto \exp \biggl (\displaystyle \mathbb {E}_{q_{\theta ^*_{z_j}}} [\log p(y_j \mid \theta ^*_{z_j})] + \mathbb {E}_{q_{\varvec{\tau }}}[\log \pi _{z_j}(\varvec{\tau })]\\&\qquad \qquad \qquad + \beta \sum _{i\in {\mathcal {N}}_{j}} q_{z_i}(z_j)\biggr ), \end{aligned} \end{aligned}$$
(56)

where for \(z_j=k\),

$$\begin{aligned} \mathbb {E}_{q_{\varvec{\tau }}}[\log \pi _{k}(\varvec{\tau })] = \mathbb {E}_{q_{\tau _k}}[\log \tau _k] + \sum _{l=1}^{k-1} \mathbb {E}_{q_{\tau _l}}[\log (1-\tau _l)]. \end{aligned}$$
(57)

The term \(\mathbb {E}_{q_{\theta ^*_{z_j}}}[\log p(y_j \vert \theta ^*_{z_j})]\) is computed using the fact that \(q_{\theta ^*_{k}}\) is a normal-inverse-Wishart distribution as described in Eq. (40) of the next VB-E-\(\varvec{\theta ^*}\) step, namely

$$\begin{aligned} \begin{aligned} q_{\theta ^*_k}(\mu _k, \varSigma _k )&= \mathcal{NIW}(\mu _k, \varSigma _k ; {\hat{m}}_k, {\hat{\lambda }}_k, {\hat{\varPsi }}_k, {\hat{\nu }}_k) \\&= {\mathcal {N}}\left( \mu _k ; {\hat{m}}_k, \dfrac{\varSigma _k}{{\hat{\lambda }}_k}\right) {\mathcal {IW}}(\varSigma _k ; {\hat{\varPsi }}_k, {\hat{\nu }}_k). \end{aligned} \end{aligned}$$
(58)

It comes out that (d is the dimension of \(y_j\))

$$\begin{aligned} \begin{aligned} \mathbb {E}_{q_{\theta ^*_{k}}}{[\log p(y_j \mid \theta ^*_{k})]}&= - \dfrac{d}{2} \log 2\pi - \dfrac{1}{2} \mathbb {E}_{q_{\theta ^*_{k}}}{[\log | \varSigma _k |]} \\&- \dfrac{1}{2} \mathbb {E}_{q_{\theta ^*_{k}}}{[(y_j- \mu _k)^T \varSigma ^{-1}_k (y_j- \mu _k)]}. \end{aligned} \end{aligned}$$
(59)

Using the decomposition (cf. Eq. (58)) and the fact that \(\varSigma _k^{-1}\) follows a Wishart distribution, it comes out that

$$\begin{aligned} \begin{aligned} \mathbb {E}_{q_{\theta ^*_{k}}}[\log | \varSigma _k |]&= \mathbb {E}_{q_{\varSigma _{k}}}[\log | \varSigma _k |] = - \mathbb {E}_{q_{\varSigma _{k}}}[\log | \varSigma ^{-1}_k |] \\&= - \sum _{i=1}^d \psi \left( \dfrac{{\hat{\nu }}_k + (1-i)}{2}\right) + \log \left| \dfrac{{\hat{\varPsi }}_k}{2}\right| , \end{aligned} \end{aligned}$$
(60)

Then, one has

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{q_{\theta ^*_{k}}} [(y_j-\mu _k)^T \varSigma ^{-1}_k (y_j- \mu _k)] \\&\quad = \mathbb {E}_{q_{\varSigma _{k}}} [(y_j- {\hat{m}}_k)^T \varSigma ^{-1}_k (y_j- {\hat{m}}_k) + {\mathrm {Tr}}(\varSigma _k^{-1} \varSigma _k/{\hat{\lambda }}_k)] \\&\quad = {\hat{\nu }}_k (y_j- {\hat{m}}_k)^T {\hat{\varPsi }}_k^{-1} (y_j- {\hat{m}}_k) + \dfrac{d}{{\hat{\lambda }}_k} . \end{aligned} \end{aligned}$$
(61)

Plugging in all of the above expressions back into Eq. (56) yields, for \(z_j = k\) and \(k=1,\ldots ,K\), \(q_{z_j}(k) \propto \tilde{q}_j(k)\) with \( \tilde{q}_j(k)\) given in (38).

1.6 A.6 VB-M-\(\beta \) step

This step does not admit a closed-form expression but can be solved numerically. The maximization in \(\beta \) admits a unique solution. Indeed, it is equivalent to solve the following equation:

(62)

where C denotes the normalizing constant that depends on \(\varvec{\tau }\) and \(\beta \). Denoting the gradient vector and Hessian matrix respectively by \(\nabla _{\beta }\) and \(\nabla ^2_{\beta }\), it comes that

$$\begin{aligned} \begin{aligned} \nabla _{\beta } \mathbb {E}_{q_{{\mathbf {z}}}q_{\varvec{\tau }}} [\log p({\mathbf {z}} \mid \varvec{\tau } ;\beta )]&= \mathbb {E}_{q_{{\mathbf {z}}}q_{\varvec{\tau }}}[\nabla _{\beta }V({\mathbf {z}}\mid \varvec{\tau }; \beta )]\\&\quad -\mathbb {E}_{p({\mathbf {z}} \mid \varvec{\tau }; \beta ) q_{\varvec{\tau }}}[\nabla _{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )],\\ \nabla ^2_{\beta } \mathbb {E}_{q_{{\mathbf {z}}}q_{\varvec{\tau }}} [\log p({\mathbf {z}} \mid \varvec{\tau }; \beta )]&= \mathbb {E}_{q_{{\mathbf {z}}}q_{\varvec{\tau }}} [\nabla ^2_{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )] \\&\quad - \mathbb {E}_{p({\mathbf {z}} \mid \varvec{\tau }; \beta ) q_{\varvec{\tau }}}[\nabla ^2_{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )] \\&\quad - \mathbb {E}_{q_{\varvec{\tau }}} [{\mathrm {Var}}_{p({\mathbf {z}} \mid \varvec{\tau }; \beta )}[\nabla _{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )]]. \end{aligned} \end{aligned}$$
(63)

It follows that whenever \(V({\mathbf {z}}\mid \varvec{\tau }; \beta )\) is linear in \(\beta \), \(\nabla ^2_{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )\) is zero, the Hessian matrix is negative semidefinite and the function to optimize is concave.

Unfortunately, due to the intractable normalizing constant C, it is necessary to evaluate the terms involving \(p({\mathbf {z}} \mid \varvec{\tau }; \beta )\) in an approximate manner. A natural approach is to use a mean-field-like approximation that consists of replacing all neighbors in the interaction term by non-random values. Thus, the Potts model is approximated by

$$\begin{aligned} p({\mathbf {z}} \mid \varvec{\tau }; \beta ) \simeq \prod \limits _{j=1}^n p^{\mathrm {MF}}_{z_j}(z_j \mid \varvec{\tau }; \beta ), \end{aligned}$$
(64)

with \(p^{\mathrm {MF}}_{z_j}(z_j \mid \varvec{\tau }; \beta )\) defined as

$$\begin{aligned} p^{\mathrm {MF}}_{z_j}(z_j=k \mid \varvec{\tau }; \beta ) \propto \exp {\biggl ( \displaystyle \log \pi _k(\varvec{\tau }) + \beta \sum _{i\in {\mathcal {N}}_{j}} q_{z_i}(k)\biggr )}. \end{aligned}$$
(65)

This approximation induced by the posterior variational approximation has been proposed in Celeux et al. (2003), Forbes and Peyrard (2003) and also exploited in Chaari et al. (2013). It thus follows that

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{q_{{\mathbf {z}}}q_{\varvec{\tau }}} [\nabla _{\beta }V({\mathbf {z}}\mid \varvec{\tau }; \beta )] = \sum _{k=1}^K \sum _{i\in {\mathcal {N}}_{j}} q_{z_j}(k) q_{z_i}(k) \\&\mathbb {E}_{p({\mathbf {z}} \mid \varvec{\tau }; \beta ) q_{\varvec{\tau }}} [\nabla _{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )] \\&\quad \simeq \mathbb {E}_{q_{\varvec{\tau }}} \left[ \sum _{k=1}^K \sum _{i\in {\mathcal {N}}_{j}} p^{\mathrm {MF}}_{z_j}(k \mid \varvec{\tau }; \beta ) p^{\mathrm {MF}}_{z_i}(k \mid \varvec{\tau }; \beta )\right] . \end{aligned} \end{aligned}$$
(66)

The additional difficulty is that we have to compute the expectation with respect to each \(q_{\tau _k}\). This can be done using simulations. To avoid the Monte Carlo sum, we can use instead of (65) another approximation where the random \(\varvec{\tau }\) is replaced by a set of fixed values \(\varvec{{\widetilde{\tau }}}\) given by

$$\begin{aligned} {\widetilde{\tau }}_k = \mathbb {E}_{q_{\tau _k}}[\tau _k] = \dfrac{{\hat{\gamma }}_{k,1}}{{\hat{\gamma }}_{k,1}+{\hat{\gamma }}_{k,2}}. \end{aligned}$$
(67)

Thus, Eq. (65) turns into

$$\begin{aligned} \begin{aligned} p^{\mathrm {MF}}_{z_j}(z_j=k; \beta )&\propto \exp {\biggl ( \displaystyle \log \pi _k(\varvec{{\widetilde{\tau }}}) + \beta \sum _{i \in {\mathcal {N}}_{j}} q_{z_i}(k)\biggr )}, \end{aligned} \end{aligned}$$
(68)

where \( \log \pi _k(\varvec{{\widetilde{\tau }}}) = \log {\widetilde{\tau }}_k + \sum _{l=1}^{k-1} \log (1-{\widetilde{\tau }}_l). \) Similarly, it follows that

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{p({\mathbf {z}} \mid \varvec{\tau }; \beta ) q_{\varvec{\tau }}} [\nabla _{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )]\\&\quad \simeq \sum _{k=1}^K \sum _{i\in {\mathcal {N}}_{j}} p^{\mathrm {MF}}_{z_j}(k ; \beta ) p^{\mathrm {MF}}_{z_i}(k ; \beta ). \end{aligned} \end{aligned}$$
(69)

By setting them equal to each other and solving this equation for \(\beta \) over an interval, say [0, 10], one obtains an updated value \({\hat{\beta }}\).

1.7 A.7 VB-M-\((s_1,s_2,a)\) step

For the sake of simplicity, we use for \(\sigma \) a uniform prior so that parameter a does not have to be taken into account. With the previous choice of priors on \(\alpha \) and \(\sigma \) for PY-MRF mixture models, this VB-M-step becomes

$$\begin{aligned} (s_1, s_2)^{(r)} = \mathop {\mathrm {arg\,max}}_{(s_1, s_2)} {\mathbb {E}}_{q_{\alpha ,\sigma }^{(r)}}[\log p(\alpha \mid \sigma ; s_1, s_2)] \end{aligned}$$
(70)

But the issue is now that the precise form of \(q_{\alpha ,\sigma }^{(r)}\) is not known. We can use again importance sampling for the optimization.

More specifically, these steps do not admit an explicit closed-form expression but can be solved numerically using gradient ascent schemes. Indeed, for \(s_1, s_2\), it is equivalent to solve

(71)

As before, when \(\sigma =0\), this step simplifies into \(s_1^{(r)} = {\hat{s}}_1^{(r)}\) and \(s_2^{(r)} = {\hat{s}}_2^{(r)}\), namely to the DP-MRF case.

1.8 A.8 Initialization of the VBEM algorithm

An important question which is not often addressed in details is how to initialize the VBEM algorithm. In contrast to the standard EM that we can equivalently start with an initial E-step or an initial M-step, VBEM requires several steps to be initialized depending on the complexity of the model.

In the present work, we propose the following procedure for initializing the VBEM algorithm. First, we set values for \(s_1\) and \(s_2\), which are taken in our experiments to \(s_1=1\) and \(s_2=200/K\). From this, we can initialize the VB-E-\((\alpha ,\sigma )\) step by setting \(\mathbb {E}[\alpha ]=s_1/s_2\) and \(\mathbb {E}[\sigma ]=0\). It is then required to set values for the other hyperparameters. The interaction parameter \(\beta \) can be initialized to \(\beta =0\) assuming no initial spatial interaction and for the \(\rho _k\)’s defining the normal-inverse-Wishart priors, we suggest to use for all k, \(m_k={\mathbf {0}} \), \(\varPsi _k = {\mathbf {1}} \cdot 10^3\), \(\nu _k =d\) and \(\lambda _k=1\) in order to start with a large variance for the \(\mu _k\)’s.

In addition, we need to initialize the cluster assignments which correspond to the VB-E-\(\mathbf {Z} \) step. Several approaches are possible depending on the available information and model. A common way often used in image segmentation is to initialize the \(q_Z(z_j)\)’s randomly or using an initial segmentation coming either from preliminary information or from a simpler non-spatial clustering procedure, e.g., k-means. In our experiments, an initialization into K clusters obtained with k-means++ (Arthur and Vassilvitskii 2007) was used. k-means++ is basically identical to the k-means algorithm, except for the selection of initial centers. More concretely, k-means++ starts with allocating one cluster center at random and then searches for the next ones which will be selected with a probability proportional to the distance to the closest center already chosen. The essential idea is to make all centers that would be selected as far away as possible from each other. It should be noted that k-means++ also uses random initialization as a starting point, so it can give different results on different runs. To overcome the issue of poor initialization, we propose to run k-means++ several times and use the labels that yield the best compactness (the sum of squared distances from each point to their corresponding center) to initialize the \(q_Z(z_j)\)’s and thus update the \(\rho _k\)’s. From the initialization of \(q_Z(\mathbf {z})\) and \({\varvec{\rho }} \), the VB-E-\({\varvec{\tau }} \) and VB-E-\({\varvec{\theta }} ^*\) steps can then be derived. This simple scheme has the advantage of accelerating the convergence of the VBEM algorithm.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lü, H., Arbel, J. & Forbes, F. Bayesian nonparametric priors for hidden Markov random fields. Stat Comput 30, 1015–1035 (2020). https://doi.org/10.1007/s11222-020-09935-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09935-9

Keywords

Navigation