Abstract
One of the central issues in statistics and machine learning is how to select an adequate model that can automatically adapt its complexity to the observed data. In the present paper, we focus on the issue of determining the structure of clustered data, both in terms of finding the appropriate number of clusters and of modeling the right dependence structure between the observations. Bayesian nonparametric (BNP) models, which do not impose an upper limit on the number of clusters, are appropriate to avoid the required guess on the number of clusters but have been mainly developed for independent data. In contrast, Markov random fields (MRF) have been extensively used to model dependencies in a tractable manner but usually reduce to finite cluster numbers when clustering tasks are addressed. Our main contribution is to propose a general scheme to design tractable BNP–MRF priors that combine both features: no commitment to an arbitrary number of clusters and a dependence modeling. A key ingredient in this construction is the availability of a stick-breaking representation which has the threefold advantage to allowing us to extend standard discrete MRFs to infinite state space, to design a tractable estimation algorithm using variational approximation and to derive theoretical properties on the predictive distribution and the number of clusters of the proposed model. This approach is illustrated on a challenging natural image segmentation task for which it shows good performance with respect to the literature.
Similar content being viewed by others
Notes
Note that the notation introduced for the different \(\theta _j^\star \) differs from that devoted to the stick-breaking variables, \(\theta _j^*\).
References
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)
Albughdadi, M., Chaâri, L., Tourneret, J., Forbes, F., Ciuciu, P.: A Bayesian non-parametric hidden Markov random model for hemodynamic brain parcellation. Signal Process. 135, 132–146 (2017)
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp. 1027–1035. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2007). http://dl.acm.org/citation.cfm?id=1283383.1283494
Beal, M., Ghahramani, Z.: The variational Bayesian EM Algorithm for incomplete data: with application to scoring graphical model structures. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics, pp. 453–464. Oxford University Press, Oxford (2003)
Beal, M.J.: Variational algorithms for approximate Bayesian inference. PhD thesis (2003)
Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodol.) 36, 192–236 (1974)
Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1(1), 121–143 (2006)
Celeux, G., Forbes, F., Peyrard, N.: EM procedures using mean field-like approximations for Markov model-based image segmentation. Pattern Recognit. 36, 131–144 (2003)
Chaari, L., Vincent, T., Forbes, F., Dojat, M., Ciuciu, P.: Fast joint detection-estimation of evoked brain activity in event-related fMRI using a variational approach. IEEE Trans. Med. Imag. 32(5), 821–837 (2013)
Chandler, D.: Introduction to Modern Statistical Mechanics. Oxford University Press, New York (1987)
Chatzis, S.P.: A Markov random field-regulated Pitman–Yor process prior for spatially constrained data clustering. Pattern Recognit. 46(6), 1595–1603 (2013)
Chatzis, S.P., Tsechpenakis, G.: The infinite hidden Markov random field model. IEEE Trans. Neural Netw. 21(6), 1004–1014 (2010)
Corduneanu, A., Bishop, C.M.: Variational Bayesian model selection for mixture distributions. In: Proceedings Eighth International Conference on Artificial Intelligence and Statistics, pp. 27–34. Morgan Kaufmann (2001)
da Silva, A.R.F.: A Dirichlet process mixture model for brain MRI tissue classification. Med. Image Anal. 11(2), 169–182 (2007)
De Blasi, P., Favaro, S., Lijoi, A., Mena, R.H., Prünster, I., Ruggiero, M.: Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 212–229 (2015)
Favaro, S., Lijoi, A., Nava, C., Nipoti, B., Prünster, I., Teh, Y.W.: On the stick-breaking representation for homogeneous NRMIs. Bayesian Anal. 11(3), 697–724 (2016)
Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973)
Forbes, F., Peyrard, N.: Hidden Markov random field model selection criteria based on mean field-like approximations. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1089–1101 (2003)
Ghosal, S., Van der Vaart, A.: Fundamentals of Nonparametric Bayesian Inference, vol. 44. Cambridge University Press, Cambridge (2017)
Green, P.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 (1995)
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)
Johnson, T.D., Liu, Z., Bartsch, A.J., Nichols, T.E.: A Bayesian non-parametric Potts model with application to pre-surgical fMRI data. Stat. Methods Med. Res. 22(4), 364–381 (2013)
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (1996)
Miller, J.W., Harrison, M.T.: Mixture models with a prior on the number of components. J. Am. Stat. Assoc. 113, 340–356 (2018)
Murphy, K.P.: Conjugate Bayesian analysis of the Gaussian distribution. def 1(\(2\sigma 2\)), 16 (2007)
Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse and other variants. In: Jordan (ed.) Lear. in Graph. Mod., pp. 355–368 (1998)
Orbanz, P., Buhmann, J.M.: Nonparametric Bayesian image segmentation. Int. J. Comput. Vis. 77(1–3), 25–45 (2008)
Pitman, J.: Combinatorial stochastic processes. Lecture Notes in Mathematics, vol. 1875. Springer, Berlin (2006). Lectures from the 32nd Summer School on Probability Theory held in Saint-Flour, July 7–24 (2002)
Pitman, J., Yor, M.: The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25(2), 855–900 (1997)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Sethuraman, J.: A constructive definition of Dirichlet priors. Stat. Sin. 4(2), 639–650 (1994)
Shyr, A., Darrell, T., Jordan, M.I., Urtasun, R.: Supervised hierarchical Pitman–Yor process for natural scene segmentation. In: Proceedings of CVPR 2011, pp. 2281–2288 (2011)
Sodjo, J., Giremus, A., Dobigeon, N., Giovannelli, J.F.: A generalized Swendsen–Wang algorithm for Bayesian nonparametric joint segmentation of multiple images. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1882–1886. IEEE, La Nouvelle Orléans (2017)
Stoehr, J.: A review on statistical inference methods for discrete Markov random fields (2017). arXiv e-prints arXiv:1704.03331
Sudderth, E.B., Jordan, M.I.: Shared segmentation of natural scenes using dependent Pitman–Yor processes. In: Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8–11, 2008, pp. 1585–1592 (2008)
Teh, Y.W.: A Bayesian interpretation of interpolated Kneser–Ney. Technical report (2006)
Unnikrishnan, R., Pantofaru, C., Hebert, M.: A measure for objective evaluation of image segmentation algorithms. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)—Workshops, pp. 34–34 (2005)
Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. Int. J. Comput. Vis. 62(1), 61–81 (2005)
Wallach, H., Jensen, S., Dicker, L., Heller, K.: An alternative prior process for nonparametric Bayesian clustering. In: Y.W. Teh, M. Titterington (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp. 892–899. PMLR, Chia Laguna Resort, Sardinia (2010)
Wang, C., Blei, D.M.: Truncation-free stochastic variational inference for Bayesian nonparametric models. In: Proceedings of the 25th International Conference on Neural Information Processing Systems—Volume 1, NIPS’12, pp. 413–421 (2012)
Xu, D., Caron, F., Doucet, A.: Bayesian nonparametric image segmentation using a generalized Swendsen–Wang algorithm. ArXiv e-prints (2016)
Acknowledgements
This article was developed in the framework of the Grenoble Alpes Data Institute, supported by the French National Research Agency under the “Investissements d’avenir” program (ANR-15-IDEX-02).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Appendix
A Appendix
Proofs of propositions 1 and 2 are given in the two first sections. Details on the VBEM steps, to complete the previous developments when necessary, follow.
1.1 A.1 Proof of Proposition 1
Integrating out \({\varvec{\pi }} \) from distribution (12) and noticing that the Markov term does not depend on \({\varvec{\pi }} \), leads to
where \(H(z_{1:n+1})= \sum _{i\sim j}\delta _{z_i=z_j}\) is counting the number of homogeneous edges. It follows that
When \(\beta =0\), the model (12) reduces to standard Gibbs-type priors for which the quantities above have been given in (23). Using (23) and enumerating how \(z_{n+1}\) can affect the number of homogeneous edges, it comes that:
for \(z_{n+1} \not \in \{z_1, \ldots , z_n\}\),
and for \(\ell \in \{z_1 \ldots z_{n}\}\),
The result follows by normalizing the quantities above, using (21) and noticing that
\(\square \)
1.2 A.2 Proof of Proposition 2
Consider the DP first. Let \(D_n = \delta _{\theta _n\,\text {is new}\vert \theta _{1:n-1}}\) be the Bernoulli random variable equal to one when \(\theta _n\) is a fresh draw, which for the DP-MRF happens with probability
Then \({K_n}= \sum _{i=1}^n D_i\), so we have for the DP-MRF
By definition of the maximal degree D of the graph, the multiplicity \({\tilde{n}}_\ell \) is at most D for any \(\ell \), hence we have the following upper bound
Plugging (49) into (48) provides the desired inequality
Turning to the Pitman–Yor process, we follow the proof technique of Teh (2006). The prior expectation of the number of clusters satisfies the following recursion
Assuming here that \(\alpha \ge 0\) and using the lower bound (49) yields
By induction, and using \(K_1=1\),
where \(a_n\simeq b_n\) means \(a_n/b_n\) converges to some positive constant when \(n\rightarrow \infty \). The desired inequality for PY is obtained by writing this constant equal to c. \(\square \)
1.3 A.3 VB-E-\({\varvec{\tau }} \) step
Considering the terms involving \(\tau _k\), we recognize the beta distribution \(\mathcal{B}(\tau _k; {\hat{\gamma }}_{k,1}, {\hat{\gamma }}_{k,2})\) specified in (33). It follows the expressions of the following quantities,
where \(\psi (\cdot )\) is the digamma function defined in Sect. 5.2.
1.4 A.4 VB-E-\((\alpha , \sigma )\) step
In the PY case, \(q_{\alpha ,\sigma }(\alpha ,\sigma )\) is proportional to
where \(B(a,b):=\frac{\varGamma (a)\varGamma (b)}{\varGamma (a+b)}\) represents the beta function. The last term simplifies into
The difficulty is that except in the DP-MRF case, the normalizing constant for \({\tilde{q}}_{\alpha ,\sigma }(\alpha ,\sigma )\) is not tractable. Nevertheless, to carry out the VBEM algorithm, we do not need the full \(q_{\alpha , \sigma }\) distribution but only the means \({\mathbb {E}}_{q_{\sigma }}[\sigma ]\) and \({\mathbb {E}}_{q_{\alpha }}[\alpha ]\). One solution is therefore to use importance sampling or MCMC to compute these expectations via Monte Carlo sums. Using the prior on \((\alpha ,\sigma )\) given in (31), it comes that
where \(\mathcal{G}(\alpha +\sigma ; {\hat{s}}_1, {\hat{s}}_2)\) is the pdf of a \(\sigma \)-shifted gamma distribution with \({\hat{s}}_1, {\hat{s}}_2\) given in (36). The parameter \(\xi \) is defined as
which can be computed using (51). We propose then to use as importance distribution \(\nu (\alpha , \sigma ) = {{\mathcal {G}}}(\alpha +\sigma ; {\hat{s}}_1, {\hat{s}}_2) \; p(\sigma ; a) \) with \(p(\sigma ; a) \) the uniform distribution on [0,1], \({{\mathcal {U}}}_{[0,1]}(\sigma )\). It comes an expression for the importance weights,
The importance sampling scheme then consists of:
For \(i=1\) to M, simulate first independently \(\sigma _i\) from \({{\mathcal {U}}}_{[0,1]}(\sigma )\) and then simulate conditionally \(\alpha _i\) using the \(\sigma _i\)-shifted gamma \({\mathcal SG}(\sigma _i, {\hat{s}}_1, {\hat{s}}_2) \). This later simulation is easily obtained by simulating a standard \({{\mathcal {G}}}({\hat{s}}_1, {\hat{s}}_2) \) and then substracting \(\sigma _i\) to the result.
Compute the importance weights \(w_i = W(\alpha _i, \sigma _i)\).
Approximate the means \({\mathbb {E}}_{q_{\sigma }}[\sigma ] \approx \dfrac{\sum _{i=1}^M w_i \sigma _i}{\sum _{i=1}^M w_i}\) and \({\mathbb {E}}_{q_{\alpha }}[\alpha ] \approx \dfrac{\sum _{i=1}^M w_i \alpha _i}{\sum _{i=1}^M w_i}\)
Note that this complication is due to the PY. In the DP-MRF case, the E-\(\alpha \) step is considerably simpler as it reduces to computing the approximate posterior expectation of \(\alpha \), namely \(\mathbb {E}_{q_\alpha }[\alpha ] = {\hat{s}}_1/{\hat{s}}_2\).
1.5 A.5 VB-E-\({\mathbf {Z}} \) step
The VB-E-\({\mathbf {Z}} \) is divided into n steps. Since we assume \(q_{z_j}(z_j)=0\) for \(z_j > K\), it is only necessary to compute the distributions for \(z_j \le K\), namely
where for \(z_j=k\),
The term \(\mathbb {E}_{q_{\theta ^*_{z_j}}}[\log p(y_j \vert \theta ^*_{z_j})]\) is computed using the fact that \(q_{\theta ^*_{k}}\) is a normal-inverse-Wishart distribution as described in Eq. (40) of the next VB-E-\(\varvec{\theta ^*}\) step, namely
It comes out that (d is the dimension of \(y_j\))
Using the decomposition (cf. Eq. (58)) and the fact that \(\varSigma _k^{-1}\) follows a Wishart distribution, it comes out that
Then, one has
Plugging in all of the above expressions back into Eq. (56) yields, for \(z_j = k\) and \(k=1,\ldots ,K\), \(q_{z_j}(k) \propto \tilde{q}_j(k)\) with \( \tilde{q}_j(k)\) given in (38).
1.6 A.6 VB-M-\(\beta \) step
This step does not admit a closed-form expression but can be solved numerically. The maximization in \(\beta \) admits a unique solution. Indeed, it is equivalent to solve the following equation:
where C denotes the normalizing constant that depends on \(\varvec{\tau }\) and \(\beta \). Denoting the gradient vector and Hessian matrix respectively by \(\nabla _{\beta }\) and \(\nabla ^2_{\beta }\), it comes that
It follows that whenever \(V({\mathbf {z}}\mid \varvec{\tau }; \beta )\) is linear in \(\beta \), \(\nabla ^2_{\beta } V({\mathbf {z}}\mid \varvec{\tau }; \beta )\) is zero, the Hessian matrix is negative semidefinite and the function to optimize is concave.
Unfortunately, due to the intractable normalizing constant C, it is necessary to evaluate the terms involving \(p({\mathbf {z}} \mid \varvec{\tau }; \beta )\) in an approximate manner. A natural approach is to use a mean-field-like approximation that consists of replacing all neighbors in the interaction term by non-random values. Thus, the Potts model is approximated by
with \(p^{\mathrm {MF}}_{z_j}(z_j \mid \varvec{\tau }; \beta )\) defined as
This approximation induced by the posterior variational approximation has been proposed in Celeux et al. (2003), Forbes and Peyrard (2003) and also exploited in Chaari et al. (2013). It thus follows that
The additional difficulty is that we have to compute the expectation with respect to each \(q_{\tau _k}\). This can be done using simulations. To avoid the Monte Carlo sum, we can use instead of (65) another approximation where the random \(\varvec{\tau }\) is replaced by a set of fixed values \(\varvec{{\widetilde{\tau }}}\) given by
Thus, Eq. (65) turns into
where \( \log \pi _k(\varvec{{\widetilde{\tau }}}) = \log {\widetilde{\tau }}_k + \sum _{l=1}^{k-1} \log (1-{\widetilde{\tau }}_l). \) Similarly, it follows that
By setting them equal to each other and solving this equation for \(\beta \) over an interval, say [0, 10], one obtains an updated value \({\hat{\beta }}\).
1.7 A.7 VB-M-\((s_1,s_2,a)\) step
For the sake of simplicity, we use for \(\sigma \) a uniform prior so that parameter a does not have to be taken into account. With the previous choice of priors on \(\alpha \) and \(\sigma \) for PY-MRF mixture models, this VB-M-step becomes
But the issue is now that the precise form of \(q_{\alpha ,\sigma }^{(r)}\) is not known. We can use again importance sampling for the optimization.
More specifically, these steps do not admit an explicit closed-form expression but can be solved numerically using gradient ascent schemes. Indeed, for \(s_1, s_2\), it is equivalent to solve
As before, when \(\sigma =0\), this step simplifies into \(s_1^{(r)} = {\hat{s}}_1^{(r)}\) and \(s_2^{(r)} = {\hat{s}}_2^{(r)}\), namely to the DP-MRF case.
1.8 A.8 Initialization of the VBEM algorithm
An important question which is not often addressed in details is how to initialize the VBEM algorithm. In contrast to the standard EM that we can equivalently start with an initial E-step or an initial M-step, VBEM requires several steps to be initialized depending on the complexity of the model.
In the present work, we propose the following procedure for initializing the VBEM algorithm. First, we set values for \(s_1\) and \(s_2\), which are taken in our experiments to \(s_1=1\) and \(s_2=200/K\). From this, we can initialize the VB-E-\((\alpha ,\sigma )\) step by setting \(\mathbb {E}[\alpha ]=s_1/s_2\) and \(\mathbb {E}[\sigma ]=0\). It is then required to set values for the other hyperparameters. The interaction parameter \(\beta \) can be initialized to \(\beta =0\) assuming no initial spatial interaction and for the \(\rho _k\)’s defining the normal-inverse-Wishart priors, we suggest to use for all k, \(m_k={\mathbf {0}} \), \(\varPsi _k = {\mathbf {1}} \cdot 10^3\), \(\nu _k =d\) and \(\lambda _k=1\) in order to start with a large variance for the \(\mu _k\)’s.
In addition, we need to initialize the cluster assignments which correspond to the VB-E-\(\mathbf {Z} \) step. Several approaches are possible depending on the available information and model. A common way often used in image segmentation is to initialize the \(q_Z(z_j)\)’s randomly or using an initial segmentation coming either from preliminary information or from a simpler non-spatial clustering procedure, e.g., k-means. In our experiments, an initialization into K clusters obtained with k-means++ (Arthur and Vassilvitskii 2007) was used. k-means++ is basically identical to the k-means algorithm, except for the selection of initial centers. More concretely, k-means++ starts with allocating one cluster center at random and then searches for the next ones which will be selected with a probability proportional to the distance to the closest center already chosen. The essential idea is to make all centers that would be selected as far away as possible from each other. It should be noted that k-means++ also uses random initialization as a starting point, so it can give different results on different runs. To overcome the issue of poor initialization, we propose to run k-means++ several times and use the labels that yield the best compactness (the sum of squared distances from each point to their corresponding center) to initialize the \(q_Z(z_j)\)’s and thus update the \(\rho _k\)’s. From the initialization of \(q_Z(\mathbf {z})\) and \({\varvec{\rho }} \), the VB-E-\({\varvec{\tau }} \) and VB-E-\({\varvec{\theta }} ^*\) steps can then be derived. This simple scheme has the advantage of accelerating the convergence of the VBEM algorithm.
Rights and permissions
About this article
Cite this article
Lü, H., Arbel, J. & Forbes, F. Bayesian nonparametric priors for hidden Markov random fields. Stat Comput 30, 1015–1035 (2020). https://doi.org/10.1007/s11222-020-09935-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-020-09935-9