Skip to main content
Log in

Parallel inference for big data with the group Bayesian method

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

In recent years, big datasets are often split into several subsets due to the storage requirements. We propose a parallel group Bayesian method for statistical inference in sparse big data. This method improves the existing methods in two aspects: the total datasets are also split into a data subset sequence and the parameter vector is divided into several sub-vectors. Besides, we add a weight sequence to optimize the sub-estimators when each of them has a different covariance matrix. We obtain several theoretical properties of the estimator. The results of numerical simulations show that our method is consistent with the theoretical results and is more effective than classic Markov chain Monte Carlo methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Brockwell AE (2006) Parallel Markov chain Monte Carlo simulation by pre-fetching. J Comput Graph Stat 15(1):246–261

    Article  MathSciNet  Google Scholar 

  • Corander J, Ekdahl M, Koski T (2008) Parallell interacting MCMC for learning of topologies of graphical models. Data Min Knowl Disc 17(3):431–456

    Article  Google Scholar 

  • Denwood MJ (2016) runjags: An R package providing interface utilities, model templates, parallel computing methods and additional distributions for MCMC models in JAGS. J Stat Softw 71(9):1–25

    Article  Google Scholar 

  • Jiang W (2007) Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. Ann Stat 35(4):1487–1511

    Article  MathSciNet  Google Scholar 

  • Johndrow J, Orenstein P, Bhattacharya A (2017) Bayes shrinkage at GWAS scale: convergence and approximation theory of a scalable MCMC algorithm for the horseshoe prior. ArXiv:1705.00841

  • Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681

    Article  MathSciNet  Google Scholar 

  • Lee JD, Liu Q, Sun Y, Taylor JE (2017) Communication-efficient sparse regression. J Mach Learn Res 18(5):1–30

    MathSciNet  MATH  Google Scholar 

  • Liang F, Song Q, Kai Y (2013) Bayesian subset modeling for high-dimensional generalized linear models. J Am Stat Assoc 108(502):589–606

    Article  MathSciNet  Google Scholar 

  • Liu X, Huang M, Fan B, Buckler E, Zhang Z (2016) Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet 12(2):e1005767

    Article  Google Scholar 

  • Martino L, Elvira V, Luengo D, Louzada F (2016a) Parallel Metropolis chains with cooperative adaptation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3974–3978

  • Martino L, Elvira V, Luengo D, Corander J, Louzada F (2016b) Orthogonal parallel MCMC methods for sampling and optimization. Digit Signal Proc 58:64–84

    Article  Google Scholar 

  • Miasojedow B, Moulines E, Vihola M (2013) An adaptive parallel tempering algorithm. J Comput Graph Stat 22(3):649–664

    Article  MathSciNet  Google Scholar 

  • Nishihara R, Murray I, Adams RP (2014) Parallel MCMC with generalized elliptical slice sampling. J Mach Learn Res 15(1):2087–2112

    MathSciNet  MATH  Google Scholar 

  • Owen J, Wilkinson DJ, Gillespie CS (2015) Scalable inference for Markov processes with intractable likelihoods. Stat Comput 25(1):145–156

    Article  MathSciNet  Google Scholar 

  • Quiroz M, Kohn R, Villani M, Tran MN (2019) Speeding up MCMC by efficient data subsampling. J Am Stat Assoc 114(526):831–843

    Article  MathSciNet  Google Scholar 

  • Schäfer C, Chopin N (2013) Sequential Monte Carlo on large binary sampling spaces. Stat Comput 23:1–22

    Article  MathSciNet  Google Scholar 

  • Song Q, Liang F (2015) A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression. J R Stat Soc B 77(5):947–972

    Article  MathSciNet  Google Scholar 

  • Zeng P, Zhou X (2017) Non-parametric genetic prediction of complex traits with latent dirichlet process regression models. Nat Commun 8(1):1–11

    Article  Google Scholar 

  • Zhou Y, Johansen A, Aston J (2013) Toward automatic model comparison: an adaptive sequential Monte Carlo approach. J Comput Graph Stat 25(3):701–726

    Article  MathSciNet  Google Scholar 

  • Wang C, Chen MH, Schifano E, Wu J, Yan J (2016) Statistical methods and computing for big data. Stat Interface 9(4):399–414

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank a co-editor and three anonymous referees for their extremely valuable suggestions. This work was supported by a grant from Natural Science Foundation of Shandong under Project ID ZR2016AM09.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guangbao Guo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Technical proofs

Appendix: Technical proofs

In this section, we collect the technical proofs.

Proof of Theorem 1

For \(t=C_r\log n/\epsilon \), \(C_r\) is a large constant. Noting that \(e^x\le 1+x+\frac{1}{2}x^2e^{|x|}\) for all \(x>0\), we obtain

$$\begin{aligned} \sum _{n=1}^\infty n^{r-2} P\Bigg (\sum _{g=1}^{G_n} \Vert \varepsilon _{I_{k,g}}\Vert _1>\epsilon \Bigg )\le & {} \sum _{n=1}^\infty n^{r-2} e^{-\epsilon t} E\exp \Bigg (t\sum _{g=1}^{G_n} \Vert \varepsilon _{I_g}\Vert _1\Bigg ) \\\le & {} \sum _{n=1}^\infty n^{r-2-C_r} \prod _{g=1}^{G_n} Ee^{t \Vert \varepsilon _{I_g}\Vert _1} \\\le & {} \sum _{n=1}^\infty n^{r-2-C_r} \prod _{g=1}^{G_n} \Big [1+\frac{1}{2}t^2 E\varepsilon _{I_g}^2 e^{t \Vert \varepsilon _{I_g}\Vert _1}\Big ] \\\le & {} \sum _{n=1}^\infty n^{r-2-C_r} \prod _{g=1}^{G_n} \Big [1+c(\log n)^2 E e^{(1+c)\Vert \varepsilon _{I_g}\Vert _1}\Big ] \\\le & {} \sum _{n=1}^\infty n^{r-2-C_r} \exp \Big ( c(\log n)^2 {G_n} \Big ) \\\le & {} \sum _{n=1}^\infty n^{(r+\epsilon )-(2+C_r)} <\infty \quad {\text {for}}\; k=1,\ldots ,K_n. \end{aligned}$$

Here \(C_r>(r+\epsilon )\) for \(\epsilon >0\), and c is a suitable constant. We thus have the theorem. \(\square \)

Proof of Theorem 2

Note that \(\{w_{k}\}_{k=1}^{K_n}\) and \(\{\epsilon _{I_k}\}_{k=1}^{K_n}\) are independent, we then have, for \(r>0\),

$$\begin{aligned} E\Bigg (\sum _{k=1}^\infty \Vert w_{k}\epsilon _{I_k}\Vert _1^r\Bigg )\le \sum _{k=1}^\infty E(w_{k}^r)\ E\Vert \epsilon _{I_1}\Vert _1^r<\infty . \end{aligned}$$

Therefore, \(\sum _{k=1}^\infty w_{k}\epsilon _{I_k}\) converges.

Let \(s_{K_n}=\sum _{k=1}^{K_n}\epsilon _{I_k}\) for \(s_0=0\) and \(K_n\ge 1\), \(Y_{K_n}=(s_{K_n}/K_n^{1/r})\). Therefore \(\lim _{K_n\rightarrow \infty }Y_{K_n}=0\), and

$$\begin{aligned} \sum _{k=1}^\infty w_{k}\epsilon _{I_k} =\sum _{k=1}^\infty w_{k}(s_{k}-s_{k-1})=\lim _{K_n\rightarrow \infty }\bigg (\sum _{k=1}^{K_n-1} (w_{k}-w_{k+1})s_k+w_{K_n}s_{K_n}\bigg ). \end{aligned}$$

We obtain

$$\begin{aligned} \sum _{k=1}^\infty w_{k}\epsilon _{I_k}=\sum _{k=1}^\infty k^{1/r}(w_{k}-w_{k+1})Y_k \quad {\text {for}} \; Y_k=\sum _{j=1}^{k}\epsilon _{I_j}/k^{1/r}. \end{aligned}$$

Let \(D_{I_k}=k^{1/r}(w_{k}-w_{k+1})\). Hence, \(\sum _{k=1}^\infty |D_{I_k}|< C_w\), and \(\lim _{k\rightarrow \infty }D_{I_k} =0\) for \(k,n\in \mathbb {N}^+\). For any \(\epsilon >0\), we choose \(N_D\) to satisfy \(\Vert Y_k\Vert _1<\epsilon \) for \(k\ge N_D\). We then have

$$\begin{aligned} \sum _{k=1}^\infty \big \Vert D_{I_k}Y_k \big \Vert _1 \le \sum _{k=1}^{N_D-1} |D_{I_k}|\cdot \Vert Y_k\Vert _1+ \sum _{k=N_D}^\infty |D_{I_k}|\cdot \Vert Y_k\Vert _1 \longrightarrow C_w\epsilon \quad {\text {as}} \; N_D\longrightarrow \infty . \end{aligned}$$

Let\(\epsilon \rightarrow 0\),we have

$$\begin{aligned} \sum _{k=1}^{K_n} \big \Vert D_{I_k}Y_k \big \Vert _1\longrightarrow 0 \quad {\text {as}}\; K_n\longrightarrow \infty . \end{aligned}$$

Thus, for \(K_n=O(\sqrt{n})\),

$$\begin{aligned} \big \Vert \hat{\beta }_w-\beta ^*\big \Vert _1= \Big \Vert \sum _{k=1}^{K_n} w_{k} \epsilon _{I_k}\Big \Vert _1\le \sum _{k=1}^{K_n} \big \Vert D_{I_k}Y_k\big \Vert _1 \longrightarrow 0 \quad {\text {as}} \; n\longrightarrow \infty . \end{aligned}$$

\(\square \)

Proof of Theorem 3

We need three steps to build the proof.

  1. i)

    \(\pi \big [p_{I_k,M}: d_1(p_{I_k}^*,p_{I_k,M})\le \epsilon _{n_k}^2/4\big ]\ge \exp (-n_k\epsilon _{n_k}^2/4)\) for all sufficiently large \(n_k\) with \(n=\sum _{k=1}^{K_n} n_k\). For \(\Vert \theta _{I_k}^*-\theta _{I_k,M}\Vert _1\longrightarrow 0\), we have \(d_1(p_{I_k}^*,p_{I_k,M})=h(\theta _{I_k}^l)(\theta _{I_k}^*-\theta _{I_k,M})\). The derivative function h is continuous in some neighborhood of \(\theta _{I_k}^*\), \(\theta _{I_k}^l\) is between \(\theta _{I_k}^*\) and \(\theta _{I_k,M}\). Observe that, for \(k=1,\ldots ,K_n\),

    $$\begin{aligned} \Vert \theta _{I_k}^*-\theta _{I_k}^l\Vert _1\le & {} \Vert \theta _{I_k}^*-\theta _{I_k,M}\Vert _1\le \sum _{g=1}^{G_n} \big \Vert X_{I_{k,g}}\beta _{I_{k,g}}^*- M_{I_g} X_{I_{k,g}}^M \beta _{I_{k,g}}^M\big \Vert _1 \\\le & {} \bigg \Vert \sum _{g=1}^{G_n}|M_{{\setminus } I_g} X_{I_{k,g}}\beta _{I_{k,g}}^*\bigg \Vert _1+\bigg \Vert \sum _{g=1}^{G_n} X_{I_{k,g}}(\beta _{I_{k,g}}^*- M_{I_g}\beta _{I_{k,g}}^M)\bigg \Vert _1 \\\le & {} C_\theta (\Delta _{n_k}+r_{n_k}\delta _{n_k}), \end{aligned}$$

    where \(\Delta _{n_k}= \sum _{g=1}^{G_n}|M_{ {\setminus } I_g} \Vert \beta _{I_{k,g}}^*\Vert _1\), and \(C_\theta \) is a constant. By (A1) and Theorem 2, note that

    $$\begin{aligned} \Vert \theta _{I_k}^l\Vert _1\le \Vert \theta _{I_k}^*\Vert _1+\Vert \theta _{I_k}^*-\theta _{I_k}^l\Vert _1\le \lim _{n_k\rightarrow \infty }\sum _{g=1}^{G_n} \Vert \beta _{I_{k,g}}^*\Vert _1+\Delta _{n_k}+r_{n_k}\delta _{n_k}, \end{aligned}$$

    \(\Vert \theta _{I_k}^l\Vert _1\) is bounded as \(r_{n_k}\delta _{n_k}\longrightarrow 0\). Therefore, \(\Vert h(\theta _{I_k}^l)\Vert _1\) is bounded, and an existing constant \(C_h\) satisfies

    $$\begin{aligned} d_1(p_{I_k}^*,p_{I_k,M})\le C_h(\Delta _{n_k}+r_{n_k}\delta _{n_k}). \end{aligned}$$

    Let \(\delta _{n_k}=c_\epsilon \epsilon _{n_k}^2/|M| \) for a suitable constant \(c_\epsilon >0\). Due to \(\Delta _{n_k}\prec \epsilon _{n_k}^2\), we have \(d_1(p_{I_k}^*,p_{I_k,M})\le \epsilon _{n_k}^2/4\). Let \(S_{I_k}=\{p(y_{I_k}|M,\beta _{I_k}):\beta _{I_k}\in (\beta _{I_{k,g}}^*\pm \delta _{n_k})_{g=1,\ldots ,G_n}\}\) and \(T_{I_k}=\{p_{I_k}:d_1(p_{I_k}^*,p_{I_k})\le \epsilon _{n_k}^2/4\}\). For a sufficiently large \(n_k\), \(\pi (T_{I_k})>\pi (S_{I_k})\ge \exp (-n_k\epsilon _{n_k}^2/4)\).

  2. ii)

    \(\log N(\epsilon _{n_k},{\mathcal {P}}_{n_k})\le n_k \epsilon _{n_k}^2\) for all sufficiently large \(n_k\). Denote regression parameters \(u_{I_k}=\{u_{I_{k,1}},\ldots ,u_{I_{k,G_n}}\}\) and \(v_{I_k}=\{v_{I_{k,1}},\ldots ,v_{I_{k,G_n}}:\Vert v_{I_{k,g}}\Vert _1\le C_G\}\) for constant \(C_G\) (\(\Vert \beta _{I_{k,g}}\Vert _1\le C_G\)). \(u_{I_{k,g}}\) and \(v_{I_{k,g}}\) are zero for the same set of components M (\(|M|\le \bar{r}_{n_k}\)). For \(p_{I_{k,u}}=\exp [y_{I_k}^\top \theta _{I_{k,u}}-b(\theta _{I_{k,u}})+c(y_{I_k})]\) and \(\theta _{I_{k,u}}=\sum _{g=1}^{G_n}X_{I_{k,g}}u_{I_{k,g}}\), \(p_{I_{k,v}}\) and \(\theta _{I_{k,v}}\) are the same as \(p_{I_{k,u}}\) and \(\theta _{I_{k,u}}\). Then, the Hellinger distance \(d(p_{I_{k,u}},p_{I_{k,v}})\le \sqrt{d_0(p_{I_{k,u}},p_{I_{k,v}})} \) and

    $$\begin{aligned} d_0(p_{I_{k,u}},p_{I_{k,v}})\le E(b'(\theta _{I_{k,v}})-b'(\theta _{I_k}^l))^\top (\theta _{I_{k,v}}-\theta _{I_{k,u}}), \end{aligned}$$

    where \(\theta _{I_k}^l\) is an in-between vector of \(\theta _{I_{k,v}}\) and \(\theta _{I_{k,u}}\). Observe that

    $$\begin{aligned} \Vert \theta _{I_{k,v}}-\theta _{I_{k,u}}\Vert _1=\bigg \Vert \sum _{g=1}^{G_n} X_{I_{k,g}}(v_{I_{k,g}}-u_{I_{k,g}})\bigg \Vert _1\le C_\theta \bar{r}_{n_k} \delta ; \end{aligned}$$

    then,

    $$\begin{aligned} d_0(p_{I_{k,u}},p_{I_{k,v}})\le 2 \sup _{\Vert \theta \Vert _1\le \bar{r}_{n_k} C_G} \Vert b'(\theta )\Vert _1 \bar{r}_{n_k} \delta , d(p_{I_{k,u}},p_{I_{k,v}})\le \sqrt{2 \sup _{|\theta |\le \bar{r}_{n_k} C_G} \Vert b'(\theta )\Vert _1 \bar{r}_n \delta }. \end{aligned}$$

    Therefore, \(d(p_{I_{k,u}},p_{I_{k,v}})\le \epsilon _{n_k}\) if \(\delta =\epsilon _{n_k}^2/\{\sup _{|\theta _{I_k}|\le \bar{r}_{n_k} C_G} \Vert b'(\theta _{I_k})\Vert _1 \bar{r}_{n_k} \}\). Additionally,

    $$\begin{aligned} N(\epsilon _{n_k},{\mathcal {P}}_{n_k})\le & {} (\bar{r}_{n_k}+1)G_n^{\bar{r}_{n_k}} \biggl (1+2\epsilon _{n_k}^{-2}\cdot \sup _{|\theta _{I_k}|\le \bar{r}_{n_k} C_G} \Vert b'(\theta )\Vert _1 \bar{r}_{n_k} C_G \biggl )^{\bar{r}_{n_k}}.\\ \end{aligned}$$

    By (A2), we obtain the result of ii).

  3. iii)

    Using (A2), \(\pi ({\mathcal {P}}_{n_k}^c)\le \exp (-2n_k\epsilon _{n_k}^2)\) for all sufficiently large \(n_k\). Observe that

    $$\begin{aligned} \pi ({\mathcal {P}}_{n_k}^c)\le & {} \pi (|M|>\bar{r}_{n_k})\\&+\sum _{M:|M|\le \bar{r}_{n_k}}\pi (M) \pi \bigg (\bigcup _{g=1,\ldots ,G_n}\Big (\Vert \beta _{I_{k,g}}\Vert _1>C_G\Big )\big |M\bigg ). \end{aligned}$$

Then,

$$\begin{aligned}&(1+\bar{r}_{n_k})\exp (-4n_k\epsilon _{n_k}^2)\\&\quad =\exp \big [\log (1+\bar{r}_{n_k})-4n_k\epsilon _{n_k}^2\big ]\le \exp (-2n_k\epsilon _{n_k}^2) \end{aligned}$$

for all sufficiently large \(n_k\). Wherefrom, we have the result of iii).

Due to i), ii), and iii), there exists \(N_{n_k}\) such that

$$\begin{aligned} E \Big \{\pi \Big [ d(p_{I_k}^*,p_{I_k,M})>4\epsilon _{n_k}\big |(y_{I_k},X_{I_k})\Big ]\Big \}\le 4 \exp (-n_k\epsilon _{n_k}^2/2)\quad {\text {for}}\; n_k>N_{n_k}. \end{aligned}$$

Thus, we have the theorem. \(\square \)

Proof of Theorem 4

Note that \(|p(\beta , M)|\) is bounded, there exists \(C_M\) that satisfies

$$\begin{aligned} \Bigg |\sum _{k=1}^{K_n}u_k p(\beta _{I_k},M,y_{I_k})-p(\beta ,M,y)\Bigg |\le C_M\cdot \Bigg |\sum _{k=1}^{K_n} u_k (l_{I_k}-l)\Bigg |. \end{aligned}$$
$$\begin{aligned}&P\Bigg (\bigg |\sum _{k=1}^{K_n}u_kp(\beta _{I_k},M,y_{I_k})-p(\beta ,M,y)\bigg |>\varepsilon \Bigg ) \le P\Bigg (C_M\cdot \big |\sum _{k=1}^{K_n} u_kl_{I_k}-l\big |>\varepsilon \Bigg )\\&\quad =P\Bigg (\big |\sum _{k=1}^{K_n}u_kl_{I_k}-l\big |>\frac{\varepsilon }{C_M}\Bigg )\\&\quad = P\Bigg (\big |\sum _{k=1}^{K_n}u_k(l_{I_k}-El_{I_k}+El_{I_k}-El+El-l)\big |>\frac{\varepsilon }{C_M}\Bigg ) \\&\quad \le P\Bigg (\big |\sum _{k=1}^{K_n}u_k(l_{I_k}-El_{I_k})\big |>\frac{\varepsilon }{3C_M}\Bigg )\\&\qquad + P\Bigg (\big |\sum _{k=1}^{K_n}u_k(El_{I_k}-El)\big |>\frac{\varepsilon }{3 C_M}\Bigg )\\&\qquad +P\Bigg ( \big | \sum _{k=1}^{K_n}u_k(El-l)\big |>\frac{\varepsilon }{3C_M}\Bigg )\\&\quad = A_1+A_2+A_3.\\ \end{aligned}$$

By Chebyshev’s inequality,

$$\begin{aligned} A_2\le \frac{\sum _{k=1}^{K_n}u_k^2 \ \text {var}(l_{I_k}-l)}{\varepsilon ^2}. \end{aligned}$$

With (A3), we obtain

$$\begin{aligned} A_2\le O (n^{-1})\longrightarrow 0 \quad {\text {as}}\; n\longrightarrow \infty . \end{aligned}$$

By Hoeffding’s inequality, (A4) and (A5), we have

$$\begin{aligned} A_1\le & {} P\Big (\frac{1}{n}\big |\sum _{k=1}^{K_n}l_{I_k}-El_{I_k}\big |\ge t\Big ) \le 2 \exp (-2 n^2 t^2 R_l^{-2}), \\ A_3\le & {} P\Big (\frac{1}{n}\big |\sum _{k=1}^{K_n}El-l\big |\ge t\Big ) \le 2 \exp ( -2 n^2t^2 R_l^{-2})\quad {\text {for}}\; t>0. \end{aligned}$$

Let \(t=\varepsilon \), we then have

$$\begin{aligned} A_1, A_3\le 2 \exp (-2 n^2 \varepsilon ^2R_l^{-2}) \longrightarrow 0 \quad {\text {as}}\; n\longrightarrow \infty . \end{aligned}$$

The proof is finished.\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, G., Qian, G., Lin, L. et al. Parallel inference for big data with the group Bayesian method. Metrika 84, 225–243 (2021). https://doi.org/10.1007/s00184-020-00784-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-020-00784-0

Keywords

Mathematics Subject Classification

Navigation