Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter July 14, 2018

On the relation between the true and sample correlations under Bayesian modelling of gene expression datasets

  • Royi Jacobovic EMAIL logo

Abstract

The prediction of cancer prognosis and metastatic potential immediately after the initial diagnoses is a major challenge in current clinical research. The relevance of such a signature is clear, as it will free many patients from the agony and toxic side-effects associated with the adjuvant chemotherapy automatically and sometimes carelessly subscribed to them. Motivated by this issue, several previous works presented a Bayesian model which led to the following conclusion: thousands of samples are needed to generate a robust gene list for predicting outcome. This conclusion is based on existence of some statistical assumptions including asymptotic independence of sample correlations. The current work makes two main contributions: (1) It shows that while the assumptions of the Bayesian model discussed by previous papers seem to be non-restrictive, they are quite strong. To demonstrate this point, it is shown that some standard sparse and Gaussian models are not included in the set of models which are mathematically consistent with these assumptions. (2) It is shown that the empirical Bayes methodology which was applied in order to test the relevant assumptions does not detect severe violations and consequently an overestimation of the required sample size might be incurred. Finally, we suggest that under some regularity conditions it is possible that the current theoretical results can be used for development of a new method to test the asymptotic independence assumption.

Acknowledgement

Special thanks to Or Zuk, Yuval Benjamini and Jonathan Fiat for their helpful comments.

Appendix

Proof.

Proof. (Theorem 3.1)

Due to symmetry considerations, it is enough to prove that

P{ρx1x2=c}<1,c(1,1).

To this end, assume by contradiction that there exists some c ∈ (−1, 1) such that P{ρx1x2=c}=1. Since G(⋅) is a probability distribution over Θ and it is known that almost surely Fθ() is associated with finite first two moments, the probability (with respect to G(⋅)) that the correlation matrix of (X1, X2, Y) is positive semi-definite, equals to one. To obtain a contradiction, it is shown that with positive probability, the characteristic polynomial of this correlation matrix is associated with negative roots. In details, since P{ρx1x2=c}=1, then almost surely the characteristic polynomial of the correlation matrix of (X1, X2, Y) is given by

P(λ;ρ1,ρ2,c)=det[1λcρ1c1λρ2ρ1ρ21λ]=
=(1λ)3(1λ)(ρ12+ρ22+c2)+2cρ1ρ2.

Now, set ρ1 = 0 and obtain the following equation:

P(λ;ρ1=0,ρ2,c)=(1λ)3(1λ)(c2+ρ22)=0.

Clearly, if c(1,1){0} , for ρ2=2c22(12,1) there exists one root of P(λ) which is given by

λ^=11+c22<0

i.e., there exists a negative solution to the equation

P(λ;ρ1=0,ρ2=2c22,c)=0.

Since Cardano formula [see e.g. Spiegel (1968)] implies that the solutions of the equation P(λ; ρ1, ρ2, c) = 0 are continuous in (ρ1, ρ2) at the point p:=(0,2c22), there exists δ > 0 which is associated with a ball Bδ(p)(1,1)2 such that for any (ρ1, ρ2) ∈ Bδ(p) there exists a negative root for P(λ; ρ1, ρ2, c). In addition, the fact that ϕ(⋅) is strictly increasing continuous function on (1,1) and ϕ(ρ1),ϕ(ρ2)i.i.dN(0,σq2) implies that (ρ1, ρ2) are continuously distributed over (−1, 1)2. Therefore, with positive probability, P(λ; ρ1, ρ2, c) is associated with negative root implying a contradiction. To complete the proof, if c=0 , then

P(λ;ρ1,ρ2,c=0)=(1λ)[(1λ)2ρ12ρ22].

If we let ρi;i=1,2 be both close enough to 1, then this characteristic polynomial has negative root. Therefore the same arguments which were used for the case where c(1,1){0} may be carried out once again to justify existence of negative root with positive probability.

   □

Lemma 5.1.

Lemma 5.1. (Multivariate Delta Method)

Let Ψ be m-dimensional positive definite matrix with a sequence of m-dimensional random vectors {Xn}n=1 such that

n(Xnμ)LN(0,Ψ).

If f:md is associated with f() which is continuous in some neighbourhood of μ, then

n[f(Xn)f(μ)]LN(0,[f(μ)]TΨ[f(μ)])

where f(β~)=[fjβi]ij(β~) is the partial derivative matrix of f(⋅) at the point β~ ∈ ℝm.

Proof.

See chapter 7 of Ferguson (1996).    □

Proof.

Proof. (Theorem 3.2)

Consider the following notations:

mxi:=1nj=1nXij,i=1,,k
mxi2:=1nj=1nXij2,i=1,,k
mxiy:=1nj=1nXijYj,i=1,,k
my:=1nj=1nYj
my2:=1nj=1nYj2
sxi2:=mxi2mxi2,i=1,,k
sxiy:=mxiymximy,i=1,,k
sy2:=my2my2.

Using these notations, the sample correlations can be written as:

ri:=rni=sxiysxisy,i=1,,k.

Now, before going forward let us mention that the following analysis is done with respect to the probability space conditioned on θ for which there exist finite four moments of Fθ(). Since it is given that Fθ() is associated with such finite moments with probability one, the results of the upcoming analysis hold with probability one. Now, with respect to this setup, an application of the multivariate central limit theorem (CLT) justifies the convergence

n[(mx1mxkmymx12mxk2my2mx1ymxky)(000σx12σxk2σy2σx1yσxky)]nLN3k+2(0,Σ1)

where the covariance matrix Σ1 is given by Σij1=C(Zi,Zj),1i,j3k+2 and Z is as follows:

Z=(X1,,Xk,Y,X12,,Xk2,Y2,X1Y,,XkY)T.

Define a function η:R3k+2R2k+1 by

η(z)=(zk+1+1z12zk+1+k+1zk+12z2k+2+1zk+1z1z2k+2+kzk+1zk)

and notice that

  1. η[(mx1mxkmymx12mxk2my2mx1ymxky)]=(sx12sxk2sy2sx1ysxky2)
  2. η(z)=(2diag(z1,,zk+1)BIk+1×k+1Ok+1×k0k×k+1Ik×k)

    where the matrix B is given by

    B:=(zk+1Ik×kz1,,zk).

Thus, the multivariate delta method can be used in order to deduce that

n[(sx12sxk2sy2sx1ysxky2)(σx12σxk2σy2σx1yσxky2)]nLN2k+1(0,Σ2)

where Σ2 is given by

Σ2=ηT[(000σx12σxk2σy2σx1yσxky)]Σ1η[(000σx12σxk2σy2σx1yσxky)].

Notice that for the vector of inputs written above ∇η is given by

(Ok+1×k+1Ok+1×kIk+1×k+1Ok+1×kOk×k+1Ik×k),

which means that Σ2 equals to the down-right 2k+1 × 2k+1 block of Σ1. Considering this result, define another function γ:R2k+1Rk as follows

γ(v)=(vk+1+1vk+1v1vk+1+kvk+1vk)

which satifies

  1. γ[(sx12sxk2sy2sx1ysxky2)]=(r1rk)
  2. γT(v)=[A|B|C]

    where

    A:=diag(vk+1+12v13vk+1,,vk+1+k2vk3vk+1)
    B:=(vk+1+12v1vk+13,,vk+1+k2vkvk+13)T
    C:=diag(1v1vk+1,,1v1vk+1).

Therefore, the multivariate delta method can be used once again to obtain the limit

n[(r1rk)(ρ1ρk)]nLNk[0,Σ3]

where Σ3 is given by

Σ3=γT[(σx12σxk2σy2σx1yσxky)]Σ2γ[(σx12σxk2σy2σx1yσxky)].

Define ϕ:(1,1)kRk as follows

ϕ(w):=(ϕ(w1),,ϕ(wk))T

and notice that ϕ(⋅) is differentiable in its domain and hence its derivative matrix is given by

ϕ(w)=diag[ϕ(w1),,ϕ(wk)].

If so, we shall do one more execution of the multivariate delta method to derive the convergence

n[(ϕ(r1)ϕ(rk))(ϕ(ρ1)ϕ(ρk))]nLNK[0,Σ4]

where Σ4 is given by

[Σ4]ij=ϕ(ρi)ϕ(ρj)[Σ3]ij,1i,jk.

Since ϕ′(⋅) is positive for any possible input and non-correlation is equivalent to independence under Gaussian law, then for any ij asymptotic independence of ϕ(ri) and ϕ(rj) is equivalent to [Σ3]ij=0. To see how the needed result stems from this understanding, for simplicity and w.l.o.g, consider the case where i = 1 and j = 2. In this case r1 and r2 are asymptotically independent iff the following equation holds

Σ123=(ρ12σx1200ρ12σy21σx2σy00)TΣ2(0ρ22σx220ρ22σy201σx2σy0)=0.

To conclude, recall that Σ2 is down-right 2k+1 × 2k+1 block of Σ1 and deduce the needed result. □

Proof.

Proof. (Theorem 3.3)

For simplicity and w.l.o.g. it is enough to show that the event of having rn1 and rn2 which are not asymptotically independent occurs with positive probability. To do so, consider (i,j)=(1,2) , and notice that due to the previous theorem, it is enough to prove that Equation (2) does not hold with positive probability. Now, (X1, X2, Y) is a Gaussian and hence, as was shown by Isserlis (1918), each of the covariances appeared in Equation (2) can be expressed as follows

C(X12,X22)=E(X12X22)E(X12)E(X22)=σx12σx22+2σx1x22σx12σx22=2σx1x22C(Y2,X22)==2σx2y2C(X1Y,X22)=E(X1YX22)E(X1Y)E(X22)=σx22σx1y+2σx1x2σx2yσx1yσx22=2σx1x2σx2yC(X12,Y2)==2σx1y2C(Y2,Y2)=EY4EY2EY2=3σy4σy2σy2=2σy4C(X1Y,Y2)=E(X1Y3)E(X1Y)E(Y2)=3σy2σx1yσx1yσy2=2σy2σx1yC(X12,X2Y)==2σx1x2σx1yC(Y2,X2Y)==2σy2σx2yC(X1Y,X2Y)=E(X1X2Y2)E(X1Y)E(X2Y)=σy2σx1x2+2σx1yσx2yσx1yσx2y==σy2σx1x2+σx1yσx2y

where σx12:=V(X1), σx22:=V(X2), σx1x2:=C(X1,X2), σx1y:=C(X1,Y) and σx2y:=C(X2,Y). By insertion of these expressions into Equation (2), a sufficient and necessary condition for asymptotic independence of rn1 and rn2 is that the following equation holds with probability one

(3)ρx1x22ρ1ρ22+ρx1x2(1ρ12ρ22)+ρ1ρ23+ρ13ρ2ρ1ρ22=0

where ρx1x2:=σx1,x2/σx12σx22. The next step is to show that with positive probability, (ρ1, ρ2) is such that this equation has no real solution. By showing this, since ρx1x2 is real with probability one, we in fact demonstrate that this equation doesn't hold with positive probability. To see this, since ϕ(x) = 0 iff x = 0, then Assumption 2.1 implies that P{ρi0,i=1,2}=1. Therefore, Equation (3) is almost surely a quadratic equation w.r.t. ρx1x2 that, depending on the values of ρ1 and ρ2, might not have a real solution. Indeed, if (ρ1, ρ2) = (0.5, 0.9) ∈ (−1, 1)2, then the discriminant of the quadratic equation equals to −0.00855 < 0.

Now, the fact that the discriminant of the quadratic equation is continuous in ρ1 and ρ2 at the point (0.5, 0.9) implies that there exists some δ > 0 such that the discriminant is negative for any (ρ1, ρ2) ∈ Bδ(0.5, 0.9) ⊂ (−1, 1)2. By Assumption 2.1, G() is a prior such that ϕ(ρ1),,ϕ(ρk)i.i.dN(0,σq2) and hence[5]

P({ϕ(ρ1),ϕ(ρ2)}ϕ(Bδ(0.5,0.9)))>0.

To proceed, since ϕ(⋅) is strictly increasing and continuous, there exists a strictly monotone and continuous inverse ϕ1(). Therefore,

P({ρ1,ρ2}Bδ(0.5,0.9))>0,

i.e. with positive probability there is no real solution and hence the needed result is provided.    □

Computation of Figure (1)

Input:n, k, u

Output: Histogram and normal QQ-plot of a random realization of ϕ(rn1),,ϕ(rnk)

  1. For j = 1, …, n:

  2. Draw (X1j,,Xkj)Nk(0,I) and set Yj=i=1uXij.

  3. End for.

  4. For i = 1, …, k:

  5. Compute the empirical correlation between the vectors (Xi1,,Xin) and (Y1,,Yn) and denote it by rni.

  6. End for.

  7. Return Histogram and normal QQ-plot of ϕ(rn1),,ϕ(rnk).

Computation of Figure (2)

Input:n, k, u, B

Output: Estimates of the expectation and standard deviation of the proportion of genes that are selected correctly as computed by the straightforward and fast approximated approaches.

  1. For t = 1, …, B:

  2. For j = 1, …, n:

  3. Draw (X1j,,Xkj)Nk(0,I) and set Yj=i=1uXij.

  4. End for.

  5. For i = 1, …, k:

  6. Compute the empirical correlation between the vectors (Xi1, …, Xin) and (Y1, …, Yn) and denote it by rni(t).

  7. Compute dt=1ui=1uIi where Ii indicates whether |rni(t)| is one of the u highest values of the vector (|rn1(t)|,,|rnk(t)|).

  8. End for.

  9. Compute the empirical variance of rn1(t),,rnk(t) and denote it by W.

  10. Set σ^q=W1n3 .

  11. Draw ϕ(ρ1),,ϕ(ρk)i.i.dN(0,σ^q2) .

  12. Compute the set of indices which are associated with the u highest values of the vector (|ϕ(ρ1)|,,|ϕ(ρ1)|), denote it by S1 and Draw z1(t),,zk(t)i.i.dN(0,1n3).

  13. For i = 1, …, k:

  14. Compute vi(t)=|rni(t)+zi(t)|.

  15. End for.

  16. Compute the set of indexes which are associated with the u highest values of the vector (v1(t),,vk(t)), denote it by S2 and compute ct=|S1S2|u.

  17. End for.

  18. Compute d¯=1Bt=1Bdt and c¯=1Bt=1Bct

  19. Return d¯, c¯, 1Bt=1B(ctc¯)2 and 1Bt=1B(dtd¯)2

References

Alam, K. (1979): “Distribution of sample correlation coefficients.” Nav. Res. Logist., 26, 327–330.10.1002/nav.3800260212Search in Google Scholar

Alam, K. & M. H. S. Rizvi (1976): “Selection of largest multiple correlation coefficients: exact sample size case.” Ann. Stat., 4, 614–620.10.1214/aos/1176343467Search in Google Scholar

Caravlho, C., J. Chang, J. Lucas, J. Nevins, Q. Wang and M. West (2008): “High-dimensional sparse factor modeling: applications in gene expression genomics.” JASA, 103, 1438–1456.10.1198/016214508000000869Search in Google Scholar PubMed PubMed Central

Cui, X. and J. Wilson (2008): “On the probability of correct selection for large k populations with application to microarray data.” Biometrical J., 50, 833–870.Search in Google Scholar

Cui, X., H. Zhao and J. Wilson (2010): “Optimized ranking and selection methods for feature selection with application in microarray experiments.” J. Biopharm. Stat., 20, 223–239.10.1080/10543400903572720Search in Google Scholar PubMed PubMed Central

Dobra, A., C. Hans, B. J. J. N. G. Y. and M. West (2004): “Sparse graphical models for exploring gene expression data.” J. Multivariate Anal., 90, 196–212.10.1016/j.jmva.2004.02.009Search in Google Scholar

Donoho, D. (2000): High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture.Search in Google Scholar

Ein-Dor, L., O. Zuk and E. Domany (2006): “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.” Proc. Natl. Acad. Sci. USA, 103, 5923–5928.10.1073/pnas.0601231103Search in Google Scholar PubMed PubMed Central

Ferguson, T. (1996): A course in large sample theory, Chapman and Hall, London.10.1007/978-1-4899-4549-5Search in Google Scholar

Fisher, R. (1915): “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population.” Biometrika, 10, 507–521.10.2307/2331838Search in Google Scholar

Fisher, R. (1921): “On the probable error of a coefficient of correlation deduced from a small sample.” Metron, 1, 3–32.Search in Google Scholar

Guyon, I. and A. Elisseeff (2003): “An introduction to variable and feature selection.” J. Mach. Learn. Res., 3, 1157–1182.Search in Google Scholar

Hall, M. (1998): Correlation based feature selection for machine learning. PhD thesis, Department of Computer-Science, University of Waikato, Hamilton, New-Zealand.Search in Google Scholar

Isserlis, L. (1918): “On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables.” Biometrika, 12, 134–139.10.1093/biomet/12.1-2.134Search in Google Scholar

Jacobovic, R. and O. Zuk (2017): “On the asymptotic efficiency of selection procedures for independent gaussian populations.” Electron. J. Stat., 11, 5375–5405.10.1214/17-EJS1375Search in Google Scholar

Knowles, D. and Z. Ghahramani (2011): “Nonparametric bayesian sparse factor models with application to gene expression modeling.” Ann. Appl. Stat., 5, 1534–1552.10.1214/10-AOAS435Search in Google Scholar

Levy, K. (1975): “Selecting the best population from among k binomial populations or the population with the largest correlation coefficient from among k bivariate normal populations.” Psychometrika, 40, 121–122.10.1007/BF02291486Search in Google Scholar

Levy, K. (1977): “Appropriate sample sizes for selecting a population with the largest correlation coefficient from among k bivariate normal populations.” Educ. Psychol. Meas., 37, 61–66.10.1177/001316447703700107Search in Google Scholar

McDowell, I. C., D. Manandhar, C. Vockley, A. Schmid and T. Reddy (2018): “Clustering gene expression time series data using an infinite gaussian process mixture model.” PLoS Comput. Biol., 14, e1005896.10.1371/journal.pcbi.1005896Search in Google Scholar PubMed PubMed Central

Pakman, A. and L. Paninski (2014): “Exact hamiltonian monte carlo for truncated multivariate gaussians.” J. Comput. Graph. Stat., 23, 518–542.10.1080/10618600.2013.788448Search in Google Scholar

Ramberg, J. (1977): “Selecting the best predictor variate.” Commun. Stat. Theory Methods, 11, 1133–1147.10.1080/03610927708827556Search in Google Scholar

Rizvi, M. H. H. S. (1973): “Selection of largest multiple correlation coefficients: asymptotic case.” J. Am. Stat. Assoc., 68, 184–188.10.1080/01621459.1973.10481360Search in Google Scholar

Spiegel, M. R. (1968): Mathematical handbook of formulas and tables. Schaum.Search in Google Scholar

Wilcox, R. (1978): “Some comments on selecting the best of several binomial populations or the bivariate normal population having the largest correlation coefficient.” Psychometrika, 43, 127–128.10.1007/BF02294099Search in Google Scholar

Yeung, K., C. Fraley, A. Murua, A. Raftery and W. Ruzzo (2001): “Model-based clustering and data transformations for gene expression data.” Bioinformatics, 17, 977–987.10.1093/bioinformatics/17.10.977Search in Google Scholar PubMed

Yu, L. and H. Liu (2003): “Feature selection for high-dimensional data: a fast correlation-based filter solution.” Proceedings of the twentieth International Conference on Machine Learning, page 856–863.Search in Google Scholar

Zuk, O., L. Ein-Dor and E. Domany (2007): “Ranking under uncertainty.” UAI, 466–473.Search in Google Scholar

Published Online: 2018-07-14

©2018 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 26.4.2024 from https://www.degruyter.com/document/doi/10.1515/sagmb-2017-0068/html
Scroll to top button