On the relation between the true and sample correlations under Bayesian modelling of gene expression datasets

Royi Jacobovic

doi:10.1515/sagmb-2017-0068

Published by De Gruyter July 14, 2018

On the relation between the true and sample correlations under Bayesian modelling of gene expression datasets

Royi Jacobovic

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2017-0068

Showing a limited preview of this publication:

Abstract

The prediction of cancer prognosis and metastatic potential immediately after the initial diagnoses is a major challenge in current clinical research. The relevance of such a signature is clear, as it will free many patients from the agony and toxic side-effects associated with the adjuvant chemotherapy automatically and sometimes carelessly subscribed to them. Motivated by this issue, several previous works presented a Bayesian model which led to the following conclusion: thousands of samples are needed to generate a robust gene list for predicting outcome. This conclusion is based on existence of some statistical assumptions including asymptotic independence of sample correlations. The current work makes two main contributions: (1) It shows that while the assumptions of the Bayesian model discussed by previous papers seem to be non-restrictive, they are quite strong. To demonstrate this point, it is shown that some standard sparse and Gaussian models are not included in the set of models which are mathematically consistent with these assumptions. (2) It is shown that the empirical Bayes methodology which was applied in order to test the relevant assumptions does not detect severe violations and consequently an overestimation of the required sample size might be incurred. Finally, we suggest that under some regularity conditions it is possible that the current theoretical results can be used for development of a new method to test the asymptotic independence assumption.

Keywords: Bayesian statistics; delta-method; large-sample statistics; micro-array data analysis; multivariate statistics; prediction of cancer outcome

Acknowledgement

Special thanks to Or Zuk, Yuval Benjamini and Jonathan Fiat for their helpful comments.

Appendix

Proof.

Proof. (Theorem 3.1)

Due to symmetry considerations, it is enough to prove that

P{ρx1x2=c}<1 , ∀c∈(−1,1) .

To this end, assume by contradiction that there exists some c ∈ (−1, 1) such that P{ρx1x2=c}=1. Since G(⋅) is a probability distribution over Θ and it is known that almost surely Fθ(⋅) is associated with finite first two moments, the probability (with respect to G(⋅)) that the correlation matrix of (X₁, X₂, Y) is positive semi-definite, equals to one. To obtain a contradiction, it is shown that with positive probability, the characteristic polynomial of this correlation matrix is associated with negative roots. In details, since P{ρx1x2=c}=1, then almost surely the characteristic polynomial of the correlation matrix of (X₁, X₂, Y) is given by

P(λ;ρ1,ρ2,c)=det[1−λcρ1c1−λρ2ρ1ρ21−λ]=

=(1−λ)3−(1−λ)(ρ12+ρ22+c2)+2cρ1ρ2 .

Now, set ρ₁ = 0 and obtain the following equation:

P(λ;ρ1=0,ρ2,c)=(1−λ)3−(1−λ)(c2+ρ22)=0 .

Clearly, if c∈(−1,1)∖{0} , for ρ2=2−c22∈(12,1) there exists one root of P(λ) which is given by

λ^=1−1+c22<0

i.e., there exists a negative solution to the equation

P(λ;ρ1=0,ρ2=2−c22,c)=0 .

Since Cardano formula [see e.g. Spiegel (1968)] implies that the solutions of the equation P(λ; ρ₁, ρ₂, c) = 0 are continuous in (ρ₁, ρ₂) at the point p:=(0,2−c22), there exists δ > 0 which is associated with a ball Bδ(p)⊆(−1,1)2 such that for any (ρ₁, ρ₂) ∈ B_δ(p) there exists a negative root for P(λ; ρ₁, ρ₂, c). In addition, the fact that ϕ(⋅) is strictly increasing continuous function on (−1,1) and ϕ(ρ1),ϕ(ρ2)∼i.i.dN(0,σq2) implies that (ρ₁, ρ₂) are continuously distributed over (−1, 1)². Therefore, with positive probability, P(λ; ρ₁, ρ₂, c) is associated with negative root implying a contradiction. To complete the proof, if c=0 , then

P(λ;ρ1,ρ2,c=0)=(1−λ)[(1−λ)2−ρ12−ρ22].

If we let ρi;i=1,2 be both close enough to 1, then this characteristic polynomial has negative root. Therefore the same arguments which were used for the case where c∈(−1,1)∖{0} may be carried out once again to justify existence of negative root with positive probability.

□

Lemma 5.1.

Lemma 5.1. (Multivariate Delta Method)

Let Ψ be m-dimensional positive definite matrix with a sequence of m-dimensional random vectors {Xn}n=1∞ such that

n(Xn−μ)→LN(0,Ψ).

If f:ℝ^m → ℝ^d is associated with ∇f(⋅) which is continuous in some neighbourhood of μ, then

n[f(Xn)−f(μ)]→LN(0,[∇f(μ)]TΨ[∇f(μ)])

where ∇f(β~)=[∂fj∂βi]ij(β~) is the partial derivative matrix of f(⋅) at the point β~ ∈ ℝ^m.

Proof.

See chapter 7 of Ferguson (1996). □

Proof.

Proof. (Theorem 3.2)

Consider the following notations:

mxi:=1n∑j=1nXij , ∀i=1,…,k

mxi2:=1n∑j=1nXij2 , ∀i=1,…,k

mxiy:=1n∑j=1nXijYj , ∀i=1,…,k

my:=1n∑j=1nYj

my2:=1n∑j=1nYj2

sxi2:=mxi2−mxi2 , ∀i=1,…,k

sxiy:=mxiy−mximy , ∀i=1,…,k

sy2:=my2−my2 .

Using these notations, the sample correlations can be written as:

ri:=rni=sxiysxisy , ∀i=1,…,k .

Now, before going forward let us mention that the following analysis is done with respect to the probability space conditioned on θ for which there exist finite four moments of Fθ(⋅). Since it is given that Fθ(⋅) is associated with such finite moments with probability one, the results of the upcoming analysis hold with probability one. Now, with respect to this setup, an application of the multivariate central limit theorem (CLT) justifies the convergence

n[(mx1⋮mxkmymx12⋮mxk2my2mx1y⋮mxky)−(0⋮00σx12⋮σxk2σy2σx1y⋮σxky)]→n→∞LN3k+2(0,Σ1)

where the covariance matrix Σ¹ is given by Σij1=C(Zi,Zj),1≤i,j≤3k+2 and Z is as follows:

Z=(X1,…,Xk,Y,X12,…,Xk2,Y2,X1Y,…,XkY)T.

Define a function η:R3k+2→R2k+1 by

η(z)=(zk+1+1−z12⋮zk+1+k+1−zk+12z2k+2+1−zk+1z1⋮z2k+2+k−zk+1zk)

and notice that

η[(mx1⋮mxkmymx12⋮mxk2my2mx1y⋮mxky)]=(sx12⋮sxk2sy2sx1y⋮sxky2)
∇η(z)=(−2diag(z1,…,zk+1)BIk+1×k+1Ok+1×k0k×k+1Ik×k)
where the matrix B is given by
B:=(−zk+1⋅Ik×k−z1,…,−zk) .

Thus, the multivariate delta method can be used in order to deduce that

n[(sx12⋮sxk2sy2sx1y⋮sxky2)−(σx12⋮σxk2σy2σx1y⋮σxky2)]→n→∞LN2k+1(0,Σ2)

where Σ² is given by

Σ2=∇ηT[(0⋮00σx12⋮σxk2σy2σx1y⋮σxky)]Σ1∇η[(0⋮00σx12⋮σxk2σy2σx1y⋮σxky)] .

Notice that for the vector of inputs written above ∇η is given by

(Ok+1×k+1Ok+1×kIk+1×k+1Ok+1×kOk×k+1Ik×k),

which means that Σ² equals to the down-right 2k+1 × 2k+1 block of Σ¹. Considering this result, define another function γ:R2k+1→Rk as follows

γ(v)=(vk+1+1vk+1v1⋮vk+1+kvk+1vk)

which satifies

γ[(sx12⋮sxk2sy2sx1y⋮sxky2)]=(r1⋮rk)
∇γT(v)=[A|B|C]
where
A:=−diag(vk+1+12v13vk+1,…,vk+1+k2vk3vk+1)
B:=−(vk+1+12v1vk+13,…,vk+1+k2vkvk+13)T
C:=diag(1v1vk+1,…,1v1vk+1) .

Therefore, the multivariate delta method can be used once again to obtain the limit

n[(r1⋮rk)−(ρ1⋮ρk)]→n→∞LNk[0,Σ3]

where Σ³ is given by

Σ3=∇γT[(σx12⋮σxk2σy2σx1y⋮σxky)]Σ2∇γ[(σx12⋮σxk2σy2σx1y⋮σxky)] .

Define ϕ∗:(−1,1)k→Rk as follows

ϕ∗(w):=(ϕ(w1),…,ϕ(wk))T

and notice that ϕ(⋅) is differentiable in its domain and hence its derivative matrix is given by

∇ϕ∗(w)=diag[ϕ′(w1),…,ϕ′(wk)] .

If so, we shall do one more execution of the multivariate delta method to derive the convergence

n[(ϕ(r1)⋮ϕ(rk))−(ϕ(ρ1)⋮ϕ(ρk))]→n→∞LNK[0,Σ4]

where Σ⁴ is given by

[Σ4]ij=ϕ′(ρi)ϕ′(ρj)[Σ3]ij , 1≤i,j≤k .

Since ϕ′(⋅) is positive for any possible input and non-correlation is equivalent to independence under Gaussian law, then for any i ≠ j asymptotic independence of ϕ(r_i) and ϕ(r_j) is equivalent to [Σ3]ij=0. To see how the needed result stems from this understanding, for simplicity and w.l.o.g, consider the case where i = 1 and j = 2. In this case r₁ and r₂ are asymptotically independent iff the following equation holds

Σ123=(−ρ12σx120⋮0−ρ12σy21σx2σy0⋮0)TΣ2(0−ρ22σx22⋮0−ρ22σy201σx2σy⋮0)=0.

To conclude, recall that Σ2 is down-right 2k+1 × 2k+1 block of Σ1 and deduce the needed result. □

Proof.

Proof. (Theorem 3.3)

For simplicity and w.l.o.g. it is enough to show that the event of having rn1 and rn2 which are not asymptotically independent occurs with positive probability. To do so, consider (i,j)=(1,2) , and notice that due to the previous theorem, it is enough to prove that Equation (2) does not hold with positive probability. Now, (X₁, X₂, Y) is a Gaussian and hence, as was shown by Isserlis (1918), each of the covariances appeared in Equation (2) can be expressed as follows

C(X12,X22)=E(X12X22)−E(X12)E(X22)=σx12σx22+2σx1x22−σx12σx22=2σx1x22C(Y2,X22)=…=2σx2y2C(X1Y,X22)=E(X1YX22)−E(X1Y)E(X22)=σx22σx1y+2σx1x2σx2y−σx1yσx22=2σx1x2σx2yC(X12,Y2)=…=2σx1y2C(Y2,Y2)=EY4−EY2EY2=3σy4−σy2σy2=2σy4C(X1Y,Y2)=E(X1Y3)−E(X1Y)E(Y2)=3σy2σx1y−σx1yσy2=2σy2σx1yC(X12,X2Y)=…=2σx1x2σx1yC(Y2,X2Y)=…=2σy2σx2yC(X1Y,X2Y)=E(X1X2Y2)−E(X1Y)E(X2Y)=σy2σx1x2+2σx1yσx2y−σx1yσx2y==σy2σx1x2+σx1yσx2y

where σx12:=V(X1), σx22:=V(X2), σx1x2:=C(X1,X2), σx1y:=C(X1,Y) and σx2y:=C(X2,Y). By insertion of these expressions into Equation (2), a sufficient and necessary condition for asymptotic independence of rn1 and rn2 is that the following equation holds with probability one

(3)ρx1x22ρ1ρ22+ρx1x2(1−ρ12−ρ22)+ρ1ρ23+ρ13ρ2−ρ1ρ22=0

where ρx1x2:=σx1,x2/σx12σx22. The next step is to show that with positive probability, (ρ₁, ρ₂) is such that this equation has no real solution. By showing this, since ρx1x2 is real with probability one, we in fact demonstrate that this equation doesn't hold with positive probability. To see this, since ϕ(x) = 0 iff x = 0, then Assumption 2.1 implies that P{ρi≠0,∀i=1,2}=1. Therefore, Equation (3) is almost surely a quadratic equation w.r.t. ρx1x2 that, depending on the values of ρ₁ and ρ₂, might not have a real solution. Indeed, if (ρ₁, ρ₂) = (0.5, 0.9) ∈ (−1, 1)², then the discriminant of the quadratic equation equals to −0.00855 < 0.

Now, the fact that the discriminant of the quadratic equation is continuous in ρ₁ and ρ₂ at the point (0.5, 0.9) implies that there exists some δ > 0 such that the discriminant is negative for any (ρ₁, ρ₂) ∈ B_δ(0.5, 0.9) ⊂ (−1, 1)². By Assumption 2.1, G(⋅) is a prior such that ϕ(ρ1),…,ϕ(ρk)∼i.i.dN(0,σq2) and hence^[5]

P({ϕ(ρ1),ϕ(ρ2)}⊂ϕ(Bδ(0.5,0.9)))>0.

To proceed, since ϕ(⋅) is strictly increasing and continuous, there exists a strictly monotone and continuous inverse ϕ−1(⋅). Therefore,

P({ρ1,ρ2}⊂Bδ(0.5,0.9))>0,

i.e. with positive probability there is no real solution and hence the needed result is provided. □

Computation of Figure (1)

Input:n, k, u

Output: Histogram and normal QQ-plot of a random realization of ϕ(rn1),…,ϕ(rnk)

For j = 1, …, n:
Draw (X1j,…,Xkj)∼Nk(0,I) and set Yj=∑i=1uXij.
End for.
For i = 1, …, k:
Compute the empirical correlation between the vectors (Xi1,…,Xin) and (Y1,…,Yn) and denote it by rni.
End for.
Return Histogram and normal QQ-plot of ϕ(rn1),…,ϕ(rnk).

Computation of Figure (2)

Input:n, k, u, B

Output: Estimates of the expectation and standard deviation of the proportion of genes that are selected correctly as computed by the straightforward and fast approximated approaches.

For t = 1, …, B:
For j = 1, …, n:
Draw (X1j,…,Xkj)∼Nk(0,I) and set Yj=∑i=1uXij.
End for.
For i = 1, …, k:
Compute the empirical correlation between the vectors (X_i1, …, X_in) and (Y₁, …, Y_n) and denote it by rni(t).
Compute dt=1u∑i=1uIi where I_i indicates whether |rni(t)| is one of the u highest values of the vector (|rn1(t)|,…,|rnk(t)|).
End for.
Compute the empirical variance of rn1(t),…,rnk(t) and denote it by W.
Set σ^q=W−1n−3 .
Draw ϕ(ρ1),…,ϕ(ρk)∼i.i.dN(0,σ^q2) .
Compute the set of indices which are associated with the u highest values of the vector (|ϕ(ρ1)|,…,|ϕ(ρ1)|), denote it by S₁ and Draw z1(t),…,zk(t)∼i.i.dN(0,1n−3).
For i = 1, …, k:
Compute vi(t)=|rni(t)+zi(t)|.
End for.
Compute the set of indexes which are associated with the u highest values of the vector (v1(t),…,vk(t)), denote it by S₂ and compute ct=|S1∩S2|u.
End for.
Compute d¯=1B∑t=1Bdt and c¯=1B∑t=1Bct
Return d¯, c¯, 1B∑t=1B(ct−c¯)2 and 1B∑t=1B(dt−d¯)2

References

Alam, K. (1979): “Distribution of sample correlation coefficients.” Nav. Res. Logist., 26, 327–330.10.1002/nav.3800260212Search in Google Scholar

Alam, K. & M. H. S. Rizvi (1976): “Selection of largest multiple correlation coefficients: exact sample size case.” Ann. Stat., 4, 614–620.10.1214/aos/1176343467Search in Google Scholar

Caravlho, C., J. Chang, J. Lucas, J. Nevins, Q. Wang and M. West (2008): “High-dimensional sparse factor modeling: applications in gene expression genomics.” JASA, 103, 1438–1456.10.1198/016214508000000869Search in Google Scholar PubMed PubMed Central

Cui, X. and J. Wilson (2008): “On the probability of correct selection for large k populations with application to microarray data.” Biometrical J., 50, 833–870.Search in Google Scholar

Cui, X., H. Zhao and J. Wilson (2010): “Optimized ranking and selection methods for feature selection with application in microarray experiments.” J. Biopharm. Stat., 20, 223–239.10.1080/10543400903572720Search in Google Scholar PubMed PubMed Central

Dobra, A., C. Hans, B. J. J. N. G. Y. and M. West (2004): “Sparse graphical models for exploring gene expression data.” J. Multivariate Anal., 90, 196–212.10.1016/j.jmva.2004.02.009Search in Google Scholar

Donoho, D. (2000): High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture.Search in Google Scholar

Ein-Dor, L., O. Zuk and E. Domany (2006): “Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.” Proc. Natl. Acad. Sci. USA, 103, 5923–5928.10.1073/pnas.0601231103Search in Google Scholar PubMed PubMed Central

Ferguson, T. (1996): A course in large sample theory, Chapman and Hall, London.10.1007/978-1-4899-4549-5Search in Google Scholar

Fisher, R. (1915): “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population.” Biometrika, 10, 507–521.10.2307/2331838Search in Google Scholar

Fisher, R. (1921): “On the probable error of a coefficient of correlation deduced from a small sample.” Metron, 1, 3–32.Search in Google Scholar

Guyon, I. and A. Elisseeff (2003): “An introduction to variable and feature selection.” J. Mach. Learn. Res., 3, 1157–1182.Search in Google Scholar

Hall, M. (1998): Correlation based feature selection for machine learning. PhD thesis, Department of Computer-Science, University of Waikato, Hamilton, New-Zealand.Search in Google Scholar

Isserlis, L. (1918): “On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables.” Biometrika, 12, 134–139.10.1093/biomet/12.1-2.134Search in Google Scholar

Jacobovic, R. and O. Zuk (2017): “On the asymptotic efficiency of selection procedures for independent gaussian populations.” Electron. J. Stat., 11, 5375–5405.10.1214/17-EJS1375Search in Google Scholar

Knowles, D. and Z. Ghahramani (2011): “Nonparametric bayesian sparse factor models with application to gene expression modeling.” Ann. Appl. Stat., 5, 1534–1552.10.1214/10-AOAS435Search in Google Scholar

Levy, K. (1975): “Selecting the best population from among k binomial populations or the population with the largest correlation coefficient from among k bivariate normal populations.” Psychometrika, 40, 121–122.10.1007/BF02291486Search in Google Scholar

Levy, K. (1977): “Appropriate sample sizes for selecting a population with the largest correlation coefficient from among k bivariate normal populations.” Educ. Psychol. Meas., 37, 61–66.10.1177/001316447703700107Search in Google Scholar

McDowell, I. C., D. Manandhar, C. Vockley, A. Schmid and T. Reddy (2018): “Clustering gene expression time series data using an infinite gaussian process mixture model.” PLoS Comput. Biol., 14, e1005896.10.1371/journal.pcbi.1005896Search in Google Scholar PubMed PubMed Central

Pakman, A. and L. Paninski (2014): “Exact hamiltonian monte carlo for truncated multivariate gaussians.” J. Comput. Graph. Stat., 23, 518–542.10.1080/10618600.2013.788448Search in Google Scholar

Ramberg, J. (1977): “Selecting the best predictor variate.” Commun. Stat. Theory Methods, 11, 1133–1147.10.1080/03610927708827556Search in Google Scholar

Rizvi, M. H. H. S. (1973): “Selection of largest multiple correlation coefficients: asymptotic case.” J. Am. Stat. Assoc., 68, 184–188.10.1080/01621459.1973.10481360Search in Google Scholar

Spiegel, M. R. (1968): Mathematical handbook of formulas and tables. Schaum.Search in Google Scholar

Wilcox, R. (1978): “Some comments on selecting the best of several binomial populations or the bivariate normal population having the largest correlation coefficient.” Psychometrika, 43, 127–128.10.1007/BF02294099Search in Google Scholar

Yeung, K., C. Fraley, A. Murua, A. Raftery and W. Ruzzo (2001): “Model-based clustering and data transformations for gene expression data.” Bioinformatics, 17, 977–987.10.1093/bioinformatics/17.10.977Search in Google Scholar PubMed

Yu, L. and H. Liu (2003): “Feature selection for high-dimensional data: a fast correlation-based filter solution.” Proceedings of the twentieth International Conference on Machine Learning, page 856–863.Search in Google Scholar

Zuk, O., L. Ein-Dor and E. Domany (2007): “Ranking under uncertainty.” UAI, 466–473.Search in Google Scholar

Published Online: 2018-07-14

On the relation between the true and sample correlations under Bayesian modelling of gene expression datasets

Abstract

Acknowledgement

Appendix

Proof. (Theorem 3.1)

Lemma 5.1. (Multivariate Delta Method)

Proof.

Proof. (Theorem 3.2)

Proof. (Theorem 3.3)

Computation of Figure (1)

Computation of Figure (2)

References

Journal and Issue

Articles in the same Issue