Skip to main content
Log in

Alternating DCA for reduced-rank multitask linear regression with covariance matrix estimation

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

We study a challenging problem in machine learning that is the reduced-rank multitask linear regression with covariance matrix estimation. The objective is to build a linear relationship between multiple output variables and input variables of a multitask learning process, taking into account the general covariance structure for the errors of the regression model in one hand, and reduced-rank regression model in another hand. The problem is formulated as minimizing a nonconvex function in two joint matrix variables (X,Θ) under the low-rank constraint on X and positive definiteness constraint on Θ. It has a double difficulty due to the non-convexity of the objective function as well as the low-rank constraint. We investigate a nonconvex, nonsmooth optimization approach based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) for this hard problem. A penalty reformulation is considered which takes the form of a partial DC program. An alternating DCA and its inexact version are developed, both algorithms converge to a weak critical point of the considered problem. Numerical experiments are performed on several synthetic and benchmark real multitask linear regression datasets. The numerical results show the performance of the proposed algorithms and their superiority compared with three classical alternating/joint methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aldrin, M.: Reduced-Rank Regression. Encyclopedia of Environmetrics, Vol. 3. Wiley, pp. 1724–1728 (2002)

  2. Chen, L., Huang, J.Z.: Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Am. Stat. Assoc. 107(500), 1533–1545 (2012)

    Article  MathSciNet  Google Scholar 

  3. Chen, L., Huang, J.Z.: Sparse reduced-rank regression with covariance estimation. Stat. Comput. 26(1), 461–470 (2016)

    Article  MathSciNet  Google Scholar 

  4. Cover, T.M., Thomas, A.: Determinant inequalities via information theory. SIAM J. Matrix Anal. Appl. 9(3), 384–392 (1988)

    Article  MathSciNet  Google Scholar 

  5. Dev, H., Sharma, N.L., Dawson, S.N., Neal, D.E., Shah, N.: Detailed analysis of operating time learning curves in robotic prostatectomy by a novice surgeon. BJU Int. 109(7), 1074–1080 (2012)

    Article  Google Scholar 

  6. Dubois, B., Delmas, J.F., Obozinski, G.: Fast algorithms for sparse reduced-rank regression. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of Machine Learning Research, Proceedings of Machine Learning Research, vol. 89, pp 2415–2424. PMLR (2019)

  7. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936)

    Article  Google Scholar 

  8. Foygel, R., Horrell, M., Drton, M., Lafferty, J.: Nonparametric reduced rank regression. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp 1628–1636. Curran Associates, Inc. (2012)

  9. Ha, W., Foygel Barber, R.: Alternating minimization and alternating descent over nonconvex sets. ArXiv e-prints arXiv:1709.04451 (2017)

  10. Harrison, L., Penny, W., Friston, K.: Multivariate autoregressive modeling of fMRI time series. Neuroimage 19, 1477–1491 (2003)

    Article  Google Scholar 

  11. He, D., Parida, L., Kuhn, D.: Novel applications of multitask learning and multiple output regression to multiple genetic trait prediction. Bioinformatics 32(12), i37–i43 (2016)

    Article  Google Scholar 

  12. Hu, Z., Nie, F., Wang, R., Li, X.: Low rank regularization: A review. Neural Networks. In Press. Available online 31 October 2020. https://doi.org/10.1016/j.neunet.2020.09.021 (2020)

  13. Hyams, E., Mullins, J., Pierorazio, P., Partin, A., Allaf, M., Matlaga, B.: Impact of robotic technique and surgical volume on the cost of radical prostatectomy. J. Endourol. 27(3), 298–303 (2013)

    Article  Google Scholar 

  14. Ioffe, A., Tihomirov, V.: Theory of extremal problems. North-Holland (1979)

  15. Izenman, A.J.: Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 5(2), 248–264 (1975)

    Article  MathSciNet  Google Scholar 

  16. Koshi, S.: Convergence of convex functions and duality. Hokkaido Math. J. 14(3), 399–414 (1985)

    Article  MathSciNet  Google Scholar 

  17. Le, H.M., Le Thi, H.A., Nguyen, M.C.: Sparse semi-supervised support vector machines by DC programming and DCA. Neurocomputing 153, 62–76 (2015)

    Article  Google Scholar 

  18. Le Thi, H.A.: Analyse numérique des algorithmes de l’optimisation DC. approches locale et globale. codes et simulations numériques en grande dimension. applications. Ph.D. thesis, University of Rouen France (1994)

  19. Le Thi, H.A.: Solving Large scale molecular distance geometry problems by a smoothing technique via the gaussian transform and D.C. Programming. J. Glob. Optim. 27(1), 375–397 (2003)

    MathSciNet  MATH  Google Scholar 

  20. Le Thi, H.A.: Portfolio selection under downside risk measures and cardinality constraints based on DC programming and DCA. Comput. Manag. Sci. 6 (4), 459–475 (2009)

    Article  MathSciNet  Google Scholar 

  21. Le Thi, H.A.: DC Programming and DCA for supply chain and production management: state-of-the-art models and methods. Int. J. Prod. Res. 58 (20), 6078–6114 (2020)

    Article  Google Scholar 

  22. Le Thi, H.A., Ho, V.T.: Online learning based on online DCA and application to online classification. Neural Comput. 32(4), 759–793 (2020)

    Article  MathSciNet  Google Scholar 

  23. Le Thi, H.A., Ho, V.T., Pham Dinh, T.: A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning. J. Glob. Optim. 73(2), 279–310 (2019)

    Article  MathSciNet  Google Scholar 

  24. Le Thi, H.A., Huynh, V.N., Pham Dinh, T.: DC Programming and DCA for General DC Programs. In: Van Do, T., Le Thi, H.A., Nguyen, N.T. (eds.) Advanced Computational Methods for Knowledge Engineering, vol. 282, pp 15–35. Springer International Publishing (2014)

  25. Le Thi, H.A., Huynh, V.N., Pham Dinh, T.: Alternating DC Algorithm for Partial DC Programming. Technical report, University of Lorraine (2016)

  26. Le Thi, H.A., Le, H.M., Pham Dinh, T.: New and efficient DCA based algorithms for minimum sum-of-squares clustering. Pattern Recogn. 47 (1), 388–401 (2014)

    Article  Google Scholar 

  27. Le Thi, H.A., Le, H.M., Phan, D.N., Tran, B.: Stochastic DCA for the large-sum of non-convex functions problem and its application to group variable selection in classification. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70. JMLR.org, pp. 3394–3403 (2017)

  28. Le Thi, H.A., Nguyen, M.C.: DCA Based algorithms for feature selection in multi-class support vector machine. Ann. Oper. Res. 249(1), 273–300 (2017)

    Article  MathSciNet  Google Scholar 

  29. Le Thi, H.A., Nguyen, M.C., Pham Dinh, T.: A DC programming approach for finding communities in networks. Neural Comput. 26(12), 2827–2854 (2014)

    Article  MathSciNet  Google Scholar 

  30. Le Thi, H.A., Pham Dinh, T.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133(1–4), 23–46 (2005)

    MathSciNet  MATH  Google Scholar 

  31. Le Thi, H.A., Pham Dinh, T.: Difference of convex functions algorithms (DCA) for image restoration via a Markov random field model. Optim. Eng. 18 (4), 873–906 (2017)

    Article  MathSciNet  Google Scholar 

  32. Le Thi, H.A., Pham Dinh, T.: DC Programming and DCA: thirty years of developments. Mathematical programming, Special issue: DC Programming - Theory. Algorithms and Applications 169(1), 5–68 (2018)

    MATH  Google Scholar 

  33. Le Thi, H.A., Pham Dinh, T., Le, H.M., Vo, X.T.: DC Approximation approaches for sparse optimization. Eur. J. Oper. Res. 244(1), 26–46 (2015)

    Article  MathSciNet  Google Scholar 

  34. Le Thi, H.A., Pham Dinh, T., Ngai, H.V.: Exact penalty and error bounds in dc programming. J. Glob. Optim. 52(3), 509–535 (2012)

    Article  MathSciNet  Google Scholar 

  35. Le Thi, H.A., Phan, D.N.: DC Programming and DCA for sparse optimal scoring problem. Neurocomputing 186, 170–181 (2016)

    Article  Google Scholar 

  36. Le Thi, H.A., Ta, A.S., Pham Dinh, T.: An efficient DCA based algorithm for power control in large scale wireless networks. Appl. Math. Comput. 318, 215–226 (2018)

    MathSciNet  MATH  Google Scholar 

  37. Lee, C.L., Lee, C.A., lee, J.: Handbook of Quantitative Finance and Risk Management. Springer, USA (2010)

    Book  Google Scholar 

  38. Magnus, J.R., Neudecker, H.: Matrix differential calculus with applications to simple, hadamard, and kronecker products. J. Math. Psychol. 29(4), 474–492 (1985)

    Article  MathSciNet  Google Scholar 

  39. Nguyen, M.N., Le Thi, H.A., Daniel, G., Nguyen, T.A.: Smoothing techniques and difference of convex functions algorithms for image reconstructions. Optim. 69(7-8), 1601–1633 (2020)

    Article  MathSciNet  Google Scholar 

  40. Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Math. Vietnam. 22(1), 289–355 (1997)

    MathSciNet  MATH  Google Scholar 

  41. Pham Dinh, T., Le Thi, H.A.: DC Optimization algorithms for solving the trust region subproblem. SIAM J. Optim. 8(2), 476–505 (1998)

    Article  MathSciNet  Google Scholar 

  42. Pham Dinh, T., Le Thi, H.A.: Recent Advances in DC Programming and DCA. In: Nguyen, N.T., Le Thi, H.A. (eds.) Transactions on Computational Intelligence XIII, vol. 8342, pp 1–37. Springer, Berlin (2014)

  43. Phan, D.N., Le Thi, H.A.: Group variable selection via p,0 regularization and application to optimal scoring. Neural Netw. 118, 220–234 (2019)

    Article  Google Scholar 

  44. Reinsel, G.C., Velu, R.P.: Multivariate Reduced-Rank regression: Theory and Applications, 1 edn. Lecture Notes in Statistics 136. Springer, New York (1998)

    Book  Google Scholar 

  45. Salinetti, G., Wets, R.J.: On the relations between two types of convergence for convex functions. J. Math. Anal. Appl. 60(1), 211–226 (1977)

    Article  MathSciNet  Google Scholar 

  46. Smith, A.E., Coit, D.W.: Constraint-handling techniques - penalty functions. In: Handbook of Evolutionary Computation, Oxford University Press, pp. C5.2:1–C5.2.6 (1997)

  47. Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-target regression via input space expansion: treating targets as inputs. Mach. Learn. 104(1), 55–98 (2016)

    Article  MathSciNet  Google Scholar 

  48. Tran, T.T., Le Thi, H.A., Pham Dinh, T.: DC Programming and DCA for enhancing physical layer security via cooperative jamming. Comput. Oper. Res. 87, 235–244 (2017)

    Article  MathSciNet  Google Scholar 

  49. Wold, S., Sjöström, M., Eriksson, L.: PLS-Regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 58(2), 109–130 (2001)

    Article  Google Scholar 

  50. Yuan, M., Ekici, A., Lu, Z., Monteiro, R.: Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(3), 329–346 (2007)

    Article  MathSciNet  Google Scholar 

  51. Zălinescu, C.: Convex analysis in general vector spaces. World Scientific (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoai An Le Thi.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: estimate μ Θ

The next lemma indicates how to estimate the value of μΘ to guarantee the convexity of function \(\widetilde {H}(\cdot ,\boldsymbol {\varTheta })\) defined in Theorem 2. In this lemma, ∥⋅∥2 denotes the 2-norm (or spectral norm) of a matrix.

Lemma 1

For each fixed \(\boldsymbol {\varTheta } \in \mathbb {R}^{m \times m}\), if \(\mu _{\boldsymbol {\varTheta }} \leq 1/\left (\|\frac {2}{n} {\sum }_{i=1}^{n}\boldsymbol {\phi }_{i} \boldsymbol {\phi }_{i}^{\top }\|_{2} \|\boldsymbol {\varTheta }\|_{2} + 2\alpha \right )\), then \(\widetilde {H}(\cdot ,\boldsymbol {\varTheta })\) is convex.

Proof

Knowing that the function \(\max \limits _{\boldsymbol {Y} \in \mathcal {X}} (2\langle \boldsymbol {X}, \boldsymbol {Y} \rangle - \|\boldsymbol {Y}\|_{F}^{2})\) is convex in X and the sum of two convex functions is also convex, then it is sufficient to choose μΘ such that the function \((\frac {1}{2\mu _{\boldsymbol {\varTheta }}}-\alpha ) \|\boldsymbol {X}\|_{F}^{2} - \frac {1}{n} {\sum }_{i=1}^{n} (\boldsymbol {z}_{i}-\boldsymbol {X} \boldsymbol {\phi }_{i})^{\top } \boldsymbol {\varTheta } (\boldsymbol {z}_{i}-\boldsymbol {X} \boldsymbol {\phi }_{i})\) becomes convex. For this aim, we can take \((\frac {1}{\mu _{\boldsymbol {\varTheta }}}-2\alpha )\) greater than or equal to the spectral radius of the Hessian matrix of \({\varLambda }(\boldsymbol {X}) = \frac {1}{n} {\sum }_{i=1}^{n} (\boldsymbol {z}_{i}-\boldsymbol {X} \boldsymbol {\phi }_{i})^{\top } \boldsymbol {\varTheta } (\boldsymbol {z}_{i}-\boldsymbol {X} \boldsymbol {\phi }_{i})\), i.e. \(\frac {1}{\mu _{\boldsymbol {\varTheta }}}-2\alpha \geq \rho (\nabla ^{2} {\varLambda }(\boldsymbol {X}))\). From the matrix differential calculus (see, e.g., [38]), we have

$$ \begin{array}{@{}rcl@{}} \nabla {\varLambda}(\boldsymbol{X}) &=& \frac{2}{n} \sum\limits_{i=1}^{n} \boldsymbol{\varTheta}(\boldsymbol{X}\boldsymbol{\phi}_{i}-\boldsymbol{z}_{i})\boldsymbol{\phi}_{i}^{\top} = \boldsymbol{\varTheta} \boldsymbol{X} \left( \frac{2}{n} \sum\limits_{i=1}^{n}\boldsymbol{\phi}_{i}\boldsymbol{\phi}_{i}^{\top}\right) - \frac{2}{n} \boldsymbol{\varTheta} \sum\limits_{i=1}^{n} \boldsymbol{z}_{i}\boldsymbol{\phi}_{i}^{\top}; \end{array} $$
(34)
$$ \begin{array}{@{}rcl@{}} \nabla^{2} {\varLambda}(\boldsymbol{X}) &=& \left( \frac{2}{n} \sum\limits_{i=1}^{n}\boldsymbol{\phi}_{i} \boldsymbol{\phi}_{i}^{\top} \right) \otimes \boldsymbol{\varTheta}. \end{array} $$

Here ⊗ denotes the Kronecker product. Since ∇2Λ(X) is symmetric, we yield \(\rho (\nabla ^{2} {\varLambda }(\boldsymbol {X})) = \|\nabla ^{2} {\varLambda }(\boldsymbol {X})\|_{2} = \|\left (\frac {2}{n} {\sum }_{i=1}^{n}\boldsymbol {\phi }_{i} \boldsymbol {\phi }_{i}^{\top } \right ) \otimes \boldsymbol {\varTheta }\|_{2} = \|\frac {2}{n} {\sum }_{i=1}^{n}\boldsymbol {\phi }_{i} \boldsymbol {\phi }_{i}^{\top }\|_{2} \|\boldsymbol {\varTheta }\|_{2}\). Thus, the proof is complete. □

Appendix B: Comparative Algorithms for Solving the Problem (2)

The Al-M method alternates between computing the variable X and Θ at each iteration. In particular, at iteration k, for fixed Θ, we need to compute Xk+ 1, an optimal solution to the following problem (see, e.g., [1, 44])

$$ \min \frac{1}{n} \sum\limits_{i=1}^{n} (\boldsymbol{z}_{i}-\boldsymbol{X} \boldsymbol{\phi}_{i})^{\top} \boldsymbol{\varTheta}^{k} (\boldsymbol{z}_{i}-\boldsymbol{X} \boldsymbol{\phi}_{i}) \text{ s.t. } \boldsymbol{X} \in \mathcal{X}. $$
(35)

Let us denote by Z (resp. Φ) a matrix in \(\mathbb {R}^{m \times n}\) (resp. \(\mathbb {R}^{d \times n}\)) whose each column is a vector zi (resp. ϕi); and define Dk := (ΦΦ)(− 1/2)(ΦZ)(Θk)(1/2). A reduced-rank regression estimate Xk+ 1 of (35) is given by (see [1])

$$ \boldsymbol{X}^{k+1} = \sum\limits_{t=1}^{r} {\varLambda}_{t} \left[(1/n) \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top} \right]^{(-1/2)} \boldsymbol{u}_{t} \boldsymbol{v}_{t}^{\top} (\boldsymbol{\varTheta}^{k})^{(-1/2)}, $$
(36)

where the sequence {Λt} is the singular values of matrix Dk; {ut} and {vt} are the left-hand and right-hand singular vectors of Dk. For fixed X, the Al-M computes the point Θk+ 1 using (17) at Xk+ 1. Note that the Al-M method does not have any parameter.

figure a

The Al-GD method differs from the Al-M method by the fact that the Al-GD performs one iteration of gradient descent method for solving the problem (35) (see [9]). In particular, Xk+ 1 is computed as follows:

$$ \boldsymbol{X}^{k+1} = \text{Proj}_{\mathcal{X}}\left( \boldsymbol{X}^{k} + \frac{2\eta_{\boldsymbol{X}}}{n}\boldsymbol{\varTheta}^{k} \sum\limits_{i=1}^{n} (\boldsymbol{z}_{i}-\boldsymbol{X}^{k} \boldsymbol{\phi}_{i})\boldsymbol{\phi}_{i}^{\top} \right), $$
(35)

where the step size ηX is a tuning parameter. Al-GD computes the point Θk+ 1 using (17).

figure b

The J-GD method does not compute two variables alternately, but takes one gradient descent step in the joint variable (X,Θ) (see [9]). For estimating Xk+ 1, it is the same as (4), while the point Θk+ 1 is computed by using gradient descent method for (16) at the point (Xk, Θk) as follows:

$$ \boldsymbol{\varTheta}^{k+1} = \text{Proj}_{{\varOmega}}\left( \boldsymbol{\varTheta}^{k} + \eta_{\boldsymbol{\varTheta}} \boldsymbol{\Delta}^{k}\right), $$
(38)

where the step size ηΘ is a tuning parameter and

$$ \boldsymbol{\varDelta}^{k} = (\boldsymbol{\varTheta}^{k})^{(-1)} -\left[\frac{1}{n}\boldsymbol{\varTheta}^{k} \sum\limits_{i=1}^{n} (\boldsymbol{z}_{i}-\boldsymbol{X}^{k} \boldsymbol{\phi}_{i})(\boldsymbol{z}_{i}-\boldsymbol{X}^{k} \boldsymbol{\phi}_{i})^{\top} \right]. $$
figure c

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le Thi, H.A., Ho, V.T. Alternating DCA for reduced-rank multitask linear regression with covariance matrix estimation. Ann Math Artif Intell 90, 809–829 (2022). https://doi.org/10.1007/s10472-021-09732-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-021-09732-8

Keywords

Mathematics Subject Classification (2010)

Navigation