Abstract
We introduce new global and local inexact oracle concepts for a wide class of convex functions in composite convex minimization. Such inexact oracles naturally arise in many situations, including primal–dual frameworks, barrier smoothing, and inexact evaluations of gradients and Hessians. We also provide examples showing that the class of convex functions equipped with the newly inexact oracles is larger than standard self-concordant and Lipschitz gradient function classes. Further, we investigate several properties of convex and/or self-concordant functions under our inexact oracles which are useful for algorithmic development. Next, we apply our theory to develop inexact proximal Newton-type schemes for minimizing general composite convex optimization problems equipped with such inexact oracles. Our theoretical results consist of new optimization algorithms accompanied with global convergence guarantees to solve a wide class of composite convex optimization problems. When the first objective term is additionally self-concordant, we establish different local convergence results for our method. In particular, we prove that depending on the choice of accuracy levels of the inexact second-order oracles, we obtain different local convergence rates ranging from linear and superlinear to quadratic. In special cases, where convergence bounds are known, our theory recovers the best known rates. We also apply our settings to derive a new primal–dual method for composite convex minimization problems involving linear operators. Finally, we present some representative numerical examples to illustrate the benefit of the new algorithms.
Similar content being viewed by others
References
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)
Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications, vol. 3. SIAM, University City (2001)
Bogolubsky, L., Dvurechenskii, P., Gasnikov, A., Gusev, G., Nesterov, Y., Raigorodskii, A., Tikhonov, A., Zhukovskii, M.: Learning supervised pagerank with gradient-based and gradient-free optimization methods. In: Advances in Neural Information Processing Systems, pp. 4914–4922 (2016)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. SIAM, University City (2008)
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Dvurechensky, P., Gasnikov, A.: Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. J. Optim. Theory Appl. 171(1), 121–145 (2016)
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008)
Gao, W., Goldfarb, D.: Quasi-Newton methods: superlinear convergence without linesearch for self-concordant functions. Optim. Method Softw. 34(1), 194–217 (2019)
Harmany, Z.T., Marcia, R.F., Willett, R.M.: This is SPIRAL-TAP: sparse poisson intensity reconstruction algorithms—theory and practice. IEEE Trans. Image Process. 21(3), 1084–1096 (2012)
Hsieh, C.J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.: Sparse inverse covariance matrix estimation using quadratic approximation. Adv. Neutral Inf. Process. Syst. 24, 1–18 (2011)
Lefkimmiatis, S., Unser, M.: Poisson image reconstruction with hessian schatten-norm regularization. IEEE Trans. Image Process. 22(11), 4314–4327 (2013)
Li, J., Andersen, M., Vandenberghe, L.: Inexact proximal newton methods for self-concordant functions. Math. Methods Oper. Res. 85(1), 19–41 (2017)
Li, L., Toh, K.C.: An inexact interior-point method for \(\ell _1\)-regularized sparse covariance selection. Math. Program. Compt. 2(3), 291–315 (2010)
Lu, Z.: Randomized block proximal damped Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2017)
Marron, S.J., Todd, M.J., Ahn, J.: Distance-weighted discrimination. J. Am. Stat. Assoc. 102(480), 1267–1271 (2007)
Necoara, I., Patrascu, A., Glineur, F.: Complexity of first-order inexact Lagrangian and penalty methods for conic convex programming. Optim. Method Softw. 34(2), 305–335 (2019)
Necoara, I., Suykens, J.A.K.: Interior-point Lagrangian decomposition method for separable convex optimization. J. Optim. Theory Appl. 143(3), 567–588 (2009)
Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization : A Basic Course, Volume 87 of Applied Optimization. Kluwer Academic Publishers, Berlin (2004)
Nesterov, Y., Nemirovski, A.: Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, New York (1994)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006)
Olsen, P.A., Oztoprak, F., Nocedal, J., Rennie, S.J.: Newton-like methods for sparse inverse covariance estimation. Adv. Neural Inf. Process. Syst. 25, 1–9 (2012)
Ostrovskii, D.M., Bach, F.: Finite-sample analysis of M-estimators using self-concordance. arXiv:1810.06838v1 (2018)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Rockafellar, R.T.: Convex Analysis. Princeton Mathematics Series, vol. 28. Princeton University Press, Princeton (1970)
Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modelling and Theory. SIAM, University City (2009)
Sun, T., Tran-Dinh, Q.: Generalized self-concordant functions: a recipe for Newton-type methods. Math. Program. 178, 145–213 (2018)
Toh, K.-C., Todd, M.J., Tütüncü, R.H.: On the implementation and usage of SDPT3—a Matlab software package for semidefinite-quadratic-linear programming. Technical Report 4, NUS Singapore (2010)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
Tran-Dinh, Q., Necoara, I., Savorgnan, C., Diehl, M.: An inexact perturbed path-following method for Lagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim. 23(1), 95–125 (2013)
Tran-Dinh, Q., Sun, T., Lu, S.: Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Math. Program. 177(1–2), 173–223 (2019)
Zhang, R.Y., Fattahi, S., Sojoudi, S.: Linear-time algorithm for learning large-scale sparse graphical models. IEEE Access 7, 12658–12672 (2019)
Zhang, Y., Lin, X.: DiSCO: distributed optimization for self-concordant empirical loss. In: Proceedings of the 32th International Conference on Machine Learning, pp. 362–370 (2015)
Acknowledgements
The work of Q. Tran-Dinh was partly supported by the National Science Foundation (NSF), Grant: DMS-1619884, and the Office of Naval Research (ONR), Grant: N00014-20-1-2088 (2020–2023). The work of I. Necoara was partly supported by the Executive Agency for Higher Education, Research and Innovation Funding (UEFISCDI), Romania, PNIII-P4-PCE-2016-0731, Project ScaleFreeNet, No. 39/2017.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A The proof of technical results in the main text
A The proof of technical results in the main text
This appendix provides the proofs of technical results and missing concepts in the main text.
1.1 A.1 The proof of Lemma 1: properties of global inexact oracle
(a) Substituting \(x = y\) into (5), we obtain (7) directly for all \(x\in \mathrm {dom}(f)\).
(b) Clearly, if \(\langle g(\bar{x}), y - \bar{x}\rangle \ge 0\) for all \(y\in \mathrm {dom}(f)\), then \(\langle g(\bar{x}), x^{\star } - \bar{x}\rangle \ge 0\) for a minimizer \(x^{\star }\) of f. Using this relation into (5), we have
which implies \(f^{\star } \le f(\bar{x}) \le f^{\star } + \delta _1\).
(c) Let \(\nabla f(x)\) be a (sub)gradient of f at \(x \in \mathrm {int}\left( \mathrm {dom}(f)\right) \). For \(y\in \mathrm {dom}(f)\), it follows from (5) and (7) that
Subtracting this estimate from the second inequality of (5), we have
provided that \(\vert\!\Vert y - x \Vert\!\vert_x < \frac{1}{1+\delta _0}\). Let us consider an arbitrary \(z\in {\mathbb{R}}^p\) such that
Let us choose \(y = y_{\tau }(x) := x + \tau \text {sign}(\left\langle \nabla f(x)-g(x),z\right\rangle )z\) for some \(\tau > 0\). Since \(x\in \mathrm {int}\left( \mathrm {dom}(f)\right) \), for sufficiently small \(\tau \), \(y\in \mathrm {dom}(f)\). Moreover, (52) becomes \(\tau \vert\!\Vert \nabla f(x)-g(x) \Vert\!\vert_x^{*}\le \omega _{*}\left( (1+\delta _0)\tau \right) +\delta _1\), which is equivalent to
Let us take \(\tau := \frac{\delta _2}{(1+\delta _0 + \delta _2)(1+\delta _0)}\) for some sufficiently small \(\delta _2 > 0\). Then, we can easily check that \(\vert\!\Vert y - x \Vert\!\vert_x = \tau < \frac{1}{1+\delta _0}\). In this case, the right-hand side of (53) becomes
for any \(\delta _2 >0\). Minimizing the right-hand side of (54) w.r.t. \(\delta _2 > 0\), we can show that the minimum is attained at \(\delta _2 := \delta _2(\delta _0,\delta _1) > 0\) which is the unique solution \(\delta _2 := (1+\delta _0)\omega ^{-1}(\delta _1)\) of \(\omega \left( \tfrac{\delta _2}{1+\delta _0} \right) = \delta _1\) in \(\delta _2\), where \(\omega ^{-1}\) is the inverse function of \(\omega \) (note that \(\omega (\tau ) = \tau - \ln (1 + \tau )\)).
Now, substituting \(\delta _2 = \delta _2(\delta _0,\delta _1)\) back into \(s(\tau ;\delta _0,\delta _1)\), we can see that the minimum value of (54) is \(\delta _2 := \delta _2(\delta _0,\delta _1) = (1+\delta _0)\omega ^{-1}(\delta _1)\). By directly using the definition of \(\omega \), it is obvious that if \(\delta _0 \rightarrow 0\) and \(\delta _1\rightarrow 0\), then \(\delta _2 := (1+\delta _0)\omega ^{-1}(\delta _1) \rightarrow 0\).
(d) Let us consider the function \(\varphi (y) := f(y) - \langle \nabla {f}(x^0), y\rangle \) for some \(x^0 \in \mathrm {dom}(f)\). It is clear that \(\nabla {\varphi }(x^0) = 0\), which shows that \(x^0\) is a minimizer of \(\varphi \). Hence, we have \(\varphi (x^0) \le \varphi (x - tH(x)^{-1}h(x))\) for some \(t > 0\) such that \(x\in \mathrm {int}\left( \mathrm {dom}(f)\right) \) and \(x - tH(x)^{-1}h(x) \in \mathrm {dom}(f)\). If we define \(\tilde{\varphi }(x) := {\tilde{f}}(x) - \langle \nabla {f}(x^0), x\rangle \), and \(h(x) := g(x) - \nabla {f}(x^0)\), then, by using (5), we can further derive
Minimizing the right-hand side of the last estimate w.r.t. \(t > 0\), we obtain
given \(t = \frac{1}{(1+\delta _0)(1+\delta _0+\vert\!\Vert h(x) \Vert\!\vert_x^{*})}\). Using the definition of \(\varphi \) and the Cauchy-Schwarz inequality, we have
By letting \(x^0 = y\) into this inequality, we obtain exactly (9). \(\square \)
1.2 A.2 The proof of Lemma 2: properties of local inexact oracle
From the second line of (6), for any \(u\in {\mathbb{R}}^p\), we have
which implies the first expression of (10).
Using again the second line of (6), we have \(\frac{1}{(1+\delta _3)^2}\nabla ^2{f}(x)^{-1} \preceq H(x)^{-1} \preceq \frac{1}{(1-\delta _3)^2}\nabla ^2{f}(x)^{-1}\). Hence, for any \(v\in {\mathbb{R}}^p\), one has
which implies the second expression of (10).
Now, we prove (11). For any \(x, y \in {\mathcal{X}}\), using (10) with \(u := y - x\), we have
For \(x, y \in {\mathcal{X}}\) such that \(\vert\!\Vert y-x \Vert\!\vert_x < 1-\delta _3\) for \(\delta _3 \in [0, 1)\), the estimate (55) implies that
Since \(\vert\!\Vert y-x \Vert\!\vert_x < 1-\delta _3\), by (55), we have \(\left\| y-x\right\| _x < 1\). Hence, by [23, Theorem 4.1.6], we can show that
Combining (57) and (56), and using again (6), we can further derive
and
Therefore, we obtain the first estimate of (11) from these expressions.
From the second line of (6), we have \(-(2\delta _3 - \delta _3^2)\nabla ^2{f}(x) \preceq H(x) - \nabla ^2 f(x) \preceq (2\delta _3 + \delta _3^2)\nabla ^2 f(x)\). If we define \(G_x := [\nabla ^2 f(x)]^{-1/2}(\nabla ^2 f(x) - H(x))[\nabla ^2 f(x)]^{-1/2}\), then the last estimate implies that
Moreover, by (55), (57), (58), and the Cauchy–Schwarz inequality in (i), we can further derive
which is exactly the second estimate of (11). \(\square \)
1.3 A.3 The proof of Lemma 3: computational inexact oracle
(a) We first prove the left-hand side inequality of (5). Since f is standard self-concordant, for any \(x, y\in \mathrm {dom}(f)\) and \(\alpha \in [0, 1]\), we have
where (i) follows from [23, Theorem 4.1.7] and (ii) follows from the Cauchy-Schwarz inequality.
Let \(\gamma := 1 - \delta _3 \in (0, 1]\). We consider the function
The first and second derivatives of \(\underline{\psi }\) are given respectively by
Since \(\alpha \in [0, 1]\), it is easy to check that \(\underline{\psi }''(t) \ge 0\) for all \(t \ge 0\). Hence, \(\underline{\psi }\) is convex.
If \((1-\alpha )\gamma > \delta _2\), then \(\underline{\psi }\) attains the minimum at \(\underline{t}^{*} > 0\) as the positive solution of \(\underline{\psi }'(t) = (1-\alpha )\gamma - \delta _2 - \frac{\gamma }{1+\gamma t} + \frac{\alpha \gamma }{1 + \alpha \gamma t} = 0\). Solving this equation for a positive solution, we get
Let choose \(\alpha := 1 - \frac{2\delta _2}{1-\delta _3} = \frac{1-2\delta _2-\delta _3}{1-\delta _3}\). To guarantee \(\alpha \in [0, 1]\), we impose \(2\delta _2 + \delta _3 \in [0, 1)\). Moreover, \((1-\alpha )\gamma - \delta _2 = \delta _2 \ge 0\). Substituting \(\alpha \) and \(\gamma = 1 - \delta _3\) into \(\underline{t}^{*}\), we eventually obtain
As a result, we can directly compute
In this case, we can write the minimum value \(\underline{\psi }(\underline{t}^{*})\) of \(\underline{\psi }\) explicitly as
where \(\underline{c}_{23} := \left[ (1-\delta _2 - \delta _3)^2 + (1-\delta _3)(1-2\delta _2 - \delta _3)\right] ^{1/2} \ge 0\). This is exactly third line of (16).
Now, substituting this lower bound \(\underline{\psi }^{*}(\delta _2,\delta _3)\) of \(\underline{\psi }\) into (59) and noting that \(\alpha (1-\delta _3) = 1 - 2\delta _2 - \delta _3\), we obtain
Clearly, if we define \({\tilde{f}}(x) := {\hat{f}}(x) - \varepsilon ~+~ \underline{\psi }^{*}(\delta _2,\delta _3)\) and \(\delta _0 := 2\delta _2 + \delta _3 \in [0, 1)\), then the last inequality is exactly the left-hand side inequality of (5).
(b) To prove the right-hand side inequality of (5), we first derive
where (a) follows from [23, Theorem 4.1.8] and (b) holds due to the Cauchy-Schwarz inequality.
Let \(\beta \ge 1\) and \(\bar{\gamma } := 1 + \delta _3 \ge 1\). We consider the following function
First, we compute the first and second derivatives of \(\bar{\psi }\) respectively as
Clearly, \(\bar{\psi }''(t) \le 0\) for all \(0 \le t < \frac{1}{\bar{\gamma }\beta }\). Hence, \(\bar{\psi }\) is concave in t. To find the maximum value of \(\bar{\psi }\), we need to solve \(\bar{\psi }'(t) = 0\) for \(t > 0\), and obtain
Let us choose \(\beta := 1 + \frac{2\delta _2}{1+\delta _3} \ge 1\). Then we can explicitly compute \(\bar{t}^{*}\) as
To evaluate \(\bar{\psi }(\bar{t}^{*})\), we first compute
Using this expression, we can explicitly compute the maximum value \(\bar{\psi }(\bar{t}^{*})\) of \(\bar{\psi }\) as
where \(\bar{c}_{23} := \sqrt{3(1+\delta _2+\delta _3)^2 - (1+\delta _3)(1+2\delta _2+\delta _3)} \ge 0\). Plugging this expression into (60), and noting that \(\beta (1+\delta _3) = 1 + 2\delta _2 + \delta _3 = 1 + \delta _0\), we can show that
Finally, by defining \(\delta _1 := \max \left\{ 0, 2\varepsilon + \bar{\psi }^{*}(\delta _2,\delta _3) - \underline{\psi }^{*}(\delta _2,\delta _3)\right\} \ge 0\) and noting that \({\tilde{f}}(x) = {\hat{f}}(x) - \varepsilon + \underline{\psi }^{*}\), we obtain
which proves the right-hand side inequality of (5). \(\square \)
1.4 A.4 The proof of Lemma 4: inexact oracle of the dual problem
Since \(\varphi \) is self-concordant, by [23, Theorem 4.1.6] and \(\delta (x) := \left\| {\tilde{u}}^{*}(x) - u^{*}(x)\right\| _{{\tilde{u}}^{*}(x)}\), we have
Multiplying this estimate by A and \(A^{\top }\) on the left and right, respectively we obtain
Using (19) and (20), this estimate leads to
Since \(\delta (x) \le \delta \) and \(\delta _3 := \frac{\delta }{1-\delta } \in [0, 1)\), we have \((1-\delta (x))^2 \ge (1-\delta )^2 \ge (1-\delta _3)^2\) and \(\frac{1}{(1-\delta (x))^2} \le \frac{1}{(1-\delta )^2} = (1+\delta _3)^2\). Using these inequalities in (61), we obtain the second bound of (21).
Next, by the definition of g(x) and \(\nabla {f}(x)\), we can derive that
where we use \(A^{\top }(AQ^{-1}A^{\top })^{-1}A \preceq Q\) for \(Q = \nabla ^2{\varphi }(u^{*}(x))\succ 0\) in (i) (see [34] for a detailed proof of this inequality). This expression implies \(\vert\!\Vert g(x) - \nabla f(x) \Vert\!\vert_x^{*} \le \delta \), the first estimate of (21).
Now, by the definition of f in (17) and of \({\tilde{f}}\) in (20), respectively, and the optimality condition \(\nabla {\varphi }(u^{*}(x)) = A^{\top }x\) in (17), we have
Since \(\varphi \) is standard self-concordant, using [23, Theorem 4.1.7, 4.1.8] we obtain from the last expression that
which leads to
provided that \(\delta (x)<1\). This condition leads to \(\vert f(x) - {\tilde{f}}(x)\vert \le \omega _{*}\left( \tfrac{\delta }{1-\delta }\right) =: \varepsilon \).
Using Lemma 3 with \(\varepsilon :=\omega _{*}\left( \frac{\delta }{1-\delta }\right) \) and \(\delta _2 := \delta \), and \(\delta _3\) defined above, we conclude that \(({\tilde{f}}, g, H)\) given by (20) is a \((\delta _0,\delta _1)\)-global inexact oracle of f, where \(\delta _0\) and \(\delta _2\) are computed from Lemma 3. Since \(2\delta _2+\delta _3<1\) is required in Lemma 3, by a direct numerical calculation, we obtain \(\delta \in [0,0.292]\).
From the optimality condition of (18) we have \(\nabla {\varphi }(u^{*}(x)) - A^{\top }x = 0\). Let \(r(x) := \nabla {\varphi }({\tilde{u}}^{*}(x)) - A^{\top }x\). Then, using the self-concordance of \(\varphi \), by [23, Theorem 4.1.7], we have
where we use [23, Theorem 4.1.7] in (a). Since \(\delta (x) := \Vert {\tilde{u}}^{*}(x) - u^{*}(x)\Vert _{{\tilde{u}}^{*}(x)}\), by the Cauchy-Schwarz inequality, we can show that \(\frac{\delta (x)^2}{1 + \delta (x)} \le \Vert r(x)\Vert ^{*}_{{\tilde{u}}^{*}(x)}\delta (x)\), which leads to \(\frac{\delta (x)}{1+\delta (x)} \le \Vert r(x)\Vert ^{*}_{{\tilde{u}}^{*}(x)}\).
Finally, we assume that \(\Vert r(x)\Vert ^{*}_{{\tilde{u}}^{*}(x)} \le \frac{\delta }{1+\delta }\) for some \(\delta > 0\) as stated in Lemma 4. Using this condition and the last inequality \(\frac{\delta (x)}{1+\delta (x)} \le \Vert r(x)\Vert ^{*}_{{\tilde{u}}^{*}(x)}\) we have \(\frac{\delta (x)}{1+\delta (x)} \le \frac{\delta }{1+\delta }\), which implies that \(\delta (x) \le \delta \). \(\square \)
1.5 A.5 The proof of Lemma 5: key estimate for local convergence analysis
First, recall that \(\nu ^k\in g(x^k)+H(x^k)(z^k - x^k)+\partial R(z^k)\) from (28). Using the definition of \(\mathcal{P}_x\) from (23), this expression leads to
Shifting the index from k to \(k+1\), the last expression leads to
Next, if we denote by \(r_{x^k}(z^k):=g(x^k)+H(x^k)(z^k-x^k)\), then again from (28) and (23), we can rewrite
Denote \(H_k := H(x^k)\), \(f_k':=\nabla f(x^k)\), and \(g_k := g(x^k)\) for simplicity. By the triangle inequality, we have
To upper bound \(\lambda _{k+1}\), we upper bound each term of (64) as follows.
(a) For the first term \(\vert\!\Vert x^{k+1}-z^k\Vert\!\vert_{x^{k+1}}\) of (64), since f is standard self-concordant, by (10) and [23, Theorem 4.1.5], we have
Since \(\alpha _k \in [0, 1]\), \(\lambda _k := \vert\!\Vert d^k \Vert\!\vert_{x^k}\), and \(x^{k+1} := x^k + \alpha _k(z^k - x^k) = x^k + \alpha _kd^k\) due to (iPNA), we have
Substituting (65) into the last estimate, we obtain
(b) For the second term \(\vert\!\Vert z^{k+1} - z^{k} \Vert\!\vert_{x^{k+1}}\) of (64), using (62), (63), the triangle inequality in (i), and the nonexpansiveness of the scaled proximal operator \(\mathcal{P}_{x}\) from (25), we can show that
To further estimate the last term \([\mathcal{T}_2]\) of (67), we have
Utilizing this estimate and the triangle inequality, we can estimate the term \([\mathcal{T}_2]\) of (67) as
Now, using triangle inequality, we can split the term \([\mathcal{T}_1]\) of (67) as
In addition, using the left-hand side inequality in the first line of (11) with \(x := x^k\) and \(y := x^{k+1}\), we have
To estimate each term of (69), we first note that
Second, using (70) and \(\vert\!\Vert H_kd^k \Vert\!\vert_{x^k}^{*} = \vert\!\Vert d^k \Vert\!\vert_{x^k} = \lambda _k\), we can show that
Third, by (6) and (70), we have
Fourth, utilizing (70) and [23, Theorem 4.1.14], we can show that
Fifth, employing the second inequality of (11) with \(x := x^k\), \(y := x^{k+1}\), and \(v := x^{k+1} - x^k\), we can show that
Finally, substituting (71), (72), (73), (74), and (75) into (69), we can upper bound \([\mathcal{T}_1]\) as
Plugging this upper bound of \([\mathcal{T}_1]\) and the upper bound of \([\mathcal{T}_2]\) from (68) into (67), we obtain
Substituting this estimate and (66) back into (64) we get
Since \(0< 1 - \delta _4 < 1\), rearranging this estimate, we obtain (36). \(\square \)
1.6 A.6 Detailed proofs of the missing technical results in the main text
In this subsection, we provide more details of some missing proofs in the main text.
(a) Technical details in the proof of Theorems 3and 4: Let us denote the right-hand side of (36) by
where \(\lambda _k, \delta _2 \ge 0\), \(\alpha _k \in [0, 1]\), \(\delta _3, \delta _4 \in [0, 1)\), \(\alpha _k\lambda _k + \delta _3 < 1\), and \(\theta := (\delta _2, \delta _3, \delta _4)\).
If \(\alpha _k = 1\), then \(H(\cdot )\) reduces to
If \(\alpha _k = \frac{1-\delta _4}{(1+\delta )(1+\delta + (1-\delta )\lambda _k)} \in [0, 1]\), then \(H(\cdot )\) can be rewritten as
The following lemma is used to prove Theorems 3 and 4 in the main text.
Lemma 6
The function\(H_1(\cdot )\)defined by (76) is monotonically increasing w.r.t. each variable\(\lambda _k \ge 0\), \(\delta _2 \ge 0\), \(\delta _3 \in [0, 1)\), and\(\delta _4\in [0, 1)\)such that\(\lambda _k + \delta _3 < 1\).
Similarly, for given\(\lambda _k > 0\)and\(\delta \in [0, 1)\), the function\(H_2(\cdot )\)defined by (77) is monotonically increasing w.r.t. each variable\(\alpha _k \in [0, 1]\), \(\delta _2 \ge 0\), \(\delta _3 \in [0, 1)\), and\(\delta _4\in [0, 1)\)such that\(\alpha _k\lambda _k + \delta _3 < 1\). Moreover, if\(0 \le \delta _3, \delta _4 \le \delta \), then we can upper bound\(H_2\)as\(H_2(\alpha _k,\lambda _k, \delta , \theta ) \le {\widehat{H}}_2(\lambda _k, \delta , \delta _2)\), where
The function\({\widehat{H}}_2(\cdot )\)is also monotonically increasing w.r.t. each variable\(\delta _2\)and\(\lambda _k\).
Proof
We first consider \(H_1\) defined by (76). For \(\lambda _k \ge 0\), \(\delta _2 \ge 0\), \(\delta _3 \in [0, 1)\), and \(\delta _4\in [0, 1)\) such that \(\lambda _k + \delta _3 < 1\), the term 1 is \(\frac{\delta _2}{1-\delta _4}\), which is monotonically increasing w.r.t. \(\delta _2\) and \(\delta _4\). The term 2 is monotonically increasing w.r.t. \(\lambda _k\), \(\delta _2\), \(\delta _3\), and \(\delta _4\). The terms 3 and 4 are monotonically increasing w.r.t. \(\lambda _k\), \(\delta _3\), and \(\delta _4\). Consequently, \(H_1(\cdot )\) is monotonically increasing w.r.t. \(\lambda _k\), \(\delta _2\), \(\delta _3\), and \(\delta _4\).
For fixed \(\lambda _k > 0\) and \(\delta \in [0, 1)\), we consider \(H_2(\cdot )\) defined by (77). Clearly, for \(\alpha _k \in [0, 1]\), \(\delta _2 \ge 0\), \(\delta _3 \in [0, 1)\), and \(\delta _4\in [0, 1)\) such that \(\alpha _k\lambda _k + \delta _3 < 1\), the term 1 is monotonically increasing w.r.t. \(\delta _2\) and \(\delta _4\). The term 2 is monotonically increasing w.r.t. \(\lambda _k\), \(\delta _2\), \(\delta _3\), and \(\delta _4\). The terms 3 and 4 are monotonically increasing w.r.t. \(\delta _3\), and \(\delta _4\). Consequently, \(H_2(\cdot )\) is monotonically increasing w.r.t. \(\delta _2\), \(\delta _3\), and \(\delta _4\). Using the upper bound \(\delta \) of \(\delta _3\) and \(\delta _4\) into \(H_2\), we can easily get \(H_2(\alpha _k,\lambda _k, \delta , \theta ) \le {\widehat{H}}_2(\lambda _k, \delta , \delta _2)\). The monotonic increase of \({\widehat{H}}_2\) w.r.t. \(\delta _2\) and \(\lambda _k\) can be easily checked directly by verifying each term separately. \(\square \)
(b) The detailed proof for Example 1(c) in Sect. 3.1: We provide here the detailed proof of the estimate (14) in Example 1(c) of Sect. 3.1.
Since \(f_1(x) = -\ln (x)\), \(f_2(x) = \max \{\delta _1x, \delta _1\}\), and \(f(x) = f_1(x) + f_2(x)\), we have \(\mathrm {dom}(f) = \{x\in {\mathbb{R}}\mid x > 0\}\). Moreover, since \(\nabla ^2{f_1}(x) = H(x) = \frac{1}{x^2}\), the condition \(\vert \!\Vert y - x\vert \!\Vert _{x} < 1\) (here we use \(\delta _0 = 0\)) leads to \(\frac{(y-x)^2}{x^2} < 1\), which is equivalent to \(-x< y-x < x\), or equivalently, \(0< y < 2x\). Since \(y > 0\), the condition \(\Vert y - x\Vert _{x} < 1\) is equivalent to \(0< y < 2x\). In this case, we have
Using this expression, one can show that
In summary, we get \(f_2(y) - f_2(x) - \langle g_2(x), y-x\rangle \le \delta _1\), which is exactly (14). \(\square \)
1.7 A.7 Implementation details: approximate proximal Newton directions
When solving for \(z^k\) in (iPNA), we use FISTA [1]. At the \(j^{\mathrm {th}}\) iteration of the inner loop, \(d^j\) is computed as
where \(w^j :=d^{j-1}+\frac{t_{j-1}-1}{t_{j}}(d^{j-1}-d^{j-2})\). By the definition of \(\mathrm {prox}_{\alpha R}\), the following relation holds:
which guarantees that the vector \(\nu ^k :=\frac{w^j - d^j}{\alpha }+H(x^k)(d^j - w^j)=\left( \frac{{\mathbb{I}}_p}{\alpha }-H(x^k)\right) (w^j - d^j)\) satisfies the condition \(\nu ^k \in g(x^k)+H(x^k)(d^j)+\partial R(x^k+d^j)\). In our implementation, this \(\nu ^k\) was used in (28) to determine whether to accept this \(d^k := d^j\) as an inexact proximal Newton direction at the iteration k in (iPNA).
Rights and permissions
About this article
Cite this article
Sun, T., Necoara, I. & Tran-Dinh, Q. Composite convex optimization with global and local inexact oracles. Comput Optim Appl 76, 69–124 (2020). https://doi.org/10.1007/s10589-020-00174-2
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-020-00174-2
Keywords
- Self-concordant functions
- Composite convex minimization
- Local and global inexact oracles
- Inexact proximal Newton-type method
- Primal–dual second-order method