Abstract
This short note considers an efficient variant of the trust-region algorithm with dynamic accuracy proposed by Carter (SIAM J Sci Stat Comput 14(2):368–388, 1993) and by Conn et al. (Trust-region methods. MPS-SIAM series on optimization, SIAM, Philadelphia, 2000) as a tool for very high-performance computing, an area where it is critical to allow multi-precision computations for keeping the energy dissipation under control. Numerical experiments are presented indicating that the use of the considered method can bring substantial savings in objective function’s and gradient’s evaluation “energy costs” by efficiently exploiting multi-precision computations.
Similar content being viewed by others
Notes
The solution of nonlinear systems of equations is considered rather than unconstrained optimization.
Numerical experiments not reported here suggest that our default choice of remembering 15 secant pairs gives good performance, although keeping a smaller number of pairs is still acceptable.
Carter [10] requires \(\omega _g \le 1-\eta _2\) while we require \(\omega _g\le \kappa _g\) with \(\kappa _g\) satisfying (2.5). A fixed value is also used for \(\omega _f\), whose upper bound depends on \(\omega _g\). The Hessian approximation is computed using an unsafeguarded standard BFGS update.
The collection of [8] and a few other problems, all available in Matlab.
Remember it is proportional to the square of the number of significant digits.
References
Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun. 180, 2526–2533 (2009)
Bellavia, S., Gratton, S., Riccietti, E.: A Levenberg–Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients. Numer. Math. 140, 791–825 (2018)
Bellavia, S., Gurioli, G., Morini, B.: Theoretical study of an adaptive cubic regularization method with dynamic inexact Hessian information (2018). arXiv:1808.06239
Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L.: Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. SIAM J. Optim. 29(4), 2881–2915 (2019)
Bergou, E., Diouane, Y., Kungurtsev, V., Royer, C.W.: A subsampling line-search method with second-order results (2018). arXiv:1810.07211
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust region method via supermartingales. INFORMS J. Optim. 1(2), 92–119 (2019)
Brown, A.A., Bartholomew-Biggs, M.: Some effective methods for unconstrained optimization based on the solution of ordinary differential equations. Technical Report Technical Report 178, Hatfield Polytechnic, Hatfield, UK (1987)
Buckley, A.G.: Test functions for unconstrained minimization. Technical Report CS-3, Computing Science Division, Dalhousie University, Dalhousie, Canada (1989)
Carter, R.G.: A worst-case example using linesearch methods for numerical optimization with inexact gradient evaluations. Technical Report MCS-P283-1291, Argonne National Laboratory, Argonne, USA (1991)
Carter, R.G.: Numerical experience with a class of algorithms for nonlinear optimization using inexact function and gradient information. SIAM J. Sci. Stat. Comput. 14(2), 368–388 (1993)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. In: The Proceedings of the 2018 International Conference of Mathematicians (ICM 2018), Rio de Janeiro (2018)
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. Ser. A 159(2), 337–375 (2018)
Chen, X., Jiang, B., Lin, T., Zhang, S.: On adaptive cubic regularization Newton’s methods for convex optimization via random sampling (2018). arXiv:1802.05426
Conn, A.R., Gould, N.I.M., Lescrenier, M., Toint, Ph.L.: Performance of a multifrontal scheme for partially separable optimization. In: Gomez, S., Hennart, J.P. (eds.) Advances in Optimization and Numerical Analysis. Proceedings of the Sixth Workshop on Optimization and Numerical Analysis, Oaxaca, Mexico, vol. 275, pp. 79–96. Kluwer Academic Publishers, Dordrecht (1994)
Conn, A.R., Gould, N.I.M., Toint, Ph.L.: LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A). Number 17 in Springer Series in Computational Mathematics. Springer, Heidelberg (1992)
Conn, A.R., Gould, N.I.M., Toint, Ph.L.: Trust-Region Methods. MPS-SIAM Series on Optimization. SIAM, Philadelphia (2000)
Dixon, L.C.W., Maany, Z.: A family of test problems with sparse Hessian for unconstrained optimization. Technical Report 206, Numerical Optimization Center, Hatfield Polytechnic, Hatfield, UK (1988)
Elster, C., Neumaier, A.: A method of trust region type for minimizing noisy functions. Computing 58(1), 31–46 (1997)
Galal, S., Horowitz, M.: Energy-efficient floating-point unit design. IEEE Trans. Comput. 60(7), 913–922 (2011)
Gould, N.I.M., Orban, D., Toint, Ph.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2015)
Griewank, A., Toint, Ph.L.: Partitioned variable metric updates for large structured optimization problems. Numer. Math. 39, 119–137 (1982)
Higham, N.J.: The rise of multiprecision computations. Talk at SAMSI 2017. https://bit.ly/higham-samsi17(2017)
Kugler, L.: Is “good enough” computing good enough? Commun. ACM 58, 12–14 (2015)
Leyffer, S., Wild, S., Fagan, M., Snir, M., Palem, K., Yoshii, K., Finkel, H.: Moore with less—leapgrogging Moore’s law with inexactness for supercomputing (2016). arXiv:1610.02606v2 (to appear in Proceedings of PMES 2018: 3rd International Workshop on Post Moore’s Era Supercomputing)
Li, G.: The secant/finite difference algorithm for solving sparse nonlinear systems of equations. SIAM J. Numer. Anal. 25(5), 1181–1196 (1988)
Matsuoka, S.: Private communication (2018)
Moré, J.J., Garbow, B.S., Hillstrom, K.E.: Testing unconstrained optimization software. ACM Trans. Math. Softw. 7(1), 17–41 (1981)
Nocedal, J., Wright, S.J.: Numerical Optimization. Series in Operations Research. Springer, Heidelberg (1999)
Palem, K.V.: Inexactness and a future of computing. Philos. Trans. R. Soc. A 372, 20130281 (2014)
Poenisch, G., Schwetlick, H.: Computing turning points of curves implicitly defined by nonlinear equations depending on a parameter. Computing 20, 101–121 (1981)
Pu, J., Galal, S., Yang, X., Shacham, O., Horowitz, M.: FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 single-precision FPU, and a 43.7 GFLOPS/W at 74.6 GFLOPS/mm2 double-precision FPU, in 28nm UTBB FDSOI. In: Hardware Architecture (2016)
Schmidt, J.W., Vetters, K.: Albeitungsfreie verfahren fur nichtlineare optimierungsproblem. Numer. Math. 15, 263–282 (1970)
Spedicato, E.: Computational experience with quasi-Newton algorithms for minimization problems of moderately large size. Technical Report CISE-N-175, CISE Documentation Service, Segrate, Milano (1975)
Wang, N., Choi, J., Brand, D., Chen, C.-Y., Gopalakrishnan, K.: Training deep neural networks with 8-bit floating point numbers. In: 32nd Conference on Neural Information Processing Systems (2018). arXiv:1812.08011
Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact Hessian information (2017). arXiv:1708.07164v3
Acknowledgements
S. Gratton: Partially supported by 3IA Artificial and Natural Intelligence Toulouse Institute, French “Investing for the Future - PIA3” program under the Grant Agreement ANR-19-PI3A-0004. P. L. Toint: Partially supported by ANR-11-LABX-0040-CIMI within the program ANR-11-IDEX-0002-02.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Complexity theory for the TR1DA algorithm
Appendix: Complexity theory for the TR1DA algorithm
For the sake of accuracy and completeness, we now provide details of the first-order worst-case complexity analyis summarized at the end of Sect. 2. As indicated there, the following development can be seen as a combination of the arguments proposed by [16] for the convergence theory of trust-region methods with inexact gradients (pp. 280) and dynamic accuracy (pp. 400).
We assume that
- AS.1::
The objective function f is twice continuously differentiable in \(\mathfrak {R}^n\) and there exist a constant \(\kappa _\nabla \ge 0\) such that \(\Vert \nabla _x^2f(x)\Vert \le \kappa _\nabla\) for all \(x \in \mathfrak {R}^n\).
- AS.2::
There exists a constant \(\kappa _H\ge 0\) such that \(\Vert H_k\Vert \le \kappa _H\) for all \(k\ge 0\).
- AS.3:
There exists a constant \(\kappa _{\mathrm{low}}\) such that \(f(x)\ge \kappa _{\mathrm{low}}\) for all \(x\in \mathfrak {R}^n\).
Lemma A.1
Suppose AS.1 and AS.2 hold. Then, for each \(k\ge 0,\)
for \(\kappa _{H\nabla } = 1+\max [\kappa _H, \kappa _\nabla ].\)
Proof
(See [16, Theorem 8.4.2].) The definition (2.8), (2.6), the mean-value theorem, the Cauchy–Schwarz inequality and AS.1 give that, for some \(\xi _k\) in the segment \([x_k,x_k+s_k]\),
and (A.1) follows from the the Cauchy–Schwarz inequality and the inequality \(\Vert s_k\Vert \le \Delta _k\). \(\square\)
Lemma A.2
We have that, for all \(k\ge 0,\)
and
Proof
(See [16, p. 401].) The mechanism of the TR1DA algorithm ensures that (A.2) holds. Hence,
As a consequence, for iterations where \(\rho _k \ge \eta _1\),
and (A.3) must hold. \(\square\)
This result implies, in particular, that the sequence \(\{f(x_k)\}_{k\ge 0}\) is non-increasing, and the TR1DA algorithm is therefore monotone on the exact function f.
Lemma A.3
Suppose AS.1 and AS.2 hold, and that \(\overline{g}(x_k,\omega _{g,k})\ne 0.\) Then
Proof
(See [16, Theorem 8.4.3].) Since (2.5) implies that \({ \frac{1}{2}}(1-\eta _1)-\eta _0-\kappa _g \in (0,1)\) the first part of (A.4) then gives that \(\Delta _k < \Vert \overline{g}(x_k,\omega _{g,k})\Vert\, /\, \kappa _{H\nabla }\). Hence the inequality \(1+\Vert H_k\Vert \le \kappa _{H\nabla }\) and (2.9) yield that
As a consequence, we may use (2.11), the Cauchy–Schwarz inequality, (A.2) (twice), (A.1), the inequality \(\kappa _{H\nabla }\ge 1\) and the first part of (A.4) to deduce that, for all \(k\ge 0\),
Thus \(\rho _k\ge \eta _2\) and (2.12) ensures the second part of (A.4). \(\square\)
Lemma A.4
Suppose that AS.1 and AS.2 hold. Then, before termination,
Proof
(See [16, Theorem 8.4.4].) Before termination, we must have that
Suppose that iteration k is the first iteration such that
Then the update (2.12) implies that
where we have used (A.6) to deduce the last inequality. But this bound and (A.4) imply that \(\Delta _{k+1}\ge \Delta _k\), which is impossible since \(\Delta _k\) is reduced at iteration k. Hence no k exists such that (A.7) holds and the desired conclusion follows. \(\square\)
Lemma A.5
For each \(k \ge 0,\) define
the index sets of “successful” and “unsuccessful” iterations, respectively. Then
Proof
Observe that (2.12) implies that
and that
Combining these two inequalities, we obtain from (A.5) that
Dividing by \(\Delta _0\), taking logarithms and recalling that \(\gamma _2\in (0,1)\), we get
Hence (A.9) follows since \(k = |\mathcal{S}_k|+|\mathcal{U}_k|\). \(\square\)
Theorem A.1
Suppose that AS.1–AS.3 hold. Suppose also that\(\Delta _0\ge \theta \epsilon,\)where\(\theta\)is defined in (A.5). Then the TR1DA algorithm produces an iterate\(x_k\)such that\(\Vert \nabla _x^1f(x_k)\Vert \le \epsilon\) in at most
successful iterations, at most
iterations in total, at most\(\tau _{\mathrm{tot}}\)(approximate) evaluations of the gradient satisfying (2.6), and at most\(2\tau _{\mathrm{tot}}\)(approximate) evaluations of the objective function satisfying (2.2).
Proof
As in the previous proof, (A.6) must hold before termination. Using AS.3, (A.8), (A.3), (2.9), (A.6), the assumption that \(\Delta _0\ge \theta \epsilon\) and (A.5), we obtain that, for an arbitrary \(k\ge 0\) before termination,
and therefore
As a consequence \(\Vert \overline{g}(x_k,\omega _{g,k})\Vert < \epsilon /(1+\kappa _g)\) after at most \(\tau _S\epsilon ^{-2}\) successful iterations and the algorithm terminates. The relation (2.13) then ensures that \(\Vert \nabla _x^1f(x_k)\Vert < \epsilon\), yielding (A.10). We may now use (A.9) and the mechanism of the algorithm to complete the proof. \(\square\)
Given that \(\epsilon \in (0,1]\), we immediately note that
Moreover, the proof of Theorem A.1 implies that these complexity bounds improve from \(\mathcal{O}(\epsilon ^{-2})\) to \(\mathcal{O}(\epsilon ^{-1})\) if \(\epsilon\) is so large or \(\Delta _0\) so small to yield \(\Delta _0 < \theta \epsilon\).
We conclude this brief complexity theory by noting that the domain in which AS.1 is assumed to hold can be restricted to the “tree of iterates” \(\cup _{k\ge 0}[x_k,x_k+s_k]\) without altering our results. This can be useful if an upper bound \({\bar{\Delta }}\) is imposed on the step’s length, in which case the monotonicty of the algorithm ensures that the tree of iterates remains in the set
While it can be difficult to verify AS.1 on the (a priori unpredictable) tree of iterates, verifying it on the above set is much easier.
Rights and permissions
About this article
Cite this article
Gratton, S., Toint, P.L. A note on solving nonlinear optimization problems in variable precision. Comput Optim Appl 76, 917–933 (2020). https://doi.org/10.1007/s10589-020-00190-2
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-020-00190-2