Abstract
In this paper, a convex optimization-based method is proposed for numerically solving dynamic programs in continuous state and action spaces. The key idea is to approximate the output of the Bellman operator at a particular state by the optimal value of a convex program. The approximate Bellman operator has a computational advantage because it involves a convex optimization problem in the case of control-affine systems and convex costs. Using this feature, we propose a simple dynamic programming algorithm to evaluate the approximate value function at pre-specified grid points by solving convex optimization problems in each iteration. We show that the proposed method approximates the optimal value function with a uniform convergence property in the case of convex optimal value functions. We also propose an interpolation-free design method for a control policy, of which performance converges uniformly to the optimum as the grid resolution becomes finer. When a nonlinear control-affine system is considered, the convex optimization approach provides an approximate policy with a provable suboptimality bound. For general cases, the proposed convex formulation of dynamic programming operators can be modified as a nonconvex bilevel program, in which the inner problem is a linear program, without losing the uniform convergence properties.
Similar content being viewed by others
Notes
However, our method is suitable for problems with high-dimensional action spaces.
More precisely, the set \({\mathcal {U}}({\varvec{x}})\) needs to be represented by convex inequalities, i.e., there exist functions \(a_k: {\mathcal {X}} \times {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) and \(b_k: {\mathcal {X}} \rightarrow {\mathbb {R}}\) such that
$$\begin{aligned} {\mathcal {U}}({\varvec{x}}) := \{ {\varvec{u}} \in {\mathbb {R}}^m : a_k ({\varvec{x}}, {\varvec{u}}) \le b_k({\varvec{x}}), k=1, \ldots , N_{ineq}\}, \end{aligned}$$where \({\varvec{u}} \mapsto a_k ({\varvec{x}}, {\varvec{u}})\) is a convex function for each fixed \({\varvec{x}} \in {\mathcal {X}}\) and each k.
Note that the convexity of v is unused in the second part of the proof of Proposition 3.1. Thus, it is valid in the nonconvex case.
The matrix B used in our experiments can be downloaded from the following link: http://coregroup.snu.ac.kr/DB/B1000.mat.
The CPU time increases superlinearly with the number of grid points. This is because the size of the optimization problem (5) also increases with the grid size. Note that the problem size is invariant when using the bi-level method in Sect. 4.2. Thus, in that case the CPU time scales linearly as shown in Table 3.
The observation of the second-order empirical convergence rate is consistent with our theoretical result since Theorem 3.1 only suggests that the suboptimality gap decreases with the first-order rate. Thus, the actual convergence rate can be higher than the convergence rate for the suboptimality gap.
To compute the optimal value function, we used the method in Sect. 4.2 discretizing the action space with 1001 equally spacing grid points.
References
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (2014)
Kushner, H., Dupuis, P.G.: Numerical Methods for Stochastic Control Problems in Continuous Time. Springer, New York (2013)
Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (2012)
Savorgnan, C., Lasserre, J.B., Diehl, M.: Discrete-time stochastic optimal control via occupation measures and moment relaxations. In: Proceedings of Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference (2009)
Dufour, F., Prieto-Rumeau, T.: Finite linear programming approximations of constrained discounted Markov decision processes. SIAM J. Control Optim. 51(2), 1298–1324 (2013)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)
Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific, Belmont (2019)
Szepesvari, C.: Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, San Rafael (2010)
Bertsekas, D.P.: Convergence of discretization procedures in dynamic programming. IEEE Trans. Autom. Control 20(3), 415–419 (1975)
Langen, H.-J.: Convergence of dynamic programming models. Math. Oper. Res. 6(4), 493–512 (1981)
Whitt, W.: Approximations of dynamic programs, I. Math. Oper. Res. 3(3), 231–243 (1978)
Hinderer, K.: On approximate solutions of finite-stage dynamic programs. In: Puterman, M.L. (ed.) Dynamic Programming and Its Applications, pp. 289–317. Academic Press, New York (1978)
Chow, C.-S., Tsitsiklis, J.N.: An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Trans. Autom. Control 36(8), 898–914 (1991)
Dufour, F., Prieto-Rumeau, T.: Approximation of Markov decision processes with general state space. J. Math. Anal. Appl. 388, 1254–1267 (2012)
Dufour, F., Prieto-Rumeau, T.: Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochast. Int. J. Probab. Stochast. Process. 87(2), 273–307 (2015)
Saldi, N., Yüksel, S., Linder, T.: On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42(4), 945–978 (2017)
Hernández-Lerma, O.: Discretization procedures for adaptive Markov control processes. J. Math. Anal. Appl. 137, 485–514 (1989)
Johnson, S.A., Stedinger, J.R., Shoemaker, C.A., Li, Y., Tejada-Guibert, J.A.: Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Oper. Res. 41(3), 484–500 (1993)
Hernández-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer, New York (2012)
Rust, J.: Using randomization to break the curse of dimensionality. Econometrica 65(3), 487–516 (1997)
Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 1, 815–857 (2008)
Haskell, W.B., Jain, R., Sharma, H., Yu, P.: A universal empirical dynamic programming algorithm for continuous state MDPs. IEEE Trans. Autom. Control 65(1), 115–129 (2020)
Jang, S., Yang, I.: Stochastic subgradient methods for dynamic programming in continuous state and action spaces. In: Proceedings of the 58th IEEE Conference on Decision and Control, pp. 7287–7293 (2019)
Nesterov, Y.: Lectures on Convex Optimization, 2nd edn. Springer, Cham (2018)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Bellmont (2015)
Falcone, M., Ferretti, R.: Semi-Lagrangian Approximation Schemes for Linear and Hamilton-Jacobi Equations. SIAM, Philadelphia (2013)
Alla, A., Falcone, M., Saluzzi, L.: An efficient DP algorithm on a tree-structure for finite horizon optimal control problems. SIAM J. Sci. Comput. 41(4), A2384–A2406 (2019)
Picarelli, A., Reisinger, C.: Probabilistic error analysis for some approximation schemes to optimal control problems. Syst. Control Lett. 137, 104619 (2020)
Yang, I.: A convex optimization approach to dynamic programming in continuous state and action spaces. arXiv preprint arXiv:1810.03847 (2018)
Abate, A., Prandini, M., Lygeros, J., Sastry, S.: Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems. Automatica 44, 2724–2734 (2008)
Yang, I.: A dynamic game approach to distributionally robust safety specifications for stochastic systems. Automatica 94, 94–101 (2018)
Dantzig, G.B.: Linear Programming and Extensions. Princeton University Press, Princeton (1998)
Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization. Athena Scientific, Belmont (1997)
Sethi, S.P., Thompson, G.L.: Optimal Control Theory: Applications to Management Science and Economics. Springer, New York (2000)
Dubins, L.E.: On curves of minimal length with a constraint on average curvature, and with prescribed initial and terminal positions and tangents. Am. J. Math. 79(3), 497–516 (1957)
Magron, V., Garoche, P.-L., Henrion, D., Thirioux, X.: Semidefinite approximations of reachable sets for discrete-time polynomial systems. SIAM J. Control Optim. 57(4), 2799–2820 (2019)
Kurzhanskiy, A.A., Varaiya, P.: Reach set computation and control synthesis for discrete-time dynamical systems with disturbances. Automatica 47, 1414–1426 (2011)
Acknowledgements
This work was supported in part by the Creative-Pioneering Researchers Program through SNU, the National Research Foundation of Korea funded by the MSIT (2020R1C1C1009766), and Samsung Electronics.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Lars Grüne.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: State Space Discretization Using a Rectilinear Grid
Appendix: State Space Discretization Using a Rectilinear Grid
In this appendix, we provide a concrete way to discretize the state space using a rectilinear grid. The construction below satisfies all the conditions in Sect. 2.3.
-
1.
Choose a convex compact set \({\mathcal {Z}}_0 := [\underline{{\varvec{x}}}_{0, 1}, \overline{{\varvec{x}}}_{0, 1}] \times [\underline{{\varvec{x}}}_{0, 2}, \overline{{\varvec{x}}}_{0, 2}] \times \cdots \times [\underline{{\varvec{x}}}_{0, n}, \overline{{\varvec{x}}}_{0, n}]\), and discretize it using an n-dimensional rectilinear grid. Set \(t \leftarrow 0\).
-
2.
Compute (or over-approximate) the forward reachable setFootnote 8
$$\begin{aligned} R_{t} := \big \{ f({\varvec{x}}, {\varvec{u}}, {\varvec{\xi }}) : {\varvec{x}} \in {\mathcal {Z}}_{t}, {\varvec{u}} \in {\mathcal {U}}({\varvec{x}}), {\varvec{\xi }} \in \varXi \big \}. \end{aligned}$$ -
3.
Choose a convex compact set \({\mathcal {Z}}_{t+1} := [\underline{{\varvec{x}}}_{t+1, 1}, \overline{{\varvec{x}}}_{t+1, 1}] \times [\underline{{\varvec{x}}}_{t+1, 2}, \overline{{\varvec{x}}}_{t+1, 2}] \times \cdots \times [\underline{{\varvec{x}}}_{t+1, n}, \overline{{\varvec{x}}}_{t+1, n}]\) such that \(R_t \subseteq {\mathcal {Z}}_{t+1}\).
-
4.
Expand the rectilinear grid to fit \({\mathcal {Z}}_{t+1}\).
-
5.
Stop if \(t+1 = K\); otherwise, set \(t \leftarrow t+1\) and go to Step 2.
We can then choose \({\mathcal {C}}_i\) as each grid cell. We label \({\mathcal {C}}_i\) so that \(\bigcup _{i=1}^{N_{{\mathcal {C}}, t}} {\mathcal {C}}_i = {\mathcal {Z}}_t\) for all t. A two-dimensional example is shown in Fig. 1. This construction approach was used in Sects. 5.1 and 5.3.
Rights and permissions
About this article
Cite this article
Yang, I. A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces. J Optim Theory Appl 187, 133–157 (2020). https://doi.org/10.1007/s10957-020-01747-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-020-01747-1