1 Introduction

Discounting future benefits and costs is crucial in order to determine fair prices for investments and projects, in particular, when long time horizons come into play like for example in the task to price carbon emissions. The idea behind pricing carbon emissions is to add up the cost of the economic damage caused by this emission for the society. Up to now the value of one ton CO\(_2\) emissions varies quite heavily between countries, ranging for instance from \(\$ 119.43\) in Sweden to \(\$ 2.69\) in Japan (see [11]). This is of course only partly due to the use of different discount functions, but nevertheless emphasises the role of discounting.

The traditional discounting with a constant discount factor can be traced back to [29]. This is still the most common way to discount benefits and costs, in particular because it simplifies the computation. However, Koopmans [25] gave an axiomatic characterisation of a class of recursive utilities, which also includes the classical way of discounting. He introduced an aggregator W to aggregate the current utility \(u_t\) with future ones \(v_{t+1}\) in \(v_t=W(u_t,v_{t+1})\). When we choose \(W(u,v)=u+\beta v,\) we get the classical discounting with a discount factor \(\beta \).

In this paper, we study a Markov decision process with a Borel state space, unbounded stage utility and with a non-linear discount function \(\delta \) which has certain properties. We use an aggregation of the form

$$\begin{aligned} v_t (h_t) = u_t (\pi _t(h_t)) + \int \delta \big (v_{t+1}(h_t, \pi _t(h_t), x_{t+1})\big ) q(dx_{t+1}|x_t,\pi _t(h_t)) \end{aligned}$$

where q is the transition kernel of the Markov decision process, \(\pi _t\) is the decision function at time t and \(h_t\) the history of the process. When \(\delta (x)=\beta x\) we are back in the classical setting. In this case, it is well-known how to solve a Markov decision process with an infinite time horizon, see for example [3, 8, 15, 16, 30, 31]. In the unbounded utility case, the established method is to use a weighted supremum norm and combine it with Banach’s contraction theorem (see e.g. [3, 4, 9, 14, 17]). In our setting the Banach contraction principle cannot be applied. Indeed our paper is in the spirit of [20,21,22], where also non-linear discounting was used and an extension of the Banach theorem due to Matkowski [26] was applied. Whereas papers [21, 22] consider a purely deterministic decision process, work [20] treats a stochastic problem. However, in [20] the expectation is the final operator applied at the end of aggregation, whereas in the present paper expectation and discounting operators alternate. As will be explained in Sect. 3 this has the advantage that we get optimal stationary policies in our setting.

The main result of our paper is a solution procedure of these new kind of discounting problems with stochastic transition. In particular, we provide an optimality equation, show that it has a solution and prove that there exists an optimal stationary policy for the problem in the infinite time horizon. Note that we allow the utility function to be unbounded.

The outline of our paper is as follows. We introduce our model data together with the assumptions in Sect. 2. In Sect. 3, we present our optimisation problem. Particularly, we explain how the utility is aggregated in our model and what precisely the difference to the model and results in [20] is. In Sect. 4, we summarise some auxiliary results like a measurable selection theorem and a generalised fixed point theorem, which is used later. Next, in Sect. 5.1 we treat the model, where the positive and negative part of the one-stage utility is bounded by a weight function \(\omega .\) We show in this case that the value function \(v^*\) is a unique fixed point of the corresponding maximal reward operator and that every maximiser in the Bellman equation for \(v^*\) defines an optimal stationary policy. In Sect. 5.2, we consider then the setting, where the utility function is unbounded from below, but still bounded from above by the weight function \(\omega .\) Here, we can only show that the value function \({\underline{v}}^*\) is a fixed point of the corresponding maximal reward operator, but examples show that the fixed point is not necessarily unique. Anyway, as in Sect. 5.1, any maximiser in the Bellman equation for \({\underline{v}}^*\) defines again an optimal stationary policy. The proof employs an approximation of \({\underline{v}}^*\) by a monotone sequence of value functions, which are bounded by a weight function \(\omega \) in absolute value like in Sect. 5.1. In Sect. 6, we briefly discuss two numerical algorithms for the solution of problems in Sect. 5.1, namely policy iteration and policy improvement. The last section presents some applications. We discuss two different optimal growth models, an inventory problem and a stopping problem.

2 The Dynamic Programming Model

Let \({\mathbb {N}}\) (\({\mathbb {R}}\)) denote the set of all positive integers (all real numbers) and \(\underline{{\mathbb {R}}}={\mathbb {R}}\cup \{-\infty \},\) \({\mathbb {R}}_+ =[0,\infty ).\) A Borel space Y is a non-empty Borel subset of a complete separable metric space. By \({{\mathcal {B}}}(Y)\) we denote the \(\sigma \)-algebra of all Borel subsets of Y and we write \({{\mathcal {M}}}(Y)\) to denote the set of all Borel measurable functions \(g:Y\rightarrow \underline{{\mathbb {R}}}.\)

A discrete-time Markov decision process is specified by the following objects:

(i):

X is the state space and is assumed to be a Borel space.

(ii):

A is the action space and is assumed to be a Borel space.

(iii):

D is a non-empty Borel subset of \(X\times A.\) We assume that for each \(x\in X,\) the non-empty x-section

$$\begin{aligned} A(x):=\{a\in A: (x,a)\in D\} \end{aligned}$$

of D represents the set of actions available in state x.

(iv):

q is a transition probability from D to X. For each \(B \in {{\mathcal {B}}}(X)\), q(B|xa) is the probability that the new state is in the set B, given the current state is \(x\in X\) and an action \(a\in A(x)\) has been chosen.

(v):

\(u\in {{\mathcal {M}}}(D)\) is a one-period utility function.

(vi):

\(\delta : \underline{{\mathbb {R}}}\rightarrow \underline{{\mathbb {R}}}\) is a discount function.

Let \(D^{n}:= D\times \cdots \times D\) (n times) for \(n\in {\mathbb {N}}.\) Let \(H_1=X\) and \(H_{n+1}\) be the space of admissible histories up to the n-th transition, i.e., \(H_{n+1}:= D^n\times X\) for \(n\in {\mathbb {N}}.\) An element of \(H_n\) is called a partial history of the process. We put \(H=D\times D\times \cdots \) and assume that \(H_n\) and H are equipped with the product \(\sigma \)-algebras.

In this paper, we restrict ourselves to deterministic policies, since randomisation does not give any advantage from the point of view of utility maximisation. A policy \(\pi \) is a sequence \((\pi _n)\) of decision rules where, for every \(n\in {\mathbb {N}},\) \(\pi _n\) is a Borel measurable mapping, which associates any admissible history \(h_n\in H_n\) \((n\in {\mathbb {N}})\) with an action \(a_n\in A(x_n).\) We write \(\varPi \) to denote the set of all policies. Let F be the set of all Borel measurable mappings \(f:X\rightarrow A\) such that \(f(x)\in A(x)\) for every \(x\in X.\) When A(x) is compact for each \(x\in X\), then from the Arsenin-Kunugui result (see Theorem 18.18 in [24]), it follows that \(F\not =\emptyset .\) A policy \(\pi \) is called stationary if \(\pi _n=f\) for all \(n\in {\mathbb {N}}\) and some \(f\in F.\) Therefore, a stationary policy \(\pi =(f,f,\ldots )\) will be identified with \(f\in F\) and the set of all stationary policies will be denoted by F.

2.1 Assumptions with Comments

Let \(\omega :X\rightarrow [1, \infty )\) be a fixed Borel measurable function.

Assumptions (A)

(A2.1) :

there exists \(b>0\) such that

$$\begin{aligned} u(x,a)\ge -b\omega (x)\quad \text{ for } \text{ all } (x,a)\in D, \end{aligned}$$
(A2.2):

there exists \(c>0\) such that

$$\begin{aligned} u(x,a)\le c\omega (x)\quad \text{ for } \text{ all } (x,a)\in D. \end{aligned}$$

Our next assumptions are on the discount function \(\delta .\)

Assumptions (B)

(B2.1):

there exists an increasing function \(\gamma :{\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+\) such that \(\gamma (z)<z\) for each \(z>0\) and

$$\begin{aligned} |\delta (z_1)-\delta (z_2)|\le \gamma (|z_1-z_2|) \end{aligned}$$

for all \(z_1, z_2 \in {\mathbb {R}}.\)

(B2.2):

\(\delta \) is increasing, \(\delta (0)=0\) and \(\delta (-\infty )=-\infty \).

(B2.3):

(i) \(\gamma \) is subadditive, i.e., \(\gamma (y+z)\le \gamma (y)+\gamma (z)\) for all \(y,\ z\ge 0,\) and (ii) \(\gamma (\omega (x)y)\le \omega (x)\gamma (y)\) for all \(x\in X\) and \(y> 0.\)

(B2.4):

it holds that

$$\begin{aligned} \int _X \omega (y)q(dy|x,a)\le \alpha \omega (x) \text{ for } \text{ all } (x,a)\in D \end{aligned}$$

   and either

(i):

\(\alpha \le 1\) or

(ii):

\(\alpha >1\) and \(\alpha \gamma (x)<x\) for all \(x\in (0,+\infty ).\)

Remark 2.1

In some empirical studies it was observed that negative and positive utilities were discounted by different discount factors (“sign effect”). Therefore a simple non-linear discount function is \(\delta (y) = \delta _1y\) for \(y\le 0\) and \(\delta (y) = \delta _2y\) for \(y>0,\) where \(\delta _1, \ \delta _2\in (0,1)\) and \(\delta _1 \ \not = \delta _2.\) For a discussion and interpretation of this and other types of discount functions the reader is referred to Jaśkiewicz, Matkowski and Nowak [21]. Additional examples of discount functions are also given in Sect. 7.

Remark 2.2

Obviously, assumption (B2.1) implies that \(\gamma (0)=0 \) and \(\gamma \) is continuous at zero. Moreover, it implies that \(\delta \) is continuous and (with (B2.2)) that \(|\delta (z)|\le \gamma (|z|)\) for all \(z\in {\mathbb {R}}\). From (B2.3),  it follows that

$$\begin{aligned} |\gamma (y)-\gamma (z)|\le \gamma (|y-z|)\quad \text{ for } \text{ all }\quad y,\ z\ge 0. \end{aligned}$$
(2.1)

This fact and continuity of \(\gamma \) at zero, implies that \(\gamma \) is continuous at any point in \({\mathbb {R}}_+.\)

Remark 2.3

Assumption (B2.3) holds if the function \(z\mapsto \gamma (z)/z\) is non-increasing on \((0,\infty ).\) Note that under this condition we have \(\frac{\gamma (y+z)}{y+z} \le \frac{\gamma (y )}{y}\) and \(\frac{\gamma (y+z)}{y+z}\le \frac{\gamma ( z)}{ z }.\) Hence, \(\gamma (y+z)\le \gamma (y)+\gamma (z).\) Moreover, for any \(d\ge 1 \) and \(z>0,\) \(\gamma (dz)/dz \le \gamma (z)/z.\) Thus \(\gamma (dz)\le d\gamma (z).\) Take \(d=\omega (x).\) Then (ii) in (B2.3) holds.

Remark 2.4

There are subadditive functions \(\gamma \) such that \(z\mapsto \gamma (z)/z\) is not necessarily non-increasing. An example of such a subadditive function is \(\gamma (x)=(1-\varepsilon )x+\varepsilon |\sin x|\) for some \(\varepsilon \in (0,1).\)

The following two standard sets of assumptions will be used alternatively.

Assumptions (W)

(W2.1):

A(x) is compact for every \(x\in X\) and the set-valued mapping \(x\mapsto A(x)\) is upper semicontinuous, i.e., \(\{x\in X:\; A(x)\cap K \not = \emptyset \}\) is closed for each closed set \(K\subset A,\)

(W2.2):

the function u is upper semicontinuous on D

(W2.3):

the transition probability q is weakly continuous, i.e.,

$$\begin{aligned} (x,a) \mapsto \int _X \phi (y)q(dy|x,a) \end{aligned}$$

is continuous on D for each bounded continuous function \(\phi \),

(W2.4):

the function \(\omega \) is continuous on X.

(W2.5):

the function

$$\begin{aligned} (x,a) \mapsto \int _X \omega (y)q(dy|x,a) \end{aligned}$$

is continuous on D.

Assumptions (S)

(S2.1):

A(x) is compact for every \(x\in X\),

(S2.2):

the function \(u(x,\cdot )\) is upper semicontinuous on A(x) for every \(x\in X,\)

(S2.3):

for each \(x \in X\) and every Borel set \({\widetilde{X}} \subset X,\) the function \(q({\widetilde{X}}|x,\cdot )\) is continuous on A(x), 

(S2.4):

the function \(a\mapsto \int _X \omega (y)q(dy|x,a)\) is continuous on A(x) for every \(x\in X.\)

The above conditions were used in stochastic dynamic programming by many authors, see, e.g., Schäl [30], Bäuerle and Rieder [3], Bertsekas and Shreve [7] or Hernández-Lerma and Lasserre [16, 17]. Using the so-called “weight” or “bounding” function \(\omega \) one can study dynamic programming models with unbounded one-stage utility u. This method was introduced by Wessels [34], but as noted by van der Wal [32], in the dynamic programming with linear discount function \(\delta (z)=\beta z\), one can introduce an extra state \(x_e\not \in X\), re-define the transition probability and the utility function to obtain an equivalent “bounded model”. More precisely, we consider a new state space \(X\cup \{x_e\},\) where \(x_e\) is an absorbing isolated state. Let \(A(x_e)=\{a_e\}\) with an extra action \(a_e\not \in A.\) For \(x\in X\) the action sets are A(x). The transition probabilities Q and one-stage utilities R in a new model are as follows

$$\begin{aligned} Q(B|x,a):= & {} \frac{1}{\alpha \omega (x)}\int _B\omega (y)q(dy|x,a),\quad \text{ for } B\in \mathcal{} B(X),\quad \\ Q(x_e|x,a):= & {} 1-\frac{\int _X\omega (y)q(dy|x,a)}{\alpha \omega (x)}, \\ R(x,a):= & {} \frac{u(x,a)}{\omega (x)},\quad \text{ for } (x,a)\in D,\quad R(x_e,a_e):=0. \end{aligned}$$

Here, \(\alpha \) is a constant from assumption (B2.4). This transformed Markov decision process is equivalent to the original one in the sense that every policy gives the same total expected discounted payoff up to the factor \(\omega (x)\), where \(x\in X\) denotes the initial state. We would like to emphasise that in the non-linear discount function case such a transformation to bounded case is not possible. We need to do some extra work.

Remark 2.5

If u is bounded from above by some constant, then we can skip assumptions (A) and it is enough in assumptions (B) to require (B2.1),  (B2.2) and (B2.3)(i). In this case, it suffices to put \(\alpha =1,\) \(\omega (x)=1\) for all \(x\in X\) and it is easily seen that (B2.3)(ii), (B2.4), (S2.4), (W2.4) and (W2.5) hold. If, on the other hand, u is unbounded in the sense that there exists a function \(\omega \) meeting conditions (A2.2),  then \(\omega (x)\ge 1\) for all \(x\in X\) must be unbounded as well. From Remark 2.3, it follows that (B2.3) holds when the function \(z\mapsto \gamma (z)/z\) is non-increasing. We would like to emphasise that condition (ii) in (B2.3) is crucial in our proofs in the case when (A2.2) holds with an unbounded function \(\omega .\) The dynamic programming problems when only (A2.2) is assumed can be solved by a “truncation method” and then making use of an approximation by solutions for models that satisfy conditions (A).

3 Discounted Utility Evaluations: Two Alternative Approaches

Let \(r_1(x,a)=u(x,a)\) for \((x,a)\in D \) and, for any \(n\in {\mathbb {N}} \) and \((h_{n},a_n)= (x_1,a_1,\ldots ,x_n,a_n)\in D^n,\)

$$\begin{aligned}&r_{n+1}(h_{n+1},a_{n+1})\\&\quad = u(x_1,a_1)+\delta \big (u(x_2,a_2)+\delta \big (u(x_3,a_3)+\cdots +\delta \left( u\left( x_{n+1},a_{n+1})\right) \cdots \right) \big )\\&\quad =u(x_1,a_1)+\delta (r_{n}(x_2,a_2,\ldots ,x_{n+1},a_{n+1})). \end{aligned}$$

Below in this subsection we assume that all expectations (integrals) and limits exist. In the sequel, we shall study cases where this assumption is satisfied.

Let \(\pi \in \varPi \) and \(x=x_1\in X\) be an initial state. By \({\mathbb {E}}_x^\pi \) we denote the expectation operator with respect to the unique probability measure \({\mathbb {P}}^\pi _x\) on H induced by the policy \(\pi \in \varPi \) and the transition probability q according to the Ionescu–Tulcea theorem, see Proposition 7.28 in [7].

Definition 3.1

For any \(\pi =(\pi _k) \in \varPi \) and any initial state \(x=x_1\) the n-stage expected discounted utility is

$$\begin{aligned} R_n(x,\pi ) = {\mathbb {E}}_x^\pi \big [r_n(x_1,a_1,\ldots ,x_n,a_n)\big ] \end{aligned}$$

and the expected discounted utility over an infinite horizon is

$$\begin{aligned} R(x,\pi ):= \lim \limits _{n\rightarrow \infty } R_n(x,\pi ). \end{aligned}$$
(3.1)

A policy \(\pi ^*\in \varPi \) is optimal in the dynamic programming model under utility evaluation (3.1), if

$$\begin{aligned} R(x,\pi ^*) \ge R(x,\pi )\quad \text{ for } \text{ all }\quad \pi \in \varPi ,\ x\in X. \end{aligned}$$

Remark 3.2

Utility functions as in (3.1) have been considered by Jaśkiewicz et al. [20]. Optimal policies have been shown to exist for the model with u bounded from above satisfying assumptions (A) and either (W) or (S). However, optimal policies obtained in [20] are history-dependent and are characterised by an infinite system of Bellman equations as in the non-stationary model of Hinderer [18].

To obtain stationary optimal policies, we shall define a recursive discounted utility using the ideas similar to those developed in papers on dynamic programming by Denardo [12] and Bertsekas [6] and in papers on economic dynamic optimisation [1, 2, 4, 9, 14, 27, 28, 33]. The seminal article for these studies was the work by Koopmans [25] on stationary recursive utility generalising the standard discounted utility of Samuelson [29]. To define the recursive discounted utility we must introduce some operator notation.

Let \(\pi =(\pi _1,\pi _2,\ldots )\in \varPi \) and \(v\in {{\mathcal {M}}}(H_{k+1})\). We set

$$\begin{aligned} Q_{\pi _k}^\delta v(h_k)=\int _X\delta (v(h_{k},\pi _k(h_k),x_{k+1}))q(dx_{k+1}|x_k, \pi _k(h_k)) \end{aligned}$$

and

$$\begin{aligned} T_{\pi _k}v(h_k)= & {} u(x_k,\pi _k(h_k)) +Q_{\pi _k}^\delta v(h_k)\\= & {} u(x_k,\pi _k(h_k)) +\int _X\delta (v(h_{k},\pi _k(h_k),x_{k+1}))q(dx_{k+1}|x_k, \pi _k(h_k)). \end{aligned}$$

These operators are well-defined for example when u and v are bounded from above.

Similarly, we define \(Q_{\pi _k}^\gamma \) with \(\delta \) replaced by \(\gamma \). Observe that by (B2.1)

$$\begin{aligned} |Q_{\pi _k}^\delta v(h_k)|\le Q_{\pi _k}^\gamma |v|(h_k) \end{aligned}$$

provided that \(Q_{\pi _k}^\delta v(h_k)>-\infty .\) This fact will be used frequently.

The interpretation of \(T_{\pi _k}v(h_k)\) is as follows. If \( x_{k+1} \mapsto v(h_{k},\pi _k(h_k),x_{k+1})\) is a “continuation value” of utility, then \(T_{\pi _k}v(h_k)\) is the expected discounted utility given the pair \((x_k,\pi _k(h_k)).\)

The composition \(T_{\pi _1}\circ T_{\pi _2}\circ \cdots \circ T_{\pi _n} \) of the operators \(T_{\pi _1},T_{\pi _2},\ldots , T_{\pi _n}\) is for convenience denoted by \(T_{\pi _1}T_{\pi _2}\cdots T_{\pi _n}. \)

Let \(\mathbf{0}\) be a function that assigns zero to each argument \(y\in X.\)

Definition 3.3

For any \(\pi =(\pi _k) \in \varPi \) and any initial state \(x=x_1,\) the n-stage recursive discounted utility is defined as

$$\begin{aligned} U_n(x,\pi )= T_{\pi _1} \cdots T_{\pi _n}{} \mathbf{0}(x) \end{aligned}$$

and the recursive discounted utility over an infinite horizon is

$$\begin{aligned} U(x,\pi ):= \lim \limits _{n\rightarrow \infty } U_n(x,\pi ). \end{aligned}$$
(3.2)

A policy \(\pi ^*\in \varPi \) is optimal in the dynamic programming model under utility evaluation (3.2), if

$$\begin{aligned} U(x,\pi ^*) \ge U(x,\pi )\quad \text{ for } \text{ all }\quad \pi \in \varPi ,\ x\in X. \end{aligned}$$
(3.3)

For instance, below we give a full formula for \(n=3.\) Namely,

$$\begin{aligned} U_3(x,\pi )= & {} T_{\pi _{ 1}}T_{\pi _{2}}T_{\pi _3}{} \mathbf{0}(x) = u(x,\pi _1(x)) +\int _X\delta \Big (u(x_{2},\pi _{2}(x,\pi _1(x),x_{2}))\\&+ \int _X\delta (u(x_{3},\pi _3(x,\pi _1(x),x_{2}, \pi _{2}(x,\pi _1(x),x_{2}),x_3))q(dx_{3}|x_{2},\\&\pi _{2}(x,\pi _1(x),x_{2})))\Big )\\&\times \, q(dx_2|x,\pi _1(x)). \end{aligned}$$

We would like to point out that in the special case of linear discount function \(\delta (z)=\beta z\) with \(\beta \in (0,1),\) the two above-mentioned approaches coincide. In that case we deal with the usual expected discounted utility, because

$$\begin{aligned} R_n(x,\pi )= U_n(x,\pi ) = {\mathbb {E}}_x^\pi \left[ \sum _{k=1}^n \beta ^{k-1} u(x_k,a_k)\right] . \end{aligned}$$

4 Auxiliary Results

Let Y be a Borel space. By \({{\mathcal {U}}}(Y)\) we denote the space of all upper semicontinuous functions on Y. We recall some results on measurable selections, a generalisation of the Banach fixed point theorem and present a property of a subadditive function \(\gamma .\)

Lemma 4.1

Assume that A(x) is compact for each \(x\in X.\)

(a):

Let \(g\in {{\mathcal {M}}}(D)\) be such that \(a\mapsto g(x,a)\) is upper semicontinuous on A(x) for each \(x\in X.\) Then,

$$\begin{aligned} g^*(x):= \max \limits _{a\in A(x)}g(x,a) \end{aligned}$$

is Borel measurable and there exists a Borel measurable mapping \(f^*:X\rightarrow A\) such that

$$\begin{aligned} f^*(x)\in \mathrm{Arg}\max _{a\in A(x)}g(x,a) \end{aligned}$$

for all \(x\in X.\)

(b):

If, in addition, we assume that \(x\mapsto A(x)\) is upper semicontinuous and \(g\in {{\mathcal {U}}}(D),\) then \(g^*\in {{\mathcal {U}}}(X).\)

Part (a) follows from Corollary 1 in [10]. Part (b) is a corollary to Berge’s maximum theorem, see [5, pp. 115–116] and Proposition 10.2 in [30].

Let \({{\mathcal {M}}}^a_b(X)\) be the space of all functions \( v\in {{\mathcal {M}}}(X) \) such that \(x\mapsto v(x)/\omega (x)\) is bounded from above on X. The symbol \({{\mathcal {M}}}^d_b(X)\) is used for the subspace of functions \(v \in {{\mathcal {M}}}^a_b(X) \) such that \(x\mapsto |v(x)|/\omega (x)\) is bounded on X. Let \({{\mathcal {M}}}^a_b(D)\) be the space of all functions \( w\in {{\mathcal {M}}}(D) \) such that \((x,a)\mapsto w(x,a)/\omega (x)\) is bounded from above on D. By \({{\mathcal {M}}}^d_b(D)\) we denote the space of all functions \( w\in {{\mathcal {M}}}_b^a(D) \) such that \((x,a)\mapsto |w(x,a)|/\omega (x)\) is bounded on D. We also define

$$\begin{aligned} {{\mathcal {U}}}^a_b(X):= & {} {{\mathcal {M}}}^a_b(X)\cap {{\mathcal {U}}}(X),\quad {{\mathcal {U}}}^d_b(X):={{\mathcal {M}}}^d_b(X)\cap {{\mathcal {U}}}(X),\quad \text{ and }\\ {{\mathcal {U}}}^a_b(D):= & {} {{\mathcal {M}}}^a_b(D)\cap {{\mathcal {U}}}(D),\quad {{\mathcal {U}}}^d_b(D):={{\mathcal {M}}}^d_b(D)\cap {{\mathcal {U}}}(D). \end{aligned}$$

Lemma 4.2

Let assumptions (B) be satisfied and

$$\begin{aligned} I(v)(x,a):= \int _X\delta (v(y))q(dy|x,a), \quad v\in {{\mathcal {M}}}^a_b(X),\ (x,a)\in D. \end{aligned}$$

Then \(I(v) \in {{\mathcal {M}}}^a_b(D)\). If \(v\in {{\mathcal {M}}}^d_b(X)\), then \(I(v)\in {{\mathcal {M}}}^d_b(D).\)

Proof

Let \(v\in {{\mathcal {M}}}^a_b(X)\) and \(v^+(y)=\max \{v(y),0\}.\) Then there exists \(c_0>0\) such that \(v(y)\le v^+(y) \le c_0\omega (y)\) for all \(y\in X.\) Obviously, we have \(I(v)(x,a)\le I(v^+)(x,a) \le I(c_0\omega )(x,a) \le \alpha c_0\omega (x)\) for all \((x,a)\in D.\) Hence \(I(v) \in {{\mathcal {M}}}^a_b(D).\) Now assume that \(v\in {{\mathcal {M}}}^d_b(X).\) Then there exists a constant \(c_1>0\) such that \(|v(y)|\le c_1\omega (y)\) for all \(y\in X \) and we obtain \(|I(v)(x,a)| \le \alpha c_1\omega (x)\) for all \((x,a)\in D.\) Thus \(I(v) \in {{\mathcal {M}}}^d_b(D).\) \(\square \)

Our results will be formulated using the standard dynamic programming operators. For any \(v\in {{\mathcal {M}}}^a_b(X),\) put

$$\begin{aligned} Sv(x,a):= u(x,a)+\int _X\delta (v(y))q(dy|x,a),\quad (x,a)\in D. \end{aligned}$$

Next define

$$\begin{aligned} Tv(x):=\sup _{a\in A(x)} Sv(x,a) =\sup _{a\in A(x)}\left[ u(x,a)+\int _X \delta (v(y))q(dy|x,a)\right] . \end{aligned}$$
(4.1)

By \(T^{(m)}\) we denote the composition of T with itself m times.

If \(f\in F\) and \(v\in {{\mathcal {M}}}^a_b(X), \) then we put

$$\begin{aligned} T_fv(x):=Sv(x,f(x))=u(x,f(x))+\int _X \delta (v(y))q(dy|x,f(x)). \end{aligned}$$
(4.2)

Clearly, \(T_fv\in {{\mathcal {M}}}^a_b(X).\)

The next result follows from Lemmas 4.14.2 and Lemmas 8.3.7 and 8.5.5 from [17].

Lemma 4.3

Assume that assumptions (A) and (B) hold.

(a):

If conditions (W) are also satisfied and \(v\in {{\mathcal {U}}}^d_b(X), \) then \(Sv \in {{\mathcal {U}}}^d_b(D)\) and \(Tv \in {{\mathcal {U}}}^d_b(X). \)

(b):

If (S2.2)–(S2.4) hold and \(v\in {{\mathcal {M}}}^d_b(X), \) then \(Sv \in {{\mathcal {M}}}^d_b(D) \) and, for each \(x\in X,\) the function \(a\mapsto Sv(x,a)\) is upper semicontinuous on A(x). Moreover, \(Tv \in {{\mathcal {U}}}^d_b(X). \)

Remark 4.4

(a):

The assumption that \(\delta \) is continuous and increasing is important for part (a) of Lemma 4.3.

(b):

Under assumptions of Lemma 4.3, in the operator in (4.1) one can replace \(\sup \) by \(\max \).

(c):

Using Lemma 4.2, one can easily see that if \(v\in {{\mathcal {M}}}^a_b(X)\) and \(f\in F,\) then \(T_fv \in {{\mathcal {M}}}^a_b(X).\)

The following fixed point theorem will play an important role in our proof (see e.g. [26] or Theorem 5.2 in [13]).

Lemma 4.5

Let (Zm) be a complete metric space, \(\psi :{\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+\) be a continuous, increasing function with \(\psi (x)<x\) for all \(x\in (0,\infty )\). If an operator \(T:Z\rightarrow Z\) satisfies the inequality

$$\begin{aligned} m(Tv_1,Tv_2)\le \psi (m(v_1,v_2)) \end{aligned}$$

for all \(v_1,\ v_2\in Z\), then T has a unique fixed point \(v^*\in Z\) and

$$\begin{aligned} \lim _{n\rightarrow \infty }m(T^{(n)}v,v^*)=0 \end{aligned}$$

for each \(v\in Z \). Here \(T^{(n)}\) is the composition of T with itself n times.

For the convenience of the reader we formulate and prove a modification of Lemma 8 from [20] that is used many times in our proofs. Consider a function \(\psi :{\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+\) and put

$$\begin{aligned} \psi _m(z)=z+\psi \big (z+\psi \big (z+\cdots +\psi (z+\psi (z))\cdots \big ) \big ), \ \text{ where } \ z>0 \text{ appears } \text{ m } \text{ times. } \end{aligned}$$

Lemma 4.6

If \(\psi \) is increasing, subadditive and \(\psi (y)< y\) for all \(y>0,\) then for any \(z>0\), there exists

$$\begin{aligned} L(z):=\lim _{m\rightarrow \infty } \psi _{m}(z)= \sup _{m\ge 1}\psi _{m}(z)<\infty . \end{aligned}$$

Proof

For any \(k\in {\mathbb {N}},\) let \(\psi ^{(k)}\) mean the composition of \(\psi \) with itself k times. Note that since the function \(\psi \) is increasing, then for each \(m\ge 1,\)

$$\begin{aligned} \psi _{m+1}(z)>\psi _m(z). \end{aligned}$$

Hence, the sequence \((\psi _{m}(z))\) is increasing. We show that its limit is finite. Indeed, observe that by the subadditivity of \(\psi ,\) we have

$$\begin{aligned} \psi _2(z)-\psi _1(z)= & {} z+\psi (z)-z\le \psi (z),\quad \text{ and } \\ \psi _3(z)-\psi _2(z)= & {} z+\psi (z+\psi (z))-z-\psi (z)\le \psi ^{(2)}(z). \end{aligned}$$

By induction, we obtain

$$\begin{aligned} \psi _{m}(z)-\psi _{m-1}(z)\le \psi ^{(m-1)}(z). \end{aligned}$$

Let \(\epsilon >0\) be fixed. Since \(\psi ^{(m)}(z)\rightarrow 0\) as \(m\rightarrow \infty \), there exists \(m\ge 1\) such that

$$\begin{aligned} \psi _m(z)-\psi _{m-1}(z)< \epsilon -\psi (\epsilon ). \end{aligned}$$

Observe now that from subadditivity of \(\psi \) (set \(\gamma :=\psi \) in (2.1)), it follows that

$$\begin{aligned} \psi _{m+1}(z)-\psi _{m-1}(z)= & {} \psi _{m+1}(z)-\psi _{m}(z)+\psi _{m}(z)-\psi _{m-1}(z)\\\le & {} z+\psi (\psi _{m}(z))-z-\psi (\psi _{m-1}(z))+\epsilon -\psi (\epsilon )\\\le & {} \psi (\psi _{m}(z)-\psi _{m-1}(z))+\epsilon -\psi (\epsilon )\\\le & {} \psi (\epsilon -\psi (\epsilon ))+\epsilon -\psi (\epsilon )<\psi (\epsilon )+\epsilon -\psi (\epsilon )=\epsilon . \end{aligned}$$

By induction, we can easily prove that

$$\begin{aligned} \psi _{m+k}(z)-\psi _{m-1}(z)\le \epsilon \end{aligned}$$

for all \(k\ge 0.\) Hence, \(\psi _{m+k}(z)\le \psi _{m-1}(z)+ \epsilon .\) Since \(\psi _{m-1}(z)\) is finite, it follows that L(z) is finite. \(\square \)

5 Stationary Optimal Policies in Dynamic Problems with the Recursive Discounted Utilities

In this section, we prove that if \(u\in {{\mathcal {M}}}^a_b(D),\) assumptions (B) hold and either conditions (W) or (S) are satisfied, then the recursive discounted utility functions (3.2) are well-defined and there exists an optimal stationary policy. Moreover, under assumptions (W) ((S)), the value function \(x\mapsto \sup _{\pi \in \varPi }U(x,\pi )\) belongs to \({{\mathcal {U}}}^a_b(X) \) (\({{\mathcal {M}}}^a_b(X)\)). The value function and an optimal policy will be characterised via a single Bellman equation. First we shall study the case \(u \in {{\mathcal {M}}}^d_b(D)\) and then apply an approximation technique to the unbounded from below case.

5.1 One-Period Utilities with Bounds on Both Sides

Assume that \({{\mathcal {M}}}^d_b(X)\) is endowed with the so-called weighted norm \(\Vert \cdot \Vert _\omega \) defined as

$$\begin{aligned} \Vert v\Vert _\omega := \sup _{x\in X} \frac{|v(x)|}{\omega (x)},\quad v\in {{\mathcal {M}}}^d_b(X). \end{aligned}$$

Then \({{\mathcal {M}}}^d_b(X)\) is a Banach space and \({{\mathcal {U}}}^d_b(X)\) is a closed subset of \({{\mathcal {M}}}^d_b(X),\) if (W2.4) holds. The following theorem is the main result of this subsection. Its proof is split in different parts below.

Theorem 5.1

Suppose that assumptions (A),  (B) hold and assumptions (W) are satisfied. Then

(a):

the Bellman equation \(Tv=v\) has a unique solution \(v^* \in {{\mathcal {M}}}^d_b(X)\) and

$$\begin{aligned} \lim _{m\rightarrow \infty }\Vert T^{(m)}{} \mathbf{0}-v^*\Vert _\omega =0\quad \text{ and }\quad v^*(x)=\sup _{\pi \in \varPi }U(x,\pi ), \ x\in X, \end{aligned}$$
(b):

there exists \( f^*\in F\) such that \(T_{ f^*}v^*=v^*\) and \(f^*\) is an optimal stationary policy for problem (3.3),

(c):

\(v^* \in {{\mathcal {U}}}^d_b(X).\) The points (a) and (b) also remain valid under assumptions (A), (B) and (S).

Remark 5.2

As already mentioned, in our approach we consider only deterministic strategies. This is because the optimality results do not change, when we take randomised strategies into account. Actually, we may examine a new model in which the original action sets A(x) are replaced by the set of probability measures \(\Pr (A(x))\). Then, the Bellman equation has a solution as in Theorem 5.1, but the supremum in (4.1) is taken over the set \(\Pr (A(x))\). However, due to our assumptions the maximum is also attained at a Dirac delta concentrated at some point from A(x). Therefore, randomised strategies do not influence the results.

Since condition (B2.4) contains two cases, it is convenient to define a new function

$$\begin{aligned} {\tilde{\gamma }}(y):=\left\{ \begin{array}{c c} \gamma (y), &{}\quad \alpha \le 1\\ \alpha \gamma (y), &{}\quad \alpha >1. \end{array}\right. \end{aligned}$$

Clearly, \({\tilde{\gamma }}\) is subadditive. Let \(z=\max \{b,c\},\) the constants \(b>0\) and \(c>0\) come from (A). Then \(|u(x,a)|\le \omega (x) z \) for all \((x,a)\in D.\) From (B2.3)(ii),  it follows that

$$\begin{aligned} {\tilde{\gamma }}(\omega (x)y)\le \omega (x){\tilde{\gamma }}(y),\quad \text{ for } \text{ all } \quad x\in X,\ y\ge 0. \end{aligned}$$
(5.1)

This inequality is frequently used in our proofs. Let

$$\begin{aligned} {\tilde{\gamma }}_k(z)=z+{\tilde{\gamma }}\big (z+{\tilde{\gamma }}(z+\cdots +{\tilde{\gamma }}(z)\big )\cdots \big ), \end{aligned}$$

where z appears on the right-hand side k times. Putting \(\psi ={\tilde{\gamma }}\) in Lemma 4.6, we infer that

$$\begin{aligned} {\tilde{L}}(z):= \lim _{k\rightarrow \infty }{\tilde{\gamma }}_k(z)= \sup _{k\in {\mathbb {N}}} {\tilde{\gamma }}_k(z) <\infty .\nonumber \\ \end{aligned}$$
(5.2)

We point out that \({\tilde{\gamma }}^{(n)}\) is the n-th iteration of the function \({\tilde{\gamma }}.\)

We now prove that the recursive discounted utility (3.2) is well-defined.

Lemma 5.3

If \(u \in {{\mathcal {M}}}^d_b(D)\) and assumptions (B) are satisfied, then \(U(x,\pi ):=\lim _{n\rightarrow \infty } U_n(x,\pi )\) exists for any policy \(\pi \in \varPi \) and any initial state \(x\in X.\) Moreover, \(U(\cdot ,\pi )\in {{\mathcal {M}}}^d_b(X)\) and

$$\begin{aligned} \lim _{n\rightarrow \infty } \Vert U(\cdot ,\pi )- U_n(\cdot ,\pi )\Vert _\omega =0. \end{aligned}$$

Proof

We shall prove that \((U_n(\cdot ,\pi ))\) is a Cauchy sequence of functions in \({{\mathcal {M}}}^d_b(X)\) for each policy \(\pi \in \varPi .\) We claim that

$$\begin{aligned} |U_{n+m}(x,\pi )-U_n(x,\pi )|\le Q_{\pi _1}^\gamma \ldots Q_{\pi _{n-1}}^\gamma Q^\gamma _{\pi _n} |T_{\pi _{n+1}}\ldots T_{\pi _{n+m}}{} \mathbf{0}|(x). \end{aligned}$$
(5.3)

Indeed, using assumptions (B), we can conclude that

$$\begin{aligned} |U_{n+m}(x,\pi )-U_n(x,\pi )|= & {} |T_{\pi _1}\cdots T_{\pi _{n}} T_{\pi _{n+1}}\cdots T_{\pi _{n+m}} \mathbf{0}(x)-T_{\pi _1}\cdots T_{\pi _n} \mathbf{0}(x)|\\= & {} |Q_{\pi _1}^\delta T_{\pi _2}\cdots T_{\pi _{n+m}} \mathbf{0}(x) - Q_{\pi _1}^\delta T_{\pi _2}\cdots T_{\pi _{n}} \mathbf{0}(x) | \\\le & {} Q_{\pi _1}^\gamma |T_{\pi _2}\cdots T_{\pi _{n+m}} \mathbf{0}-T_{\pi _2}\cdots T_{\pi _{n}} \mathbf{0}|(x) \le \quad \text{(cont.) }\cdots \\\le & {} Q_{\pi _1}^\gamma \cdots Q_{\pi _{n-1}}^\gamma |Q^\delta _{\pi _n} T_{\pi _{n+1}}\cdots T_{\pi _{n+m}}{} \mathbf{0}|(x) \\\le & {} Q_{\pi _1}^\gamma \cdots Q_{\pi _{n-1}}^\gamma Q^\gamma _{\pi _n} |T_{\pi _{n+1}}\cdots T_{\pi _{n+m}}{} \mathbf{0}|(x). \end{aligned}$$

Assume that \(m=1\). Then, for any \(h_{n+1}\in H_{n+1},\) we have

$$\begin{aligned} |T_{\pi _{n+1}}{} \mathbf{0}(h_{n+1})|= |u(x_{n+1},\pi _{n+1}(h_{n+1}))|\le \omega (x_{n+1})z. \end{aligned}$$

Take \(m=2\) and notice by (B) that

$$\begin{aligned} |T_{\pi _{n+1}}T_{\pi _{n+2}}{} \mathbf{0}(h_{n+1})|= & {} |u(x_{n+1},\pi _{n+1}(h_{n+1}))+\int _X \delta (u(x_{n+2},\pi _{n+2}(h_{n+1},x_{n+2})))\\&q(dx_{n+2}|x_{n+1},\pi _{n+1}(h_{n+1}))|\\\le & {} \omega (x_{n+1})z+\int _X \gamma (\omega (x_{n+2})z)q(dx_{n+2}|x_{n+1},\pi _{n+1}(h_{n+1}))\\\le & {} \omega (x_{n+1})z+\int _X\omega (x_{n+2}) \gamma (z)q(dx_{n+2}|x_{n+1},\pi _{n+1}(h_{n+1}))\\\le & {} \omega (x_{n+1})(z+\alpha \gamma (z))\le \omega (x_{n+1})(z+{\tilde{\gamma }}(z)). \end{aligned}$$

For \(m=3,\) it follows that

$$\begin{aligned}&|T_{\pi _{n+1}}T_{\pi _{n+2}}T_{\pi _{n+3}}{} \mathbf{0}(h_{n+1})|= |u(x_{n+1},\pi _{n+1}(h_{n+1}))\\&\qquad +\int _X \delta (T_{\pi _{n+2}}T_{\pi _{n+3}} \mathbf{0}(h_{n+1},\pi _{n+1}(h_{n+1}),x_{n+2} ))q(dx_{n+2}|x_{n+1},\pi _{n+1}(h_{n+1}))| \\&\quad \le \omega (x_{n+1})z+ \int _X \gamma \left( \omega (x_{n+2})(z+{\tilde{\gamma }}(z))\right) q(dx_{n+2}|_{n+1},\pi _{n+1}(h_{n+1}))\\&\quad \le \omega (x_{n+1})z+\int _X\omega (x_{n+2}) \gamma \left( z+{\tilde{\gamma }}(z)\right) q(dx_{n+2}|x_{n+1},\pi _{n+1}(h_{n+1}))\\&\quad \le \omega (x_{n+1})z+\alpha \omega (x_{n+1})\gamma (z+{\tilde{\gamma }}(z))\le \omega (x_{n+1})(z+{\tilde{\gamma }}(z+{\tilde{\gamma }}(z))). \end{aligned}$$

Continuing this way, for any \(h_{n+1}\in H_{n+1},\) we obtain

$$\begin{aligned} |T_{\pi _{n+1}}\cdots T_{\pi _{n+m}}{} \mathbf{0}(h_{n+1}) |\le & {} \omega (x_{n+1}) \big (z+{\tilde{\gamma }}\big (z+{\tilde{\gamma }}(z+\cdots +{\tilde{\gamma }}(z+{\tilde{\gamma }}(z))\cdots ) \big )\big ) \nonumber \\= & {} \omega (x_{n+1}){\tilde{\gamma }}_m\left( z\right) , \end{aligned}$$
(5.4)

where z appears m times on the right-hand side of inequality (5.4). By (5.2), \({\tilde{\gamma }}_m( z)< {\tilde{L}} ( z )<\infty .\) Combining (5.3) and (5.4) and making use of (B2.4) and (5.1), we conclude that

$$\begin{aligned} Q_{\pi _1}^\gamma \ldots Q_{\pi _{n-1}}^\gamma Q^\gamma _{\pi _n} {\tilde{L}}(z)\omega (x)\le & {} Q_{\pi _1}^\gamma \ldots Q_{\pi _{n-1}}^\gamma \gamma ({\tilde{L}}(z)) \alpha \omega (x)\\= & {} Q_{\pi _1}^\gamma \cdots Q_{\pi _{n-1}}^\gamma {\tilde{\gamma }}({\tilde{L}}(z))\omega (x) \ldots \\\le & {} {\tilde{\gamma }}^{(n)}({\tilde{L}}(z))\omega (x). \end{aligned}$$

Consequently,

$$\begin{aligned} \Vert U_{n+m}(\cdot ,\pi )-U_n(\cdot ,\pi )\Vert _\omega \le {\tilde{\gamma }}^{(n)} ({\tilde{L}}(z)). \end{aligned}$$
(5.5)

From the proof of (5.4) we deduce that for any \(n\in {\mathbb {N}}\)

$$\begin{aligned} |U_n(x,\pi )|\le \omega (x){\tilde{\gamma }}_n(z)\le \omega (x) {\tilde{L}}(z). \end{aligned}$$

Therefore, for each \(n\in {\mathbb {N}},\) \(U_n(\cdot ,\pi )\in {{\mathcal {M}}}_b^d(X).\) From (5.5) it follows that \((U_n(x,\pi ))\) is a Cauchy sequence in the Banach space \({{\mathcal {M}}}^d_b(X).\) \(\square \)

Proof of Theorem 5.1

Consider first assumptions (W). By Lemma 4.3, T maps \({{\mathcal {U}}}^d_b(X)\) into itself. We show that T has a fixed point in \({{\mathcal {U}}}^d_b(X).\) Let \(v_1,\ v_2 \in {{\mathcal {U}}}^d_b(X).\) Then, under assumptions (B) we obtain

$$\begin{aligned} |Tv_1(x)-Tv_2(x) |&\le \sup _{a\in A(x)} \int _X\left| \delta \big (v_1(y)\big )-\delta \big (v_2(y)\big )\right| q(dy|x,a) \\&\le \sup _{a\in A(x)} \int _X \gamma \big (|v_1(y) - v_2(y)|\big )q(dy|x,a) \\&\le \sup _{a\in A(x)}\left| \int _X \gamma \big (\Vert v_1 - v_2\Vert \big )\omega (y)q(dy|x,a)\right| \\&\le \alpha \gamma \big ( \Vert v_1-v_2\Vert _\omega \big )\omega (x). \end{aligned}$$

Hence,

$$\begin{aligned} \Vert Tv_1-Tv_2\Vert _\omega \le {\tilde{\gamma }}( \Vert v_1-v_2\Vert _\omega ). \end{aligned}$$

Since the space \({{\mathcal {U}}}^d_b(X)\) endowed with the metric induced by the norm \(\Vert \cdot \Vert _\omega \) is complete, by Lemma 4.5, there exists a unique \(v^*\in {{\mathcal {U}}}^d_b(X)\) such that \(v^*=Tv^*\) and

$$\begin{aligned} \lim _{n\rightarrow \infty } \Vert T^{(n)}v-v^*\Vert _\omega =0\quad \text{ for } \text{ any } \quad v\in {{\mathcal {U}}}^d_b(X). \end{aligned}$$

By Lemma 4.1 and the assumptions that \(\delta \) is increasing and continuous, it follows that there exists \(f^*\in F\) such that \(v^*=T_{f^*}v^*.\) We claim that

$$\begin{aligned} v^*(x)=U(x,f^*)=\lim _{n\rightarrow \infty }T^{(n)}_{f^*}v^*(x) \quad \text{ for } \text{ all } x\in X. \end{aligned}$$

The operator \(T_{f^*}:{{\mathcal {M}}}^d_b(X)\rightarrow {{\mathcal {M}}}^d_b(X)\) also satisfies assumptions of Lemma 4.5. Thus there is a unique function \({\tilde{v}}\in {{\mathcal {M}}}^d_b(X)\) such that

$$\begin{aligned} {\tilde{v}}(x)=T_{f^*}{\tilde{v}}(x)=\lim _{n\rightarrow \infty } T_{f^*}^{(n)} h(x),\quad x\in X, \end{aligned}$$

for any \(h\in {{\mathcal {M}}}^d_b(X).\) Therefore, \({\tilde{v}}=v^*.\) Putting \(h: =\mathbf{0}\) we deduce from Lemma 5.3 that

$$\begin{aligned} \lim _{n\rightarrow \infty } T_{f^*}^{(n)} \mathbf{0}(x)=U(x,f^*),\quad x\in X. \end{aligned}$$

In order to prove the optimality of \(f^*\) note that for any \(a\in A(x)\) and \(x\in X\), it holds

$$\begin{aligned} v^*(x)\ge u(x,a)+\int _X \delta (v^*(y))q(dy|x,a). \end{aligned}$$

Taking any policy \(\pi =(\pi _n)\) and iterating the above inequality, we get

$$\begin{aligned} v^*(x)\ge T_{\pi _1}\cdots T_{\pi _{n}}v^*(x),\quad x\in X. \end{aligned}$$

We now prove that

$$\begin{aligned} \lim _{n\rightarrow \infty } T_{\pi _1}\cdots T_{\pi _{n}} v^*(x)=U(x,\pi ), \quad x\in X. \end{aligned}$$

With this end in view, we first consider the differences

$$\begin{aligned} |U_{n}(x,\pi )- T_{\pi _1}\cdots T_{\pi _{n}} v^*(x)|= & {} | T_{\pi _1}\cdots T_{\pi _{n}}{} \mathbf{0}(x) - T_{\pi _1}\cdots T_{\pi _{n}}v^*(x)|\\\le & {} \omega (x) {\tilde{\gamma }}^{(n)}(\Vert v^*\Vert _\omega ) \rightarrow 0 \quad \text{ as } n\rightarrow \infty . \end{aligned}$$

By Lemma 5.3, \(U_{n}(x,\pi )\rightarrow U(x,\pi )\) for every \(x\in X\) as \(n \rightarrow \infty .\) Therefore, we have that

$$\begin{aligned} v^*(x)\ge \lim _{n\rightarrow \infty } T_{\pi _1}\cdots T_{\pi _{n}} v^*(x)= U(x,\pi ),\quad x\in X \end{aligned}$$

and

$$\begin{aligned} \sup _{\pi \in \varPi } U(x,\pi )\ge U(x,f^*)=v^*(x)\ge \sup _{\pi \in \varPi } U(x,\pi ),\quad x\in X. \end{aligned}$$

This implies that

$$\begin{aligned} U(x,f^*)=v^*(x)= \sup _{\pi \in \varPi } U(x,\pi ),\quad x\in X, \end{aligned}$$

which finishes the proof under assumptions (W). For assumptions (S) the proof proceeds along the same lines. By Lemma 4.3, under (S), \(T: {{\mathcal {M}}}_b^d(X)\rightarrow {{\mathcal {M}}}_b^d(X)\). \(\square \)

Remark 5.4

Under assumptions of Theorem 5.1, the Bellman equation has a unique solution and it is the optimal value function \(v^*(x)= \sup _{\pi \in \varPi } U(x,\pi ).\) Moreover, it holds that

$$\begin{aligned} \lim _{n\rightarrow \infty }\Vert T^{(n)}{} \mathbf{0} -v^*\Vert _\omega =0. \end{aligned}$$

Obviously, \(T^{(n)}{} \mathbf{0}\) is the value function in the n-step dynamic programming problem. One can say that the value iteration algorithm works and the iterations \(T^{(n)}{} \mathbf{0}(x)\) converge to \(v^*(x)\) for each \(x\in X.\) This convergence is uniform in \(x\in X\) when the weight function \( \omega \) is bounded.

5.2 One-Period Utilities Unbounded from Below

In this subsection we drop condition (A2.1) and assume that there exists \(c>0 \) such that \(u(x,a)\le c\omega (x)\) for all \((x,a)\in D.\) In other words, \(u\in {{\mathcal {M}}}^a_b(D).\) Here we obtain the following result which is shown in the remaining part of this subsection.

Theorem 5.5

Suppose that assumptions (A2.2),  (B) and (W) are satisfied. Then

(a):

the optimal value function

$$\begin{aligned} {\underline{v}}^*(x):=\sup _{\pi \in \varPi }U(x,\pi ), \ x\in X, \end{aligned}$$

is a solution to the Bellman equation \(Tv=v\) and \({\underline{v}}^* \in {{\mathcal {M}}}^a_b(X),\)

(b):

there exists \( {\tilde{f}}\in F\) such that \(T_{{\tilde{f}}}{\underline{v}}^*={\underline{v}}^*\) and \({\tilde{f}}\) is an optimal stationary policy,

(c):

\({\underline{v}}^* \in {{\mathcal {U}}}^a_b(X).\) The points (a) and (b) also remain valid under assumptions (A2.2), (B) and (S).

Remark 5.6

We shall prove that \({\underline{v}}^*\) is the limit of a non-increasing sequence of value functions in “truncated models”, i.e. the models that satisfy (A2.1) and (A2.2). The convergence is monotone, but it is not uniform. The Bellman equation may have many unbounded solutions.

Remark 5.7

The assumptions of Theorem 5.5 do not guarantee uniqueness. An example is very simple. Assume that \(X={\mathbb {N}},\) \(A=A(x)=\{a\},\) \(u(x,a)=0\) for all \((x,a)\in D,\) and the process moves from state x to \(x+1\) with probability one. The discount function \(\delta (x)=\beta x\) with \(\beta \in (0,1).\) Clearly, u satisfies assumption (A2.2) with \(\omega (x)=1,\) \(c=1.\) Note that \(v (x)= r/\beta ^x\) is a solution to the Bellman equation \(Tv=v\) for any \(r\in {\mathbb {R}}.\) Clearly, \({\underline{v}}^*(x)= 0\) is one of them. Actually, \({\underline{v}}^*(x)= 0\) is the largest non-positive solution to the Bellman equation. This example does not contradict the uniqueness result in Theorem 5.1. Within the class of bounded functions \({\underline{v}}^*(x)= 0\) is the unique solution to the Bellman equation.

We now prove that the recursive discounted utility (3.2) is well-defined.

Lemma 5.8

If \(u \in {{\mathcal {M}}}^a_b(D)\) and assumptions (B) are satisfied, then \(U(x,\pi ):=\lim _{n\rightarrow \infty } U_n(x,\pi )\) exists in \(\underline{{\mathbb {R}}}\) for any policy \(\pi \in \varPi \) and any initial state \(x\in X.\) Moreover, \(U(\cdot ,\pi )\in {{\mathcal {M}}}^a_b(X).\)

Our assumption that \(u\in {{\mathcal {M}}}^a_b(D)\) means that (A2.2) holds, i.e., there exists \(c>0\) such that \(u(x,a)\le c\omega (x)\) for all \((x,a)\in D.\)

Proof of Lemma 5.8

We divide the proof into five parts.

Step 1: We start with a simple observation: for any \(n\in {\mathbb {N}},\) \(x\in X\) and \(\pi \in \varPi \) it holds

$$\begin{aligned} U_{n+1}(x,\pi )\le U_n(x,\pi )+{\tilde{\gamma }}^{(n)}(c)\omega (x). \end{aligned}$$

From assumptions (B),  it follows that

$$\begin{aligned} \delta (a+c\omega (x))\le \delta (a)+ \gamma (c\omega (x)) \le \gamma (c)\omega (x) \end{aligned}$$

for \(a\in \underline{{\mathbb {R}}}\) and \(x\in X.\) Note that, for any \(h_k\in H_k\) and \(\pi _k\), we have

$$\begin{aligned} T_{\pi _k} c\omega (h_k)= & {} u(x_k,\pi _k(h_k))+\int _X\delta (c\omega (y))q(dy|x_k,\pi _k(h_k))\nonumber \\\le & {} u(x_k,\pi _k(h_k))+Q^\gamma _{\pi _k}c\omega (x_k)\le T_{\pi _k}\mathbf{0}(h_k)+{\tilde{\gamma }}(c)\omega (x_k). \end{aligned}$$
(5.6)

Furthermore, for any \(v\in {{\mathcal {M}}}(H_{k+1})\) such that \(v(h_k,\pi _k(h_k),y)\le \eta \omega (y)\) for all \(y\in X\) and some \(\eta >0,\) we obtain

$$\begin{aligned}&T_{\pi _k} (v+c\omega )(h_k)\\&\quad =u(x_k,\pi _k(h_k))+\int _X\delta (v(h_k,\pi _k(h_k),y)+ c\omega (y))q(dy|x_k,\pi _k(h_k))\\&\quad \le u(x_k,\pi _k(h_k))+Q^\delta _{\pi _k} v(h_k)+ Q^\gamma _{\pi _k} c\omega (h_k) \le T_{\pi _k} v(h_k)+{\tilde{\gamma }}(c)\omega (x_k). \end{aligned}$$

From this fact and (5.6) we conclude that

$$\begin{aligned} U_{n+1}(x,\pi )= & {} T_{\pi _1}\cdots T_{\pi _n}T_{\pi _{n+1}}{} \mathbf{0}(x)\le T_{\pi _1}\cdots T_{\pi _{n-1}}T_{\pi _n}c\omega (x)\nonumber \\\le & {} T_{\pi _1}\cdots T_{\pi _{n-1}}\big (T_{\pi _n}{} \mathbf{0} + {\tilde{\gamma }}(c)\omega \big )(x) \nonumber \\\le & {} T_{\pi _1}\cdots T_{\pi _{n-2}}\big (T_{\pi _{n-1}}T_{\pi _n}\mathbf{0} +{\tilde{\gamma }}^{(2)}(c)\omega \big )(x) \ldots \ \text{(cont.) } \nonumber \\\le & {} T_{\pi _1}\cdots T_{\pi _{n-2}}T_{\pi _{n-1}}T_{\pi _n}{} \mathbf{0} (x)+{\tilde{\gamma }}^{(n)}(c)\omega (x) \nonumber \\= & {} U_n(x,\pi )+{\tilde{\gamma }}^{(n)}(c)\omega (x). \end{aligned}$$
(5.7)

This finishes the first step.

Step 2: Let \(U_{n}(x,\pi )=-\infty \) for some \(n\in {\mathbb {N}}\), \(x\in X\) and \(\pi \in \varPi \). Then, by Step 1, \(U_{m}(x,\pi )=-\infty \) for all \(m\ge n\). Therefore, \(\lim _{m\rightarrow \infty }U_{m}(x,\pi )=-\infty .\)

Step 3: Let \(v \in {{\mathcal {M}}}(H_{k+1}) \) be such that \(v(h_k,a_k,x_{k+1}) \le \eta \omega (x_{k+1})\) for every \(x_{k+1}\in X\) and for some \(\eta >0.\) Define

$$\begin{aligned} {\varGamma }_{\pi _k}v(h_k):= & {} c\omega (x_k)+Q^\gamma _{\pi _k}v(h_k)\quad \text{ and }\\ \varGamma _{\pi _{n+1}}^{\pi _{n+m}}{} \mathbf{0}(h_{n+1}):= & {} \varGamma _{\pi _{n+1}}\cdots \varGamma _{\pi _{n+m}}{} \mathbf{0}(h_{n+1}). \end{aligned}$$

Note that

$$\begin{aligned} \varGamma _{\pi _{n+m}}{} \mathbf{0}(h_{n+m}) = c\omega (x_{n+m}). \end{aligned}$$

Next, we have

$$\begin{aligned}&\varGamma _{\pi _{n+m-1}} \varGamma _{\pi _{n+m}}{} \mathbf{0}(h_{n+m-1})\\&\quad = c\omega (x_{n+m-1})+ \int _X\gamma \big (c\omega ( x_{n+m})\big ) q\big (dx_{n+m}|x_{n+m-1},\pi _{n+m-1}(h_{n+m-1})\big )\\&\quad \le \omega (x_{n+m-1})(c+{\tilde{\gamma }}(c)), \end{aligned}$$

and

$$\begin{aligned}&\varGamma _{\pi _{n+m-2}}\varGamma _{\pi _{n+m-1}} \varGamma _{\pi _{n+m}}\mathbf{0}(h_{n+m-2}) \\&\quad \le c\omega (x_{n+m-2}) + \int _X\gamma \big ( \omega ( x_{n+m-1})(c+{\tilde{\gamma }}(c))\big )\\&\qquad q\big (dx_{n+m-1}|x_{n+m-2},\pi _{n+m-2}(h_{n+m-2})\big )\\&\quad \le \omega (x_{n+m-2})(c+ {\tilde{\gamma }}(c+{\tilde{\gamma }}(c))). \end{aligned}$$

Continuing this way, we get

$$\begin{aligned} \varGamma _{\pi _{n+1}}^{\pi _{n+m}}{} \mathbf{0}(h_{n+1})= & {} \varGamma _{\pi _{n+1}}\cdots \varGamma _{\pi _{n+m}}{} \mathbf{0}(h_{n+1}) \\\le & {} \omega (x_{n+1}) \big (c+{\tilde{\gamma }}\big (c+{\tilde{\gamma }}\big (c+\cdots +{\tilde{\gamma }}(c+{\tilde{\gamma }}(c))\cdots \big ) \big )\big ), \end{aligned}$$

where c appears on the right-hand side of this inequality m times. Putting \({\tilde{c}} ={\tilde{L}}(z)\) with \(z=c\) in (5.2), we obtain

$$\begin{aligned} \varGamma _{\pi _{n+1}}^{\pi _{n+m}}{} \mathbf{0}(h_{n+1}) = \varGamma _{\pi _{n+1}}\cdots \varGamma _{\pi _{n+m}}{} \mathbf{0}(h_{n+1}) \le \omega (x_{n+1}){\tilde{c}}=\omega (x_{n+1}){\tilde{L}}(c) <\infty .\nonumber \\ \end{aligned}$$
(5.8)

Step 4: For \(m,\ n \in {\mathbb {N}},\) we set

$$\begin{aligned}&W_{n,m}(x,\pi ):=T_{\pi _1}\cdots T_{\pi _n}\varGamma _{\pi _{n+1}}^{\pi _{n+m}} \mathbf{0}(x)\quad \text{ and }\\&\quad \quad W_{n,0}(x,\pi ):=T_{\pi _1}\cdots T_{\pi _n}{} \mathbf{0}(x)=U_n(x,\pi ). \end{aligned}$$

For any \(k=1,\ldots ,n-1\) and \(h_{k+1}\in H_{k+1},\) \(\pi \in \varPi ,\) let \(\pi (k+1)= (\pi _{k+1},\pi _{k+2},\ldots )\) and

$$\begin{aligned} V^{\pi (k+1)}_{n-k,m}(h_{k+1}):= & {} T_{\pi _{k+1}}\cdots T_{\pi _n} \varGamma _{\pi _{n+1}}^{\pi _{n+m}} \mathbf{0}(h_{k+1}), \\ V^{\pi (k+1)}_{n-k,0}(h_{k+1}):= & {} T_{\pi _{k+1}}\cdots T_{\pi _n} \mathbf{0}(h_{k+1}). \end{aligned}$$

For \(k=n-1\), we have

$$\begin{aligned} V^{\pi (n)}_{1,m}(h_n)= T_{\pi _n} \varGamma _{\pi _{n+1}}^{\pi _{n+m}} \mathbf{0}(h_n), \end{aligned}$$

and

$$\begin{aligned} V^{\pi (n)}_{1,0}(h_n)= T_{\pi _n} \mathbf{0}(h_n)= u(x_n,\pi _n(h_n)). \end{aligned}$$

Hence and from (5.8), it follows that

$$\begin{aligned} V^{\pi (n)}_{1,m}(h_n)- V^{\pi (n)}_{1,0}(h_n) \le \int _X \gamma \big ( \omega (x_{n+1}){\tilde{c}}\big )q(dx_{n+1}|x_n,\pi _n(h_n)) \le \omega (x_n){\tilde{\gamma }}({\tilde{c}}).\nonumber \\ \end{aligned}$$
(5.9)

Observe that for each \(k= 1,\ldots ,n-2,\)

$$\begin{aligned} V^{\pi (k+1)}_{n-k,m}(h_{k+1}) - V^{\pi (k+1)}_{n-k,0}(h_{k+1})= & {} T_{\pi _{k+1}}V^{\pi (k+2)}_{n-k-1,m}(h_{k+1}) - T_{\pi _{k+1}}V^{\pi (k+2)}_{n-k-1,0}(h_{k+1}) \nonumber \\= & {} Q^\delta _{\pi _{k+1}}V^{\pi (k+2)}_{n-k-1,m}(h_{k+1})- Q^\delta _{\pi _{k+1}}V^{\pi (k+2)}_{n-k-1,0}(h_{k+1}) \nonumber \\\le & {} Q^\gamma _{\pi _{k+1}}\big (V^{\pi (k+2)}_{n-k-1,m} - V^{\pi (k+2)}_{n-k-1,0}\big )(h_{k+1}). \end{aligned}$$
(5.10)

It is important to note that

$$\begin{aligned} V^{\pi (k+1)}_{n-k ,m}- V^{\pi (k+1)}_{n-k ,0}> 0,\quad k=1,\ldots ,n-1. \end{aligned}$$

Now, using (5.10), for any \(\pi \in \varPi \) and \(x=x_1,\) we conclude that

$$\begin{aligned} W_{n,m}(x,\pi )- W_{n,0}(x,\pi )= & {} T_{\pi _1}V^{\pi (2)}_{n-1 ,m}(x) -T_{\pi _1}V^{\pi (2)}_{n-1 ,0}(x)\\= & {} Q^\delta _{\pi _1} V^{\pi (2)}_{n-1 ,m}(x) -Q^\delta _{\pi _1}V^{\pi (2)}_{n-1 ,0}(x) \\\le & {} Q^\gamma _{\pi _1}\big ( V^{\pi (2)}_{n-1 ,m} - V^{\pi (2)}_{n-1 ,0}\big )(x) \\\le & {} Q^\gamma _{\pi _1}Q^\gamma _{\pi _2} \big ( V^{\pi (3)}_{n-2 ,m} - V^{\pi (3)}_{n-2 ,0}\big )(x) \ldots \ \text{(cont.) }\\\le & {} Q^\gamma _{\pi _1}Q^\gamma _{\pi _2}\cdots Q^\gamma _{\pi _{n-1}} \big ( V^{\pi (n)}_{1 ,m} - V^{\pi (n)}_{1 ,0}\big )(x). \end{aligned}$$

This and (5.9) imply that

$$\begin{aligned} W_{n,m}(x,\pi )- W_{n,0}(x,\pi )\le \omega (x){\tilde{\gamma }}^{(n)}({\tilde{c}}),\ \text{ for } \text{ all }\ m,\ n \in {\mathbb {N}}. \end{aligned}$$
(5.11)

Step 5: We now consider the case where \(U_{n}(x,\pi )>-\infty \) for an initial state \(x\in X,\) a policy \(\pi \in \varPi \) and for all \(n\in {\mathbb {N}}\). From (5.8) we have

$$\begin{aligned} W_{n,m}(x,\pi )\le W_{n,m+1}(x,\pi ) \le \varGamma _{\pi _1}^{\pi _{n+m+1}}{} \mathbf{0} (x)\le {\tilde{c}}\omega (x) <\infty . \end{aligned}$$

Therefore, \(\lim _{m\rightarrow \infty } W_{n,m}(x,\pi )\) exists and is finite. Let us denote this limit by \(G_n\). Note that, for each \(m,\ n\in {\mathbb {N}},\)

$$\begin{aligned} U_n(x,\pi )= W_{n,0}(x,\pi ) \le W_{n,m}(x,\pi ). \end{aligned}$$

Let \(\epsilon >0\) be fixed. Then, by (5.11), for sufficiently large n,  say \(n>N_0,\)

$$\begin{aligned} W_{n,m}(x,\pi )\le W_{n,0}(x,\pi )+\epsilon \end{aligned}$$

for all \(m\in {\mathbb {N}}.\) Thus

$$\begin{aligned} W_{n,m}(x,\pi ) - \epsilon \le W_{n,0}(x,\pi )\le W_{n,m}(x,\pi ) \end{aligned}$$

and consequently

$$\begin{aligned} G_n - \epsilon \le W_{n,0}(x,\pi )\le G_n \end{aligned}$$

for all \(n> N_0.\) Observe that the sequence \((G_n)\) is non-increasing and \(G_*:=\lim _{n\rightarrow \infty }G_n\) exists in the extended real line \(\underline{{\mathbb {R}}}\). Hence, the limit

$$\begin{aligned} \lim _{n\rightarrow \infty }W_{n,0}(x,\pi )=\lim _{n\rightarrow \infty } U_n(x,\pi ) \end{aligned}$$

also exists and equals \(G_*.\) \(\square \)

In the proof of Theorem 5.5 we shall need the following result (see [30] or Theorem A.1.5 in [3]).

Lemma 5.9

If Y is a metric space and \((w_n)\) is a non-increasing sequence of upper semicontinuous functions \(w_n:Y\rightarrow \underline{{\mathbb {R}}},\) then

  1. (a)

    \(w_\infty =\lim _{n\rightarrow \infty }w_n\) exists and \(w_\infty \) is upper semicontinuous,

  2. (b)

    if, additionally, Y is compact, then

    $$\begin{aligned} \max _{y\in Y} \lim _{n\rightarrow \infty }w_n(y)=\lim _{n\rightarrow \infty }\max _{y\in Y}w_n(y). \end{aligned}$$

In the proof of Theorem 5.5 we shall refer to the dynamic programming operators defined in (4.1) and (4.2). Moreover, we also define corresponding operators for \(v\in {{\mathcal {M}}}^a_b(X)\) and \(K\in {\mathbb {N}}\) as follows

$$\begin{aligned} T^K v(x):=\sup _{a\in A(x)}\left[ u^K(x,a)+\int _X\delta (v(y))q(dy|x,a)\right] ,\quad x\in X, \end{aligned}$$

and

$$\begin{aligned} T_{f}^K v(x)=u^K(x,f(x))+\int _X \delta (v(y))q(dy|x,f(x)),\quad x\in X, \end{aligned}$$

where \(f\in F\) and \(u^K(x,a)=\max \{u(x,a),1-K\},\) \(K\in {\mathbb {N}}.\) The recursive discounted utility functions with one-period utility \(u^K\) in the finite (n-periods) and infinite time horizon for an initial state \(x\in X\) and a policy \(\pi \in \varPi \) will be denoted by \(U_n^K(x,\pi )\) and \(U^K(x,\pi ),\) respectively.

Lemma 5.10

For any \(n\in {\mathbb {N}}\) and \(f\in F\), it holds

$$\begin{aligned} \lim _{K\rightarrow \infty } U^K_n(x,f)=U_n(x,f),\quad x\in X. \end{aligned}$$

Proof

We proceed by induction. For \(n=1\) the fact is obvious. Suppose that

$$\begin{aligned} T^{K,(n)}_f\mathbf{0}(x)=U^K_n(x,f)\rightarrow T^{(n)}_f\mathbf{0}(x)=U_n(x,f)\quad \text{ as }\ K\rightarrow \infty , \ \text{ for } \text{ all }\ x\in X. \end{aligned}$$

Here, \(T^{K,(n)}_f\) denotes the n-th composition of the operator \(T^{K}_f\) with itself. Then, by our induction hypothesis, our assumption that \(\delta \) is continuous and increasing, and the monotone convergence theorem, we infer that, for every \(x\in X,\)

$$\begin{aligned} \lim _{K\rightarrow \infty }T_f^K T_f^{K,(n)}{} \mathbf{0} (x)= & {} \lim _{K\rightarrow \infty }\left( u^K(x,f(x))+\int _X\delta (T_f^{K,(n)}{} \mathbf{0} (y))q(dy|x,f(x)) \right) \\= & {} u(x,f(x))+\int _X \delta (T_f^{(n)}{} \mathbf{0} (y))q(dy|x,f)= T_f^{(n+1)}{} \mathbf{0} (x). \end{aligned}$$

The lemma now follows by the induction principle. \(\square \)

Proof of Theorem 5.5

Assume first (W). The proof for (S) is analogous with obvious changes. By Theorem 5.1, for any \(K\in {\mathbb {N}},\) there exists a unique solution \(v^{*,K}\in {{\mathcal {U}}}^d_b(X)\) to the Bellman equation and, for each \(x\in X,\) \(v^{*,K}(x) = \sup _{\pi \in \varPi } U^K(x,\pi ).\) Since, \(u^K \ge u,\) it follows that

$$\begin{aligned} v^{*,K}(x) = \sup _{\pi \in \varPi } U^K(x,\pi )\ge \sup _{\pi \in \varPi } U(x,\pi )={\underline{v}}^*(x), \quad x\in X. \end{aligned}$$

Clearly, the sequence \((v^{*,K})\) is non-increasing, thus \(v_\infty (x) := \lim _{K\rightarrow \infty } v^{*,K}(x)\) exists in \(\underline{{\mathbb {R}}}\) for every \(x\in X,\) and consequently,

$$\begin{aligned} v_\infty (x) \ge {\underline{v}}^*(x), \quad x\in X. \end{aligned}$$
(5.12)

From Theorem 5.1 we know that \(v^{*,K}\) is a solution to the equation

$$\begin{aligned} v^{*,K}(x)=\sup _{a\in A(x)}\left[ u^K(x,a)+\int _X \delta (v^{*,K}(y))q(dy|x,a)\right] , \quad x\in X. \end{aligned}$$

Since both sequences \((v^{*,K}(x)),\) \(x\in X,\) and \((u^K(x,a)),\) \((x,a)\in D,\) are non-increasing, it follows from Lemma 5.9, our assumption that \(\delta \) is increasing and continuous and the monotone convergence theorem that

$$\begin{aligned} v_\infty (x)= & {} \lim _{K\rightarrow \infty } v^{*,K}(x) = \lim _{K\rightarrow \infty } \max _{a\in A(x)}\left[ u^K(x,a)+\int _X \delta (v^{*,K}(y))q(dy|x,a)\right] \nonumber \\= & {} \max _{a\in A(x)}\lim _{K\rightarrow \infty }\left[ u^K(x,a)+\int _X \delta (v^{*,K}(y))q(dy|x,a)\right] \nonumber \\= & {} \max _{a\in A(x)}\left[ u(x,a)+\int _X \delta (v_\infty (y))q(dy|x,a)\right] ,\quad x\in X. \end{aligned}$$
(5.13)

Moreover, in case (W),  we have that \(v_\infty \in {{\mathcal {U}}}^a_b(X).\) From the obvious inequalities \(u(x,a)\le u^1(x,a)\le c\omega (x),\) \((x,a)\in D,\) it follows that \(v_\infty (x) \le {\tilde{c}}\omega (x)\) for \({\tilde{c}}= {\tilde{L}}(c) \) and for all \(x\in X\) (put \(z=c\) in (5.2)). By Lemma 4.1, there exists a maximiser \({\tilde{f}}\in F\) on the right-hand side of equation (5.13) and we have

$$\begin{aligned} v_\infty (x) = u(x,{\tilde{f}}(x))+\int _X \delta (v_\infty (y))q(dy|x,{\tilde{f}}(x))=T_{{\tilde{f}}} v_\infty (x),\quad x\in X. \end{aligned}$$

Iterating this equation, we obtain that

$$\begin{aligned} v_\infty (x)= T^{(n)}_{{\tilde{f}}} v_\infty (x) \le T^{K,(n)}_{{\tilde{f}}}v_\infty (x) \le T^{K,(n)}_{{\tilde{f}}} {\tilde{c}}\omega (x), \ \text{ for } \text{ all }\ x\in X\ \text{ and } \ k\in {\mathbb {N}}. \end{aligned}$$

From (5.7) in the proof of Lemma  5.8, (with c replaced by \({\tilde{c}},\) u replaced by \(u^K\) and \(\pi _1=\cdots =\pi _n= {\tilde{f}}\)) we infer that

$$\begin{aligned} v_\infty (x)\le T^{K,(n)}_{{\tilde{f}}}{\tilde{c}}\omega (x) \le U^K_n(x,{\tilde{f}})+{\tilde{\gamma }}^{(n)}({\tilde{c}})\omega (x),\ \text{ for } \text{ all }\ x\in X\ \text{ and } n\in {\mathbb {N}}. \end{aligned}$$

Letting \(K\rightarrow \infty \) in the above inequality and making use of Lemma 5.10 yield that

$$\begin{aligned} v_\infty (x)\le U_n(x,{\tilde{f}})+{\tilde{\gamma }}^{(n)}({\tilde{c}})\omega (x)\ \text{ for } \text{ all }\ x\in X. \end{aligned}$$

Hence,

$$\begin{aligned} v_\infty (x)\le & {} \lim _{n\rightarrow \infty }\left( U_n(x,{\tilde{f}})+{\tilde{\gamma }}^{(n)}({\tilde{c}})\omega (x) \right) = U(x,{\tilde{f}})\le \sup _{\pi \in \varPi } U(x,\pi )\\= & {} {\underline{v}}^*(x),\quad x\in X. \end{aligned}$$

From this inequality and (5.12), we conclude that

$$\begin{aligned} v_\infty (x)= U(x,{\tilde{f}})= \sup _{\pi \in \varPi } U(x,\pi )= {\underline{v}}^*(x),\ \text{ for } \text{ all }\ x\in X, \end{aligned}$$

and the proof is finished. \(\square \)

6 Computational Issues

In this section we consider the unbounded utility setting as in Theorem 5.1.

6.1 Policy Iteration

An optimal stationary policy can be computed as a limit point of a sequence of decision rules. In what follows, we define \(V_0=\mathbf{0}\) and \(V_n := T^{(n)} \mathbf{0}\) for \(n\in {\mathbb {N}}\). Next for fixed \(x\in X,\) let

$$\begin{aligned} A_{n }^*(x) := \text{ Arg }\max _{a\in A(x)} \Big ( u(x,a) + \int _X \delta (V_{n-1}(y))q(dy|x,a)\Big ) \end{aligned}$$

for \(n\in {\mathbb {N}}. \) In the same way, let

$$\begin{aligned} A^*(x) := \text{ Arg }\max _{a\in A(x)} \Big ( u(x,a) + \int _X \delta (v^*(y))q(dy|x,a)\Big ). \end{aligned}$$

By \( Ls A_n^*(x),\) we denote the upper limit of the set sequence \((A_n^*(x)),\) that is, the set of all accumulation points of sequences \( (a_n)\) with \(a_n\in A_n^*(x) \) for all \(n\in {\mathbb {N}}.\)

The next result states that an optimal stationary policy can be obtained from accumulation points of sequences of maximisers of recursively computed value functions. Related results for dynamic programming with standard discounting are discussed, for example, in [3, 30].

Theorem 6.1

Under assumptions of Theorem 5.1, we obtain: \(\emptyset \ne Ls A_n^*(x) \subset A^*(x)\) for all \(x\in {X}\).

Proof

Fix \(x\in {X}\) and define for \(n\in {\mathbb {N}}\) the functions \(v_n : A(x) \rightarrow {\mathbb {R}}\) by

$$\begin{aligned} v_n(a) := u(x,a) + \int _X \delta (V_{n-1}(y))q(dy|x,a). \end{aligned}$$

By Lemma 4.3, \(v_n \) is upper semicontinuous on A(x). Moreover, for \(m\ge n\) and \(a\in A(x)\), we have

$$\begin{aligned} |v_m(a)-v_n(a)|\le & {} |\int _X \delta \big ( T^{(m-1)}\mathbf{0}(y)\big )q(dy|x,a) - \int _X \delta \big ( T^{(n-1)}\mathbf{0}(y)\big )q(dy|x,a) |\nonumber \\\le & {} \int _X \gamma \big ( |T^{(m-1)}\mathbf{0}(y)-T^{(n-1)}{} \mathbf{0}(y)|\big )q(dy|x,a). \end{aligned}$$
(6.1)

Using the arguments as in the proof of (5.5), we infer that

$$\begin{aligned}&|T^{(m-1)}{} \mathbf{0}(y)-T^{(n-1)}{} \mathbf{0}(y)|\le \sup _{\pi \in \varPi }|U_{m-1}(y,\pi )- U_{n-1}(y,\pi )|\nonumber \\\le & {} \omega (y)\sup _{\pi \in \varPi }\Vert U_{m-1}(\cdot ,\pi )-U_{n-1}(\cdot ,\pi )\Vert _{\omega } \le \omega (y){\tilde{\gamma }}^{(n-1)}({\tilde{L}}(z)). \end{aligned}$$
(6.2)

From (6.1) and (6.2), it follows that

$$\begin{aligned} |v_m(a)-v_n(a)|\le & {} \int _X \gamma \big (\omega (y){\tilde{\gamma }}^{(n-1)} ({\tilde{L}}(z))\big )q(dy|x,a)\\\le & {} \int _X \omega (y)\gamma \big ({\tilde{\gamma }}^{(n-1)}\big ({\tilde{L}}(z))\big )q(dy|x,a)\le \omega (x){\tilde{\gamma }}^{(n)}\big ({\tilde{L}}(z)\big ). \end{aligned}$$

for all \(a\in A(x).\) Hence

$$\begin{aligned} \max _{a\in A(x)}|v_m(a)-v_n(a)|\le \omega (x){\tilde{\gamma }}^{(n)}\big ({\tilde{L}}(z)\big ) =:\varepsilon _n. \end{aligned}$$

This then implies that \(v_m(a) \le v_n(a)+\varepsilon _n\) for \(m\ge n\) and all \(a\in A(x).\) Since \( \varepsilon _n\rightarrow 0\) as \(n\rightarrow \infty ,\) the result follows from Theorem A.1.5. in [3] and the fact that \(v^*=\lim _{n\rightarrow \infty } V_n\) (see Remark 5.4). \(\square \)

6.2 Howard’s Policy Improvement Algorithm

The algorithm proposed by Howard [19] is widely discussed in the literature on Markov decision processes (dynamic programming), see [3, 16, 17] and their references. It may be also applied to models with the recursive discounted utility.

For any \(f\in F\) we shall use the following notation \(U_f=U(\cdot ,f)\).

Theorem 6.2

Let assumptions of Theorem 5.1 be satisfied. For any \(f\in F,\) denote

$$\begin{aligned} A(x,f) := \big \{ a\in A(x)\; |\; u(x,a)+\int _X \delta (U(y,f)) q(dy|x,a) > U(x,f)\big \}, \quad x\in {X}. \end{aligned}$$

Then, the following holds:

  1. (a)

    If for some Borel set \({X}_0\subset {X}\) we define a decision rule g by

    $$\begin{aligned} g(x) \in A(x,f)&\quad \;\text{ for }\; x\in {X}_0, \\ g(x) = f(x)&\quad \;\text{ for }\; x\notin {X}_0, \end{aligned}$$

    then \(U_g\ge U_f\) and \(U_g(x) > U_f(x)\) for \(x\in {X}_0\). In this case, the policy g is called an improvement of f.

  2. (b)

    If \(A(x,f) = \emptyset \) for all \(x\in {X}\), then \(U_f=v^*\), i.e., the stationary policy \(f\in F\) is optimal.

Proof

(a) From the definition of g we obtain

$$\begin{aligned} T_g U_f (x) > U(x,f), \end{aligned}$$

if \(x\in {X}_0\) and \(T_g U_f(x) = U(x,f),\) if \(x\notin {X}_0\). Thus by induction

$$\begin{aligned} U_f (x)\le T_g U_f(x)\le T_g^{(n)} U_f(x), \end{aligned}$$

where the first inequality is strict for \(x\in {X}_0\). Letting \(n\rightarrow \infty ,\) it follows as in the proof of Theorem 5.1 that \(U_f \le U_g\) and in particular \(U(x,f) < U(x,g)\) for \(x\in {X}_0\).

(b) The condition \(A(x,f) = \emptyset \) for all \(x\in {X}\) implies \(T U_f \le U_f\). Since we always have \(T U_f \ge T_f U_f = U_f,\) we obtain \(T U_f = U_f\). From Theorem 5.1 we know that T has a unique fixed point \(v^*\in {{\mathcal {U}}}^d_b(X)\) (under assumptions (W)) or \(v^*\in {{\mathcal {M}}}^d_b(X)\) (under assumptions (S)). Thus \(U_f =v^*\). \(\square \)

Altogether, we have the following algorithm for the computation of the value function and an optimal stationary policy:

  1. 1.

    Choose \(f_0\in F\) arbitrary and set \(k=0\).

  2. 2.

    Compute \(U_{f_k} \) as the unique solution \(v\in {{\mathcal {M}}}^d_b({X})\) of the equation \(v=T_{f_k}v\).

  3. 3.

    Choose \(f_{k+1}\in F\) such that

    $$\begin{aligned} f_{k+1}(x) \in \text{ Arg }\max _{a\in A(x)} \Big ( u(x,a) + \int _X \delta (U_{f_k}(y))q(dy|x,f_k(x))\Big ) \end{aligned}$$

    and set \(f_{k+1}(x)=f_{k}(x)\) if possible. If \(f_{k+1}=f_{k},\) then \(U_{f_{k}}=v^*\) and \((f_k,f_k,\ldots )\) is an optimal stationary policy. Else set \(k:=k+1\) and go to step 2.

It is obvious that the algorithm stops in a finite number of steps if the state and action sets are finite. In general, as in the standard discounted case (see, e.g., Theorem 7.5.1 and Corollary 7.5.3 in [3]) we can only claim that

$$\begin{aligned} v^*(x)= \lim _{k\rightarrow \infty } U(x,f_k). \end{aligned}$$

7 Applications

Example 7.1

(Stochastic optimal growth model 1) There is a single good available to the consumer. The level of this good at the beginning of period \(t\in {\mathbb {N}}\) is given by \(x_t\in X:=[0,\infty ).\) The consumer has to divide \(x_t\) between consumption \(a_t\in A:=[0,\infty )\) and investment (saving) \(y_t=x_t-a_t\). Thus \(A(x_t) = [0,x_t]\). From consumption \(a_t\) the consumer receives utility \(u(x_t,a_t)=\sqrt{a_t}\). Investment, on the other hand, is used for production with input \(y_t\) yielding output

$$\begin{aligned} x_{t+1} = y_t\cdot \xi _t, \qquad t\in {\mathbb {N}}, \end{aligned}$$

where \((\xi _t)\) is a sequence of i.i.d. shocks with distribution \(\nu \) being a probability measure on \([0,\infty ).\) The initial state \(x=x_1 \in {\mathbb {R}}_+\) is non-random. Further, we assume that

$$\begin{aligned} {\bar{s}} ={\mathbb {E}}\xi _t =\int _0^\infty s\nu (ds)\le 1. \end{aligned}$$

Let the discount function be as follows

$$\begin{aligned} \delta (z)=(1-\varepsilon )z+\varepsilon \ln (1+z), \quad z\ge 0 \end{aligned}$$

with some \(\varepsilon \in (0,1).\) We observe that there is no constant \(\beta \in (0,1)\) such that \(\delta (z)<\beta z\) for all \(z>0.\) We define \(\gamma :=\delta \) and note that \(z\mapsto \gamma (z)/z\) is non-increasing. Hence, \(\gamma \) and \(\delta \) satisfy assumptions (B2.1)-(B2.3). Now we show that assumptions (A), (W) and (B2.4) are satisfied with an appropriate function \(\omega .\) With this end in view, put

$$\begin{aligned} \omega (x)=\sqrt{x+1}, \quad x\in X. \end{aligned}$$

Then \(|u(x,a)|\le \sqrt{x}\le \sqrt{x+1}\) for \(a\in A(x)=[0,x]\) and \(x\in X.\) Thus, (A) holds. Furthermore, by Jensen’s inequality we have

$$\begin{aligned} \int _X \omega (y)q(dy|x,a)=\int _0^\infty \sqrt{(x-a)s+1}\ \nu (ds)\le \sqrt{ (x-a){\bar{s}}+1}\le \sqrt{x+1} \end{aligned}$$

for \(a\in A(x)\) and \(x\in X.\) Hence, (B2.4) is satisfied with \(\alpha =1.\) It is obvious that conditions (W) are also met. Therefore, from Theorem 5.1, there exists an upper semicontinuous solution to the Bellman equation and a stationary optimal policy \(f^*\in F.\)

Example 7.2

(Stochastic optimal growth model 2) We study a modified model from Example 1. We assume that the next state evolves according to the equation

$$\begin{aligned} x_{t+1} = y_t^\theta \cdot \xi _t+(1-\rho )\cdot y_t,\qquad t\in {\mathbb {N}} \end{aligned}$$

where \(\rho \), \(\theta \) are some constants from the interval (0, 1). Here, \((\xi _t)\) is again a sequence of i.i.d. shocks with distribution \(\nu \) and the expected value \({\bar{s}}>0\). The utility function is \(u(x,a)=a^\sigma \) with \(\sigma \in (0,1).\) Let \(\omega (x)=(x+r)^\sigma ,\) where \(r\ge 1.\) Assume now that the discount function is of the form

$$\begin{aligned} \delta (z)=(1-2\varepsilon )z+\varepsilon \ln (1+z), \quad z\ge 0 \end{aligned}$$

with some \(\varepsilon \in (0,1).\) Then, \(\delta (z)\le (1-\varepsilon )z\) for \(z\ge 0.\) By Example 1 in [2] (see also Example 2 in [23]), we have that

$$\begin{aligned} \int _X \omega (y)q(dy|x,a)\le \left( 1+\frac{({\bar{s}}/\rho ) ^{\frac{1}{1-\theta }}}{r}\right) ^\sigma \omega (x),\quad x\in X. \end{aligned}$$

Hence, in (B2.4), we set

$$\begin{aligned} \alpha := \left( 1+\frac{({\bar{s}}/\rho )^{\frac{1}{1-\theta }}}{r}\right) ^\sigma >1 \end{aligned}$$

and \(\alpha \gamma (z)< z\) is satisfied with \(\gamma :=\delta \) and r such that \(\alpha (1-\varepsilon )<1.\) Clearly, all conditions (A), (B) and (W) are satisfied. (B(2.3) follows from the fact that \(z\mapsto \gamma (z)/z\) is non-increasing.) By Theorem  5.1, the value function \(v^*\) is upper semicontinuous and satisfies the Bellman equation. Moreover, there exists an optimal stationary policy.

Example 7.3

(Inventory model) A manager inspects the stock at each period \(t\in {\mathbb {N}}.\) The number of units in stock is \(x_t\in X:=[0,\infty ).\) He can sell \(\min \{x_t,\varDelta _t\}\) units in period t, where \(\varDelta _t\ge 0\) is a random variable representing an unknown demand. At the end of period t he can also order any amount \(a_t\in A:=[0,{\hat{a}}]\) of new goods to be delivered at the beginning of next period at a cost \(C(a_t)\) paid immediately. Here \({\hat{a}}\) is some positive constant. Moreover, the function C is bounded above by \({\hat{C}}\), lower semicontinuous, increasing and \(C(0)=0.\) The state equation is of the form

$$\begin{aligned} x_{t+1}=x_t-\min \{x_t,\varDelta _t\}+a_t,\quad \text{ for } t\in {\mathbb {N}}, \end{aligned}$$

where \((\varDelta _t)\) is a sequence of i.i.d. random variables such that each \(\varDelta _t\) follows a continuous distribution \(\varPhi \) and \({\mathbb {E}}\varDelta _t<\infty .\) The manager considers a recursive discounted utility with a discount function \(\delta \) satisfying (B2.1)–(B2.3) (with \(\omega =1\)). This model can be viewed as a Markov decision process, in which \(u(x,a):=p{\mathbb {E}}\min \{x,\varDelta \}-C(a)\) is the one period bounded utility function and p denotes the unit stock price. (Here \(\varDelta = \varDelta _t\) for any fixed \(t\in {\mathbb {N}}.\)) Clearly, \( -{\hat{C}}\le u(x,a)\le p{\mathbb {E}}\varDelta .\) Next note that the transition probability q is of the form

$$\begin{aligned} q(B|x,a)=\int _{0}^{\infty }1_{B}(x-\min \{x,y\}+a)dF(y), \end{aligned}$$

where \(B\subset X\) is a Borel set, \(x\in X\), \(a\in A.\) If \(\phi \) is a bounded continuous function on X, then the integral

$$\begin{aligned} \int _{X}\phi (y)q(dy|x,a)= & {} \int _{0}^{\infty }\phi (x-\min \{x,y\}+a)dF(y )\\= & {} \int _{0}^{x}\phi (x-y+a)dF(y)+ \int _{x}^{\infty }\phi (a)dF(y)\\= & {} \int _{0}^{x}\phi (x-y+a)dF(y)+ \phi (a)(1-F(x)) \end{aligned}$$

depends continuously on (xa). Hence, the model satisfies assumptions (W2.1)-(W2.3). Therefore, by Theorem 5.1, there exists a bounded upper semicontinuous solution to the Bellman equation and an optimal stationary policy \(f^*\in F.\)

Example 7.4

(Stopping problem) We now describe a stopping problem with non-linear discounting. Suppose the Borel state space is X and there is an (uncontrolled) Markov chain with an initial distribution \(q_0\) and the transition probability \(q(\cdot |x).\) By \({\mathbb {P}}\) we denote the probability measure on the product space \(X^\infty \) of all trajectories of the Markov chain induced by \(q_0\) and q. At each period the controller has to decide whether to stop the process and receive the reward R(x),  where x is the current state or to continue. In the latter case the reward C(x) (which might be a negative cost) is received. The aim is to find a stopping time such that the recursive discounted reward is maximized. We assume here that the controller has to stop with probability one. This problem is a special case of the more general model of Sect. 2. We have to choose here

(i):

\(X\cup \{\infty \}\) is the state space where \(\infty \) is an absorbing state which indicates that the process is already stopped,

(ii):

\(A:=\{0,1\}\) where \(a=0\) means continue and \(a= 1\) means stop,

(iii):

\(A(x):= A\) for all \(x\in X\cup \{\infty \}\),

(iv):

\({q}(B | x,0):=q(B|x)\) for \(x\in X,\) B a Borel set and \(q(\{\infty \} | x,1)=1\) for \(x\in X\) and \(q(\infty |\infty ,\cdot )=1\),

(v):

\({u}(x,a) := C(x)(1-a)+ R(x) a \) for \(x\in X\) and \(u(\infty ,\cdot )=0\).

We assume now that \(|C(x)|\le \omega (x)\) and \(|R(x)|\le \omega (x)\) which implies (A) and we assume (B). The optimisation problem is considered with the recursive discounted utility and an interpretation that the receiving of rewards (costs) is stopped after a random time. This random time is a stopping time with respect to the filtration generated by observable states (for more details see Chapter 10 in [3]).

Proposition 7.5

If the above assumptions on the stopping problem are satisfied, then

  1. (a)

    there exists a function \(v^*\in {{\mathcal {M}}}^d_b(X)\) such that

    $$\begin{aligned} v^*(x)=\max \left\{ R(x); \; C(x) +\int _X \delta (v^*(y))q(dy|x)\right\} . \end{aligned}$$
    (7.1)
  2. (b)

    Moreover, define \(f^*(x)=1\) if \(v^*(x)=R(x)\) and \(f^*(x)=0\) else and let \(\tau ^* := \inf \{n\in {\mathbb {N}} : f^*(x_n)=1\}\). If \({\mathbb {P}}(\tau ^*<\infty )=1,\) then \(\tau ^*\) is optimal for the stopping problem and \(v^*(x)=\sup _{\pi \in \varPi }U(x,\pi ).\)

The action space A consists of two elements only, so assumptions (S) are satisfied. In the above description we already assumed (A) and (B). Therefore the result follows from Theorem 5.1

Let us consider a special example. Imagine a person who wants to sell her house. At the beginning of each week she receives an offer, which is randomly distributed over the interval [mM] with \(0<m<M\). The offers are independent and identically distributed with distribution q. The house seller has to decide immediately whether to accept or reject this offer. If she rejects, the offer is lost and she has maintenance cost. Which offer should she accept in order to maximise her expected reward?

Here, we have \(X:=[m,M]\), \(C(x)\equiv -c\) and \(R(x):=x\). From Proposition 7.5 we obtain that the value function satisfies

$$\begin{aligned} v^*(x)=\max \left\{ x; -c +\int _{[m,M]} \delta (v_*(y))q(dy)\right\} . \end{aligned}$$

Note that \(C^*:=\int _{[m,M]} \delta (v^*(y))q(dy)\) is obviously a constant independent of x. Thus the optimal strategy is to accept the first offer which is above \(-c+C^*\). The corresponding stopping time is geometrically distributed and thus certainly satisfies \({\mathbb {P}}(\tau ^*<\infty )=1\). Moreover, it is not difficult to see that whenever we have two discount functions \(\delta _1,\delta _2,\) which satisfy assumptions (B) and which are ordered, i.e., \(\delta _1\le \delta _2\), then \(C_1^* \le C_2^*\) because the operator T is monotone. Thus, with stricter discounting we will stop earlier.