Abstract

The main objective of multiagent reinforcement learning is to achieve a global optimal policy. It is difficult to evaluate the value function with high-dimensional state space. Therefore, we transfer the problem of multiagent reinforcement learning into a distributed optimization problem with constraint terms. In this problem, all agents share the space of states and actions, but each agent only obtains its own local reward. Then, we propose a distributed optimization with fractional order dynamics to solve this problem. Moreover, we prove the convergence of the proposed algorithm and illustrate its effectiveness with a numerical example.

1. Introduction

In recent years, reinforcement learning [1] has received much attention from the society and succeeded remarkably in many areas such as machine learning and artificial intelligence [2]. As we all know, in reinforcement learning, an agent determines the optimal strategy under the feedback of rewards via constantly interacting with the environment. The function of the policy maps possible states to possible actions. Although reinforcement learning has made great achievements in single agent, it remains challenging in the application of multiagent [3]. The goal of the multiagent system is to enable several agents with simple intelligence, but it is easy to manage and control to realize complex intelligence through mutual cooperation. While reducing the complexity of system modeling, the robustness, reliability, and flexibility of the system should be improved [4, 5].

In this paper, the objective of this paper is to investigate multiagent reinforcement learning (MARL), where each agent exchanges information with their neighbors in network systems [6]. All agents share the state space and action except local rewards. The purpose of the MARL is to determine the global optimal policy, and a feasible way is to construct a central controller, where each agent must exchange information with the central controller [7], which makes decisions for all of them. However, with the increase of state dimensions, the computation of the central controller becomes extensively heavy. The whole system would collapse if the central controller was attacked.

Then, we try to replace the centralized algorithm mentioned above with distributed control [8, 9]. Consistency protocol based on design enables all agents to achieve the same state [1013]. In [14], Zhang et al. proposed a continuous-time distributed version of the gradient algorithm. As far as we know, most of the gradient methods use integer order iteration. In fact, fractional order has been developed for 300 years and used to solve many kinds of problems such as control applications and systems’ theory [1517]. In comparison with the traditional integer order algorithm, the fractional order algorithm has more design freedom and potential to obtain better convergence performance [18, 19].

Hereinafter, the contributions of the paper are listed:(1)We transform the multiagent strategy evaluation problem into a distributed optimization problem with a consensus constraint(2)We construct the fractional order dynamics and prove the convergence of the algorithm(3)We take a numerical example to verify the superiority of the proposed fractional order algorithm

The rest organization of this paper is listed as follows. Section 2 introduces some problems of formulation on MARL and fractional order calculus. Section 3 transforms the multiagent strategy evaluation problem into the optimization problem with a consensus constraint, proposes an algorithm with fractional order dynamics, and proves that the algorithm asymptotically converge to an exact solution. Section 4 presents a simulation example, and we summarize the work in Section 5.

2. Problem Formulation

2.1. Notations

Let and represent the real number set, n-dimensional real column vector set, and n × m real matrix set, respectively. AT represents the transpose of A. and represents a multiagent Markov decision process (MDP), where is the state space and is the joint action space. is the probability of transition from to when the agent takes the joint action a and , is the local reward when agent i takes joint action a at state s and is a discount parameter. represents the condition of probability when the agent takes joint action a at state s. The reward function of agent i is defined when follows a joint policy at state s as follows:where the right-hand side of the equation means that there is a probability for all possible choices of action a, and we calculate the expected value for all rewards of agent i:where represents the average of the local rewards.

2.2. Graph theory

The graph is expressed as , where represents a graph, is the set of vertices, and is the set of edges in If any edge in the graph is undirected, the graph is named as undirected graph [20]. In graph, is the adjacency matrix with if , otherwise. is the degree matrix with and Laplacian matrix is . Moreover, if the graph is connected, has the following two properties:(1)Laplacian matrix is a semipositive definite matrix(2)The minimum eigenvalue is 0 because the sum of every row of the Laplace matrix is 0

The minimum nonzero eigenvalue is defined as the algebraic connectivity of the graph.

Assumption 1. The undirected graph mentioned in the following text is connected.

Lemma 1 (see [21]). The frequency distributed model is defined for a fractional order system , where as follows:where .

Definition 1 (see [22]). The th order Caputo derivative iswhere , is Gamma function, and is the nth order derivative of .

2.3. Policy Evaluation

To measure the benefits of agents in its current state, we establish the following value function, which represents the value of the cumulative return obtained by agents starting from the state , adopting a certain strategy :

We construct Bellman equation based on and :

It is difficult to evaluate directly if the dimension of the state space is very large. Therefore, we use to approximate , where is the vector and , which is a particular function for state s. Indeed, solving equation (6) is equivalent to obtain the vector via . In other words, it means to minimize the mean square error about , where , is a diagonal matrix determined by the stationary distribution. We construct the equation as follows:where is a regularization parameter and is a projection operator in the column subspace of . It is not difficult to rewrite as substituting into (7):where , , and .

The minimum value of in equation (8) is unique if A is a full rank matrix and C is a positive definite matrix. In practice, it is difficult to get the expectations in the compact form when the distribution is unknown. We replace expectation with the average as follows:where , and

We assume that the sample size p approaches infinity to make sure its confidence level. In these sequences, each state is attached at least once. Then, we reconstruct equation (8) as follows:

Noteworthy, in a shared space, the agent observes the states and actions of the neighbors, but only observes the local rewards of its own. In other words, we get and except . So, we define with . Then, we rewrite equation (10) as follows:

3. Fractional Order Dynamics for Policy Evaluation

Hereinbefore, the aim of policy evaluation becomes to minimize the object function. Now, we rewrite (11) as follows:

We define as a factor concatenating all : and the aggregative function as . As we all know, the consensus constraint (12) is expressed aswhere , , and Based on (13), we formulate the following the augmented Lagrangian:where is the Lagrange multiplier.

It is feasible to design a fractional order continuous-time optimization algorithm from primal-dual viewpoint, gradient descend for primal variable , and gradient ascent for dual variable via (14). Both of them are updated according to the fractional order law:where , and are gradient of on variables and , respectively. We express the detail of (15) in Algorithm 1.

Initialization: .
Update
  For
   
  End
 Return

The aim of the distributed algorithm is to obtain the solution of the value function. The proposed algorithm has more potential to get better convergence performance and design freedom than the conventional integer order. Hereinafter, we provide the following convergence conclusion.

Theorem 1. Under Assumption 1, let and be generated according to Algorithm 1. If , then asymptotically converges to the optimal solution.

Proof. We obtain the detailed dynamics of and :where I is an identity matrix. We consider the equilibrium of (16):Then, we combine (16) and (17), and according to the facts , Through Lemma 1, we reconstruct (18) as follows:andWe construct the Lyapunov function as follows:Then,We obtain the result according to the Lasalle invariance principle.
Hereinafter, we improve the convergence conclusion of Theorem 1 by extending from (0,1) to (1,2).

Theorem 2. Under Assumption 1, let and be generated according to Algorithm 1. If then asymptotically converges to the optimal solution.

Proof. Under the condition , we rewrite the dynamics with the condition of Theorem 1 as follows:Due to and ,Under the condition of (23) and (24), we obtain the frequency distributed model by Lemma 1 as follows:We construct the Lyapunov function:Then,Through the LaSalle invariance principle, we obtain the result.

4. Experimental Simulation

In this section, we provide an example to illustrate the effectiveness of the proposed algorithm. There are 20 states in the multiagent reinforcement learning. We set , regularization parameter , and discount parameter . There are 4 agents in the connected network in Figure 1. State is a randomly generated 5-dimensional column vector, the dimension of is a cosine function, and P is a randomly generated 5-dimensional matrix.

Then, we randomly generate the matrices as follows:

Before the simulation, it is necessary to obtain the solution of the multiagent reinforcement learning:

We show the comparison about the fractional order algorithm with the conventional integer order one. In Figures 2 and 3, the curve illustrates almost the same convergence performance as the conventional integer order when is 0.995. In Figures 4 and 5, the fractional order algorithm achieves a faster convergent rate than that of the integer order algorithm. Simulation results illustrate the convergence about the integer order and the fractional order. Furthermore, the proposed distributed algorithm with fractional order dynamics has more design freedom to achieve a better performance than that of the conventional first-order algorithm.

5. Conclusion

In this paper, the value function problem of the multiagent reinforcement learning was transformed as a distributed optimization problem with a consensus constraint. Then, we proposed a distributed algorithm with fractional order dynamics to solve this problem. Besides, we proved the asymptotic convergence of the algorithm by Lyapunov functions and illustrated the effectiveness of the proposed algorithm with an example. In the future, we will consider applying reinforcement learning to the recommendation system, so as to get better results [23].

Data Availability

The .m and .slx data used to support the findings of this study have been deposited in the Github repository (97weiD/data_DPEFOD).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61973002 and 61902104), and the Anhui Provincial Natural Science Foundation (2008085J32 and 2008085QF295).