Distributed Policy Evaluation with Fractional Order Dynamics in Multiagent Reinforcement Learning

Dai, Wei; Wang, Wei; Mao, Zhongtian; Jiang, Ruwen; Nian, Fudong; Li, Teng

doi:https://doi.org/10.1155/2021/1020466

Security and Communication Networks

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Security, Privacy, and Multimodal Data Analysis for Social Media

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 1020466 | https://doi.org/10.1155/2021/1020466

Distributed Policy Evaluation with Fractional Order Dynamics in Multiagent Reinforcement Learning

Wei Dai,¹Wei Wang,²Zhongtian Mao,¹Ruwen Jiang,¹Fudong Nian,³and Teng Li¹

Academic Editor: Zhenhua Tan

Received15 Jul 2021

Revised15 Aug 2021

Accepted17 Aug 2021

Published06 Sept 2021

Abstract

The main objective of multiagent reinforcement learning is to achieve a global optimal policy. It is difficult to evaluate the value function with high-dimensional state space. Therefore, we transfer the problem of multiagent reinforcement learning into a distributed optimization problem with constraint terms. In this problem, all agents share the space of states and actions, but each agent only obtains its own local reward. Then, we propose a distributed optimization with fractional order dynamics to solve this problem. Moreover, we prove the convergence of the proposed algorithm and illustrate its effectiveness with a numerical example.

1. Introduction

In recent years, reinforcement learning [1] has received much attention from the society and succeeded remarkably in many areas such as machine learning and artificial intelligence [2]. As we all know, in reinforcement learning, an agent determines the optimal strategy under the feedback of rewards via constantly interacting with the environment. The function of the policy maps possible states to possible actions. Although reinforcement learning has made great achievements in single agent, it remains challenging in the application of multiagent [3]. The goal of the multiagent system is to enable several agents with simple intelligence, but it is easy to manage and control to realize complex intelligence through mutual cooperation. While reducing the complexity of system modeling, the robustness, reliability, and flexibility of the system should be improved [4, 5].

In this paper, the objective of this paper is to investigate multiagent reinforcement learning (MARL), where each agent exchanges information with their neighbors in network systems [6]. All agents share the state space and action except local rewards. The purpose of the MARL is to determine the global optimal policy, and a feasible way is to construct a central controller, where each agent must exchange information with the central controller [7], which makes decisions for all of them. However, with the increase of state dimensions, the computation of the central controller becomes extensively heavy. The whole system would collapse if the central controller was attacked.

Then, we try to replace the centralized algorithm mentioned above with distributed control [8, 9]. Consistency protocol based on design enables all agents to achieve the same state [10–13]. In [14], Zhang et al. proposed a continuous-time distributed version of the gradient algorithm. As far as we know, most of the gradient methods use integer order iteration. In fact, fractional order has been developed for 300 years and used to solve many kinds of problems such as control applications and systems’ theory [15–17]. In comparison with the traditional integer order algorithm, the fractional order algorithm has more design freedom and potential to obtain better convergence performance [18, 19].

Hereinafter, the contributions of the paper are listed:(1)We transform the multiagent strategy evaluation problem into a distributed optimization problem with a consensus constraint(2)We construct the fractional order dynamics and prove the convergence of the algorithm(3)We take a numerical example to verify the superiority of the proposed fractional order algorithm

The rest organization of this paper is listed as follows. Section 2 introduces some problems of formulation on MARL and fractional order calculus. Section 3 transforms the multiagent strategy evaluation problem into the optimization problem with a consensus constraint, proposes an algorithm with fractional order dynamics, and proves that the algorithm asymptotically converge to an exact solution. Section 4 presents a simulation example, and we summarize the work in Section 5.

2. Problem Formulation

2.1. Notations

Let and represent the real number set, n-dimensional real column vector set, and n × m real matrix set, respectively. AT represents the transpose of A. and represents a multiagent Markov decision process (MDP), where is the state space and is the joint action space. is the probability of transition from to when the agent takes the joint action a and , is the local reward when agent i takes joint action a at state s and is a discount parameter. represents the condition of probability when the agent takes joint action a at state s. The reward function of agent i is defined when follows a joint policy at state s as follows:where the right-hand side of the equation means that there is a probability for all possible choices of action a, and we calculate the expected value for all rewards of agent i:where represents the average of the local rewards.

2.2. Graph theory

The graph is expressed as , where represents a graph, is the set of vertices, and is the set of edges in If any edge in the graph is undirected, the graph is named as undirected graph [20]. In graph, is the adjacency matrix with if , otherwise. is the degree matrix with and Laplacian matrix is . Moreover, if the graph is connected, has the following two properties:(1)Laplacian matrix is a semipositive definite matrix(2)The minimum eigenvalue is 0 because the sum of every row of the Laplace matrix is 0

The minimum nonzero eigenvalue is defined as the algebraic connectivity of the graph.

Assumption 1. The undirected graph mentioned in the following text is connected.

Lemma 1 (see [21]). The frequency distributed model is defined for a fractional order system , where as follows:where .

Definition 1 (see [22]). The th order Caputo derivative iswhere , is Gamma function, and is the nth order derivative of .

2.3. Policy Evaluation

To measure the benefits of agents in its current state, we establish the following value function, which represents the value of the cumulative return obtained by agents starting from the state , adopting a certain strategy :

We construct Bellman equation based on and :

It is difficult to evaluate directly if the dimension of the state space is very large. Therefore, we use to approximate , where is the vector and , which is a particular function for state s. Indeed, solving equation (6) is equivalent to obtain the vector via . In other words, it means to minimize the mean square error about , where , is a diagonal matrix determined by the stationary distribution. We construct the equation as follows:where is a regularization parameter and is a projection operator in the column subspace of . It is not difficult to rewrite as substituting into (7):where , , and .

The minimum value of in equation (8) is unique if A is a full rank matrix and C is a positive definite matrix. In practice, it is difficult to get the expectations in the compact form when the distribution is unknown. We replace expectation with the average as follows:where , and

We assume that the sample size p approaches infinity to make sure its confidence level. In these sequences, each state is attached at least once. Then, we reconstruct equation (8) as follows:

Noteworthy, in a shared space, the agent observes the states and actions of the neighbors, but only observes the local rewards of its own. In other words, we get and except . So, we define with . Then, we rewrite equation (10) as follows:

3. Fractional Order Dynamics for Policy Evaluation

Hereinbefore, the aim of policy evaluation becomes to minimize the object function. Now, we rewrite (11) as follows:

We define as a factor concatenating all : and the aggregative function as . As we all know, the consensus constraint (12) is expressed aswhere , , and Based on (13), we formulate the following the augmented Lagrangian:where is the Lagrange multiplier.

It is feasible to design a fractional order continuous-time optimization algorithm from primal-dual viewpoint, gradient descend for primal variable , and gradient ascent for dual variable via (14). Both of them are updated according to the fractional order law:where , and are gradient of on variables and , respectively. We express the detail of (15) in Algorithm 1.

	Initialization: .
	Update
For

End
Return

The aim of the distributed algorithm is to obtain the solution of the value function. The proposed algorithm has more potential to get better convergence performance and design freedom than the conventional integer order. Hereinafter, we provide the following convergence conclusion.

Theorem 1. Under Assumption 1, let and be generated according to Algorithm 1. If , then asymptotically converges to the optimal solution.

Proof. We obtain the detailed dynamics of and :where I is an identity matrix. We consider the equilibrium of (16):Then, we combine (16) and (17), and according to the facts , Through Lemma 1, we reconstruct (18) as follows:andWe construct the Lyapunov function as follows:Then,We obtain the result according to the Lasalle invariance principle.
Hereinafter, we improve the convergence conclusion of Theorem 1 by extending from (0,1) to (1,2).

Theorem 2. Under Assumption 1, let and be generated according to Algorithm 1. If then asymptotically converges to the optimal solution.

Proof. Under the condition , we rewrite the dynamics with the condition of Theorem 1 as follows:Due to and ,Under the condition of (23) and (24), we obtain the frequency distributed model by Lemma 1 as follows:We construct the Lyapunov function:Then,Through the LaSalle invariance principle, we obtain the result.

4. Experimental Simulation

In this section, we provide an example to illustrate the effectiveness of the proposed algorithm. There are 20 states in the multiagent reinforcement learning. We set , regularization parameter , and discount parameter . There are 4 agents in the connected network in Figure 1. State is a randomly generated 5-dimensional column vector, the dimension of is a cosine function, and P is a randomly generated 5-dimensional matrix.

Then, we randomly generate the matrices as follows:

Before the simulation, it is necessary to obtain the solution of the multiagent reinforcement learning:

We show the comparison about the fractional order algorithm with the conventional integer order one. In Figures 2 and 3, the curve illustrates almost the same convergence performance as the conventional integer order when is 0.995. In Figures 4 and 5, the fractional order algorithm achieves a faster convergent rate than that of the integer order algorithm. Simulation results illustrate the convergence about the integer order and the fractional order. Furthermore, the proposed distributed algorithm with fractional order dynamics has more design freedom to achieve a better performance than that of the conventional first-order algorithm.

5. Conclusion

In this paper, the value function problem of the multiagent reinforcement learning was transformed as a distributed optimization problem with a consensus constraint. Then, we proposed a distributed algorithm with fractional order dynamics to solve this problem. Besides, we proved the asymptotic convergence of the algorithm by Lyapunov functions and illustrated the effectiveness of the proposed algorithm with an example. In the future, we will consider applying reinforcement learning to the recommendation system, so as to get better results [23].

Data Availability

The .m and .slx data used to support the findings of this study have been deposited in the Github repository (97weiD/data_DPEFOD).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61973002 and 61902104), and the Anhui Provincial Natural Science Foundation (2008085J32 and 2008085QF295).

References

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press Cambridge, Cambridge, MA, USA, 1998.
J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: a survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
View at: Publisher Site | Google Scholar
L. Busoniu, R. De Schutter, and B. D. Schutter, “A comprehensive survey of m reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008.
View at: Publisher Site | Google Scholar
X. Wang, G. Wang, and S. Li, “Distributed finite-time optimization for integrator chain m systems with disturbances,” IEEE Transactions on Automatic Control, vol. 65, no. 12, pp. 5296–5311, 2020.
View at: Publisher Site | Google Scholar
X. Wang, S. Li, X. Yu, and J. Yang, “Distributed active anti-disturbance consensus for leader-follower higher-order multi-agent systems with mismatched disturbances,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5795–5801, 2017.
View at: Publisher Site | Google Scholar
K. Zhang, Z. Yang, and T. Basar, “Networked multi-agent reinforcement learning in continuous spaces,” in Proceedings of the 57th IEEE Conference on Decision and Control, pp. 2771–2776, Fontainebleau in Miami Beach, FL, USA, December 2018.
View at: Publisher Site | Google Scholar
J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems, pp. 66–83, São Paulo, Brazil, May 2017.
View at: Publisher Site | Google Scholar
X. Zhao, P. Yi, and L. Li, “Distributed policy evaluation via inexact ADMM in multi-agent reinforcement learning,” Control Theory and Technology, vol. 18, no. 4, pp. 362–378, 2020.
View at: Publisher Site | Google Scholar
A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
View at: Publisher Site | Google Scholar
X. Wang, S. Li, and G. Wang, “Distributed optimization for disturbed second-order m systems based on active a control,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 6, pp. 2104–2117, 2020.
View at: Publisher Site | Google Scholar
X. Wang, G. Wang, and S. Li, “A distributed fixed-time optimization algorithm for multi-agent systems,” Automatica, vol. 122, Article ID 109289, 2020.
View at: Publisher Site | Google Scholar
X. Wang, S. Li, and J. Lam, “Distributed active anti-disturbance output consensus algorithms for higher-order multi-agent systems with mismatched disturbances,” Automatica, vol. 74, pp. 30–37, 2016.
View at: Publisher Site | Google Scholar
T. Yang, X. Yi, J. Wu et al., “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
View at: Publisher Site | Google Scholar
M. Zhang, X. Liu, and J. Liu, “Convergence analysis of a continuous-time distributed gradient descent algorithm,” IEEE Control Systems Letters, vol. 5, no. 4, pp. 1339–1344, 2021.
View at: Publisher Site | Google Scholar
J.-G. Luo and Y.-Q. Chen, “Robust stability and stabilization of fractional-order interval systems with the fractional order α: the 0 ≪ α ≪ 1 case,” IEEE Transactions on Automatic Control, vol. 55, no. 1, pp. 152–158, 2010.
View at: Publisher Site | Google Scholar
Y.-Q. Wei, D.-Y. Liu, and D. Boutat, “Innovative fractional derivative estimation of the pseudo-state for a class of fractional order linear systems,” Automatica, vol. 99, pp. 157–166, 2019.
View at: Publisher Site | Google Scholar
Y. Wei, Y. Chen, J. Wang, and Y. Wang, “Analysis and description of the infinite-dimensional nature for nabla discrete fractional order systems,” Communications in Nonlinear Science and Numerical Simulation, vol. 72, pp. 472–492, 2019.
View at: Publisher Site | Google Scholar
S. Cheng, Y. Wei, Y. Chen, Y. Li, and Y. Wang, “An innovative fractional order LMS based on variable initial value and gradient order,” Signal Processing, vol. 133, pp. 260–269, 2017.
View at: Publisher Site | Google Scholar
S. Cheng, S. Liang, and Y. Fan, “Distributed solving sylvester equations with fractional order dynamics,” Control Theory and Technology, vol. 19, no. 1, pp. 249–259, 2021.
View at: Publisher Site | Google Scholar
A. Torres and G. Anders, “Spectral graph theory and network dependability,” in Proceedings of the 2009 4th International Conference on Dependability of Computer Systems, Brunow, Poland, July 2009.
View at: Google Scholar
J. C. Trigeassou, N. Maamri, J. Sabatier, and A. Oustaloup, “Transients of fractional-order integrator and derivatives,” Signal, Image and Video Processing, vol. 6, no. 3, pp. 359–372, 2012.
View at: Publisher Site | Google Scholar
C. Monje, Y. Chen, B. Vinagre, D. Xue, and V. Feliu-Batlle, Fractionalorder Systems and Controls: Fundamentals and Applications, Springer, New York, NY, USA, 2010.
F. Xue, X. He, X. Wang, J. Xu, K. Lia, and R. Hong, “Deep item-based collaborative filtering for top-n recommendation,” ACM Transactions on Information Systems, vol. 37, no. 3, pp. 33–25, 2019.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Wei Dai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

367

Downloads

545

Citations