当前位置: X-MOL 学术arXiv.cs.MA › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using Reinforcement Learning to Herd a Robotic Swarm to a Target Distribution
arXiv - CS - Multiagent Systems Pub Date : 2020-06-29 , DOI: arxiv-2006.15807
Zahi M. Kakish, Karthik Elamvazhuthi, Spring Berman

In this paper, we present a reinforcement learning approach to designing a control policy for a "leader'' agent that herds a swarm of "follower'' agents, via repulsive interactions, as quickly as possible to a target probability distribution over a strongly connected graph. The leader control policy is a function of the swarm distribution, which evolves over time according to a mean-field model in the form of an ordinary difference equation. The dependence of the policy on agent populations at each graph vertex, rather than on individual agent activity, simplifies the observations required by the leader and enables the control strategy to scale with the number of agents. Two Temporal-Difference learning algorithms, SARSA and Q-Learning, are used to generate the leader control policy based on the follower agent distribution and the leader's location on the graph. A simulation environment corresponding to a grid graph with 4 vertices was used to train and validate the control policies for follower agent populations ranging from 10 to 100. Finally, the control policies trained on 100 simulated agents were used to successfully redistribute a physical swarm of 10 small robots to a target distribution among 4 spatial regions.

中文翻译:

使用强化学习将机器人群聚集到目标分布

在本文中,我们提出了一种强化学习方法来为“领导者”代理设计控制策略,该策略通过排斥性交互将一群“跟随者”代理尽可能快地聚集到强连接的目标概率分布图形。领导者控制策略是群体分布的函数,它根据普通差分方程形式的平均场模型随时间演变。策略依赖于每个图顶点的代理群体,而不是单个代理活动,简化了领导者所需的观察,并使控制策略能够随着代理的数量而扩展。两种时间差分学习算法,SARSA 和 Q-Learning,用于根据跟随者代理分布和领导者在图上的位置生成领导者控制策略。对应于具有 4 个顶点的网格图的模拟环境用于训练和验证从 10 到 100 的跟随代理群体的控制策略。最后,在 100 个模拟代理上训练的控制策略用于成功地重新分配 10小型机器人将目标分布在 4 个空间区域中。
更新日期:2020-06-30
down
wechat
bug