Elsevier

Pattern Recognition Letters

Volume 155, March 2022, Pages 135-142
Pattern Recognition Letters

Robust experience replay sampling for multi-agent reinforcement learning

https://doi.org/10.1016/j.patrec.2021.11.006Get rights and content

Highlights

  • Propose new algorithms for acquiring suitable experiences from buffer through filtering.

  • Strengthen exploration strategy by reducing repetitive decisions at a given state.

  • Improve performance which is higher than or comparable to the baseline algorithms.

  • Achieve early convergence and improved policy searching compared to the baselines.

Abstract

Learning from the relevant experiences leads to fast convergence if the experiences provide useful information. We present the new and simple yet efficient technique to find suitable samples of experiences to train the agents in a given state of an environment. We intended to increase the number of states visited and unique sequences that efficiently reduce the number of states the agents have to explore or exploit. Our technique implicitly introduces additional strength to the exploration-exploitation trade-off. It filters the samples of experiences that can benefit more than half the number of agents and then utilizes the experiences to extract useful information for decision making. First, we compute the similarities between the observed state and previous states in the experiences to achieve this filtering. Then, we filter the samples using the hyper-parameter, z, to decide which experiences will be suitable. We found out that agents learn quickly and efficiently since sampled experiences provide useful information that speeds up convergence. In every episode, most agents learn or contribute to improve the total expected future return. We further study our approaches’ generalization ability and present different settings to show significant improvements in diverse experiment environments.

Introduction

Replay memory is an essential concept in deep reinforcement learning since it enables the algorithms to reuse the observed streams of experiences to improve their internal beliefs. Most of the algorithms use stored samples in the replay memory for data efficiency [14], [27], [30]. Since experience replay breaks data correlation [2], [21], it introduces a significant improvement to the data efficiency to induce stability in training and speed up learning.

On the contrary, the essence of training data collection is fairly modest to obtain in a simulation environment compared to real-world control tasks [2]. Because of such limitations, most reinforcement learning algorithms tend not to shine in real-world applications; henceforth, they become impractical. In these situations, efficient utilization of resources and time is crucial. That way, deep learning-based agents can take advantage of the collected and stored experiences in the replay memory to learn efficiently and mitigate several problems [8].

Normally, approximately one million or more samples can be collected and stored in the memory buffer. Because of that, there must be sampling techniques to sample relevant experiences from the replay memory. Usually, most RL algorithms randomly sample a batch of samples at each step to update the agent’s parameters. Unfortunately, not all samples at the given state should be equally weighted [2], [21]; therefore, random sampling would be inadequate approach to choose useful samples.

This paper proposes a samples filtering technique since discovering samples rich in useful information to agents at the particular states is challenging. It is even more challenging when dealing with problems involving multiple agents. Our technique is adopting a cosine similarity to measure the similarity between two vectors as discussed in Section 4. By computing the similarity scores, we can choose which data samples are suitable for agents’ parameters improvement for better performance. Furthermore, this sampling technique reduces the chance of using the same transitions of state-action pairs and rewards too often. This way, we increase the possibility of examining unexplored decisions. The unexplored decisions will prevent the agents to frequently take the same actions several times and always end up at the same visited states without acquiring new experiences.

The following is the summary of the contributions of our works;

  • Propose new algorithms for acquiring relevant experiences from experience replay memory through filtering.

  • Strengthen exploration strategy by reducing repetitive decisions at a given state.

  • Improve performance, which is higher than or comparable to the baseline algorithms.

  • Achieve early convergence and improved policy searching in several tasks compared to the baselines.

Section snippets

Related work

While single-agent reinforcement learning (SARL) gains popularity in research as well as industrial applications, multi-agent reinforcement learning (MARL) still is facing some challenges, one of which is non-stationarity. Non-stationarity in a multi-agent environment (Markov games) emerged as a result of changes in each agent’s policy over time due to each agent’s action that also affects state transition functions and reward functions of one another [19].

Various techniques have been

Similarities measure

In 1992, Lin [14] introduced the idea of Experience Replay, which significantly contributed to many reinforcement algorithms. The main idea behind experience replay is to stabilize learning process and introduce sample efficiency during training by repeatedly presenting the collected experiences through sampling the transitions stored in the replay buffer. These transitions are made up of tuples, and at each time-step, a tuple contains a state st, action at taken, next state st+1 and reward r

Proposed method

In 1992, Lin [14] introduced the idea of Experience Replay, which significantly contributed to many reinforcement algorithms. The main idea behind experience replay is to stabilize learning process and introduce sample efficiency during training by repeatedly presenting the collected experiences through sampling the transitions stored in the replay buffer. These transitions are made up of tuples, and at each time-step, tuple contains a state st, action at taken, next state st+1 and reward r as

Experiments

In this section, we describe our experiments to uncover the potential of these algorithms. We implemented them on top of several baselines and tried several settings to evaluate our sampling technique performance. We used open-source implementations of Graph Convolution Reinforcement Learning for Multi-Agent Cooperation (DGN) [11] and Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning (PIC) [16] as our baselines and assign their names to DGN + RS-MARL and PIC + RS-MARL,

Results and discussion

High return with speedy efficient learning accomplishes the purposes of any reinforcement learning algorithms. With this in our mind, we successfully develop a simple yet powerful algorithms to find and filter highly efficient samples to train agents from collected experiences. We discuss these with experiment results.

Conclusion

This paper proposed a method to sample past experiences stored in the replay buffer to take advantage of them to train agents efficiently in multi-agent reinforcement learning (MARL) environment. We use the state currently observed to filter samples that are needed to implicitly introduce some advantages, which lead to quick convergence, as shown in the experiment results. Like the human learning process does not solve the same problem with the same approach repeatedly to learn to generalize in

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (31)

  • W. Zemzem et al.

    Cooperative multi-agent systems using distributed reinforcement learning techniques

    Procedia Comput. Sci.

    (2018)
  • T. Bansal, J. W. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent complexity via multi-agent competition, arXiv...
  • M. Brittain, J. Bertram, X. Yang, P. Wei, Prioritized sequence experience replay, arXiv preprint...
  • L. Busoniu et al.

    Technical Report 10,003 Multi-agent Reinforcement Learning : An Overview

    Technical Report

    (2012)
  • F. Christianos et al.

    Shared experience actor-critic for multi-agent reinforcement learning

    Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’20)

    (2020)
  • C. Colas et al.

    GEP-PG: decoupling exploration and exploitation in deep reinforcement learning algorithms

    Proceedings of Machine Learning Research

    (2018)
  • J. Ding et al.

    Convolutional neural network with data augmentation for SAR target recognition

    IEEE Geosci. Remote Sens. Lett.

    (2016)
  • J. Foerster et al.

    Counterfactual multi-agent policy gradients

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI’18)

    (2018)
  • J. Foerster et al.

    Stabilising experience replay for deep multi-agent reinforcement learning

    Proceedings of the 34th International Conference on Machine Learning

    (2017)
  • J. Hu et al.

    Nash q-learning for general-sum stochastic games

    J. Mach. Learn. Res.

    (2003)
  • S. Iqbal, F. Sha, Actor-attention-critic for multi-agent reinforcement learning, in: Proceedings of the 36th...
  • J. Jiang et al.

    Graph convolutional reinforcement learning

    8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia

    (2020)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

  • M. Lanctot et al.

    A unified game-theoretic approach to multiagent reinforcement learning

    Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS-17)

    (2017)
  • L.-J. Lin

    Self-improving reactive agents based on reinforcement learning, planning and teaching

    Mach. Learn.

    (1992)
  • Cited by (11)

    • MEET: A Monte Carlo Exploration-Exploitation Trade-Off for Buffer Sampling

      2023, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    View all citing articles on Scopus
    View full text