Deep cooperative strategies for multi-agent reinforcement learning

Pina, Rafael

doi:10.26174/thesis.lboro.26015305.v1

Deep cooperative strategies for multi-agent reinforcement learning

thesis

posted on 2024-06-18, 13:41 authored by Rafael PinaRafael Pina

Human beings are inherently social creatures, engaged in constant interaction with one another. A world devoid of these interactions would lack meaning. Among the various types of interactions that permeate our daily lives, there are situations requiring the collaboration of multiple individuals to achieve a common goal. For instance, observing construction workers cooperating to construct a building or drivers yielding priority to one another for safe navigation during a stroll down the street exemplifies such cooperation. These scenarios can be considered instances of collaboration. In the realm of machines and artificial intelligence, we can model these scenarios as multi-agent systems.

Multi-agent reinforcement Learning (MARL) stands out as a popular machine learning paradigm for solving complex problems within multi-agent systems. While real-world multi-agent problems can often be framed as MARL problems, the full potential of recently proposed approaches is realized primarily through computer simulations due to the implications that MARL brings to the table. For instance, in real scenarios, agents usually have access only to their local observations of the environment and cannot see any other global information. On the other hand, in simulation, it is possible to leverage a central oracle that has all the information of the system during the training stage of the agents. This mode of operation is known as Centralised Training with Decentralised Execution (CTDE). Under this paradigm, value function factorisation methods constitute a family of MARL algorithms that learn a decomposition of the joint action-value function into agent-wise policies. Despite several of these methods have been proposed recently, some of them still fail to solve certain complex tasks with difficult trade-offs and make use of extra state information, which may not always be available. In this thesis, we introduce a value function factorisation method called Residual Q-Networks that does not require extra state information during the factorisation process. Theoretically, this method is capable of factorising any family of environments, offering advantages, particularly in scenarios with severe penalties for non-cooperative behaviour.

The emergence of lazy agents is another common issue in MARL. This occurs when, within a team of several agents, some do not cooperate toward the team’s overall objective, opting instead to wait for teammates to do all the work. This issue arises from inaccuracies in assigning credit for the shared reward among team members. To address this problem, in this thesis, we propose a causality-based method that aims to find causal relationships between individual observations of the agents and the team reward. The intuition is that, when the team receives a reward, each agent should only take it, if it had any impact on achieving the team reward. Furthermore, we transition from merely learning under CTDE to exploring how agents can learn independently, without sharing network parameters, by employing our causality-based method to enhance their cooperative behaviour. Independent learning is considered a more realistic approach since agents are treated as self-contained entities, each with its own policies and do not rely on a centralised oracle for learning.

In multiple situations, having the ability to communicate can be beneficial to improve cooperative behaviours. In many MARL scenarios, there might be cases where communication is available, as it is commonly seen in multiple real applications. To this end, this thesis proposes a communication method, named Attentive Regularized Communication (ARCOMM). Allowing agents to talk can be key to learn certain complex tasks, and for the success of that it is necessary to learn efficient messages. Additionally, we explore communication among fully independent learners - agents that do not share network parameters and solely rely on their localised observations. This exploration encompasses different levels of network capacity, parameter sharing, and communication, examining their interplay.

Overall, the approaches proposed in this thesis bring important solutions to some of the current problems in MARL, contributing to the field with new ways of improving cooperation, and allowing the creation of more capable agents. At the same time, we have presented perspectives and enlightened some discussions in the direction towards minimising the gap that exists between simulation and reality. Finally, we conclude this thesis by outlining potential avenues for future research, shedding light on additional applications that can benefit from the findings presented herein.

History

School

Loughborough University, London

Publisher

Loughborough University

Rights holder

Publication date

2023

Notes

A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of the degree of Doctor of Philosophy of Loughborough University.