Offline-MARL

Offline Multi-Agent Reinforcement Learning with Knowledge Distillation

NeurIPS 2022
Submission

Abstract

We introduce a offline multi-agent reinforcement learning (MARL) framework that utilizes previously collected data without additional online data collection. Our method reformulates offline MARL as a sequence modeling problem and thus builds on top of the simplicity and scalability of the Transformer architecture. In the fashion of centralized training and decentralized execution, we propose to first train a teacher policy as if the MARL dataset is generated by a single agent. After the teacher policy has identified and recombined the "good" behavior in the dataset, we create separate student policies and distill not only the teacher policy's features but also its structural relations among different agents' features to student policies. Despite its simplicity, the proposed method outperforms state-of-the-art model-free offline MARL baselines while being more robust to demonstration's quality on several environments.

Overview

We compare (a) independent decision transformer (IDT), (b) multi-agent decision transformer(MADT), and (c) our approach. (a) IDT trains an independent decision transformer for each agent separately. (b) MADT extends IDT by sharing parameters across multiple agents and concatenate agent's one-shot IDs to the observations. (c) Our approach first train a teacher policy, instantiated by a centralized decision transformer, and then distill both its features and structural relations among features to IDT.

Quantative Results

Qualtative Results

Fill-In: The goal of this task is to explore as many blocks as possible with multiple agents. Explored blocks are colored (e.g., green). Training dataset consists of agents' random walk trajectories and their per-step rewards.
- Ours: perform the best
- BC: mimic the random walks, which perform the worst
- MADT: performs better than IDT and BC by filling more blocks, but still can’t outperform our approach
- IDT: agents somehow understand how to fill the blocks. However, the trajectories of the agents have some overlap.

-->