Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens

id:: 2409.09513
Authors:: Joseph Clinton, Robert Lieck
Published:: 2024-09-14
arXiv:: https://arxiv.org/abs/2409.09513
PDF:: https://arxiv.org/pdf/2409.09513
DOI:: N/A
Journal Reference:: N/A
Primary Category:: cs.LG
Categories:: cs.LG, cs.AI, cs.CL
Comment:: 11 pages, 5 figures, Submitted to AAAI
github_url:: _

abstract

Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent’s future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model’s policy through the interpretable plan visualisations and attention map.

premise

outline

quotes

notes

summary

Brief Overview

The paper introduces the Planning Transformer (PT), a novel model for long-horizon offline reinforcement learning. PT addresses the compounding error problem inherent in autoregressive models like the Decision Transformer (DT) by incorporating “Planning Tokens.” These tokens provide high-level, long-timescale information about the agent’s future, guiding the low-level policy and reducing error. This approach achieves state-of-the-art performance in complex D4RL environments and improves model interpretability through plan visualizations and attention maps.

Key Points

Introduces Planning Tokens to address compounding error in long-horizon offline reinforcement learning.
Combines strengths of Reinforcement Learning via Supervised Learning (RvS) and Hierarchical Reinforcement Learning (HRL) without HRL’s drawbacks.
Achieves state-of-the-art (SOTA) performance on several D4RL benchmark tasks, particularly in long-horizon goal-conditioned environments.
Enhances interpretability through visualization of Plans and attention maps, offering insights into the model’s decision-making process.
Employs a unified training pipeline for action and plan prediction, simplifying the model architecture.

Notable Quotes

None explicitly stated in the provided text.

Primary Themes

Addressing Compounding Error in Offline RL: The core theme is tackling the limitations of autoregressive models in long-horizon tasks by introducing a mechanism (Planning Tokens) for implicit planning.
Improving Interpretability in RL: The paper highlights the increased interpretability offered by the visualization of generated plans and attention maps, which is a significant contribution to the field.
Efficient Offline RL: The model aims for efficient learning from fixed datasets, which is a crucial aspect of offline RL.
Hybrid Approach to HRL: The method leverages the benefits of both RvS and HRL approaches without the typical complexities and limitations of explicit HRL architectures.