Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens

id:

2409.09513

Authors:

Joseph Clinton, Robert Lieck

Published:

2024-09-14

arXiv:

https://arxiv.org/abs/2409.09513

PDF:

https://arxiv.org/pdf/2409.09513

DOI:

N/A

Journal Reference:

N/A

Primary Category:

cs.LG

Categories:

cs.LG, cs.AI, cs.CL

Comment:

11 pages, 5 figures, Submitted to AAAI

github_url:

_

abstract

Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent’s future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model’s policy through the interpretable plan visualisations and attention map.

premise

outline

quotes

notes

summary

  1. Brief Overview

The paper introduces the Planning Transformer (PT), a novel model for long-horizon offline reinforcement learning. PT addresses the compounding error problem inherent in autoregressive models like the Decision Transformer (DT) by incorporating “Planning Tokens.” These tokens provide high-level, long-timescale information about the agent’s future, guiding the low-level policy and reducing error. This approach achieves state-of-the-art performance in complex D4RL environments and improves model interpretability through plan visualizations and attention maps.

  1. Key Points

  • Introduces Planning Tokens to address compounding error in long-horizon offline reinforcement learning.

  • Combines strengths of Reinforcement Learning via Supervised Learning (RvS) and Hierarchical Reinforcement Learning (HRL) without HRL’s drawbacks.

  • Achieves state-of-the-art (SOTA) performance on several D4RL benchmark tasks, particularly in long-horizon goal-conditioned environments.

  • Enhances interpretability through visualization of Plans and attention maps, offering insights into the model’s decision-making process.

  • Employs a unified training pipeline for action and plan prediction, simplifying the model architecture.

  1. Notable Quotes

None explicitly stated in the provided text.

  1. Primary Themes

  • Addressing Compounding Error in Offline RL: The core theme is tackling the limitations of autoregressive models in long-horizon tasks by introducing a mechanism (Planning Tokens) for implicit planning.

  • Improving Interpretability in RL: The paper highlights the increased interpretability offered by the visualization of generated plans and attention maps, which is a significant contribution to the field.

  • Efficient Offline RL: The model aims for efficient learning from fixed datasets, which is a crucial aspect of offline RL.

  • Hybrid Approach to HRL: The method leverages the benefits of both RvS and HRL approaches without the typical complexities and limitations of explicit HRL architectures.