PEBL: Pessimistic Ensembles for Offline Deep Reinforcement Learning
More Info
expand_more
Abstract
Offline reinforcement learning (RL), or learning from a fixed data set, is an attractive alternative to online RL. Offline RL promises to address the cost and safety implications of tak- ing numerous random or bad actions online, a crucial aspect of traditional RL that makes it difficult to apply in real-world problems. However, when RL is na ̈ıvely applied to a fixed data set, the resulting policy may exhibit poor performance in the real environment. This happens due to over-estimation of the value of state-action pairs not sufficiently covered by the data set. A promising way to avoid this is by applying pessimism and acting according to a lower bound estimate on the value. It has been shown that penalizing the learned value according to a pessimistic bound on the uncertainty can drastically improve offline RL. In deep reinforcement learn- ing, however, uncertainty estimation is highly non-trivial and development of effective uncertainty-based pessimistic algo- rithms remains an open question. This paper introduces two novel offline deep RL methods built on Double Deep Q- Learning and Soft Actor-Critic. We show how a multi-headed bootstrap approach to uncertainty estimation is used to cal- culate an effective pessimistic value penalty. Our approach is applied to benchmark offline deep RL domains, where we demonstrate that our methods can often beat the current state- of-the-art.