Deep, model based reinforcement learning has shown state of the art, human-exceeding performance in many challenging domains.
Low sample efficiency and limited exploration remain however as leading obstacles in the field.
In this work, we incorporate epistemic uncertain
...
Deep, model based reinforcement learning has shown state of the art, human-exceeding performance in many challenging domains.
Low sample efficiency and limited exploration remain however as leading obstacles in the field.
In this work, we incorporate epistemic uncertainty into planning for better exploration.
We develop a low-cost framework for estimating and computing the uncertainty as it propagates in planning with a learned model.
We propose a new method, \textit{planning for exploration}, that utilizes the propagated uncertainty for inference of the best action for exploration in real time, to achieve exploration that is informed, sequential over multiple time steps and acts with respect to uncertainty in decisions that are multiple steps into the future (deep exploration).
To evaluate our method with the state of the art algorithm MuZero, we incorporate different uncertainty estimation mechanisms, modify the Monte-Carlo tree search planning used by MuZero to incorporate our developed framework, and overcome challenges associated with learning from off-policy, exploratory trajectories with an algorithm that learns from on-policy targets. Our results demonstrate that planning for exploration is able to achieve effective deep exploration even when deployed with an algorithm that learns from on-policy targets, and using standard, scalable uncertainty estimation mechanisms.
We further provide an ablation study that illustrates that the methodology we propose for on-policy target generation from exploratory trajectories is effective at alleviating averse effects of training with trajectories that have not been sampled from an explotiatory policy. We provide full access to our implementation and our algorithmic contributions through GitHub.