We investigate the generalization performance of predictive models in model-based reinforcement learning when trained using maximum likelihood estimation (MLE) versus proper value equivalence (PVE) loss functions. While the more conventional MLE loss aims to fit models to predict
...
We investigate the generalization performance of predictive models in model-based reinforcement learning when trained using maximum likelihood estimation (MLE) versus proper value equivalence (PVE) loss functions. While the more conventional MLE loss aims to fit models to predict state transitions and rewards as accurately as possible, value-equivalent methods (e.g. PVE) prioritize value-relevant features. We show that in a tabular setting, MLE-based models generalize better than their PVE counterparts when fit to a small number of training policies, whereas PVE-based models perform better as the number of policies increases. With increasing model rank, generalisation error tends to improve for MLE and PVE, and the two become closer in generalisation ability.