Policy Learning with Human Teachers

Using directive feedback in a Gaussian framework

Master thesis (2019)

Authors

D. Wout Mechanical Engineering

Contributors

J. Kober Learning & Autonomous Control - Mechanical, Maritime and Materials Engineering (mentor)

Carlos Celemin Learning & Autonomous Control - Mechanical, Maritime and Materials Engineering (mentor)

D. Gavrila Intelligent Vehicles - Mechanical, Maritime and Materials Engineering (graduation committee member)

Faculty

Mechanical Engineering, Mechanical Engineering

To reference this document use:

http://resolver.tudelft.nl/uuid:d6cff61f-8e74-4714-b713-f127c1392b7a

More Info

expand_more

Published Date

27-03-2019

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Mechanical Engineering

Abstract

A prevalent approach for learning a control policy in the model-free domain is by engaging Reinforcement Learning (RL). A well known disadvantage of RL is the necessity for extensive amounts of data for a suitable control policy. For systems that concern physical application, acquiring this vast amount of data might take an extraordinary amount of time. In contrast, humans have shown to be very efficient in detecting a suitable control policy for reference tracking problems. Employing this intuitive knowledge has proven to render model-free learning strategies suitable for physical applications. Recent studies have shown that learning a policy by directive action corrections is a very efficient approach in employing this do-main knowledge. Moreover, feedback based methods do not necessarily require expert knowledge on modelling and control and are therefore more generally applicable. The current state-of-the-art regarding directional feedback was introduced by Celemin and Ruiz-del Solar (2015) and coined COrrective Advice Communicated by Humans (COACH). In this framework the trainer is able to correct the observed actions by providing directive advise for iterative policy updates. However, COACH employs Radial Basis Function (RBF) networks, which limit the capabilities to apply the framework on higher dimensional problems due to an infeasible tuning process.This study introduces Gaussian Process Coach (GPC), an algorithm preserving COACH’s structure, but introducing Gaussian Processes (GPS) as alternative to RBF networks. Moreover, the employment of GPS allows for uncertainty estimation of the policy, which will be used for 1) inquiringhigh-informative feedback samples in an Active Learning (AL) framework, 2) introduce an Adaptive Learning Rate (ALR) that adapts the learning rate to the coarse or refine focused learning phase of the trainer and 3) establish a novel sparsification technique that is specifically designed for iterative GP policy updates. We will show by employing synthesized and human teachers that the novel algorithm outperforms COACH on every domain tested, with the most outspoken difference on higher dimensional problems. Furthermore, we will prove the independent contributions of AL and ALR.

Files

Repository_thesis.pdf

(pdf | 10.3 Mb)

Unknown license