A Brief Introduction to Reinforcement Learning

Similar documents
10703 Deep Reinforcement Learning and Control

Slides credited from Dr. David Silver & Hung-Yi Lee

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Introduction to Deep Q-network

When Network Embedding meets Reinforcement Learning?

10703 Deep Reinforcement Learning and Control

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015

Deep Reinforcement Learning

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

arxiv: v1 [cs.cv] 2 Sep 2018

Markov Decision Processes. (Slides from Mausam)

Markov Decision Processes and Reinforcement Learning

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Markov Decision Processes (MDPs) (cont.)

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay

TD LEARNING WITH CONSTRAINED GRADIENTS

Marco Wiering Intelligent Systems Group Utrecht University

Neural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze)

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Accelerating Reinforcement Learning in Engineering Systems

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning

WestminsterResearch

Planning and Control: Markov Decision Processes

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

CS 234 Winter 2018: Assignment #2

CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

arxiv: v1 [cs.ai] 18 Apr 2017

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Recurrent Neural Network (RNN) Industrial AI Lab.

Reinforcement Learning (2)

Q-learning with linear function approximation

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Residual Advantage Learning Applied to a Differential Game

Monte Carlo Tree Search PAH 2015

Generalized Inverse Reinforcement Learning

Boosted Deep Q-Network

Deep Reinforcement Learning for Urban Traffic Light Control

Function Approximation. Pieter Abbeel UC Berkeley EECS

Deep Reinforcement Learning in Large Discrete Action Spaces

arxiv: v1 [stat.ml] 1 Sep 2017

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Applications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich?

Deep Q-Learning to play Snake

15-780: MarkovDecisionProcesses

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be

CS 188: Artificial Intelligence Spring Recap Search I

Recap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

CSE151 Assignment 2 Markov Decision Processes in the Grid World

Neural Networks and Tree Search

Knowledge-Defined Networking: Towards Self-Driving Networks

Simulated Transfer Learning Through Deep Reinforcement Learning

Deep Generative Models Variational Autoencoders

ECS289: Scalable Machine Learning

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues

Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation

Unsupervised Learning: Clustering

Gradient Reinforcement Learning of POMDP Policy Graphs

Hierarchical Reinforcement Learning for Robot Navigation

0.1. alpha(n): learning rate

Tree Based Discretization for Continuous State Space Reinforcement Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

UNIVERSITY OF CALGARY. Flexible and Scalable Routing Approach for Mobile Ad Hoc Networks. by Function Approximation of Q-Learning.

arxiv: v1 [cs.db] 22 Mar 2018

Human-Like Autonomous Car-Following Model with Deep Reinforcement Learning

Attend to details of the value iteration and policy iteration algorithms Reflect on Markov decision process behavior Reinforce C programming skills

Learning Complex Motions by Sequencing Simpler Motion Templates

Basis Function Construction in Reinforcement Learning using Cascade-Correlation Learning Architecture

Learning to bounce a ball with a robotic arm

Distributed Deep Q-Learning

Model learning for robot control: a survey

Common Subspace Transfer for Reinforcement Learning Tasks

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Optimization. Industrial AI Lab.

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Per-decision Multi-step Temporal Difference Learning with Control Variates

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Announcements. Homework 4. Project 3. Due tonight at 11:59pm. Due 3/8 at 4:00pm

Dynamic Preferences in Multi-Criteria Reinforcement Learning

Using Reinforcement Learning to Optimize Storage Decisions Ravi Khadiwala Cleversafe

COMBINING MODEL-BASED AND MODEL-FREE RL

Robotic Search & Rescue via Online Multi-task Reinforcement Learning

Learning Multi-Goal Inverse Kinematics in Humanoid Robot

Lecture 13: Learning from Demonstration

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Mobile Robot Obstacle Avoidance based on Deep Reinforcement Learning

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Machine Learning Basics. Sargur N. Srihari

10-701/15-781, Fall 2006, Final

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction

Gerhard Neumann Computational Learning for Autonomous Systems Technische Universität Darmstadt, Germany

CME 213 SPRING Eric Darve

Transcription:

A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml

Reinforcement Learning Agent At each step t: The agent receives a state S t from environment The agent executes action A t based on the received state The agent receives scalar reward R t from the environment The environment transform into a new state S t+1 2 Environment http://www.cs.ucl.ac.uk/staff/d.silver/web/teac hing_files/intro_rl.pdf

Maze Example States: Agent s location Actions: N, E, S, W Rewards: 100 if reaching the goal -100 if reaching the dead end -1 per time-step 3 http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching_files/intro_rl.pdf

Deep Reinforcement Learning Robotics, control Self-driving Deep learningto represent states, actions, or policy functions Language interaction System operating 4

Reinforcement Learning Markov Decision Process (MDP) 5 Figure from ICML Tutorial by Sergey Levine and Chelsea Finn

A MDP Game If the agent is at state A, which action would you prefer? 6

Reinforcement Learning Policy: Agent s behavior strategy u How to choose an action given the current state in order to maximize the total reward Deterministic policy! = #(%) Stochastic policy!~# % = ((! %) Fully observable: % = * Partially observable: % * 7

Value-Based RL Value Based u Learnt Value Function u Implicit Policy Policy Based u No Value Function u Learnt Policy Actor-Critic u Learnt Value Function u Learnt Policy 8

Q-Learning: Procedure %~"(#) Fit a model (estimate return) Update! " #, % with! " #, % Generate samples (run policy) Improve policy " # = ()*+(, %! " (#, %) 9

Value Function A value function is a prediction of future reward u How much reward will I get from action a given state s under policy π? Q-value function (action-value function) gives expected total reward u Each state-action pair (s, a) has an entry q(s, a) 10

Value Function Q-value function gives expected total reward u Sample an trajectory with policy π u " #, % #, & #'(, " #'(, % #'(, & #'), " #'), u Discount factor γ : between [0, 1], how far ahead in time the algorithm looks Delayed reward is taken into consideration Future rewards! 11

Value Function Bellman equation (with a particular policy!) s # a # s #$% a #$% s #$& a #$& s # a # s #$% 12 Terminal /0102

Optimal Value Function An optimal value function is the maximum achievable value! ($, &) = max, [!, $, & ] =!, ($, &) Act optimally: / ($) = argmax[! ($, &)] 2 Optimal value maximizes over all decisions! $, & = 3 456 + 8 max 9 :;< R >5? + 8? max 9 :;@ R >5A = 3 456 + 8 max 9 :;< q (s 456, a >56 ) 15

Q-Learning: Optimization Optimal q-values obey Bellman function! #, % = ' () [+ + - max 1)! (# ), % ) #, %) Minimizing mean-squared error of value function between approximate value and true value (target) 6 7, 8 = 9(: + ; <8= 8) 6 7, 8 6 7, 8 ) Biased true value Approximate value Policy @(7) : choose the action that has the maximum profit 16

Q-Learning: Trial and Error Exploration & Exploitation : ε-greedy u Require sufficient exploration to enable learning u! " = $ arg max * +, ",! with probability 1 9 random action with probability 9 17

Q-Learning: Algorithm Initialize! arbitrarily Repeat (for each episode): Initialize " Repeat (for each time step of episode): Take action from " according to policy (e.g. ε-greedy) #~% " Observe reward & and transit to new state " '! ", # 1,! ", # +, & +. max 2' " " until " is terminal!(" ', # ) 18

Q-Learning: Review Strength u Learn from experience sampled from policy u Choose the best action for each state to get highest long-term reward Weakness u Over-estimate the action values using the maximum value as approximation for the maximum expected value à Double Q-learning u Large memory to store q values with increasing numbers of states and actions à Deep Q-learning 19

Double Q-learning Double estimator u sometimes underestimates rather than overestimates the maximum expected value 20

Function Approximation Represent value function by Q-network with weights! May use back-propagation to compute " # $ in neural networks Popular choice: linear approximation, kernel methods, decision trees, deep learning models 21

Deep Q-Learning Advantage u possible to apply the algorithm to larger problems, even when the state space is continuous u generalize earlier experiences to previously unseen states Challenge u unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q-value 22

Q-learning: Varients Deep Q Network (DQN) u Playing Atari with Deep Reinforcement Learning, V. Mnih et al., NIPS 2013 Workshop Double DQN u Deep Reinforcement Learning with Double Q-learning, H. van Hasselt et al., AAAI 2016 Dueling DQN u Dueling Network Architectures for Deep Reinforcement Learning, Z. Wang et al., ICML 2016 23

Policy-Based RL Value Based u Learnt Value Function u Implicit Policy Policy Based u No Value Function u Learnt Policy Actor-Critic u Learnt Value Function u Learnt Policy 24

Policy-Based RL Directly parametrize the policy:! " #, % = '[% #, *] Comparing to Value-Based RL: u Approximate the value function:, ",. (#, %) u Generate policy from the value function Greedy: taking the action that can maximize q(s,a) Epsilon-greedy: small probability to explore new arg max % 1 = 2, # 1, % with probability 1 D 8 random action withprobability D 25

Advantages Advantages: u Better convergence properties u Effective in high-dimensional or continuous action spaces u Can learn stochastic policies Disadvantages: u Converge to a local optimum u High variance 26

Policy-based RL Markov Chain:! " # $, & $,, # (, & ( =!(# $ ), / " & - # -!(# -0$ # -, & - ) ( -.$ Goal is to maximize the reward: 2 = &456&7 " 8 (9,:)~<= (9,:)[4 #, & ] Let @ = #, & be the state-action sequence: 2 = &456&7 " 8 (A)~< = (A)[4 @ ] 27

Policy Search Goal: find best! for policy " # $ How to measure the quality of policy " # $? Objective Function %(#) u In Policy Gradient, % # = ) $~"# ($)[ -($)] An optimization problem: find # to maximize %(#) u Gradient Descent 28

Policy Gradient Search for a local maximum! " by ascending the gradient of the policy: " = %& '!(") & '! " is the policy gradient And % is learning rate 29

Policy Gradient! " = $ %~'( (%) +, = -. /, +, 0, 1 /! " = - 1 /. /, +, 0, = -. /, 1 / 234. /, +, 0, 1 /. /, =. /, 1 /. /,. /, =. /, 1 / 234. /, 30

Policy Gradient! " # $ = &! " ' " ( ) ( *( = & ' " (! " +,-' " ( ) ( *( =. /~12 (/)[! " +,-' " ( ) ( ] ; =. /~12 (/) 7 89:! " +,-' " < 8 > 8 7 )(> 8, < 8 ) ; 89: 31

One-Step MDPs A simple class of one-step MDPs u Start in state!~#! u Obtain reward $ = &!,( after one time-step u Then terminate Compute policy gradient with likelihood ratios: 32 ) * = +,~-. (,) 1 2(3) = 1 4 6 7(8) 1 9 : ; < 3 = 4,9 > < ) * = 1 7(8) 1 ; < 3 > < log ; < 3 = 4,9 4 6 9 : = +,~-. (,)[> < log ; < 3 2(3)]

Policy Gradient Theorem The policy gradient theorem: u Generalizes the likelihood ratio approach to multi-step MDPs u Replaces instantaneous reward r with long-term value! " ($) Policy Gradient Theorem For any differentiable policy & ' ($) For any policy objective function (()) the policy gradient is * ' ( ) =, -~"/ (-)[* ' log & ' $ R " ($)] 33

REINFORCE How to update the policy with policy gradient? Using stochastic gradient ascent Using delayed reward! " as an unbiased sample of # $ (&, () 34 function REINFORCE Initialise * arbitrarily For each episode {s 1,a 1,r 2,,s T-1,a T-1,r T } ~, * do For t=1 to T-1 do * = * + /0 * 123, * 4 5, 65 7 5 end for end for Return *

Review Policy-Based RL: directly parametrize the policy Objective Function! " : measure the quality of policy! " = $ %~'( (%) +,(-) Policy Gradient Theorem:. /! " = $ %~'( (%)[. / log 3 / - R ' (-)] REINFORCE algorithm: stochastic gradient ascent and delayed reward 35

Actor-Critic Value Based u Learnt Value Function u Implicit Policy Policy Based u No Value Function u Learnt Policy Actor-Critic u Learnt Value Function u Learnt Policy 36

Actor-Critic Policy gradient still has high variance L Use a critic to evaluate the action just selected State-value function & state-action value function 37 V " ($) = ' (~*+,. [q 1, $ ] Actor-Critic maintains two sets of parameters u Actor (Policy Gradient) : Updates policy parameters θ u Critic (Function Approximation) : Updates parameters 5 of the state-value function

Actor-Critic Approximate policy gradient! " # $ = & '~)* (')[! " log 0 " 1 R ) (1)] & '~)* ' [! " log 0 " 1 (4 + γv 8 (1 ) V 8 (1))] Fit V 8 ; to sampled reward sums = ) (;,? ) = ) ;,? + V 8 ; Target TD error in Q-learning A B = 1 2 4 + γv 8(s ) V 8 (s) G 38

Actor-Critic: Procedure Fit a model (estimate return)!! $ % &! +~-(/) Generate samples (run policy) Improve policy ' ' + $ ) * ' 39

Actor-Critic: Algorithm Initialize actor! with θ, critic V with $ arbitrarily Repeat (for each episode): Initialize % Repeat (for each time step of episode): Take action from % according to actor &~( ) % Observe reward * and transit to new state % + Evaluate TD error A % = * + 6V 7 (% + ) V 7 % $ $ < 7 = 7 A %? 40 Calculate A(s) using new critic V 7 θ θ + < ) = H log ( H s A(%) % % until % is terminal

Summary and Discussions 41

Reinforcement Learning Sequential decision: current decision affects future decision Trial-and-error: just try, do not worry making mistakes u Explore (new possibilities) u Exploit (with the current best policy) Future reward: maximizing the future rewards instead of just the intermediate rewards at each step u Remember q(s,a) 42

Difference to Supervised Learning Supervised learning: given a set of samples (x i,y i ), estimate f: XàY 43

Difference to Supervised Learning You know what a true goal is, but do not know how to achieve that goal Through interactions with environment (trial-and-error) Many possible solutions (policies), which is optimal? 44