Slides credited from Dr. David Silver & Hung-Yi Lee

Similar documents
A Brief Introduction to Reinforcement Learning

Introduction to Deep Q-network

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Deep Reinforcement Learning

10703 Deep Reinforcement Learning and Control

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay

10703 Deep Reinforcement Learning and Control

When Network Embedding meets Reinforcement Learning?

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

CS 234 Winter 2018: Assignment #2

arxiv: v1 [cs.cv] 2 Sep 2018

Applications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich?

Markov Decision Processes (MDPs) (cont.)

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Neural Networks and Tree Search

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Neural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze)

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Alternatives to Direct Supervision

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Accelerating Reinforcement Learning in Engineering Systems

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Unsupervised Learning

Markov Decision Processes and Reinforcement Learning

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Markov Decision Processes. (Slides from Mausam)

CME 213 SPRING Eric Darve

arxiv: v1 [stat.ml] 1 Sep 2017

Distributed Deep Q-Learning

TD LEARNING WITH CONSTRAINED GRADIENTS

Deep Reinforcement Learning for Urban Traffic Light Control

arxiv: v2 [cs.lg] 16 May 2017

Deep Reinforcement Learning for Pellet Eating in Agar.io

Foundations of Artificial Intelligence

WestminsterResearch

Deep Q-Learning to play Snake

A Fast Learning Algorithm for Deep Belief Nets

Deep Learning and Its Applications

COMBINING MODEL-BASED AND MODEL-FREE RL

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Neuro-Dynamic Programming An Overview

GAMES Webinar: Rendering Tutorial 2. Monte Carlo Methods. Shuang Zhao

How Learning Differs from Optimization. Sargur N. Srihari

Planning and Control: Markov Decision Processes

CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8

CS221 Final Project: Learning Atari

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

arxiv: v1 [cs.lg] 6 Nov 2018

Markov Random Fields and Gibbs Sampling for Image Denoising

A Series of Lectures on Approximate Dynamic Programming

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Perceptron: This is convolution!

A Brief Look at Optimization

Generalized Inverse Reinforcement Learning

arxiv: v1 [cs.ai] 18 Apr 2017

Simulated Transfer Learning Through Deep Reinforcement Learning

Optimazation of Traffic Light Controlling with Reinforcement Learning COMP 3211 Final Project, Group 9, Fall

Learning Multi-Goal Inverse Kinematics in Humanoid Robot

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

CSEP 573: Artificial Intelligence

Deep Reinforcement Learning in Large Discrete Action Spaces

arxiv: v2 [cs.lg] 10 Mar 2018

Per-decision Multi-step Temporal Difference Learning with Control Variates

arxiv: v2 [cs.lg] 25 Oct 2018

Boosted Deep Q-Network

Machine Learning Classifiers and Boosting

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Monte Carlo Tree Search PAH 2015

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Gradient Reinforcement Learning of POMDP Policy Graphs

Human-Like Autonomous Car-Following Model with Deep Reinforcement Learning

Mobile Robot Obstacle Avoidance based on Deep Reinforcement Learning

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning

Using Machine Learning to Optimize Storage Systems

Score function estimator and variance reduction techniques

Reinforcement Learning (2)

Learn to Swing Up and Balance a Real Pole Based on Raw Visual Input Data

Tracking Algorithms. Lecture16: Visual Tracking I. Probabilistic Tracking. Joint Probability and Graphical Model. Deterministic methods

Cellular Network Traffic Scheduling using Deep Reinforcement Learning

IN recent years, artificial intelligence (AI) has flourished. Reinforcement Learning and Deep Learning based Lateral Control for Autonomous Driving

A Deep Reinforcement Learning-Based Framework for Content Caching

Deictic Image Mapping: An Abstraction For Learning Pose Invariant Manipulation Policies

Recurrent Neural Network (RNN) Industrial AI Lab.

Practical Tutorial on Using Reinforcement Learning Algorithms for Continuous Control

CSE 573: Artificial Intelligence Autumn 2010

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Adversarial Examples in Deep Learning. Cho-Jui Hsieh Computer Science & Statistics UC Davis

Deep Reinforcement Learning and the Atari MARC G. BELLEMARE Google Brain

arxiv: v1 [cs.lg] 15 Jun 2016

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation

CAP 6412 Advanced Computer Vision

Deep Learning of Visual Control Policies

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues

Transcription:

Slides credited from Dr. David Silver & Hung-Yi Lee

Review Reinforcement Learning 2

Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to act Each action influences the agent s future state Success is measured by a scalar reward signal Big three: action, state, reward 3

Agent and Environment Agent observation o t action a t reward r t MoveRight MoveLeft Environment 4

Major Components in an RL Agent An RL agent may include one or more of these components Value function: how good is each state and/or action Policy: agent s behavior function Model: agent s representation of the environment 5

Reinforcement Learning Approach Value-based RL Estimate the optimal value function Policy-based RL Search directly for optimal policy Model-based RL Build a model of the environment Plan (e.g. by lookahead) using model is maximum value achievable under any policy is the policy achieving maximum future reward 6

RL Agent Taxonomy Model-Free Value Policy Actor-Critic Learning a Critic Learning an Actor Model 7

Deep Reinforcement Learning Idea: deep learning for reinforcement learning Use deep neural networks to represent Value function Policy Model Optimize loss function by SGD 8

Value-Based Approach LEARNING A CRITIC 9

Critic = Value Function Idea: how good the actor is State value function: when using actor π, the expected total reward after seeing observation (state) s larger scalar smaller A critic does not determine the action An actor can be found from a critic 10

Monte-Carlo for Estimating Monte-Carlo (MC) The critic watches π playing the game MC learns directly from complete episodes: no bootstrapping Idea: value = empirical mean return After seeing s a, until the end of the episode, the cumulated reward is G a After seeing s b, until the end of the episode, the cumulated reward is G b Issue: long episodes delay learning 11

Temporal-Difference for Estimating Temporal-difference (TD) The critic watches π playing the game TD learns directly from incomplete episodes by bootstrapping TD updates a guess towards a guess Idea: update value toward estimated return - 12

MC v.s. TD Monte-Carlo (MC) Large variance Unbiased No Markov property Temporal-Difference (TD) Small variance Biased Markov property smaller variance may be biased 13

MC v.s. TD 14

Critic = Value Function State-action value function: when using actor π, the expected total reward after seeing observation (state) s and taking action a scalar for discrete action only 15

Q-Learning Given Q π s, a, find a new actor π better than π π does not have extra parameters (depending on value function) not suitable for continuous action π = π π interacts with the environment TD or MC Find a new actor π better than π? Learning Q π s, a 16

Q-Learning Goal: estimate optimal Q-values Optimal Q-values obey a Bellman equation learning target Value iteration algorithms solve the Bellman equation 17

Deep Q-Networks (DQN) Estimate value function by TD Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD 18

Deep Q-Networks (DQN) Objective is to minimize MSE loss by SGD Leading to the following Q-learning gradient Issue: naïve Q-learning oscillates or diverges using NN due to: 1) correlations between samples 2) non-stationary targets 19

Stability Issues with Deep RL Naive Q-learning oscillates or diverges with neural nets 1. Data is sequential Successive samples are correlated, non-iid (independent and identically distributed) 2. Policy changes rapidly with slight changes to Q-values Policy may oscillate Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown Naive Q-learning gradients can be unstable when backpropagated 20

Stable Solutions for DQN DQN provides a stable solutions to deep value-based RL 1. Use experience replay Break correlations in data, bring us back to iid setting Learn from all past policies 2. Freeze target Q-network Avoid oscillation Break correlations between Q-network and target 3. Clip rewards or normalize network adaptively to sensible range Robust gradients 21

Stable Solution 1: Experience Replay To remove correlations, build a dataset from agent s experience Take action at according to ε-greedy policy small prob for exploration Store transition in replay memory D Sample random mini-batch of transitions from D Optimize MSE between Q-network and Q-learning targets 22

Stable Solution 2: Fixed Target Q-Network To avoid oscillations, fix parameters used in Q-learning target freeze Compute Q-learning targets w.r.t. old, fixed parameters freeze Optimize MSE between Q-network and Q-learning targets Periodically update fixed parameters 23

Stable Solution 3: Reward / Value Range To avoid oscillations, control the reward / value range DQN clips the rewards to [ 1, +1] Prevents too large Q-values Ensures gradients are well-conditioned 24

Deep RL in Atari Games 25

DQN in Atari Goal: end-to-end learning of values Q(s, a) from pixels Input: state is stack of raw pixels from last 4 frames Output: Q(s, a) for all joystick/button positions a Reward is the score change for that step DQN Nature Paper [link] [code] 26

DQN in Atari DQN Nature Paper [link] [code] 27

Other Improvements: Double DQN Nature DQN Issue: tend to select the action that is over-estimated Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI 2016. 28

Other Improvements: Double DQN Nature DQN Double DQN: remove upward bias caused by Current Q-network Older Q-network is used to select actions is used to evaluate actions Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI 2016. 29

Other Improvements: Prioritized Replay Prioritized Replay: weight experience based on surprise Store experience in priority queue according to DQN error 30

Other Improvements: Dueling Network Dueling Network: split Q-network into two channels Action-independent value function Value function estimates how good the state is Action-dependent advantage function Advantage function estimates the additional benefit Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, arxiv preprint, 2015. 31

Other Improvements: Dueling Network State Action State Action Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, arxiv preprint, 2015. 32

Policy-Based Approach LEARNING AN ACTOR 33

On-Policy v.s. Off-Policy On-policy: The agent learned and the agent interacting with the environment is the same Off-policy: The agent learned and the agent interacting with the environment is different 34

Goodness of Actor An episode is considered as a trajectory τ Reward: not related to your actor control by your actor Actor left right fire 0.1 0.2 0.7 35

Goodness of Actor An episode is considered as a trajectory τ Reward: We define as the expected value of reward If you use an actor to play game, each τ has P τ θ to be sampled Use π θ to play the game N times, obtain τ 1, τ 2,, τ N Sampling τ from P τ θ N times sum over all possible trajectory 36

Deep Policy Networks Represent policy by deep network with weights Objective is to maximize total discounted reward by SGD Update the model parameters iteratively 37

Policy Gradient Gradient assent to maximize the expected reward do not have to be differentiable can even be a black box use π θ to play the game N times, obtain τ 1, τ 2,, τ N 38

Policy Gradient An episode trajectory ignore the terms not related to θ 39

Policy Gradient Gradient assent for iteratively updating the parameters If τ n machine takes a t n when seeing s t n Tuning θ to increase Tuning θ to decrease Important: use cumulative reward R τ n of the whole trajectory τ n n instead of immediate reward r t 40

Policy Gradient Given actor parameter θ data collection model update 41

Improvement: Adding Baseline Ideally it is probability Sampling not sampled Issue: the probability of the actions not sampled will decrease 42

Actor-Critic Approach LEARNING AN ACTOR & A CRITIC 43

Actor-Critic (Value-Based + Policy-Based) Estimate value function Q π s, a, V π s Update policy based on the value function evaluation π π is a actual function that maximizes the value may works for continuous action π interacts with the environment π = π TD or MC Update actor from π π based on Q π s, a, V π s Learning Q π s, a, V π s 44

Advantage Actor-Critic π = π Update actor based on V π s π interacts with the environment Learning the policy (actor) using the value evaluated by critic TD or MC Learning V π s Advantage function: the reward r t n we truly obtain when taking action a t n evaluated by critic expected reward r t n we obtain if we use actor π baseline is added Positive advantage function increasing the prob. of action a t n Negative advantage function decreasing the prob. of action a t n 45

Advantage Actor-Critic Tips The parameters of actor π s and critic V π s can be shared s Network Network Network left right fire V π s Use output entropy as regularization for π s Larger entropy is preferred exploration 46

Asynchronous Advantage Actor-Critic (A3C) Asynchronous 1. Copy global parameters 2. Sampling some data 3. Compute gradients 4. Update global models (other workers also update models) θ θ 1 θ 1 +η θ θ θ 1 Mnih et al., Asynchronous Methods for Deep Reinforcement Learning, in JMLR, 2016. 47

Pathwise Derivative Policy Gradient Original actor-critic tells that a given action is good or bad Pathwise derivative policy gradient tells that which action is good 48

Pathwise Derivative Policy Gradient an actor s output Gradient ascent: Fixed Actor = This is a large network Silver et al., Deterministic Policy Gradient Algorithms, ICML, 2014. Lillicrap et al., Continuous Control with Deep Reinforcement Learning, ICLR, 2016. 49

Deep Deterministic Policy Gradient (DDPG) Idea Critic estimates value of current policy by DQN Actor updates policy in direction that improves Q Critic provides loss function for actor add noise exploration Update actor π π based on Q π s, a π interacts with the environment Learning Q π s, a Replay Buffer Actor = Lillicrap et al., Continuous Control with Deep Reinforcement Learning, ICLR, 2016. 50

DDPG Algorithm Initialize critic network θ Q and actor network θ π Initialize target critic network θ Q = θ Q and target actor network θ π = θ π Initialize replay buffer R In each iteration Use π s + noise to interact with the environment, collect a set of s t, a t, r t, s t+1, put them in R Sample N examples s n, a n, r n, s n+1 Update critic Q to minimize Update actor π to maximize Update target networks: from R using target networks the target networks update slower Lillicrap et al., Continuous Control with Deep Reinforcement Learning, ICLR, 2016. 51

DDPG in Simulated Physics Goal: end-to-end learning of control policy from pixels Input: state is stack of raw pixels from last 4 frames Output: two separate CNNs for Q and π Lillicrap et al., Continuous Control with Deep Reinforcement Learning, ICLR, 2016. 52

Model-Based Agent s Representation of the Environment 53

Model-Based Deep RL Goal: learn a transition model of the environment and plan based on the transition model Objective is to maximize the measured goodness of model Model-based deep RL is challenging, and so far has failed in Atari 54

Issues for Model-Based Deep RL Compounding errors Errors in the transition model compound over the trajectory A long trajectory may result in totally wrong rewards Deep networks of value/policy can plan implicitly Each layer of network performs arbitrary computational step n-layer network can lookahead n steps 55

Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead search tree with millions of positions State-of-the-art Go programs use MCTS Convolutional Networks 12-layer CNN trained to predict expert moves Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree 1st strong Go program Silver, et al., Mastering the game of Go with deep neural networks and tree search, Nature, 2016. https://deepmind.com/alphago/ 56

OpenAI Universe Software platform for measuring and training an AI's general intelligence via the OpenAI gym environment 57

Concluding Remarks RL is a general purpose framework for decision making under interactions between agent and environment An RL agent may include one or more of these components Value function: how good is each state and/or action Policy: agent s behavior function Model: agent s representation of the environment Learning a Critic Value RL problems can be solved by end-to-end deep learning Reinforcement Learning + Deep Learning = AI Actor-Critic Policy Learning an Actor 58

References Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf 59