An introduction to multi-armed bandits

Similar documents
Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Monte Carlo Tree Search PAH 2015

Where we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Overview of various smoothers

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Non-Parametric Modeling

CME323 Report: Distributed Multi-Armed Bandits

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Edges and Triangles. Po-Shen Loh. Carnegie Mellon University. Joint work with Jacob Fox

Nonparametric Methods Recap

Optimization Methods for Machine Learning (OMML)

Pruning Nearest Neighbor Cluster Trees

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Real-world bandit applications: Bridging the gap between theory and practice

Unsupervised Learning: Clustering

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Section 4 Matching Estimator

STA 4273H: Statistical Machine Learning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Feature Selection for Image Retrieval and Object Recognition

0x1A Great Papers in Computer Security

Semi-supervised Learning

Generative and discriminative classification techniques

CSC 411: Lecture 05: Nearest Neighbors

What is machine learning?

Conditional Volatility Estimation by. Conditional Quantile Autoregression

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Going nonparametric: Nearest neighbor methods for regression and classification

CS 229 Midterm Review

Notes and Announcements

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Chapter 4: Non-Parametric Techniques

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Machine Learning and Pervasive Computing

Nearest Neighbor Classification

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Learning from High Dimensional fmri Data using Random Projections

Planning and Control: Markov Decision Processes

5 Learning hypothesis classes (16 points)

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Supervised vs unsupervised clustering

Machine Learning Techniques for Data Mining

3D Point Cloud Processing

Nearest Neighbor Classification

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Risk bounds for some classification and regression models that interpolate

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Slides for Data Mining by I. H. Witten and E. Frank

Chapter 3: Supervised Learning

Using the Kolmogorov-Smirnov Test for Image Segmentation

Coding for Random Projects

Bumptrees for Efficient Function, Constraint, and Classification Learning

Supervised Learning: K-Nearest Neighbors and Decision Trees

Applied Statistics for Neuroscientists Part IIa: Machine Learning

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Combinatorial Methods in Density Estimation

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Classification: Feature Vectors

Topics in Machine Learning

Machine Learning Lecture 3

Bayesian Optimization Nando de Freitas

Machine Learning in Biology

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Distribution-free Predictive Approaches

w KLUWER ACADEMIC PUBLISHERS Global Optimization with Non-Convex Constraints Sequential and Parallel Algorithms Roman G. Strongin Yaroslav D.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Introduction to Mobile Robotics

CSE 573: Artificial Intelligence Autumn 2010

TECHNICAL REPORT NO Random Forests and Adaptive Nearest Neighbors 1

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

Data Knowledge and Optimization. Patrick De Causmaecker CODeS-imec Department of Computer Science KU Leuven/ KULAK

Last time... Bias-Variance decomposition. This week

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Stanford University CS359G: Graph Partitioning and Expanders Handout 18 Luca Trevisan March 3, 2011

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Information Driven Healthcare:

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Supplementary Figure 1. Decoding results broken down for different ROIs

Estimating survival from Gray s flexible model. Outline. I. Introduction. I. Introduction. I. Introduction

Bioinformatics - Lecture 07

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Learning with minimal supervision

Transcription:

An introduction to multi-armed bandits Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)

Plan 1. An introduction to multi-armed bandits 2. The multi-armed bandit with covariates and the k-nearest neighbour UCB algorithm.

Plan: Intro to multi-armed bandits 1. The concept of reinforcement learning 2. Multi-armed bandits 3. The exploration vs. exploitation trade-off 4. The upper confidence bound algorithm (UCB) 5. The concentration of measure phenomenon 6. A regret bound for the UCB algorithm

Supervised learning and reinforcement learning

Supervised learning

Supervised learning

Supervised learning

Learning with animals

Learning with animals

Learning with animals

Reinforcement learning Environment Agent

Reinforcement learning Environment Action Agent

Reinforcement learning Environment Reward Action Agent

Reinforcement learning Environment Reward Action Agent Learning

Reinforcement learning No supervision - the only feedback given is the reward An agent s action affects the information it receives A sequential learning problem

The multi-armed bandit problem

The multi-armed bandit problem Bandit 1 Random reward Bandit 2

Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward

Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward Rewards i.i.d with

Applications Sequential clinical trials

Applications Sequential clinical trials Online advertisement optimisation

Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t

The exploration vs. exploitation trade-off

Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that is small.,

Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that, is small. Exploitation: Achieve a high reward Choose i so that is large.

The upper confidence bound (UCB) algorithm: Optimism in the face of uncertainty

The UCB algorithm

The UCB algorithm

The UCB algorithm

Regret Compare oracle policy with the

UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

Big O notation

Big O notation

UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

Concentration of measure

Concentration of measure I flip a coin 10 000 times. I get heads 7 500 times and tails 2 500 times. If I flip the same coin again, what s the probability I get a head?

Hoeffding s Inequality (1963) Independent random variables with for

Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average

Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average

Proof of the UCB regret bound

UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t

Bad events Let s define bad events Lemma 1 for by

Proof of Lemma 1

Proof of Lemma 1

Proof of Lemma 1 Hoeffding s Inequality

Lemma 1 We defined bad events Lemma 1 for by

Lemma 2 Suppose, & Then At least one of or hold.

Proof of Lemma 2 Suppose neither one of or Then & From hold.

Proof of Lemma 2 1. 2.

Proof of Lemma 2 1. 2. 1. & 2.

Proof of Lemma 2 1. 2. 1. & 2.

Lemma 2 Suppose, & Then At least one of or hold.

Lemma 3 Lemma 1 + Lemma 3 Suppose. Then Lemma 2

Lemma 4 Lemma 3 Suppose. Then Lemma 4 Suppose. Then

Proof of Lemma 4

Proof of Lemma 4

Proof of Lemma 4

Proof of Lemma 4

Proof of UCB Regret bound Suppose. Then

Proof of UCB Regret bound Suppose. Then

Proof of UCB Regret bound Suppose. Then

Proof of UCB Regret bound Suppose. Then

The multi-armed bandit problem with covariates and the k-nearest neighbour UCB algorithm Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)

Plan 1. Multi-armed bandits with covariates 2. Non-parametric assumptions 3. Partition based policies and the UCBogram 4. Manifolds 5. The k-nearest neighbour UCB algorithm

The multi-armed bandit with covariates

Bandits with side-information Multi-armed bandits with additional side-information Example 1: Personalised sequential clinical trials - access to a patient's genome sequence. Example 2: Personalised online advertisement placement - access to a customer s interests, browsing and purchasing history.

Multi-armed bandit with covariates For Observe a covariate Choose an arm to pull based on & the reward history Receive a reward

Multi-armed bandit with covariates Covariates drawn from For each is drawn i.i.d from Expected reward, on.

Bandits with covariates

Bandits with covariates

Regret for bandits with covariates Compare Regret: with the oracle policy

Non-parametric assumptions

The Lipschitz assumption For define by Lipschitz assumption:, b,.

The Lipschitz assumption

The Lipschitz assumption

The Margin assumption Define the margin function by Margin assumption:,,.

The margin assumption

The margin assumption

Histogram based policies

The UCBogram Rigollet and Zeevi (COLT, 2010) consider UCBogram 1. Partition cubes, into 2. Apply UCB locally on each of the separate cubes.

The UCBogram

The UCBogram UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c

The UCBogram Rigollet and Zeevi (2010): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with Then the UCBogram satisfies:

Adaptively Binned Successive Elimination Perchet & Rigollet (2011): ABSE: 1. Refine the partition whenever 2. Run a standard bandit algorithm locally until the subsequent refinement

Adaptively Binned Successive Elimination SE SE SE SE

Adaptively Binned Successive Elimination SE SE SE SE SE SE SE

Adaptively Binned Successive Elimination SE SE SE SE SE SE SE SE SE SE SE SE SE

Adaptively Binned Successive Elimination Perchet & Rigollet (2011): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with any Then the ABSE satisfies:

Bandits on manifolds

Manifolds A - dimensional manifold Looks locally like - dimensional Euclidean space

Manifolds In many applications d is large, but the data close to a -dimensional smooth manifold with. Eg. statistical regularities in the space of MRI scans/ genome sequences We should be able to exploit this property - but the manifold is not known in advance!

The k-nearest neighbour method in supervised learning

The k-nearest neighbour method The k-nearest neighbour method is simple & intuitive Effectively manages the bias-variance trade-off in supervised learning.

The k-nearest neighbour method Kpotufe (2012): k-nearest neighbours achieves minimax optimal rates in supervised regression (adapts to intrinsic dimension) Chaudhuri & Dasgupta (2014): k-nearest neighbours achieves minimax optimal rates in supervised classification with the margin condition Reeve & Brown (2017): k-nearest neighbours achieves minimax optimal rates for cost-sensitive learning on manifolds.

k-nearest Neighbours UCB

K-Nearest Neighbours UCB Given and we define The number of times, amongst the k-nearest neighbours of x, that arm i was pulled The cumulative reward over all the times that arm i was pulled and was amongst the k-nearest neighbours of x The k-nearest neighbour reward estimate

Defining uncertainty The distance to the k-th nearest neighbour Uncertainty Standard deviation Bias

K-nearest neighbour UCB

K-nearest neighbour UCB

K-nearest neighbour UCB

Choosing k? Cross-validation is not a good option in the online setting In the supervised regression setting Kpotufe (2012) Choose k by minimising an upper bound on the squared error Choose k to minimise uncertainty:

The K-NN UCB algorithm For Observe a covariate For Receive a reward

The Lipschitz assumption For define by Lipschitz assumption:, b,.

The margin assumption Define the margin function by Margin assumption:,,.

The dimension assumption Holds whenever the covariates are chosen from a well-behaved measure on a compact Riemannian manifold of dimension

The Regret Bound Reeve, Mellor & Brown (2017): Suppose that: 1) The Lipschitz assumption holds, 2) The margin assumption holds, 3) The dimension assumption holds, Then we have the following regret bound:

Empirical validation Cumulative regret A two-dimensional manifold in a fifteen-dimensional feature space Time

Discussion Doesn t require prior knowledge of: The time horizon The dimension of the manifold Achieves the minimax optimal rate, up to a logarithmic factor The regret bound extends to any finite number of arms & reward distributions with sub-gaussian noise

Thank you for listening!