An introduction to multi-armed bandits

An introduction to multi-armed bandits Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)

Plan 1. An introduction to multi-armed bandits 2. The multi-armed bandit with covariates and the k-nearest neighbour UCB algorithm.

Plan: Intro to multi-armed bandits 1. The concept of reinforcement learning 2. Multi-armed bandits 3. The exploration vs. exploitation trade-off 4. The upper confidence bound algorithm (UCB) 5. The concentration of measure phenomenon 6. A regret bound for the UCB algorithm

Supervised learning and reinforcement learning

Supervised learning

Learning with animals

Reinforcement learning Environment Agent

Reinforcement learning Environment Action Agent

Reinforcement learning Environment Reward Action Agent

Reinforcement learning Environment Reward Action Agent Learning

Reinforcement learning No supervision - the only feedback given is the reward An agent s action affects the information it receives A sequential learning problem

The multi-armed bandit problem

The multi-armed bandit problem Bandit 1 Random reward Bandit 2

Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward

Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward Rewards i.i.d with

Applications Sequential clinical trials

Applications Sequential clinical trials Online advertisement optimisation

Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t

The exploration vs. exploitation trade-off

Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that is small.,

Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that, is small. Exploitation: Achieve a high reward Choose i so that is large.

The upper confidence bound (UCB) algorithm: Optimism in the face of uncertainty

The UCB algorithm

Regret Compare oracle policy with the

UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

Big O notation

UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

Concentration of measure

Concentration of measure I flip a coin 10 000 times. I get heads 7 500 times and tails 2 500 times. If I flip the same coin again, what s the probability I get a head?

Hoeffding s Inequality (1963) Independent random variables with for

Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average

Proof of the UCB regret bound

UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t

Bad events Let s define bad events Lemma 1 for by

Proof of Lemma 1

Proof of Lemma 1 Hoeffding s Inequality

Lemma 1 We defined bad events Lemma 1 for by

Lemma 2 Suppose, & Then At least one of or hold.

Proof of Lemma 2 Suppose neither one of or Then & From hold.

Proof of Lemma 2 1. 2.

Proof of Lemma 2 1. 2. 1. & 2.

Lemma 2 Suppose, & Then At least one of or hold.

Lemma 3 Lemma 1 + Lemma 3 Suppose. Then Lemma 2

Lemma 4 Lemma 3 Suppose. Then Lemma 4 Suppose. Then

Proof of Lemma 4

Proof of UCB Regret bound Suppose. Then

The multi-armed bandit problem with covariates and the k-nearest neighbour UCB algorithm Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)

Plan 1. Multi-armed bandits with covariates 2. Non-parametric assumptions 3. Partition based policies and the UCBogram 4. Manifolds 5. The k-nearest neighbour UCB algorithm

The multi-armed bandit with covariates

Bandits with side-information Multi-armed bandits with additional side-information Example 1: Personalised sequential clinical trials - access to a patient's genome sequence. Example 2: Personalised online advertisement placement - access to a customer s interests, browsing and purchasing history.

Multi-armed bandit with covariates For Observe a covariate Choose an arm to pull based on & the reward history Receive a reward

Multi-armed bandit with covariates Covariates drawn from For each is drawn i.i.d from Expected reward, on.

Bandits with covariates

Regret for bandits with covariates Compare Regret: with the oracle policy

Non-parametric assumptions

The Lipschitz assumption For define by Lipschitz assumption:, b,.

The Lipschitz assumption

The Margin assumption Define the margin function by Margin assumption:,,.

The margin assumption

Histogram based policies

The UCBogram Rigollet and Zeevi (COLT, 2010) consider UCBogram 1. Partition cubes, into 2. Apply UCB locally on each of the separate cubes.

The UCBogram

The UCBogram UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c

The UCBogram Rigollet and Zeevi (2010): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with Then the UCBogram satisfies:

Adaptively Binned Successive Elimination Perchet & Rigollet (2011): ABSE: 1. Refine the partition whenever 2. Run a standard bandit algorithm locally until the subsequent refinement

Adaptively Binned Successive Elimination SE SE SE SE

Adaptively Binned Successive Elimination SE SE SE SE SE SE SE

Adaptively Binned Successive Elimination SE SE SE SE SE SE SE SE SE SE SE SE SE

Adaptively Binned Successive Elimination Perchet & Rigollet (2011): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with any Then the ABSE satisfies:

Bandits on manifolds

Manifolds A - dimensional manifold Looks locally like - dimensional Euclidean space

Manifolds In many applications d is large, but the data close to a -dimensional smooth manifold with. Eg. statistical regularities in the space of MRI scans/ genome sequences We should be able to exploit this property - but the manifold is not known in advance!

The k-nearest neighbour method in supervised learning

The k-nearest neighbour method The k-nearest neighbour method is simple & intuitive Effectively manages the bias-variance trade-off in supervised learning.

The k-nearest neighbour method Kpotufe (2012): k-nearest neighbours achieves minimax optimal rates in supervised regression (adapts to intrinsic dimension) Chaudhuri & Dasgupta (2014): k-nearest neighbours achieves minimax optimal rates in supervised classification with the margin condition Reeve & Brown (2017): k-nearest neighbours achieves minimax optimal rates for cost-sensitive learning on manifolds.

k-nearest Neighbours UCB

K-Nearest Neighbours UCB Given and we define The number of times, amongst the k-nearest neighbours of x, that arm i was pulled The cumulative reward over all the times that arm i was pulled and was amongst the k-nearest neighbours of x The k-nearest neighbour reward estimate

Defining uncertainty The distance to the k-th nearest neighbour Uncertainty Standard deviation Bias

K-nearest neighbour UCB

Choosing k? Cross-validation is not a good option in the online setting In the supervised regression setting Kpotufe (2012) Choose k by minimising an upper bound on the squared error Choose k to minimise uncertainty:

The K-NN UCB algorithm For Observe a covariate For Receive a reward

The Lipschitz assumption For define by Lipschitz assumption:, b,.

The margin assumption Define the margin function by Margin assumption:,,.

The dimension assumption Holds whenever the covariates are chosen from a well-behaved measure on a compact Riemannian manifold of dimension

The Regret Bound Reeve, Mellor & Brown (2017): Suppose that: 1) The Lipschitz assumption holds, 2) The margin assumption holds, 3) The dimension assumption holds, Then we have the following regret bound:

Empirical validation Cumulative regret A two-dimensional manifold in a fifteen-dimensional feature space Time

Discussion Doesn t require prior knowledge of: The time horizon The dimension of the manifold Achieves the minimax optimal rate, up to a logarithmic factor The regret bound extends to any finite number of arms & reward distributions with sub-gaussian noise

Thank you for listening!