An introduction to multi-armed bandits Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)
Plan 1. An introduction to multi-armed bandits 2. The multi-armed bandit with covariates and the k-nearest neighbour UCB algorithm.
Plan: Intro to multi-armed bandits 1. The concept of reinforcement learning 2. Multi-armed bandits 3. The exploration vs. exploitation trade-off 4. The upper confidence bound algorithm (UCB) 5. The concentration of measure phenomenon 6. A regret bound for the UCB algorithm
Supervised learning and reinforcement learning
Supervised learning
Supervised learning
Supervised learning
Learning with animals
Learning with animals
Learning with animals
Reinforcement learning Environment Agent
Reinforcement learning Environment Action Agent
Reinforcement learning Environment Reward Action Agent
Reinforcement learning Environment Reward Action Agent Learning
Reinforcement learning No supervision - the only feedback given is the reward An agent s action affects the information it receives A sequential learning problem
The multi-armed bandit problem
The multi-armed bandit problem Bandit 1 Random reward Bandit 2
Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward
Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward Rewards i.i.d with
Applications Sequential clinical trials
Applications Sequential clinical trials Online advertisement optimisation
Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t
The exploration vs. exploitation trade-off
Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that is small.,
Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that, is small. Exploitation: Achieve a high reward Choose i so that is large.
The upper confidence bound (UCB) algorithm: Optimism in the face of uncertainty
The UCB algorithm
The UCB algorithm
The UCB algorithm
Regret Compare oracle policy with the
UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound
Big O notation
Big O notation
UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound
Concentration of measure
Concentration of measure I flip a coin 10 000 times. I get heads 7 500 times and tails 2 500 times. If I flip the same coin again, what s the probability I get a head?
Hoeffding s Inequality (1963) Independent random variables with for
Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average
Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average
Proof of the UCB regret bound
UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound
Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t
Bad events Let s define bad events Lemma 1 for by
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1 Hoeffding s Inequality
Lemma 1 We defined bad events Lemma 1 for by
Lemma 2 Suppose, & Then At least one of or hold.
Proof of Lemma 2 Suppose neither one of or Then & From hold.
Proof of Lemma 2 1. 2.
Proof of Lemma 2 1. 2. 1. & 2.
Proof of Lemma 2 1. 2. 1. & 2.
Lemma 2 Suppose, & Then At least one of or hold.
Lemma 3 Lemma 1 + Lemma 3 Suppose. Then Lemma 2
Lemma 4 Lemma 3 Suppose. Then Lemma 4 Suppose. Then
Proof of Lemma 4
Proof of Lemma 4
Proof of Lemma 4
Proof of Lemma 4
Proof of UCB Regret bound Suppose. Then
Proof of UCB Regret bound Suppose. Then
Proof of UCB Regret bound Suppose. Then
Proof of UCB Regret bound Suppose. Then
The multi-armed bandit problem with covariates and the k-nearest neighbour UCB algorithm Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)
Plan 1. Multi-armed bandits with covariates 2. Non-parametric assumptions 3. Partition based policies and the UCBogram 4. Manifolds 5. The k-nearest neighbour UCB algorithm
The multi-armed bandit with covariates
Bandits with side-information Multi-armed bandits with additional side-information Example 1: Personalised sequential clinical trials - access to a patient's genome sequence. Example 2: Personalised online advertisement placement - access to a customer s interests, browsing and purchasing history.
Multi-armed bandit with covariates For Observe a covariate Choose an arm to pull based on & the reward history Receive a reward
Multi-armed bandit with covariates Covariates drawn from For each is drawn i.i.d from Expected reward, on.
Bandits with covariates
Bandits with covariates
Regret for bandits with covariates Compare Regret: with the oracle policy
Non-parametric assumptions
The Lipschitz assumption For define by Lipschitz assumption:, b,.
The Lipschitz assumption
The Lipschitz assumption
The Margin assumption Define the margin function by Margin assumption:,,.
The margin assumption
The margin assumption
Histogram based policies
The UCBogram Rigollet and Zeevi (COLT, 2010) consider UCBogram 1. Partition cubes, into 2. Apply UCB locally on each of the separate cubes.
The UCBogram
The UCBogram UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c
The UCBogram Rigollet and Zeevi (2010): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with Then the UCBogram satisfies:
Adaptively Binned Successive Elimination Perchet & Rigollet (2011): ABSE: 1. Refine the partition whenever 2. Run a standard bandit algorithm locally until the subsequent refinement
Adaptively Binned Successive Elimination SE SE SE SE
Adaptively Binned Successive Elimination SE SE SE SE SE SE SE
Adaptively Binned Successive Elimination SE SE SE SE SE SE SE SE SE SE SE SE SE
Adaptively Binned Successive Elimination Perchet & Rigollet (2011): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with any Then the ABSE satisfies:
Bandits on manifolds
Manifolds A - dimensional manifold Looks locally like - dimensional Euclidean space
Manifolds In many applications d is large, but the data close to a -dimensional smooth manifold with. Eg. statistical regularities in the space of MRI scans/ genome sequences We should be able to exploit this property - but the manifold is not known in advance!
The k-nearest neighbour method in supervised learning
The k-nearest neighbour method The k-nearest neighbour method is simple & intuitive Effectively manages the bias-variance trade-off in supervised learning.
The k-nearest neighbour method Kpotufe (2012): k-nearest neighbours achieves minimax optimal rates in supervised regression (adapts to intrinsic dimension) Chaudhuri & Dasgupta (2014): k-nearest neighbours achieves minimax optimal rates in supervised classification with the margin condition Reeve & Brown (2017): k-nearest neighbours achieves minimax optimal rates for cost-sensitive learning on manifolds.
k-nearest Neighbours UCB
K-Nearest Neighbours UCB Given and we define The number of times, amongst the k-nearest neighbours of x, that arm i was pulled The cumulative reward over all the times that arm i was pulled and was amongst the k-nearest neighbours of x The k-nearest neighbour reward estimate
Defining uncertainty The distance to the k-th nearest neighbour Uncertainty Standard deviation Bias
K-nearest neighbour UCB
K-nearest neighbour UCB
K-nearest neighbour UCB
Choosing k? Cross-validation is not a good option in the online setting In the supervised regression setting Kpotufe (2012) Choose k by minimising an upper bound on the squared error Choose k to minimise uncertainty:
The K-NN UCB algorithm For Observe a covariate For Receive a reward
The Lipschitz assumption For define by Lipschitz assumption:, b,.
The margin assumption Define the margin function by Margin assumption:,,.
The dimension assumption Holds whenever the covariates are chosen from a well-behaved measure on a compact Riemannian manifold of dimension
The Regret Bound Reeve, Mellor & Brown (2017): Suppose that: 1) The Lipschitz assumption holds, 2) The margin assumption holds, 3) The dimension assumption holds, Then we have the following regret bound:
Empirical validation Cumulative regret A two-dimensional manifold in a fifteen-dimensional feature space Time
Discussion Doesn t require prior knowledge of: The time horizon The dimension of the manifold Achieves the minimax optimal rate, up to a logarithmic factor The regret bound extends to any finite number of arms & reward distributions with sub-gaussian noise
Thank you for listening!