An introduction to multi-armed bandits

Size: px

Start display at page:

Download "An introduction to multi-armed bandits"

Marjory Preston
6 years ago
Views:

1 An introduction to multi-armed bandits Henry WJ Reeve (Manchester) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)

2 Plan 1. An introduction to multi-armed bandits 2. The multi-armed bandit with covariates and the k-nearest neighbour UCB algorithm.

3 Plan: Intro to multi-armed bandits 1. The concept of reinforcement learning 2. Multi-armed bandits 3. The exploration vs. exploitation trade-off 4. The upper confidence bound algorithm (UCB) 5. The concentration of measure phenomenon 6. A regret bound for the UCB algorithm

4 Supervised learning and reinforcement learning

5 Supervised learning

6 Supervised learning

7 Supervised learning

8 Learning with animals

9 Learning with animals

10 Learning with animals

11 Reinforcement learning Environment Agent

12 Reinforcement learning Environment Action Agent

13 Reinforcement learning Environment Reward Action Agent

14 Reinforcement learning Environment Reward Action Agent Learning

15 Reinforcement learning No supervision - the only feedback given is the reward An agent s action affects the information it receives A sequential learning problem

16 The multi-armed bandit problem

17 The multi-armed bandit problem Bandit 1 Random reward Bandit 2

18 Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward

19 Multi-armed bandit formalism For Choose an arm to pull based on the reward history Receive a reward Rewards i.i.d with

20 Applications Sequential clinical trials

21 Applications Sequential clinical trials Online advertisement optimisation

22 Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t

23 The exploration vs. exploitation trade-off

24 Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that is small.,

25 Exploration vs. exploitation Exploration: Obtain more accurate estimates Choose i so that, is small. Exploitation: Achieve a high reward Choose i so that is large.

26 The upper confidence bound (UCB) algorithm: Optimism in the face of uncertainty

27 The UCB algorithm

28 The UCB algorithm

29 The UCB algorithm

30 Regret Compare oracle policy with the

31 UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

32 Big O notation

33 Big O notation

34 UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

35 Concentration of measure

36 Concentration of measure I flip a coin times. I get heads times and tails times. If I flip the same coin again, what s the probability I get a head?

37 Hoeffding s Inequality (1963) Independent random variables with for

38 Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average

39 Hoeffding s Inequality (1963) Independent random variables with for Define the empirical average

40 Proof of the UCB regret bound

41 UCB Regret Bound The gap Auer et al. (2002): The UCB policy achieves a logarithmic regret bound

42 Notation Expected reward for each arm Arm pulled at time t Number of times each arm has been pulled at time t Empirical average of rewards for each arm at time t

43 Bad events Let s define bad events Lemma 1 for by

44 Proof of Lemma 1

45 Proof of Lemma 1

46 Proof of Lemma 1 Hoeffding s Inequality

47 Lemma 1 We defined bad events Lemma 1 for by

48 Lemma 2 Suppose, & Then At least one of or hold.

49 Proof of Lemma 2 Suppose neither one of or Then & From hold.

50 Proof of Lemma

51 Proof of Lemma & 2.

52 Proof of Lemma & 2.

53 Lemma 2 Suppose, & Then At least one of or hold.

54 Lemma 3 Lemma 1 + Lemma 3 Suppose. Then Lemma 2

55 Lemma 4 Lemma 3 Suppose. Then Lemma 4 Suppose. Then

56 Proof of Lemma 4

57 Proof of Lemma 4

58 Proof of Lemma 4

59 Proof of Lemma 4

60 Proof of UCB Regret bound Suppose. Then

61 Proof of UCB Regret bound Suppose. Then

62 Proof of UCB Regret bound Suppose. Then

63 Proof of UCB Regret bound Suppose. Then

64 The multi-armed bandit problem with covariates and the k-nearest neighbour UCB algorithm Henry WJ Reeve (Manchester) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester)

65 Plan 1. Multi-armed bandits with covariates 2. Non-parametric assumptions 3. Partition based policies and the UCBogram 4. Manifolds 5. The k-nearest neighbour UCB algorithm

66 The multi-armed bandit with covariates

67 Bandits with side-information Multi-armed bandits with additional side-information Example 1: Personalised sequential clinical trials - access to a patient's genome sequence. Example 2: Personalised online advertisement placement - access to a customer s interests, browsing and purchasing history.

68 Multi-armed bandit with covariates For Observe a covariate Choose an arm to pull based on & the reward history Receive a reward

69 Multi-armed bandit with covariates Covariates drawn from For each is drawn i.i.d from Expected reward, on.

70 Bandits with covariates

71 Bandits with covariates

72 Regret for bandits with covariates Compare Regret: with the oracle policy

73 Non-parametric assumptions

74 The Lipschitz assumption For define by Lipschitz assumption:, b,.

75 The Lipschitz assumption

76 The Lipschitz assumption

77 The Margin assumption Define the margin function by Margin assumption:,,.

78 The margin assumption

79 The margin assumption

80 Histogram based policies

81 The UCBogram Rigollet and Zeevi (COLT, 2010) consider UCBogram 1. Partition cubes, into 2. Apply UCB locally on each of the separate cubes.

82 The UCBogram

83 The UCBogram UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c UCB c

84 The UCBogram Rigollet and Zeevi (2010): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with Then the UCBogram satisfies:

85 Adaptively Binned Successive Elimination Perchet & Rigollet (2011): ABSE: 1. Refine the partition whenever 2. Run a standard bandit algorithm locally until the subsequent refinement

86 Adaptively Binned Successive Elimination SE SE SE SE

87 Adaptively Binned Successive Elimination SE SE SE SE SE SE SE

88 Adaptively Binned Successive Elimination SE SE SE SE SE SE SE SE SE SE SE SE SE

89 Adaptively Binned Successive Elimination Perchet & Rigollet (2011): Suppose & is is absolutely continuous, with a well-behaved density. Suppose that the bandit satisfies the Lipchitz condition & the margin condition with any Then the ABSE satisfies:

90 Bandits on manifolds

91 Manifolds A - dimensional manifold Looks locally like - dimensional Euclidean space

92 Manifolds In many applications d is large, but the data close to a -dimensional smooth manifold with. Eg. statistical regularities in the space of MRI scans/ genome sequences We should be able to exploit this property - but the manifold is not known in advance!

93 The k-nearest neighbour method in supervised learning

94 The k-nearest neighbour method The k-nearest neighbour method is simple & intuitive Effectively manages the bias-variance trade-off in supervised learning.

95 The k-nearest neighbour method Kpotufe (2012): k-nearest neighbours achieves minimax optimal rates in supervised regression (adapts to intrinsic dimension) Chaudhuri & Dasgupta (2014): k-nearest neighbours achieves minimax optimal rates in supervised classification with the margin condition Reeve & Brown (2017): k-nearest neighbours achieves minimax optimal rates for cost-sensitive learning on manifolds.

96 k-nearest Neighbours UCB

97 K-Nearest Neighbours UCB Given and we define The number of times, amongst the k-nearest neighbours of x, that arm i was pulled The cumulative reward over all the times that arm i was pulled and was amongst the k-nearest neighbours of x The k-nearest neighbour reward estimate

98 Defining uncertainty The distance to the k-th nearest neighbour Uncertainty Standard deviation Bias

99 K-nearest neighbour UCB

100 K-nearest neighbour UCB

101 K-nearest neighbour UCB

102 Choosing k? Cross-validation is not a good option in the online setting In the supervised regression setting Kpotufe (2012) Choose k by minimising an upper bound on the squared error Choose k to minimise uncertainty:

103 The K-NN UCB algorithm For Observe a covariate For Receive a reward

104 The Lipschitz assumption For define by Lipschitz assumption:, b,.

105 The margin assumption Define the margin function by Margin assumption:,,.

106 The dimension assumption Holds whenever the covariates are chosen from a well-behaved measure on a compact Riemannian manifold of dimension

107 The Regret Bound Reeve, Mellor & Brown (2017): Suppose that: 1) The Lipschitz assumption holds, 2) The margin assumption holds, 3) The dimension assumption holds, Then we have the following regret bound:

108 Empirical validation Cumulative regret A two-dimensional manifold in a fifteen-dimensional feature space Time

109 Discussion Doesn t require prior knowledge of: The time horizon The dimension of the manifold Achieves the minimax optimal rate, up to a logarithmic factor The regret bound extends to any finite number of arms & reward distributions with sub-gaussian noise

110 Thank you for listening!

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes