Real-world bandit applications: Bridging the gap between theory and practice

Size: px

Start display at page:

Download "Real-world bandit applications: Bridging the gap between theory and practice"

Lawrence Shepherd
5 years ago
Views:

1 Real-world bandit applications: Bridging the gap between theory and practice Audrey Durand EWRL 2018

2 The bandit setting (Thompson, 1933; Robbins, 1952; Lai and Robbins, 1985) Action Agent Context Environment Reward Flow 1. (Receive context) 2. Select action 3. Observe reward 4. Go back to step 1 Audrey Durand Real-world bandit applications 1

3 Structure in bandit problems Contextual bandits Reward = function of context Structured bandits Reward = function of action Reward 0-1 Reward 0.5 Action 1 Action 2 Action 3 Action Context Actions Audrey Durand Real-world bandit applications 2

4 Capturing structure with kernel regression Non-linear f(x) f(x) = ϕ(x) θ X ϕ( ) K Kernel k(x, x ) = ϕ(x) ϕ(x ) x, x X Audrey Durand Real-world bandit applications 3

5 Posterior model Gathering observations y 1, y 2,..., y N at locations x 1, x 2,..., x N Assuming that y i = f(x i ) + ξ i with ξ i N (0, σ 2 ) θ N d (0, Σ) with Σ = σ2 λ I d for λ > 0 Gaussian posterior on f f λ,n x 1,..., x N, y 1,..., y N N ([f λ,n (x)] x X, σ2 ) λ [k λ,n(x, x )] x,x X with f λ,n (x) = k N (x) (K N + λi N ) 1 y N, k λ,n (x, x ) = k(x, x ) k N (x) (K N + λi N ) 1 k N (x ) Audrey Durand Real-world bandit applications 4

6 Gaussian Process (GP) (Rasmussen and Williams, 2006) For λ = σ 2, i.e. assuming θ N d (0, I d ) (a) N = 1 (b) N = 10 (c) N = 50 Theoretical guarantees in the streaming setting Many applications in bandits, e.g. Kernel UCB, Kernel TS, CGP/GP-UCB, GP-TS Audrey Durand Real-world bandit applications 5

7 Adaptive treatment allocation for mice trials as contextual bandits Joint work with Joelle Pineau, Georgios D. Mitsis, Katerina Strati, Charis Achilleos, and Demetris Machine Learning for Healthcare (MLHC) 2018

Collecting data for personalized medecine Treatment adapted to the disease Which treatment should be allocated to patients with cancer given the stage of their disease?

8 Collecting data for personalized medecine Treatment adapted to the disease Which treatment should be allocated to patients with cancer given the stage of their disease? Tumour volume v = π 6 (lw)3/2 (Tomayko and Reynolds, 1989) Data acquisition setup Mice with induced cancer tumours Options: none, 5-FU, imiquimod, and imiquimod + 5-FU Natural death or sacrifice when tumour volume is critical Audrey Durand Real-world bandit applications 6

9 Initial data acquisition RCT (Randomized Clinical Trial) 6 mice, 163 triplets (v i, a i, v i ) Quick degradation of subjects Limited covering of the tumour volume space Volume (mm 3 ) (a) Tumour evolution Number of days Number of measurements (b) Tumour volume distribution Volume (mm 3 ) Audrey Durand Real-world bandit applications 7

10 Adaptive allocation ACT (Adaptive Clinical Trial) (Thompson, 1933) Favor selection of better treatments Reduce the exposition of less effective treatments Contextual bandits Context: tumour volume x t Actions: 4 possible treatment options Treatment effect: post-treatment tumour volume x t Reward: y t = x t Audrey Durand Real-world bandit applications 8

11 Select next action given the context BESA (Best Empirical Sampled Average) (Baransi et al., 2014) Fair comparison of empirical estimators Opportunities for action to show how good they are GP BESA for two actions Parameters: context x t, two actions a and b Let N (a) t 1 < N (b) t 1 Subset S (a) t all observations from a history Subset S (b) t Compute posterior mean subsample of N (a) t Compute posterior mean f (a) t f (b) t Select a t = argmax i {a,b} f (i) t (x t ) from b history using S (a) t using S (b) t Audrey Durand Real-world bandit applications 9

12 Regression on subsets of observations (a) N = 5 (b) N = Treatment effect f f N Treatment effect f f N Tumour volume Tumour volume (c) N = 20 (d) N = Treatment effect f f N Treatment effect f f N Tumour volume Tumour volume Audrey Durand Real-world bandit applications 10

13 Animal experiments Delayed feedback Update the history upon the completion of a group Group A mice 1 and 2 Group B mice 3, 4 and 5 Group C mice 6 and 7 Group D mice 8, 9 and 10 Baseline strategies for comparison No treatment (3 mice) Random (5 mice) 5-FU (4 mice) Audrey Durand Real-world bandit applications 11

14 Evolution of tumour volumes with adaptive allocation Volume (mm 3 ) Volume (mm 3 ) Volume (mm 3 ) Volume (mm 3 ) Group A Group B Group C Group D Days Recall: The algorithm is updated after each group Audrey Durand Real-world bandit applications 12

15 Animals live longer (a) Per approach (b) Per group of GP BESA Life duration (days) Life duration (days) None Random 5-FU GP BESA A B C D Preliminary results GP BESA median > 50% 5-FU median Visible improvement after each group To validate on a larger cohort Audrey Durand Real-world bandit applications 13

16 A better state space covering Number of measurements (a) Random allocation Volume (mm 3 ) Number de measurements (b) Adaptive allocation Volume (mm 3 ) Using data in a next phase More information on the tumor growth process 40% more data points of volume > 70mm 3 Audrey Durand Real-world bandit applications 14

17 Contextual bandits for adaptive clinical trials work well! Even though: Contexts were not independent The model was (over) simplistic Rewards were delayed We lost guarantees on the regression model Audrey Durand Real-world bandit applications 15

18 Theory vs Practice In theory... The kernel is well adapted to the function The noise variance σ 2 is known In practice... Sometimes the kernel is picked from expert knowledge Sometimes the kernel is tuned using previous data Sometimes σ 2 is learned from previous data Sometimes an upper bound σ 2 + σ 2 is used Audrey Durand Real-world bandit applications 16

19 Theory vs Practice In theory... The kernel is well adapted to the function The noise variance σ 2 is known In practice... Sometimes the kernel is picked from expert knowledge Sometimes the kernel is tuned using previous data Sometimes σ 2 is learned from previous data Sometimes an upper bound σ 2 + σ 2 is used Theory holds with this! Audrey Durand Real-world bandit applications 16

20 Impact of noise upper bound on regression model Using σ 2 + Using σ Audrey Durand Real-world bandit applications 17

21 Empirical noise variance estimate Confidence intervals on empirical variance (Maillard, 2016) Example with σ = 0.1: Confidence envelope σ + = 1.0 σ + = 0.5 σ + = t Use these intervals to adapt the regularization (λ) Audrey Durand Real-world bandit applications 18

22 Example: Adapting λ t vs fixed λ (unknown σ) 10 observations 100 observations observations observations f λ = σ 2 +/C 2 λ t = σ 2 +,t 1/C X X Audrey Durand Real-world bandit applications 19

23 Theory can hold with more realistic assumptions Requires initial lower/upper bounds σ /σ + instead of σ Preserves guarantees on the resulting regression model Tightest confidence intervals than using fixed σ + All details in Durand, Maillard, Pineau (JMLR 2018) Audrey Durand Real-world bandit applications 20

Lavoie-Cardinal, Paul De Koninck, Christian Gagné, Theresa

24 Optimizing super-resolution imaging parameters as multi-objective structured bandits Joint work with Flavie Lavoie-Cardinal, Paul De Koninck, Christian Gagné, Theresa Wiesner, Marc-André Gardner, Louis-Émile Robitaille, and Anthony Bilodeau

25 Observing structures at the nanoscale (Hell and Wichmann, 1994) Audrey Durand Real-world bandit applications 21

26 Optimizing imaging parameters Biology: The optimal parameters are not always the same Find good parameters during the imaging task Maximize the acquisition of images useful to researchers Minimize trials of poor parameters Structured bandits Action space: Space of parameters that can be tuned Reward function defined over the action space Audrey Durand Real-world bandit applications 22

27 Trade-off between different objectives Conflicting objectives Maximize image quality (structure visible) Minimize photobleaching (structure maintained after imaging) Include an expert in the learning loop Multi-objective bandits Estimates Expert Action Agent Environment Rewards Audrey Durand Real-world bandit applications 23

28 Producing estimates Thompson Sampling (Thompson, 1933) Decision based on sampling from the posterior distribution Kernel TS in the multi-objective loop Model each objective function f,i using kernel regression Sample f t,i from the posterior distribution on f,i Estimate at parameter configuration x = ( f t,i (x)) i E.g. (sampled imaging quality, sampled photobleaching) Audrey Durand Real-world bandit applications 24

29 Presenting estimates to user Expert acts as an argmax on the preference function Audrey Durand Real-world bandit applications 25

30 Experiments on neuronal imaging: two objectives Experimental conditions Three parameters (1000 configurations): Excitation laser power Depletion laser power Duration of imaging per pixel Two objectives: f,1 : image quality (structure visible) f,2 : photobleaching (structure maintained after imaging) Image quality feedback provided by expert Audrey Durand Real-world bandit applications 26

31 Example: Image quality of actin rings Poor images Good images Audrey Durand Real-world bandit applications 27

Logarithmic regret, as expected Neuron: Rat neuron PC12: Rat tumor cell line ulti-objective optimization of microscopy parameters for re HEK293: Human embryonic kidney

32 Logarithmic regret, as expected Neuron: Rat neuron PC12: Rat tumor cell line ulti-objective optimization of microscopy parameters for re HEK293: Human embryonic kidney cells parison across different cell types. a) Parameter configurati agingaudrey trials Durand in neuronal (green), Real-worldPC12 bandit applications (blue), and HEK (o

33 Better images without manual tuning Audrey Durand Real-world bandit applications 29

34 Fully automatized loop! Audrey Durand Real-world bandit applications 30

Logarithmic regret again, without expert in the loop I Bassoon: Protein from the nerve terminals active zone I Tubulin: Protein of cytoskeleton

35 Logarithmic regret again, without expert in the loop I Bassoon: Protein from the nerve terminals active zone I Tubulin: Protein of cytoskeleton I Actin: Protein involved in cell motility and establishment/maintenance of cell junctions/shape Audrey Durand Real-world bandit applications 31

36 Fully automated imaging results Audrey Durand Real-world bandit applications 32

37 Structured bandits for online imaging parameters tuning with multiple objectives work well! Even though: We may have experts in the loop We may even have neural networks in the loop Audrey Durand Real-world bandit applications 33

38 Take home Bandits have great applications Theory is important: we should make it applicable Audrey Durand Real-world bandit applications 34

39 Thanks! Collaborators: Joelle Pineau Odalric-Ambrym Maillard Christian Gagné Georgios D. Mitsis Katerina Strati Demetris Iacovides Charis Achilleos Flavie-Lavoie Cardinal Paul De Koninck Theresa Wiesner Marc-André Gardner Louis-Émile Robitaille Anthony Bilodeau Audrey Durand Real-world bandit applications 35

An introduction to multi-armed bandits

An introduction to multi-armed bandits Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester) Plan 1. An introduction to