Bandit-based Search for Constraint Programming

Size: px

Start display at page:

Download "Bandit-based Search for Constraint Programming"

Ferdinand Small
6 years ago
Views:

1 Bandit-based Search for Constraint Programming Manuel Loth 1,2,4, Michèle Sebag 2,4,1, Youssef Hamadi 3,1, Marc Schoenauer 4,2,1, Christian Schulte 5 1 Microsoft-INRIA joint centre 2 LRI, Univ. Paris-Sud and CNRS 3 Microsoft Research Cambridge 4 INRIA Saclay 5 KTH, Stockholm Review AERES, Nov LABORATOIRE DE RECHERCHE EN INFORMATIQUE 1 / 23

2 Search/Optimization and Machine Learning Different Learning contexts Supervised (from examples) vs Reinforcement (from reward) Off-line (static) vs On-line (while searching) Here: Use on-line Reinforcement Learning (MCTS) To improve CP search 2 / 23

3 Main idea Constraint Programming Explore a search tree Heuristics: (learn to) order variables & values Monte-Carlo Tree Search A tree-search method Breathrough for games and planning Hybridizing MCTS and CP Bandit-based Search for Constraint Programming 3 / 23

4 Overview MCTS BaSCoP Experimental validation Conclusions and Perspectives 4 / 23

5 The Multi-Armed Bandit problem Lai, Robbins 85 In a casino, one wants to maximize one s gains while playing. Lifelong learning Exploration vs Exploitation Dilemma Play the best arm so far? But there might exist better arms... Exploitation Exploration 5 / 23

6 The Multi-Armed Bandit problem (2) K arms, i th arm gives reward 1 with proba. µ i, 0 otherwise At each time t, one selects an arm i t and gets a reward r t n i,t = number of times i has been selected in [0,t] ˆµ i,t = average reward of arm i in [0,t] Upper Confidence Bound Auer et al Be optimistic when facing the { unknown } log( nj,t ) Select argmax ˆµ i,t + C n i,t ɛ-greedy with probability 1 ɛ, select argmax {ˆµ i,t } exploitation else select an arm uniformly exploration 6 / 23

7 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Explored Tree Search Tree 7 / 23

8 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

9 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

10 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

11 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

12 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

13 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

14 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

15 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree 7 / 23

16 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Explored Tree New Node 7 / 23

17 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Random Explored Tree New Node 7 / 23

18 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Random Explored Tree New Node 7 / 23

19 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Random Explored Tree New Node 7 / 23

20 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Random Explored Tree New Node 7 / 23

21 Monte-Carlo Tree Search Kocsis Szepesvári, 06 UCT == UCB for Trees: gradually grow the search tree Iterate Tree-Walk Building Blocks Select next action Bandit phase Add a node Grow a leaf of the search tree Select next action bis Random phase, roll-out Compute instant reward Evaluate Update information in visited nodes of the search tree Propagate Returned solution: Path visited most often Bandit Based Search Tree Random Explored Tree New Node 7 / 23

22 Overview MCTS BaSCoP Experimental validation Conclusions and Perspectives 8 / 23

23 Adaptation Main issues Which default policy? Which reward? Which selection rule? (random phase) Desired As problem-independent as possible Compatible with multiple restarts (some) Guarantees of completeness 9 / 23

24 Default policy: Depth-first search (DFS) Enforces completeness Accounts for priors about values (some are better than others; neighborhood of last best solution). Limited memory resources required: under each MCTS leaf node, store the current DFS path (assignments on the left of the DFS path are closed) 10 / 23

25 Reward If multiple restarts, rewards cannot be attached to tree nodes rewards attached to elementary assignments i.e. (variable = value) Guiding principles Variables: Fail first existing heuristics perform well Values: Fail deep { 1 if failure deeper than (local) average reward(var = val) = 0 otherwise Discussion Compatible with multiple restarts Noise: var might occur at different depths But noise equally affects all val. 11 / 23

26 Selection rules L-value: left-value (0) R-value: right-value (1) Baselines (non-adaptive) Uniform ɛ-left: with proba 1 ɛ select L-value, otherwise, R-value Adaptive selection rules UCB: select val maximizing reward (var = val) + C UCB-left: same, but C left = ρc right, ρ > 1 log( value n(value)) n(var = val) 12 / 23

27 Overview MCTS BaSCoP Experimental validation Conclusions and Perspectives 13 / 23

28 Goal of experiments Compare BaSCoP with baselines DFS alone Adaptive and non-adaptive selection rules Genericity Robustness wrt multiple restarts Sensitivity analysis wrt parameters 14 / 23

29 Experimental setting Algorithmic framework: Gecode Top policies non-adaptive Adaptive Uniform UCB ɛ-left UCB-Left Parameters ɛ.05,.1,.15,.2 C.05,.1,.2,.5 ρ 1, 2, 4, 8 Bottom policies Depth-First Search ɛ-left UCB UCB-Left 15 / 23

30 Benchmark problems Job-shop scheduling 40 Taillard instances Multiple restarts (Luby sequence), neighborhood search Performance: mean relative error (to best known results) Car-sequencing 70 instances, circa 200 n-ary variables Performance: -violation No restart All results averaged over 11 runs 16 / 23

31 Structures of visited trees Uniform UCB ɛ-left UCB-Left Typical tree shapes for some JSP Taillard instance 17 / 23

32 Experimental Results State-of-the-art results on several instances ( tree-walks) mean relative error to best-known solution tree-walks DFS Balanced e-left(e=0.15) UCB(C=0.1) UCB-left(C l =0.2,C r =0.1) Sample result: Mean Relative Error on Taillard / 23

33 Car Sequencing ABS ABS ABS ABS 2/3 2/5 1/2 Car assembly line, different options on ordered cars. Stalls can handle a given number of cars Arrange car sequence so as not to exceed any capacity minimize number of empty stalls n-ary, no restart, no positional bias of values 19 / 23

34 Car Sequencing DFS UCB, C in {0.05,0.1,0.2,0.5} number of empty stalls instances BaSCoP modestly but significantly better than DFS... but both significantly worse than ad hoc heuristics 20 / 23

35 Overview MCTS BaSCoP Experimental validation Conclusions and Perspectives 21 / 23

36 Conclusion BaSCoP integrated in the Gecode framework Generic heuristics for value ordering Compatible with multiple restarts DFS as rollout policy provides completeness guarantees Improves on DFS on 2/3 benchmark families State-of-art CP results without any ad-hoc heuristics on JSP 22 / 23

37 Perspectives Extensions Rank-based reward for values for n-ary contexts When no-restart, full MCTS (rewards attached to partial assignments) Rewards for variable ordering Control of the parallelization scheme (adaptive work stealing) 23 / 23

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the