When Network Embedding meets Reinforcement Learning?

Size: px

Start display at page:

Download "When Network Embedding meets Reinforcement Learning?"

John Flowers
5 years ago
Views:

1 When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1

2 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE and RL to solve combinatorial problems on graphs 2

3 1. An Introduction to (Deep) Reinforcement Learning 3

Supervised Learning http://cs231n.stanford.

4 Supervised Learning 4

5 Unsupervised Learning 5

6 Reinforcement Learning 6

7 Reinforcement Learning 7

Reinforcement Learning - Example http://cs231n.

8 Reinforcement Learning - Example 8

9 How can we mathematically formalize the RL problem? Markov Decision Process : MDP = Markov Assumption: P(s $%& s (, a (,..., s $, a $ ) = P(s $%& s $,a $ ) Reward Assumption: R(s (, a (,..., s $, a $, s $%& ) = R(s $, a $, s $%& ) = r $%& R 9

10 How can we mathematically formalize the RL problem? Markov Decision Process : MDP = Follow a policy, sample trajectories (or) path s0, a0, r0, s1, a1, r1, Policy: π(s, a) = P(a $ s $ ) [0,1], that is a $ π(s $, ) Goal: find policy π that maximizes cumulative discount rewards: max? γ $ r $ $A( 10

11 Value function 11

12 State Value function : V 12

13 State Value function : V 13

14 State-Action Value function : Q 14

15 Relation between Q and V functions: 15

16 The optimal Value function and optimal Policy Optimal policy and optimal state-value function: 16

17 The optimal Value function and optimal Policy 17

18 The optimal Value function and optimal Policy Proof: Algorithms for Reinforcement Learning, Szepesvári, Csaba. 18

19 Bellman optimality equation for V* 19

20 Bellman optimality equation for V* 20

21 Bellman optimality equation for Q* 21

22 Solving for the optimal value function 22

23 Solving for the optimal value function : Q-learning DQN : Deep Q Nework 23

24 Solving for the optimal value function : Q-learning 24

25 Case study: Atari Game Minh et al. NIPS Workshop 2014: Nature

26 Q-network Architecture Minh et al. NIPS Workshop 2014: Nature

27 Training Techniques : Experience Replay Experience Replay: store the agent s experiences at each time-step, e t = (s t, a t, r t, s t+1 ), in a data set D t ={e 1,,e t }, pooled over many episodes into a replay memory, and train Q network on random mini-batch samples from this replay memory. Advantages: Ø Each step of experience is potentially used in many weight updates, which allows for greater data efficiency; Ø Learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples, randomizing the samples breaks these correlations and therefore reduces the variance of the updates. 27

28 Training Techniques : target Q network Use a separate network for generating the targets y in the Q-learning updates. More precisely, every C updates, clone the network Q to obtain a target network Q, and use Q for generating the Q-learning targets y for the following C updates. Advantages: Generating the targets using an older set of parameters adds a delay between the time an update to Q is made and time the update affects the targets y, making divergence or oscillations much more unlikely. 28

29 Putting it together: Deep Q-Learning with Experience Replay Human-level control through deep reinforcement learning, Nature,

30 Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network Human-level control through deep reinforcement learning, Nature,

31 Putting it together: Deep Q-Learning with Experience Replay For each timestep t of the game Human-level control through deep reinforcement learning, Nature,

32 Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy Human-level control through deep reinforcement learning, Nature,

33 Putting it together: Deep Q-Learning with Experience Replay Take the action (at), and observe the reward rt and next state st+1 Human-level control through deep reinforcement learning, Nature,

34 Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory Human-level control through deep reinforcement learning, Nature,

35 Putting it together: Deep Q-Learning with Experience Replay Experience replay: sample a random minibatch of transitions from replay memory and perform a gradient descent step Human-level control through deep reinforcement learning, Nature,

36 Putting it together: Deep Q-Learning with Experience Replay Human-level control through deep reinforcement learning, Nature, 2015 Every C steps, clone the current Q network to obtain a target Q network 36

Using DQN to play Atari Games https://www.youtube.com/watch?

37 Using DQN to play Atari Games

38 Two main methods of Reinforcement Learning Ø Value based: Q-Learning Ø Policy based: Policy Gradient 38

39 Solving for the optimal policy: Policy Gradient What s the problem with Q-learning? The Q-function can be very complicated! For example: a robot grasping an object has a very high-dimensional state =>hard to learn exact value of every (state, action) pair. But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. find the best policy from a collection of polies? 39

40 Solving for the optimal policy: Policy Gradient Gradient ascent on policy parameters! 40

41 Solving for the optimal policy: Policy Gradient 41

42 Variance Reduction 42

How to choose the baseline? http://cs231n.

43 How to choose the baseline? 43

44 Actor-Critic Algorithm 44

45 Actor-Critic Algorithm 45

46 2. How to combine NE and RL to solve combinatorial optimization problems on graphs Q1: What is a combinatorial optimization problem? A1: It is a topic that consists of finding an optimal objects from a finite set of objects. Q2: What is a combinatorial optimization problem on graph? A2: Objects refer to the nodes or edges, such as MVC (minimum vertex cover), MAXCUT,, etc.. Q3: What are the traditional methods for graph combinatorial optimization problem? A3: Traditional methods include three types: Ø Exact algorithm Ø Approximate algorithm Ø Heuristic algorithm Q4: Why the new algorithm? A4: All three paradigms seldom exploit a common trait of real-world optimization problems: instances of the same type of problem are solved again and again on a regular basis, maintaining the same combinatorial structure, but differing mainly in their data. This algorithm aims to provide a general framework for these problems. 46

A motivational example Minimum Vertex Cover Find smallest vertex subset S s.t. each edge has at least one end in S https://www.

47 A motivational example Minimum Vertex Cover Find smallest vertex subset S s.t. each edge has at least one end in S Models advertising optimization in social networks 47

48 A motivational example Minimum Vertex Cover Find smallest vertex subset S s.t. each edge has at least one end in S Models advertising optimization in social networks 48

Proposal: Learning Greedy Algorithms Minimum Vertex Cover 2-approx: greedily pick the uncovered edge with maximum sum of degrees of its endpoints and cover its two connected nodes

49 Proposal: Learning Greedy Algorithms Minimum Vertex Cover 2-approx: greedily pick the uncovered edge with maximum sum of degrees of its endpoints and cover its two connected nodes Goal: construct a solution by sequentially adding nodes to a partial solution S, based on maximizing some evaluation function Q, which measures the 49

50 Problem Statement Given a graph optimization problem G and a distribution D of problem instances, can we learn better greedy heuristics that generalize to unseen instances from D? Minimum Vertex Cover Maximum Cut Traveling Salesman Prob. Insert nodes into cover Insert nodes into subset Insert nodes into sub-tour Learning combinatorial optimization algorithms over graphs, Hanjun Dai, et al. NIPS

51 Challenge #1: How to Learn Possible approach: Supervised learning Given a partial solution, predict next vertex to add to solution Data: collect (partial solution, next vertex) pairs features Task: multi-class classification label Pointer Network[Vinyals, et al., NIPS 2015]: a smarter approach with recurrent neural networks PROBLEM Need to compute good/optimal solutions to NP-Hard problems in order to learn!! 51

Recall: Reinforcement Learning Background Reward R(t): score you earned at current step State S: current screen Action i: move your board left / right Action value function Q Q

52 Recall: Reinforcement Learning Background Reward R(t): score you earned at current step State S: current screen Action i: move your board left / right Action value function Q Q (S, i): your predicted future total rewards Policy π(s): How to choose your action Greedy policy: i = argmax I Q K (S, i) [Minh, et al. Nature 2015]

Reinforcement Learning Formulation Minimum Vertex Cover min W X (,&? x I I V s.

Compute score for each vertex 2. Select vertex with largest score 3.

Improve policy by learning from experience => no need to compute optima Action

53 Reinforcement Learning Formulation Minimum Vertex Cover min W X (,&? x I I V s. t. x I + x [ 1, i, j E Repeat until all edges covered: 1. Compute score for each vertex 2. Select vertex with largest score 3. Add best vertex to cover Reward: r $ = 1 StateS: current selected nodes SOLUTION Improve policy by learning from experience => no need to compute optima Action value function: Q Q (S, v) Greedy policy: v = argmax c Q K (S, v) Update state S 53

54 Reinforcement Learning Algorithm Algorithm 1 Q-learning for the Greedy Algorithm 1: Initialize experience replay memory M to capacity N 2: for episode e =1to L do 3: Draw graph G from distribution D 4: Initialize the state to empty S 1 =() 5: for step t =1to T do random node v 2 St, w.p. 6: v t = argmax b v2st Q(h(St ),v; ), otherwise 7: Add v t to partial solution: S t+1 :=(S t,v t ) 8: if t n then 9: Add tuple (S t n,v t n,r t n,t,s t ) to M 10: Sample random batch from B iid. M 11: Update by SGD over (6) for B 12: end if 13: end for 14: end for 15: return Instance generation. Θ: model parameters Depend on features Sample graph instance Update state Optimize model parameters (y Q(h(St b ),v t ; )) 2, y = max v 0 Q(h(St+1 b ),v 0 ; )+r(s t,v t ) deal with the issue of delayed rewards, whe 54 b

55 Challenge #2: How to Represent Representation of v A feature vector that describes v in state S $ Representation of S $ A feature vector that describes state S $ Possible approach: Feature engineering Degree, 2-hop neighborhood size, other centrality measures PROBLEMS 1- Task-specific engineering needed 2- Hard to tell what is a good feature 3- Difficult to generalize across diff. graph sizes 55

56 0 v structure2vec: Deep Node Representations [Dai, et al., ICML 2016] µ (t+1) b Repeat embedding T times: Updating feature vector v relu 1 x v b+ 2 R 2 R X X2 R Node s own tag x c Neighbors features u2n (v) µ(t) u + u2n (v) relu( 4 w(v, u)) 0 Neighbors edge weights Non-linearity: relu(x) = max (0, x) Θ: model parameters 56

57 0 v structure2vec: Deep Node Representations [Dai, et al., ICML 2016] Non-linearity: relu(x) = max (0, x) Repeat embedding T times: For each node: Update feature vector SOLUTION 1- No feature engineeringpneeded Compute Q-value: µ (t+1) v 2 R 2 2- Features parameters trained to be good bq(h(s),v; ) = 5 > relu([ 6 Xu2V µ(t u ), 7 µ (T v ) ]) 3- Can handle different graph sizes 2 Sum-pooling over nodes Θ: model parameters 57 b

58 Overall Framework 58

59 Experimental Setup Graph types Minimum Vertex Cover (MVC) Erdos-Renyi (ER) or Barabasi-Albert (BA) Maximum Cut (MAXCUT) ER or BA Traveling Salesman Problem (TSP) DIMACS generator; uniform grid or clustered Solvers ILP with CPLEX IQP with CPLEX Concorde Feature embedding size: 64 Embedding iterations T: 3 to 5 Full details in paper 59

60 Results: Solution Quality [MVC - BA] Near optimal, barely visible. approximation ratio

61 Results: Solution Quality [MAXCUT - BA] 61

62 Results: Solution Quality [TSP - clustered] 62

$Physics 125 375 {-1, 0, 1} TSPLIB 51-318 \ \ https://www.cc.gatech.edu/~hdai8/ 63$

63 Results: Realistic Instances Network Nodes Edges Weighted MemeTracker No Physics {-1, 0, 1} TSPLIB \ \ 63

64 Results: Algorithm Behavior 64

65 Results: Algorithm Behavior step (4) step (5) step step (1) (6) step (7) step step (3) (8) step (9) step step (5) (10) step (11) 65

66 Learning graph opt: quantitative comparison Train on small graphs with nodes Generalize to not only graphs from same distribution But also larger graphs Approximation ratio < Generalization to large instances

67 Conclusion A learning framework that exploits graph structure Applies directly to many graph optimization problems Promising tool for automated algorithm design NIPS paper: Code:

68 Any Question? 68

Learning Combinatorial Optimization Algorithms over Graphs

Learning Combinatorial Optimization Algorithms over Graphs arxiv:704.0665v [cs.lg] 5 Apr 207 Hanjun Dai, Elias B. Khalil, Yuyu Zhang, Bistra Dilkina, Le Song College of Computing, Georgia Institute of