Fast Convergence of Regularized Learning in Games

Size: px

Start display at page:

Download "Fast Convergence of Regularized Learning in Games"

Sharon Shaw
5 years ago
Views:

1 Fast Convergence of Regularized Learning in Games Vasilis Syrgkanis Alekh Agarwal Haipeng Luo Robert Schapire Microsoft Research NYC Microsoft Research NYC Princeton University Microsoft Research NYC

2 Strategic Interactions in Computer Systems Game theoretic scenarios in modern computer systems

3 Strategic Interactions in Computer Systems Game theoretic scenarios in modern computer systems $ $ $ internet routing advertising auctions

4 How Do Players Behave? Classical game theory: players play according to Nash Equilibrium How do players converge to equilibrium? Nash Equilibrium is computationally hard Most scenarios: repeated strategic interactions Simple adaptive game playing more natural Learn to play well over time from past experience e.g. Dynamic bid optimization tools in online ad auctions internet routing advertising auctions

5 How Do Players Behave? Classical game theory: players play according to Nash Equilibrium How do players converge to equilibrium? Nash Equilibrium is computationally hard Caveats! Most scenarios: repeated strategic interactions Simple adaptive game playing more natural Learn to play well over time from past experience e.g. Dynamic bid optimization tools in online ad auctions internet routing advertising auctions

6 How Do Players Behave? Classical game theory: players play according to Nash Equilibrium How do players converge to equilibrium? Nash Equilibrium is computationally hard Caveats! Most scenarios: repeated strategic interactions Simple adaptive game playing more natural Learn to play well over time from past experience e.g. Dynamic bid optimization tools in online ad auctions internet routing advertising auctions

equilibrium? Nash Equilibrium is computationally hard Caveats!

7 How Do Players Behave? Classical game theory: players play according to Nash Equilibrium How do players converge to equilibrium? Nash Equilibrium is computationally hard Caveats! Most scenarios: repeated strategic interactions Simple adaptive game playing more natural Learn to play well over time from past experience e.g. Dynamic bid optimization tools in online ad auctions internet routing advertising auctions

8 No-Regret Learning in Games Each player uses no-regret learning algorithm with good regret against adaptive adversaries Regret of each player against any fixed strategy converges to zero Many simple algorithms achieve no-regret MWU, Regret Matching, Follow the Regularized/Perturbed Leader, Mirror Descent [Freund and Schapire 1995, Foster and Vohra 1997, Hart and Mas-Collel 2000, Cesa-Bianchi and Lugosi 2006, ]

9 No-Regret Learning in Games Each player uses no-regret learning algorithm with good regret against adaptive adversaries Regret of each player against any fixed strategy converges to zero Many simple algorithms achieve no-regret MWU, Regret Matching, Follow the Regularized/Perturbed Leader, Mirror Descent [Freund and Schapire 1995, Foster and Vohra 1997, Hart and Mas-Collel 2000, Cesa-Bianchi and Lugosi 2006, ]

10 No-Regret Learning in Games Each player uses no-regret learning algorithm with good regret against adaptive adversaries Regret of each player against any fixed strategy converges to zero Many simple algorithms achieve no-regret MWU, Regret Matching, Follow the Regularized/Perturbed Leader, Mirror Descent [Freund and Schapire 1995, Foster and Vohra 1997, Hart and Mas-Collel 2000, Cesa-Bianchi and Lugosi 2006, ]

11 Known Convergence Results Empirical distribution converges to generalization of Nash equilibrium: Coarse Correlated Equilibrium [Young2004] Sum of player utilities Welfare - converges to approximate optimality (under structural conditions on the game) [Roughgarden2010] Convergence rate inherited from adversarial analysis: typically O 1/ T O 1/ T impossible to improve against adversaries

12 Known Convergence Results Empirical distribution converges to generalization of Nash equilibrium: Coarse Correlated Equilibrium [Young2004] Sum of player utilities Welfare - converges to approximate optimality (under structural conditions on the game) [Roughgarden2010] Convergence rate inherited from adversarial analysis: typically O 1/ T O 1/ T impossible to improve against adversaries

13 Known Convergence Results Empirical distribution converges to generalization of Nash equilibrium: Coarse Correlated Equilibrium [Young2004] Sum of player utilities Welfare - converges to approximate optimality (under structural conditions on the game) [Roughgarden2010] Convergence rate inherited from adversarial analysis: typically O 1/ T O 1/ T impossible to improve against adversaries

14 Known Convergence Results Empirical distribution converges to generalization of Nash equilibrium: Coarse Correlated Equilibrium [Young2004] Sum of player utilities Welfare - converges to approximate optimality (under structural conditions on the game) [Roughgarden2010] Convergence rate inherited from adversarial analysis: typically O 1/ T O 1/ T impossible to improve against adversaries

15 Main Question When all players invoke no-regret learning: Is convergence faster than adversarial rate?

16 High-Level Main Results Yes! We prove that if each player s algorithm: 1. Makes stable predictions 2. Regret bounded by stability of the environment Then convergence faster than 1/ T Can be achieved by regularization and recency bias.

17 High-Level Main Results Yes! We prove that if each player s algorithm: 1. Makes stable predictions 2. Regret bounded by stability of the environment Then convergence faster than 1/ T Can be achieved by regularization and recency bias.

18 High-Level Main Results Yes! We prove that if each player s algorithm: 1. Makes stable predictions 2. Regret bounded by stability of the environment Then convergence faster than 1/ T Can be achieved by regularization and recency bias.

19 Model and Technical Results

20 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy space S i and utility U i : S 1 S n 0,1 At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

21 Repeated Game Model n players play a normal form game for T time steps set of available actions reward Each player i has strategy space S i and utility U i : S 1 S n 0,1 At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

22 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy bids in an space auctions i and utility U i : S 1 S n 0,1 utility from winning an item at some price At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

23 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy paths bids in in an a space network auctions i and utility Ulatency i : S 1 of chosen Spath n 0,1 utility from winning an item at some price At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

24 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy space S i and utility U i : S 1 S n 0,1 At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

25 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy space S i and utility U i : S 1 S n 0,1 At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

26 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy space S i and utility U i : S 1 S n 0,1 At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

27 Repeated Game Model n players play a normal form game for T time steps Each player i has strategy space S i and utility U i : S 1 S n 0,1 At each step t = 1 T, each player i : Picks a distribution over strategies p i t Δ S i Receives expected utility: E U i s 1 t,, s n t and draws s i t p i t Observes expected utility vector u t i, where coordinate u t i [s i ] is expected utility had strategy s i been played u t i s i = E U i s t t 1,, s i,, s n

28 Regret of Learning Algorithm Regret: gain from switching to best fixed strategy T T Regret i = max s i S i E t=1 U i s 1 t,, s i,, s n t E t=1 U i s 1 t,, s i t,, s n t Many algorithms achieve regret O T

29 Regret of Learning Algorithm Regret: gain from switching to best fixed strategy Expected utility if played s i all the time Expected utility of algorithm T T Regret i = max s i S i E t=1 U i s 1 t,, s i,, s n t E t=1 U i s 1 t,, s i t,, s n t Many algorithms achieve regret O T

30 Regret of Learning Algorithm Regret: gain from switching to best fixed strategy Expected utility if played s i all the time Expected utility of algorithm T T Regret i = max s i S i E t=1 U i s 1 t,, s i,, s n t E t=1 U i s 1 t,, s i t,, s n t Many algorithms achieve regret O T

31 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

32 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

33 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

34 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

35 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Defined by [Roughgarden 2010] Main tool for proving welfare bounds (price of anarchy) Applicable to e.g. congestion games, ad auctions Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

36 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Defined by [Roughgarden 2010] Main tool for proving welfare bounds (price of anarchy) Applicable to e.g. congestion games, ad auctions Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

37 Related Work [Daskalakis, Deckelbaum, Kim 2011] Two player zero sum games Marginal empirical distributions converge to Nash at rate O 1 T Unnatural coordinated dynamics [Rakhlin, Sridharan 2013] Two player zero sum games Optimistic Mirror Descent and Optimistic Hedge Sum of regrets O 1 => Marginal empirical dist. converge to Nash at O 1 T

38 Related Work [Daskalakis, Deckelbaum, Kim 2011] Two player zero sum games Marginal empirical distributions This converge work: to Nash at rate O 1 T Unnatural coordinated dynamics General multi-player games vs. 2-player zero-sum Convergence to CCE and Welfare vs. Nash Sufficient conditions vs. specific algorithms Give new example algorithms (e.g. general recency Optimistic Mirror Descent and Optimistic bias, Optimistic HedgeFTRL) [Rakhlin, Sridharan 2013] Two player zero sum games Sum of regrets O 1 => Marginal empirical dist. converge to Nash at O 1 T

39 Main Results We give general sufficient conditions on learning algorithms (and concrete example families of algorithms) such that: Thm1. Each player s regret is O T 1/4 Convergence to CCE at O T 3/4 Thm2. Sum of player regrets is O 1 Corollary. Convergence to approximately optimal welfare if the game is smooth at O T 1 Thm3. Give generic way to modify an algorithm to achieve small regret against nice algorithms and O T against any adversarial sequence

40 Why Expect Faster Rates? If your opponent doesn t change mixed strategy a lot p t t 1 2 p small 2 1 Your expected utility from a strategy is approximately the same between iterations Last iteration s utility good proxy for next iteration s utility u 1 t u 1 t 1 small

41 Why Expect Faster Rates? If your opponent doesn t change mixed strategy a lot p t t 1 2 p small 2 1 Your expected utility from a strategy is approximately the same between iterations Last iteration s utility good proxy for next iteration s utility u 1 t u 1 t 1 small

42 Why Expect Faster Rates? If your opponent doesn t change mixed strategy a lot p t t 1 2 p small 2 1 Your expected utility from a strategy is approximately the same between iterations Last iteration s utility good proxy for next iteration s utility u 1 t u 1 t 1 small

43 Why Expect Faster Rates? If your opponent doesn t change mixed strategy a lot p t t 1 2 p small 2 1 Your expected utility from a strategy is approximately the same between iterations Last iteration s utility good proxy for next iteration s utility u 1 t u 1 t 1 small

44 Why Expect Faster Rates? If your opponent doesn t change mixed strategy a lot p t t 1 2 p small 2 1 Your expected utility from a strategy is approximately the same between iterations Last iteration s utility good proxy for next iteration s utility u 1 t u 1 t 1 small

45 Simplified Sufficient Conditions for Fast Convergence 1. Stability of Mixed Strategies p t t 1 i p i η γ 2. Regret Bounded by Stability of Utility Sequence Regret i α η + η β t=1 T u i t u i t 1 2 Then, for η = O T 1/4, each player s regret is O T 1/4

46 Simplified Sufficient Conditions for Fast Convergence 1. Stability of Mixed Strategies p t t 1 i p i η γ 2. Regret Bounded by Stability of Utility Sequence Regret i α η + η β t=1 T u i t u i t 1 2 Then, for η = O T 1/4, each player s regret is O T 1/4

47 Simplified Sufficient Conditions for Fast Convergence 1. Stability of Mixed Strategies p i t p i t 1 A 2. Regret Bounded by Stability of Utility Sequence Regret i B + C T t=1 u i t u i t 1 Plus conditions on constants A,B,C,D T 2 D t=1 p i t p i t 1 2

48 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i e η t τ=1 u i τ si Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. 1. Uses last iteration as predictor for next iteration 2. Equivalent to Follow the Regularized Leader; Regularization Stability

49 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i e η t τ=1 u i τ si Past cumulative performance of action Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. 1. Uses last iteration as predictor for next iteration 2. Equivalent to Follow the Regularized Leader; Regularization Stability

50 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i Optimistic Hedge [Rakhlin-Sridharan 13] e η t τ τ=1 u i si p t+1 i s i e η t τ t τ=1 u i si +u i si Past cumulative performance of action Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. 1. Uses last iteration as predictor for next iteration 2. Equivalent to Follow the Regularized Leader; Regularization Stability

51 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i Optimistic Hedge [Rakhlin-Sridharan 13] e η t τ τ=1 u i si p t+1 i s i e η t τ t τ=1 u i si +u i si Past cumulative performance of action Past performance double counting last iteration Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. 1. Uses last iteration as predictor for next iteration 2. Equivalent to Follow the Regularized Leader; Regularization Stability

52 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i Optimistic Hedge [Rakhlin-Sridharan 13] e η t τ τ=1 u i si p t+1 i s i e η t τ t τ=1 u i si +u i si Past cumulative performance of action Past performance double counting last iteration Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. 1. Uses last iteration as predictor for next iteration 2. Equivalent to Follow the Regularized Leader; Regularization Stability

53 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i Optimistic Hedge [Rakhlin-Sridharan 13] e η t τ τ=1 u i si p t+1 i s i e η t τ t τ=1 u i si +u i si Past cumulative performance of action Past performance double counting last iteration Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. Corollary. If all players in a game use optimistic Hedge with step size 1. Uses last iteration as predictor for next iteration O T 1/4 then each player s regret is O T 1/4 2. Equivalent to Follow the Regularized Leader; Regularization Stability

54 Example Algorithm Hedge [Littlestone-Warmuth 94, Freund-Schapire 97] p i t+1 s i e η t τ=1 u i τ si Past cumulative performance of action Optimistic Hedge [Rakhlin-Sridharan 13] p i t+1 s i e η t τ=1 u i τ si +u i t si Past performance double counting last iteration Lemma. Optimistic Hedge satisfies both sufficient conditions Intuition. Prove extends to Optimistic Follow the Regularized Leader Algorithms t 1. Uses last iteration as predictor for next iteration p t+1 i = argmax p Δ Si p, u τ t R p i + p, u i + 2. Equivalent to Follow the Regularized Leader; Regularization η Stability τ=1

55 Simulations

56 Simulation Example: Auction Game 4 bidders bidding on 4 items At each iteration bidder picks one item and submits bid Highest bidder on each item wins and pays bid

57 Cumulative Regret Simulation Example: Auction Game 4 bidders bidding on 4 items At each iteration bidder picks one item and submits bid Highest bidder on each item wins and pays bid Maximum Player Regret Time Hedge Optimistic Hedge

58 Cumulative Regret Simulation Example: Auction Game 4 bidders bidding on 4 items At each iteration bidder picks one item and submits bid Highest bidder on each item wins and pays bid Maximum Player Regret O T O 1? Time Hedge Optimistic Hedge

59 Cumulative Regret Expected Bid Simulation Example: Auction Game 4 bidders bidding on 4 items At each iteration bidder picks one item and submits bid Highest bidder on each item wins and pays bid Maximum Player Regret O T Expected Bid on an Item O 1? Time Time Hedge Optimistic Hedge Hedge Optimistic Hedge

60 Cumulative Regret Expected Bid Simulation Example: Auction Game 4 bidders bidding on 4 items At each iteration bidder picks one item and submits bid Highest bidder on each item wins and pays bid Oscillatory Stable Maximum Player Regret O T Expected Bid on an Item O 1? Time Time Hedge Optimistic Hedge Hedge Optimistic Hedge

61 Main Take-Away Points Learning algorithms can enjoy faster regret rates in game theoretic environments Extend previous work to general multi-player games Provide generic sufficient conditions Recency bias and regularization are key components

62 Main Take-Away Points Learning algorithms can enjoy faster regret rates in game theoretic environments Extend previous work to general multi-player games Provide generic sufficient conditions Recency bias and regularization are key components

63 Main Take-Away Points Learning algorithms can enjoy faster regret rates in game theoretic environments Extend previous work to general multi-player games Provide generic sufficient conditions Recency bias and regularization are key components Thank you!

64 Appendix Slides

65 Black-Box Robustness to Any Sequence What if opponents do not use a stable algorithm? Solution: use adaptive step-size, tracking change of environment t Keep upper estimate B on path length I t = τ=1 Set parameter η = min a B, T 1 4 u i τ u i τ 1 Once path length becomes larger, double estimate B and restart algorithm Theorem. Same fast regret guarantees (up to log factors) when opponents are stable; O T when opponents arbitrary. 2

66 Prob. of Player 1 Playing H Prob. of Player 1 Playing H Simulation Example: Zero-Sum Game Matching Pennies Style Game: H T 1 H T 0-1 Hedge Dynamics Optimistic Hedge Dynamics Prob. of Player 2 Playing H Prob. of Player 2 Playing H Trajectory of mixed strategies Nash equilibrium Trajectory of mixed strategies Nash equilibrium

Games in Networks: the price of anarchy, stability and learning. Éva Tardos Cornell University

Games in Networks: the price of anarchy, stability and learning Éva Tardos Cornell University Why care about Games? Users with a multitude of diverse economic interests sharing a Network (Internet) browsers