Backpropagating through Structured Argmax using a SPIGOT

Size: px

Start display at page:

Download "Backpropagating through Structured Argmax using a SPIGOT"

Angelina Fowler
5 years ago
Views:

1 Backpropagating through Structured Argmax using a SPIGOT Hao Peng, Sam Thomson, Noah A. July 17, 2018

2 Overview arg max Parser Downstream task Loss L

3 Overview arg max Parser Downstream task Head token Yang and Mitchell, 2017 Tree-RNN Tai et al., 2015 Graph CNN Kipf and Welling, 2017 Loss L

4 Overview arg max Parser A layer in the computation graph? Downstream task Loss L

5 Overview Non-differentiable arg max Parser A layer in the computation graph? Downstream task Loss L

6 Overview Aim Structured prediction as a layer. Motivation Structures help. Ji and Smith, 2017; Oepen et al., 2017 Linguistic structures may not be universally optimal. Williams, 2017 arg max Intermediate parser Downstream task Loss L r L?

7 Overview Aim Structured prediction as a layer. Motivation Structures help. Ji and Smith, 2017; Oepen et al., 2017 Linguistic structures may not be universally optimal. Williams, 2017 arg max Intermediate parser Downstream task Loss L r L? Challenges argmax is non-differentiable.

8 Overview Aim Structured prediction as a layer. Motivation Structures help. Ji and Smith, 2017; Oepen et al., 2017 Linguistic structures may not be universally optimal. Williams, 2017 Challenges argmax is non-differentiable. arg max Method Loss L Intermediate parser Downstream task A proxy Structured Prediction Intermediate Gradients Optimization Technique SPIGOT r L?

9 Outline Background: structured prediction as linear programs Method: SPIGOT algorithm Experiments

10 Structured Prediction Reviewed Input Output

11 Structured Prediction Reviewed Input Score S ( ) X s ( ) head mod = arcs

12 Structured Prediction Reviewed Input Score > s = s ( ),s ( ),s ( ),...,s ( ) z = [ 1?, 0?, 1?,..., 0? ] > Output s.t. arg max z forms a tree z > s ẑ

13 Linear Programming Formulation ẑ = arg max z > s.t. z forms a tree 2 3 s ( ) s ( ) s ( ) s ( ) Az apple b Roth and Yih, 2004; Martins et al., 2009

14 Linear Programming Formulation ẑ arg max z > s.t. z forms a tree z i 2 {0, 1} relaxation z i 2 [0, 1] = 2 Az apple b 3 s ( ) s ( ) s ( ) s ( ) Roth and Yih, 2004; Martins et al., 2009

15 Outline Background: structured prediction as linear programs Method: SPIGOT algorithm Experiments

16 Backprop ẑ = arg max z > s.t. z forms a tree s ( ) s ( ) s ( ). s ( ) r L ẑ Downstream task Loss L

17 Backprop ẑ = arg max z > s.t. z forms a tree s ( ) s ( ) s ( ). s ( ) r L ẑ rẑl Downstream task Loss L Backprop

18 Backprop ẑ = arg max z > s.t. z forms a tree s ( ) s ( ) s ( ). s ( ) r L Backprop r s L ẑ rẑl Downstream task Loss L Backprop

19 Backprop ẑ = arg max z > s.t. z forms a tree s ( ) s ( ) s ( ). s ( ) r L Backprop r s L Proxy ẑ rẑl Downstream task Loss L Backprop

20 Backprop We have: rẑl We need: r s L

21 Backprop We have: rẑl We need: r s L Leibniz, 1676 r s L = J rẑl

22 Backprop We have: rẑl We need: r s L Leibniz, 1676 r s L = J rẑl ẑ = arg max z > s s.t. z forms a tree Jacobian not defined

23 Backprop We have: rẑl We need: r s L Leibniz, 1676 r s L = J rẑl Straight-through Estimator (STE) Hinton, 2012; Bengio et al., 2013 r s L, rẑl

24 Some Geometry Straight-through Estimator (STE): r s L, rẑl Az apple b ẑ =[1, 0, 1,, 0] >

25 Some Geometry Straight-through Estimator (STE): r s L, rẑl Az apple b rẑl =[ 0.3, 0.5, 0.4,...,0.2] ẑ =[1, 0, 1,, 0] >

26 Some Geometry Straight-through Estimator (STE): r s L, rẑl p = ẑ rẑl Az apple b rẑl =[ 0.3, 0.5, 0.4,...,0.2] ẑ =[1, 0, 1,, 0] >

27 Some Geometry SPIGOT p = ẑ rẑl q Az apple b rẑl =[ 0.3, 0.5, 0.4,...,0.2] ẑ =[1, 0, 1,, 0] >

28 Some Geometry SPIGOT p = ẑ rẑl q Az apple b rẑl =[ 0.3, 0.5, 0.4,...,0.2] r s L ẑ =[1, 0, 1,, 0] > p = ẑ rẑl q =proj(p) r s L, ẑ q

29 Some Geometry SPIGOT ẑ rẑl ẑ rẑl r s L ẑ ẑ r s L

30 Algorithm Input Parser ẑ = arg max z > s.t. z forms a tree 2 3 s ( ) s ( ) s ( ) s ( ) ẑ

31 Algorithm Input Parser ẑ = arg max z > s.t. z forms a tree 2 3 s ( ) s ( ) s ( ) s ( ) ẑ Downstream task Loss L

32 Algorithm Input Parser ẑ = arg max z > s.t. z forms a tree 2 3 s ( ) s ( ) s ( ) s ( ) ẑ rẑl Downstream task Loss L Backprop

33 Algorithm Input Parser ẑ = arg max z > s.t. z forms a tree 2 3 s ( ) s ( ) s ( ) s ( ) r s L p = ẑ rẑl q =proj(p) r s L, ẑ q Project onto ẑ rẑl Downstream task Loss L Backprop

34 Algorithm Input Parser ẑ = arg max z > s.t. z forms a tree 2 3 s ( ) s ( ) s ( ) s ( ) r L Backprop r s L p = ẑ rẑl q =proj(p) r s L, ẑ q Project onto ẑ rẑl Downstream task Loss L Backprop

35 Connections to Related Work SPIGOT STE ẑ rẑl ẑ rẑl r s L ẑ r s L Hard decision on Backprop Marginal Projection ẑ Pipeline STE Structured Att. SPIGOT Structured Attention: Kim et al., 2017

36 Connections to Related Work SPIGOT Structured Attention ẑ rẑl ẑ = softmax(...) r s L ẑ = arg max (...) Hard decision on Backprop Marginal Projection ẑ Pipeline STE Structured Att. SPIGOT Structured Attention: Kim et al., 2017

37 Applications Training data Joint learning Swayamdipta et al., 2016 arg max Parser L 1 r L 1

38 Applications Training data Joint learning Swayamdipta et al., 2016 arg max Parser L 1 r L 1 r L 2 Downstream task r L 2 Loss L 2

39 Applications Training data Joint learning Swayamdipta et al., 2016 Induce latent structures Yogatama et al., 2017; Williams et al., 2017 Training data arg max Parser r L 1 r L 2 L 1 arg max Parser r L Downstream task r L 2 Downstream task r L Loss L 2 Loss L

40 Outline Background: structured prediction as linear programs Method: SPIGOT algorithm Experiments

41 Experiments: Syntactic-then-semantic Parsing Input arg max Syntactic Parser Syntactic tree Semantic graph arg1 Semantic Parser arg2 poss

42 Experiments: Syntactic-then-semantic Parsing Input Eisner Algorithm Eisner, 1996 arg max Syntactic Parser BiLSTM + MLP Kiperwasser and Goldberg, 2016 Syntactic tree Semantic graph arg1 Semantic Parser arg2 poss

43 Experiments: Syntactic-then-semantic Parsing Input Eisner Algorithm Eisner, 1996 arg max Syntactic Parser BiLSTM + MLP Kiperwasser and Goldberg, 2016 Syntactic tree root NeurboParser Peng et al., 2017 Concat head token embedding Semantic graph arg1 Semantic Parser arg2 poss

44 SemEval 15. Micro-averaged labeled F1 88 in-domain out-of-domain 86 F Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop Hard decision Projection ẑ N/A N/A N/A Neurbo: Peng et al., 2017

45 SemEval 15. Micro-averaged labeled F1 88 in-domain out-of-domain 86 F Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop Hard decision Projection ẑ N/A N/A N/A Neurbo: Peng et al., 2017

46 SemEval 15. Micro-averaged labeled F1 88 in-domain out-of-domain 86 F Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop Hard decision Projection ẑ N/A N/A N/A Neurbo: Peng et al., 2017

47 SemEval 15. Micro-averaged labeled F1 88 in-domain out-of-domain 86 F Neurbo Pipeline STE Structured Att. SPIGOT Syntax Backprop Hard decision Projection ẑ N/A N/A N/A Neurbo: Peng et al., 2017

48 Semantic Parsing for Sentiment Classification Input Semantic graph arg max arg1 Semantic Parser arg2 poss Classifier Positive? Negative?

49 Semantic Parsing for Sentiment Classification Input AD 3 Martins et al., 2011 Semantic graph arg max : arg1 arg1 Semantic Parser arg2 poss :arg2; :poss NeurboParser Peng et al., 2017 BiLSTM+MLP Concat head token and role Classifier Positive? Negative?

50 Stanford Sentiment Treebank accuracy Accuracy BiLSTM Pipeline STE SPIGOT

51 Conclusion Problem

52 Conclusion Problem Method SPIGOT

53 Conclusion Problem Method Results SPIGOT

54 Thank you!

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented