Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set Tianze Shi* Liang Huang Lillian Lee* * Cornell University Oregon State University O n 3 O n 3 Minimal O n 6 Theoretical Feature Set Practical
Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 2
Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 3
Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 4
Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 5
Short Version Transition-based dependency parsing has an exponentially-large search space O n 3 exact solutions exist In practice, however, we needed rich features O n 6 (This work) with bi-lstms, now we can do O(n 3 )! And we get state-of-the-art results 6
Dependency Parsing root xcomp obj OUTPUT INPUT nsubj mark det She wanted to eat an apple 7
Transition-based Dependency Parsing Initial state Terminal states 8
Transition-based Dependency Parsing Goal: max score( ) = max score( ) Initial state Terminal states 9
Exact Decoding with Dynamic Programming Goal: max score( ) = max score( ) Initial state Exponential to polynomial Terminal states (Huang and Sagae, 2010; Kuhlmann, Gómez-Rodríguez and Satta, 2011) 10
Transition Systems DP Complexity # Action Types Arc-standard O n 4 3 Arc-eager O n 3 4 In our paper Arc-hybrid O n 3 3 Presentational convenience 11
Arc-hybrid Transition System State s 2 s 1 s 0 b 0 b 1 Stack Buffer Initial State ROOT She wanted Terminal State ROOT (Yamada and Matsumoto, 2003) (Gómez-Rodríguez et al., 2008) (Kuhlmann et al., 2011) 12
Arc-hybrid Transition System Transitions b 0 s 1 s 0 s 0 b 0 reduce reduce b 0 s 1 b 0 s 0 s 0 Same as arc-standard 13
Arc-hybrid Transition System Transitions b 0 s 1 s 0 s 0 b 0 reduce reduce b 0 s 1 b 0 s 0 s 0 Same as arc-standard 14
Arc-hybrid Transition System Transitions b 0 s 1 s 0 s 0 b 0 reduce reduce b 0 s 1 b 0 s 0 s 0 Same as arc-standard 15
Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 16
Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 17
Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 18
Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 19
Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 20
Arc-hybrid Transition System Stack Buffer initial ROOT She wanted to eat an apple ROOT She wanted to eat an apple ROOT She wanted to eat an apple reduce ROOT wanted to eat an apple She ROOT wanted ROOT wanted to to eat an apple eat an apple 21
Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 22
Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 23
Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 24
Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 25
Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 26
Arc-hybrid Transition System Stack Buffer ROOT wanted to eat an apple reduce ROOT wanted eat an apple reduce ROOT wanted eat ROOT wanted eat an ROOT wanted eat to an apple apple apple ROOT wanted eat apple an 27
Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 28
Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 29
Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 30
Arc-hybrid Transition System Stack Buffer reduce reduce ROOT wanted eat apple ROOT wanted eat apple reduce terminal ROOT ROOT wanted eat wanted 31
Dynamic Programming for Arc-hybrid 0 1 2 3 4 5 6 (n) ROOT She wanted to eat an apple Stack Buffer Deduction Item Goal i j 0 n + 1 [i, j] [0, n + 1] 32
Dynamic Programming for Arc-hybrid i j [i, j] [j, j + 1] i j j + 1 33
Dynamic Programming for Arc-hybrid reduce i, k [k, j] [?, j] k j reduce? j k 34
Dynamic Programming for Arc-hybrid reduce i, k [k, j] [i, j] i k j reduce i j k 35
Dynamic Programming for Arc-hybrid reduce i, k [k, j] [i, j] i i * k k j reduce i j k 36
Dynamic Programming for Arc-hybrid In Kuhlmann, Gómez-Rodríguez and Satta (2011) s notation i * [i, k] reduce i, k [k, j] [i, j] i k * [k, j] i k j * [i, j] reduce i j k 37
Dynamic Programming for Arc-hybrid i * [i, k] reduce i, k [k, j] [i, j] i k * [k, j] i k j * [i, j] reduce i j k 38
Dynamic Programming for Arc-hybrid [i, j] [j, j + 1] Goal: [0, n + 1] reduce i, k [k, j] [i, j] k j O n 3 reduce i, k [k, j] [i, j] i k 39
Time Complexity in Practice Complexity depends on feature representation! Typical feature representation: Feature templates look at specific positions in the stack and in the buffer 40
Time Complexity in Practice Compare the following features s 0 s 0 b 0 s 1 b 0 Time complexities are different!!! i j k i j s sh i, j s sh k, i, j j j + 1 i j j + 1 41
Time Complexity in Practice Complexity depends on feature representation! Typical feature representation: Feature templates look at specific positions in the stack and in the buffer Best-known complexity in practice: O(n 6 ) (Huang and Sagae, 2010) Stack Buffer s 2 s 1 s 0 b 0 b 1 s 1. lc s 1. rc s 0. lc s 0. rc 42
Best-known Time Complexities (recap) O n 3 O n 6 Theoretical Gap: Feature representation Practical 43
In Practice, Instead of Exact Decoding Greedy search (Nivre, 2003, 2004, 2008; Chen and Manning, 2014) Beam search (Zhang and Clark, 2011; Weiss et al.,2015) Best-first search (Sagae and Lavie, 2006; Sagae and Tsujii, 2007; Zhao et al., 2013) Dynamic oracles (Goldberg and Nivre, 2012, 2013) Global normalization on the beam (Zhou et al., 2015; Andor et al., 2016) Reinforcement learning (Lê and Fokkens, 2017) Learning to search (Daumé III and Marcu, 2005; Chang et al., 2016; Wiseman and Rush, 2016) 44
How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack Buffer s 2 s 1 s 0 b 0 b 1 b 2 s 1. lc i s 1. rc i s 0. lc i s 0. rc i s 1. lc 0. lc 0 s 1. rc 0. rc 0 s 0. lc 0. lc 0 s 0. rc 0. rc 0 45
How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM Dyer et al. (2015) s σ 1 Stack s 2 s 1 s 0 Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) 46
How Many Positional Features Do We Need? Bi-LSTMs give compact feature representations (Kiperwasser and Goldberg, 2016; Cross and Huang, 2016) Features used in Kiperwasser and Goldberg (2016) Stack s 2 s 1 s 0 b 0 Buffer Features used in Cross and Huang (2016) Stack s 1 s 0 b 0 Buffer 47
How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM Dyer et al. (2015) Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) Summarizing trees on stack Exponential DP Enables Slow DP Enables Fast DP Summarizing input 48
Model Architecture s sh, s re, s re Multi-layer perceptron s 2 s 1 s 0 b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 49
Model Architecture s sh, s re, s re Multi-layer perceptron s 1 s 0 b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 50
Model Architecture s sh, s re, s re Multi-layer perceptron s 0 b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 51
Model Architecture s sh, s re, s re Multi-layer perceptron b 0 Bi-directional LSTM Word embeddings + POS embeddings She wanted to eat an apple 52
How Many Positional Features Do We Need? We answer the question empirically experimented with greedy decoding Two positional feature vectors are enough! 100 94. 08 ±0.13 94. 08 ±0.05 94. 03 ±0.12 52. 39 ±0.23 UAS on PTB (dev) 80 60 40 {s 2, s 1, s 0, b 0 } {s 1, s 0, b 0 } {s 0, b 0 } {b 0 } Considered in prior work 53
How Many Positional Features Do We Need? Our minimal feature set Stack s 0 b 0 Buffer Counter-intuitive, but works for greedy decoding s 1 s 0 b 0 reduce s 1 b 0 s 0 54
How Many Positional Features Do We Need? Our minimal feature set Stack s 0 b 0 Buffer Counter-intuitive, but works for greedy decoding The bare deduction items already contain enough information to extract features for DP Leads to the first O n 3 implementation of global decoders! 55
How Many Positional Features Do We Need? Non-neural (manual engineering) Chen and Manning (2014) Stack LSTM Dyer et al. (2015) Summarizing trees on stack Exponential DP Enables Slow DP Bi-LSTM Kiperwasser and Goldberg (2016) Cross and Huang (2016) Enables Fast DP Our work Summarizing input Fast(er) DP 56
Best-known Time Complexities (recap) O n 3 O n 6 Theoretical Gap: Feature representation Practical 57
Our contribution Minimal Feature Set O n 3 O n 3 O n 6 Theoretical Practical 58
Decoding i Score of the sub-sequence * [i, j] i, j : v j, j + 1 : 0 i j i j j + 1 59
Decoding i * [i, k] i k reduce i, k : v 1 k, j : v 2 i, j : v 1 + v 2 + Δ Δ = s sh i, k + s re k, j * [k, j] * [i, j] i k j reduce i j k 60
Training Separate incorrect from correct by a margin max score( ) - score( ) + cost( ) Cost-augmented decoding (Taskar et al., 2005) reduce i, k : v 1 k, j : v 2 i, j : v 1 + v 2 + s sh i, k + s re k, j + 1 head k j i j k 61
Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Local Global 62
Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Our arc-eager DP Our arc-hybrid DP Local Global Our Global 63
Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 Our best local BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Our arc-eager DP Our arc-hybrid DP Local Global Our Local Our Global 64
Chinese CTB UAS Comparing with State-of-the-art 90.5 90.0 89.5 89.0 88.5 88.0 87.5 87.0 86.5 Our best local BGDS16 DBLMS15 KG16b KG16a CH16 KG16a KBKDS16 CFHGD16 WC16 20 KBKDS16 DM17 86.0 93.0 94.0 95.0 96.0 English PTB UAS Our arc-eager DP Our arc-hybrid DP 15 Our all global 5 Our arc-eager DP 5 Our arc-hybrid DP Local Global Our Local Our Global Ensemble 65
Results CoNLL 17 Shared Task Macro-average of 81 treebanks in 49 languages 2 nd highest overall performance 75 75.00 LAS 74 74.32 74.00 73.75 73 Ensemble Exact Arc-eager Exact Arc-hybrid Graphbased (Shi, Wu, Chen and Cheng, 2017; Zeman et al., 2017) 66
Conclusion Bi-LSTM feature set is minimal yet highly effective First O n 3 implementation of exact decoders Global training and decoding gave high performance 67
More in Our Paper Description and analysis of three transition systems (arc-standard, arc-hybrid, arc-eager) CKY-style representations of the deduction systems = + Theoretical analysis of the global methods Arc-eager models can simulate arc-hybrid models Arc-eager models can simulate edge-factored models 68
Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set https://github.com/tzshi/dp-parser-emnlp17 Tianze Shi* Liang Huang Lillian Lee* * Cornell University Oregon State University
CKY-style Visualization 70
CKY-style Visualization 71
Results with Arc-eager and Arc-standard 72
Results with Arc-eager and Arc-standard 73