Agenda for today Homework questions, issues? Non-projective dependencies Spanning tree algorithm for non-projective parsing 1
Projective vs non-projective dependencies If we extract dependencies from trees, y have certain properties Each word has one governor that it points to Each word governs some span A words governor cannot fall within span that word governs When we project to string, no crossing dependencies result Or languages do not fit nice into constituent structure e.g., free word order languages such as Czech These languages have higher frequency of non-projective dependencies 2
Schematic from Hall and Novák (2005) w c w c w b w a w b w a w b w c w a a b c a b c a b c Figure 1: Examples of projective and non-projective trees. The trees on left and center are both projective. The tree on right is non-projective. A tree is projective if: for every three nodes: w a, w b, and w c where a < b < c; if w a is governed by w c n w b is transitively governed by w c or if w c is governed by w a n w b is transitively governed by w a. Keith Hall and Václav Novák. 2005. Corrective Modeling for Non-Projective Dependency Parsing. In Proceedings of 9th International Workshop on Parsing Technologies (IWPT). 3
Example from Nivre (2006) Pred Atr 0 1 R Z (Out-of AuxP AuxK AuxP Sb AuxZ 2 3 4 5 P VB T C nich je jen jedna m is only one-fem-sg ( Only one of m concerns quality. ) 6 R na to Adv 7 N4 kvalitu quality 8 Z:..) Figure 1: Dependency graph for Czech sentence from Prague Dependency Treebank Nivre, J. (2006) Constraints on Non-Projective Dependency Parsing. In Proceedings of 11th Conference of European Chapter of Association for Computational Linguistics (EACL). 4
Degree of non-projectivity For a dependency graph G, let G(i, j) be graph over substring S[i, j] Let e be an arc between words i and j, i < j Degree of e is number of weakly connected subgraphs in G(i + 1, j 1) that are not dominated by head of e Degree of graph is maximum degree of any edge in graph In graph on previous slide, (1,5) arc is non-projective G(2, 4) contains three connected components of one word each Words 2 and 4 are dominated by 5 in graph Word 3 is not, hence arc has degree 1 When searching non-projective structures, can restrict max degree 5
Computational Linguistics Volume 34, Number 4 How common is non-projectivity? (Nivre, CL, 2008) Table 1 Data sets. Tok = number of tokens ( 1000); Sen = number of sentences ( 1000); T/S = tokens per sentence (mean); Lem = lemmatization present; CPoS = number of coarse-grained part-of-speech tags; PoS = number of (fine-grained) part-of-speech tags; MSF = number of morphosyntactic features (split into atoms); Dep = number of dependency types; NPT = proportion of non-projective dependencies/tokens (%); NPS = proportion of non-projective dependency graphs/sentences (%). Language Tok Sen T/S Lem CPoS PoS MSF Dep NPT NPS Arabic 54 1.5 37.2 yes 14 19 19 27 0.4 11.2 Bulgarian 190 14.4 14.8 no 11 53 50 18 0.4 5.4 Chinese 337 57.0 5.9 no 22 303 0 82 0.0 0.0 Czech 1,249 72.7 17.2 yes 12 63 61 78 1.9 23.2 Danish 94 5.2 18.2 no 10 24 47 52 1.0 15.6 Dutch 195 13.3 14.6 yes 13 302 81 26 5.4 36.4 German 700 39.2 17.8 no 52 52 0 46 2.3 27.8 Japanese 151 17.0 8.9 no 20 77 0 7 1.1 5.3 Portuguese 207 9.1 22.8 yes 15 21 146 55 1.3 18.9 Slovene 29 1.5 18.7 yes 11 28 51 25 1.9 22.2 Spanish 89 3.3 27.0 yes 15 38 33 21 0.1 1.7 Swedish 191 11.0 17.3 no 37 37 0 56 1.0 9.8 Turkish 58 5.0 11.5 yes 14 30 82 25 1.5 11.6 6 from treebanks of thirteen different languages with considerable typological variation.
Converting from non-projective to projective Issue with our current training data, for use with projective algorithms Nivre and Nilsson discuss in Pseudo-projective dependency parsing (2005) Basic approach: While re are non-projective arcs Find smallest (shortest distance) non-projective arc Change head of this arc to its head s governor Not guaranteed to be minimal transformation In that paper, y remember (via labels) lifting Then, in a post-process, y can un-do lift Hall and Novak (2005) also have a post-processing approach based on constituency parsers 7
Relaxing projectivity The brute force CYK and Eisner algorithms result in projective dependency structures If we relax requirement of projectivity Build a complete directed graph between words Find minimum spanning tree of graph Chu-Liu / Edmonds algorithm O(n 2 ) When parsing free word order languages (like Czech), ability to capture non-projective dependencies is important Can hurt performance slightly in languages like English with few non-projective dependencies 8
Nivre-style non-projective parsing algorithm Algorithm for incrementally checking all O(n 2 ) pairs for a link for j = 1 to n for i = j 1 to 0 if PERMISSIBLE(i, j) LINK(i, j) PERMISSIBLE checks for well-formedness conditions (and non-projectivity degree limitations) LINK is a classifier that decides: head left, head right, no arc Nivre (2007) achieves accuracy improvements for 5 languages over projective algorithm, at some efficiency cost 9
Dependency graph Given a sentence, want to discover dependency structure All words (but one) dependent on one or word in sentence One word is main (or ROOT) head of sentence ( bit ) For any word, don t know a priori what its head is Build a graph with link to every possible head Will approach this problem as a weighted disambiguation Every link will have a weight Find best solution for particular weighted graph 10
Building a word graph I 2 ROOT 2 8 10 8 bit dog postman One node for each word in string, plus ROOT Directed arcs from head to dependent, with score (higher better) Here we just show score of each word with ROOT as ir head 11
2 Building a word graph II ROOT 2 8 10 8 30 bit 30 40 dog 20 3 11 2 postman Slightly pruned full graph (some arcs omitted for space) Scores on arcs must depend only on head and dependent nodes e.g., log P(head dependent) 40 12
Word graph as matrix From word w i (row) as dependent to word w j as head (column) i j ROOT dog bit postman 2-40 0 0 0 dog 8 0-30 0 11 bit 10 0 20-0 2 2 0 0 0-40 postman 8 0 3 30 0 - One arc per row in final solution 13
Minimum spanning tree Tree that links all nodes in a graph Has minimum cost/distance in space of all trees In our case, weight is good (like log probability) so want max Sum weights over all arcs used in tree Greedy algorithm (Chu-Liu/Edmonds) finds best solution n eliminates cycles 14
Find minimum spanning tree, step 1 2 ROOT 2 8 10 8 30 bit 30 40 dog 20 3 2 postman 40 11 For each node in graph, except ROOT: Select highest scoring incoming arc 15
MST algorithm, step 1 with matrix For each row, pick column with highest score i j ROOT dog bit postman 2-40 0 0 0 dog 8 0-30 0 11 bit 10 0 20-0 2 2 0 0 0-40 postman 8 0 3 30 0 - This defines a subgraph, which is a candidate solution Note that dog and bit form a cycle in subgraph 16
Minimum spanning tree I ROOT 30 bit 30 40 20 dog Need to deal with cycles postman Collapse one cycle into a single node Recalculate arcs in and out of new (collapsed) node Find new highest scoring incoming arc for every node 40 17
Re-calculating transitions in and out Using notation from McDonald et al. (2005) Let C be set of nodes in cycle Let s(x, x ) be score from x to x (head to dep) Let a(v) be head of max scoring incoming arc Then, for any x C s(c, x) = max x C s(x, x) and [ s(x, C) = max s(x, x ) s(a(x ), x ) + s(c) ] x C where s(c) = s(a(v), v) v C 18
Create new matrix, with cycle as single node Original Matrix j ROOT dog bit postman i 2-40 0 0 0 dog 8 0-30 0 11 bit 10 0 20-0 2 2 0 0 0-40 postman 8 0 3 30 0 - New matrix will collapse dog bit into a single node (Note that se will not always be contiguous words) 19
New matrix, with cycle as single node S(C) = 30 + 20 = 50 New scores: i j ROOT dog/bit postman 2-40 (dog) 0 0 dog/bit 40 (bit) 30 (bit) - 30 (bit) 32 (bit) 2 0 0 (dog) - 40 postman 8 0 30 (bit) 0-20
2 New graph, with cycle node ROOT 40 8 2 30 bit 40 dog 20 32 postman 40 Note: need to remember which internal node supplied max used in calculating each arc score Now select highest scoring incoming arc for each node 21 30
MST algorithm, iteration 2 with matrix For each row, pick column with highest score i j ROOT dog/bit postman 2-40 (dog) 0 0 dog/bit 40 (bit) 30 (bit) - 30 (bit) 32 (bit) 2 0 0 (dog) - 40 postman 8 0 30 (bit) 0 - This defines a subgraph, which is a candidate solution Arcs involving cycle go from/to specified cycle member Note that this subgraph has no cycles (hence we are done) 22
Minimum spanning tree II 40 ROOT 30 bit 40 dog 20 postman 40 If no cycles, this is minimum spanning tree Whichever node is max for incoming arc breaks cycle 30 Max outgoing arcs are heads for that dependency 23
Minimum spanning tree III ROOT bit dog Final solution: bit is head of string postman dog and postman are dependents of bit is dependent of closest noun 24
Notes on MST algorithm Graph has n + 1 nodes and n 2 transitions Original formulation of this algorithm (Chu-Liu/Edmonds) can have a maximum of n cycle collapsing, hence O(n 3 ) Tarjan update to algorithm uses a Fibonacci heap to achieve O(n 2 ) This is called an edge factored model, since edges (or arcs) in graphs can only depend on head and dependent Scores can be assigned via a log linear model Has become very popular due to efficiency and ability to capture non-projective dependencies for languages like Czech 25
Log linear modeling Haven t really talked about where scores come from Discriminative scenario, so given string To retain quadratic complexity, features depend only on head and dependent Word and POS-tag substrings around each McDonald et al. (2005) used MIRA Instance of passive-aggressive on-line algorithm Will discuss relative to perceptron in next lecture Basic idea is to include error magnitude in update To take into account neighboring dependencies requires eir increase in complexity or some clever new algorithms 26