K-best Parsing Algorithms

Size: px

Start display at page:

Download "K-best Parsing Algorithms"

Rhoda Phillips
5 years ago
Views:

1 K-best Parsing Algorithms Liang Huang University of Pennsylvania joint work with David Chiang (USC Information Sciences Institute)

2 k-best Parsing Liang Huang (Penn) k-best parsing 2

3 k-best Parsing I saw a boy with a telescope. Liang Huang (Penn) k-best parsing 2

4 k-best Parsing S NP I v saw VP NP a boy VP prep PP NP 1-best with a telescope I saw a boy with a telescope. Liang Huang (Penn) k-best parsing 2

5 k-best Parsing S S NP I VP VP v NP saw a boy VP NP prep PP NP 1-best 2nd-best with a telescope I saw a boy with a telescope. Liang Huang (Penn) k-best parsing 2

6 Not a trivial task Liang Huang (Penn) k-best parsing 3

7 Not a trivial task Aravind Joshi Liang Huang (Penn) k-best parsing 3

8 Not a trivial task I saw her duck. Aravind Joshi Liang Huang (Penn) k-best parsing 3

9 Not a trivial task I saw her duck. Aravind Joshi Liang Huang (Penn) k-best parsing 3

10 Not a trivial task I saw her duck. Aravind Joshi Liang Huang (Penn) k-best parsing 3

11 I saw her duck. Liang Huang (Penn) k-best parsing 4

12 I saw her duck. how about I saw her duck with a telescope. Liang Huang (Penn) k-best parsing 4

13 I saw her duck with a telescope. Liang Huang (Penn) k-best parsing 5

14 I saw her duck with a telescope. Liang Huang (Penn) k-best parsing 5

15 I saw her duck with a telescope. Liang Huang (Penn) k-best parsing 5

16 Another Example I eat sushi with tuna. Aravind Joshi Liang Huang (Penn) k-best parsing 6

17 Another Example I eat sushi with tuna. Aravind Joshi Liang Huang (Penn) k-best parsing 6

18 Another Example I eat sushi with tuna. Aravind Joshi Liang Huang (Penn) k-best parsing 6

19 Or even... I eat sushi with tuna. Aravind Joshi Liang Huang (Penn) k-best parsing 7

20 Why k-best? postpone disambiguation in a pipeline 1-best is not always optimal in the future propagate k-best lists instead of 1-best e.g.: semantic role labeler uses k-best parses approximate the set of all possible interpretations reranking (Collins, 2000) minimum error training (Och, 2003) online training (McDonald et al., 2005) part-of-speech tagging lattice Syntactic Parsing k-best parses Semantic Interpretation Liang Huang (Penn) k-best parsing 8

21 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) v 3. time complexity: O( V + E ) Liang Huang 9 Dynamic Programming

22 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) v 3. time complexity: O( V + E ) Liang Huang 9 Dynamic Programming

23 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) v 3. time complexity: O( V + E ) Liang Huang 9 Dynamic Programming

24 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) naive k-best: O(k E) v 3. time complexity: O( V + E ) Liang Huang 9 Dynamic Programming

25 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) 3. time complexity: O( V + E ) v naive k-best: O(k E) good k-best: O(E + k log k V) Liang Huang 9 Dynamic Programming

26 Parsing as Deduction Parsing with context-free grammars (CFGs) Dynamic Programming (CKY algorithm) (B, i, j ) (C, j+1, k) (A, i, k) A B C (NP, 1, 3) (VP, 4, 6) (S, 1, 6) S NP VP Liang Huang (Penn) k-best parsing 10

27 Parsing as Deduction Parsing with context-free grammars (CFGs) Dynamic Programming (CKY algorithm) (B, i, j ) (C, j+1, k) (A, i, k) A B C (NP, 1, 3) (VP, 4, 6) (S, 1, 6) S NP VP B C i j j+1 k Liang Huang (Penn) k-best parsing 10

28 Parsing as Deduction Parsing with context-free grammars (CFGs) Dynamic Programming (CKY algorithm) (B, i, j ) (C, j+1, k) (A, i, k) A B C (NP, 1, 3) (VP, 4, 6) (S, 1, 6) S NP VP A i k Liang Huang (Penn) k-best parsing 10

29 Parsing as Deduction Parsing with context-free grammars (CFGs) Dynamic Programming (CKY algorithm) (B, i, j ) (C, j+1, k) (A, i, k) A B C (NP, 1, 3) (VP, 4, 6) (S, 1, 6) S NP VP A computational complexity: O(n 3 P ) P is the set of productions (rules) i k Liang Huang (Penn) k-best parsing 10

30 Deduction => Hypergraph hypergraph is a generalization of graph each hyperedge connects several vertices to one vertex (VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6) (NP, 1, 3) (VP, 4, 6) (S, 1, 6) (NP, 1, 3) (VP, 4, 6) (S, 1, 6) Liang Huang (Penn) k-best parsing 11

31 Deduction => Hypergraph hypergraph is a generalization of graph each hyperedge connects several vertices to one vertex vertices (VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6) (NP, 1, 3) (VP, 4, 6) (S, 1, 6) (NP, 1, 3) (VP, 4, 6) (S, 1, 6) Liang Huang (Penn) k-best parsing 11

32 Deduction => Hypergraph hypergraph is a generalization of graph each hyperedge connects several vertices to one vertex vertices (VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6) (NP, 1, 3) (VP, 4, 6) (NP, 1, 3) (VP, 4, 6) (S, 1, 6) hyperedges (S, 1, 6) Liang Huang (Penn) k-best parsing 11

33 Packed Forest as Hypergraph I saw a boy with a telescope S S NP VP VP packed forest a compact representation of all parse trees I v saw VP NP a boy NP prep with PP NP a telescope Liang Huang (Penn) k-best parsing 12

34 Packed Forest as Hypergraph I saw a boy with a telescope S NP VP VP NP packed forest a compact representation of all parse trees I v saw NP a boy prep with PP NP a telescope Liang Huang (Penn) k-best parsing 13

35 Packed Forest as Hypergraph I saw a boy with a telescope S NP VP VP NP packed forest a compact representation of all parse trees I v saw NP a boy prep with PP NP a telescope Liang Huang (Penn) k-best parsing 13

36 Weighted Deduction/Hypergraph (B, i, j): p (C, j+1, k): q (A, i, k): f (p, q) A B C f is the weight function e.g.: in Probabilistic Context-Free Grammars: f (p, q)= p q Pr (A B C) Liang Huang (Penn) k-best parsing 14

37 Monotonic Weight Functions all weight functions must be monotonic on each of their arguments optimal sub-problem property in dynamic programming B: b A: f (b, c) C: c CKY example: A = (S, 1, 5) B = (NP, 1, 2), C = (VP, 3, 5) f (b, c) = b c Pr(S NP VP) Liang Huang (Penn) k-best parsing 15

38 Monotonic Weight Functions all weight functions must be monotonic on each of their arguments optimal sub-problem property in dynamic programming B: b b A: f (b, c) f (b, c) C: c CKY example: A = (S, 1, 5) B = (NP, 1, 2), C = (VP, 3, 5) f (b, c) = b c Pr(S NP VP) Liang Huang (Penn) k-best parsing 15

39 k-best Problem in Hypergraphs 1-best problem find the best derivation of the target vertex t k-best problem find the top k derivations of the target vertex t assumption in CKY, t = (S, 1, n) acyclic: so that we can use topological order Liang Huang (Penn) k-best parsing 16

40 Outline Formulations Algorithms Generic 1-best Viterbi Algorithm Algorithm 0: naïve Algorithm 1: hyperedge-level Algorithm 2: vertex (item)-level Algorithm 3: lazy algorithm Experiments Applications to Machine Translation Liang Huang (Penn) k-best parsing 17

41 Generic 1-best Viterbi Algorithm traverse the hypergraph in topological order for each vertex ( bottom-up ) for each incoming hyperedge compute the result of the f function along the hyperedge update the 1-best value for the current vertex if possible u: a f 1 w: b v : f 1 (a, b) Liang Huang (Penn) k-best parsing 18

42 Generic 1-best Viterbi Algorithm traverse the hypergraph in topological order for each vertex ( bottom-up ) for each incoming hyperedge compute the result of the f function along the hyperedge update the 1-best value for the current vertex if possible (VP, 2, 4) u: a f 1 (VP, 2, 7) (PP, 4, 7) w: b v : f 1 (a, b) Liang Huang (Penn) k-best parsing 18

43 Generic 1-best Viterbi Algorithm traverse the hypergraph in topological order for each vertex ( bottom-up ) for each incoming hyperedge compute the result of the f function along the hyperedge update the 1-best value for the current vertex if possible u: a f 1 w: b v : better ( f 1 (a, b), f 2 (c, d)) u : c f 2 w : d Liang Huang (Penn) k-best parsing 19

44 Generic 1-best Viterbi Algorithm traverse the hypergraph in topological order for each incoming hyperedge compute the result of the f function along the hyperedge update the 1-best value for the current vertex if possible u: a f 1 w: b v : better( better ( f 1 (a, b), f 2 (c, d)), ) u : c f 2 w : d Liang Huang (Penn) k-best parsing 20

45 Generic 1-best Viterbi Algorithm traverse the hypergraph in topological order for each incoming hyperedge compute the result of the f function along the hyperedge update the 1-best value for the current vertex if possible u: a f 1 w: b v : better( better ( f 1 (a, b), f 2 (c, d)), ) u : c f 2 overall time complexity: O( E ) w : d Liang Huang (Penn) k-best parsing 20

46 Generic 1-best Viterbi Algorithm traverse the hypergraph in topological order for each incoming hyperedge compute the result of the f function along the hyperedge update the 1-best value for the current vertex if possible u: a f 1 w: b v : better( better ( f 1 (a, b), f 2 (c, d)), ) u : c f 2 overall time complexity: O( E ) w : d in CKY: E =O (n 3 P ) Liang Huang (Penn) k-best parsing 20

47 Dynamic Programming: 1950 s Richard Bellman Andrew Viterbi Liang Huang (Penn) k-best parsing 21

48 Dynamic Programming: 1950 s Richard Bellman Andrew Viterbi We knew everything so far in your talk 40 years ago Liang Huang (Penn) k-best parsing 21

49 k-best Viterbi algorithm 0: naïve straightforward k-best extension: a vector of length k instead of a single value vector components maintain sorted now what s f (a, b)? k 2 values -- Cartesian Product f (a i, b j ) just need top k out of the k 2 values u: a f 1 mult k ( f, a, b) = top k { f (a i, b j ) } w: b v : mult k ( f 1, a, b) Liang Huang (Penn) k-best parsing 22

50 k-best Viterbi algorithm 0: naïve straightforward k-best extension: a vector of length k instead of a single value vector components maintain sorted now what s f (a, b)? k 2 values -- Cartesian Product f (a i, b j ) b just need top k out of the k 2 values.6.4 a.3.3 u: a f 1 mult k ( f, a, b) = top k { f (a i, b j ) } w: b v : mult k ( f 1, a, b) Liang Huang (Penn) k-best parsing 22

51 Algorithm 0: naïve straightforward k-best extension: a vector of length k instead of a single value and how to update? from two k-lengthed vectors (2k elements) select the top k elements: O(k) u: a f 1 w: b u : c f 2 v : merge k (mult k ( f 1, a, b), mult k ( f 2, c, d)) w : d Liang Huang (Penn) k-best parsing 23

52 Algorithm 0: naïve straightforward k-best extension: a vector of length k instead of a single value and how to update? from two k-lengthed vectors (2k elements) select the top k elements: O(k) u: a w: b u : c w : d f 2 f 1 v : merge k (mult k ( f 1, a, b), mult k ( f 2, c, d)) overall time complexity: O(k 2 E ) Liang Huang (Penn) k-best parsing 23

53 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) }.1 b a.3.3 Liang Huang (Penn) k-best parsing 24

54 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) } only interested in top k, why enumerate all k 2?.1 b a.3.3 Liang Huang (Penn) k-best parsing 24

55 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) } only interested in top k, why enumerate all k 2? a and b are sorted!.1.3 b a.3 Liang Huang (Penn) k-best parsing 24

56 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) } only interested in top k, why enumerate all k 2? a and b are sorted! f is monotonic!.1.3 b a.3 Liang Huang (Penn) k-best parsing 24

57 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) } only interested in top k, why enumerate all k 2? a and b are sorted! f is monotonic! f (a 1, b 1 ) must be the 1-best.1.3 b a.3 Liang Huang (Penn) k-best parsing 24

58 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) } only interested in top k, why enumerate all k 2? a and b are sorted! f is monotonic! f (a 1, b 1 ) must be the 1-best the 2nd-best must be either f (a 2, b 1 ) or f (a 1, b 2 ).1.3 b a.3 Liang Huang (Penn) k-best parsing 24

59 Algorithm 1: speedup mult k mult k ( f, a, b) = top k{ f (a i, b j ) } only interested in top k, why enumerate all k 2? a and b are sorted! f is monotonic! f (a 1, b 1 ) must be the 1-best the 2nd-best must be either f (a 2, b 1 ) or f (a 1, b 2 ) what about the 3rd-best?.1.3 b a.3 Liang Huang (Penn) k-best parsing 24

60 Algorithm 1 (Demo) f (a, b) = ab Liang Huang (Penn) k-best parsing 25

61 Algorithm 1 (Demo) f (a, b) = ab.1.3 b j a i Liang Huang (Penn) k-best parsing 25

62 Algorithm 1 (Demo) f (a, b) = ab.1.3 b j a i Liang Huang (Penn) k-best parsing 25

63 Algorithm 1 (Demo) f (a, b) = ab.1.3 b j a i Liang Huang (Penn) k-best parsing 25

64 Algorithm 1 (Demo) f (a, b) = ab.1.3 b j a i Liang Huang (Penn) k-best parsing 26

65 Algorithm 1 (Demo) f (a, b) = ab.1.3 b j a i Liang Huang (Penn) k-best parsing 26

66 Algorithm 1 (Demo) f (a, b) = ab b j a i Liang Huang (Penn) k-best parsing 26

67 Algorithm 1 (Demo) use a priority queue (heap) to store the candidates (frontier) b j a i Liang Huang (Penn) k-best parsing 27

68 Algorithm 1 (Demo) use a priority queue (heap) to store the candidates (frontier) b j a i Liang Huang (Penn) k-best parsing 27

69 Algorithm 1 (Demo) use a priority queue (heap) to store the candidates (frontier) b j a i Liang Huang (Penn) k-best parsing 27

70 Algorithm 1 (Demo) use a priority queue (heap) to store the candidates (frontier) b j in each iteration: 1. extract-max from the heap 2. push the two shoulders into the heap k iterations O(k logk E ) overall time a i Liang Huang (Penn) k-best parsing 27

71 Algorithm 2: speedup merge k Algorithm 1 works on each hyperedge sequentially can we process them simultaneously? u: a w: b p: x q: y f 1 f i f d v Liang Huang (Penn) k-best parsing 28

72 Algorithm 2 (Demo) starts with an initial heap of the 1-best derivations from each hyperedge B 1 x C 1 B 2 x C 2 B 3 x C item-level heap v k = 2, d = 3 Liang Huang (Penn) k-best parsing 29

73 Algorithm 2 (Demo) pop the best (.42) and B 1 x C 1 B 2 x C 2 B 3 x C item-level heap v k = 2, d = 3.42 Liang Huang (Penn) k-best parsing 30

74 Algorithm 2 (Demo) pop the best (.42) and push the two successors (.07 and.24) B 1 x C 1 B 2 x C 2 B 3 x C item-level heap v k = 2, d = 3 output.42 Liang Huang (Penn) k-best parsing 31

75 Algorithm 2 (Demo) pop the 2 nd -best (.36) B 1 x C 1 B 2 x C 2 B 3 x C item-level heap v k = 2, d = 3 output Liang Huang (Penn) k-best parsing 32

76 Algorithm 3: Offline (lazy) from Algorithm 0 to Algorithm 2: delaying the calculations until needed -- lazier larger locality even lazier (one step further) we are interested in the k-best derivations of the final item only! Liang Huang (Penn) k-best parsing 33

77 Algorithm 3: Offline (lazy) forward phase do a normal 1-best search till the final item but construct the hypergraph (forest) along the way recursive backward phase ask the final item: what s your 2nd-best? final item will propagate this question till the leaves then ask the final item: what s your 3rd-best? Liang Huang (Penn) k-best parsing 34

78 Algorithm 3 demo after the forward step (1-best parsing): forest = 1-best derivations from each hyperedge.20? 0.4? 0.5 NP (1, 2) VP (3, 7).42? 0.7? 0.6 NP (1, 3) VP (4, 7) 1-best.28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 35

79 Algorithm 3 demo now the backward step.20? 0.4? 0.5 NP (1, 2) VP (3, 7).42? 0.7? 0.6 NP (1, 3) VP (4, 7) 1-best.28? 0.7? 0.4 VP (1, 5) NP (6, 7) what s your 2nd-best? S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 36

80 Algorithm 3 demo.20? 0.4? 0.5 NP (1, 2) VP (3, 7).42? 0.7? 0.6 NP (1, 3) VP (4, 7) 1-best.28? 0.7? 0.4 VP (1, 5) NP (6, 7) I m not sure... let me ask my parents S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 37

81 Algorithm 3 demo well, it must be either or.20? 0.4? 0.5 NP (1, 2) VP (3, 7)??.42? 0.7? 0.6 NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 38

82 Algorithm 3 demo but wait a minute did you already know the? s?.20? 0.4? 0.5 NP (1, 2) VP (3, 7)?? ?? 0.6 NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 39

83 Algorithm 3 demo but wait a minute did you already know the? s?.20? 0.4? 0.5 NP (1, 2) VP (3, 7)?? ?? 0.6 NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) oops forgot to ask more questions recursively S (1, 7) Liang Huang (Penn) k-best parsing 39 k=2.42

84 Algorithm 3 demo what s your 2nd-best?.20? 0.4? 0.5 NP (1, 2) VP (3, 7)??.42? 0.7? 0.6 NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 40

85 Algorithm 3 demo recursion goes on to the leaf nodes.20? 0.4? 0.5 NP (1, 2) VP (3, 7)??.42? 0.7? 0.6 NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 41

86 Algorithm 3 demo and reports back the numbers??.20? 0.4? 0.5 NP (1, 2) VP (3, 7) NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7) k=2.42 Liang Huang (Penn) k-best parsing 42

87 Algorithm 3 demo push.30 and.21 to the candidate heap (priority queue) ? 0.4? 0.5 NP (1, 2) VP (3, 7) NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) S (1, 7).42 k=2.30 Liang Huang (Penn) k-best parsing 43

88 Algorithm 3 demo pop the root of the heap (.30) ? 0.4? 0.5 NP (1, 2) VP (3, 7) NP (1, 3) VP (4, 7).28? 0.7? 0.4 VP (1, 5) NP (6, 7) now I know my 2nd-best S (1, 7).42 k=2.30 Liang Huang (Penn) k-best parsing 44

89 Interesting Properties 1-best is best everywhere (all decisions optimal) local picture: 2nd-best is optimal everywhere except one decision and that decision must be 2nd-best and it s the best of all 2nd-best decisions b j so what about the 3rd-best? kth-best is.6 a i Liang Huang (Penn) k-best parsing 45

90 Summary of Algorithms Algorithms Time Complexity Locality 1-best O( E ) hyperedge alg. 0: naïve O(k a E ) hyperedge (mult k ) alg. 1 O(k log k E ) hyperedge (mult k ) alg. 2 O( E + V k log k) item (merge k ) alg. 3: lazy O( E + D k log k) global for CKY: a=2, E =O(n 3 P ), V =O(n 2 N ), D =O(n) a is the arity of the grammar Liang Huang (Penn) k-best parsing 46

91 Outline Formulations Algorithms: Alg.0 thru Alg. 3 Experiments Collins/Bikel Parser both efficiency and accuracy Applications in Machine Translation Liang Huang (Penn) k-best parsing 47

92 Background: Statistical Parsing Probabilistic Grammar induced from a treebank (Penn Treebank) State-of-the-art Parsers Collins (1999), Bikel (2004), Charniak (2000), etc. Evaluation of Accuracy PARSEVAL: tree-similarity (English treebank: ~90%) Previous work on k-best Parsing: Collins (2000): turn off dynamic programming Charniak/Johnson (2005): coarse-to-fine, still too slow Liang Huang (Penn) k-best parsing 48

93 Efficiency Implemented Algorithms 0, 1, 3 on top of Collins/Bikel Parser Average (wall-clock) time on Penn Treebank (per sentence): O( E + D k log k) Liang Huang (Penn) k-best parsing 49

94 Oracle Reranking given k parses of a sentence oracle reranking: pick the best parse according to the gold-standard real reranking: pick the best parse according to the score function gold standard correct parse 1-best accuracy 100% 89% Liang Huang (Penn) k-best parsing 50

95 Oracle Reranking given k parses of a sentence gold standard correct parse accuracy 100% oracle reranking: pick the best parse according to the gold-standard real reranking: pick the best parse according to the score function k-best parses 1-best 96% 91% 89% 78% Liang Huang (Penn) k-best parsing 50

96 Oracle Reranking given k parses of a sentence oracle reranking: pick the best parse according to the gold-standard real reranking: pick the best parse according to the score function gold standard correct parse oracle k-best parses 1-best accuracy 100% 96% 91% 89% 78% Liang Huang (Penn) k-best parsing 50

97 Quality of the k-best lists This work on top of Collins parser Collins (2000) Liang Huang (Penn) k-best parsing 51

98 Quality of the k-best lists This work on top of Collins parser Collins (2000) Liang Huang (Penn) k-best parsing 51

99 Why are our k-best lists better? average number of parses average number of parses for sentences of certain length this work Collins (2000) sentence length Collins (2000): turn down dynamic programming theoretically exponential time complexity; aggressive beam pruning to make it tractable in practice Liang Huang (Penn) k-best parsing 52

100 Why are our k-best lists better? average number of parses average number of parses for sentences of certain length this work Collins (2000) 7% of the test-set sentence length Collins (2000): turn down dynamic programming theoretically exponential time complexity; aggressive beam pruning to make it tractable in practice Liang Huang (Penn) k-best parsing 52

101 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) v 3. time complexity: O( V + E ) Liang Huang 53 Dynamic Programming

102 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) v 3. time complexity: O( V + E ) Liang Huang 53 Dynamic Programming

103 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) v 3. time complexity: O( V + E ) Liang Huang 53 Dynamic Programming

104 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) Algorithm 0: O(k E) v 3. time complexity: O( V + E ) Liang Huang 53 Dynamic Programming

105 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) Algorithm 0: O(k E) v Algorithm 2: O(E + k log k V) 3. time complexity: O( V + E ) Liang Huang 53 Dynamic Programming

106 k-best Viterbi Algorithm? 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): d(v) = d(u) w(u, v) key observation: d(u) is fixed to optimal at this time u w(u, v) Algorithm 0: O(k E) 3. time complexity: O( V + E ) Liang Huang 53 v Algorithm 2: O(E + k log k V) Algorithm 3: O(E + k log k D) Dynamic Programming

107 Conclusions monotonic hypergraph formulation the k-best derivations problem k-best Algorithms Algorithm 0 (naïve) to Algorithm 3 (lazy) experimental results efficiency accuracy (effectively searching over larger space) applications in machine translation k-best rescoring and forest rescoring Liang Huang (Penn) Applications in MT 54

108 Implemented in... state-of-the-art statistical parsers Charniak parser (2005); Berkeley parser (2006) McDonald et al. dependency parser (2005) Microsoft Research (Redmond) dependency parser (2006) generic dynamic programming languages/packages Dyna (Eisner et al., 2005) and Tiburon (May and Knight, 2006) state-of-the-art syntax-based translation systems Hiero (Chiang, 2005) ISI syntax-based system (2005) CMU syntax-based system (2006) BBN syntax-based system (2007) Liang Huang (Penn) k-best parsing 55

109 Applications in Machine Translation

110 Syntax-based Translation synchronous context-free grammars (SCFGs) generating pairs of strings/trees simultaneously co-indexed nonterminal further rewritten as a unit S S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... S NP VP NP VP Baoweier PP VP Powell VP PP yu Shalong juxing le huitan held a meeting with Sharon Liang Huang (Penn) Applications in MT 57

111 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel Liang Huang (Penn) Applications in MT 58

112 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

113 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

114 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... NP PP VP Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

115 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... VP NP PP VP Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

116 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... VP NP PP VP Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

117 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... VP NP PP VP Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

118 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... VP 3.0 Powell with Sharon held a talk NP PP VP Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

119 Translation as Parsing translation ( decoding ) => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel S NP (1) VP (2), NP (1) VP (2) VP PP (1) VP (2), VP (2) PP (1) NP Baoweier, Powell... held a talk with Sharon VP 3.0 Powell with Sharon held a talk NP PP VP Baoweier yu Shalong juxing le huitan Liang Huang (Penn) Applications in MT 58

120 Language Model: Rescoring Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency Broken English language model (LM) fluency English Que hambre tengo yo Liang Huang (Penn) What hunger have I Hungry I am so Have I that hunger I am so hungry How hunger have I... I am so hungry Applications in MT 59

121 Language Model: Rescoring Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency synchronous CFG Broken English language model (LM) fluency n-gram LM English Que hambre tengo yo Liang Huang (Penn) What hunger have I Hungry I am so Have I that hunger I am so hungry How hunger have I... I am so hungry Applications in MT 59

122 Language Model: Rescoring Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency synchronous CFG Broken English language model (LM) fluency n-gram LM English k-best rescoring Que hambre tengo yo Liang Huang (Penn) What hunger have I Hungry I am so Have I that hunger I am so hungry How hunger have I... I am so hungry Applications in MT 59

123 Language Model: Rescoring Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency synchronous CFG Broken English language model (LM) fluency n-gram LM English k-best rescoring TM Que hambre tengo yo Liang Huang (Penn) What hunger have I Hungry I am so Have I that hunger I am so hungry How hunger have I... I am so hungry Applications in MT 59

124 Language Model: Rescoring Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency synchronous CFG Broken English language model (LM) fluency n-gram LM English k-best rescoring TM LM Que hambre tengo yo Liang Huang (Penn) What hunger have I Hungry I am so Have I that hunger I am so hungry How hunger have I I am so hungry Applications in MT 59

125 k-best rescoring results The ISI syntax-based translation system currently the best performing system on Chinese to English task in NIST evaluations based on synchronous grammars translation model (TM) only: rescoring with trigram LM on best list: BLEU score Liang Huang (Penn) Applications in MT 60

126 Language Model: Integration Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency synchronous CFG Broken English language model (LM) fluency n-gram LM English Liang Huang (Penn) (Knight and Koehn, 2003) Applications in MT 61

127 Language Model: Integration Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Spanish translation model (TM) competency synchronous CFG Broken English language model (LM) fluency n-gram LM English Que hambre tengo yo decoder integrated (LM-integrated) decoder computationally challenging! I am so hungry Liang Huang (Penn) (Knight and Koehn, 2003) Applications in MT 61

128 Integrated Decoding Results The ISI syntax-based translation system currently the best performing system on Chinese to English task in NIST evaluations based on synchronous grammars translation model (TM) only: rescoring with trigram LM on best list: trigram integrated decoding: BLEU score Liang Huang (Penn) Applications in MT 62

129 Integrated Decoding Results The ISI syntax-based translation system currently the best performing system on Chinese to English task in NIST evaluations based on synchronous grammars translation model (TM) only: rescoring with trigram LM on best list: trigram integrated decoding: BLEU score Liang Huang (Penn) but over 100 times slower! Applications in MT 62

130 Integrated Decoding Results see my second talk for details! The ISI syntax-based translation system currently the best performing system on Chinese to English task in NIST evaluations based on synchronous grammars translation model (TM) only: rescoring with trigram LM on best list: trigram integrated decoding: BLEU score Liang Huang (Penn) but over 100 times slower! Applications in MT 62

131 Rescoring vs. Integration quality good poor rescoring slow fast speed Liang Huang (Penn) Applications in MT 63

132 Rescoring vs. Integration quality good integration poor rescoring slow fast speed Liang Huang (Penn) Applications in MT 63

133 Rescoring vs. Integration quality good integration? any compromise? (on-the-fly rescoring?) poor rescoring slow fast speed Liang Huang (Penn) Applications in MT 63

134 Rescoring vs. Integration quality good integration Yes, forest-rescoring -- almost as fast as rescoring, and almost as good as integration? any compromise? (on-the-fly rescoring?) poor rescoring slow fast speed Liang Huang (Penn) Applications in MT 63

135 Why Integration is Slow? hyperedge VP1, 6 PP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6 with... Sharon along... Sharon with... Shalong held... talk held... meeting hold... talks split each node into +LM items (w/ boundary words) beam search: only keep top-k +LM items at each node but there are many ways to derive each node can we avoid enumerating all combinations? Liang Huang (Penn) Applications in MT 64

136 Why Integration is Slow? hyperedge VP1, 6 hold... Sharon held... Shalong hold... Shalong PP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6 with... Sharon along... Sharon with... Shalong held... talk held... meeting hold... talks split each node into +LM items (w/ boundary words) beam search: only keep top-k +LM items at each node but there are many ways to derive each node can we avoid enumerating all combinations? Liang Huang (Penn) Applications in MT 64

137 Forest Rescoring k-best parsing Algorithm 2 hyperedge with LM cost, we can only do k-best approximately. VP PP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6 process all hyperedges simultaneously! significant savings of computation Liang Huang (Penn) Applications in MT 65

138 Forest Rescoring Results on the Hiero system (Chiang, 2005) ~10 fold speed-up at the same level of BLEU on my syntax-directed system (Huang et al., 2006) ~10 fold speed-up at the same level of search-error on a typical phrase-based system (Pharaoh) ~30 fold speed-up at the same level of search-error ~100 fold speed-up at the same level of BLEU also used in NIST evals by ISI, CMU, and BBN syntax-based systems see my ACL 07 paper for details Liang Huang (Penn) Applications in MT 66

139 Conclusions monotonic hypergraph formulation the k-best derivations problem k-best Algorithms Algorithm 0 (naïve) to Algorithm 3 (lazy) experimental results efficiency accuracy (effectively searching over larger space) applications in machine translation k-best rescoring and forest rescoring Liang Huang (Penn) Applications in MT 67

140 Thank you!! Questions? Comments?

141 Thank you!! Questions? Comments?

142 Thank you!! Questions? Comments? Liang Huang and David Chiang (2005). Better k-best Parsing. In Proceedings of IWPT, Vancouver, B.C. Liang Huang and David Chiang (2007). Forest Rescoring: Faster Decoding with Integrated Language Models. In Proceedings of ACL, Prague, Czech Rep.

143 Quality of the k-best lists Collins (2000) Liang Huang (Penn) k-best parsing 69

144 Syntax-based Experiments

145 Tree-to-String System syntax-directed, English to Chinese (Huang, Knight, Joshi, 2006) the reverse direction is found in (Liu et al., 2006) VP!"""#$%&"""'$ bei synchronous treesubstitution grammars (STSG) VBD was VP VP-C PP (Galley et al., 2004; Eisner, 2003) related to STAG (Shieber/Schabes, 90) VBN PP IN NP-C shot TO to NP-C NN by DT the NN police tested on 140 sentences slightly better BLEU scores than Pharaoh Liang Huang (Penn) death Applications in MT 71

146 Speed vs. Search Quality Liang Huang (Penn) Applications in MT 72

147 Speed vs. Translation Accuracy Liang Huang (Penn) Applications in MT 73

148 Cube Pruning VP1, 6 PP1, 3 VP3, 6 monotonic grid? (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 74

149 Cube Pruning VP1, 6 PP1, 3 VP3, 6 (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP non-monotonic grid due to LM combo costs with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 75

150 Cube Pruning VP1, 6 PP1, 3 VP3, 6 bigram (meeting, with) (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP non-monotonic grid due to LM combo costs with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 75

151 Cube Pruning VP1, 6 PP1, 3 VP3, 6 non-monotonic grid due to LM combo costs (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 76

152 Cube Pruning k-best parsing Algorithm 1 a priority queue of candidates extract the best candidate (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 77

153 Cube Pruning k-best parsing Algorithm 1 a priority queue of candidates extract the best candidate push the two successors (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 78

154 Cube Pruning k-best parsing Algorithm 1 a priority queue of candidates extract the best candidate push the two successors (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 79

155 Cube Pruning items are popped out-of-order solution: keep a buffer of pop-ups (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 80

156 Cube Pruning items are popped out-of-order solution: keep a buffer of pop-ups finally re-sort the buffer and return inorder: (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) (VP held talk 3,6 ) (VP hold conference 3,6 ) Liang Huang (Penn) Applications in MT 80

Forest-Based Search Algorithms

Forest-Based Search Algorithms for Parsing and Machine Translation Liang Huang University of Pennsylvania Google Research, March 14th, 2008 Search in NLP is not trivial! I saw her duck. Aravind Joshi 2