Better Word Alignment with Supervised ITG Models. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley

Size: px

Start display at page:

Download "Better Word Alignment with Supervised ITG Models. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley"

Baldwin Waters
5 years ago
Views:

1 Better Word Alignment with Supervised ITG Models Aria Haghighi, John Blitzer, John DeNero, and Dan Klein UC Berkeley

2 Word Alignment s Indonesia parliament court in arraigned speaker

3 Word Alignment s Indonesia parliament court in arraigned speaker

4 Word Alignment s Indonesia parliament court in arraigned speaker

5 Word Alignment s Indonesia parliament court in arraigned speaker

6 Experimental Setup

7 Experimental Setup 2002 Chinese-English NIST data labeled training examples 191 test examples

8 Experimental Setup 2002 Chinese-English NIST data labeled training examples 191 test examples Evaluation - - Prec, Recall over gold alignments BLEU end-to-end

9 Alignment Families A

10 Alignment Families Optimal AER A

11 Alignment Families Optimal AER A 0.0 A

12 Alignment Families Optimal AER A 0.0 A A 1-1

13 Alignment Families Optimal AER A 0.0 A A 1-1

14 Alignment Families Optimal AER A A A A 1-1

15 Alignment Families

16 Alignment Families Inversion Transduction Grammar (ITG) [Wu, 97]

17 Alignment Families Inversion Transduction Grammar (ITG) X f,e, f,,,e [Wu, 97]

18 Alignment Families Inversion Transduction Grammar (ITG) e X f,e, f,,,e f [Wu, 97]

19 Alignment Families Inversion Transduction Grammar (ITG) e X f,e, f,,,e f X X (L) X (R) [Wu, 97]

20 Alignment Families Inversion Transduction Grammar (ITG) e X f,e, f,,,e f X X (L) X (R) X (L) X (R) [Wu, 97]

21 Alignment Families Inversion Transduction Grammar (ITG) e X f,e, f,,,e f X X (L) X (R) X (L) X (R) X X (L) X (R) [Wu, 97]

22 Alignment Families Inversion Transduction Grammar (ITG) e X f,e, f,,,e f X X (L) X (R) X (L) X (R) X X (L) X (R) X (L) X (R) [Wu, 97]

23 Alignment Families Optimal AER A A A A 1-1

24 Alignment Families Optimal AER A A A A 1-1 A ITG

25 Alignment Families Optimal AER A A A A 1-1 A ITG

26 Alignment Families Optimal AER A A 1-1 A ITG A A 1-1 A ITG

27 Alignment Families

28 Alignment Families Block ITG (BITG)

29 Alignment Families Block ITG (BITG) X f,e, f,e

30 Alignment Families Block ITG (BITG) e e X f,e, f,e f f

31 Alignment Families Block ITG (BITG) e e X f,e, f,e f f X X (L) X (R) X (L) X (R) X X (L) X (R) X (L) X (R)

32 Alignment Families Optimal AER A A 1-1 A ITG A A 1-1 A ITG

33 Alignment Families Optimal AER A A 1-1 A ITG A A 1-1 A BITG A ITG

34 Alignment Families Optimal AER A A 1-1 A ITG A A 1-1 A BITG A ITG

35 Alignment Families Optimal AER A A A ITG 1.2 A BITG 10.2 A A 1-1 A BITG A ITG

36 Alignment Families Optimal AER A BITG 1.2 A A BITG

37 Supervised Data

38 Supervised Data x s Indonesia parliament court in arraigned speaker

39 Supervised Data a x s Indonesia parliament court in arraigned speaker

40 Loss Function

41 Loss Function a

42 Loss Function a â

43 Loss Function a â L(a, a) = # of missing sure alignments + # of incorrect alignments

44 Loss Function a â L(a, a) = # of missing sure alignments + # of incorrect alignments

45 Loss Function a â L(a, a) = # of missing sure alignments + # of incorrect alignments

46 Linear Model

47 Linear Model a A

48 Linear Model a A Score w T φ(a)

49 Linear Model a A Score w T φ(a) Features φ(a) = φ ij (i,j) a

50 Linear Model a A Score w T φ(a) Features φ(a) = φ ij (i,j) a φ 13 = { PartialDictMatch(Indonesia, Indonesia,

51 Linear Model a A Score w T φ(a) Features φ(a) = φ ij + (i,j) a φ j i/ a φ i + j/ a

52 Margin Training

53 Margin Training a

54 Margin Training a â

55 Margin Training a â arg max a A w T φ(a)

56 Margin Training a â arg max a A w T φ(a)+λl(a, a)

57 Margin Training a â

58 Margin Training a â w T φ(a ) w T φ(â)+l(a, â)

59 Margin Training a â min w w w 2 w T φ(a ) w T φ(â)+l(a, â)

60 Oracle Projection a

61 Oracle Projection a A

62 Oracle Projection a A A

63 Oracle Projection m =min a A L(a, a) a A A

64 Oracle Projection m =min a A L(a, a) M(a ) = {a A s.t. a L(a, a) =m } A A

65 Oracle Projection m =min a A L(a, a) M(a ) = {a A s.t. a L(a, a) =m } A A M(a )

66 Oracle Projection a A A M(a )

67 Oracle Projection a A A M(a )

68 Oracle Projection a A A M(a )

69 Oracle Projection a a p A A M(a )

70 Learning Alignments 1-1 ITG BITG

71 Learning Alignments 1-1 ITG BITG MIRA Trained

72 Learning Alignments 1-1 ITG BITG MIRA Trained Viterbi Inference

73 Learning Alignments 1-1 ITG BITG MIRA Trained - Dice - Lexical - Distance - Dictionary Viterbi Inference Simple Features

74 Learning Alignments 1-1 ITG BITG 90 MIRA Trained Viterbi Inference Simple Features Dice - Lexical - Distance - Dictionary Precision Recall

75 Likelihood Training

76 Likelihood Training P w (a x) exp{w T φ(a)}

77 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A

78 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A A 1-1

79 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A A 1-1 #P

80 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A A X 1-1 #P

81 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A A 1-1 AITG X#P

82 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A A 1-1 AITG X#P j k i l Target Source

83 Likelihood Training P w (a x) exp{w T φ(a)} Z w (x) = exp{w T φ(a)} a A A 1-1 AITG X#P j k i l Target Source ABITG Target j k i l Source

84 Derivations vs. Alignments

85 Derivations vs. Alignments

86 Derivations vs. Alignments

87 Derivations vs. Alignments

88 Derivations vs. Alignments

89 Derivations vs. Alignments

90 Derivations vs. Alignments

91 ITG Normal Form

92 ITG Normal Form N-ary Productions

93 ITG Normal Form N-ary Productions

94 ITG Normal Form N-ary Productions

95 ITG Normal Form N-ary Productions

96 ITG Normal Form N-ary Productions Null Attachment

97 ITG Normal Form N-ary Productions Null Attachment

98 ITG Normal Form N-ary Productions Null Attachment

99 ITG Normal Form N-ary Productions Null Attachment

100 Oracle Projection

101 Oracle Projection P w (a x) exp{w T φ(a)}

102 Oracle Projection P w (a x) exp{w T φ(a)} max w log P w (a i x i ) i

103 Oracle Projection P w (a x) exp{w T φ(a)} max w log P w (a i x i ) i A a

104 Oracle Projection P w (a x) exp{w T φ(a)} max w log P w (a i x i ) i A a A

105 Oracle Projection P w (a x) exp{w T φ(a)} max w log P w (a i x i ) i A a A M(a )

106 Oracle Projection P w (a x) exp{w T φ(a)} max w log P w (M(a i ) x i ) i A a A M(a )

107 Learning Alignments Margin Likelihood

108 Learning Alignments Margin Likelihood Viterbi Inference

109 Learning Alignments Margin Likelihood Viterbi Inference Simple Features - Dice - Lexical - Distance - Dictionary

110 Learning Alignments Margin Likelihood 90 Viterbi Inference Simple Features Dice - Lexical - Distance - Dictionary 50 Precision Recall

111 Learning Alignments GIZA++ HMM NeurAlign Likelihood-BITG [Fayan & Dorr, 06] (+HMM)

112 Learning Alignments 90.0 GIZA++ HMM NeurAlign Likelihood-BITG [Fayan & Dorr, 06] (+HMM) Precision Recall

113 Learning Alignments 90.0 GIZA++ HMM NeurAlign Likelihood-BITG [Fayan & Dorr, 06] (+HMM) Precision Recall

114 Learning Alignments 90.0 GIZA++ HMM NeurAlign Likelihood-BITG [Fayan & Dorr, 06] (+HMM) Precision Recall

115 Learning Alignments 90.0 GIZA++ HMM NeurAlign Likelihood-BITG [Fayan & Dorr, 06] (+HMM) Precision Recall

116 Learning Alignments 90.0 GIZA++ HMM NeurAlign Likelihood-BITG [Fayan & Dorr, 06] (+HMM) Precision Recall

117 End-to-End Experiments

118 End-to-End Experiments Decoder: JosHUa [Li et. al, 2009]

119 End-to-End Experiments Decoder: JosHUa [Li et. al, 2009] Data: FBIS 100k sents max length 40

120 End-to-End Experiments Decoder: JosHUa [Li et. al, 2009] Data: FBIS 100k sents max length 40 LM: 5-gram on Eng. Gigaword Xinhua

121 End-to-End Experiments Decoder: JosHUa [Li et. al, 2009] Data: FBIS 100k sents max length 40 LM: 5-gram on Eng. Gigaword Xinhua Tuning: 300 sents. of NIST MT04 test

122 End-to-End Experiments Decoder: JosHUa [Li et. al, 2009] Test: NIST 2005 Chinese-English Data: FBIS 100k sents max length 40 LM: 5-gram on Eng. Gigaword Xinhua Tuning: 300 sents. of NIST MT04 test

123 End-to-End Experiments Decoder: JosHUa [Li et. al, 2009] Test: NIST 2005 Chinese-English Data: FBIS 100k sents max length 40 LM: 5-gram on Eng. Gigaword Xinhua Tuning: 300 sents. of NIST MT04 test Alignments Recall Prec BLEU GIZA HMM LL-BITG

124 Conclusions

125 Conclusions Blocks are important, ITG tractable

126 Conclusions Blocks are important, ITG tractable Normal form and oracle projection allow for likelihood training

127 Conclusions Blocks are important, ITG tractable Normal form and oracle projection allow for likelihood training Word alignment improvements yield BLEU improvements

128 Conclusions Blocks are important, ITG tractable Normal form and oracle projection allow for likelihood training Word alignment improvements yield BLEU improvements

129 Conclusions Blocks are important, ITG tractable Normal form and oracle projection allow for likelihood training Word alignment improvements yield BLEU improvements Software nlp.cs.berkeley.edu

130 Thanks!

131 Likelihood Criterion A â a P w (a x) = s w (a) a A s w(a ) max w (x,a ) D log P (a x)

132 Likelihood Criterion A A ITG â a P w (a x) = s w (a) a A ITG s w (a ) max w (x,a ) D log P (a x)

133 Likelihood Criterion M(a ) a m = min L(a, a) a A ITG M(a )={a A ITG s.t. L(a, a) =m }

134 Likelihood Criterion M(a ) a max w (x,a ) D log P (M(a ) x)

135 Likelihood Criterion MIRA 21.1 LogLike A BITG

136 Adding External Features Adding Joint HMM posteriors from DeNero et. al. (2007) MIRA Likelihood 1-1 ITG BITG BITG-S BITG-N Features P R AER P R AER P R AER P R AER P R AER Dice, dist, blcks, dict, lex HMM TODO: Keynote Chart

137 End-to-End Results Using JosHUa decoder (HIERO) Alignments Translations Model Prec Rec Rules BLEU GIZA M Joint HMM M Viterbi ITG M Posterior ITG M TODO: Keynote Chart

138 Alignment Families A A 1-1

139 Alignment Families A A 1-1

140 Alignment Families A A 1-1 A ITG

141 Alignment Families A A 1-1 A ITG

142 Alignment Families A A 1-1 A ITG

143 Alignment Families A A 1-1 A ITG

144 Alignment Families A A 1-1 A ITG Optimal AER A 1-1 A ITG

145 Alignment Families A A 1-1 A ITG A BITG

146 Alignment Families A A 1-1 A ITG A BITG

147 Alignment Families Optimal AER A A 1-1 A 1-1 A ITG A ITG A BITG 1.2 A BITG

148 Learning Alignments a A

149 Learning Alignments a A s(a) =w T φ(a)

150 Learning Alignments a A Score w T φ(a) Features φ(a) = φ ij + (i,j) a φ j i/ a φ i + j/ a

151 Learning Alignments a A s w (a) =w T φ(a) φ(a) = φ ij + (i,j) a φ j i/ a φ i + j/ a

152 MIRA: Margin Criterion a â

153 MIRA: Margin Criterion A â arg max s(a) a A a s w (a ) s w (â )+L(a, â )

154 MIRA: Margin Criterion A â a w t+1 = arg min w w w t 2 s.t. â = arg max a A s w t (a) s wt+1 (a ) s wt+1 (â )+L(a, â )

155 MIRA: Margin Criterion A A 1-1 â a a p â = arg max a A 1-1 s wt (a)+λl(a p, a)

156 MIRA: Margin Criterion A A 1-1 â a a p w t+1 = arg min w w w t 2 s.t. â = arg max a A 1-1 s wt (a)+λl(a p, a) s wt+1 (a ) s wt+1 (â )+L(a p, â )

157 Alignment Families A

158 Alignment Families A A 1-1

159 Alignment Families A A 1-1

160 Alignment Families A A A 1-1

161 Results GIZA++

162 Results GIZA Precision Recall

163 Results 90.0 GIZA++ HMM [Liang et. al., 06] Precision Recall

164 Results 90.0 GIZA++ HMM LL-BITG (simple) Precision Recall

165 Learning Alignments 90.0 GIZA++ HMM LL-BITG NeurAlign (simple) [Fayan & Dorr, 06] Precision Recall

166 Learning Alignments 90.0 GIZA++ HMM LL-BITG NeurAlign LL-BITG (simple) (+HMM) Precision Recall

167 Linear Model a A Score w T φ(a) Features φ(a) = φ ij (i,j) a

168 Linear Model a A Score w T φ(a) Features φ(a) = φ ij (i,j) a { Dice(i,j), Dictionary(i,j), LogFreqDiff(i,j),... }

Fast Inference in Phrase Extraction Models with Belief Propagation

Fast Inference in Phrase Extraction Models with Belief Propagation David Burkett and Dan Klein Computer Science Division University of California, Berkeley {dburkett,klein}@cs.berkeley.edu Abstract Modeling