S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

Size: px

Start display at page:

Download "S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking"

Prudence Harris
5 years ago
Views:

1 S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking Yi Yang * and Ming-Wei Chang # * Georgia Institute of Technology, Atlanta # Microsoft Research, Redmond

2 Traditional NLP Settings

3 Traditional NLP Settings High dimensional sparse features (e.g., lexical features) Languages are naturally in high dimensional spaces. Powerful! Very expressive.

4 Traditional NLP Settings High dimensional sparse features (e.g., lexical features) Languages are naturally in high dimensional spaces. Powerful! Very expressive. Linear models Linear Support Vector Machine Maximize Entropy model

5 Traditional NLP Settings High dimensional sparse features (e.g., lexical features) Languages are naturally in high dimensional spaces. Powerful! Very expressive. Linear models Linear Support Vector Machine Maximize Entropy model Sparse features + Linear models

6 Rise of Dense Features

7 Rise of Dense Features Low dimensional embedding features

8 Rise of Dense Features Low dimensional embedding features Low dimensional statistics features Named mention statistics Click-through statistics

9 Rise of Dense Features Low dimensional embedding features Low dimensional statistics features Named mention statistics Click-through statistics Dense features + Non-linear models

10 Non-linear Models

11 Non-linear Models Neural networks

12 Non-linear Models Neural networks Neural Kernel Tree # of Papers in ACL

13 Non-linear Models Neural networks Kernel methods Tree-based models (e.g., Random Forest, Boosted Tree) Neural Kernel Tree # of Papers in ACL

14 Non-linear Models Neural networks Kernel methods Tree-based models (e.g., Random Forest, Boosted Tree) Neural Kernel Tree # of Papers in ACL

15 Tree-based Models

16 Tree-based Models Empirical successes Information retrieval [LambdaMART; Burges, 2010] Computer vision [Babenko et al., 2011] Real world classification [Fernandez-Delgado et al., 2014]

17 Tree-based Models Empirical successes Information retrieval [LambdaMART; Burges, 2010] Computer vision [Babenko et al., 2011] Real world classification [Fernandez-Delgado et al., 2014] Why tree-based models? Handle categorical features and count data better. Implicitly perform feature selection.

18 Contribution We present S-MART: Structured Multiple Additive Regression Trees A general class of tree-based structured learning algorithms. A friend of problems with dense features.

19 Contribution We present S-MART: Structured Multiple Additive Regression Trees A general class of tree-based structured learning algorithms. A friend of problems with dense features. We apply S-MART to entity linking on short and noisy texts Entity linking utilizes statistics dense features.

20 Contribution We present S-MART: Structured Multiple Additive Regression Trees A general class of tree-based structured learning algorithms. A friend of problems with dense features. We apply S-MART to entity linking on short and noisy texts Entity linking utilizes statistics dense features. Experimental results show that S-MART significantly outperforms all alternative baselines.

21 Outline S-MART: A family of Tree-based Structured Learning Algorithms S-MART for Tweet Entity Linking Non-overlapping inference Experiments

22 Outline S-MART: A family of Tree-based Structured Learning Algorithms S-MART for Tweet Entity Linking Non-overlapping inference Experiments

23 Structured Learning

24 Structured Learning Model a joint scoring function S(x, y) over an input structure x and an output structure y

25 Structured Learning Model a joint scoring function S(x, y) over an input structure x and an output structure y Obtain the prediction requires inference (e.g., dynamic programming) by = arg max y2gen(x) S(x, y)

26 Structured Multiple Additive Regression Trees (S-MART)

27 Structured Multiple Additive Regression Trees (S-MART) Assume a decomposition over factors

28 Structured Multiple Additive Regression Trees (S-MART) Assume a decomposition over factors Optimize with functional gradient descents

29 Structured Multiple Additive Regression Trees (S-MART) Assume a decomposition over factors Optimize with functional gradient descents Model functional gradients using regression trees h m (x, y k ) F (x, y k )=F M (x, y k )= MX m=1 m h m (x, y k )

30 Gradient Descent

31 Gradient Descent Linear combination of parameters and feature functions F (x, y k )=w > f(x, y k )

32 Gradient Descent Linear combination of parameters and feature functions F (x, y k )=w > f(x, y k ) Gradient descent in vector space w m = w m m 1

33 Gradient Descent Linear combination of parameters and feature functions F (x, y k )=w > f(x, y k ) Gradient descent in vector w m = w m 1 m w 2 w m 1 w 0 w 1

34 Gradient Descent in Function Space

35 Gradient Descent in Function Space F 0 (x, y k )=0

36 Gradient Descent in Function Space F 0 (x, y k )=0 F m 1 (x, y k )

37 Gradient Descent in Function Space F 0 (x, y k )=0 F m 1 (x, y k ) g m (x, y k )= y k (x, y k ) F (x,y k )=F m 1 (x,y k )

38 Gradient Descent in Function Space F 0 (x, y k )=0 F m 1 (x, y k ) g m (x, y k ) g m (x, y k )= y k (x, y k ) F (x,y k )=F m 1 (x,y k )

39 Gradient Descent in Function Space F 0 (x, y k )=0 Requiring Inference F m 1 (x, y k ) g m (x, y k ) g m (x, y k )= y k (x, y k ) F (x,y k )=F m 1 (x,y k )

40 Gradient Descent in Function Space F 0 (x, y k )=0 F m (x, y k ) F m (x, y k )=F m 1 (x, y k ) m g m (x, y k )

41 Model Functional Gradients g m (x, y k )

42 Model Functional Gradients g m (x, y k )

43 Model Functional Gradients Pointwise Functional Gradients

44 Model Functional Gradients Pointwise Functional Gradients Approximation by regression Tree g m (x, y k )

45 Model Functional Gradients Pointwise Functional Gradients Approximation by regression Is linkprob > 0.5? Tree Is PER? -0.5 g m (x, y k ) -0.1 Is clickprob > 0.1? -0.3

46 S-MART vs. TreeCRF TreeCRF [Dietterich+, 2004] S-MART Structure Linear chain Various structures Loss function Logistic loss Various losses Scoring function F y t (x) F (x, y t )

47 S-MART vs. TreeCRF TreeCRF [Dietterich+, 2004] S-MART Structure Linear chain Various structures Loss function Logistic loss Various losses Scoring function F y t (x) F (x, y t )

48 S-MART vs. TreeCRF TreeCRF [Dietterich+, 2004] S-MART Structure Linear chain Various structures Loss function Logistic loss Various losses Scoring function F y t (x) F (x, y t )

49 S-MART vs. TreeCRF TreeCRF [Dietterich+, 2004] S-MART Structure Linear chain Various structures Loss function Logistic loss Various losses Scoring function F y t (x) F (x, y t )

50 S-MART vs. TreeCRF TreeCRF [Dietterich+, 2004] S-MART Structure Linear chain Various structures Loss function Logistic loss Various losses Scoring function F y t (x) F (x, y t )

51 Outline S-MART: A family of Tree-based Structured Learning Algorithms S-MART for Tweet Entity Linking Non-overlapping inference Experiments

52 Entity Linking in Short Texts

53 Entity Linking in Short Texts Data explosion: noisy and short texts Twitter messages Web queries

54 Entity Linking in Short Texts Data explosion: noisy and short texts Twitter messages Web queries Downstream applications Semantic parsing and question answering [Yih et al., 2015] Relation extraction [Riedel et al., 2013]

55 Tweet Entity Linking

56 Tweet Entity Linking

57 Tweet Entity Linking

58 Tweet Entity Linking

59 Entity Linking meets Dense Features

60 Entity Linking meets Dense Features Short of labeled data Lack of context makes annotation more challenging. Language changes, annotation may become stale and ill-suited for new spellings and words. [Yang and Eisenstein, 2013]

61 Entity Linking meets Dense Features Short of labeled data Lack of context makes annotation more challenging. Language changes, annotation may become stale and ill-suited for new spellings and words. [Yang and Eisenstein, 2013] Powerful statistic dense features [Guo et al., 2013] The probability of a surface form to be an entity View count of a Wikipedia page Textual similarity between a tweet and a Wikipedia page

62 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results

63 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here

64 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here

65 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here Eli Manning and the New York Giants are going to win the World Series

66 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here Eli Manning and the New York Giants are going to win the World Series NIL

67 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here Eli Manning and the New York Giants are going to win the World Series NIL

68 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here Eli Manning and the New York Giants are going to win the World Series Eli_Manning

69 System Overview Tokenized Message Candidate Generation Joint Recognition and Disambiguation Entity Linking Results Structured learning: select the best non-overlapping entity assignment Choose top 20 entity candidates for each surface form Add a special NIL entity to represent no entity should be fired here Eli Manning and the New York Giants are going to win the World Series Eli_Manning New_York_Giants World_Series

70 S-MART for Tweet Entity Linking

71 S-MART for Tweet Entity Linking Logistic loss L(y,S(x, y)) = log P (y x) = log Z(x) S(x, y )

72 S-MART for Tweet Entity Linking Logistic loss L(y,S(x, y)) = log P (y x) = log Z(x) S(x, y ) Point-wise g ku (x,y k = u k ) = P (y k = u k x) 1[yk = u k ]

73 S-MART for Tweet Entity Linking Logistic loss L(y,S(x, y)) = log P (y x) = log Z(x) S(x, y ) Point-wise gradients g (x,y k = u k ) Non-overlapping Inference = P (y k = u k x) 1[y k = u k ]

74 Inference: Forward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

75 Inference: Forward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

76 Inference: Forward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

77 Inference: Forward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

78 Inference: Forward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

79 Inference: Forward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

80 Inference: Backward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York York New York Giants win World World Series Series Giants

81 Inference: Backward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York New York Giants win World World Series Series York Giants

82 Inference: Backward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York New York Giants win World World Series Series York Giants

83 Inference: Backward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York New York Giants win World World Series Series York Giants

84 Inference: Backward Algorithm Eli Manning and the New York Giants are going to win the World Series Eli Eli Manning Manning New New York New York Giants win World World Series Series York Giants

85 Outline S-MART: A family of Tree-based Structured Learning Algorithms S-MART for Tweet Entity Linking Non-overlapping inference Experiments

86 Data Named Entity Extraction & Linking (NEEL) Challenge datasets [Cano et al., 2014] TACL datasets [Fang & Chang, 2014]

87 Data Named Entity Extraction & Linking (NEEL) Challenge datasets [Cano et al., 2014] TACL datasets [Fang & Chang, 2014] Data #Tweet #Entity Date NEEL Train 2,340 2,202 Jul. ~ Aug. 11 NEEL Test 1, Jul. ~ Aug. 11 TACL-IE Dec. 12 TACL-IR Dec. 12

88 Evaluation Methodology IE-driven Evaluation [Guo et al., 2013] Standard evaluation of the system ability on extracting entities from tweets Metric: macro F-score

89 Evaluation Methodology IE-driven Evaluation [Guo et al., 2013] Standard evaluation of the system ability on extracting entities from tweets Metric: macro F-score IR-driven Evaluation [Fang & Chang, 2014] Evaluation of the system ability on disambiguation of the target entities in tweets Metric: macro F-score on query entities

90 Algorithms Structured Perceptron Linear SSVM * Polynomial SSVM LambdaRank MART # S-MART Structured Non-linear Tree-based * previous state of the art system # winning system of NEEL challenge 2014

91 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F

92 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F

93 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear

94 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear

95 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear 73.2 Non-linear

96 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear

97 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART Neural based model NEEL Test F Linear

98 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear

99 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART Kernel based model NEEL Test F Linear

100 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear

101 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART Tree based model NEEL Test F Linear

102 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART NEEL Test F Linear

103 IE-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART Tree based structured model NEEL Test F Linear

104 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F

105 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F

106 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F

107 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F

108 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F

109 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F Linear

110 IR-driven Evaluation SP Linear SSVM Poly SSVM LambdaRank MART S-MART TACL-IR F Linear Non-linear

111 Conclusion

112 Conclusion A novel tree-based structured learning framework S-MART Generalization of TreeCRF

113 Conclusion A novel tree-based structured learning framework S-MART Generalization of TreeCRF A novel inference algorithm for non-overlapping structure of the tweet entity linking task.

114 Conclusion A novel tree-based structured learning framework S-MART Generalization of TreeCRF A novel inference algorithm for non-overlapping structure of the tweet entity linking task. Application: Knowledge base QA (outstanding paper of ACL 15) Our system is a core component of the QA system.

115 Conclusion A novel tree-based structured learning framework S-MART Generalization of TreeCRF A novel inference algorithm for non-overlapping structure of the tweet entity linking task. Application: Knowledge base QA (outstanding paper of ACL 15) Our system is a core component of the QA system. Rise of non-linear models We can try advanced neural based structured algorithms It s worth to try different non-linear models

116 Thank you!

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking Yi Yang School of Interactive Computing Georgia Institute of Technology yiyang@gatech.edu Ming-Wei Chang Microsoft