Parallel Boosted Regression Trees for Web Search Ranking

Size: px

Start display at page:

Download "Parallel Boosted Regression Trees for Web Search Ranking"

Claude Lawson
6 years ago
Views:

1 Parallel Boosted Regression Trees for Web Search Ranking Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal, Jennifer Paykin Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA Wesleyan University, Middletown, CT, USA WWW 0, March 30, 0, Hyderabad, India

3 Overview 3

4 Overview Search Query + Documents Order by Relevance 3

5 Overview Search Query + Documents Order by Relevance h t ( ) Ensemble Weak Learner Gradient Boosted Regression Trees h t ( ) α g t ( ) + 3

6 Overview Search Query + Documents Order by Relevance h t ( ) Ensemble Gradient Boosted Regression Trees h t ( ) α g t ( ) bin j r j m j + Approximate Parallel Method bin j Weak Learner m j r j p j 0 p j 0 3

7 Overview Ensemble Weak Learner!"##$%"& &!" %!" $!" Search Query + Documents Order by Relevance Speedup and Accuracy ()*+"#",-./$%0" h t ( ) Gradient Boosted Regression Trees h t ( ) α g t ( ) bin j r j m j p j bin j m j r j + Approximate Parallel Method 0 #!"!" 34"#",-.&/%0" 34"$",-.%'0" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& p j 0 3

8 Web Ranking 4

9 Web Ranking Query Documents Relevance Function h( ) Feature Generator Ranker Document/ Query Features {x i } x () x ()... x (n) 5

10 Web Ranking Query Query Documents Documents Relevance Function h( ) Feature Generator Ranker x () x ()... x (n) Document/ Query Features {x i } 5

11 Web Ranking Query Query Documents Documents Relevance Function h( ) Feature Generator Ranker x () x ()... x (n) Document/ Query Features {x i } 5

12 Web Ranking Query Query Documents Documents Relevance Function h( ) Feature Generator Ranker x () x ()... x (n) Document/ Query Features {x i } 5

13 Web Ranking Query Query Documents Feature Generator More Relevant Documents Relevance Function h( ) Ranker Document/ Query Features {x i } Less Relevant x () x ()... x (n) 5

14 Web Ranking Query Query Documents Feature Generator More Relevant Documents Relevance Function h( ) Ranker Document/ Query Features {x i } Less Relevant x () x ()... x (n) Ranking by pointwise relevance h : R f [0, 4], h(x i ) y i 5

15 Web Ranking 6

16 Web Ranking Supervised machine learning problem 6

17 Web Ranking Supervised machine learning problem Feature vectors Document/query pairs: x i R f 6

18 Web Ranking Supervised machine learning problem Feature vectors Document/query pairs: x i R f Labels Relevance: y i {0,,, 3, 4} 6

19 Web Ranking Supervised machine learning problem Feature vectors Document/query pairs: x i R f Labels Relevance: y i {0,,, 3, 4} Training data: D = {(x i,y i )} n i= Predictor: h : R f [0, 4], h(x i ) y i 6

20 Learning Relevance Ensemble Weak Learner h t ( ) h t ( ) α g t ( ) + 7

21 Learning a Relevance Predictor 8

22 Learning a Relevance Predictor Yahoo! Labs Learning to Rank Challenge 00 8

23 Learning a Relevance Predictor Yahoo! Labs Learning to Rank Challenge 00 Top 8 of 055 submissions 8

24 Learning a Relevance Predictor Yahoo! Labs Learning to Rank Challenge 00 Top 8 of 055 submissions Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) 8

25 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) 9

26 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + Final predictor: h t ( ) α g t ( ) h t (x i )=h t (x i )+αg t (x i ) 9

27 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) Final predictor: h t (x i )=h t (x i )+αg t (x i ) Weak learners: g t (x i ) y i h t (x i ) 9

28 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) Final predictor: h t (x i )=h t (x i )+αg t (x i ) Weak learners: g t (x i ) y i h t (x i ) Approximate gradient descent in predictor space 9

29 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) 0

30 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + Parallelization Large training datasets h t ( ) α g t ( ) 0

31 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) Parallelization Large training datasets Numerous training iterations 0

32 Gradient Boosted Regression Trees Ensemble Weak Learner h t ( ) + h t ( ) α g t ( ) Parallelization Large training datasets Numerous training iterations But training is sequential! 0

33 Learning a Regression Tree Feature Feature

34 Learning a Regression Tree Feature Feature

35 Learning a Regression Tree Feature Feature

36 Learning a Regression Tree Feature Feature

37 Learning a Regression Tree Feature Feature

38 Learning a Regression Tree Feature Feature

39 Learning a Regression Tree Feature Feature

40 Learning a Regression Tree Feature Feature

41 Learning a Regression Tree Feature ? 3.8 Feature

42 Learning a Regression Tree Feature ? 3.8 Feature

43 Learning a Regression Tree Feature Feature

44 Learning a Regression Tree feature f split s label ȳ L label ȳ R

45 Learning a Regression Tree CART alogirthm: greedily minimize cost at split feature f split s label ȳ L label ȳ R

46 Learning a Regression Tree CART alogirthm: greedily minimize cost at split L s = (x i,y i ) L s (y i ȳ s L) + (x i,y i ) R s (y i ȳ s R) feature f split s label ȳ L label ȳ R

47 Learning a Regression Tree CART alogirthm: greedily minimize cost at split L s = Optimal split: (x i,y i ) L s (y i ȳ s L) + s = argmin s (x i,y i ) R s (y i ȳ s R) L s feature f split s label ȳ L label ȳ R

48 Learning a Regression Tree Feature Feature 3

49 Learning a Regression Tree Feature Feature 3

50 Learning a Regression Tree Feature Feature 3

51 Learning a Regression Tree Feature Feature 3

52 Learning a Regression Tree Feature Feature 3

53 bin j r j m j p j bin j r j m j 0 3 Parallel Method p j 0 4

54 Learning a Regression Tree 5

55 Learning a Regression Tree To evaluate loss on a split point s on feature f argmin (y i ȳl) s + (y i ȳr) s s (x i,y i ) L s (x i,y i ) R s 5

56 Learning a Regression Tree To evaluate loss on a split point s on feature f argmin (y i ȳl) s + (y i ȳr) s s (x i,y i ) L s (x i,y i ) R s m s : number of instances with feature f less than s 5

57 Learning a Regression Tree To evaluate loss on a split point s on feature f argmin (y i ȳl) s + (y i ȳr) s s (x i,y i ) L s (x i,y i ) R s m s : number of instances with feature f less than s s : sum of labels for instances with feature f less than s 5

58 Learning a Regression Tree To evaluate loss on a split point s on feature f argmin (y i ȳl) s + (y i ȳr) s s (x i,y i ) L s (x i,y i ) R s m s : number of instances with feature f less than s s : sum of labels for instances with feature f less than s s = argmin s ( s ) : best split s m s m m s 5

59 Learning a Regression Tree To evaluate loss on a split point s on feature f argmin (y i ȳl) s + (y i ȳr) s s (x i,y i ) L s (x i,y i ) R s m s : number of instances with feature f less than s s : sum of labels for instances with feature f less than s s = argmin s ( s ) : best split s m s m m s Estimate m s and s in parallel! 5

60 Parallel Tree Construction 6

61 Parallel Tree Construction Ben-Haim/Yom-Tov (00) Parallel decision tree construction 6

62 Parallel Tree Construction Ben-Haim/Yom-Tov (00) Parallel decision tree construction Our work 6

63 Parallel Tree Construction Ben-Haim/Yom-Tov (00) Parallel decision tree construction Our work Adapted to support regression 6

64 Parallel Tree Construction Ben-Haim/Yom-Tov (00) Parallel decision tree construction Our work Adapted to support regression Optimized for low-depth trees 6

65 Parallel Tree Construction Ben-Haim/Yom-Tov (00) Parallel decision tree construction Our work Adapted to support regression Optimized for low-depth trees Provides open-source C++/MPI implementation 6

66 Parallel Tree Construction Master Processor Processor Processor p 7

67 Parallel Tree Construction Master Processor Processor Processor p Distribute training data 7

68 Parallel Tree Construction Master Processor Processor Processor p Distribute training data 7

69 Parallel Algorithm Master Initialize regression tree Processor Processor Processor p 8

70 Parallel Algorithm Master Processor Processor Processor p 9

71 Parallel Algorithm Master Processor Processor Processor p For each feature... 9

72 Parallel Algorithm Master Processor Processor Processor p For each feature... 9

73 Parallel Algorithm Master Processor Processor Compress... Processor p For each feature... 9

74 Parallel Algorithm Master Processor Processor Compress... Processor p For each feature... 9

75 Parallel Algorithm Master Send to master Processor Processor Compress... Processor p For each feature... 9

76 Parallel Algorithm Master Send to master Processor Processor Compress... Processor p For each feature... 9

77 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

78 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

79 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

80 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

81 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

82 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

83 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

84 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

85 Parallel Algorithm Master Processor Processor Processor p Repeat with other features... 0

86 Parallel Algorithm Master Processor Processor Processor p

87 Parallel Algorithm Master Master selects a split point Processor Processor Processor p

88 Parallel Algorithm Master Master selects a split point Expands tree Processor Processor Processor p

89 Parallel Algorithm Master Master selects a split point Expands tree Processor Processor Processor p

90 Parallel Algorithm Master Master selects a split point Expands tree Processor Processor Processor p And distributes updated tree

91 Parallel Algorithm Master Completes tree Processor Processor Processor p

92 Parallel Algorithm Master Completes tree Processor Processor Processor p

93 Parallel Algorithm Master Completes tree Processor Processor Processor p

94 Parallel Algorithm Master Completes tree Processor Processor Processor p

95 Parallel Algorithm Master Completes tree Processor Processor Processor p

96 Parallel Algorithm Master Completes tree Ensemble Adds to ensemble

97 Parallel Algorithm Master Completes tree Ensemble Adds to ensemble

98 Parallel Algorithm Master Completes tree Ensemble Adds to ensemble Processors update residual

99 Histograms bin j r j m j p j 0 3

100 Histograms Compress distribution of feature values across instances bin j r j m j p j 0 3

101 Histograms Compress distribution of feature values across instances Dynamic bins with stats bin j r j m j p j 0 3

102 Histograms Compress distribution of feature values across instances Dynamic bins with stats p j : bin center bin j r j m j p j 0 3

103 Histograms Compress distribution of feature values across instances Dynamic bins with stats p j : bin center m j : number of points bin j r j m j p j 0 3

104 Histograms Compress distribution of feature values across instances Dynamic bins with stats p j : bin center m j : number of points r j : sum of labels bin j r j m j p j 0 3

105 Histograms Compress distribution of feature values across instances Dynamic bins with stats p j : bin center m j : number of points r j : sum of labels Maximum histogram size bin j r j m j p j 0 3

106 Histograms bin j r j m j p j 0 4

107 Histograms Histogram functions bin j r j m j p j 0 4

108 Histograms Histogram functions Merge(histogramA, histogramb) bin j r j m j p j 0 4

109 Histograms Histogram functions Merge(histogramA, histogramb) Uniform(histogram, n) bin j r j m j p j 0 4

110 Histograms Histogram functions Merge(histogramA, histogramb) Uniform(histogram, n) bin j r j m j p j 0 InterpolateM(histogram, s) InterpolateR(histogram, s) 4

111 Parallel Tree Construction 5

112 Parallel Tree Construction Why this setup works 5

113 Parallel Tree Construction Why this setup works Accuracy: weak learner assumption 5

114 Parallel Tree Construction Why this setup works Accuracy: weak learner assumption Speedup 5

115 Parallel Tree Construction Why this setup works Accuracy: weak learner assumption Speedup Tunable communication 5

116 Parallel Tree Construction Why this setup works Accuracy: weak learner assumption Speedup Tunable communication Limited depth trees 5

117 Parallel Tree Construction Why this setup works Accuracy: weak learner assumption Speedup Tunable communication Limited depth trees Large data sets 5

118 &!" Results %!" 4!"##$%"& $!" ()*+"#",-./$%0" #!" 34"#",-.&/%0" 34"$",-.%'0"!" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& 6

119 Datasets 7

120 Datasets Yahoo LTRC Set : 473,34 training instances, 700 features Set : 34,85 training instances, 700 features 7

121 Datasets Yahoo LTRC Set : 473,34 training instances, 700 features Set : 34,85 training instances, 700 features Microsoft LETOR Fold: 73,4 training instances, 36 features 7

122 Software 8

123 Software pgbrt (this work) Approximate, parallel method 8

124 Software pgbrt (this work) Approximate, parallel method RT-Rank (Mohan, et al.) Exact GBRT method 8

125 Speedup 9

126 Speedup Speedup on 48-core SMP machine 9

127 &!" Speedup Speedup on 48-core SMP machine!"##$%"& %!" $!" #!"!" ()*+"#",-./$%0" 34"#",-.&/%0" 34"$",-.%'0" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& 9

128 Speedup!"##$%"& &!" %!" $!" Maximum speedup = 4 (on 48 processors) ()*+"#",-./$%0" Speedup on 48-core SMP machine #!"!" 34"#",-.&/%0" 34"$",-.%'0" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& 9

129 Speedup!"##$%"& &!" %!" $!" Maximum speedup = 4 (on 48 processors) ()*+"#",-./$%0" Speedup on 48-core SMP machine #!"!" 34"#",-.&/%0" 34"$",-.%'0" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& Speedup on distributed memory cluster 9

130 Speedup!"##$%"& &!" %!" $!" Maximum speedup = 4 (on 48 processors) ()*+"#",-./$%0" Speedup on 48-core SMP machine #!"!" 34"#",-.&/%0" 34"$",-.%'0" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& Speedup on distributed memory cluster &!" %!"!"##$%"& $!" ()*+"#",-./$%0" #!" 34"#",-.&/%0" 34"$",-.%'0"!" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& 9

131 Speedup!"##$%"& &!" %!" $!" Maximum speedup = 4 (on 48 processors) ()*+"#",-./$%0" Speedup on 48-core SMP machine #!"!" 34"#",-.&/%0" 34"$",-.%'0" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& Speedup on distributed memory cluster!"##$%"& &!" %!" $!" Maximum speedup = 5 (on 3 processors) ()*+"#",-./$%0" #!" 34"#",-.&/%0" 34"$",-.%'0"!" "!" #!" $!" %!" &!" '!" '%()#*&+,&-*+.#//+*/& 9

132 Accuracy 30

133 Accuracy ERR and NDCG metrics 30

134 Accuracy ERR and NDCG metrics Yahoo LTRC Set (34,85 training instances, 700 features) 30

135 Accuracy ERR and NDCG metrics Yahoo LTRC Set (34,85 training instances, 700 features) 30

136 Accuracy ERR and NDCG metrics Yahoo LTRC Set (34,85 training instances, 700 features) 30

137 Accuracy ERR and NDCG metrics Yahoo LTRC Set (34,85 training instances, 700 features) Same accuracy after slightly more iterations 30

138 Accuracy 3

139 Accuracy ERR and NDCG metrics 3

140 Accuracy ERR and NDCG metrics Yahoo LTRC Set (473,34 training instances, 700 features) 3

141 Accuracy ERR and NDCG metrics Yahoo LTRC Set (473,34 training instances, 700 features) 3

142 Accuracy ERR and NDCG metrics Yahoo LTRC Set (473,34 training instances, 700 features)!"#*%&!"#()&!"#$%!"##'&!"#'(&!"#'&!"#$%&,-./&0$3& 4,-./&0$3& 4,-./&0'3&!& $!!& +!!!& +$!!& %!!!& &'()*+,-.% 3

143 Accuracy ERR and NDCG metrics Yahoo LTRC Set (473,34 training instances, 700 features) Same accuracy after slightly increased depth!"#$%!"#*%&!"#()&!"##'&!"#'(&!"#'&!"#$%&,-./&0$3& 4,-./&0$3& 4,-./&0'3&!& $!!& +!!!& +$!!& %!!!& &'()*+,-.% 3

144 Accuracy 3

145 Accuracy Effects of pgbrt approximation 3

146 Accuracy Effects of pgbrt approximation Requires more iterations or slightly increased depth 3

147 Accuracy Effects of pgbrt approximation Requires more iterations or slightly increased depth (Permitted by speedup) 3

148 Accuracy Effects of pgbrt approximation Requires more iterations or slightly increased depth (Permitted by speedup) Same accuracy on Yahoo LTRC 3

149 Accuracy Effects of pgbrt approximation Requires more iterations or slightly increased depth (Permitted by speedup) Same accuracy on Yahoo LTRC Microsoft LETOR within %-% 3

150 Conclusions 33

151 Conclusions Parallel, approximate GBRT implementation 33

152 Conclusions Parallel, approximate GBRT implementation Both speedup and accuracy 33

153 Conclusions Parallel, approximate GBRT implementation Both speedup and accuracy Processes in hours what took days for WashU LTRC team 33

154 Acknowledgements Yahoo! Labs Ananth Mohan/Zheng Chen Weinberger lab Minmin Chen, Eddie Xu, Dor Kedem, Yuzong Liu Agrawal lab David Ferry, Jordan Krage 34

Parallel Boosted Regression Trees for Web Search Ranking

Parallel Boosted Regression Trees for Web Search Ranking Stephen Tyree swtyree@wustl.edu Kilian Q. Weinberger kilian@wustl.edu Department of Computer Science & Engineering Washington University in St.