S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

Size: px

Start display at page:

Download "S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande"

Evan Aldous Hoover
5 years ago
Views:

1 S8873 GBM INFERENCING ON GPU Shankara Rao Thejaswi Nanditale, Vinay Deshpande

2 Introduction AGENDA Objective Experimental Results Implementation Details Conclusion 2

3 INTRODUCTION 3

4 BOOSTING What is it? Training an ensemble of weak learners Together, this ensemble becomes a strong predictor or model Old but popular Machine Learning (ML) technique It helps reducing bias and variance in the models 4

5 BOOSTING Model 1 Model 2 Model 3 Model N Output 1 Output 2 Output 3 Output N Final Output 5

6 GRADIENT BOOSTING Iteratively perform gradient descent on the function space Build a weak learner at each step of this process Model 1 Output 1 Model 2 Output 2 Model N Output N Successive learners weigh more on the mistakes by their predecessors 6

7 GRADIENT BOOSTED TREES Form of gradient boosting where the weak learners are decision trees This is what is being considered for our experiments f 1 f 2 f 2 f 3 7

8 GRADIENT BOOSTED INFERENCE AKA testing AKA prediction Process of applying learned model on real world data Synonymous with deploying learned models in production Batched-inference is inference across multiple such observations at once 8

9 GRADIENT BOOSTING Libraries* Scikit-learn GradientBoostingClassifier XGBoost LightGBM CatBoost S8393 CatBoost: Fast Open-Source Gradient Boosting Library for GPU Tensorflow Boosted Trees (TFBT) * In no particular order 9

10 GRADIENT BOOSTED INFERENCE Libraries All Gradient Boosted libraries support inferencing Treelite An inference-only library Multi-threaded on CPUs Converts the decision tree into C-code and executes it 10

11 OBJECTIVE 11

12 OBJECTIVE Build a GPU-accelerated inference-only library for gradient boosting Use decision trees as our weak learners Assume batched inferencing scenarios Limit ourselves to single-gpu* * Batched inference is trivially parallelizable across multiple-gpus (and also across multiple-nodes) 12

13 GPU INFERENCING LIBRARY Introduction This talk introduces GIL (GPU/GBM Inferencing Library) An efficient GPU accelerated inference-only library for gradient boosted trees 13

14 ASSUMPTIONS For this work Input dataset is dense Ensemble of trees are also dense Fully grown trees with leaves only at the deepest layer* * If not, trivial to construct one such, from a sparse tree 14

15 EXPERIMENTS 15

16 SETUP Targets GPU Inferencing Library (GIL) xgboost.booster.predict On Treelite CPU Datasets HW/SW Centos 7.3 Python 2.7 CUDA9.1 Driver Version Airline Higgs 16

17 EXPERIMENT WORKFLOW XGBoost Trained model GIL XGBoost Treelite Output Training data Test data 17

18 CORRECTNESS For testing correctness of our implementation, we match the output value of the ensemble before classification This means exact matching with the other libraries 18

19 Throughput (rows/sec) AGGREGATE COMPARISON Higgs Airline 4.5E E E E E E E E E E+00 Treelite XGBoost GIL Dataset: Higgs and airline Number of boosters: varying Max depth: varying GPU V100 Batch size: 4 million

20 Throughput (row/sec) EFFECT OF MAX DEPTH GIL Treelite XGBoost 4.5E E E E E E E E E E Max depth Dataset: Higgs Number of boosters: 150 Max depth: varying GPU V100 Batch size: 4 million

21 Throughput (rows/sec) EFFECT OF NUMBER OF BOOSTERS 5.0E+07 GIL Treelite XGBoost 4.0E E E E E Number of boosters Dataset: Higgs Number of boosters: varying Max depth: 13 GPU V100 Batch size: 4 million

22 Throughput (row/sec) EFFECT OF BATCH SIZE 3.5E E E E E E E E E E E E E E E E E E E E+06 Number of rows Dataset: Airline Number of boosters: 200 Max depth: 10 GPU V100 Batch size: varying

23 V100 VS P E E E E E E E E E E+00 XGBoost P100 V100 GIL Dataset: Airline Number of boosters: varying Max depth: varying GPU P100 & V100 Batch size: varying

24 IMPLEMENTATION 24

25 CHALLENGES Control divergence potentially different traversals across trees in the ensemble Uncoalesced memory accesses Different traversals will also lead to requests to far-off memory locations Which also means, Treelite-like code generation is not an optimal for GPUs f 1 f 2 f 2 f 3 25

26 REDUCING MEMORY UNCOALESCING Shared Memory Store the input row for a given threadblock in shared memory Helps eliminate uncoalesced accesses to input data Reduce global memory traffic 26

27 ELIMINATING CONTROL DIVERGENCE control divergence if( value <= threshold ) { next = 2 * current + 1; no control divergence next = 2 * current (value <= threshold) } else { } next = 2 * current + 2; 27

28 NAÏVE Tree Layout Store the nodes in the usual BFS order

29 NAÏVE Tree Layout Contd. Store all the trees themselves adjacent to each other T1 T0 T2 T3 T0 T1 T2 T3 29

30 NAÏVE Work distribution Each threadblock works on a row of the input dataset Each thread works on a tree in the ensemble Final stage involves a reduction operation across threads in the threadblock 30

31 NAÏVE Work distribution T0 T1 Tn Threadblock Reduction Output 31

32 NAÏVE Issues Each thread works on nodes from different trees Suffers from memory uncoalescing on node accesses Uncoalesced accesses also result from reading input data while tree traversal 32

33 CUSTOM TREE LAYOUT GPU cache-friendly layout for nodes Store a given node of all trees adjacent to each other Have these collection of nodes themselves in the usual BFS order Helps reduce uncoalesced accesses 33

34 CUSTOM TREE LAYOUT Illustration T0 T1 T

35 Time in ms EFFECT OF CUSTOM TREE LAYOUT Custom tree layout Naive Max depth Dataset: Synthetic Number of boosters: 128 Max depth: varying GPU P100 Batch size: 1 million

36 Time in ms EFFECT OF CUSTOM TREE LAYOUT Custom tree layout Naive Number of boosters Dataset: Synthetic Number of boosters: varying Max depth: 6 GPU P100 Batch size: 1 million

37 CONCLUSION 37

38 TAKEAWAYS There is a need for efficient and optimized batched inferencing Large scope for improvements in the existing solutions A custom GPU cache-friendly tree layout to reduce memory uncoalescing GIL is a high-throughput batched inference-only library for GPU 38

39 FUTURE WORK Support for sparsity in binary trees Support for sparsity in input data Challenges: need to tackle warp-divergence and uncoalesced memory accesses Optimizing throughput for larger tree depths (>= 16) as well as number of trees 39

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning