Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

Size: px

Start display at page:

Download "Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning"

Lisa Bishop
5 years ago
Views:

1 Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning Tianyi David Han and Tarek S. Abdelrahman The Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto

2 Machine-Learning-based Auto-Tuning Increased interest in Machine Learning (ML) based autotuning, particularly for GPUs Program Features Training Programs Train Optimization ML Model New (Test) Program Program Features Predict Optimization Optimization Benefit Should the optimization be applied? 2

3 This Work Develop the AVD metric for the goodness training programs Consider the local memory optimization problem for GPUs Models trained with real programs are poor predictors Develop systematic ways for generating synthetic benchmarks Show significant improvement in models ability to predict when trained with synthetic benchmarks, by a factor of 1.5X 3

4 Outline Motivation Goodness of Training Set Local Memory Optimization Real and Synthetic Benchmarks Related work Conclusions and Future Work 4

Preliminaries Other Optimizations Program 1 (f 1 =a,f 2 =b) (f 1 =e,f 2 =f) Extract features: (f 1, f n ) n-dimensional Feature space Optimization Execute optimized and un-optimized versions 2-D

5 Preliminaries Other Optimizations Program 1 (f 1 =a,f 2 =b) (f 1 =e,f 2 =f) Extract features: (f 1, f n ) n-dimensional Feature space Optimization Execute optimized and un-optimized versions 2-D Feature Space training vector (a,b) f 2 Other Optimizations Program T +/- +/- +/- +/- (f 1 =c,f 2 =d) (f 1 =g,f 2 =h) Each feature vector and its label is a training sample/vector Collection of training samples/vectors is the training set 5 f 1

6 Goodness of Training Set Feature Space Training samples/vectors Test samples/vectors Feature 2 Feature 1 How well do training samples cover test samples? 6

7 Feature 2 Coverage Vicinity Density (VD) Feature Space Feature 1 Average Vicinity Density: AVD(T) = t Α Training samples/vectors Test samples/vectors Train a regression tree size(r): # vectors in R vectors(r): # training samples in R VD(t): vectors(r t )/size(r t ) 7 t VD(t) / T t T T: Test set

8 Outline Motivation Goodness of Training Set Local Memory Optimization Real and Synthetic Benchmarks Related work Conclusions and Future Work 8

9 Local Memory Optimization User-managed caching of data from the GPU device memory into fast local (shared) memory Exploit data re-use and avoid the penalty of memory non-coalescing Smaller number of global memory transactions Overhead of data movement and synchronization Potential for reduced parallelism Factors: degree of non-coalescing, degree of reuse, parallelism, contextual accesses, etc. Hard to predict if the optimization is beneficial [Han and Abdelrahman 2015] 9

10 Outline Motivation Goodness of Training Set Local Memory Optimization Real and Synthetic Benchmarks Related work Conclusions and Future Work 10

11 Real Benchmarks Benchmark Application Domain Description transpose Matrix transpose matrixmul_a Matrix multiply (cache A) Dense matrixmul_b Matrix multiply (cache B) Linear Algebra MVT Matrix vector multiply SGEMM convsep_row convsep_col blur SAD SAD_frame LBM STENCIL Structured Grid (Stencil) C = alpha * A * B + beta * C Separable 2D convolution (row filter) Separable 2D convolution (column filter) Blur filter Sum of Absolute Difference Cache the frame image in SAD Lattice Boltzman Machine (struct elements) 3D, 27-point MRI-GRIDDING Unstructured Grid Maps non-uniform 3D input data onto a regular 3-D grid 11

12 ML Features We use 15 features: Inter-warp (single-access) data reuse Stencil pattern data-reuse Memory non-coalescing (single-access) Total amount of (all) data accessed by a workgroup Data utilization rate in cooperative loading Cooperative loading efficiency Parallelism levels before and after the optimization Grid efficiency Computation length # of contextual accesses Memory non-coalescing in contextual accesses 12

13 Evaluation Methodology Execute every benchmark with and without local memory optimization for all possible launch configurations Nvidia Tesla 2090, CUDA 6.0, Intel Xeon E host Calculate the speedup of the optimized version Label each benchmark/launch configuration (beneficial/not beneficial) Leave-one-out evaluation Build a model with 12 benchmarks and predict for the 13 th Each model a Random Forest with 20 trees, 4 features per tree These are called real models because they are trained with real benchmarks data 13

14 Model Accuracy Use the known labels of the test set to determine the accuracy of the model Averaged for each benchmark over all launch configurations Count-based prediction accuracy % of test vectors where the prediction is correct Penalty-weighted prediction accuracy % of performance achieved by the prediction Weigh the misprediction by the degradation in performance it causes 14

15 Real Models Accuracies 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Count-based Accuracy Penalty-weighted Accuracy C: 57% P: 81% 15

16 Model Accuracy vs. AVD For each real benchmark as a test set, evaluate the model accuracy and AVD for all possible training sets = 4095 possible training sets per test set 4095*13 = ~53K models 16

17 Increasing the AVD Generate synthetic feature vectors that increase the AVD for test sets Challenge: no a priori knowledge of the test data Use a synthetic benchmark template to generate code instances that have these feature vectors Repeat the previous experiment, training with synthetic (and real) benchmark samples The models are called synthetic models because they are trained with synthetic benchmarks 17

18 Synthetic Vectors: Bounding Box (bb-k) Feature Space Training sample/vector Synthetic sample/vector Feature 2 Feature 1 Uniformly sample in expanded bounding box of training data of each benchmark 18

19 Synthetic Vectors: Bounding Box (bb-all) Feature Space Training sample/vector Synthetic sample/vector Feature 2 Feature 1 Uniformly sample in expanded bounding box of all training data of benchmarks 19

20 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Synthetic Models Accuracies bb-k 90K Samples Count-based Accuracy Penalty-weighted Accuracy C: 87% P: 94% 20

21 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Synthetic Models Accuracies bb-all 90K Samples Count-based Accuracy Penalty-weighted Accuracy C: 83% P: 92% 21

22 Model Accuracy We can significantly increase the model accuracy by using synthetic benchmarks, derived to increase the AVD 90,000 synthetic samples versus 7834 real ones (11.5X) 1.52X/1.16X count-based/penalty-weighted prediction accuracies 2.4X AVD 22

23 Impact of Training Set Size Accuracy / AVD bb-k models, averaged over all benchmarks real models Count-based Accuracy AVD Training set size 23

24 Outline Motivation Goodness of Training Set Local Memory Optimization Real and Synthetic Benchmarks Related work Conclusions and Future Work 24

25 Related Work Considerable work on building machine learning models for performance auto-tuning [Grewe et al. 2013], [Magni et al. 2014], [Agakov et al. 2006], [Cavazos et al. 2007], etc. The use of a small set (10 s) of real programs We use a large number of synthetic benchmarks for training Use of synthetic benchmarks [Han & Abdelrahman 2015], [Garvey & Abdelrahman 2015], [Cummins et al. 2016] No systematic way for generating synthetic benchmarks This work reasons why there is benefit and systemizes generation 25

26 Related Work Cont d Deep Learning for generating synthetic benchmarks by mining code repositories [Cummins et al. 2017] Focus is on synthetically correct/human readable code Our focus is complementary Data sets for training and testing, e.g. [Borovicka et al. 2012] How to select a good subset of the data for training Oversampling to balance data sets, e.g., SMOTE 2002 Data already exists and no focus on programs 26

27 Conclusions Advocated the use of synthetic benchmarks for training machine learning model A metric for the quality of a training set with respect to a test set The metrics show that models trained with a small number of real benchmarks have poor performance because of poor training data Proposed methods for generating synthetic benchmarks for training a ML model Shown a significant improvement in model performance The use of synthetic benchmarks is effective and useful 27

28 Future Work This work is an initial step towards showing the effectiveness of synthetic benchmarks Determining a good number of synthetic benchmarks Other approaches for determining feature vectors of synthetic samples, e.g., a hybrid of bb-all and bb-k More efficient ways of generating code from desired feature vectors Other GPU optimizations The use of deeper models for prediction, enabled by the abundance of synthetic data 28

29 Questions? 29

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard