A Multi-Tiered Optimization Framework for Heterogeneous Computing

Size: px

Start display at page:

Download "A Multi-Tiered Optimization Framework for Heterogeneous Computing"

Grant Reed
5 years ago
Views:

Assoc. Professor of ECE University of Florida Andrew Milluzzi Ph.D.

1 A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi Ph.D. Student University of Florida Justin Richardson Ph.D. Candidate University of Florida

2 Agenda Motivation Device Metrics Approach Overview Kernel Implementation Tier Device Performance Tier System Configuration Tier Case Study Conclusions 2

Motivations Device Metrics comparisons provide a first-order estimate of performance Performance can vary based on computational kernel and size of data

computational time or purchase of a given device Large non-recurring engineering costs for developing a platform specific application Lack of

3 Motivations Device Metrics comparisons provide a first-order estimate of performance Performance can vary based on computational kernel and size of data to process DSP, GPU, and CPU devices are rarely optimized for your personal application Benchmarking is an expensive process Access to hardware requires computational time or purchase of a given device Large non-recurring engineering costs for developing a platform specific application Lack of quantifiable data for kernel performance Micro benchmarks do not always correlate to kernel performance Kernel performance is not the same across all types of devices 3

Device Metrics Computational Density (CD) Sustained operations assuming random stream of add and multiply Computational Density per Watt (CD/W) CD normalized by TDP External Memory

I 2 C, UART, SPI Device CD of Devices Studied Int8 (GOPS) Int16 (GOPS) Int32 (GOPS) SPFP (GOPS) DPFP (GOPS) Intel Xeon E5-2670 998.40 499.20 249.60 332.80 166.40 Intel Xeon Phi 5110P 1074.

4 Device Metrics Computational Density (CD) Sustained operations assuming random stream of add and multiply Computational Density per Watt (CD/W) CD normalized by TDP External Memory Bandwidth (EMB) Device to RAM Internal Memory Bandwidth (IMB) Cache bandwidth I/O Bandwidth (IOB) Bandwidth of EMB plus all I/O ports E.g. I 2 C, UART, SPI Device CD of Devices Studied Int8 (GOPS) Int16 (GOPS) Int32 (GOPS) SPFP (GOPS) DPFP (GOPS) Intel Xeon E Intel Xeon Phi 5110P NVIDIA K NVIDIA K20x NVIDIA K GB/s = Gigabytes Per Second 4 GOPS = Giga Operations per Second

Computational Density Example NVIDIA GK110 Architecture SMX Unit 192 Single-Precision Floating Point (SPFP)Cores 64 Double-Precision Floating Point (DPFP)Cores Frequency of 700+ MHz NVIDIA K40 GPU

5 Computational Density Example NVIDIA GK110 Architecture SMX Unit 192 Single-Precision Floating Point (SPFP)Cores 64 Double-Precision Floating Point (DPFP)Cores Frequency of 700+ MHz NVIDIA K40 GPU Stats Operating Frequency of 745 MHz 15 SMX Cores NVIDIA K40 Int8, Int16, Int32, SPFP CD 15 x 192 x.745 GHz = GOPS NVIDIA K40 DPFP CD 15 x 64 x.745 GHz = GOPS 5 1 MAC = 1 OPS 1 MAC = 2 FLOPs

Approach Overview Framework Inputs Application Kernels Subset of kernels already existing in benchmarking database Target Device List Optional input, if not included, framework assumes all possible

6 Approach Overview Framework Inputs Application Kernels Subset of kernels already existing in benchmarking database Target Device List Optional input, if not included, framework assumes all possible Framework Outputs Pareto set of best system configurations and application mappings Set is scoped to only kernels of interest to user Framework Processing Kernel Implementation Tier Compare and contrast various kernel implementations for optimal performance Device Performance Tier Identify most efficient kernel for a given architecture System Configuration Tier Leverage data from other two tiers to determine optimal mapping 6

7 Approach Concept Diagram Application Kernels (user specified) Framework Pareto set of best devices and mappings for specified kernels Pareto set of best devices and mappings for specified kernels System Configuration Tier n Target Devices (optional) Pareto set of best kernel on device Pareto set of best kernel on device Pareto set of best kernel on device Device Performance Tier Device 2 Performance Tier Device n Performance Tier Pareto set of best implementation Kernel Implementation Tier Implementations of kernel 7

Kernel Implementation Tier Leverage database of existing benchmarking results Tier Function DPFP 1D FFT Implementations on Intel E5-2670 Compare implementations of a given computational kernel on a

8 Kernel Implementation Tier Leverage database of existing benchmarking results Tier Function DPFP 1D FFT Implementations on Intel E Compare implementations of a given computational kernel on a given device Identify optimal implementation at each dataset size Easily expanded to new implementations of kernels Tier Output Pareto set of optimally performing benchmark Currently only considers performance Can be extended to include productivity in terms of NRE 8

9 Device Performance Tier Leverage Pareto set for each kernel on each device Tier Function Combine Kernel Implementation Tier Pareto sets for a given device Identify most efficient computational kernel on given device Evaluate performance at each dataset size and select highest performance Tier can extrapolate performance based on Device Metrics and Realizable Utilization Expands range of framework Discussed later on slide 11 Optimal Kernel Implementation for NVIDIA K20X 9

Both 1D and 2D FFTs tend to perform better than Matrix-Multiplication at small dataset sizes Tier outputs can be later expanded to include

10 Device Performance Tier Tier Outputs Pareto set of optimal performing computational kernel at various dataset sizes for a given device Kernels can vary with different datasets sizes E.g. Both 1D and 2D FFTs tend to perform better than Matrix-Multiplication at small dataset sizes Tier outputs can be later expanded to include additional factors such as data transfer Pareto Front of Kernel Performance for NVIDIA K20X Tier s focus on device architecture enables analysis of various devices in same family Note: no DGEMM data point at this specific size 10

11 Device Performance Extrapolation Benchmarking every computation device is implausible CD and RU Device metrics enable application-independent architecture comparison RU relates real-world benchmarking results to device metrics RU is typically expressed as a percent of CD Apply RU results for a given architecture to a CD for another device in same family Extension of K20X Pareto Front for NVIDIA Kepler Family 11

12 System Configuration Tier Tier chooses devices and kernel mappings based on Pareto set from Device Performance tier Tier Inputs Pareto fronts from both the Kernel Implementation and Device Performance Tiers If kernel is not present on Pareto set in Device Performance Tier, compare Kernel Implementation Tier results between devices Concept diagram presents Device Performance and Kernel Implementation Tiers as children of System Configuration Tier Tier Outputs Optimal mapping of application kernels onto hardware devices Tier produces outputs for framework 12

13 Direct Kernel Comparisons Dataset size plays a large part in application performance Memory access and sustained computation play a large factor in observed performance In comparing devices there is often a crossover point in observed performance Example: Matrix Multiplication data for K20X, Phi, and Xeon CPU each have their own dataset size of optimal performance Comparison of Matrix Multiplication Performance 13

14 Direct Kernel Comparisons Some kernels are never optimal on any device More complex kernels will never outperform some kernels Limits due to memory access (cache) or computational complexity slow down some kernels Framework must find optimal mapping for all input kernels Comparison of Singular Value Decomposition Performance Leverage Kernel Implementation Tier data to fill in gaps Similar approach as Device Performance Tier with a subset of kernels to map 14

15 Case Study Requirements Kernel Dataset Size 2D FFT 4096 Matrix Multiplication 1024 SVD 4096 Sample application consisting of common computational kernels Explore common accelerated libraries such as CUBLAS, Intel MKL, LAPACK, ATLAS, etc. Leverage common dataset sizes Assume pipelining (concurrent kernel execution) Test range of devices leveraging projected performance and actual benchmarking Devices NVIDIA K20 NVIDIA K20X NVIDIA K40 Intel Xeon E Intel Xeon Phi 5110P 15

16 Case Study Results Device Quantity Kernel Dataset Size Intel Xeon E Matrix Multiply 1024 NVIDIA K40 2 System 1 SVD 1D FFT System 2 Device Quantity Kernel Dataset Size Intel Xeon E Matrix Multiply 1024 NVIDIA K20X 2 SVD 1D FFT NVIDIA K20X and K40 GPUs show similar performance at given dataset sizes NVIDIA K40 has significant cost over NVIDIA K20X compared to performance gain 16

Future Work Expand framework to consider additional factors Data transfer between devices in a node or between nodes can be a significant performance hit Augment framework to consider data

17 Future Work Expand framework to consider additional factors Data transfer between devices in a node or between nodes can be a significant performance hit Augment framework to consider data locality Expand devices and benchmarking Include additional device families Intel Xeon Phi family NVIDIA Maxwell family Grow benchmarking suite Sorting Image Processing Additional BLAS functions 17

Conclusions Determining optimal system mappings with only application-independent metrics is difficult Benchmarking is both expensive and time-consuming Limited benchmarking and realizable

18 Conclusions Determining optimal system mappings with only application-independent metrics is difficult Benchmarking is both expensive and time-consuming Limited benchmarking and realizable utilization can enable projection of device performance Structured optimization framework enables transparency at critical decision points in mapping process Observed kernel performance varies significantly with dataset size Hardware accelerators do not always provide the best kernel performance for every situation 18

19 Questions Andrew Milluzzi 19

HPC with Multicore and GPUs

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware