HETEROGENEOUS CPU+GPU COMPUTING

HETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China),

Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Everywhere from supercomputers to mobile devices

Heterogeneous platforms Host-accelerator hardware model Accelerator Host PCIe / Shared memory Accelerator Accelerator... MICs FPGAs Accelerator GPUs CPUs

Our focus today A heterogeneous platform = CPU + GPU Most solutions work for other/multiple accelerators An application workload = an application + its input dataset Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores

Generic multi-core CPU 5

6 Programming models Pthreads + intrinsics TBB Thread building blocks Threading library OpenCL To be discussed OpenMP Traditional parallel library High-level, pragma-based Cilk Simple divide-and-conquer model Level of abstraction increases

A GPU Architecture 7

Offloading model Kernel Host code

9 Programming models CUDA NVIDIA proprietary OpenCL Open standard, functionally portable across multi-cores OpenACC High-level, pragma-based Different libraries, programming models, and DSLs for different domains Level of abstraction increases

CPU vs. GPU 10 ALU ALU Control Throughput: ~ ALU 500 GFLOPs ALU Bandwidth: ~ 60 GB/s Cache CPU Low latency, high flexibility. Excellent for irregular codes with limited parallelism. Heterogeneous computing Do we gain = CPU anything? + GPU work together GPU High throughput. Excellent for massively parallel workloads. Throughput: ~ 5.500 GFLOPs Bandwidth: ~ 350 GB/s

Case studies Executiontime(ms) 450 400 350 300 250 200 150 100 50 0 Best = (80,20) TG TD TC TMax Matrix Multiplication (SGEMM) n = 2048 x 2048 (48MB) Only-GPU granularity = 10% Only-CPU T G T C T Max T D T G = GPU kernel execution time T D = Data transfer time T C = CPU kernel execution time T Max = max(t G +T D, T C )

Case studies Executiontime(ms) 450 400 350 300 250 200 150 100 50 0 TG TD TC TMax Best = (80,20) T G = GPU kernel execution time T D = Data transfer time T C = CPU kernel execution time T Max = max(t G +T D, T C ) Matrix Multiplication (SGEMM) n = 2048 x 2048 (48MB) Only-GPU Executiontime(ms) 180 160 140 120 100 80 60 40 20 0 Only-CPU granularity = 10% TG TD TC TMax Best = (30,70) 0 Very few applications are GPU-only. Executiontime(ms) 250 200 150 100 50 TG TD TC TMax Best =Only-CPU Separable Convolution (SConv) n = 6144x6144 (432MB) Dot Product (SDOT) n = 14913072 (512MB)

Possible advantages of heterogeneity Increase performance Both devices work in parallel Decrease data communication Which is often the bottleneck of the system Different devices for different roles Increase flexibility and reliability Choose one/all *PUs for execution Fall-back solution when one *PU fails Increase energy efficiency Cheaper per flop Main challenges: programming and workload partitioning!

Challenge 1: Programming models

OmpSs 15 Programming models (PMs) Heterogeneous computing = a mix of different processors on the same platform. Programming Mix of programming models One(/several?) for CPUs OpenMP One(/several?) for GPUs CUDA Single programming model (unified) OpenCL is a popular choice Low level High level 4.0

16 Heterogeneous Computing PMs High productivity; not all applications are easy to implement. Generic OpenACC, OpenMP 4.0 OmpSS, StarPU, HPL High level HyGraph, Cashmere, GlassWing Domain and/or application specific. Focus on: productivity and performance Specific OpenCL OpenMP+CUDA TOTEM The most common atm. Useful for performance, more difficult to use in practice Low level Domain specific, focus on performance. More difficult to use. Quite rare.

Heterogeneous computing PMs CUDA + OpenMP/TBB Typical combination for NVIDIA GPUs Individual development per *PU Glue code can be challenging OpenCL (KHRONOS group) Functional portability => can be used as a unified model Performance portability via code specialization HPL (University of A Coruna, Spain) Library on top of OpenCL, to automate code specialization

Heterogeneous computing PMs StarPU (INRIA, France) Special API for coding Runtime system for scheduling OmpSS (UPC + BSC, Spain) C + OpenCL/CUDA kernels Runtime system for scheduling and communication optimization

Heterogeneous computing PMs Cashmere (VU Amsterdam + NLeSC) Dedicated to Divide-and-conquer solutions OpenCL backend. GlassWing (VU Amsterdam) Dedicated to MapReduce applications TOTEM (U. of British Columbia, Canada) Graph processing CUDA+Multi-threading HyGraph (TUDelft, UTwente, UvA, NL) Graph processing Based on CUDA+OpenMP

Challenge 2: Workload partitioning

Workload DAG (directed acyclic graph) of kernels k0 k0 k0 k0 k0 k1 k1 k1 k2...... k3 k4 kn kn k5 I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG

22 Determining the partition Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores Thousands of Cores Mul3ple Cores Thousands of Cores

23 Static vs. dynamic Static partitioning + can be computed before runtime => no overhead + can detect GPU-only/CPU-only cases + no unnecessary CPU-GPU data transfers -- does not work for all applications Dynamic partitioning + responds to runtime performance variability + works for all applications -- incurs (high) runtime scheduling overhead -- might introduce (high) CPU-GPU data-transfer overhead -- might not work for CPU-only/GPU-only cases

24 Determining the partition Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores (near-) Optimal Low applicability Thousands of Cores Often sub-ptimal High applicability Mul3ple Cores Thousands of Cores

25 Heterogeneous Computing today Limited applicability. Low overhead => high performance Systems/frameworks: Qilin, Insieme, SKMD, Glinda,... Libraries: HPL, Static Single kernel Sporradic attempts and light runtime systems Not interesting, given that static & run-time based systems exist. Dynamic Glinda 2.0 Low overhead => high performance Still limited in applicability. Multi-kernel (complex) DAG Run-time based systems: StarPU OmpSS HyGraph High Applicability, high overhead

Mul3ple Cores Thousands of Cores Static partitioning: Glinda

*Jie Shen et al., HPCC 14. Look before you Leap: Using the Right Hardware. Resources to Accelerate Applications Glinda* Modeling the partitioning The application workload The hardware capabilities The GPU-CPU data transfer Predict the optimal partitioning Making the decision in practice Only-GPU Only-CPU CPU+GPU with the optimal partitioning

28 Modeling the partitioning Define the optimal (static) partitioning β= the fraction of data points assigned to the GPU (+T D ) T G +T D = T C

29 Modeling the workload n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β W (total workload size) quan3fies how much work has to be done

Modeling the partitioning T G = W G P G T C = W C P C T D = O Q W: total workload size W C = (1-β) * W and W G = β * W P: processing throughput (W/second) O: CPU-GPU data-transfer size Q: CPU-GPU bandwidth (B/second) T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G R GC CPU kernel time vs. GPU kernel time R GD GPU data-transfer time vs. GPU kernel execution time

Predicting the optimal partitioning Solving β from the equation Total workload size HW capability ratios Data transfer size W G W C = P G P C 1 1+ P G Q O W G β predictor There are three β predictors (by data transfer type) β = R GC 1+ R GC β = R GC 1+ v w R GD + R GC β = R GC v w R GD 1+ R GC No data transfer Partial data transfer Full data transfer

32 Glinda outcome (Any?) data-parallel application can be transformed to support heterogeneous computing A decision on the execution of the application only on the CPU only on the GPU CPU+GPU And the partitioning point

Results [1] Quality (compared to an oracle) The right HW configuration in 62 out of 72 test cases Optimal partitioning when CPU+GPU is selected Predicted partitioning Execution time (ms) Optimal partitioning (auto-tuning)

Results [2] Effectiveness (compared to Only-CPU/Only-GPU) 1.2x-14.6x speedup If taking GPU for granted, up to 96% performance will be lost Our config Only- GPU Only- CPU Execution time (ms)

Related work Glinda achieves similar or better performance than the other partitioning approaches with less cost

More advantages Support different types of heterogeneous platforms Multi-GPUs, identical or non-identical Support different profiling options suitable for different execution scenarios Online or offline Partial of full Determine not only the optimal partitioning but also the best hardware configuration Only-GPU / Only-CPU / CPU+GPU with the optimal partitioning Support both balanced and imbalanced applications

Glinda: limitations? Regular applications only >> NO, we support imbalanced applications, too. Single kernel only? >> NO, we support two types of multikernel applications, too. Single GPU only? >> NO, we support multiple GPUs/ accelerators on the same node Single node only? >> YES for now.

What about energy? The Glinda model can be adapted to optimize for energy Ongoing work at UvA Not yet clear whether optimality can be so easily determined Finding the right energy models We started with statistical models We plan to use existing analytical models in the next stage Would probably be more limitations regarding the workload types

Mul3ple Cores Thousands of Cores Dynamic partitioning: HyGraph

Dynamic partitioning: StarPU, OmpSS k1 k0 k3 k2 k4 Runtime system k2,k3,k4 Partitioning & Scheduling k5 V k0,k1,k5 Our own take on it: graph processing.

Graph processing Time Graphs are increasing in size and complexity Facebook 1.7B users, LinkedIn ~500M users, Twitter ~400M users We need HPC solutions for graph processing

CPU or GPU? Graph processing Many vertices => Massive parallelism. Some vertices inactive => Irregular workload. Varying vertex degree => Irregular tasks. Low arithmetic intensity => Bandwidth limited. Irregular access pattern => Poor caching. Ideal platform? CPU: few cores, deep caches, asymmetric processing. GPU: many cores, latency hiding, data-parallel processing. No clear winner => Attempt CPU+GPU!

Design of HyGraph HyGraph: Dynamic scheduling Replicate vertex states on CPU and GPU. Divide vertices into fixed-sized segments called blocks. Processing vertices inside block corresponds to one task. CPU: 1 thread per block. GPU: 1 SM (=1024 threads) per block

Design of HyGraph Overlap of computation with communication

Performance evaluation 2 GPUs presented Kepler : K20 Maxwell : GTX980 1 CPU Dual 8-core CPUs, 2.4GHz Three algorithms PageRank (500 iterations) 10 datasets 5 synthetic ones presented here PRELIMINARY results! http://www.cs.vu.nl/das5/ V E RMAT22 2.4 M 64.2 M RMAT23 4.6 M 129.3M RMAT24 8.9 M 260.1M RMAT25 17.1 M 523.6M RMAT26 33.0 M 1051.9M

Lower is better Preliminary results: execution time [s] 1000 100 CPU GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 10 1 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

Preliminary results: Energy [Joule] 100,000 10,000 1,000 CPU GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 100 10 1 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26 Lower is better

Higher is better (TEPS = traversed edges per s) Results [3]: TEPS / Watt 600000 500000 400000 300000 CPU GPU K20 Hybrid K20 GPU GTX 980 Hybrid GTX980 200000 100000 0 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

Results [4]: EDP 100000000 10000000 1000000 100000 GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 CPU 10000 1000 100 10 1 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

Lessons learned For irregular applications Heterogeneous computing helps performance Heterogeneous computing might help energy efficiency For K20, it s pretty good For GTX980. GPU wins Cannot generalize results to other types of graphs Both energy and performance are unpredictable Is hybrid dynamic processing feasible for energy efficiency?! Open question, still

Instead of summary

to the office Take home message [1] Heterogeneous computing works! More resources. Specialized resources. Most efforts in static partitioning and run-time systems Glinda = static partitioning for single-kernel, data parallel applications Now works for multi-kernel applications, too StarPU, OmpSS = run-time based dynamic partitioning for multikernel, complex DAG applications Domain-specific efforts, too HyGraph - graph processing Cashmere divide-and-conquer, distributed

to the office Take home message [2] Choose a system based on your application scenario: Single-kernel vs. multi-kernel Massive parallel vs. Data-dependent Single run vs. Multiple run Programming model of choice There are models to cover combinations of these choices! No framework to combine them all food for thought?

to the office Take home message [3] GPUs are energy efficient On their own Hybrid processing helps performance but for energy efficiency we are heavily application, workload, and hardware specific Irregular applications remain the cornerstone of *both* performance and energy

Current/Future research directions More heterogeneous platforms Extension to more application classes Multi-kernel with complex DAGs (streaming applications, graph-processing applications) Integration with distributed systems Intra-node workload partitioning + inter-node workload scheduling Extension of the partitioning model for energy

Backup slides

57 1 2 Imbalanced app: Sound ray tracing

Example 4: Sound ray tracing 58

Which hardware? Our application has Massive data-parallelism No data dependency between rays Compute-intensive per ray clearly, this is a perfect GPU workload!!!

60 Results [1] 180 Execu3on 3me (s) 160 140 120 100 80 60 40 20 Only GPU Only CPU 0 W9(1.3GB) Only 2.2x performance improvement! We expected 100x

Workload profile 0.8 Peak Processing iterations: ~7000 Altitude, km 0.7 0.6 0.5 0.4 0.3 0.2 0.1 aircraft 330 340 350 Sound speed, m/s (a) 1 2 3 4 5 Relative distance fromsource, km (b) 40 30 20 10 0 10 20 30 40 Launch Angle [deg] #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 Ray ID Bottom Processing iterations: ~500

Results [2] Execu3on 3me (ms) 62% performance improvement compared to Only-GPU

63 Model the app workload n: the total problem size w: workload per work- item w n

64 Model the app workload n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β W (total workload size) quan3fies how much work has to be done

Modeling the partitioning T G = W G P G T C = W C P C T D = O Q W: total workload size W C = (1-β) * W and W G = β * W P: processing throughput (W/second) O: data-transfer size Q: data-transfer bandwidth (bytes/second) T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G

66 Model the HW capabilities Workload: W G + W C = W Execution time: T G and T C P (processing throughput) Measured as workload processed per second P evaluates the hardware capability of a processor GPU kernel execu3on 3me: T G =W G /P G CPU kernel execu3on 3me: T C =W C /P C

67 Model the data-transfer O (GPU data-transfer size) Measured in bytes Q (GPU data-transfer bandwidth) Measured in bytes per second Data- transfer 3me: TD=O/Q + (Latency) Latency < 0.1 ms, negligible impact

68 Predict the partitioning T G = W G P G T C = W C P C T D = O Q T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G by solving this equation in β

69 Predict the partitioning Solve the equation W G W C = P G P C 1 1+ P G Q O W G β- dependent terms Expression W G =w x n x β W C =w x n x (1- β) O=Full data transfer or Full data transfer x β

70 Predict the partitioning Solve the equation W G W C = P G P C 1 1+ P G Q O W G β- dependent terms β- independent terms Expression W G =w x n x β W C =w x n x (1- β) O=Full data transfer or Full data transfer x β Es3ma3on

Modeling the partitioning Estimating the HW capability ratios by using profiling The ratio of GPU throughput to CPU throughput The ratio of GPU throughput to data transfer bandwidth W G W C = P G P C 1 1+ P G Q O W G R GC CPU kernel execution time vs. GPU kernel execution time R GD GPU data-transfer time vs. GPU kernel execution time

Making the decision in practice From β to a practical HW configuration Calculate N G and N C N G = n xβ N C = n x (1-β) N G : GPU partition size N C : CPU partition size N G <its lower bound Y Only-CPU N N c <its lower bound Y Only-GPU N Round up N G CPU+GPU Final N G and N C

Extensions Different profiling options Online vs. Offline profiling Partial vs. Full profiling Overall profiling overhead Offline Online w W V = v n v # runs CPU+Multiple GPUs Identical GPUs Non-identical GPUs (may be suboptimal) n profile partial workload W V v min v w

What if we have multiple kernels? Can we still do static partitioning? Mul3ple Cores Thousands of Cores

Multi-kernel applications? Use dynamic partitioning: OmpSs, StarPU Partition the kernels into chunks Distribute chunks to *PUs Keep data dependencies Enabling multi-kernel asynchronous execution (inter-kernel Static partitioning: paralelism) low applicability, high performance. Dynamic Respond partitioning: automatically low performance, to runtime high applicability. performance variability (Often) lead to suboptimal performance Scheduling policies and chunk size Scheduling overhead (taking the decision, data transfer, etc.) Can we get the best of both worlds?

How to satisfy both requirements? We combine static and dynamic partitioning We design an application analyzer that chooses the best performing partitioning strategy for any given application (1) Implement the app source code (2) Analyze the app kernel structure App class (3) Select the partitioning strategy (4) Enable the partitioning in the code App classification Partitioning strategy repository Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Application classification 5 application classes k0 k0 k0 k0 k0 k1 k1 k1 k2...... k3 k4 kn kn k5 I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Partitioning strategies Static partitioning: single-kernel applications Dynamic partitioning: multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Partitioning strategies Static partitioning: in Glinda single-kernel + multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement the source code Dynamic partitioning: in OmpSs Analyze multi-kernel applications Select a (fall-back scenarios) Enable application s class partitioning strategy the partitioning

Results [4] MK-Loop (STREAM-Loop) w/o sync: SP-Unified > DP-Perf >= DP-Dep > SP-Varied with sync: SP-Varied > DP-Perf >= DP-Dep > SP-Unified Execu'on Execution 'me [ms] time (ms) 6000 5000 4000 3000 2000 1000 0 Only-GPU Only-CPU SP-Unified DP-Perf DP-Dep SP-Varied STREAM-Loop-w/o w sync 7133.2 STREAM-Loop-w

Results: performance gain Best partitioning strategy vs. Only-CPU or Only-GPU Performance Performance Speedup Speedup 10 9 8 7 6 5 4 3 2 1 0 MatrixMul 22.2 Nbody BlackScholes HotSpot STREAM-Seq-w STREAM-Seq-w/o Best vs OG Best vs OC Avg STREAM-Loop-w STREAM-Loop-w/o Average: 3.0x (vs. Only-GPU) 5.3x (vs. Only-CPU)

Summary: Glinda Computes (close-to-)optimal partitioning Based on static partitioning Single-kernel Multi-kernel (with restrictions) No programming models attached CUDA + OpenMP/TBB OpenCL others => we propose HPL