Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids

Size: px

Start display at page:

Download "Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids"

Shauna Cobb
6 years ago
Views:

1 Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids Mark Silberstein - Technion Naoya Maruyama Tokyo Institute of Technology Mark Silberstein, Technion 1

2 The case for heterogeneous hardware FLOPs/J General Purpose Processor Accelerators Workloads Mark Silberstein, Technion 2

3 The case for heterogeneity-aware software FLOPs/J General Purpose Processor Accelerators Workloads Slides idea: Uri Weiser Mark Silberstein, Technion 3

4 The case for energy-aware task assignment FLOPs/J GPU CPU Workloads Execute on GPU Execute on CPU Mark Silberstein, Technion 4

5 Experiment: GPU vs. CPU for the same task 1 sec 1.3 sec CPU: AMD Phenom Quadcore GPU: NVIDIA GTX285 Mark Silberstein, Technion 5

6 Outline Task dependency trees Energy-efficient acceleration assignment problem Optimal algorithm Evaluation on probabilistic networks workload Mark Silberstein, Technion 6

7 Input: task dependency trees Dependency tree for AxB+C Algebraic expression evaluation Divide and conquer strategies Inference in probabilistic networks Input Output +C AxB Input Mark Silberstein, Technion 7

8 Goal: minimize power consumption Greedy assignment Output Input +C CPU=60 GPU=30 AxB CPU= GPU= Input Mark Silberstein, Technion 8

9 Goal: minimize power consumption Greedy assignment Output Run on GPU +C CPU=60 GPU=30 Input AxB CPU= GPU= Run on CPU Input Mark Silberstein, Technion 9

10 Communication is not free! CPU RAM Accelerator Accelerator RAM Mark Silberstein, Technion 10

11 Communication is not free! CPU RAM Accelerator Accelerator RAM Mark Silberstein, Technion 11

12 Greedy algorithm ignores communications Output Input and output should be moved to/from GPU Greedy (CPU-GPU) = 70J Optimal (GPU-GPU)= 65J 5 Input +C CPU=60 GPU=30 AxB CPU= GPU= 5 Input 5 Mark Silberstein, Technion 12

13 Greedy algorithm ignores communications Output Input and output should be moved to/from GPU Greedy (CPU-GPU) = 70J Optimal (GPU-GPU)= 65J 5 Input +C CPU=60 GPU=30 AxB CPU= GPU= 5 Needed: communication-aware energy-optimized assignment Input 5 Mark Silberstein, Technion 13

14 How bad can it get? Mark Silberstein, Technion 14

15 How bad can it get? C CPU=10 6 GPU= B CPU=100 GPU=10 As bad as we want... Input A CPU=0.1 GPU=10 6 Mark Silberstein, Technion

16 More formally... Task dependency tree T(V,E,P,D) V tasks, E data dependencies P v (i) cost of execution on processor i D v (i j) cost of data transfer between i and j Goal: find S: V {CPU:GPU} to minimize v V (P v [S (v)]+ w N v D v [ S (w) S (v)]) Execution of task v Communications to task v Mark Silberstein, Technion 16

17 Appears to be NP-hard... Resembles communication-aware parallel scheduling of DAGs on multiprocessors NP-hard even for dependency trees, for fixed number of processors, with non-unit execution and communication costs Mark Silberstein, Technion 17

18 Appears to be NP-hard... Resembles communication-aware parallel scheduling of DAGs on multiprocessors NP-hard even for dependency trees, for fixed number of processors, with non-unit execution and communication costs But is it really the same problem? Mark Silberstein, Technion 18

19 Energy-efficient architecture CPU-intensive workload CPU RAM Accelerator Accelerator RAM Mark Silberstein, Technion 19

20 Energy-efficient architecture Accelerator-intensive workload CPU RAM Accelerator Accelerator RAM Mark Silberstein, Technion

21 Example NVIDIA OPTIMUS From NVIDIA OPTIMUS White Paper Mark Silberstein, Technion 21

22 Energy optimization is easier! Assume we found minimum-energy schedule Two executions are energy-equivalent: All processors are used concurrently Processors used one-at-a-time, unused are powered off Mark Silberstein, Technion 22

23 Energy optimization is easier! Assume we found minimum-energy schedule Two executions are energy-equivalent: All processors are used concurrently Processors used one-at-a-time, unused are powered off We refer to this problem as Energy-efficient acceleration assignment Runtime becomes secondary optimization Mark Silberstein, Technion 23

24 Dynamic programming Observation: assignment in one branch does not affect the assignment in another Forward step: Traverse tree from the leaves to the root Update the best costs for node execution on CPU and GPU, given best costs of its descendants Mark Silberstein, Technion 24

25 Dynamic programming Observation: assignment in one branch does not affect the assignment in another Forward step: Traverse tree from the leaves to the root Update the best costs for node execution on CPU and GPU, given best costs of its descendants Backward step: Traverse tree from the root to the leaves Fix the best cost given the root on a CPU Mark Silberstein, Technion 25

26 Forward step Outut CPU 8 C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 26

27 Forward step Outut CPU 8 C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 27

28 Forward step Outut CPU 8 C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 28

29 Forward step Outut CPU Best Subtree Cost if C is on GPU 8 Best Subtree Cost if C is on CPU C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 29

30 Forward step Outut CPU 8 C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 30 30

31 Forward step Cost of subtree rooted at C if C is on CPU Outut CPU C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 31 30

32 Forward step Outut CPU Cost of subtree rooted at C if C is on GPU C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 32 30

33 Forward step 50 Outut CPU 8 88 Cost of subtree including data transfer C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 33 30

34 Backward step Outut CPU Choose the best path C CPU=10 GPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 34 30

35 Backward step Outut CPU C CPU= A B CPU= CPU=30 GPU= GPU= Mark Silberstein, Technion 35 30

36 Backward step Outut CPU C CPU= A B CPU= CPU=30 GPU= Mark Silberstein, Technion 36 30

37 Backward step Outut CPU C CPU= A B CPU= GPU= Mark Silberstein, Technion 37

38 Final assignment Outut CPU 8 C CPU= A B CPU= GPU= Mark Silberstein, Technion 38

39 Final assignment Outut CPU 8 C CPU= May run in parallel A B CPU= GPU= Mark Silberstein, Technion 39

40 Dynamic programming Forward step: Traverse tree from the leaves to the root Update the best costs for node execution on CPU and GPU, given best costs of its descendants Backward step: Traverse tree from the root to the leaves Fix the best cost given the root on a CPU Complexity: O(#Tasks x #Processors) Mark Silberstein, Technion 40

41 Experiments 6 task dependency trees Source: inference in large probabilistic networks Used in genetic analysis GPU: NVIDIA GTX285 CPU: Quadcore AMD Phenom 9500 Tasks CPU time per task (ms) GPU time per task (ms) Speedup per task Transfer sizes between tasks (KB) 1194 (0.01 / 11 / 667) (0.2 / 0.5 /.9) 0.06 / 7.7 / / 633 / 12K Mark Silberstein, Technion 41

42 Methodology Predict the energy spent on each task and transfer Runtime estimate x observed power consumption Transfer time estimate x observed power consumption Run the assignment algorithm Hybrid-greedy, only-gpu, only-cpu, exact Given the schedule, measure the actual execution and data transfer time of all tasks in the tree Scale by the observed power consumption Mark Silberstein, Technion 42

43 Hybrid strategy pays off 0 Mark Silberstein, Technion 43

44 Hybrid strategy pays off Memory transfer times are comparable with the task execution times 0 Mark Silberstein, Technion 44

45 Hybrid greedy assigns fewer tasks to GPU Total tasks GPU tasks by Optimal GPU tasks by Greedy Mark Silberstein, Technion 45

46 Optimal algorithm avoids islands Mark Silberstein, Technion 46

47 Optimal algorithm avoids islands Mark Silberstein, Technion 47

48 Runtime optimization The algorithm can be used to improve runtime Exploiting the parallelism available in the schedule 0 Mark Silberstein, Technion 48

49 Insights Communication power cost may hinder GPU performance benefits Power off assumption simplifies scheduling algorithms Algorithm utility depends on the relative amount of communications Mark Silberstein, Technion 49

50 Future work Use as a part of a heuristic for DAG Joint runtime power optimization Power measurements on real hardware Application to multicores Mark Silberstein, Technion 50

51 The High Cost of Data Movement From the keynote IPDPS11, Prof. Dally, Stanford mm 64-bit DP pj 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 50 pj 500 pj Efficient off-chip link 1 nj Its not about 28nm the FLOPS Its about data movement FHTE 4/26/11 51

52 Thank you! Mark Silberstein, Technion 52

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack