On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

Size: px

Start display at page:

Download "On-the-Fly Elimination of Dynamic Irregularities for GPU Computing"

Donna Violet Craig
5 years ago
Views:

1 On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen

2 Graphic Processing Units (GPU) 2

3 Graphic Processing Units (GPU) 2

4 Graphic Processing Unit (GPU) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

5 Graphic Processing Unit (GPU) a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

6 Graphic Processing Unit (GPU) Great for regular computation! a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

7 Graphic Processing Unit (GPU) Great for regular computation! a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

8 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4

9 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4

10 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4

11 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4

12 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4

13 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4

14 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. 4

15 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. 4

16 Dynamic Irregularities memory control flow (thread divergence)... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...} A[ ]: { a mem seg. A[ ]: for (i=0;i<a[tid]; i++) {...} 4

17 Dynamic Irregularities memory control flow (thread divergence)... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...} A[ ]: { a mem seg. A[ ]: for (i=0;i<a[tid]; i++) {...} Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs) 4

18 Performance Impact Applications: Dynamic programming, fluid simulation, image reconstruction, data mining, Potential Speedup Host: Xeon Device: Tesla HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 5

19 Previous Work Mostly on Static Memory Irregularities [Baskaran+, ICS 08], [Lee+, PPoPP 09],[Yang+, PLDI 10], etc. Dynamic Irregularity Remain unknown until runtime Mostly through hardware extensions. [Meng+, ISCA 10], [Tarjan+, SC 09], [Fung+, MICRO 07], etc. 6

20 Open Questions for Dyn. Irreg. Removal Is it possible to do so w/o hw ext.? 7

21 Open Questions for Dyn. Irreg. Removal Is it possible to do so w/o hw ext.? More fundamentally Relations among data, threads, & irregularities? What layout or thread-data mappings minimize the irreg.? How to find them? Complexity? How to resolve conflicts among irregularities? Dependences? 7

22 Overview of this Work Analytic findings on properties of dyn. irreg. removal A software solution: G-Streamline library No profiling or hw ext. Transparent removal on the fly Jeopardize no basic efficiency Treat both types of irreg. holistically 8

23 Basic Insight: Both mem & control irreg. stem from inferior thread-data mappings. Transformations data reorder. hybrid job swap. data reloc. ref. redirect. A unified treatment. 9

24 Basic Insight: Both mem & control irreg. stem from inferior thread-data mappings. Transformations data reorder. hybrid job swap. data reloc. ref. redirect. A unified treatment. Two basic mechanisms: data relocation reference redirection A[p[tid]] -> A[q[tid]] Compose three transformations. 9

25 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: tid: thread ID; : a thread; : data access; : data swapping 10

26 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

27 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

28 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

29 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

30 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

31 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> maintain mapping between threads & data values transformed... = A [Q[tid]]; A [ ]: Q[ ] = {0,1,2,3,2,3,6,7} tid: thread ID; : a thread; : data access; : data swapping 10

32 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11

33 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11

34 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11

35 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: A[ ]: 11

36 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: A[ ]: 11

37 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <redirection> transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} A[ ]: A[ ]: 11

38 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed Both reduce 1 mem trans. original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <redirection> A[ ]: 1 more than the optimal... transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} A[ ]: 11

39 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} 12

40 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} 12

41 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:

42 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:

43 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:

44 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]:

45 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. 12

46 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. Solution: A follow-up data reordering. 12

47 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} method 1 <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. Solution: A follow-up data reordering. 12

48 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]:

49 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: <relocation> B [ ]:

50 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> if (B [tid]) {...} B [ ]:

51 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> if (B [tid]) {...} B [ ]:

52 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> Job integrity if (B[tid]-tid) {...} if (B [tid]) {...} B [ ]:

53 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> Job integrity if (B[tid]-tid) {...} if (B [tid]) {...} B [ ]: newtid = Q[tid]; if (B [tid]-newtid) {...} Q[ ] = {0,1,4,3,2,5,6,7} 13

54 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap 14

55 Trans-3: Hybrid Job swap + Data reorder original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap 14

56 Trans-3: Hybrid Job swap + Data reorder original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap Single opt reduces 1 mem trans. 1 more than optimal. 14

57 Trans-3: Hybrid Job swap + Data reorder data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap Single opt reduces 1 mem trans. 1 more than optimal. 14

58 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. 14

59 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap Single opt reduces 1 mem trans. data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} 1 more than optimal. 14

60 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap Single opt reduces 1 mem trans. data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} 1 more than optimal. 14

61 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. job swapping 1 more than optimal. 14

62 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. job swapping 1 more than optimal. 14

63 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; R[ ] = {4,5,2,3,0,1,6,7} 14

64 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; A [ ]: R[ ] = {4,5,2,3,0,1,6,7} 14

65 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Optimal achieved. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; A [ ]: R[ ] = {4,5,2,3,0,1,6,7} 14

66 Comparisons Irreg. Mem Diff. applicability of reordering and job swapping Hybrid: largest potential Irreg. Control Job swapping by redirection lower overhead, but with side effects Job swapping by relocation higher overhead, no side effects 15

67 guidance for transformations G-Streamline Transformations data hybrid reorder. data reloc. job swap. ref. redirect. Optimality & approximation Complexity analysis Approximating optimal layouts and mappings overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation How to determine optimal layouts / thread-data mapping? 16

68 NP-Complete Layout: 3D matching Approx. Duplication/packing Mapping: Partition Problem guidance for transformations Transformations data hybrid reorder. data reloc. job swap. ref. redirect. Optimality & approximation Clustered sorting How to determine G-Streamline Complexity analysis Approximating optimal layouts and mappings overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation optimal layouts / thread-data mapping? 16

69 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17

70 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17

6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon 5540. Device: Tesla 1060 5 4 5.

71 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup How to minimize or hide overhead? HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17

72 G-Streamline overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18

73 G-Streamline CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18

74 G-Streamline CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18

75 CPU-GPU Pipelining Utilize Idle CPU Time Transform on CPU while computing on GPU Automatic shutdown when necessary for i=1:n async_transform (i+2); async_copy (i+2); gpu_kernel(i); end CPU cpu_transform( ) 1 GPU CPU 2 GPU MEM CPU 3 4 GPU GPU MEM CPU MEM 5 GPU 6 GPU MEM GPU copy_to_gpu gpu_kernel CPU MEM CPU MEM 19

76 Dependence or No Loop CFD (grid Euler solver) for i=1:iterations... cuda_compute_flux(...); // write A, read B cuda_time_step(...); // read A, write B... end 20

77 Dependence or No Loop CFD (grid Euler solver) CUDA-EC (DNA error correction) for i=1:iterations... cuda_compute_flux(...); // write A, read B cuda_time_step(...); // read A, write B... end main (){ gpu_fix_errors1(); } 20

78 Kernel Splitting gpukernel_org<<<...>>>(pdata,...); split pipeline gpukernel_org_sub<<<...>>>(pdata,0, (1-r)*len,...); gpukernel_opt_sub<<<...>>>(pdata,(1-r)*len+1, len,...); Also useful for loops with no dependences Enable partial transformation 21

79 CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Two-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 22

80 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23

Final Speedup 3 2.5 2.5 Basic transformation w/ efficiency control full potential 2 2.

81 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23

82 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23

83 HW Profiling Results Benchmark 3D-LBM CFD CG NN Unwrap CUDA-EC GPU-HMMER Irreg. Type Memory Transactions Reduction Divergent Branch Reduction div & mem 10% 99% mem 12% mem 19% mem 67% mem 19% div 10% div 99% 24

84 Conclusions Comprehensive Understanding Computational complexity Relations among threads, data, and irregularities G-Streamline: A Unified, Software Solution Treat both mem. and control irregularities Need no offline prof. or hw ext. A whole-system synergy 25

85 Thanks! 26

86 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Transform. Tasks GPU Computation Tasks CPU output MEM BUS 27

87 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Transform. Tasks GPU Computation Tasks Scheduler CPU output MEM BUS 27

88 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Monitor Transform. Tasks GPU Computation Tasks Scheduler CPU output MEM BUS 27

89 Adaptive Efficiency Control Fine-Grained Adaptation To Determine a Good Splitting Ratio 1. Transparent on-line profiling in first several iterations 2. Build linear model ==> Opt. Ratio = Ropt 3. Continue adjusting Ropt through a runtime control system Fast converge to near optimal Avoid unnecessary oscillations Avoid being misled by early shutdown See Paper Please. 28

Towards Intelligent Programing Systems for Modern Computing

Towards Intelligent Programing Systems for Modern Computing Xipeng Shen Computer Science, North Carolina State University Unprecedented Scale 50X perf Y202X 1000 petaflops 20mw power 2X power! Y2011 20