Towards a Performance- Portable FFT Library for Heterogeneous Computing
|
|
- Piers Ray
- 5 years ago
- Views:
Transcription
1 Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014
2 Forecast (Problem) AMD Radeon HD 6970 (VLIW) Follow along at: goo.gl/1fs9g7 2
3 Forecast (Problem) Performance- Portable? AMD Radeon HD 6970 (VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 3
4 Forecast (Problem) How to find portable set of optimizations for GPUs? Performance- Portable? AMD Radeon HD 6970 (VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 4
5 Too much heterogeneity within GPUs Follow along at: goo.gl/1fs9g7 5
6 Too much heterogeneity within GPUs Architecture Follow along at: goo.gl/1fs9g7 6
7 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Follow along at: goo.gl/1fs9g7 7
8 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture NVIDIA C2075 (Non- VLIW) Follow along at: goo.gl/1fs9g7 8
9 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Vendor NVIDIA C2075 (Non- VLIW) Follow along at: goo.gl/1fs9g7 9
10 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Vendor Generation NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 10
11 NVIDIA C2075 (Non- VLIW) n atio Vendor er Gen AMD Radeon HD 6970 (VLIW) Arch itectu re Too much heterogeneity within GPUs NVIDIA Kepler K20c (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 11
12 NVIDIA C2075 (Non- VLIW) n atio Vendor er Gen AMD Radeon HD 6970 (VLIW) Arch itectu re Too much heterogeneity within GPUs NVIDIA Kepler K20c (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 12
13 Too much heterogeneity within GPUs Problem Follow along at: goo.gl/1fs9g7 13
14 Too much heterogeneity within GPUs Problem How do we Follow along at: goo.gl/1fs9g7 14
15 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? Follow along at: goo.gl/1fs9g7 15
16 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Follow along at: goo.gl/1fs9g7 16
17 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) Follow along at: goo.gl/1fs9g7 17
18 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) A methodology for determining portable optimizations for a class of algorithms on GPUs Follow along at: goo.gl/1fs9g7 18
19 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) A methodology for determining portable optimizations for a class of algorithms on GPUs FFTs used as a case study Follow along at: goo.gl/1fs9g7 19
20 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 20
21 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 21
22 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 22
23 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 23
24 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 24
25 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 25
26 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 26
27 Survey of FFT libraries for CPU and GPU hardware 250 Intel i (CPU) 226 GFLOPS NVIDIA Tesla C2075 (GPU) AMD Radeon HD 7970 (GPU) FFTW AppleFFT CUFFT AppleFFT AMD APPML Follow along at: goo.gl/1fs9g7 27
28 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 28
29 Background (GPUs) GPU Memory Hierarchy Follow along at: goo.gl/1fs9g7 29
30 Background (GPUs) GPU Memory Hierarchy Follow along at: goo.gl/1fs9g7 30
31 Background (GPUs) GPU Memory Hierarchy Global Memory Follow along at: goo.gl/1fs9g7 31
32 Background (GPUs) GPU Memory Hierarchy Global Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Global 0.17 Follow along at: goo.gl/1fs9g7 32
33 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 33
34 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 34
35 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Local Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 35
36 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Local Memory Registers Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Registers 16.2 Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 36
37 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 37
38 Approach ( Human Compilation ) Follow along at: goo.gl/1fs9g7 38
39 Approach ( Human Compilation ) Optimizations in isolation Follow along at: goo.gl/1fs9g7 39
40 Approach ( Human Compilation ) Optimizations in isolation Improvement Follow along at: goo.gl/1fs9g7 40
41 Approach ( Human Compilation ) Optimizations in isolation Improvement Characterize Collect Measure Follow along at: goo.gl/1fs9g7 41
42 Approach ( Human Compilation ) Optimizations in isolation Optimizations in concert Improvement Characterize Collect Measure Follow along at: goo.gl/1fs9g7 42
43 Approach ( Human Compilation ) Optimizations in isolation Optimizations in concert Improvement Characterize Collect Measure Improvement Follow along at: goo.gl/1fs9g7 43
44 Approach (Optimizations*) System- level 1. Register Preloading (RP) 2. Vectorized Access/{Vector,Scalar} Math (VAVM, VASM) 3. Constant Memory Usage (CM) 4. Common Subexpression Elimination (CSE) 5. Inlining (IL) 6. Coalesced Global Access Pattern (CGAP) Algorithm- level 7. Naïve Transpose (LM- CM) 8. Compute/Transpose via LM (LM- CC) 9. Compute/No Transpose via LM (LM- CT) Architecture- and Algorithm- Level 10. Shuffle (SHFL) * For a complete list of optimization, refer to Table 4 in Towards a Performance- Portable FFT Library for Heterogeneous Computing Follow along at: goo.gl/1fs9g7 44
45 System- level Optimizations 1. Register Preloading (RP) Follow along at: goo.gl/1fs9g7 45
46 System- level Optimizations 1. Register Preloading (RP) Without Register Preloading 79 kernel void unoptimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); Follow along at: goo.gl/1fs9g7 46
47 System- level Optimizations 1. Register Preloading (RP) Without Register Preloading 79 kernel void unoptimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 kernel void optimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3); Follow along at: goo.gl/1fs9g7 47
48 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) Follow along at: goo.gl/1fs9g7 48
49 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Follow along at: goo.gl/1fs9g7 49
50 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float + = Follow along at: goo.gl/1fs9g7 50
51 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float + = Follow along at: goo.gl/1fs9g7 51
52 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float Vector Math (VAVM) float4 + float4 + = + = Follow along at: goo.gl/1fs9g7 52
53 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float Vector Math (VAVM) float4 + float4 + = + = Follow along at: goo.gl/1fs9g7 53
54 Architecture- and Algorithm- Level Optimization 10. Shuffle Follow along at: goo.gl/1fs9g7 54
55 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Follow along at: goo.gl/1fs9g7 55
56 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Follow along at: goo.gl/1fs9g7 56
57 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Follow along at: goo.gl/1fs9g7 57
58 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Shuffle (the new way) Follow along at: goo.gl/1fs9g7 58
59 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Shuffle (the new way) Follow along at: goo.gl/1fs9g7 59
60 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Follow along at: goo.gl/1fs9g7 60
61 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Matrix transpose is a data communication step in FFT Follow along at: goo.gl/1fs9g7 61
62 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Matrix transpose is a data communication step in FFT Devised Shuffle Transpose Algorithm Consists of horizontal (inter- thread shuffles) and vertical (intra- thread) Original Step 1 Horizontal rotation (between threads) Step 2 Vertical rotation (within a thread) Step 3 Horizontal rotation (between threads) Transposed Follow along at: goo.gl/1fs9g7 62
63 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 63
64 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) Follow along at: goo.gl/1fs9g7 64
65 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 65
66 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 66
67 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 67
68 Results (Experimental Testbed) Application Setup 1D FFT (batched), N = 16-, 64-, and 256- pts 2D FFT (batched), N = 256x256 GPU Testbed Device Cores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Architecture AMD Radeon HD VLIW AMD Radeon HD Non- VLIW NVIDIA Tesla C Non- VLIW NVIDIA Tesla K20c Non- VLIW Follow along at: goo.gl/1fs9g7 68
69 Results Optimizations in Isolation Radeon pts Execution Time (ms) Twiddles Transpose Cols 0 CM-K CM-L CGAP LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 Radeon 7970, 256- pts VAVM VAVM8 RP LM-CM LM-CC LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 69 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
70 Results Optimizations in Isolation NVIDIA K20c 256- pts Execution Time (ms) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC LM-CT Baseline NVIDIA K20c RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 70 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
71 Results (Observations) 1. Use scalar operations (e.g., vector access/scalar math) 2. Focus should be on memory subsystem (e.g., bus traffic) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 71 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
72 Results (Bus Traffic) Execution Time (ms) Twiddles Transpose Cols 10 0 CM-K CM-L CGAP LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC Radeon 7970: Execution Time (256- pts) LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 72 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
73 Results (Bus Traffic) Execution Time (ms) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC Radeon 7970: Execution Time (256- pts) LM-CT Baseline Bus Traffic (MB) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM Radeon 7970: Bus Traffic (256- pts) LM-CC LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 73 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
74 Results Insight #1 Primary cost of FFT is in data movement RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 74 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
75 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 75 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
76 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by using optimizations that prefetch memory (RP) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 76 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
77 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by using optimizations that prefetch memory (RP) staging transpose in scratchpad memory (LM- CM, LM- CC, LM- CT) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 77 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
78 Results 25 Kernel Load/ Store Kernel Execution Optimizations in Concert AMD Radeon HD pts Execution Time (ms) LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM VASM2 VASM4 VAVM2 VAVM4 CM CM CM CM RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 78 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
79 Results Optimizations in Concert NVIDIA Tesla K20c 256- pts Execution Time (ms) LM-CC LM-CT RP+LM-CM Kernel Load/ Store LM-CC LM-CT RP+LM-CM Kernel Execution LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM VASM2 VASM4 VAVM2 VAVM4 CM CM CM CM RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 79 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
80 Results Insight #2 One sequence of optimizations perform well for GPUs These optimizations are RP (Register Preloading) LM- CM (Local Memory Communication Only) VASM2/4 (Vector Access, Scalar Math, float2/4) CM (Constant Memory Usage) CGAP (Coalesced Global Access Pattern) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 80 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
81 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Follow along at: goo.gl/1fs9g7 81
82 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Surprise Result Goal: Accelerate communication ( gray bar ) Computation Communication Shm SELP (IP) Execution time (ms) Follow along at: goo.gl/1fs9g7 82
83 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Surprise Result Goal: Accelerate communication ( gray bar ) Result: Accelerated the computation also ( black bar ) Computation Communication Shm SELP (IP) Execution time (ms) Follow along at: goo.gl/1fs9g7 83
84 Results 2D FFT (N = 256x256) Optimizations: RP LM- CM VASM2 CM CGAP GFLOPS x 11.76x 2.05x Optimized Unoptimized 2.14x 0 AMD Radeon HD 6970 AMD Radeon HD 7970 NVIDIA Tesla C2075 NVIDIA Tesla K20c RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 84 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
85 Conclusion (Thank You!) Title: Towards a Performance- Portable FFT Library for Heterogeneous Computing Contribution: A methodology for determining portable optimizations for a class of algorithms Improvement Improvement Optimization principles for FFT on GPUs An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Insight #1: Primary cost of FFT computation is in data movement (e.g., memory bound) Insight #2: One sequence of optimizations perform well for GPUs [1D FFT] fold improvement over baseline GPU; 9.1- fold improvement over multi- core FFTW CPU with AVX. GFLOPS x AMD Radeon HD x AMD Radeon HD x NVIDIA Tesla C2075 Optimized Unoptimized 2.14x NVIDIA Tesla K20c Follow along at: goo.gl/1fs9g7 85
86 Appendix Slides Follow along at: goo.gl/1fs9g7 86
87 Background (Optimizing on GPUs) 1. RP (Register Preloading) - All data elements are first preloaded onto the register file of the respective GPU. Computation is facilitated solely on registers. 2. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth thread accesses memory element k) 3. VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the listed vector type. Arithmetic operations are scalar (float x float). 4. LM- CM (Local Memory, Communication Only) - Data elements are loaded into local memory only for communication. Threads swap data elements solely in local memory. 5. LM- CT (Local Memory, Computation, No Transpose) - Data elements are loaded into local memory for computation. The communication step is avoided by algorithm reorganization. 6. LM- CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication. 7. CM- {K,L} (Constant Memory {Kernel, Literal}) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up. CM- K refers to constant memory as a kernel argument, while CM- L refers to a static global declaration in the OpenCL kernel. 8. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure. 9. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called LU (Loop Unrolling) A loop is explicitly rewritten as an identical sequence of statements without the overhead of loop variable comparisons. 11. Shuffle - The transpose stage in FFT is performed entirely in registers eliminating the use of local memory. This optimization is only possible with NVIDIA Kepler GPUs (e.g., Tesla K20c). Follow along at: goo.gl/1fs9g7 87
88 S3: Constant Memory Fast cached lookup for frequently used data Follow along at: goo.gl/1fs9g7 88
89 S3: Constant Memory Fast cached lookup for frequently used data 16 constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61 for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid]; Follow along at: goo.gl/1fs9g7 89
90 System- level Optimizations Follow along at: goo.gl/1fs9g7 90
91 Approach System- level Optimizations (applicable to any application) 1. Register Preloading 2. Vector Access/{Vector,Scalar} Arithmetic 3. Constant Memory Usage 4. Dynamic Instruction Reduction 5. Memory Coalescing 6. Image Memory Algorithm- level Optimizations 1 C. del Mundo, W. Feng. Accelerating Fast Fourier Transform for Wideband Channelization, IEEE ICC, Budapest, Hungary, June Follow along at: goo.gl/1fs9g7 91
92 Algorithm- level optimizations Transpose elements across the diagonal are exchanged Follow along at: goo.gl/1fs9g7 92
93 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 93
94 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 94
95 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 95
96 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 96
97 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 97
98 Algorithm- level optimizations Original Transposed Follow along at: goo.gl/1fs9g7 98
99 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 99
100 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 100
101 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 101
102 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 102
103 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Original Transposed Follow along at: goo.gl/1fs9g7 103
104 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Original Transposed Follow along at: goo.gl/1fs9g7 104
105 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 105
106 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 106
107 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 107
108 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 108
109 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Advantage: Skips the transpose step Local Memory Follow along at: goo.gl/1fs9g7 109
110 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Advantage: Skips the transpose step Disadvantage: Local memory has lower throughput than registers. Local Memory Follow along at: goo.gl/1fs9g7 110
111 Architecture- level Optimization: Shuffle Software (Transpose) Hardware (K20c and shuffle) r nge NVIDIA Kepler K20c Shuffle Mechanism Follow along at: goo.gl/1fs9g7 111
112 Results (Shuffle) Bottleneck: Intra- thread data movement Follow along at: goo.gl/1fs9g7 112
113 Results (Shuffle) Bottleneck: Intra- thread data movement Horizontal ( Shuffle ) t 0 t 1 t 2 t 3 Vertical ( Intra-thread ) Stage 2: Vertical Horizontal ( Shuffle ) Register File Code 1: (NAIVE) for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; 15x Follow along at: goo.gl/1fs9g7 113
114 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Follow along at: goo.gl/1fs9g7 114
115 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } Follow along at: goo.gl/1fs9g7 115
116 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } Follow along at: goo.gl/1fs9g7 116
117 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { } src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. 44% Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0)? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0)? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0)? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0)? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1)? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1)? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1)? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1)? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2)? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2)? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2)? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2)? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3)? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3)? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3)? src_registers[3] : dst_registers[2]; Follow along 83 at: dst_registers[3] goo.gl/1fs9g7 = (tid == 3)? src_registers[0] : dst_registers[3]; 117
118 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { } src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. 44% Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0)? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0)? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0)? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0)? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1)? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1)? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1)? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1)? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2)? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2)? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2)? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2)? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3)? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3)? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3)? src_registers[3] : dst_registers[2]; Follow along 83 at: dst_registers[3] goo.gl/1fs9g7 = (tid == 3)? src_registers[0] : dst_registers[3]; 118
119 Results (Shuffle) Computation X% % improvement for communication Communication Shm Naive 15x 37.5% Occupancy DIV 6% SELP (IP) 17% SELP (OOP) 44% 50% Occupancy SELP (IP) 14% Execution Time (ms) 4.4 Follow along at: goo.gl/1fs9g7 119
120 Results (Shuffle) Computation X% % improvement for communication Communication Shm Naive 15x 37.5% Occupancy DIV 6% SELP (IP) 17% SELP (OOP) 44% 50% Occupancy SELP (IP) 14% Execution Time (ms) 4.4 Follow along at: goo.gl/1fs9g7 120
Accelerating Fast Fourier Transform for Wideband Channelization
IEEE ICC 213 - Signal Processing for Communications Symposium Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, Vignesh Adhinarayanan, Wu-chun Feng Department of Electrical
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationJohn W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on
More informationCUDA Performance Optimization
Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationAuto-tunable GPU BLAS
Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationOptimization of Tele-Immersion Codes
Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationOn the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
On the Efficacy of a Fued CPU+GPU Proceor (or APU) for Parallel Computing Mayank Daga, Ahwin M. Aji, and Wu-chun Feng Dept. of Computer Science Sampling of field that ue GPU Mac OS X Comology Molecular
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationGPU Profiling and Optimization. Scott Grauer-Gray
GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local
More informationAutomatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology
Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology FFT (Fast Fourier Transform) FFT is a fast algorithm to compute DFT (Discrete Fourier Transform). twiddle factors
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationHigh Performance Matrix-matrix Multiplication of Very Small Matrices
High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationFast Segmented Sort on GPUs
Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given
More informationDelft University of Technology Parallel and Distributed Systems Report Series. CLVectorizer: A Source-to-Source Vectorizer for OpenCL Kernels
Delft University of Technology Parallel and Distributed Systems Report Series CLVectorizer: A Source-to-Source Vectorizer for OpenCL Kernels Jianbin Fang, Ana Lucia Varbanescu {j.fang,a.l.varbanescu@tudelft.nl
More informationPredictive Runtime Code Scheduling for Heterogeneous Architectures
Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationA Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle
A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationHighly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs
Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationImplementation of Adaptive Coarsening Algorithm on GPU using CUDA
Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationArchitecture-Aware Mapping and Optimization on a 1600-Core GPU
2011 IEEE 17th International Conference on Parallel and Distributed Systems Architecture-Aware Mapping and Optimization on a 1600-Core GPU Mayank Daga, Thomas Scogland, Wu-chun Feng Department of Computer
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More information