Towards a Performance- Portable FFT Library for Heterogeneous Computing

Size: px
Start display at page:

Download "Towards a Performance- Portable FFT Library for Heterogeneous Computing"

Transcription

1 Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014

2 Forecast (Problem) AMD Radeon HD 6970 (VLIW) Follow along at: goo.gl/1fs9g7 2

3 Forecast (Problem) Performance- Portable? AMD Radeon HD 6970 (VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 3

4 Forecast (Problem) How to find portable set of optimizations for GPUs? Performance- Portable? AMD Radeon HD 6970 (VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 4

5 Too much heterogeneity within GPUs Follow along at: goo.gl/1fs9g7 5

6 Too much heterogeneity within GPUs Architecture Follow along at: goo.gl/1fs9g7 6

7 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Follow along at: goo.gl/1fs9g7 7

8 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture NVIDIA C2075 (Non- VLIW) Follow along at: goo.gl/1fs9g7 8

9 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Vendor NVIDIA C2075 (Non- VLIW) Follow along at: goo.gl/1fs9g7 9

10 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Vendor Generation NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 10

11 NVIDIA C2075 (Non- VLIW) n atio Vendor er Gen AMD Radeon HD 6970 (VLIW) Arch itectu re Too much heterogeneity within GPUs NVIDIA Kepler K20c (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 11

12 NVIDIA C2075 (Non- VLIW) n atio Vendor er Gen AMD Radeon HD 6970 (VLIW) Arch itectu re Too much heterogeneity within GPUs NVIDIA Kepler K20c (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 12

13 Too much heterogeneity within GPUs Problem Follow along at: goo.gl/1fs9g7 13

14 Too much heterogeneity within GPUs Problem How do we Follow along at: goo.gl/1fs9g7 14

15 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? Follow along at: goo.gl/1fs9g7 15

16 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Follow along at: goo.gl/1fs9g7 16

17 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) Follow along at: goo.gl/1fs9g7 17

18 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) A methodology for determining portable optimizations for a class of algorithms on GPUs Follow along at: goo.gl/1fs9g7 18

19 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) A methodology for determining portable optimizations for a class of algorithms on GPUs FFTs used as a case study Follow along at: goo.gl/1fs9g7 19

20 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 20

21 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 21

22 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 22

23 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 23

24 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 24

25 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 25

26 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 26

27 Survey of FFT libraries for CPU and GPU hardware 250 Intel i (CPU) 226 GFLOPS NVIDIA Tesla C2075 (GPU) AMD Radeon HD 7970 (GPU) FFTW AppleFFT CUFFT AppleFFT AMD APPML Follow along at: goo.gl/1fs9g7 27

28 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 28

29 Background (GPUs) GPU Memory Hierarchy Follow along at: goo.gl/1fs9g7 29

30 Background (GPUs) GPU Memory Hierarchy Follow along at: goo.gl/1fs9g7 30

31 Background (GPUs) GPU Memory Hierarchy Global Memory Follow along at: goo.gl/1fs9g7 31

32 Background (GPUs) GPU Memory Hierarchy Global Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Global 0.17 Follow along at: goo.gl/1fs9g7 32

33 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 33

34 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 34

35 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Local Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 35

36 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Local Memory Registers Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Registers 16.2 Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 36

37 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 37

38 Approach ( Human Compilation ) Follow along at: goo.gl/1fs9g7 38

39 Approach ( Human Compilation ) Optimizations in isolation Follow along at: goo.gl/1fs9g7 39

40 Approach ( Human Compilation ) Optimizations in isolation Improvement Follow along at: goo.gl/1fs9g7 40

41 Approach ( Human Compilation ) Optimizations in isolation Improvement Characterize Collect Measure Follow along at: goo.gl/1fs9g7 41

42 Approach ( Human Compilation ) Optimizations in isolation Optimizations in concert Improvement Characterize Collect Measure Follow along at: goo.gl/1fs9g7 42

43 Approach ( Human Compilation ) Optimizations in isolation Optimizations in concert Improvement Characterize Collect Measure Improvement Follow along at: goo.gl/1fs9g7 43

44 Approach (Optimizations*) System- level 1. Register Preloading (RP) 2. Vectorized Access/{Vector,Scalar} Math (VAVM, VASM) 3. Constant Memory Usage (CM) 4. Common Subexpression Elimination (CSE) 5. Inlining (IL) 6. Coalesced Global Access Pattern (CGAP) Algorithm- level 7. Naïve Transpose (LM- CM) 8. Compute/Transpose via LM (LM- CC) 9. Compute/No Transpose via LM (LM- CT) Architecture- and Algorithm- Level 10. Shuffle (SHFL) * For a complete list of optimization, refer to Table 4 in Towards a Performance- Portable FFT Library for Heterogeneous Computing Follow along at: goo.gl/1fs9g7 44

45 System- level Optimizations 1. Register Preloading (RP) Follow along at: goo.gl/1fs9g7 45

46 System- level Optimizations 1. Register Preloading (RP) Without Register Preloading 79 kernel void unoptimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); Follow along at: goo.gl/1fs9g7 46

47 System- level Optimizations 1. Register Preloading (RP) Without Register Preloading 79 kernel void unoptimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 kernel void optimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3); Follow along at: goo.gl/1fs9g7 47

48 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) Follow along at: goo.gl/1fs9g7 48

49 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Follow along at: goo.gl/1fs9g7 49

50 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float + = Follow along at: goo.gl/1fs9g7 50

51 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float + = Follow along at: goo.gl/1fs9g7 51

52 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float Vector Math (VAVM) float4 + float4 + = + = Follow along at: goo.gl/1fs9g7 52

53 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float Vector Math (VAVM) float4 + float4 + = + = Follow along at: goo.gl/1fs9g7 53

54 Architecture- and Algorithm- Level Optimization 10. Shuffle Follow along at: goo.gl/1fs9g7 54

55 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Follow along at: goo.gl/1fs9g7 55

56 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Follow along at: goo.gl/1fs9g7 56

57 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Follow along at: goo.gl/1fs9g7 57

58 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Shuffle (the new way) Follow along at: goo.gl/1fs9g7 58

59 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Shuffle (the new way) Follow along at: goo.gl/1fs9g7 59

60 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Follow along at: goo.gl/1fs9g7 60

61 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Matrix transpose is a data communication step in FFT Follow along at: goo.gl/1fs9g7 61

62 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Matrix transpose is a data communication step in FFT Devised Shuffle Transpose Algorithm Consists of horizontal (inter- thread shuffles) and vertical (intra- thread) Original Step 1 Horizontal rotation (between threads) Step 2 Vertical rotation (within a thread) Step 3 Horizontal rotation (between threads) Transposed Follow along at: goo.gl/1fs9g7 62

63 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 63

64 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) Follow along at: goo.gl/1fs9g7 64

65 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 65

66 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 66

67 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 67

68 Results (Experimental Testbed) Application Setup 1D FFT (batched), N = 16-, 64-, and 256- pts 2D FFT (batched), N = 256x256 GPU Testbed Device Cores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Architecture AMD Radeon HD VLIW AMD Radeon HD Non- VLIW NVIDIA Tesla C Non- VLIW NVIDIA Tesla K20c Non- VLIW Follow along at: goo.gl/1fs9g7 68

69 Results Optimizations in Isolation Radeon pts Execution Time (ms) Twiddles Transpose Cols 0 CM-K CM-L CGAP LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 Radeon 7970, 256- pts VAVM VAVM8 RP LM-CM LM-CC LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 69 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

70 Results Optimizations in Isolation NVIDIA K20c 256- pts Execution Time (ms) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC LM-CT Baseline NVIDIA K20c RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 70 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

71 Results (Observations) 1. Use scalar operations (e.g., vector access/scalar math) 2. Focus should be on memory subsystem (e.g., bus traffic) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 71 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

72 Results (Bus Traffic) Execution Time (ms) Twiddles Transpose Cols 10 0 CM-K CM-L CGAP LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC Radeon 7970: Execution Time (256- pts) LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 72 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

73 Results (Bus Traffic) Execution Time (ms) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC Radeon 7970: Execution Time (256- pts) LM-CT Baseline Bus Traffic (MB) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM Radeon 7970: Bus Traffic (256- pts) LM-CC LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 73 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

74 Results Insight #1 Primary cost of FFT is in data movement RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 74 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

75 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 75 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

76 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by using optimizations that prefetch memory (RP) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 76 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

77 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by using optimizations that prefetch memory (RP) staging transpose in scratchpad memory (LM- CM, LM- CC, LM- CT) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 77 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

78 Results 25 Kernel Load/ Store Kernel Execution Optimizations in Concert AMD Radeon HD pts Execution Time (ms) LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM VASM2 VASM4 VAVM2 VAVM4 CM CM CM CM RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 78 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

79 Results Optimizations in Concert NVIDIA Tesla K20c 256- pts Execution Time (ms) LM-CC LM-CT RP+LM-CM Kernel Load/ Store LM-CC LM-CT RP+LM-CM Kernel Execution LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM VASM2 VASM4 VAVM2 VAVM4 CM CM CM CM RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 79 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

80 Results Insight #2 One sequence of optimizations perform well for GPUs These optimizations are RP (Register Preloading) LM- CM (Local Memory Communication Only) VASM2/4 (Vector Access, Scalar Math, float2/4) CM (Constant Memory Usage) CGAP (Coalesced Global Access Pattern) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 80 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

81 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Follow along at: goo.gl/1fs9g7 81

82 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Surprise Result Goal: Accelerate communication ( gray bar ) Computation Communication Shm SELP (IP) Execution time (ms) Follow along at: goo.gl/1fs9g7 82

83 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Surprise Result Goal: Accelerate communication ( gray bar ) Result: Accelerated the computation also ( black bar ) Computation Communication Shm SELP (IP) Execution time (ms) Follow along at: goo.gl/1fs9g7 83

84 Results 2D FFT (N = 256x256) Optimizations: RP LM- CM VASM2 CM CGAP GFLOPS x 11.76x 2.05x Optimized Unoptimized 2.14x 0 AMD Radeon HD 6970 AMD Radeon HD 7970 NVIDIA Tesla C2075 NVIDIA Tesla K20c RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 84 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

85 Conclusion (Thank You!) Title: Towards a Performance- Portable FFT Library for Heterogeneous Computing Contribution: A methodology for determining portable optimizations for a class of algorithms Improvement Improvement Optimization principles for FFT on GPUs An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Insight #1: Primary cost of FFT computation is in data movement (e.g., memory bound) Insight #2: One sequence of optimizations perform well for GPUs [1D FFT] fold improvement over baseline GPU; 9.1- fold improvement over multi- core FFTW CPU with AVX. GFLOPS x AMD Radeon HD x AMD Radeon HD x NVIDIA Tesla C2075 Optimized Unoptimized 2.14x NVIDIA Tesla K20c Follow along at: goo.gl/1fs9g7 85

86 Appendix Slides Follow along at: goo.gl/1fs9g7 86

87 Background (Optimizing on GPUs) 1. RP (Register Preloading) - All data elements are first preloaded onto the register file of the respective GPU. Computation is facilitated solely on registers. 2. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth thread accesses memory element k) 3. VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the listed vector type. Arithmetic operations are scalar (float x float). 4. LM- CM (Local Memory, Communication Only) - Data elements are loaded into local memory only for communication. Threads swap data elements solely in local memory. 5. LM- CT (Local Memory, Computation, No Transpose) - Data elements are loaded into local memory for computation. The communication step is avoided by algorithm reorganization. 6. LM- CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication. 7. CM- {K,L} (Constant Memory {Kernel, Literal}) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up. CM- K refers to constant memory as a kernel argument, while CM- L refers to a static global declaration in the OpenCL kernel. 8. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure. 9. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called LU (Loop Unrolling) A loop is explicitly rewritten as an identical sequence of statements without the overhead of loop variable comparisons. 11. Shuffle - The transpose stage in FFT is performed entirely in registers eliminating the use of local memory. This optimization is only possible with NVIDIA Kepler GPUs (e.g., Tesla K20c). Follow along at: goo.gl/1fs9g7 87

88 S3: Constant Memory Fast cached lookup for frequently used data Follow along at: goo.gl/1fs9g7 88

89 S3: Constant Memory Fast cached lookup for frequently used data 16 constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61 for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid]; Follow along at: goo.gl/1fs9g7 89

90 System- level Optimizations Follow along at: goo.gl/1fs9g7 90

91 Approach System- level Optimizations (applicable to any application) 1. Register Preloading 2. Vector Access/{Vector,Scalar} Arithmetic 3. Constant Memory Usage 4. Dynamic Instruction Reduction 5. Memory Coalescing 6. Image Memory Algorithm- level Optimizations 1 C. del Mundo, W. Feng. Accelerating Fast Fourier Transform for Wideband Channelization, IEEE ICC, Budapest, Hungary, June Follow along at: goo.gl/1fs9g7 91

92 Algorithm- level optimizations Transpose elements across the diagonal are exchanged Follow along at: goo.gl/1fs9g7 92

93 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 93

94 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 94

95 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 95

96 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 96

97 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 97

98 Algorithm- level optimizations Original Transposed Follow along at: goo.gl/1fs9g7 98

99 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 99

100 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 100

101 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 101

102 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 102

103 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Original Transposed Follow along at: goo.gl/1fs9g7 103

104 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Original Transposed Follow along at: goo.gl/1fs9g7 104

105 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 105

106 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 106

107 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 107

108 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 108

109 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Advantage: Skips the transpose step Local Memory Follow along at: goo.gl/1fs9g7 109

110 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Advantage: Skips the transpose step Disadvantage: Local memory has lower throughput than registers. Local Memory Follow along at: goo.gl/1fs9g7 110

111 Architecture- level Optimization: Shuffle Software (Transpose) Hardware (K20c and shuffle) r nge NVIDIA Kepler K20c Shuffle Mechanism Follow along at: goo.gl/1fs9g7 111

112 Results (Shuffle) Bottleneck: Intra- thread data movement Follow along at: goo.gl/1fs9g7 112

113 Results (Shuffle) Bottleneck: Intra- thread data movement Horizontal ( Shuffle ) t 0 t 1 t 2 t 3 Vertical ( Intra-thread ) Stage 2: Vertical Horizontal ( Shuffle ) Register File Code 1: (NAIVE) for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; 15x Follow along at: goo.gl/1fs9g7 113

114 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Follow along at: goo.gl/1fs9g7 114

115 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } Follow along at: goo.gl/1fs9g7 115

116 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } Follow along at: goo.gl/1fs9g7 116

117 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { } src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. 44% Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0)? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0)? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0)? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0)? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1)? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1)? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1)? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1)? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2)? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2)? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2)? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2)? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3)? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3)? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3)? src_registers[3] : dst_registers[2]; Follow along 83 at: dst_registers[3] goo.gl/1fs9g7 = (tid == 3)? src_registers[0] : dst_registers[3]; 117

118 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { } src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. 44% Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0)? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0)? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0)? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0)? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1)? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1)? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1)? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1)? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2)? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2)? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2)? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2)? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3)? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3)? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3)? src_registers[3] : dst_registers[2]; Follow along 83 at: dst_registers[3] goo.gl/1fs9g7 = (tid == 3)? src_registers[0] : dst_registers[3]; 118

119 Results (Shuffle) Computation X% % improvement for communication Communication Shm Naive 15x 37.5% Occupancy DIV 6% SELP (IP) 17% SELP (OOP) 44% 50% Occupancy SELP (IP) 14% Execution Time (ms) 4.4 Follow along at: goo.gl/1fs9g7 119

120 Results (Shuffle) Computation X% % improvement for communication Communication Shm Naive 15x 37.5% Occupancy DIV 6% SELP (IP) 17% SELP (OOP) 44% 50% Occupancy SELP (IP) 14% Execution Time (ms) 4.4 Follow along at: goo.gl/1fs9g7 120

Accelerating Fast Fourier Transform for Wideband Channelization

Accelerating Fast Fourier Transform for Wideband Channelization IEEE ICC 213 - Signal Processing for Communications Symposium Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, Vignesh Adhinarayanan, Wu-chun Feng Department of Electrical

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

CUDA Performance Optimization

CUDA Performance Optimization Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Optimization of Tele-Immersion Codes

Optimization of Tele-Immersion Codes Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

S4289: Efficient solution of multiple scalar and block-tridiagonal equations S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,

More information

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing On the Efficacy of a Fued CPU+GPU Proceor (or APU) for Parallel Computing Mayank Daga, Ahwin M. Aji, and Wu-chun Feng Dept. of Computer Science Sampling of field that ue GPU Mac OS X Comology Molecular

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

GPU Profiling and Optimization. Scott Grauer-Gray

GPU Profiling and Optimization. Scott Grauer-Gray GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local

More information

Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology

Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology FFT (Fast Fourier Transform) FFT is a fast algorithm to compute DFT (Discrete Fourier Transform). twiddle factors

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

High Performance Matrix-matrix Multiplication of Very Small Matrices

High Performance Matrix-matrix Multiplication of Very Small Matrices High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Fast Segmented Sort on GPUs

Fast Segmented Sort on GPUs Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given

More information

Delft University of Technology Parallel and Distributed Systems Report Series. CLVectorizer: A Source-to-Source Vectorizer for OpenCL Kernels

Delft University of Technology Parallel and Distributed Systems Report Series. CLVectorizer: A Source-to-Source Vectorizer for OpenCL Kernels Delft University of Technology Parallel and Distributed Systems Report Series CLVectorizer: A Source-to-Source Vectorizer for OpenCL Kernels Jianbin Fang, Ana Lucia Varbanescu {j.fang,a.l.varbanescu@tudelft.nl

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

OpenStaPLE, an OpenACC Lattice QCD Application

OpenStaPLE, an OpenACC Lattice QCD Application OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Architecture-Aware Mapping and Optimization on a 1600-Core GPU 2011 IEEE 17th International Conference on Parallel and Distributed Systems Architecture-Aware Mapping and Optimization on a 1600-Core GPU Mayank Daga, Thomas Scogland, Wu-chun Feng Department of Computer

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information