Towards a Performance- Portable FFT Library for Heterogeneous Computing

Size: px

Start display at page:

Download "Towards a Performance- Portable FFT Library for Heterogeneous Computing"

Piers Ray
5 years ago
Views:

1 Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014

2 Forecast (Problem) AMD Radeon HD 6970 (VLIW) Follow along at: goo.gl/1fs9g7 2

3 Forecast (Problem) Performance- Portable? AMD Radeon HD 6970 (VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 3

4 Forecast (Problem) How to find portable set of optimizations for GPUs? Performance- Portable? AMD Radeon HD 6970 (VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 4

5 Too much heterogeneity within GPUs Follow along at: goo.gl/1fs9g7 5

6 Too much heterogeneity within GPUs Architecture Follow along at: goo.gl/1fs9g7 6

7 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Follow along at: goo.gl/1fs9g7 7

8 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture NVIDIA C2075 (Non- VLIW) Follow along at: goo.gl/1fs9g7 8

9 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Vendor NVIDIA C2075 (Non- VLIW) Follow along at: goo.gl/1fs9g7 9

10 Too much heterogeneity within GPUs AMD Radeon HD 6970 (VLIW) Architecture Vendor Generation NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 10

11 NVIDIA C2075 (Non- VLIW) n atio Vendor er Gen AMD Radeon HD 6970 (VLIW) Arch itectu re Too much heterogeneity within GPUs NVIDIA Kepler K20c (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 11

12 NVIDIA C2075 (Non- VLIW) n atio Vendor er Gen AMD Radeon HD 6970 (VLIW) Arch itectu re Too much heterogeneity within GPUs NVIDIA Kepler K20c (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 12

13 Too much heterogeneity within GPUs Problem Follow along at: goo.gl/1fs9g7 13

14 Too much heterogeneity within GPUs Problem How do we Follow along at: goo.gl/1fs9g7 14

15 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? Follow along at: goo.gl/1fs9g7 15

16 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Follow along at: goo.gl/1fs9g7 16

17 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) Follow along at: goo.gl/1fs9g7 17

18 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) A methodology for determining portable optimizations for a class of algorithms on GPUs Follow along at: goo.gl/1fs9g7 18

19 Too much heterogeneity within GPUs Problem How do we simultaneously optimize for all GPUs? provide insight on machine- level behavior? Solution (Contribution) A methodology for determining portable optimizations for a class of algorithms on GPUs FFTs used as a case study Follow along at: goo.gl/1fs9g7 19

20 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 20

21 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 21

22 FFT: a building block across disciplines Follow along at: goo.gl/1fs9g7 22

23 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 23

24 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 24

25 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 25

26 FFT: a building block across disciplines content/uploads/2013/02/shazam- app.png Follow along at: goo.gl/1fs9g7 26

27 Survey of FFT libraries for CPU and GPU hardware 250 Intel i (CPU) 226 GFLOPS NVIDIA Tesla C2075 (GPU) AMD Radeon HD 7970 (GPU) FFTW AppleFFT CUFFT AppleFFT AMD APPML Follow along at: goo.gl/1fs9g7 27

28 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 28

29 Background (GPUs) GPU Memory Hierarchy Follow along at: goo.gl/1fs9g7 29

30 Background (GPUs) GPU Memory Hierarchy Follow along at: goo.gl/1fs9g7 30

31 Background (GPUs) GPU Memory Hierarchy Global Memory Follow along at: goo.gl/1fs9g7 31

32 Background (GPUs) GPU Memory Hierarchy Global Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Global 0.17 Follow along at: goo.gl/1fs9g7 32

33 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 33

34 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 34

35 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Local Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 35

36 Background (GPUs) GPU Memory Hierarchy Global Memory Image Memory Constant Memory Local Memory Registers Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Registers 16.2 Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Follow along at: goo.gl/1fs9g7 36

37 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 37

38 Approach ( Human Compilation ) Follow along at: goo.gl/1fs9g7 38

39 Approach ( Human Compilation ) Optimizations in isolation Follow along at: goo.gl/1fs9g7 39

40 Approach ( Human Compilation ) Optimizations in isolation Improvement Follow along at: goo.gl/1fs9g7 40

41 Approach ( Human Compilation ) Optimizations in isolation Improvement Characterize Collect Measure Follow along at: goo.gl/1fs9g7 41

42 Approach ( Human Compilation ) Optimizations in isolation Optimizations in concert Improvement Characterize Collect Measure Follow along at: goo.gl/1fs9g7 42

43 Approach ( Human Compilation ) Optimizations in isolation Optimizations in concert Improvement Characterize Collect Measure Improvement Follow along at: goo.gl/1fs9g7 43

44 Approach (Optimizations*) System- level 1. Register Preloading (RP) 2. Vectorized Access/{Vector,Scalar} Math (VAVM, VASM) 3. Constant Memory Usage (CM) 4. Common Subexpression Elimination (CSE) 5. Inlining (IL) 6. Coalesced Global Access Pattern (CGAP) Algorithm- level 7. Naïve Transpose (LM- CM) 8. Compute/Transpose via LM (LM- CC) 9. Compute/No Transpose via LM (LM- CT) Architecture- and Algorithm- Level 10. Shuffle (SHFL) * For a complete list of optimization, refer to Table 4 in Towards a Performance- Portable FFT Library for Heterogeneous Computing Follow along at: goo.gl/1fs9g7 44

45 System- level Optimizations 1. Register Preloading (RP) Follow along at: goo.gl/1fs9g7 45

46 System- level Optimizations 1. Register Preloading (RP) Without Register Preloading 79 kernel void unoptimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); Follow along at: goo.gl/1fs9g7 46

47 System- level Optimizations 1. Register Preloading (RP) Without Register Preloading 79 kernel void unoptimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 kernel void optimized( global float2 *buffer) 80 { 81 int index = ; 82 buffer += index; private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3); Follow along at: goo.gl/1fs9g7 47

48 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) Follow along at: goo.gl/1fs9g7 48

49 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Follow along at: goo.gl/1fs9g7 49

50 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float + = Follow along at: goo.gl/1fs9g7 50

51 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float + = Follow along at: goo.gl/1fs9g7 51

52 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float Vector Math (VAVM) float4 + float4 + = + = Follow along at: goo.gl/1fs9g7 52

53 System- level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Scalar Math (VASM) float + float Vector Math (VAVM) float4 + float4 + = + = Follow along at: goo.gl/1fs9g7 53

54 Architecture- and Algorithm- Level Optimization 10. Shuffle Follow along at: goo.gl/1fs9g7 54

55 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Follow along at: goo.gl/1fs9g7 55

56 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Follow along at: goo.gl/1fs9g7 56

57 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Follow along at: goo.gl/1fs9g7 57

58 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Shuffle (the new way) Follow along at: goo.gl/1fs9g7 58

59 Architecture- and Algorithm- Level Optimization 10. Shuffle Enable efficient data communication Local Memory (the old way) Shuffle (the new way) Follow along at: goo.gl/1fs9g7 59

60 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Follow along at: goo.gl/1fs9g7 60

61 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Matrix transpose is a data communication step in FFT Follow along at: goo.gl/1fs9g7 61

62 Architecture- and Algorithm- Level Optimization 10. Shuffle Evaluate shuffle using matrix transpose Matrix transpose is a data communication step in FFT Devised Shuffle Transpose Algorithm Consists of horizontal (inter- thread shuffles) and vertical (intra- thread) Original Step 1 Horizontal rotation (between threads) Step 2 Vertical rotation (within a thread) Step 3 Horizontal rotation (between threads) Transposed Follow along at: goo.gl/1fs9g7 62

63 Outline Forecast Introduction Background Approach (Optimizations) Results & Analysis Optimizations in isolation Optimizations in concert Shuffle Conclusion Follow along at: goo.gl/1fs9g7 63

64 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) Follow along at: goo.gl/1fs9g7 64

65 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 65

66 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) Follow along at: goo.gl/1fs9g7 66

67 Results (Experimental Testbed) AMD Radeon HD 6970 (VLIW) NVIDIA C2075 (Non- VLIW) AMD Radeon HD 7970 (non- VLIW) NVIDIA Kepler K20c (Non- VLIW) Follow along at: goo.gl/1fs9g7 67

68 Results (Experimental Testbed) Application Setup 1D FFT (batched), N = 16-, 64-, and 256- pts 2D FFT (batched), N = 256x256 GPU Testbed Device Cores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Architecture AMD Radeon HD VLIW AMD Radeon HD Non- VLIW NVIDIA Tesla C Non- VLIW NVIDIA Tesla K20c Non- VLIW Follow along at: goo.gl/1fs9g7 68

69 Results Optimizations in Isolation Radeon pts Execution Time (ms) Twiddles Transpose Cols 0 CM-K CM-L CGAP LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 Radeon 7970, 256- pts VAVM VAVM8 RP LM-CM LM-CC LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 69 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

70 Results Optimizations in Isolation NVIDIA K20c 256- pts Execution Time (ms) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC LM-CT Baseline NVIDIA K20c RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 70 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

71 Results (Observations) 1. Use scalar operations (e.g., vector access/scalar math) 2. Focus should be on memory subsystem (e.g., bus traffic) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 71 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

72 Results (Bus Traffic) Execution Time (ms) Twiddles Transpose Cols 10 0 CM-K CM-L CGAP LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC Radeon 7970: Execution Time (256- pts) LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 72 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

73 Results (Bus Traffic) Execution Time (ms) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM LM-CC Radeon 7970: Execution Time (256- pts) LM-CT Baseline Bus Traffic (MB) CM-K CM-L CGAP Twiddles Transpose Cols LU CSE IL VASM4 VASM8 VASM16 VAVM16 VAVM2 VAVM4 VAVM VAVM8 RP LM-CM Radeon 7970: Bus Traffic (256- pts) LM-CC LM-CT Baseline RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 73 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

74 Results Insight #1 Primary cost of FFT is in data movement RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 74 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

75 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 75 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

76 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by using optimizations that prefetch memory (RP) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 76 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

77 Results Insight #1 Primary cost of FFT is in data movement Reduce bus traffic by using optimizations that prefetch memory (RP) staging transpose in scratchpad memory (LM- CM, LM- CC, LM- CT) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 77 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

78 Results 25 Kernel Load/ Store Kernel Execution Optimizations in Concert AMD Radeon HD pts Execution Time (ms) LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM VASM2 VASM4 VAVM2 VAVM4 CM CM CM CM RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 78 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

79 Results Optimizations in Concert NVIDIA Tesla K20c 256- pts Execution Time (ms) LM-CC LM-CT RP+LM-CM Kernel Load/ Store LM-CC LM-CT RP+LM-CM Kernel Execution LM-CC LM-CT RP+LM-CM LM-CC LM-CT RP+LM-CM VASM2 VASM4 VAVM2 VAVM4 CM CM CM CM RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 79 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

80 Results Insight #2 One sequence of optimizations perform well for GPUs These optimizations are RP (Register Preloading) LM- CM (Local Memory Communication Only) VASM2/4 (Vector Access, Scalar Math, float2/4) CM (Constant Memory Usage) CGAP (Coalesced Global Access Pattern) RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 80 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

81 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Follow along at: goo.gl/1fs9g7 81

82 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Surprise Result Goal: Accelerate communication ( gray bar ) Computation Communication Shm SELP (IP) Execution time (ms) Follow along at: goo.gl/1fs9g7 82

83 Speed- up with Shuffle Overall Performance Max. Speedup (Amdahl s Law): fold Achieved Speedup: fold Surprise Result Goal: Accelerate communication ( gray bar ) Result: Accelerated the computation also ( black bar ) Computation Communication Shm SELP (IP) Execution time (ms) Follow along at: goo.gl/1fs9g7 83

84 Results 2D FFT (N = 256x256) Optimizations: RP LM- CM VASM2 CM CGAP GFLOPS x 11.76x 2.05x Optimized Unoptimized 2.14x 0 AMD Radeon HD 6970 AMD Radeon HD 7970 NVIDIA Tesla C2075 NVIDIA Tesla K20c RP: Register Preloading; LM- {CM, CT, CC}: Local Memory- {Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access 84 Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

85 Conclusion (Thank You!) Title: Towards a Performance- Portable FFT Library for Heterogeneous Computing Contribution: A methodology for determining portable optimizations for a class of algorithms Improvement Improvement Optimization principles for FFT on GPUs An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Insight #1: Primary cost of FFT computation is in data movement (e.g., memory bound) Insight #2: One sequence of optimizations perform well for GPUs [1D FFT] fold improvement over baseline GPU; 9.1- fold improvement over multi- core FFTW CPU with AVX. GFLOPS x AMD Radeon HD x AMD Radeon HD x NVIDIA Tesla C2075 Optimized Unoptimized 2.14x NVIDIA Tesla K20c Follow along at: goo.gl/1fs9g7 85

86 Appendix Slides Follow along at: goo.gl/1fs9g7 86

87 Background (Optimizing on GPUs) 1. RP (Register Preloading) - All data elements are first preloaded onto the register file of the respective GPU. Computation is facilitated solely on registers. 2. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth thread accesses memory element k) 3. VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the listed vector type. Arithmetic operations are scalar (float x float). 4. LM- CM (Local Memory, Communication Only) - Data elements are loaded into local memory only for communication. Threads swap data elements solely in local memory. 5. LM- CT (Local Memory, Computation, No Transpose) - Data elements are loaded into local memory for computation. The communication step is avoided by algorithm reorganization. 6. LM- CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication. 7. CM- {K,L} (Constant Memory {Kernel, Literal}) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up. CM- K refers to constant memory as a kernel argument, while CM- L refers to a static global declaration in the OpenCL kernel. 8. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure. 9. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called LU (Loop Unrolling) A loop is explicitly rewritten as an identical sequence of statements without the overhead of loop variable comparisons. 11. Shuffle - The transpose stage in FFT is performed entirely in registers eliminating the use of local memory. This optimization is only possible with NVIDIA Kepler GPUs (e.g., Tesla K20c). Follow along at: goo.gl/1fs9g7 87

88 S3: Constant Memory Fast cached lookup for frequently used data Follow along at: goo.gl/1fs9g7 88

89 S3: Constant Memory Fast cached lookup for frequently used data 16 constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61 for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid]; Follow along at: goo.gl/1fs9g7 89

90 System- level Optimizations Follow along at: goo.gl/1fs9g7 90

91 Approach System- level Optimizations (applicable to any application) 1. Register Preloading 2. Vector Access/{Vector,Scalar} Arithmetic 3. Constant Memory Usage 4. Dynamic Instruction Reduction 5. Memory Coalescing 6. Image Memory Algorithm- level Optimizations 1 C. del Mundo, W. Feng. Accelerating Fast Fourier Transform for Wideband Channelization, IEEE ICC, Budapest, Hungary, June Follow along at: goo.gl/1fs9g7 91

92 Algorithm- level optimizations Transpose elements across the diagonal are exchanged Follow along at: goo.gl/1fs9g7 92

93 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 93

94 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 94

95 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 95

96 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 96

97 Algorithm- level optimizations Transpose elements across the diagonal are exchanged 4x4 matrix Transposed matrix Follow along at: goo.gl/1fs9g7 97

98 Algorithm- level optimizations Original Transposed Follow along at: goo.gl/1fs9g7 98

99 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 99

100 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 100

101 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 101

102 Algorithm- level optimizations 1. Naïve Transpose (LM- CM) Original Transposed t 0 t 1 t 2 t 3 Register File Local Memory Follow along at: goo.gl/1fs9g7 102

103 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Original Transposed Follow along at: goo.gl/1fs9g7 103

104 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Original Transposed Follow along at: goo.gl/1fs9g7 104

105 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 105

106 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 106

107 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 107

108 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Local Memory Follow along at: goo.gl/1fs9g7 108

109 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Advantage: Skips the transpose step Local Memory Follow along at: goo.gl/1fs9g7 109

110 Algorithm- level optimizations 3. The pseudo transpose (LM- CT) Idea: Load data to local memory Perform computation on columns, then rows. Original Transposed Advantage: Skips the transpose step Disadvantage: Local memory has lower throughput than registers. Local Memory Follow along at: goo.gl/1fs9g7 110

111 Architecture- level Optimization: Shuffle Software (Transpose) Hardware (K20c and shuffle) r nge NVIDIA Kepler K20c Shuffle Mechanism Follow along at: goo.gl/1fs9g7 111

112 Results (Shuffle) Bottleneck: Intra- thread data movement Follow along at: goo.gl/1fs9g7 112

113 Results (Shuffle) Bottleneck: Intra- thread data movement Horizontal ( Shuffle ) t 0 t 1 t 2 t 3 Vertical ( Intra-thread ) Stage 2: Vertical Horizontal ( Shuffle ) Register File Code 1: (NAIVE) for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; 15x Follow along at: goo.gl/1fs9g7 113

114 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Follow along at: goo.gl/1fs9g7 114

115 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } Follow along at: goo.gl/1fs9g7 115

116 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } Follow along at: goo.gl/1fs9g7 116

117 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { } src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. 44% Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0)? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0)? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0)? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0)? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1)? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1)? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1)? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1)? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2)? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2)? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2)? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2)? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3)? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3)? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3)? src_registers[3] : dst_registers[2]; Follow along 83 at: dst_registers[3] goo.gl/1fs9g7 = (tid == 3)? src_registers[0] : dst_registers[3]; 117

118 Results (Shuffle) 15x 63 for (int k = 0; k < 4; ++k) Code 1 (NAIVE) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; 6% Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { } else if (tid == 3) { } src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; Divergence Divergence Divergence General strategies Registers are fast. CUDA local memory is slow. Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. 44% Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0)? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0)? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0)? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0)? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1)? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1)? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1)? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1)? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2)? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2)? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2)? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2)? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3)? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3)? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3)? src_registers[3] : dst_registers[2]; Follow along 83 at: dst_registers[3] goo.gl/1fs9g7 = (tid == 3)? src_registers[0] : dst_registers[3]; 118

119 Results (Shuffle) Computation X% % improvement for communication Communication Shm Naive 15x 37.5% Occupancy DIV 6% SELP (IP) 17% SELP (OOP) 44% 50% Occupancy SELP (IP) 14% Execution Time (ms) 4.4 Follow along at: goo.gl/1fs9g7 119

120 Results (Shuffle) Computation X% % improvement for communication Communication Shm Naive 15x 37.5% Occupancy DIV 6% SELP (IP) 17% SELP (OOP) 44% 50% Occupancy SELP (IP) 14% Execution Time (ms) 4.4 Follow along at: goo.gl/1fs9g7 120

Accelerating Fast Fourier Transform for Wideband Channelization

IEEE ICC 213 - Signal Processing for Communications Symposium Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, Vignesh Adhinarayanan, Wu-chun Feng Department of Electrical