Introduction to HPC. Lecture 21

Size: px

Start display at page:

Download "Introduction to HPC. Lecture 21"

Horace Gibson
6 years ago
Views:

1 443 Introduction to HPC Lecture Dept of Computer Science 443 Fast Fourier Transform

2 443 FFT followed by Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! 443 FFT followed by Inverse FFT DIT DIF Use inverse twiddles for the inverse FFT No bitreversal necessary! Twiddle allocation for forward and inverse the same!!

3 FFT followed by Inverse FFT 443 DIF followed by DIT, or DIT followed by DFT have same twiddle allocation which is important in parallel computation DIT followed by DIT or DFT followed by DFT have different twiddle allocation for forward and inverse FFT. Problem in parallel computation (more twiddle storage than necessary) We illustrated this for normal input order. Same is true for bitreversed input order Since bitreversal is its own inverse, no explicit bitreversal necessary to restore input order for forward followed by inverse FFT FFT: arithmetic and memory ops 443 Butterflies Arithmetic Operations Storage References FFT Add/Sub Mult Total Data Twiddles Total Radix Radix Radix FFT Arithmetic Operations Storage References FFT Add/Sub Mult Total Data Twiddles Total Radix 3Pp Pp 5Pp 4Pp Pp 5Pp Radix4 (/8)Pp (/8)Pp (7/4)Pp (6/8)Pp (6/8)Pp (/4)Pp Radix8 (66/4)Pp (3/4)Pp (49/)Pp (3/4)Pp (4/4)Pp (3/)Pp 3

4 Higher radix FFTs 443 For modern architectures the most significant benefit is the reduced memory load Ideal radix is the largest possible for each level of the memory hierarchy, i.e. in the innermost loop use the radix such that the maximum use is made of the register file, then radix for max use of L for the next loop, etc. 443 Parallel FFT Data allocation Example: Poweroftwo data set, poweroftwo processors Consecutive data allocation Cyclic data allocation P P P P 3 P 4 P 5 P 6 P P P P P 3 P 4 P 5 P 6 P Communication Input order Consecutive Cyclic Normal First n stages Last n stages Bitreversed Last n stages First n stages 4

5 443 Permutation based FFT FFT computations carried out from msb to lsb in data index, always To make all computations local the number of permutations depend on data allocation (and if the FFT is self sorting) Exchanges affect memory access strides both in carrying out permutations, and in carrying out butterfly computations 443 Parallel unordered FFT communication requirements 5

6 Twiddle factor allocation 443 Four step FFT Constructing the four step FFT (assumes fewer processors than N ) Factorize N in two equal (palindrome) factors. Compute first rank, N FFTs of size N. Multiply with twiddle factors 3. Transpose N Nmatrix 4. Compute last rank, N FFTs of size N 443 N N N FFT4 FFT4 FFT4 FFT4 FFT4 FFT4 FFT4 FFT4 6

4 GB/s, 5 cycles L Cache 3K/core, 64B, 8way, 3 cycles L Cache

7 443 Square Transpose (c) 443 Square Transpose Xeon Clovertown Opteron 85 Memory.6 GB/s, 88 cycles Memory 6.4 GB/s, 5 cycles L Cache 3K/core, 64B, 8way, 3 cycles L Cache 64K/core, 64B, way, 3 cycles L Cache 8M/dual, 64B, 6way, 4 cycles L Cache M/core, 64B, 6way, cycles 7

8 8 443 Parallel FFT on Binary ncube 443 Pipelined FFT on ncube The first four steps of a pipelined, inplace, FFT on a 3cube Time step Time step Time step Time step Memory location Processor The Table entry is network dimension starting with the dimension corresponding to the msb of the paddr)

9 9 443 Pipelined Bisection FFT on ncube The first four steps of a pipelined, inplace, FFT on a 3cube Time step Time step Time step Time step Memory location Processor Pipelined, D, inplace, FFT on cube Performance of a pipelined, onedimensional, inplace, radix, FFT on a cube as a function of data allocation (mesh shape of the cube)

10 443 Pipelined, D, inplace, FFT on cube Performance of a pipelined, twodimensional, inplace, radix, FFT on a cube as a function of data allocation (mesh shape of the cube) FFT on Binary cube 443 Conclusion. By pipelining the communication for successive FFT stages communication time only depends on data volume, essentially (log P becomes additive factor instead of multiplicative for transforms of size P)

11 443 Some node performance results 443 Stride has a big impact on performance Impact depends on cache size and cache replacement policy Maximum efficiency is architecture dependent Codelet performance highly variable

12 443 Characteristics of Target Architectures Processor Clock frequency Peak Performance Intel Pentium IV AMD Athlon.4 GHz.4 GFlops PowerPC G4 Cache structure.8 GHz.8 GFlops L: 8K+8K, L: 56K 867 MHz 867 MFlops Intel Itanium 8 Mhz 3. GFlops IBM Power3/4 375 MHz.5 GFlops L: 64K+64K, L: 56K L: 3K+3K L: 56K, L3: M L: 6K+6K L: 9K, L3: 4M L: 64K+3K, L: 6M HP PA 8x 75 MHz 3 GFlops L:.5M +.75M Alpha EV67/ MHz.66 GGlops L: 64K+64K, L: 4M MIPS Rx 5 MHz GFlop L: 3K+3K, L: 4M 443 Intel PIV.8 GHz Intel Pentium 4.8 GHz Gb RDRAM Bus speed: 4 MHz OS: Red Hat Linux 7. Compiler: gcc version.96 Compiler options: O3 maligndouble fomitframepointer Complex tocomplex, outofplace, double precision transforms Codelet sizes: 6, 3, 64 Strides: [6] Performance: Absolute: 5*n*log(n)/t CPU in FLOPS Relative: Absolute/(Peak performance of the processor) Peak performance:.8 GFLOPS

443 Intel PIV.8 GHz Codelet Performance Theoretical Peak Performance.8 GF Caches: L: 8K+8K, 4way assoc., least recently used, writeback/writethough, CL 64B, 6 cycle latency L: 56K, 8way assoc.

13 443 Intel PIV.8 GHz Codelet Performance Theoretical Peak Performance.8 GF Caches: L: 8K+8K, 4way assoc., least recently used, writeback/writethough, CL 64B, 6 cycle latency L: 56K, 8way assoc., least recently used, writeback, CL8B, 7 cycle latency of Pentium Intel PIV.8 GHz Codelet Efficiency Efficiency (fraction of peak) Caches: L: 8K+8K, 4way assoc., least recently used, writeback/writethough, CL 64B, 6 cycle latency L: 56K, 8way assoc., least recently used, writeback, CL8B, 7 cycle latency of Pentium 4 3

14 443 Intel PIV.8 GHz Codelet Performance 443 Intel PIV.8 GHz Codelet Efficiency Best/Worst>5 Radix8/Radix ~4 Compare with matrixmult.: DGEMM typically 9+%! Efficiency How good is 6x% of peak? 4

sizes: 6, 3, 64 Strides: [6] Performance: Absolute: 5*n*log(n)/t CPU in FLOPS Relative: Absolute/(Peak performance of the

15 443 AMD Athlon.4 GHz AMD Athlon.4 GHz 5 MB DDR RAM Bus speed: 66 MHz OS: Red Hat Linux 7. Compiler: gcc version.96 Compiler options: O3 maligndouble fomitframepointer Complex tocomplex, outofplace, double precision transforms Codelet sizes: 6, 3, 64 Strides: [6] Performance: Absolute: 5*n*log(n)/t CPU in FLOPS Relative: Absolute/(Peak performance of the processor) Peak performance:.4 GFLOPS 443 AMD Athlon.4 GHz Codelet Performance Theoretical Peak Performance.4 GF Caches: L: 64K+64K, way assoc. L: 56K 5

16 443 AMD Athlon.4 GHz Codelet Efficiency Efficiency (fraction of peak) Caches: L: 64K+64K, L: 56K 443 AMD Athlon.4 GHz Codelet Performance 6

443 AMD Athlon.4 GHz Codelet Efficiency Best/Worst >7 ~ Radix8/Radix ~4 Efficiency 443 Apple Power PC G4 867 MHz PowerPC G4 867 MHz. GB SDRAM Bus speed: 33 MHz OS: Mac OS X Version.

17 443 AMD Athlon.4 GHz Codelet Efficiency Best/Worst >7 ~ Radix8/Radix ~4 Efficiency 443 Apple Power PC G4 867 MHz PowerPC G4 867 MHz. GB SDRAM Bus speed: 33 MHz OS: Mac OS X Version..4 Compiler: Apple Computer, Inc. version gcc96 Compiler options: O3 fomitframepointer Complex tocomplex, outofplace, double precision transforms Codelet sizes: 5, 3, 36, 45, 64 Strides: [6] Performance: Absolute: 5*n*log(n)/t CPU in FLOPS Relative: Absolute/(Peak performance of the processor) Peak performance: 867 MFLOPS 7

$867 GF Caches: L: 3K+3K L: 56K, L3: M Apple PowerPC G4 867 MHz Codelet Efficiency 443 Efficiency (fraction of$

18 443 Apple PowerPC G4 867 MHz Codelet Performance Theoretical Peak Performance.867 GF Caches: L: 3K+3K L: 56K, L3: M Apple PowerPC G4 867 MHz Codelet Efficiency 443 Efficiency (fraction of peak) Caches: L: 3K+3K L: 56K, L3: M 8

19 Apple PowerPC G4 867 MHz Codelet Efficiency 443 Radix4/Radix ~ Best/Worst < 5 ~ Efficiency 443 Radix4 Codelet Efficiencies 3bit Intel PIV.8 GHz AMD Athlon.4 GHz PowerPC G4 867 MHz 9

20 443 Radix8 Codelet Efficiencies 3bit Intel PIV.8 GHz AMD Athlon.4 GHz PowerPC G4 867 MHz 443 UHFFT 3bit Performance

21 443 IA3 architecture comparison Best Performing Codelet Best/worst for max performance (codlet) Best/worst for a codlet (stride imp.) Pentium IV Radix8 ~4 >5.67 Athlon Radix8 ~4 >7.85 PowerPC Radix4 ~3 ~5.9 Best efficiency bit Architecture Experiences

22 443 Intel Itanium 8 MHz Intel Itanium 8 MHz GB SDRAM Bus speed: 33 MHz OS: HPUnix i version.5 Compiler: gcc version.96 Compiler options: O fomitframepointer funrollallloops Complex tocomplex, outofplace, double precision transforms Codelet sizes: 5, 3, 36, 45, 64 Strides: [6] Performance: Absolute: 5*n*log(n)/t CPU in FLOPS Relative: Absolute/(Peak performance of the processor) Peak performance: 3. GFLOPS 443 Intel Itanium 8 MHz Inherent parallelism in IA64 Multiple FPUs with fused multiplyadd instructions Large number of registers provide good support for ILP Relatively small L cache Large codelets do not perform very well Complex scheduling problem Cache reuse and parallelism have opposite requirements

23 443 Intel Itanium 8 MHz Codelet Performance Best and worst codelets 443 Intel Itanium 8 MHz Codelet Performance Best and worst codelets 3

Intel Itanium 8 MHz codelet efficiency 443

24 Intel Itanium 8 MHz codelet efficiency 443 Radix6/Radix ~ 6 Best/Worst ~ Minimum 443 UHFFT Codelet Performance Stride impact on performance Itanium size 3 FFT UltraSparc III A factor of 3+ A factor of 5+ Max performance ( best stride) Min performance ( worst stride) 4

443 UHFFT Codelet Performance Comparison Radix Itanium 8 MHz Radix Itanium 9 MHz Radix UltraSparc III 75 MHz Radixi Itanium 8 MHz Radixi Itanium 9 MHz Radixi UltraSparc III 75 MHz 443

25 443 UHFFT Codelet Performance Comparison Radix Itanium 8 MHz Radix Itanium 9 MHz Radix UltraSparc III 75 MHz Radixi Itanium 8 MHz Radixi Itanium 9 MHz Radixi UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix3 Itanium 8 MHz Radix3 Itanium 9 MHz Radix3 UltraSparc III 75 MHz Radix3i Itanium 8 MHz Radix3i Itanium 9 MHz Radix3i UltraSparc III 75 MHz 5

26 443 UHFFT Codelet Performance Comparison Radix4 Itanium 8 MHz Radix4 Itanium 9 MHz Radix4 UltraSparc III 75 MHz Radix4i Itanium 8 MHz Radix4i Itanium 9 MHz Radix4i UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix5 Itanium 8 MHz Radix5 Itanium 9 MHz Radix5 UltraSparc III 75 MHz Radix5i Itanium 8 MHz Radix5i Itanium 9 MHz Radix5i UltraSparc III 75 MHz 6

27 443 UHFFT Codelet Performance Comparison Radix6 Itanium 8 MHz Radix6 Itanium 9 MHz Radix6 UltraSparc III 75 MHz Radix6i Itanium 8 MHz Radix6i Itanium 9 MHz Radix6i UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix7 Itanium 8 MHz Radix7 Itanium 9 MHz Radix7 UltraSparc III 75 MHz Radix7i Itanium 8 MHz Radix7i Itanium 9 MHz Radix7i UltraSparc III 75 MHz 7

28 443 UHFFT Codelet Performance Comparison Radix8 Itanium 8 MHz Radix8 Itanium 9 MHz Radix8 UltraSparc III 75 MHz Radix8i Itanium 8 MHz Radix8i Itanium 9 MHz Radix8i UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix9 Itanium 8 MHz Radix9 Itanium 9 MHz Radix9 UltraSparc III 75 MHz Radix9i Itanium 8 MHz Radix9i Itanium 9 MHz Radix9i UltraSparc III 75 MHz 8

29 443 UHFFT Codelet Performance Comparison Radix Itanium 8 MHz Radix Itanium 9 MHz Radix UltraSparc III 75 MHz Radixi Itanium 8 MHz Radixi Itanium 9 MHz Radixi UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix Itanium 8 MHz Radix Itanium 9 MHz Radix UltraSparc III 75 MHz Radixi Itanium 8 MHz Radixi Itanium 9 MHz Radixi UltraSparc III 75 MHz 9

30 443 UHFFT Codelet Performance Comparison Radix3 Itanium 8 MHz Radix3 Itanium 9 MHz Radix3 UltraSparc III 75 MHz Radix3i Itanium 8 MHz Radix3i Itanium 9 MHz Radix3i UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix6 Itanium 8 MHz Radix6 Itanium 9 MHz Radix6 UltraSparc III 75 MHz Radix6i Itanium 8 MHz Radix6i Itanium 9 MHz Radix6i UltraSparc III 75 MHz 3

443 UHFFT Codelet Performance Comparison Radix3 Itanium 8 MHz Radix3 Itanium 9 MHz Radix3 UltraSparc III 75 MHz Radix3i Itanium 8 MHz Radix3i Itanium 9 MHz Radix3i UltraSparc III 75 MHz 443

31 443 UHFFT Codelet Performance Comparison Radix3 Itanium 8 MHz Radix3 Itanium 9 MHz Radix3 UltraSparc III 75 MHz Radix3i Itanium 8 MHz Radix3i Itanium 9 MHz Radix3i UltraSparc III 75 MHz 443 UHFFT Codelet Performance Comparison Radix64 Itanium 8 MHz Radix64 Itanium 9 MHz Radix64 UltraSparc III 75 MHz Radix64i Itanium 8 MHz Radix64i Itanium 9 MHz Radix64i UltraSparc III 75 MHz 3

32 Compaq Alpha 833 MHz 443 Compaq Alpha 833 MHz GB SDRAM Bus speed: 33 MHz OS: True64 Unix Compiler: gcc version.96 Compiler options: O fomitframepointer funrollallloops Complex tocomplex, outofplace, double precision transforms Codelet sizes: 5, 3, 36, 45, 64 Strides: [6] Performance: Absolute: 5*n*log(n)/t CPU in FLOPS Relative: Absolute/(Peak performance of the processor) Peak performance:.66 GFLOPS Compaq Alpha 833 MHz Codelet Performance 443 3

33 443 IBM Power3 MHz Codelet Performance L: 8way set associative data and instruction caches L: Direct mapped, very vulnerable to cache trashing IBM Power3 MHz Execution Plan Comparison Example: Size 6 5 MFLOPS Plan 33

34 443 IBM Power3 MHz Execution Plan Comparison 43 4 n = 5 (PFA Plan) 4 4 "MFLOPS" Plan 443 IBM Power3 MHz PFA Performance PFA sizes 8 Mflops peak 34

35 443 IBM Power3 MHz Performance Powerof sizes 8 Mflops peak 443 IBM Power3 MHz Performance 8 Mflops peak 35

36 443 Compaq Alpha 833 MHz Performance.66 Gflops peak 443 HP PA 8, 8 MHz One level, MB, direct mapped cache The performance drops when 36

37 443 HP PA 8, 8 MHz Radix8, Forward FFT, Codelet Performance 443 HP PA 8, 8 MHz Radix6, Forward FFT, Codelet Performance 37

38 443 HP PA 8, 8 MHz Codelet Performance 7 Mflops peak 443 HP PA8, 8 MHz Execution Plan Performance Rank Plan CPU Time MFLOPS Relative 6.87E E E E E E E E

39 443 HP PA8, 8 MHz Execution Plan Performance MFLOPS HP PA 8, 8 MHz Execution Plan Performance 39

40 443 DFT on real sequences real FFT The transform X k of a real sequence x k is conjugate symmetric, i.e., X k = X Pk Transformation of pairs of real sequences: y k and z k Form x k = y k + iz k, then compute X, the transform of x. Let Y be the transform of y and Z the transform of z Then, X = Y+iZ, since the DFT is linear. Hence, X k = Y k +iz k and X Pk = Y Pk +iz Pk and X Pk = Y Pk iz Pk or X Pk = Y k iz k Then, Y k = (X k + X Pk )/ and Z k = (X k X Pk )/(i) Complex FFT + one post processing step involving an index reversal operation for transformation of two real sequences 443 DFT on real sequences real FFT Transform a single real sequence X m = X m = P Σ ω mj P x j = j= P/ P/ Σω mk P x k + k= Σω mk P/ x k + k= Thus, X m = Y m + ω m P Z m P/ ω m P P/ Σω m(k+) P x k+ k= Σω mk P/ x k+ k= V m [,P], ω P = e πi P Form v k = x k + ix k+, then compute V, the transform of v. Let Y be the transform of x k and Z the transform of x k+ Then, V = Y+iZ, since the DFT is linear. Note, V is of size P/. 4

41 443 DFT on real sequences real FFT Transform a single real sequence (cont d) We have V m = Y m +iz m and V P/m = Y P/m +iz P/m and V P/m = Y P/m iz P/m = Y m iz m Then, Y m = (V m + V P/m )/ and Z m = (V m V P/m )/(i) and, X m = (V m + V P/m )/ + ω Pm (V m + V P/m )/(i) V m X P/m = (V m + V P/m )/ + ω Pm (V m + V P/m )/(i) V m [,P/4], [,P/4], 443 DFT on real sequences real FFT Transform a single real sequence (cont d) Note that X = Y + Z = Re{V }+Im{V } and X P/ = Re{V }Im{V } The remaining values are computed by X m = (V m + V P/m )/ + ω Pm (V m + V P/m )/(i) V m X P/m = (V m + V P/m )/ + ω Pm (V m + V P/m )/(i) V m Complex FFT of half size sequence with odd indexed values treated as the imaginary part of a complex number, followed by one post processing step [,P/4], [,P/4], 4

42 443 DFT on real sequences Real FFT Remark: The described algorithms reduces memory and the number of arithmetic operations almost by a factor of for real sequences, but adds one communication step involving an index reversal. This step may significantly reduce the benefit of Real FFTs on some parallel platforms. 443 DFT on real sequences real FFT Transform of a single sequence, Edson s algorithm Let X m = P/ Σω mk P/ x k + k= P/ X P/+m = Σ ωp/ mk x k k= For x real we have P/ ω m P Σω mk P/ x k+ k= P/ ω m P Σω mk P/ x k+ k= P/ V m V m [,P/], [,P/], P/ X P/m = Σ ωp/ mk x k ω mσω mk P P/ x k+ V m [,P/], k= k= These are the same values as the first equation, thus the index range for the computations can be reduced to half. 4

43 443 DFT on real sequences real FFT Transform of a single sequence, Edson s algorithm Thus, we get the splitting formula X m = Y m + ω m P Z m X P/m = Y m ω m P Z m V m V m [,P/4], [,P/4], With the first splitting formula computing X P/4 X and X P/ are both real numbers as noted before and X P/4 = Y P/4 iz P/4, since both Y P/4 and Z P/4 are real 43

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity