An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Size: px

Start display at page:

Download "An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston"

Angelina Snow
6 years ago
Views:

1 An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

characteristics Multicores SMPs Clusters (employing a range of interconnect technologies) Grids

2 Diversity of execution environments Growing complexity of modern microprocessors. Deep memory hierarchies Out-of-order execution Instruction level parallelism Growing diversity of platform characteristics Multicores SMPs Clusters (employing a range of interconnect technologies) Grids (heterogeneity, wide range of characteristics) Wide range of application needs Dimensionality and sizes Data structures and data types Challenges

3 Approach Automatic algorithm selection polyalgorithmic functions Code generation from high-level descriptions Extensive application independent compile-time analysis Integrated performance modeling and analysis Run-time application and execution environment dependent composition Automated installation process

4 Discrete Fourier Transform (DFT) DFT Algorithm is a Matrix Vector multiplication 2 ( n ) y N 1 jk j N. xk k 0 DFT of size N=4 Y w w w 1 w 1 1 w w w w X N N e 2 i N i cos( 2 ).sin( 2 N N)

5 FFT There is a faster algorithm called Fast Fourier Transform (FFT) by Cooley Tukey Algorithm (1965) ( N log N) n F ( F I ). T.( I F ). P n N r m m r m r y x0 1 y x y x y x3 N m r 4 2 2

6 Some common factorizations Cooley Tukey Factorization: Radix 2 Radix 4 Radix 8 Mixed Radix Split Radix Prime Factor Rader s 5N logn 4.25N log 4.08Nlog N N ( Nlog N) 4Nlog N when N m r n F ( F I ). T.( I F ). P when N 2 i when gcd( m, r) 1 when N prime n N r m m r m r N F ( F I ). T.( I F ). P N F P( F F ) P N N N N N N N 2 r m 1 Saving N No Twiddle Multiplication No Bit-Reversal Needed F Q.(1 F ). B. D.(1 F ). Q, T T N 1 Nr N 1 N 1 N, r

7 Op-Counts Effect of Algorithm Selection on Op-Count. Factorizations

8 Size 16 implementation options (8) MFLOPS Codelet is-os Plan

9 Some size 2520 options "MFLOPS" Plan

10 Impact of strides Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

11 UHFFT Codelet Performance 64-bit Architectures

12 UHFFT Codelet Performance 64-bit Architectures

13 UHFFT Codelet Performance 64-bit Architectures

14 Codelet Performance Radix-2

15 Codelet Performance Radix-3

16 Codelet Performance Radix-4

17 Codelet Performance Radix-5

18 Codelet Performance Radix-6

19 Codelet Performance Radix-7

20 Codelet Performance Radix-8

21 Codelet Performance Radix-32

22 Codelet Performance Radix-45

23 Codelet Performance Radix-64

24 Challenges Algorithmic Unfavorable data access pattern (big 2 n strides) High efficiency of the algorithm low floating-point v.s. load/store ratio Additions/multiplications unbalance Version explosion Verification Maintenance

25 UHFFT- An Adaptive, Portable Library Auto-tuning performed in two stages. At Installation Time: Codelet Generator adapts to the Microprocessor Architecture by optimizing to reduce operation count and register reuse distance. At Run Time: Planner adapts to the Memory System by searching for the best combination of codelets (schedule) to solve given FFT Problem. The plan/schedule can be repeatedly used for same size transform.

26 UHFFT structure

27 UHFFT2.0 Code Generator: Support multiple Codelet types. Finer Scheduling through DAG. Register Blocking and Privatization. Run-time Optimization: Prime Size Support (Rader s Algorithm). Plan Search Schemes (Planner). API Standardization (Intel s DFT with Parallel extensions). Support for both In-place and Out-of-place Transforms. Formal description of FFT Plans or Schedules. Parallelization Multi-core/Many Core

28 Code Generator Four main Codelet Types dyvmc* dynmc* dxvmc* (inplace) dyvpc (PFA) *Four sub-types Forward/inverse Twiddle/non-twiddle

29 Empirical Auto-tuning

30 How many codelets to generate? Itanium has more registers Opteron has larger instruction cache

31 Opteron Codelet 3 (is=os) Stride MFLOPS UHFFT FFTW TWUHFFT Codelet 4 (is=os) Stride MFLOPS UHFFT FFTW TWUHFFT Codelet 8 (is=os) Stride MFLOPS UHFFT FFTW TWUHFFT Codelet 7 (is=os) Stride MFLOPS UHFFT FFTW TWUHFFT Codelet 16 (is=os) Stride MFLOPS UHFFT FFTW TWUHFFT Codelet 64 (is=os) Stride MFLOPS UHFFT FFTW TWUHFFT

32 Itanium2 Codelet 3 (is=os) Stride MFLOPS UHFFT2.0 FFTW FFTW_old UHFFT162 Codelet 4 (is=os) Stride MFLOPS UHFFT2.0 FFTW FFTW_old UHFFT162 Codelet 7 (is=os) Stride MFLOPS UHFFT2.0 FFTW FFTW_old UHFFT162 Codelet 8 (is=os) Stride MFLOPS UHFFT2.0 FFTW FFTW_old UHFFT162 Codelet 16 (is=os) Stride MFLOPS UHFFT2.0 FFTW FFTW_old UHFFT162 Codelet 64 (is=os) Stride MFLOPS UHFFT2.0 FFTW FFTW_old UHFFT162

33 Planner Search for the best factors and algorithms to solve a given FFT problem. Main Parameters of Search: Factors and Algorithms at each level of tree (Op-count and Locality) Parallelization (Load Balance) Cost of Search: Search cost amortizes (especially in multidimensional FFTs). Support for multiple search schemes with varying costs (high/medium/low). Empirical Search. Hybrid (Empirical/Estimation) Search. High Resolution Timers are important.

34 Planner

35 FFT Schedule Specification Language (FSSL) FSSL GRAMMAR UHFFT provides an expert user the flexibility to specify the schedule using its CFG.

36 FFT Schedule Specification Language (FSSL) Example Schedule: (outplace1020, (rader17,16)mr (inplace60,2mr(pfa15,3pfa5)mr2) )

37 Parallelization Example N 16 m 4, r 4 P 4

38 Parallelization Transpose Step (Barrier) Data sharing among processors takes place Example N 16 m 4, r 4 P 4 Step1 r Row FFTs of Size m distributed among P processors Step3 m Column FFTs of radix r distributed among p processors

39 Data Distribution (SMP/CMP) Row FFTs Example N 16 m 4, r 4 P Column FFTs

40 Data Distribution (SMP/CMP) Large block size ~ Less cache coherence issues. N 64K ( mrp2b8:$ Col, ( outplace65536, 16mr16mr16mr16 ) )65536

41 OpenMP vs PThreads Generated plan is load balanced by planner. Distributing loops is straightforward.

42 Speedup FFT is bandwidth starved. Super linear Speedup due to increase in effective cache size. Synchronization to Computation ratio is high

43 CPU Affinity (Multi-cores) Scheduling Problem / Cache conflicts in shared cache

44 UHFFT2.0.1beta vs FFTW3.2alpha2

45 UHFFT2.0.1beta vs FFTW3.2alpha2

46 In-place In-order FFT Performance Powers of two sizes.

47 In-place In-order FFT Performance Prime Factor Sizes.

48 UHFFT2 Current Status 1D Complex Out-of-place/In-place in-order Forward scrambled (Out-of-place/In-place) Forward/Inverse (Configurable Sign) High, Medium and Low Effort Plan Search Single/Double Precision SMP/CMP Executor (DFTi Extension) Empirical Auto-tuning code generator. Executor Extendibility through FSSL Grammar. Real FFT (not-integrated)

49 Executor: Multidimensional MPI Planner: Future Work Accurate Model Driven Plan Search Scheme. Economical (time and memory). Schedule memory transactions through pre-fetching. Code Generator: ISA specific codelet generation (e.g. sse, fma etc.). Generate codelet cost models.

50 Acknowledgements The SGAS work largely carried out by Thomas Sandholm. Other contributors: Olle Mulmo, Peter Gardfjall, Erik Elmroth, Bo Kagstrom The FFT work: Ayaz Ali, Fredrik Mwandia, Rishad Mahasoom, Dragan Mirkovic, Purvi Shah, Haiyan Teng Support: NSF and Intel

51 Thank You!

52 End of Moore s Law? Can not keep up with the power requirements! Source: Unknown

53 10000??%/year Performance (vs. VAX-11/780) %/year 52%/year VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86:??%/year 2002 to present

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation