SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

Size: px
Start display at page:

Download "SPIRAL Overview: Automatic Generation of DSP Algorithms & More *"

Transcription

1 SPIRAL Overview: Automatic Generation of DSP Algorithms & More * Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided by Franz Francheti, Yevgen Voronenko, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel and the rest of the SPIRAL team. CS 65 Program Generation and Opt.

2 Outline Introduction Challenge and goal PreSPIRAL Formulas, programs, and rewrite rules Multilinear Algebra and Block Recursive Algorithms SPIRAL (straightline code) SPL and Ruletrees SPIRAL (loops) -SPL SPIRAL (vector, parallel, and more) Tagged SPL SPIRAL (library generation) Operator Language Beyond transforms

3 Outline Introduction Challenge and goal PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

4 Challenge of Obtaining Efficient Code Multiple threads: 2x Memory hierarchy: 5x Vector instructions: 3x High performance library development has become a nightmare

5 Spiral Overview Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2) more Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.: ippg domain for Spiral generated code

6 SPIRAL: Abstraction Levels Carnegie Mellon, Drexel, UIUC Transform problem specification Ruletree easy manipulation for search SPL Σ-SPL vector/parallel optimizations loop optimizations traditional human domain for (i=; i<8; i++) { t[2i+4]=x[i-4]+x[i+4]; } C Code C compiler tools available

7 Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices Optimization at all abstraction levels parallelization vectorization -SPL: [PLDI 5] loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling But that s not all

8 Outline Introduction PreSPIRAL Formulas, programs, and rewrite rules [CSSP 9] Multilinear Algebra and Block Recursive Algorithms [JSC 9] SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

9 DSP Algorithms & Matrix Factorizations = i i i i i Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product Input vector Output vector input vector (signal) output vector (signal) transform = matrix

10 Rewrite Rules & Formula Manipulation

11 Rules and Dynamic Programming Diagonal matrix (twiddles) Kronecker product Identity Permutation DFT(6) DFT(6) DFT(6) DFT(2) DFT(8) DFT(4) DFT(4) DFT(8) DFT(2) Dynamic Programming (DP) searches over different applications of CT Rule using best transform found for recursive calls

12 Formulas to Code t() = x(); t() = x(2); t(2) = x(); t(3) = x(3) u() = t() + t(); u() = t() t(); u(2) = t(2) + t(3); u(3) = t(2) t(3); v() = u(); v() = u(); v(2) = u(2); v(3) = i*u(3); y() = v() + v(2); y(2) = v() - v(2); y() = v() + v(3); y(3) = v() - v(3); Simplify u() = x() + x(2); u() = x() x(); u(2) = x() + x(3); u(3) = x() x(3); u(3) = i*u(3) y() = u() + u(2); y(2) = u() - u(2); y() = u() + u(3); y(3) = u() - u(3);

13 Formulas and Architectures Parallel Operation A A A A Processor Processor Processor 2 Processor 3 x y Vector Operation A 4 A 4 A 4 A 4 A 4 A 4 A 4 A 4 Task: Rewrite formulas to introduce varying amounts of vectorization and parallelism and modify access patterns

14 Multilinear Algebra & Block Recursive Algs Matrix Multiplication Strassen s Algorithm

15 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPL and Ruletrees SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

16 SPL (Signal Processing Language) [PLDI ] SPL expresses transform algorithms as structured sparse matrix factorization Examples: SPL grammar in Backus-Naur form 6

17 Compiling SPL to Code Using Templates for i=..n- for j=..m- y[i+n*j]=x[m*i+j] y[::n-] = call A(x[::n-]) y[n::n+m-] = call B(x[n::n+m-]) for i=..n- y[im::im+m-] = call B(x[im::im+m-]) for i=..n- y[im::im+m-] = call B(x[i:n:i+m-]) 7

18 SPIRAL Architecture [IEEE 5] Approach: Empirical search over alternative recursive algorithms Transform Formula Generator SPIRAL Search Engine SPL Compiler C Compiler... for (int j=; j<=3; j++) { y[j] y[2*j] = C*x[j] = + + C2*x[j+4]; y[j+4] y[2*j+] = C*x[j] = - C2*x[j+4]; } A 2B FF 33 C4 F... Timer 8 ns Adapted Implementation DFT_8.c 8 ns 8

19 DSP Algorithms: Transforms & Breakdown Rules Combining these rules yields many algorithms for every given transform

20 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) -SPL SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language

21 Problem: Fusing Permutations and Loops Two passes over the working set Complex index computation direct mapping void sub(double *y, double *x) { double t[8]; for (int i=; i<=7; i++) t[(i/4)+2*(i%4)] = x[i]; for (int i=; i<4; i++){ y[2*i] = t[2*i] + t[2*i+]; y[2*i+] = t[2*i] - t[2*i+]; } } C compiler cannot do this One pass over the working set Simple index computation State-of-the-art SPIRAL: Hardcoded with templates FFTW: Hardcoded in the infrastructure How does hardcoding scale? void sub(double *y, double *x) { for (int j=; j<=3; j++){ y[2*j] = x[j] + x[j+4]; y[2*j+] = x[j] - x[j+4]; } } 2

22 New Approach for Loop Merging [PLDI 5] Transform SPL formula Search Engine Formula Generator New SPL SPL Compiler C Compiler Timer SPL To Σ-SPL Σ-SPL formula Loop Merging Σ-SPL formula Index Simplification Σ-SPL formula Σ-SPL Compiler Adapted Implementation Code 22

23 Σ SPL Four central constructs: Σ, G, S, Perm Σ (sum) makes loops explicit G f (gather) reads data using the index mapping f S f (scatter) writes data using the index mapping f Perm f permutes data using the index mapping f Every Σ-SPL formula still represents a matrix factorization Example: j= j= F 2 j=2 j=3 Output Input 23

24 Loop Merging With Rewriting Rules SPL SPL To Σ-SPL F 2 Σ-SPL Y T X Loop Merging Σ-SPL F 2 Index Simplification Y X Σ-SPL Rules: Σ-SPL Compiler Code for (int j=; j<=3; j++) { y[2*j] = x[j] + x[j+4]; y[2*j+] = x[j] - x[j+4]; } 24

25 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) Tagged SPL SPIRAL (library generation) Operator Language

26 SPL to Shared Memory Code: Basic Idea [SC 6] Governing construct: tensor product A A A A Processor Processor Processor 2 Processor 3 p-way embarrassingly parallel, load-balanced Problematic construct: permutations produce false sharing x y Task: Rewrite formulas to extract tensor product + keep contiguous blocks x y

27 Parallelization by Rewriting coarse platform model Load-balanced No false sharing

28 Same Approach for Other Parallel Paradigms Message Passing [ISPA 6] Vectorization [VecPar 6] MPI Cg/OpenGL for GPUs Verilog for FPGAs [DAC 8]

29 Example Results Multicore [SC 6] Code written by the computer is faster (for many sizes) than any human-written code

30 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) From Rules to complete library infrastructure Operator Language

31 Tasks To Be Automated blue = hand-written, red = generated FFTW Codelet generator Library infrastructure Recursive functions needed Vectorization (partially) Threading Adaptation mechanism (plan) Codelets ( 2 types) Our Goal Spiral Library infrastructure Recursive functions needed Vectorization Threading Adaptation mechanism (plan) Codelets ( 2 types)

32 How Library Generation Works [Vor 8] Transforms + Breakdown rules Library Target (FFTW, VSIPL, IPP FFT,...) Library Structure Parallelization / Vectorization Recursion Step Closure recursion step closure as Σ-SPL formulas Library Implementation Build library plan Hot/cold partition Generate target code High-performance library 32

33 Breakdown Rules to Library Code Cooley-Tukey Fast Fourier Transform (FFT) DFT k=4 Naive implementation void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(x) } for i= to k- dft_subvec(m, Z, Y, ) for i= to n- Y[i] = Y[i]*T[i]; for i= to m- dft_strided(k, Y, Y, ) 2 extra functions needed 33

34 Breakdown Rules to Library Code Cooley-Tukey Fast Fourier Transform (FFT) DFT Naive implementation Optimized implementation void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; Z = permute(x) } for i= to k- dft_subvec(m, Z, Y, ) for i= to n- Y[i] = Y[i]*T[i]; for i= to m- dft_strided(k, Y, Y, ) void dft(int n, cplx X[], cplx Y[]) { k = choose_factor(n); m = n/k; } for i= to k- dft_strided2(m, X, Y, ) for i= to m- dft_strided3_scaled(k, Y, Y, T, ) 2 extra functions needed 2 extra functions needed 34 How to discover these specialized variants automatically?

35 Computing Recursion Step Closure Input: transform T and a breakdown rule Output: spawned recursion steps + Σ-SPL implementation Algorithm:. Apply the breakdown rule 2. Convert to Σ-SPL 3. Apply loop merging + index simplification rules. 4. Extract recursion steps 5. Repeat until closure is reached Parameterization (not shown) derives the independent parameter set for each recursion step 35

36 Recursion Step Closure Examples DFT (scalar) DCT4 (vectorized) 4 mutually recursive functions - computed automatically - described using Σ-SPL formulas 7 mutually recursive functions 36

37 Benchmark: Complex DFT Complex DFT [Intel Core i7, 2.66 GHz, 4 cores] Performance, Gflop/s 6 4 Spiral, 4 threads Carnegie Mellon, Drexel, UIUC 2 8 FFTW Intel IPP 6. Spiral, thread k 4k 6k 64k 256k M input size

38 Examples of Generated Libraries RDFT Carnegie Mellon, Drexel, UIUC DHT DCT2 DCT3 DCT4 DFT 2-way vectorized, 2-threaded Most are faster than hand-written libs Code size: 8 2 KLOC or.5 5 MB Generation time: 3 hours Filter Wavelet Total: 3 KLOC / 3.3 MB of code generated in < 2 hours from a few simple algorithm specs Intel IPP library 6. includes Spiral generated code 38

39 Outline Introduction PreSPIRAL SPIRAL (straightline code) SPIRAL (loops) SPIRAL (vector, parallel, and more) SPIRAL (library generation) Operator Language Beyond transforms

40 Operators [DSL 9] Definition Operator: Multiple complex vectors! multiple complex vectors Higher-dimensional data is linearized Operators are potentially nonlinear Example: Matrix-matrix-multiplication (MMM) A C B

41 Operator Language Carnegie Mellon, Drexel, UIUC

42 Example: Matrix Multiplication (MMM) Breakdown rules: capture various forms of blocking

43 Matrix Multiplication Library Carnegie Mellon, Drexel, UIUC MKL. GotoBLAS.26 Spiral-generated library MKL. GotoBLAS.26 Spiral-generated library Rank-k Update, single precision, k=4 performance [Gflop/s] Dual Intel Xeon 56, 3Ghz Spiral-generated library Input size MKL Rank-k Update, double precision, k=4 performance [Gflop/s] Spiral-generated library Dual Intel Xeon 56, 3Ghz Input size MKL

44 References [CSSP 9] J. R. Johnson, R. W Johnson, D. Rodriguez, and R. Tolimieri, A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures, Circuits Systems Signal Process., (9)4: 449-5, 99. [JSC 9] R. W. Johnson, C. H. Huang and J. R. Johnson, Multilinear algebra and parallel programming, The Journal of Supercomputing, Vol. 5, Oct. 99. [IEEE 5] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", Vol. 93, No. 2, pp , 25. [IPDPS 2] Kang Chen and Jeremy Johnson, A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp , 22. [SC 6] Franz Franchetti, Yevgen Voronenko and Markus Püschel, FFT Program Generation for Shared Memory: SMP and Multicore, Proc. Supercomputing (SC), 26. [ISPA 6] Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber, Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers, Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 433, pp , 26.

45 References [VecPar 8] Franz Franchetti, Yevgen Voronenko and Markus Püschel, A Rewriting System for the Vectorization of Signal Transforms, Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp , 26. [DAC 8] Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel, Formal Datapath Representation and Manipulation for Implementing DSP Transforms, Proc. Design Automation Conference (DAC), 28. [IPDPS 4] Jeremy Johnson and Kang Chen, A Self-Adapting Distributed Memory Package for Fast Signal Transforms, Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 24. [PLDI ] Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua, SPL: A Language and Compiler for DSP Algorithms, Proc. Programming Languages Design and Implementation (PLDI), pp , 2. [PLDI 5] Franz Franchetti, Yevgen Voronenko and Markus Püschel, Formal Loop Merging for Signal Transforms, Proc. Programming Languages Design and Implementation (PLDI), pp , 25. [Vor 8] Yevgen Voronenko, Library Generation for Linear Transforms, PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 28. [DSL 9] Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel, Operator Language: A Program Generation Framework for Fast Kernels, Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp , 29,

SPIRAL Generated Modular FFTs *

SPIRAL Generated Modular FFTs * SPIRAL Generated Modular FFTs * Jeremy Johnson Lingchuan Meng Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for SPIRAL overview provided by Franz Francheti, Yevgen Voronenko,,

More information

How to Write Fast Code , spring rd Lecture, Apr. 9 th

How to Write Fast Code , spring rd Lecture, Apr. 9 th How to Write Fast Code 18-645, spring 2008 23 rd Lecture, Apr. 9 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Research Project Current status Today Papers due

More information

Generating Parallel Transforms Using Spiral

Generating Parallel Transforms Using Spiral Generating Parallel Transforms Using Spiral Franz Franchetti Yevgen Voronenko Markus Püschel Part of the Spiral Team Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Paolo D Alberto, Franz Franchetti, Peter A. Milder, Aliaksei Sandryhaila, James C. Hoe, José M. F. Moura, Markus

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

Tackling Parallelism With Symbolic Computation

Tackling Parallelism With Symbolic Computation Tackling Parallelism With Symbolic Computation Markus Püschel and the Spiral team (only part shown) With: Franz Franchetti Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University

More information

Spiral: Program Generation for Linear Transforms and Beyond

Spiral: Program Generation for Linear Transforms and Beyond Spiral: Program Generation for Linear Transforms and Beyond Franz Franchetti and the Spiral team (only part shown) ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com

More information

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 8-799B spring 25 24 th and 25 th Lecture Apr. 7 and 2, 25 Instructor: Markus Pueschel TA: Srinivas Chellappa Research Projects Presentations

More information

FFT Program Generation for the Cell BE

FFT Program Generation for the Cell BE FFT Program Generation for the Cell BE Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213, USA {schellap, franzf,

More information

Automatic Performance Programming?

Automatic Performance Programming? A I n Automatic Performance Programming? Markus Püschel Computer Science m128i t3 = _mm_unpacklo_epi16(x[0], X[1]); m128i t4 = _mm_unpackhi_epi16(x[0], X[1]); m128i t7 = _mm_unpacklo_epi16(x[2], X[3]);

More information

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms Markus Püschel Faculty José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor

More information

Automatic Performance Tuning and Machine Learning

Automatic Performance Tuning and Machine Learning Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon PhD and Postdoc openings:

More information

Library Generation For Linear Transforms

Library Generation For Linear Transforms Library Generation For Linear Transforms Yevgen Voronenko May RS RS 3 RS RS RS RS 7 RS 5 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical

More information

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance. Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications What is Spiral? Traditionally Spiral Approach Spiral

More information

Program Composition and Optimization: An Introduction

Program Composition and Optimization: An Introduction Program Composition and Optimization: An Introduction Christoph W. Kessler 1, Welf Löwe 2, David Padua 3, and Markus Püschel 4 1 Linköping University, Linköping, Sweden, chrke@ida.liu.se 2 Linnaeus University,

More information

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Markus Püschel Electrical and Computer Engineering University SPIRAL Team: José M. F. Moura (ECE, CMU) James C. Hoe (ECE, CMU)

More information

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ Franz Franchetti Applied and Numerical Mathematics Technical University of Vienna, Austria franz.franchetti@tuwien.ac.at Markus Püschel

More information

Computer Generation of IP Cores

Computer Generation of IP Cores A I n Computer Generation of IP Cores Peter Milder (ECE, Carnegie Mellon) James Hoe (ECE, Carnegie Mellon) Markus Püschel (CS, ETH Zürich) addfxp #(16, 1) add15282(.a(a69),.b(a70),.clk(clk),.q(t45)); addfxp

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

SPIRAL: Code Generation for DSP Transforms

SPIRAL: Code Generation for DSP Transforms PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION 1 SPIRAL: Code Generation for DSP Transforms Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Autotuning and Machine Learning Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato, Alen Stojanov Overview Rough classification of autotuning efforts

More information

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

Mixed Data Layout Kernels for Vectorized Complex Arithmetic Mixed Data Layout Kernels for Vectorized Complex Arithmetic Doru T. Popovici, Franz Franchetti, Tze Meng Low Department of Electrical and Computer Engineering Carnegie Mellon University Email: {dpopovic,

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Performance Analysis of a Family of WHT Algorithms

Performance Analysis of a Family of WHT Algorithms Performance Analysis of a Family of WHT Algorithms Michael Andrews and Jeremy Johnson Department of Computer Science Drexel University Philadelphia, PA USA January, 7 Abstract This paper explores the correlation

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer Manuela Veloso May, 01 CMU-CS-01-137 School of Computer Science Carnegie Mellon University Pittsburgh, PA 1213 Abstract Many

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel}@ece.cmu.edu

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Abstract The chip maker

More information

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Püschel TA: Georg Ofenbeck Schedule Today Lecture Project presentations 10 minutes each random order random speaker 10 Final code

More information

Generating SIMD Vectorized Permutations

Generating SIMD Vectorized Permutations Generating SIMD Vectorized Permutations Franz Franchetti and Markus Püschel Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213 {franzf, pueschel}@ece.cmu.edu

More information

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 20 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Planning Today Lecture Project meetings Project presentations 10 minutes each

More information

Automatically Tuned FFTs for BlueGene/L s Double FPU

Automatically Tuned FFTs for BlueGene/L s Double FPU Automatically Tuned FFTs for BlueGene/L s Double FPU Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, and Christoph W. Ueberhuber Institute for Analysis and Scientific Computing, Vienna University

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Optimizing FFT, FFTW Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations 10

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer and Manuela Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA 1213 Email: {bsinger+, mmv+}@cs.cmu.edu

More information

Aggressive Scheduling for Numerical Programs

Aggressive Scheduling for Numerical Programs Aggressive Scheduling for Numerical Programs Richard Veras, Flavio Cruz, Berkin Akin May 2, 2012 Abstract Certain classes of algorithms are still hard to optimize by modern compilers due to the differences

More information

Computer Generation of Fast Fourier Transforms for the Cell Broadband Engine

Computer Generation of Fast Fourier Transforms for the Cell Broadband Engine Computer Generation of Fast Fourier Transforms for the Cell Broadband Engine Srinivas Chellappa schellap@ece.cmu.edu Franz Franchetti franzf@ece.cmu.edu Department of Electrical and Computer Engineering

More information

Automatic Derivation and Implementation of Signal Processing Algorithms

Automatic Derivation and Implementation of Signal Processing Algorithms Automatic Derivation and Implementation of Signal Processing Algorithms Sebastian Egner Philips Research Laboratories Prof. Hostlaan 4, WY21 5656 AA Eindhoven, The Netherlands sebastian.egner@philips.com

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel@ece.cmu.edu

More information

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Code , spring st Lecture, Jan. 14 th How to Write Fast Code 18-645, spring 2008 1 st Lecture, Jan. 14 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Today Motivation and idea behind this course Technicalities

More information

Computer Generation of Hardware for Linear Digital Signal Processing Transforms

Computer Generation of Hardware for Linear Digital Signal Processing Transforms Computer Generation of Hardware for Linear Digital Signal Processing Transforms PETER MILDER, FRANZ FRANCHETTI, and JAMES C. HOE, Carnegie Mellon University MARKUS PÜSCHEL, ETH Zurich Linear signal transforms

More information

Scheduling FFT Computation on SMP and Multicore Systems

Scheduling FFT Computation on SMP and Multicore Systems Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali Dept. of Computer Science University of Houston Houston, TX 77, USA ayaz@cs.uh.edu Lennart Johnsson Dept. of Computer Science University

More information

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes 30 July 2015 Table of Contents 1 Introduction... 2 1.1 Change History... 2 1.1.1 Changes since Intel Parallel

More information

SPIRAL, FFTX, and the Path to SpectralPACK

SPIRAL, FFTX, and the Path to SpectralPACK SPIRAL, FFTX, and the Path to SpectralPACK Franz Franchetti Carnegie Mellon University www.spiral.net In collaboration with the SPIRAL and FFTX team @ CMU and LBL This work was supported by DOE ECP and

More information

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Lennart Johnsson Texas Learning and Computation Center University of Houston, Texas {ayaz,johnsson}@cs.uh.edu Dragan

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes 30 October 2015 Table of Contents 1 Introduction... 2 1.1 What s New... 2 1.2 Product Contents...

More information

High Assurance SPIRAL

High Assurance SPIRAL High Assurance SPIRAL Franz Franchetti a, Aliaksei Sandryhaila a and Jeremy R. Johnson b a Carnegie Mellon University, Pittsburgh, PA, USA; b Drexel University, Philadelphia, PA, USA ABSTRACT In this paper

More information

Automatic Generation of the HPC Challenge s Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge s Global FFT Benchmark for BlueGene/P Automatic Generation of the HPC Challenge s Global FFT Benchmark for BlueGene/P Franz Franchetti, Yevgen Voronenko 2, and Gheorghe Almasi 3 franzf@ece.cmu.edu, yvoronen@gmail.com, and gheorghe@us.ibm.com

More information

Operator Language: A Program Generation Framework for Fast Kernels

Operator Language: A Program Generation Framework for Fast Kernels Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University

More information

Computer Generation of Efficient Software Viterbi Decoders

Computer Generation of Efficient Software Viterbi Decoders Computer Generation of Efficient Software Viterbi Decoders Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh

More information

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning Bandit-Based Optimization on Graphs with Application to Library Performance Tuning Frédéric de Mesmay fdemesma@ece.cmu.edu Arpad Rimmel rimmel@lri.fr Yevgen Voronenko yvoronen@ece.cmu.edu Markus Püschel

More information

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and

More information

DFT Compiler for Custom and Adaptable Systems

DFT Compiler for Custom and Adaptable Systems DFT Compiler for Custom and Adaptable Systems Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler:

More information

Optimal Performance Numerical Code Challenges and Solutions

Optimal Performance Numerical Code Challenges and Solutions Optimal Performance Numerical Code Challenges and Solutions Markus Püschel Picture: www.tapety-na-pulpit.org Scientific Computing Physics/biology simulations Consumer Computing Audio/image/video processing

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE Scanning the Issue Special Issue on Program Generation, Optimization, and Platform Adaptation This special issue of the PROCEEDINGS OF THE IEEE offers an overview of ongoing efforts to facilitate the development

More information

Learning to Construct Fast Signal Processing Implementations

Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science

More information

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Analysis and

More information

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes 12 April 2017 Contents 1.1 What Every User Should Know About This Release... 2 1.2 What s New... 2 1.3 Product Contents...

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out

More information

Computer Generation of Platform-Adapted Physical Layer Software

Computer Generation of Platform-Adapted Physical Layer Software Computer Generation of Platform-Adapted Physical Layer Software Yevgen Voronenko (SpiralGen, Pittsburgh, PA, USA; yevgen@spiralgen.com); Volodymyr Arbatov (Carnegie Mellon University, Pittsburgh, PA, USA;

More information

On the Computer Generation of Adaptive Numerical Libraries

On the Computer Generation of Adaptive Numerical Libraries On the Computer Generation of Adaptive Numerical Libraries A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy FRÉDÉRIC DE MESMAY Supervised by Markus

More information

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS c World Scientific Publishing Company PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS RADE KUTIL and PETER EDER Department of Computer Sciences, University of Salzburg Jakob Haringerstr. 2, 5020

More information

A Fast Fourier Transform Compiler

A Fast Fourier Transform Compiler RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato & Alen Stojanov Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)

More information

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST

Introduction A Parametrized Generator Case Study Real Applications Conclusions Bibliography BOAST Portability Using Meta-Programming and Auto-Tuning Frédéric Desprez 1, Brice Videau 1,3, Kevin Pouget 1, Luigi Genovese 2, Thierry Deutsch 2, Dimitri Komatitsch 3, Jean-François Méhaut 1 1 INRIA/LIG -

More information

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling Basilio B. Fraguela Depto. de Electrónica e Sistemas Universidade da Coruña A Coruña, Spain basilio@udc.es Yevgen Voronenko,

More information

SPIRAL: Extreme Performance Portability

SPIRAL: Extreme Performance Portability SPIRAL: Extreme Performance Portability This paper provides an end-to-end discussion of the SPIRAL system, its domain-specific languages, and code generation techniques. By FRANZ FRANCHETTI, Senior Member

More information

COMPUTER architects are experimenting with ever. SPIRAL: Extreme Performance Portability

COMPUTER architects are experimenting with ever. SPIRAL: Extreme Performance Portability PROCEEDINGS OF THE IEEE, SPECIAL ISSUE ON FROM HIGH LEVEL SPECIFICATION TO HIGH PERFORMANCE CODE 1 SPIRAL: Extreme Performance Portability Franz Franchetti, Senior Member, IEEE, Tze Meng Low, Member, IEEE,

More information

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Püschel TA: Georg Ofenbeck Last Time: Locality Temporal and Spatial memory memory Last Time: Reuse FFT: O(log(n)) reuse MMM: O(n)

More information

Generating FPGA-Accelerated DFT Libraries

Generating FPGA-Accelerated DFT Libraries Generating FPGA-Accelerated DFT Libraries Paolo D Alberto Yahoo! {pdalbert@yahoo-inc.com} Peter A. Milder, Aliaksei Sandryhaila, Franz Franchetti James C. Hoe, José M. F. Moura, and Markus Püschel Department

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL

Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL Doru Thom Popovici, Francis P. Russell, Karl Wilkinson, Chris-Kriton Skylaris Paul H. J. Kelly and Franz Franchetti

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

Automating the Modeling and Optimization of the Performance of Signal Transforms

Automating the Modeling and Optimization of the Performance of Signal Transforms IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 8, AUGUST 2002 2003 Automating the Modeling and Optimization of the Performance of Signal Transforms Bryan Singer and Manuela M. Veloso Abstract Fast

More information

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9 Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.

More information

Automated Code Generation for High-Performance, Future-Compatible Graph Libraries

Automated Code Generation for High-Performance, Future-Compatible Graph Libraries Research Review 2017 Automated Code Generation for High-Performance, Future-Compatible Graph Libraries Dr. Scott McMillan, Senior Research Scientist CMU PI: Prof. Franz Franchetti, ECE 2017 Carnegie Mellon

More information

Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Memory

Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Memory Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Fazle Sadi, Berkin Akin, Doru T. Popovici, James C. Hoe, Larry Pileggi and Franz Franchetti Department of Electrical and

More information

An Empirically Optimized Radix Sort for GPU

An Empirically Optimized Radix Sort for GPU 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications An Empirically Optimized Radix Sort for GPU Bonan Huang, Jinlan Gao and Xiaoming Li Electrical and Computer Engineering

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

RECENTLY, the characteristics of signal transform algorithms

RECENTLY, the characteristics of signal transform algorithms 2120 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 7, JULY 2004 Dynamic Data Layouts for Cache-Conscious Implementation of a Class of Signal Transforms Neungsoo Park, Member, IEEE, and Viktor K.

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Document Number: 327356-005US Legal Information Intel Math Kernel Library Getting Started

More information

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction Original published: H. Dachsel, M. Hofmann, J. Lang, and G. Rünger. Automatic tuning of the Fast Multipole Method based on integrated performance prediction. In Proceedings of the 14th IEEE International

More information

Multiplicationless DFT Calculation Using New Algorithms

Multiplicationless DFT Calculation Using New Algorithms Multiplicationless DT Calculation Using ew Algorithms Jaya Krishna Sunkara, Chiranjeevi Muppala PG Scholar, SVUCE, Tirupati, IDIA. Asst. Prof., TJS College of Engineering, Anna University, IDIA. Jayakrishna.s7@gmail.com,

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

A Package for Generating, Manipulating, and Testing Convolution Algorithms. Anthony F. Breitzman and Jeremy R. Johnson

A Package for Generating, Manipulating, and Testing Convolution Algorithms. Anthony F. Breitzman and Jeremy R. Johnson A Package for Generating, Manipulating, and Testing Convolution Algorithms Anthony F. Breitzman and Jeremy R. Johnson Technical Report DU-CS-03-05 Department of Computer Science Drexel University Philadelphia,

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Document Number: 327356-005US Legal Information Legal Information By using this document,

More information

The Design and Implementation of FFTW3

The Design and Implementation of FFTW3 The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize

More information