Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Similar documents
An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Scheduling FFT Computation on SMP and Multicore Systems

Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallelism in Spiral

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Algorithms and Computation in Signal Processing

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

OPERATING SYSTEM. Chapter 4: Threads

Formal Loop Merging for Signal Transforms

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

CS420: Operating Systems

Introduction to HPC. Lecture 21

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Chapter 4: Multithreaded Programming

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Multi-core Architectures. Dr. Yingwu Zhu

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Chapter 4: Threads. Chapter 4: Threads

WHY PARALLEL PROCESSING? (CE-401)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Cache-oblivious Programming

Intel Performance Libraries

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Multi-core Architectures. Dr. Yingwu Zhu

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

PARALLEL FFT PROGRAM OPTIMIZATION ON HETEROGENEOUS COMPUTERS. Shuo Chen

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

Chapter 4: Multithreaded

A SEARCH OPTIMIZATION IN FFTW. by Liang Gu

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Concurrency for data-intensive applications

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Basics of Performance Engineering

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 13: March 25

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Performance of Multicore LUP Decomposition

A Hybrid GPU/CPU FFT Library for Large FFT Problems

Chapter 4: Threads. Operating System Concepts 9 th Edition

c 2013 Alexander Jih-Hing Yee

Chapter 4: Multi-Threaded Programming

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

Modern Processor Architectures. L25: Modern Compiler Design

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Library Generation For Linear Transforms

Chapter 4: Threads. Chapter 4: Threads

Introduction to Runtime Systems

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition,

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

RAMP-White / FAST-MP

Why Multiprocessors?

Grid Computing: Application Development

Multiprocessors - Flynn s Taxonomy (1966)

Chapter 4: Threads. Operating System Concepts 9 th Edition

Outline. Threads. Single and Multithreaded Processes. Benefits of Threads. Eike Ritter 1. Modified: October 16, 2012

CellSs Making it easier to program the Cell Broadband Engine processor

How to Write Fast Numerical Code

Molecular Dynamics and Quantum Mechanics Applications

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Parallel Systems. Project topics

Grid Application Development Software

Automatic Performance Tuning and Machine Learning

Introduction to Multicore Programming

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

The Design and Implementation of FFTW3

Intel Math Kernel Library 10.3

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

An Introduction to Parallel Programming

A Lightweight OpenMP Runtime

Transcription:

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston

Outline Motivation Introduction FFT Background Parallelization Challenges Performance Results Conclusion

Motivation FFT is one of the most popular algorithms in scientific and engineering applications More than 40% if we include all signal processing applications in modern telecommunications. Other Applications include: Digital Image Processing and Compression. In all these applications, FFT takes a significant part of CPU time. Strong motivation for development of highly optimized implementations.

Motivation Optimization Challenges: Growing complexity of Modern Architectures: Deep memory hierarchy. Multilevel Parallelism. Compilers technology can not fill all the gaps. Hand Tuning: Tedious (~200 lines of c code for size 8 FFT). Expensive due to growing heterogeneity. Solution: Autotuning through domain specific frameworks. Automatic scheduling of operations.

Automatic Tuning Automatic Performance Tuning Approach Two Stage Methodology (widely used) Installation time Generate optimized code blocks (micro-kernels). Optimized for instruction schedules and registers use. Run time Select and schedule variants (algorithms, factors). Optimized for memory access schedule and load balancing. Examples: FFT: MKL, FFTW, UHFFT Linear Algebra: ATLAS

In this paper Extend the Auto-tuning approach in UHFFT to Multi-core and SMP architectures. OpenMP and PThreads implementations Establish the heuristics for selecting the best schedule (plan) of computation and factorization. Different factorizations have different memory access pattern. Empirical evaluation of parallel FFT on latest architectures. SMP: Itanium2 and Opteron CMP: Xeon and Opteron

A Brief Introduction to FFT

Mathematical Background The Fast Fourier Transform (FFT): Faster algorithm for evaluation of the Discrete Fourier Transform (DFT). 2 DFT is a matrix vector product ( N ) y N 1 jk j N. xk k 0 DFT Matrix size N=4 Y 1 1 1 1 1 i 1 i. X 1 1 1 1 1 i 1 i

Mathematical Background FFT algorithm Exploits the symmetries in the DFT matrix. The famous Cooley Tukey Algorithm (1965) ( Nlog N) Divide the Problem into smaller non-unique factors: FFT N=4 m=2 r=2 N m r n F ( F I ). T.( I F ). P n N r m m r m r y0 1 1 1 1 1 x0 1 y1 1 1 1 1 2 x2 1. 0. y 2 1 2 4 1 1 x 1 1 1 1 y 3 1 2 4 1 2 x3

Mathematical Background Strided Access Twiddle Multiplication Bit Reversal y0 1 1 1 1 1 x0 1 y1 1 1 1 1 2 x2 1. 0. y 2 1 2 4 1 1 x 1 1 1 1 y 3 1 2 4 1 2 x3 Butterfly Visualization

FFT Challenges Algorithmic Unfavorable data access pattern (2 n strides). Low floating point vs. loads & stores ratio (Bandwidth bound). Unbalanced Multiplication/Addition operation count. Compare that with Linear Algebra Codes Low Spatial Locality. Low Algorithm Efficiency Codelet size=16 High Spatial Locality. High Algorithm Efficiency Block size=16

Strided Data Access Performance Impact. 66% Drop for stride of only 1K

UHFFT Design UHFFT Installation Input FFT Problem Codelet Generator Butterfly Computation Factorization Algorithm Selection UHFFT Run-Time DFTi API FSSL Grammar (Parser / Scanner) Compile - Evaluate and Guide (AEOS) Optimization (Scalar and Vector) Scheduler (DAG) Unparser (Output) Choose Algorithm Planner Select Codelet Evaluate Performance Timers Executor SMP Layer Prime Factor Split Radix Mixed Radix Rader s FFT Codelet Library Single CPU Best FFT Descriptor Plan CPU SMP

Parallelization of FFT on SMP/CMP

Summary FFT Unfavorable strided memory access. Bit reversal permutation required for in-order result. Many factorizations (schedules) to solve a given FFT problem. Divide and Conquer Algorithm Complexity O(nlogn) Architectures Trend towards Deeper memory hierarchy Growing gap between processing power and memory speed/bandwidth with multi-cores. Multi-cores with shared or private cache.

Parallelization of FFT FFT is a naturally data parallel algorithm. Parallelize at the top level of recursive tree (2D formulation). N m r row FFTs m column FFTs r size=m size=r serial recursive calls

Parallelization of FFT Only one rendezvous point. Example N=16 m=4 r=4 Sync here Barrier or Transpose

Choice of Factors Where to put the barrier in FFT Plan? Between the row (r) and column (m) FFT stages. FFTm is performed recursively using smaller factors. FFTr { codelet library generated at installation time }. Speedup Opteron (8way SMP) FFT: N=128K Complex Double Plans

Choice of Factors Work distribution Perfect load balancing possible by choosing the factors m and r that are divisible by number of Threads P. (m P and r P) Choose largest column blocking to resolve cache coherence issues. MFLOPS Itanium 2 (4way SMP) FFT: N=64K Complex Double Col.Block (log)

Multithreading Overhead Two Main Sources Thread Synchronization Use Busy wait to bypass thread synchronization OS calls. Load Balance Thread Creation Use Thread Pooling

Thread Pooling Create the worker threads at the start. Master thread notifies when the task is ready. The other model is fork-join model. Overhead of creating and destroying threads.

Thread Pooling Performance Xeon Woodcrest (2xDual CMP/SMP) FFT: N=2 n Complex Double P=4 cores MFLOPS Size (log)

Performance Results

Platforms SMP SMP/CMP

Multi-core Cache Configuration Xeon Woodcrest (2xDual CMP/SMP) Opteron 275 (2xDual CMP/SMP) Processor 0 Processor 1 Processor 0 Processor 1 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 CPU CPU CPU CPU CPU CPU CPU CPU L1 Cache L2 Cache L1 Cache L1 Cache L2 Cache L1 Cache L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache System Bus System Bus Shared Memory Shared Memory

Shared/Private Cache Xeon Woodcrest (2xDual CMP/SMP) FFT: N=2 n Complex Double P=4 cores Opteron 275 (2xDual CMP/SMP) FFT: N=2 n Complex Double P=4 cores MFLOPS MFLOPS Size (log) Size (log) In each graph problem was executed ten times and average was plotted with the variation bar. Two lines in each graph represent the number of threads spawned.

Shared/Private Cache Although (Linux kernel 2.6) fixed some of the thread scheduling issues (e.g. ping pong). To ensure best performance: Set the CPU affinity to thread specifically in a program sched_setaffinity or KMP_AFFINITY (intel)

Programming Models OpenMP: Portable Multithreaded programming model. Parallelization performed by compiler. OpenMP Implementations optimized by vendors. PThreads: Allows more flexibility. Parallelization defined by user. For a carefully partitioned data both OpenMP and PThreads can achieve identical performance.

OpenMP vs PThreads Performance of one, two and four threads using OpenMP and PThreads implementations in UHFFT. Barrier implementation customized for PThreads using busy wait and atomic decrement. MFLOPS Itanium 2 (4way SMP) UHFFT: N=2 n Complex Double P=4 processors Size (log)

UHFFT and FFTW Support for efficient CMP/SMP FFT Recently: UHFFT2.0.1-beta and FFTW3.2-alpha Similar adaptive methodology (code generation and run-time search). Many Implementation level differences, mainly: FFTW generates ISA specific codelets (i.e. sse, fma etc.). UHFFT use automatic empirical optimization code generator. UHFFT better at cache blocking.

UHFFT vs FFTW MFLOPS FFTW utilizes threading for (N > 256)

Overall Efficiency Itanium and Opteron: Increase in total cache proportional to increase in cores. Xeon: Shared cache. % of Peak 25 20 15 10 Xeon Opteron 275 Itanium 2 5 0 In-cache Out-of-cache

Conclusions The best sequential FFT plan is not guaranteed to perform optimally on multi-cores. Factorization has to take into account cache coherence related issues and load balancing. Given a carefully scheduled FFT computation both OpenMP and PThreads can perform equally well. Performance benchmarks on the most recent architectures show good speedup using our implementation in UHFFT. But: For FFT, high efficiency can only be possible if the memory bandwidth is increased proportional to the processing power.

Thank You!

Backup Slides

Low Efficiency of FFT

Op-Counts Effect of Algorithm Selection on Op-Count. Algorithms

Data Distribution (SMP/CMP) Example N 16 m 4, r 4 P 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Row FFTs 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Column FFTs

UHFFT2 Current Status 1D Complex Out-of-place/In-place in-order Forward scrambled (Out-of-place/In-place) Forward/Inverse (Configurable Sign) High, Medium and Low Effort Plan Search Single/Double Precision SMP/CMP Executor (DFTi Extension) Empirical Auto-tuning code generator. Executor Extendibility through FSSL Grammar. Real FFT (not-integrated)

Future Work Executor: Multidimensional MPI Planner: Accurate Model Driven Plan Search Scheme. Economical (time and memory). Schedule memory transactions through pre-fetching. Code Generator: ISA specific codelet generation (e.g. sse, fma etc.). Generate codelet cost models.