IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Similar documents
c 2013 Alexander Jih-Hing Yee

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Intel Performance Libraries

Trends in HPC (hardware complexity and software challenges)

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

My 2 hours today: 1. Efficient arithmetic in finite fields minute break 3. Elliptic curves. My 2 hours tomorrow:

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

ECE 8823: GPU Architectures. Objectives

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Adaptive Transpose Algorithms for Distributed Multicore Processors

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

HPC Architectures. Types of resource currently in use

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

San Diego Supercomputer Center. Georgia Institute of Technology 3 IBM UNIVERSITY OF CALIFORNIA, SAN DIEGO

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

High Performance Computing: Tools and Applications

Parallelized Progressive Network Coding with Hardware Acceleration

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

High-Performance and Parallel Computing

Efficient FFT Algorithm and Programming Tricks

GPUs and Emerging Architectures

Algorithms and Computation in Signal Processing

Results and Discussion *

Double-precision General Matrix Multiply (DGEMM)

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture Notes on Cache Iteration & Data Dependencies

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

45-year CPU Evolution: 1 Law -2 Equations

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Performance of deal.ii on a node

Parallel Computing Ideas

Parallel FFT Program Optimizations on Heterogeneous Computers

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Storage Hierarchy Management for Scientific Computing

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Using GPUs to compute the multilevel summation of electrostatic forces

NUMA replicated pagecache for Linux

Master Informatics Eng.

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Performance of Multicore LUP Decomposition

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Introduction to Xeon Phi. Bill Barth January 11, 2013

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Programming in CUDA. Malik M Khan

Advanced Parallel Programming I

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Progress Report on QDP-JIT

Intel Knights Landing Hardware

Case study: OpenMP-parallel sparse matrix-vector multiplication

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Maximizing Memory Performance for ANSYS Simulations

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Intel Parallel Studio XE 2015

GAIL The Graph Algorithm Iron Law

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

OpenMP for next generation heterogeneous clusters

Modern CPU Architectures

High Performance Computing with Accelerators

Scalability and Classifications

COSC 6385 Computer Architecture - Multi Processor Systems

LogCA: A High-Level Performance Model for Hardware Accelerators

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Lecture 2: Single processor architecture and memory

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Why Multiprocessors?

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2

Using a Scalable Parallel 2D FFT for Image Enhancement

The ECM (Execution-Cache-Memory) Performance Model

Introduction to Runtime Systems

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Optimising for the p690 memory system

Introduction to Parallel Computing

Day 6: Optimization on Parallel Intel Architectures

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Transcription:

SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign

The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory bandwidth is always problem. Single-node shared memory: not enough bandwidth Multi-node: even worse Dominant factor in performance. Naïve implementations also bound by latency. Data-reordering can be many times slower than FFT computation itself!

The Classic Approach to 3D-FFT 1. Perform x-dimension FFT. 2. In-memory transpose. 3. All-to-all communication. 4. Perform y-dimension FFT. 5. In-memory transpose. 6. All-to-all communication. 7. Perform z-dimension FFT. 8. In-memory transpose. 9. All-to-all communication. Exact order may differ. 3 all-to-all communication steps. An extra transpose may be needed at beginning i to get data into order.

What is an out-of-order FFT? The Out-of-order FFT is mathematically the same as in-order FFT: Frequency domain is not in order. Forward Transform: Start from in-order time domain. End with out-of-order frequency domain. Use Decimation-in-Frequency algorithm. Inverse Transform: Start from out-of-order frequency domain. End with in-order time domain. Use Decimation-in-Time algorithm. Order of Frequency Domain: Bit-reversed is the most common. Other orders exist. Some algorithms are even faster at the cost of further scrambling up the frequency domain. Decimation-in-Frequency FFT (Image taken from cnx.org)

Why Out-of-Order? Many applications do not need an in-order frequency domain. Convolution Do not even need to look at Frequency Domain. Out-of-order FFT is faster: In-order FFTs require data-reordering -> bit-reversal Very poor memory access. Re-ordering is more expensive than FFT itself! In-order FFTs cannot be easily done in place. Requires double the memory of out-of-order FFT. Aggravates memory bottleneck. Out-of-order order FFT can be several times faster! No need for final transpose for distributed FFTs over many nodes.

Convolution via Out-of-order FFT Time-domain (in order) Pointwise Multiply (order does not matter) Time-domain (in order) (Images taken from cnx.org)

Implementations of out-of-order FFT Prime95/MPrime By George Woltman Used in GIMPs (Great Internet t Mersenne Pi Prime Search) World record holder for the largest prime number found. (August 2008) 9 of 10 largest known prime numbers found by GIMPS. Uses FFT for cyclic convolution. Fastest known out-of-order of order FFT. (x86-64 64 assembly for Windows + Linux) y-cruncher Multi-threaded Pi Program By Alexander J. Yee Fastest program to compute Pi and other constants. World Record holder for the most digits it of Pi ever computed. (5 trillion digits it August 2010) Uses FFT and NTT for multiplying large numbers. Almost as fast as Prime95. (Standard C with Intel SSE Intrinsics cross platform) djbfft By Daniel J. Bernstein One of the first implementations of out-of-order FFTs. Outperformed FFTW by factors > 3 for convolution. Never widely-used, but motivated other out-of-order FFT projects.

Our Approach to 3D-FFT Recognize that n-d FFT is same as 1D FFT. Different Twiddle Factors. Same Memory Access. Same Data-Flow. Implement a 1D-FFT with modified twiddle factors instead! All tricks for optimizing 1D-FFT are now available. Use Bailey s 4-step algorithm for 1D FFT. 1. First computation ti pass. 2. Matrix Transpose 3. Second computation pass. 4. Matrix Transpose data back to initial order. Out-of-order FFT -> No need for final transpose! Total: 1 transpose -> Only 1 all-to-all communication.

Our Approach to 3D-FFT (cont.) To apply Bailey s 4-step method: Break the FFT into 2 passes. Easy way (slab decomposition): x and y into one pass. z by itself in second pass. (y can go with x or z) Hard way (split dimension): i Split the y dimension across the two passes. Overcomes scalability issue with 1 st method. (see next section) Both are being implemented. Frequency Domain will be Bit-reversed.

Drawbacks Slab Decomposition Split Dimension Can use standard FFT libraries. Cannot use standard libraries. Optimal performance will still require custom sub-routines. Everything must be written from scratch. Input data can be contiguous. Input data must be strided. Most common representation. Could imply extra transpose. # of nodes must divide evenly into either x or z dimension. # of nodes must divide evenly into x*y or x*z. Scalability is limited i to N nodes for N 3 3D-FFT Scalability is limited i to N 3/2 nodes for N 3 3D-FFT. Blue Waters will have more than 10,000000 nodes Not a problem on Blue Waters.

Some Implementation Details All code written from scratch. No libraries. i Everything is customized. SIMD SSE for x86-64 AltiVec for PowerPC Struct of Arrays layout Will extend to AVX and FMA in the future. (Next-gen Intel/AMD x64.) Radix 4 FFT Good performance. Fits into 16 registers. Not too bad for cache associativity. Pre-compute Twiddle Factors Duplicate tables to ensure sequential access.

Benchmarks Memory Bottleneck 70 60 Complex Out-of-order 3D-FFT (1024 3 ) Windows OpenMP - 16 GB needed 2 x Intel Xeon X5482-64 GB DDR2 2 memory channels per socket 4 memory channels total 50 Seconds 40 30 Affinity 1 socket 2 sockets 20 10 0 different socke ets all in same soc cket 2 per socket 1 thread 2 threads 4 threads 8 threads Threads all cores utilize ed

Benchmarks - Shared Memory 80 Complex Out-of-order 3D-FFT (1024 3 ) Slab Decomposition - 32 GB needed 2 x Intel Xeon X5482-64 GB DDR2 70 60 Time (second ds) 50 All 40 FFT-z All-to-all 30 Transpose 20 FFT-xy 10 0

Early Benchmarks - Distributed 25 Complex Out-of-order 3D-FFT (1024 3 ) Slab Decomposition - 32 GB needed Accelerator Cluster - UIUC 20 s) Time (second 15 10 Strided FFT-z All-to-all Inner-node Transpose Dense FFT-xy 5 0 16 / 8 32 / 8 32 / 16 64 / 16 MPI Processes / Nodes

Analysis All-to-all communication steps reduced: Reduced from 3 to 1 for out-of-order FFT. In-order FFT doable by adding one transpose at end. Possibly communication optimal: No data dt is transferred more than once. Some data is never transferred at all. Hard to further reduce the # of bytes transferred. (Is our current approach optimal?) Maybe possible to improve communication pattern instead? Lots of room for improvement within the node. FFT computation can be better optimized. Difficult to imagine more nodes than x or z dimension. Blue Waters: > 10,000 nodes 10,000 may be greater than one of the dimensions. May not be possible (or efficient) to use slab-decomposition.

Next Steps Test current code on larger systems. Make sure the current implementation scales Port the code to PowerPC AltiVec. Currently implemented using x86-64 64 SSE3. Implement blocking and padding. Breaks cache associativity -> allows higher radix transforms. Overlapped communication i and computation. Support for prime factors other than 2 3*2 k, 5*2 k, and maybe 7*2 k Real-input transforms. In-order FFT.

Thanks for Listening Questions?