High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Similar documents
Parallel FFT Program Optimizations on Heterogeneous Computers

A Hybrid GPU/CPU FFT Library for Large FFT Problems

Warps and Reduction Algorithms

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

High Performance Computing on GPUs using NVIDIA CUDA

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

high performance medical reconstruction using stream programming paradigms

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Dense matching GPU implementation

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Introduction to parallel Computing

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

PARALLEL FFT PROGRAM OPTIMIZATION ON HETEROGENEOUS COMPUTERS. Shuo Chen

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

QR Decomposition on GPUs

Modern GPUs (Graphics Processing Units)

GPU Accelerated Machine Learning for Bond Price Prediction

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Optimisation Myths and Facts as Seen in Statistical Physics

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Many-core and PSPI: Mixing fine-grain and coarse-grain parallelism

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Fast Tridiagonal Solvers on GPU

Parallel Processing SIMD, Vector and GPU s cont.

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Real-Time Support for GPU. GPU Management Heechul Yun

GPUfs: Integrating a file system with GPUs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Performance of Multicore LUP Decomposition

Small Discrete Fourier Transforms on GPUs

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

A Multi-Tiered Optimization Framework for Heterogeneous Computing

The Art of Parallel Processing

Advanced CUDA Optimizations

Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Incremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs

Fast BVH Construction on GPUs

CHPS: An Environment for Collaborative Execution on Heterogeneous Desktop Systems

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Quantum ESPRESSO on GPU accelerated systems

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

GPUs and GPGPUs. Greg Blanton John T. Lubia

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

General Purpose GPU Computing in Partial Wave Analysis

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Computer Architecture

Module 9 : Numerical Relaying II : DSP Perspective

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GRAPHICS PROCESSING UNITS

Technical Report TR

Applications of Berkeley s Dwarfs on Nvidia GPUs

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

LUNAR TEMPERATURE CALCULATIONS ON A GPU

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

A GPU based brute force de-dispersion algorithm for LOFAR

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

AES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Overview of Project's Achievements

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

EFFICIENT PARTITIONING BASED HIERARCHICAL AGGLOMERATIVE CLUSTERING USING GRAPHICS ACCELERATORS WITH CUDA

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

International Supercomputing Conference 2009

Transcription:

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Motivation Fourier Transform widely used in Physics, Astronomy, Engineering etc. Applications include signal processing, fluid dynamics. Original 1d DFT algorithm is O(N 2 ) FFT improves the time complexity to O(N.logN) Nevertheless, FFT is still computationally intensive Continued advances of computers demand larger and faster implementations of the algorithm.

2D Fourier Transform equation The two dimensional Fourier transform equation is given by It can be modified to : The term in square brackets corresponds to the one-dimensional Fourier transform of the m th line and can be computed using the standard fast Fourier transform (FFT). Each line is substituted with its Fourier transform, and the one-dimensional discrete Fourier transform of each column is computed.

Our AIM and Previous Work Our AIM: To develop an efficient Heterogeneous CPU-GPU implementation of 2-dimensional FFT. There have been attempts on heterogeneous 2D FFT, namely, An Efficient, Model-Based CPU-GPU Heterogeneous FFT Library, IPDPS 2008 Summary of paper: The library achieves optimal performance using heterogeneous CPU-GPU computing resources. The load distribution ratio is automatically predicted from a performance model. Compute 2D FFT using 1D FFT libraries such as CUFFT, MKL, FFTW etc. Results : The heterogeneous library is around two times slower than the existing best GPU FFT libraries. This is because of the overhead of transposing matrices and multiple data transmissions between Cpu and Gpu.

Shortcomings and Strengths Strengths: The Load balancing ratio is automatically estimated from a performance model, and the error in prediction is quite low. Hence, less human burden. The library can handle data of very large sizes, which is not the case with pure GPU libraries due to memory limitations on Gpu. Faster implementations of FFT are possible only through heterogeneous architectures. Shortcomings: They use only 1 thread on the Dual Core. The other core is occupied by the GPU control thread. Transposing of large matrices being done on the CPU, during which Gpu remains idle. No work done during data transmission. Hardware and software resources available today is much advanced then the one used in paper (Nvidia Geforce 8800GTX with 128 cores and Intel Core 2 Duo E6400, 2.13GHz) One would prefer using pure GPU FFT libraries due to better performance.

ISCA 2010 Gpu-Cpu myth paper The paper Debunking the 100X GPU vs. CPU Myth, ISCA 2010 highlights the architectural features of Gpu and Cpu which contribute to the performance gap between the two. After optimizing 14 kernels on both Gpu and Cpu, the performance gap between and Nvidia GTX280 processor and Intel Core I7 960 narrows to only 2.5x on average. (For FFT, the ratio is 3) The paper reports CUFFT as the best implementation of 1d FFT on GPUs and MKL as best implementation on multi-core. We have seen that 2d FFT is composed of 1d FFT along rows, and then 1d FFT along columns. Can use the above mentioned libraries in our Hybrid algorithm.

Pure GPU and Pure CPU timings FFT Size CUFFT time (in ms), Tesla GPU 1000 x 1000 1.45 6.00 2048 x 2048 6.75 19.00 8000 x 8000 87.5 400.00 8192 x 8192 173 400.00 MKL time (in ms), Core i7, 12 threads

Experimental Setup We use a machine having an Nvidia Tesla processor in combination with an Intel Core I7 980. Tesla has 30 streaming multi-processors, each having 8 CUDA cores, giving a total of 240 cores. Core I7 can run 12 threads at a time. We use Linux Kernel with Nvidia display driver and CUDA version 4.0 The 2D matrices used in the following experiments consist of two 32-bit floating-point numbers, each of which represents real or complex part, respectively.

Approach The basic approach is as follows: o The first Transpose can be shifted to either side (GPU / CPU), whichever is more effective. o The second Transpose is a crucial step. We can exploit the pipelining pattern present in the system. o All the data transfer present in algorithm can be overlapped with Transpose kernel using CUDA streams. o If all the Data transfer time is hidden, effective time will be (FFT computation time + Transpose computation time)

CUDA streams Streams help in achieving concurrent kernel / memcpy execution. When streams are not specified, default stream is 0 and operations are not concurrent. Approx, all the data transfer time is hidden by overlapping with kernel execution. Transfer Data A Kernel on Data A Transfer Data B Kernel on Data B

Hybrid Algorithm Distribute and Transfer Data Transpose (GPU) Divide the 2D signal into GPU_rows and CPU_rows. Transfer signal to the GPU Transpose CPU rows Transpose GPU rows Transfer CPU rows to host Row-FFT Use CUFFT for Row-FFT on GPU rows Use MKL for Row-FFT on CPU rows Transpose (GPU) Transpose GPU rows Transfer CPU rows to device Transpose CPU rows Column-FFT Transfer Transposed CPU rows Use CUFFT for column FFT on GPU rows Use MKL for column FFT on CPU rows Final output transfer Transfer Output back to host that is currently on GPU

Timings (Threshold = 90%) FFT Size Pure GPU time Hybrid Algorithm time Transpose Time FFT computati on time 1024 x 1024 1146 3150 1700 700 750 2048 x 2048 6000 12000 6800 2800 2200 4096 x 4096 28198 52615 30554 14000 8230 Data Transfer time for CPU rows 8192 x 8192 156180 226461 122614 71451 32396 More than half of the time is spent in Transpose computations How to hide Transpose time? Maybe use CPU for Transposing by pipelining : FFT -> Data-Transfer -> Transpose (CPU)

Using CPU for Transposing Problems By the time GPU does row/column FFT on entire signal, only 10% of the data can be transferred to CPU end for transposing CPU transpose time much higher than CUDA kernel for the same The transposed data must be transferred again to GPU-end for FFT. One more pipeline!

Hiding Transpose time If Transpose time needs to be hidden: Much higher bandwidth would be required than what is currently available We need an equally powerful device as a GPU Multi-GPU system?

Review The previous algorithm was as follows: Total time taken = 52000 micro-seconds CUFFT benchmark timing = 28000 micro-seconds

Hiding Transpose The algorithm can be seen as a series of row-fft -> Transpose -> row-fft -> Transpose Transpose is the bottleneck in the algorithm. Transpose can be hidden with the help of following possible pipeline: If transpose time is hidden, the effective time of the algorithm will be twice the time taken for row-fft Deciding whom to assign row-fft and Transpose is not trivial

Steps to implement the pipeline 1. Divide the signal into chunks of rows 2. Perform FFT at CPU side on one chunk and after finishing start transferring it to GPU 3. CPU starts row FFT on another chunk while GPU does transpose of the previous chunk 4. Repeat the steps till all chunks are transferred and transposed on GPU end

Implementation using 2 streams

Operations per chunk Chunk size CPU row-fft per chunk 10 280 275 20 377 530 50 620 1200 100 1000 2200 Transfer time per chunk Time required to transfer 1 chunk from CPU to GPU > Time required for FFT of 1 chunk on CPU Timings equal at small chunk-size but of no use Effective time of the pipeline will be dominated by the transfer time.

Result and Observations As expected, the total time of the pipeline is dominated by the one-way data transfer time. Under-utilization of GPU The entire signal is transferred over time, which is very inefficient practice Need for data-decomposition and assign FFT computations to the GPU

Eliminate Transpose Can we eliminate explicit transpose computation? Doing Row-FFT and Transpose in a single step is possible Read data in row-major form and write the FFT output in column-major form. But timing of the operation should be close to original Row-FFT timing Total time = 2 * (Row-FFT time)

Using Stride parameters It is possible to specify STRIDE for input and output data. For instance, consider a simple Row-Row formulation Here, input stride = output stride = 1, input dist = output dist = Y

Another Example For doing FFT along the columns Here, input stride = output stride = Y, input dist = output dist = 1

4 Types of Operations Row Row Read in row major and write in row major Row - Column Read in row major and write in column major Column Row Read in Column major and write in Row major Column Column Read in column major and write in column major

Timings Operation Time 1024 x 1024 Time 2048 x 2048 Row-Row 140 470 Row-Column 245 1550 Column-Row 265 2770 Column-Column 273.5 3138 Due to less spatial locality, operations other than row-row take more time If 2D FFT = Row-Row + Column-Column Time taken = 140 + 273.5 = 413 usec (1024x1024) = 470 + 3138 = 3600 usec (2048x2048) The benchmark timing is 380 usec (1024x1024) and 1480 usec (2048x2048)

Time (in usec) Row Row followed by Column-Column 4000 3500 3000 2500 2000 1500 Benchmark Strided FFT 1000 500 0 512 1024 2048 Size of 2D FFT

Time (in usec) Row Column followed by Row-Column 3500 3000 2500 2000 1500 Benchmark Strided FFT 1000 500 0 512 1024 2048 Size of 2D FFT

Time (in usec) Column-Row followed by Column-Row 6000 5000 4000 3000 2000 Benchmark Strided FFT 1000 0 512 1024 2048 Size of 2D FFT

Can Hybrid Help? The FFT implementations using strides are 2x slower than benchmark. Using CPU in parallel can improve the timings by 10-20% May not be useful enough. Rows/Columns can be distributed between CPU and GPU Possibly, estimate the division threshold from the available bandwidth.