FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Similar documents
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Porting Performance across GPUs and FPGAs

Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs

Lecture 21: High-level Synthesis (2)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA


CUDA. Matthew Joyner, Jeremy Williams

Mathematical computations with GPUs

Tesla Architecture, CUDA and Optimization Strategies

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Introduction to CUDA Programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Paralization on GPU using CUDA An Introduction

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPUs and GPGPUs. Greg Blanton John T. Lubia

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Introduction to Parallel Computing with CUDA. Oswald Haan

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

High Performance Computing and GPU Programming

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA Architecture & Programming Model

Real-time Graphics 9. GPGPU

Fundamental CUDA Optimization. NVIDIA Corporation

Real-time Graphics 9. GPGPU

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Technology for a better society. hetcomp.com

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Multilevel Granularity Parallelism Synthesis on FPGAs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Particle-Grid Methods: Electrostatics

Fundamental CUDA Optimization. NVIDIA Corporation

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Programmable Graphics Hardware (GPU) A Primer

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Programming in CUDA. Malik M Khan

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

Threading Hardware in G80

Multi-Processors and GPU

Introduction to GPU programming. Introduction to GPU programming p. 1/17

ECE 574 Cluster Computing Lecture 15

Accelerating image registration on GPUs

Unrolling parallel loops

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to CUDA

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Introduction to GPU computing

Numerical Simulation on the GPU

A Cross-Input Adaptive Framework for GPU Program Optimizations

GPU programming. Dr. Bernhard Kainz

Lecture 2: CUDA Programming

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Programming Model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Shared Memory and Synchronizations

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Portland State University ECE 588/688. Graphics Processors

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS 179: GPU Programming

When MPPDB Meets GPU:

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Parallel Accelerators

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Module 2: Introduction to CUDA C. Objective

Master Informatics Eng.

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Introduction to GPU hardware and to CUDA

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Introduction to GPGPU and GPU-architectures

CS516 Programming Languages and Compilers II

Exploring OpenCL Memory Throughput on the Zynq

CME 213 S PRING Eric Darve

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

Graphics Processor Acceleration and YOU

Module 2: Introduction to CUDA C

Using GPUs to compute the multilevel summation of electrostatic forces

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

General Purpose GPU Computing in Partial Wave Analysis

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Performance potential for simulating spin models on GPU

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

COSC 6374 Parallel Computations Introduction to CUDA

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Massively Parallel Architectures

Tesla GPU Computing A Revolution in High Performance Computing

High Performance Computing with Accelerators

Transcription:

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009

Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs. Symposium on Application Specific Processors, July, 2009. (Best Paper Award) Overview Comparison: FPGA vs GPU vs CPU CUDA FCUDA AutoPilot Results

FCUDA: Overview CUDA popular language for programming Nvidia GPUs describes coarse-grained parallelism FCUDA: convert CUDA into annotated C code Use high level synthesis (AutoPilot) to convert C into RTL Synthesize RTL for an FPGA

Why FPGAs? FPGAs have higher computational density per Watt than GPUs and CPUs: 32-bit integer arithmetic: 6X higher 32-bit floating point arithmetic: 2X higher

Power: FPGA vs GPU vs CPU J. Williams et. al. Computational density of fixed and reconfigurable multi-core devices for application acceleration, RSSI, 2008.

Computational Density: FPGA vs GPU vs CPU J. Williams et. al. Computational density of fixed and reconfigurable multi-core devices for application acceleration, RSSI, 2008.

CUDA: Compute Unified Device Architecture CUDA is a C API and compiler infrastructure for Nvidia GPUs GPU is a SIMD architecture

GPU Memory Hierarchy

CUDA Example CCode: 1 int main () { 2... 3 for ( int i = 0; i < N; i++) { 4 C[ i ] = A[ i ] + B[ i ]; 5 } 6 } CUDA: 1 int main () { 2... 3 VecAdd<<<1, N>>>(A, B, C) ; 4 } 5 global void VecAdd( float A, float B, float C) { 6 int i = threadidx.x; 7 C[ i ] = A[ i ] + B[ i ]; 8 }

AutoPilot

AutoPilot

FCUDA Map thread-blocks onto parallel FPGA cores Minimize inter-core communication to avoid synchronization FCUDA pragmas: COMPUTE: computation task of kernel TRANSFER: data communication task to off-chip DRAM SYNC: synchronization scheme simple DMA: serialize data communication and computation ping-pong DMA: double BRAMs to allow concurrent data communication and computation

Coulombic Potential (CP) CUDA kernel 1 2 3 4 global void cenergy( int numatoms, int gridspacing, int energygrid) { 5 6 7 8 9... 10 int energyval = 0; 11 for (atomid=0; atomid<numatoms ; atomid++) { 12... 13 e n e r g y v a l += a t o m i n f o [ atomid ]. w r 1; 14 } 15 16 17 e n e r g y g r i d [ o u t a d d r ] += e n e r g y v a l ; 18 19 }

FCUDA-annotated CP kernel 1 #pragma FCUDA GRID x dim=2 y dim=1 begin name= cp grid 2 #pragma FCUDA BLOCKS s t a r t x=0 end x=128 start y=0 end y=128 3 #pragma FCUDA SYNC t y p e=s i m p l e 4 global void cenergy( int numatoms, int gridspacing, int energygrid) { 5 6 7 8 9... 10 int energyval = 0; 11 for (atomid=0; atomid<numatoms ; atomid++) { 12... 13 e n e r g y v a l += a t o m i n f o [ atomid ]. w r 1; 14 } 15 16 17 e n e r g y g r i d [ o u t a d d r ] += e n e r g y v a l ; 18 19 } 20 #pragma FCUDA GRID end name= c p grid

FCUDA-annotated CP kernel 1 #pragma FCUDA GRID x dim=2 y dim=1 begin name= cp grid 2 #pragma FCUDA BLOCKS s t a r t x=0 end x=128 start y=0 end y=128 3 #pragma FCUDA SYNC t y p e=s i m p l e 4 global void cenergy( int numatoms, int gridspacing, int energygrid) { 5 6 7 8 #pragma FCUDA COMPUTE c o r e s =2 b e g i n name= c p block 9... 10 int energyval = 0; 11 for (atomid=0; atomid<numatoms ; atomid++) { 12... 13 e n e r g y v a l += a t o m i n f o [ atomid ]. w r 1; 14 } 15 #pragma FCUDA COMPUTE end name= c p block 16 17 e n e r g y g r i d [ o u t a d d r ] += e n e r g y v a l ; 18 19 } 20 #pragma FCUDA GRID end name= c p grid

FCUDA-annotated CP kernel 1 #pragma FCUDA GRID x dim=2 y dim=1 begin name= cp grid 2 #pragma FCUDA BLOCKS s t a r t x=0 end x=128 start y=0 end y=128 3 #pragma FCUDA SYNC t y p e=s i m p l e 4 global void cenergy( int numatoms, int gridspacing, int energygrid) { 5 #pragma FCUDA TRANSFER c o r e s =1 t y p e=b u r s t b e g i n name= f e t c h 6 #pragma FCUDA DATA t y p e=l o a d from=a t o m i n f o s t a r t =0 end= MAXATOMS 7 #pragma FCUDA TRANSFER end name= f e t c h 8 #pragma FCUDA COMPUTE c o r e s =2 b e g i n name= c p block 9... 10 int energyval = 0; 11 for (atomid=0; atomid<numatoms ; atomid++) { 12... 13 e n e r g y v a l += a t o m i n f o [ atomid ]. w r 1; 14 } 15 #pragma FCUDA COMPUTE end name= c p block 16 #pragma FCUDA TRANSFER c o r e s =1 t y p e=b u r s t b e g i n name= w r i t e 17 e n e r g y g r i d [ o u t a d d r ] += e n e r g y v a l ; 18 #pragma FCUDA TRANSFER end name= w r i t e 19 } 20 #pragma FCUDA GRID end name= c p grid

FCUDA compiler passes FCUDA based on Cetus source-to-source compiler Front-end transformations: Convert implicit thread execution into an explicit loop Vectorize local thread variables Back-end transformation: Split tasks into separate AutoPilot functions Wrap parallel function calls with AUTOPILOT REGION PARALLEL pragmas Create DMA burst transfers between off-chip DRAM and on-chip BRAM arrays

FCUDA front-end processed CP compute task 1 #pragma FCUDA COMPUTE c o r e s =2 b e g i n name= c p block 2 3 4 5... 6 int energyval = 0; 7 for (atomid=0; atomid<numatoms ; atomid++) { 8... 9 energyval += atominfo [ atomid ].w r 1; 10 } 11 } 12 } 13 #pragma FCUDA COMPUTE end name= c p block

FCUDA front-end processed CP compute task 1 #pragma FCUDA COMPUTE c o r e s =2 b e g i n name= c p block 2 int energyval []; 3 for(tidx.y=0;tidx.y<blockdim ; tidx. y++) // thread loop 4 for(tidx.x=0;tidx.x<blockdim ; tidx. x++) { 5... 6 energyval [ tidx ]=0.0 f ; 7 for (atomid=0; atomid<numatoms ; atomid++) { 8... 9 energyval [ tidx ] += atominfo [ atomid ].w r 1; 10 } 11 } 12 } 13 #pragma FCUDA COMPUTE end name= c p block

FCUDA-generated AutoPilot code

Results: CUDA kernels

Results: FPGA kernel characteristics

Results: Devices FPGA: Virtex5 XC5VFX200T device (65nm) Over 100K LUTs 2MB of on-chip BlockRAM memory 384 DSP units No specs given for off-chip DRAM Max clock: 550Mhz GPU: Nvidia GeForce 8800 GTX (90nm) 16 Streaming Multiprocessors (SM) and 128 cores. Register File: 32KB/SM = 512KB total Shared (on-chip) memory: 16KB/SM = 256KB total Global (off-chip) memory 768MB, 86.4GB/sec bandwidth Core Clock: 575Mhz,

Results: FPGA vs GPU Speedup

Results: FPGA vs GPU Power

Conclusions CUDA to FPGA flow is possible with decent performance No comparison to more recent GPUs (GTX280) Future work: use OpenCL instead of CUDA OpenCL supports for both ATI and Nvidia GPUs

Cetus: Source-to-source compiler Cetus is a source-to-source C compiler written in Java Maintained by Purdue University Designed for automatic parallelization Data dependence analysis Control flow graph generation Available at cetus.ecn.purdue.edu Modified Artistic License

Cetus: Examples