Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Similar documents
FlexGrip: A Soft GPGPU for FPGAs

ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

Portland State University ECE 588/688. Graphics Processors

First: Shameless Adver2sing

FPGAs as Streaming MIMD Machines for Data Analy9cs. James Thomas, Matei Zaharia, Pat Hanrahan

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Testing and Validation of a Prototype Gpgpu Design for FPGAs

Lecture 1 Introduc-on

CDA 4253 FPGA System Design Op7miza7on Techniques. Hao Zheng Comp S ci & Eng Univ of South Florida

BLAS. Basic Linear Algebra Subprograms

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

: Advanced Compiler Design. 8.0 Instruc?on scheduling

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Implementation and Experimental Evaluation of a CUDA Core under Single Event Effects. Werner Nedel, Fernanda Kastensmidt, José.

Register Alloca.on Deconstructed. David Ryan Koes Seth Copen Goldstein

Tesla Architecture, CUDA and Optimization Strategies

Handout 3. HSAIL and A SIMT GPU Simulator

GPU Fundamentals Jeff Larkin November 14, 2016

Numerical Simulation on the GPU

CUDA Programming Model

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

PhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015.

Por$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain

Introduc)on to GPU Programming

ECSE 425 Lecture 25: Mul1- threading

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

RISC Architecture: Multi-Cycle Implementation

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #7. Warehouse Scale Computer

FPGA architecture and design technology

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

High Performance Computing on GPUs using NVIDIA CUDA

Task 8: Extending the DLX Pipeline to Decrease Execution Time

Network Coding: Theory and Applica7ons

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

Utility Reduced Logic (v1.00a)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

FPGA: What? Why? Marco D. Santambrogio

Instruc=on Set Architecture

EITF35: Introduction to Structured VLSI Design

GRAPHICS PROCESSING UNITS

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Master Informatics Eng.

asoc: : A Scalable On-Chip Communication Architecture

Exploring GPU Architecture for N2P Image Processing Algorithms

INTRODUCTION TO FPGA ARCHITECTURE

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Fundamental CUDA Optimization. NVIDIA Corporation

High-Level Synthesis Creating Custom Circuits from High-Level Code

UNIT V: CENTRAL PROCESSING UNIT

Compiler: Control Flow Optimization

Multi-Processors and GPU

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Fundamental CUDA Optimization. NVIDIA Corporation

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

COL 380: Introduc1on to Parallel & Distributed Programming. Lecture 1 Course Overview + Introduc1on to Concurrency. Subodh Sharma

FPGA for Dummies. Introduc)on to Programmable Logic

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Architecture & Programming Model

RISC Architecture: Multi-Cycle Implementation

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

DESIGN AND IMPLEMENTATION OF 32-BIT CONTROLLER FOR INTERACTIVE INTERFACING WITH RECONFIGURABLE COMPUTING SYSTEMS

CSSE232 Computer Architecture I. Datapath

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Compiler Optimization Intermediate Representation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Cluster Computing. Advanced Computing Center for Research and Education

Programmer's View of Execution Teminology Summary

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

CS 61C: Great Ideas in Computer Architecture Compilers and Floa-ng Point. Today s. Lecture

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Is There A Tradeoff Between Programmability and Performance?

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Introduction to Partial Reconfiguration Methodology

OS History and OS Structures

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

CS 61C: Great Ideas in Computer Architecture Func%ons and Numbers

Practical Introduction to CUDA and GPU

Code Genera*on for Control Flow Constructs

CSE Opera,ng System Principles

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Macro Assembler. Defini3on from h6p://

Supporting Multithreading in Configurable Soft Processor Cores

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

International IEEE Symposium on Field-Programmable Custom Computing Machines

Scientific Computing on GPUs: GPU Architecture Overview

CS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading

A Prototype Multithreaded Associative SIMD Processor

Ways to implement a language

Implementation of GP-GPU with SIMT Architecture in the Embedded Environment

Fusion PIC Code Performance Analysis on the Cori KNL System. T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel!

Transcription:

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

Outline Mo+va+on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 2

Mo+va+on Compiling FPGA designs is :me consuming Requires resynthesizing design for each change Synthesize to create netlist Implement Not every system has a GPGPU available GPGPUs not prac:cal for systems that require minimal power and heat Inflexible compared to FPGAs Translate Map Place & Route Create BIT File 3

FlexGrip SoA GPGPU FlexGrip: FLEXible GRaphIcs Processor Fully CUDA binary-compa:ble integer so6 GPGPU Run mul:ple applica:ons without the need to recompile the hardware Support for highly mul:threaded applica:ons and complex condi:onal execu:on Architectural Customiza:ons Trade power versus performance Add processing, memory, and custom resources Choose between bitstreams, each with different architectural features Reconfigure (perhaps on-the-fly) for specific applica:ons 4

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 5

Introduc+on to the GPGPU Hardware Array of streaming mul-processors (SMs) Architecture Each SM consists of a set of 32-bit scalar processors (SPs) Single Instruc:on Mul:ple Data (SIMD) execu:on Mul:processor executes same instruc:ons on different scalar processors at each clock cycle SP Scalar Processor (core) SFU Special func:on unit (Used for transcendental func:ons like sine, cosine, log etc.) Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Func:onal Simulator for GPGPU, IEEE Interna-onal Symposium on Modeling, Analysis & Simula-on of Computer and Telecommunica-on Systems (MASCOTS), Aug 2010 6

SoAware to Hardware Mapping Compute Unified Device Architecture Block scheduler: Assigns thread blocks to mul:processors Thread Block: collec:on of opera:ons which can be performed in parallel Threads are scheduled in the form of warps Warp: Subset of opera:ons performed in parallel; some:mes condi:onally Fine grained scheduling: SM architected as single instruc:on, mul:ple thread (SIMT) processor Each scalar processor (SP) executes one thread maintaining its own PC Performs same opera:on on different set of data; some:mes condi:onally 7 Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Func:onal Simulator for GPGPU, IEEE Interna-onal Symposium on Modeling, Analysis & Simula-on of Computer and Telecommunica-on Systems (MASCOTS), Aug 2010

Outline Mo:va:on Background FlexGrip: SoA GPGPU Op:miza:ons Experimental Results Summary 8

System Architecture 9

FlexGrip Streaming Mul+processor 10

Branch Divergence Branch divergence occurs when threads inside a warp branch to different execu:on paths Example: Instruc:ons inside ELSE statement are masked (i.e.: not executed) Once IF statement complete, use the complement of mask to execute ELSE statement Branch Path A Path B Thread 11

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op+miza+ons Experimental Results Summary 12

Condi+onal Branch Op+miza+ons Each of the 24 warps within an SM contains it s own Warp Stack Each warp stack has entry for each thread (32) Each entry: 32-bit ac:ve thread mask, 2-bit type, 32- bit address Instruction Condition Thread Mask Instruction Mask Selected Predicate Reg 32 32 Predicate Lookup Table 4x32 Predicate Registers P0 P1 P2 P3 Instruction ID Warp Identifier Thread 31 0 Control Flow Unit Active Thread Mask FSM Warp Stack Control Target Address RPC RPC Mask[1:N] Type Token Addr RPC RPC RPC Mask[1:N] Type Token Addr RPC RPC RPC Mask[1:N] Token RPC Type Addr RPC RPC Mask[1:N] Type Token Addr RPC Prior to execu:ng taken path -- instruc:on address and ac:ve thread mask are pushed on the stack Upon comple:on of the taken path -- stack is read, the ac:ve mask inverted, and processing con:nues Worst case: Require nes:ng for all 32 threads (~50KB of memory!) Op:miza:on: Profile applica:ons for op:mal depth 13 To write back Next PC

Source Operand Op+miza+ons Read Controller SRC 3 Addr SRC 2 Addr Read Operand Controller Calculate Address Read Source 3 Operand Control 3 Addr 3 Memory (Global, Shared, Constant) Execute Stage Data From Decode Stage SRC 1 Addr Read Operand Controller Calculate Address Read Source 2 Operand Control 2 Addr 2 Memory and Register Controller Op 3 Op 2 Op 1 Multiplier Read Stage Read Operand Controller Calculate Address Read Source 1 Operand Control 1 Addr 1 Registers (Vector, Predicate, Address) + Data to write stage 14

Mul+ple Streaming Mul+processors Maximum of 256 threads in a thread block At the start of execu:on, the max number of thread blocks that can be scheduled is calculated HOST: CUDA Software DEVICE: FlexGrip Hardware Grid Thread Block 0 Thread Block 1 Streaming Multiprocessor 0 Warp Scheduling Unit Kernel.......... Block Scheduler Thread Block N-1 Streaming Multiprocessor N - 1 Warp Scheduling Unit Vector Register File SP SP SP SP..... Vector Register File SP SP SP SP SP SP SP SP SP SP SP SP SIMD Execution Shared Memory SIMD Execution Shared Memory Memory Interconnect Threads scheduled in a roundrobin fashion Global / System / Constant Memory 15

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 16

Design Environment and Benchmarks Design Environment Synthesis and Design: Xilinx ISE 14.2 Simula:on: Modelsim SE 10.1 Total of Five CUDA Applica:ons Evaluated Benchmarks from University of Wisconsin 1 and NVIDIA Programmers Guide 2 Mix of data parallel and control-flow intensive Benchmark Autocorr Bitonic MatrixMul Reduc:on Transpose Descrip+on Autocorrela:on of 1D array High performance sor:ng network Mul:plica:on of square matrices Parallel reduc:on of 1D array Matrix transpose Autocorr Bitonic MatMult Reduction Transpose 100% 80% 60% 40% 20% 0% 1 D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguilera, A. Nagarajan,M. Anderson, M. Kenny, S. Bauer, M. Schulte, and K. Compton, ERCBench: An open-source benchmark suite for embedded and reconfigurable compu:ng, in Interna:onal Conference on Field Programmable Logic and Applica:ons, Aug. 2010, pp. 408 413 2 Nvidia CUDA programming guide, version 2.3.1. 17

Benchmarking vs. MicroBlaze MicroBlaze so6 processor Implemented on Xilinx ML605 development board (Virtex-6 VLX240T FPGA) So6ware :mer used for execu:on :me FlexGrip: So6 GPGPU Implemented on ML605 for 1 SM and 8 SPs ModelSim 10.1 for benchmarking 1 SM with 16 and 32 SPs and 2 SM 8, 16, and 32 SP designs All five benchmarks ran successfully with same bitstream Compile :mes < 1 second All designs were evaluated at 100 MHz 18

Architecture Scalability 1 SM Varying SPs in Single SM Average speedups: 8 cores 12x 16 cores 18x 32 cores 22x Largest Speedups: Reduc+on: Array size mul:ple of 32, fully u:lizing warps MatrixMult: High arithme:c density Bitonic: Divergence cost amor:zed by more swapping in parallel Memory bandwidth limita:on Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 1 SM 19

Architecture Scalability 2 SM Varying SPs in 2 SM Design Peak Speedup over 40x for 4 of 5 benchmarks 1 SM vs. 2 SM Speedup ranged from 1.77x (Reduc:on) to 1.98x (Transpose, MatrixMul) Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 2 SM Speedup of 2 SM vs. 1 SM (256 data size) 8 SP 16 SP 32 SP Autocorr 1.94 1.94 1.94 Bitonic 1.82 1.83 1.85 MatrixMul 1.98 1.98 1.98 Reduc:on 1.78 1.77 1.77 Transpose 1.98 1.98 1.98 20

Es:mated using Xilinx s XPower Tool Dynamic power used to generate efficiency Sta:c power largely func:on of device size Energy = Power x Execu:on Time Energy Efficiency MicroBlaze requires an average of 80% more energy than FlexGrip for 1 SM, 8 SP configura:on 21

Architectural Customiza+ons Num. Of Oper. Warp Depth Slice LUTs Flip Flops Block RAM DSP % Area Red. % Dyn. Red. Baseline 3 32 60,375 103,776 124 156 - - Autocorr 3 16 52,121 82,017 124 156 14% 3% Mat. Mult. 3 0 42,536 60,161 124 156 20% 9% Reduc:on 3 0 42,536 60,161 124 156 30% 9% Transpose 3 0 42,536 60,161 124 156 30% 9% Bitonic 3 2 39,189 57,301 124 156 35% 15% Bitonic 2 2 27,136 27,136 120 12 62% 38% Removing mul:plier/third operand and reduced warp depth achieves 23% energy reduc:on for any benchmark Depending on applica:on space, one could vary parameters to op:mize system 22

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 23

Summary Implemented a fully-func:onal so6 GPGPU for FPGAs Executes CUDA code on FPGA very quickly; no need to resynthesize Can be used in systems that do not have GPGPUs Scalable and flexible design Control the number of processing cores and mul:processors Customize hardware to op:mize system Swap so6 GPGPU into FPGA as needed Significant benefits vs. MicroBlaze Up to 55x for highly parallel benchmarks with 2 SM design On average 80% dynamic energy reduc:on versus MicroBlaze Addi:onal benefits architectural op:miza:ons Addi:onal dynamic energy savings of up to 14% Reduced LUT area by 33% on average 24

Thank you! Acknowledgements: My parents, family and friends My advisor, Prof. Russell Tessier L-3 KEO Xilinx 25