Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

Outline Mo+va+on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 2

Mo+va+on Compiling FPGA designs is :me consuming Requires resynthesizing design for each change Synthesize to create netlist Implement Not every system has a GPGPU available GPGPUs not prac:cal for systems that require minimal power and heat Inflexible compared to FPGAs Translate Map Place & Route Create BIT File 3

FlexGrip SoA GPGPU FlexGrip: FLEXible GRaphIcs Processor Fully CUDA binary-compa:ble integer so6 GPGPU Run mul:ple applica:ons without the need to recompile the hardware Support for highly mul:threaded applica:ons and complex condi:onal execu:on Architectural Customiza:ons Trade power versus performance Add processing, memory, and custom resources Choose between bitstreams, each with different architectural features Reconfigure (perhaps on-the-fly) for specific applica:ons 4

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 5

Introduc+on to the GPGPU Hardware Array of streaming mul-processors (SMs) Architecture Each SM consists of a set of 32-bit scalar processors (SPs) Single Instruc:on Mul:ple Data (SIMD) execu:on Mul:processor executes same instruc:ons on different scalar processors at each clock cycle SP Scalar Processor (core) SFU Special func:on unit (Used for transcendental func:ons like sine, cosine, log etc.) Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Func:onal Simulator for GPGPU, IEEE Interna-onal Symposium on Modeling, Analysis & Simula-on of Computer and Telecommunica-on Systems (MASCOTS), Aug 2010 6

SoAware to Hardware Mapping Compute Unified Device Architecture Block scheduler: Assigns thread blocks to mul:processors Thread Block: collec:on of opera:ons which can be performed in parallel Threads are scheduled in the form of warps Warp: Subset of opera:ons performed in parallel; some:mes condi:onally Fine grained scheduling: SM architected as single instruc:on, mul:ple thread (SIMT) processor Each scalar processor (SP) executes one thread maintaining its own PC Performs same opera:on on different set of data; some:mes condi:onally 7 Image courtesy: S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A Parallel Func:onal Simulator for GPGPU, IEEE Interna-onal Symposium on Modeling, Analysis & Simula-on of Computer and Telecommunica-on Systems (MASCOTS), Aug 2010

Outline Mo:va:on Background FlexGrip: SoA GPGPU Op:miza:ons Experimental Results Summary 8

System Architecture 9

FlexGrip Streaming Mul+processor 10

Branch Divergence Branch divergence occurs when threads inside a warp branch to different execu:on paths Example: Instruc:ons inside ELSE statement are masked (i.e.: not executed) Once IF statement complete, use the complement of mask to execute ELSE statement Branch Path A Path B Thread 11

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op+miza+ons Experimental Results Summary 12

Condi+onal Branch Op+miza+ons Each of the 24 warps within an SM contains it s own Warp Stack Each warp stack has entry for each thread (32) Each entry: 32-bit ac:ve thread mask, 2-bit type, 32- bit address Instruction Condition Thread Mask Instruction Mask Selected Predicate Reg 32 32 Predicate Lookup Table 4x32 Predicate Registers P0 P1 P2 P3 Instruction ID Warp Identifier Thread 31 0 Control Flow Unit Active Thread Mask FSM Warp Stack Control Target Address RPC RPC Mask[1:N] Type Token Addr RPC RPC RPC Mask[1:N] Type Token Addr RPC RPC RPC Mask[1:N] Token RPC Type Addr RPC RPC Mask[1:N] Type Token Addr RPC Prior to execu:ng taken path -- instruc:on address and ac:ve thread mask are pushed on the stack Upon comple:on of the taken path -- stack is read, the ac:ve mask inverted, and processing con:nues Worst case: Require nes:ng for all 32 threads (~50KB of memory!) Op:miza:on: Profile applica:ons for op:mal depth 13 To write back Next PC

Source Operand Op+miza+ons Read Controller SRC 3 Addr SRC 2 Addr Read Operand Controller Calculate Address Read Source 3 Operand Control 3 Addr 3 Memory (Global, Shared, Constant) Execute Stage Data From Decode Stage SRC 1 Addr Read Operand Controller Calculate Address Read Source 2 Operand Control 2 Addr 2 Memory and Register Controller Op 3 Op 2 Op 1 Multiplier Read Stage Read Operand Controller Calculate Address Read Source 1 Operand Control 1 Addr 1 Registers (Vector, Predicate, Address) + Data to write stage 14

Mul+ple Streaming Mul+processors Maximum of 256 threads in a thread block At the start of execu:on, the max number of thread blocks that can be scheduled is calculated HOST: CUDA Software DEVICE: FlexGrip Hardware Grid Thread Block 0 Thread Block 1 Streaming Multiprocessor 0 Warp Scheduling Unit Kernel.......... Block Scheduler Thread Block N-1 Streaming Multiprocessor N - 1 Warp Scheduling Unit Vector Register File SP SP SP SP..... Vector Register File SP SP SP SP SP SP SP SP SP SP SP SP SIMD Execution Shared Memory SIMD Execution Shared Memory Memory Interconnect Threads scheduled in a roundrobin fashion Global / System / Constant Memory 15

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 16

Design Environment and Benchmarks Design Environment Synthesis and Design: Xilinx ISE 14.2 Simula:on: Modelsim SE 10.1 Total of Five CUDA Applica:ons Evaluated Benchmarks from University of Wisconsin 1 and NVIDIA Programmers Guide 2 Mix of data parallel and control-flow intensive Benchmark Autocorr Bitonic MatrixMul Reduc:on Transpose Descrip+on Autocorrela:on of 1D array High performance sor:ng network Mul:plica:on of square matrices Parallel reduc:on of 1D array Matrix transpose Autocorr Bitonic MatMult Reduction Transpose 100% 80% 60% 40% 20% 0% 1 D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguilera, A. Nagarajan,M. Anderson, M. Kenny, S. Bauer, M. Schulte, and K. Compton, ERCBench: An open-source benchmark suite for embedded and reconfigurable compu:ng, in Interna:onal Conference on Field Programmable Logic and Applica:ons, Aug. 2010, pp. 408 413 2 Nvidia CUDA programming guide, version 2.3.1. 17

Benchmarking vs. MicroBlaze MicroBlaze so6 processor Implemented on Xilinx ML605 development board (Virtex-6 VLX240T FPGA) So6ware :mer used for execu:on :me FlexGrip: So6 GPGPU Implemented on ML605 for 1 SM and 8 SPs ModelSim 10.1 for benchmarking 1 SM with 16 and 32 SPs and 2 SM 8, 16, and 32 SP designs All five benchmarks ran successfully with same bitstream Compile :mes < 1 second All designs were evaluated at 100 MHz 18

Architecture Scalability 1 SM Varying SPs in Single SM Average speedups: 8 cores 12x 16 cores 18x 32 cores 22x Largest Speedups: Reduc+on: Array size mul:ple of 32, fully u:lizing warps MatrixMult: High arithme:c density Bitonic: Divergence cost amor:zed by more swapping in parallel Memory bandwidth limita:on Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 1 SM 19

Architecture Scalability 2 SM Varying SPs in 2 SM Design Peak Speedup over 40x for 4 of 5 benchmarks 1 SM vs. 2 SM Speedup ranged from 1.77x (Reduc:on) to 1.98x (Transpose, MatrixMul) Speedup vs. MicroBlaze for variable scalar processors and input data size 256 for 2 SM Speedup of 2 SM vs. 1 SM (256 data size) 8 SP 16 SP 32 SP Autocorr 1.94 1.94 1.94 Bitonic 1.82 1.83 1.85 MatrixMul 1.98 1.98 1.98 Reduc:on 1.78 1.77 1.77 Transpose 1.98 1.98 1.98 20

Es:mated using Xilinx s XPower Tool Dynamic power used to generate efficiency Sta:c power largely func:on of device size Energy = Power x Execu:on Time Energy Efficiency MicroBlaze requires an average of 80% more energy than FlexGrip for 1 SM, 8 SP configura:on 21

Architectural Customiza+ons Num. Of Oper. Warp Depth Slice LUTs Flip Flops Block RAM DSP % Area Red. % Dyn. Red. Baseline 3 32 60,375 103,776 124 156 - - Autocorr 3 16 52,121 82,017 124 156 14% 3% Mat. Mult. 3 0 42,536 60,161 124 156 20% 9% Reduc:on 3 0 42,536 60,161 124 156 30% 9% Transpose 3 0 42,536 60,161 124 156 30% 9% Bitonic 3 2 39,189 57,301 124 156 35% 15% Bitonic 2 2 27,136 27,136 120 12 62% 38% Removing mul:plier/third operand and reduced warp depth achieves 23% energy reduc:on for any benchmark Depending on applica:on space, one could vary parameters to op:mize system 22

Outline Mo:va:on Background FlexGrip: So6 GPGPU Op:miza:ons Experimental Results Summary 23

Summary Implemented a fully-func:onal so6 GPGPU for FPGAs Executes CUDA code on FPGA very quickly; no need to resynthesize Can be used in systems that do not have GPGPUs Scalable and flexible design Control the number of processing cores and mul:processors Customize hardware to op:mize system Swap so6 GPGPU into FPGA as needed Significant benefits vs. MicroBlaze Up to 55x for highly parallel benchmarks with 2 SM design On average 80% dynamic energy reduc:on versus MicroBlaze Addi:onal benefits architectural op:miza:ons Addi:onal dynamic energy savings of up to 14% Reduced LUT area by 33% on average 24

Thank you! Acknowledgements: My parents, family and friends My advisor, Prof. Russell Tessier L-3 KEO Xilinx 25