ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing

ECE 5775 (Fall 7) High-Level Digital Design Automation Specialized Computing

Announcements All students enrolled in CMS & Piazza Vivado HLS tutorial on Tuesday 8/29 Install an SSH client (mobaxterm or putty) % ssh -X <netid>@ecelinux-.ece.cornell.edu Bring your laptop!

Agenda Motivation for hardware specialization Case study on convolution FPGA introduction Basic building blocks Basic homogeneous FPGA architecture Modern heterogeneous FPGA architecture 2

Tension between Efficiency and Flexibility Programmable Processing MOPS/mW Source: High-performance Energy-Efficient Reconfigurable Accelerator Circuits for the Sub-45nm Era July 2 by Ram K. Krishnamurthy, Circuits Research Labs, Intel Corp. 3

A Simple Single-Cycle Microprocessor Adder PC DR SA SB IMM MB FS MD LD MW RAM RF LD SA SB DR D_in DataA DataB SE IMM MB ALU V C Z N M_address Data_in RAM MW MD 4

Evaluating an Simple Expression on CPU R <= M[R] P C RF ALU RAM Step-by-step CPU activities R2 <= M[R+] P C RF ALU RAM R3 <= R + R2 P C RF ALU RAM M[R+2] <= R3 P C RF ALU RAM Source: Adapted from Desh Singh s talk at HCP 4 workshop 5

Unrolling the CPU Hardware R <= M[R] P C RF ALU RAM CPU. Replicate the CPU hardware R2 <= M[R+] P C RF ALU CPU2 RAM R3 <= R + R2 P C RF ALU CPU3 RAM Space M[R+2] <= R3 P C RF ALU CPU4 RAM 6

Eliminating Unused Logic R <= M[R] RF ALU RAM. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 RF RF ALU ALU RAM Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic M[R+2] <= R3 RF ALU RAM 7

A Special-Purpose Architecture R <= M[R] R LW R. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 + LW R2 + Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic R3 5. Wire up registers and propagate values M[R+2] <= R3 + SW Resulting circuit can be realized with either ASIC or FPGA 8

Understanding Energy Inefficiency of General-Purpose Processors (GPPs) L-I$ Typical Superscalar OoO Pipeline RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Parameter Value Fetch/issue/retire width 4 # Integer ALUs 3 # FP ALUs 2 # ROB entries 96 # Reservation station entries 64 L I-cache 32 KB, 8-way set associative L D-cache 32 KB, 8-way set associative L2 cache 6 MB, 8-way set associative [source: Jason Cong, ISLPED 4 keynote] 9

Energy Breakdown of Pipeline Components L-I$ RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Memory % Misc 23% FPU 8% Fetch unit 9% Rename 2% Scheduler % Decode 6% Register files 3% Control Mul/div 4% Int ALU 4%

Removing Non-Computing Portions Misc 23% Fetch unit 9% Decode 6% Memory % Mul/div 4% FPU 8% Int ALU 4% Rename 2% Scheduler % Register files 3% Computing portion: % (memory) + 26% (compute) = 36%

Energy Comparison of Processor ALUs and Dedicated Units Operation 32-bit add 32-bit multiply Processor ALU.22 nj@ 2 GHz.2 nj@ 2 GHz 45 nm TSMC standard cell library.2 nj @ GHz.7 nj @ GHz Why are processor units so expensive? ALU can perform multiple operations Add/sub/bitwise XOR/OR/AND 64-bit ALU Singleprecision FP operation.5 nj @ 2GHz.8 nj @ 5 MHz Dynamic/domino logic used to run at high frequency Higher power dissipation 2

Energy Breakdown with Standard-Cell ASICs Misc 23% Fetch unit 9% Decode 6% Rename 2% Register files 3% Memory % ALU/FPU savings 25.% Int ALU.2% Scheduler % FPU.4% Mul/div.2% Computing portion: % (memory) + ~% (compute) = % Only X gain is attainable? 3

Additional Energy Savings from Specialization Specialized memory architecture Exploit regular memory access patterns to minimize energy per memory read/write Specialized communication architecture Exploit data movement patterns to optimize the structure/topology of on-chip interconnection network Customized data type Exploit data range information to reduce bitwidth/precision and simply arithmetic operations These techniques combined can lead to another - X energy efficiency improvement over GPPs 4

Case Study: Memory Specialization for Convolution The main computation of image/video processing is performed over overlapping stencils, termed as convolution 2 3 4 5 6 2 3 4 5 6 Input image frame - -2-2 3x3 Convolution 2 3 4 5 6 2 3 4 5 6 Output image frame 5

Example Application: Edge Detection Identifies discontinuities in an image where brightness (or image intensity) changes sharply Very useful for feature extractions in computer vision Sobel operator G=(G X, G Y ) Figures: Pilho Kim, GaTech 6

CPU Implementation of Convolution for (n=; n<height-; n++) for (m=; m<width-; m++) for (i=; i<3; i++) for (j=; j<3; j++) out[n][m]+=img[n+i][m+j] * f[i][j]; CPU Cache Main Memory 7

General-Purpose Cache for Convolution Minimizes main memory accesses to improve performance W Input picture (W pixels wide) A general-purpose cache is expensive in cost and incurs nontrivial energy overhead 8

Specializing Cache for Convolution () Remove rows that are not in the neighborhood of the convolution window W 9

Specializing Cache for Convolution (2) Rearrange the rows as a D array of pixels Each time we move the window to right and push in the new pixel to the cache W Old New Old Pixel W W Remove the edge pixels that are not needed for computation New Pixel 2

A Specialized Cache : Line Buffer Line buffer: a fixed-width cache with (K-)*W+K pixels in flight Fixed addressing: Low area/power and high performance Old Pixel 2W+3 (with K=3) New Pixel In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs 2

What is an FPGA? FPGA: Field-Programmable Gate Array An integrated circuit designed to be configured by a customer or a designer after manufacturing (wikipedia) Components in an FPGA Chip Programmable logic blocks Programmable interconnects Programmable I/Os 22

Three Important Pieces SRAM-based implementation is popular Non-standard technology means older technology generation LUT Lookup table (LUT, formed by SRAM bits) Pass transistor (controlled by an SRAM bit) Multiplexer (controlled by SRAM bits) 23

Any function of k variables can be implemented with a 2 k : multiplexer 24 Multiplexer as a Universal Gate Cout S Cin B A Cout S Cin B A Cout S Cin B A Cout Sum Cin B A??? 2 3 4 5 6 7 S2 8: MUX S S Cout????????

How Many Functions? How many distinct 3-input -output Boolean functions exist? What about K inputs? 25

Look-Up Table (LUT) A k-input LUT (k-lut) can be configured to implement any k- input -output combinational logic 2 k SRAM bits Delay is independent of logic function / / / / / / / MUX Y / x 2 x x A 3-input LUT 26

How Many LUTs? How many 3-input LUTs are needed to implement the following full adder? How about using 4-input LUTs? A B C in C out S 27

A Logic Element A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed The LUT and FF combined form a logic element LUT 28

A Logic Block A logic block clusters multiple logic elements COUT COUT Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices Two independent carry chains per CLB for implementing adders Crossbar Switch SLICE SLICE Each slice contains four LUTs CIN CIN 29

Traditional Homogeneous FPGA Architecture Switch block Logic block Routing track 3

Modern Heterogeneous Field-Programmable System-on-Chip Island-style configurable mesh routing Lots of dedicated components Memories/multipliers, I/Os, processors Specialization leads to higher performance and lower power [Figure credit: embeddedrelated.com] 3

Dedicated DSP Blocks Built-in components for fast arithmetic operation optimized for DSP applications Essentially a multiply-accumulate core with many other features Fixed logic and connections, functionality may be configured using control signals at run time Much faster than LUT-based implementation (ASIC vs. LUT) 32

Example: Xilinx DSP48E Slice 25x8 signed multiplier 48-bit add/subtract/accumulate 48-bit logic operations SIMD operations (2/24 bit) Pipeline registers for high speed [source: Xilinx Inc.] 33

Finite Impulse Response (FIR) Filter Mapped to DSP Slices N i= y[n] = c i x[n i] x(n) 8 C C C2 C3 DSP Slice 38 y(n) [source: Xilinx Inc.]

Dedicated Block RAMs (BRAMs) Example: Xilinx 8K/36K block RAMs 32k x to 52 x 72 in one 36K block Simple dual-port and true dual-port configurations Built-in FIFO logic 64-bit error correction coding per 36K block 8K/36K block RAM DIA DIPA ADDRA WEA ENA CLKA DIB DIPB ADDRB WEB ENB CLKB DOA DOPA DOB DOPB [source: Xilinx Inc.] 35

Embedded FPGA System-on-Chip Dual ARM Cortex-A9 + NEON SIMD extension @6MHz~GHz Up to 35K logic cells 2MB Block RAM 9 DSP48s Xilinx Zynq All Programmable System-on-Chip [Source: Xilinx Inc.] 36

FPGA as an Accelerator for Cloud Computing Massive amount of finegrained parallelism Silicon configurable to fit the application Block RAM Block RAM Performance/watt advantage over CPUs & GPUs ~2 Million Logic Blocks ~5 DSP Blocks ~3Mb Block RAM AWS Cloud F FPGA instance: Xilinx UltraScale+ VU9P [Figure source: David Pellerin, AWS] 37

Microsoft Deploying FPGAs in Datacenter FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services e.g., project BrainWave claimed ~4Teraflops on large recurrent neural networks using Intel Stratix FPGAs [source: Microsoft, BrainWave, HotChips 27] 38

Summary: FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Highly parallel and/or deeply pipelined to achieve maximum parallelism Distributed data/control dispatch Silicon configurable to fit algorithm Compute the exact algorithm at the desired level of numerical accuracy Bit-level sizing and sub-cycle chaining Customized memory hierarchy Performance/watt advantage Low power consumption compared to CPU and GPGPUs Low clock speed Specialized architecture blocks 39

Next Class Vivado HLS tutorial (led by TAs ) 4

Acknowledgements These slides contain/adapt materials developed by Prof. Jason Cong (UCLA) 4