Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for (int x = ; x < in.width(); x++) blur (x, y) = (in(x-, y) + in(x, y) + in(x+, y))/3; } Algorithm Parsing Transformations Scheduling Allocation RTL Generation Compiler Binding BitSel Unit BitSel Unit Architecture conv window conv window Adder Tree PE Output Buffer High-Level Design & Automation Programmable System-on-Chip 3
Understanding Energy Inefficiency of General-Purpose Processors L-I$ Typical Superscalar OoO Pipeline RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Parameter Value Fetch/issue/retire width 4 # Integer ALUs 3 # FP ALUs 2 # ROB entries 96 # Reservation station entries 64 L I-cache 32 KB, 8-way set associative L D-cache 32 KB, 8-way set associative L2 cache 6 MB, 8-way set associative [source: Jason Cong, ISLPED 4 keynote] 5
Energy Breakdown of Pipeline Components L-I$ RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Memory % Misc 23% FPU 8% Fetch unit 9% Rename 2% Scheduler % Decode 6% Register files 3% Control Mul/div 4% Int ALU 4% 6
Removing Non-Computing Portions Misc 23% Fetch unit 9% Decode 6% Memory % Mul/div 4% FPU 8% Int ALU 4% Rename 2% Scheduler % Register files 3% Computing portion: % (memory) + 26% (compute) = 36% 7
Energy Comparison of Processor ALUs and Dedicated Units Operation 32-bit add 32-bit multiply Processor ALU.22 nj@ 2 GHz.2 nj@ 2 GHz 45 nm TSMC standard cell library.2 nj @ GHz.7 nj @ GHz Why are processor units so expensive? ALU can perform multiple operations Add/sub/bitwise XOR/OR/AND 64-bit ALU Singleprecision FP operation.5 nj @ 2GHz.8 nj @ 5 MHz Dynamic/domino logic used to run at high frequency Higher power dissipation 8
A Simple Single-Cycle Microprocessor Adder PC DR SA SB IMM MB FS MD LD MW RAM RF LD SA SB DR D_in DataA DataB SE IMM MB ALU V C Z N M_address Data_in RAM MW MD
Evaluating an Simple Expression on CPU R <= M[R] P C RF ALU RAM Step-by-step CPU activities R2 <= M[R+] P C RF ALU RAM R3 <= R + R2 P C RF ALU RAM M[R+2] <= R3 P C RF ALU RAM Source: Adapted from Desh Singh s talk at HCP 4 workshop
Unrolling the CPU Hardware R <= M[R] P C RF ALU RAM CPU. Replicate the CPU hardware R2 <= M[R+] P C RF ALU CPU2 RAM R3 <= R + R2 P C RF ALU CPU3 RAM Space M[R+2] <= R3 P C RF ALU CPU4 RAM 2
Eliminating Unused Logic R <= M[R] RF ALU RAM. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 RF RF ALU ALU RAM Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic M[R+2] <= R3 RF ALU RAM 3
A Special-Purpose Architecture R <= M[R] R LW R. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 + LW R2 + Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic R3 5. Wire up registers and propagate values M[R+2] <= R3 + SW Can be realized with either ASIC or FPGA 4
FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Silicon configurable to fit algorithm Performance/watt advantage 5
What is an FPGA? FPGA: Field-Programmable Gate Array An integrated circuit designed to be configured by a customer or a designer after manufacturing (wikipedia) Components in an FPGA Chip Programmable logic blocks Programmable interconnects Programmable I/Os 6
Three Important Pieces SRAM-based implementation is popular Non-standard technology means older technology generation LUT Lookup table (LUT, formed by SRAM bits) Pass transistor (controlled by an SRAM bit) Multiplexer (controlled by SRAM bits) 7
Any function of k variables can be implemented with a 2 k : multiplexer 8 Multiplexer as a Universal Gate Cout S Cin B A Cout S Cin B A Cout S Cin B A Cout Sum Cin B A??? 2 3 4 5 6 7 S2 8: MUX S S Cout????????
How Many Functions? How many distinct 2-input -output Boolean functions exist? What about K inputs? 9
Look-Up Table (LUT) A k-input LUT (k-lut) can be configured to implement any k- input -output combinational logic 2 k SRAM bits Delay is independent of logic function / / / / / / / MUX Y / x 2 x x A 3-input LUT 2
How Many LUTs? How many 3-input LUTs are needed to implement the following full adder? What about using 4-input LUTs? A B C in C out S 2
A Logic Element A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed The LUT and FF combined form a logic element LUT 22
A Logic Block A logic block clusters multiple logic elements COUT COUT With Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices Two independent carry chains per CLB for implementing adders Crossbar Switch SLICE SLICE Each slice contains four LUTs CIN CIN 23
Traditional Homogeneous FPGA Architecture Switch block Logic block Routing track 24
Modern Heterogeneous Field-Programmable System-on-Chip Island-style configurable mesh routing Lots of dedicated components Memories/multipliers, I/Os, processors Specialization leads to higher performance and lower power [Figure credit: embeddedrelated.com] 25
Dedicated DSP Blocks Built-in components for fast arithmetic operation optimized for DSP applications Fixed logic and connections, functionality may be configured using control signals at run time Much faster than LUT-based implementation (ASIC vs. LUT) Xilinx XtremeDSP blocks Starting with Virtex 4 family, DSP48 block is introduced for highspeed DSP on FPGAs Essentially a multiply-accumulate core with many other features 26
Example: Xilinx DSP48E Slice 25x8 signed multiplier 48-bit add/subtract/accumulate 48-bit logic operations SIMD operations (2/24 bit) Pipeline registers for high speed [source: Xilinx Inc.] 27
Finite Impulse Response (FIR) Filter Mapped to DSP Slices N i= y[n] = c i x[n i] x(n) 8 C C C2 C3 DSP Slice 38 y(n) [source: Xilinx Inc.]
Hardened Floating-Point Units Arria FPGA and SoC Variable-Precision DSP Block Architecture [Source: Altera Corp., 24] 29
Dedicated Block RAMs (BRAMs) Xilinx 8K/36K block RAMs 32k x to 52 x 72 in one 36K block Simple dual-port and true dual-port configurations Built-in FIFO logic 64-bit error correction coding per 36K block 8K/36K block RAM DIA DIPA ADDRA WEA ENA CLKA DIB DIPB ADDRB WEB ENB CLKB DOA DOPA DOB DOPB [source: Xilinx Inc.] 3
Additional Energy Savings from Specialization Specialized memory architecture Exploit regular memory access patterns to minimize energy per memory read/write Specialized communication architecture Exploit data movement patterns to optimize the structure/topology of on-chip interconnection network Customized data type Exploit data range information to reduce bitwidth/precision and simply arithmetic operations These techniques can lead to -X better energy efficiency over general-purpose processors 3
Case Study: Convolution The main computation of image/video processing is performed over overlapping stencils, termed as convolution 2 3 4 5 6 2 3 4 5 6 Input image frame - -2-2 3x3 Convolution 2 3 4 5 6 2 3 4 5 6 Output image frame 32
Example Application: Edge Detection Identifies discontinuities in an image where brightness (or image intensity) changes sharply Very useful for feature extractions in computer vision Sobel operator G=(G X, G Y ) Figures: Pilho Kim, GaTech 33
CPU Implementation of Convolution for (n=; n<height-; n++) for (m=; m<width-; m++) for (i=; i<3; i++) for (j=; j<3; j++) out[n][m]+=img[n+i][m+j] * f[i][j]; CPU Cache Main Memory 34
Cache for Convolution Minimizes main memory accesses to improve performance W Input picture (W pixels wide) A general-purpose cache is expensive in cost and incurs nontrivial energy overhead 35
Customizing Cache for Convolution () Remove rows that are not in the neighborhood of the convolution window W 36
Customizing Cache for Convolution (2) Rearrange the rows as a D array of pixels Each time we move the window to right and push in the new pixel to the cache W Old Pixel W W Remove the edge pixels that are not needed for computation New Pixel 37
A Customized Cache : Line Buffer Line buffer: a fixed-width cache with (K-)*W+K pixels in flight Fixed addressing: Low area/power and high performance Old Pixel 2W+3 (with K=3) New Pixel In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs 38
Customized Memory Hierarchy for Convolution Memory architecture customized for convolution Input pixel stream Flip- Flops Convolve Output pixel stream Processing window Line buffers Frame buffers On-chip SRAMs Frame n-2 Frame n- Frame n Off-chip DDR 39
FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Highly parallel and/or deeply pipelined to achieve maximum parallelism Distributed data/control dispatch Silicon configurable to fit algorithm: Compute the exact algorithm at the desired level of numerical accuracy Bit-level sizing and sub-cycle chaining Customized memory hierarchy Performance/watt advantage Low power consumption compared to CPU and GPGPUs Low clock speed Specialized architecture blocks 4
Acknowledgements These slides contain/adapt materials developed by Prof. Jason Cong (UCLA) 42