Course Overview Revisited

Similar documents
ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing

FPGA architecture and design technology

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

INTRODUCTION TO FPGA ARCHITECTURE

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

EECS150 - Digital Design Lecture 16 - Memory

ECE 5775 High-Level Digital Design Automation Fall Fixed-Point Types Analysis of Algorithms

EECS150 - Digital Design Lecture 16 Memory 1

Field Programmable Gate Array (FPGA)

ECE 2300 Digital Logic & Computer Organization. Caches

Multiple Instruction Issue. Superscalars

EITF35: Introduction to Structured VLSI Design

Specializing Hardware for Image Processing

Embedded Systems. 7. System Components

Reconfigurable Computing

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Programmable Logic. Any other approaches?

ECE 2300 Digital Logic & Computer Organization. More Single Cycle Microprocessor

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

Higher Level Programming Abstractions for FPGAs using OpenCL

Embedded Systems: Hardware Components (part I) Todor Stefanov

The Nios II Family of Configurable Soft-core Processors

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

CDA 4253 FGPA System Design Xilinx FPGA Memories. Hao Zheng Comp Sci & Eng USF

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

The Xilinx XC6200 chip, the software tools and the board development tools

Limitations of Scalar Pipelines

Design Methodologies and Tools. Full-Custom Design

Design Methodologies. Full-Custom Design

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

ECE 2300 Digital Logic & Computer Organization

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Altera FLEX 8000 Block Diagram

Lecture 41: Introduction to Reconfigurable Computing

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

FPGA: What? Why? Marco D. Santambrogio

Implementation of DSP Algorithms

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

PACE: Power-Aware Computing Engines

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

A Reconfigurable Multifunction Computing Cache Architecture

The Virtex FPGA and Introduction to design techniques

Programmable Logic. Simple Programmable Logic Devices

XPU A Programmable FPGA Accelerator for Diverse Workloads

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages

ASIC Design of Shared Vector Accelerators for Multicore Processors

FPGA Polyphase Filter Bank Study & Implementation

Spiral 2-8. Cell Layout

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

CS146 Computer Architecture. Fall Midterm Exam

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK

Superscalar Processors

CS 152, Spring 2011 Section 8

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

Understanding Peak Floating-Point Performance Claims

XT Node Architecture

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

Virtex-II Architecture

Parallel FIR Filters. Chapter 5

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

discrete logic do not

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Superscalar Machines. Characteristics of superscalar processors

FABRICATION TECHNOLOGIES

REAL TIME DIGITAL SIGNAL PROCESSING

Design and Implementation of a Super Scalar DLX based Microprocessor

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

FPGAs: FAST TRACK TO DSP

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Basic Computer Architecture

04 - DSP Architecture and Microarchitecture

Handout 2 ILP: Part B

The Processor: Instruction-Level Parallelism

Topics. Midterm Finish Chapter 7

A 1-GHz Configurable Processor Core MeP-h1

ΔΙΑΛΕΞΗ 2: FPGA Architectures

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

FPGA for Software Engineers

Memory Systems IRAM. Principle of IRAM

H100 Series FPGA Application Accelerators

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Workspace for '4-FPGA' Page 1 (row 1, column 1)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Zynq Ultrascale+ Architecture

Zynq-7000 All Programmable SoC Product Overview

Transcription:

Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for (int x = ; x < in.width(); x++) blur (x, y) = (in(x-, y) + in(x, y) + in(x+, y))/3; } Algorithm Parsing Transformations Scheduling Allocation RTL Generation Compiler Binding BitSel Unit BitSel Unit Architecture conv window conv window Adder Tree PE Output Buffer High-Level Design & Automation Programmable System-on-Chip 3

Understanding Energy Inefficiency of General-Purpose Processors L-I$ Typical Superscalar OoO Pipeline RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Parameter Value Fetch/issue/retire width 4 # Integer ALUs 3 # FP ALUs 2 # ROB entries 96 # Reservation station entries 64 L I-cache 32 KB, 8-way set associative L D-cache 32 KB, 8-way set associative L2 cache 6 MB, 8-way set associative [source: Jason Cong, ISLPED 4 keynote] 5

Energy Breakdown of Pipeline Components L-I$ RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Memory % Misc 23% FPU 8% Fetch unit 9% Rename 2% Scheduler % Decode 6% Register files 3% Control Mul/div 4% Int ALU 4% 6

Removing Non-Computing Portions Misc 23% Fetch unit 9% Decode 6% Memory % Mul/div 4% FPU 8% Int ALU 4% Rename 2% Scheduler % Register files 3% Computing portion: % (memory) + 26% (compute) = 36% 7

Energy Comparison of Processor ALUs and Dedicated Units Operation 32-bit add 32-bit multiply Processor ALU.22 nj@ 2 GHz.2 nj@ 2 GHz 45 nm TSMC standard cell library.2 nj @ GHz.7 nj @ GHz Why are processor units so expensive? ALU can perform multiple operations Add/sub/bitwise XOR/OR/AND 64-bit ALU Singleprecision FP operation.5 nj @ 2GHz.8 nj @ 5 MHz Dynamic/domino logic used to run at high frequency Higher power dissipation 8

A Simple Single-Cycle Microprocessor Adder PC DR SA SB IMM MB FS MD LD MW RAM RF LD SA SB DR D_in DataA DataB SE IMM MB ALU V C Z N M_address Data_in RAM MW MD

Evaluating an Simple Expression on CPU R <= M[R] P C RF ALU RAM Step-by-step CPU activities R2 <= M[R+] P C RF ALU RAM R3 <= R + R2 P C RF ALU RAM M[R+2] <= R3 P C RF ALU RAM Source: Adapted from Desh Singh s talk at HCP 4 workshop

Unrolling the CPU Hardware R <= M[R] P C RF ALU RAM CPU. Replicate the CPU hardware R2 <= M[R+] P C RF ALU CPU2 RAM R3 <= R + R2 P C RF ALU CPU3 RAM Space M[R+2] <= R3 P C RF ALU CPU4 RAM 2

Eliminating Unused Logic R <= M[R] RF ALU RAM. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 RF RF ALU ALU RAM Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic M[R+2] <= R3 RF ALU RAM 3

A Special-Purpose Architecture R <= M[R] R LW R. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 + LW R2 + Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic R3 5. Wire up registers and propagate values M[R+2] <= R3 + SW Can be realized with either ASIC or FPGA 4

FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Silicon configurable to fit algorithm Performance/watt advantage 5

What is an FPGA? FPGA: Field-Programmable Gate Array An integrated circuit designed to be configured by a customer or a designer after manufacturing (wikipedia) Components in an FPGA Chip Programmable logic blocks Programmable interconnects Programmable I/Os 6

Three Important Pieces SRAM-based implementation is popular Non-standard technology means older technology generation LUT Lookup table (LUT, formed by SRAM bits) Pass transistor (controlled by an SRAM bit) Multiplexer (controlled by SRAM bits) 7

Any function of k variables can be implemented with a 2 k : multiplexer 8 Multiplexer as a Universal Gate Cout S Cin B A Cout S Cin B A Cout S Cin B A Cout Sum Cin B A??? 2 3 4 5 6 7 S2 8: MUX S S Cout????????

How Many Functions? How many distinct 2-input -output Boolean functions exist? What about K inputs? 9

Look-Up Table (LUT) A k-input LUT (k-lut) can be configured to implement any k- input -output combinational logic 2 k SRAM bits Delay is independent of logic function / / / / / / / MUX Y / x 2 x x A 3-input LUT 2

How Many LUTs? How many 3-input LUTs are needed to implement the following full adder? What about using 4-input LUTs? A B C in C out S 2

A Logic Element A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed The LUT and FF combined form a logic element LUT 22

A Logic Block A logic block clusters multiple logic elements COUT COUT With Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices Two independent carry chains per CLB for implementing adders Crossbar Switch SLICE SLICE Each slice contains four LUTs CIN CIN 23

Traditional Homogeneous FPGA Architecture Switch block Logic block Routing track 24

Modern Heterogeneous Field-Programmable System-on-Chip Island-style configurable mesh routing Lots of dedicated components Memories/multipliers, I/Os, processors Specialization leads to higher performance and lower power [Figure credit: embeddedrelated.com] 25

Dedicated DSP Blocks Built-in components for fast arithmetic operation optimized for DSP applications Fixed logic and connections, functionality may be configured using control signals at run time Much faster than LUT-based implementation (ASIC vs. LUT) Xilinx XtremeDSP blocks Starting with Virtex 4 family, DSP48 block is introduced for highspeed DSP on FPGAs Essentially a multiply-accumulate core with many other features 26

Example: Xilinx DSP48E Slice 25x8 signed multiplier 48-bit add/subtract/accumulate 48-bit logic operations SIMD operations (2/24 bit) Pipeline registers for high speed [source: Xilinx Inc.] 27

Finite Impulse Response (FIR) Filter Mapped to DSP Slices N i= y[n] = c i x[n i] x(n) 8 C C C2 C3 DSP Slice 38 y(n) [source: Xilinx Inc.]

Hardened Floating-Point Units Arria FPGA and SoC Variable-Precision DSP Block Architecture [Source: Altera Corp., 24] 29

Dedicated Block RAMs (BRAMs) Xilinx 8K/36K block RAMs 32k x to 52 x 72 in one 36K block Simple dual-port and true dual-port configurations Built-in FIFO logic 64-bit error correction coding per 36K block 8K/36K block RAM DIA DIPA ADDRA WEA ENA CLKA DIB DIPB ADDRB WEB ENB CLKB DOA DOPA DOB DOPB [source: Xilinx Inc.] 3

Additional Energy Savings from Specialization Specialized memory architecture Exploit regular memory access patterns to minimize energy per memory read/write Specialized communication architecture Exploit data movement patterns to optimize the structure/topology of on-chip interconnection network Customized data type Exploit data range information to reduce bitwidth/precision and simply arithmetic operations These techniques can lead to -X better energy efficiency over general-purpose processors 3

Case Study: Convolution The main computation of image/video processing is performed over overlapping stencils, termed as convolution 2 3 4 5 6 2 3 4 5 6 Input image frame - -2-2 3x3 Convolution 2 3 4 5 6 2 3 4 5 6 Output image frame 32

Example Application: Edge Detection Identifies discontinuities in an image where brightness (or image intensity) changes sharply Very useful for feature extractions in computer vision Sobel operator G=(G X, G Y ) Figures: Pilho Kim, GaTech 33

CPU Implementation of Convolution for (n=; n<height-; n++) for (m=; m<width-; m++) for (i=; i<3; i++) for (j=; j<3; j++) out[n][m]+=img[n+i][m+j] * f[i][j]; CPU Cache Main Memory 34

Cache for Convolution Minimizes main memory accesses to improve performance W Input picture (W pixels wide) A general-purpose cache is expensive in cost and incurs nontrivial energy overhead 35

Customizing Cache for Convolution () Remove rows that are not in the neighborhood of the convolution window W 36

Customizing Cache for Convolution (2) Rearrange the rows as a D array of pixels Each time we move the window to right and push in the new pixel to the cache W Old Pixel W W Remove the edge pixels that are not needed for computation New Pixel 37

A Customized Cache : Line Buffer Line buffer: a fixed-width cache with (K-)*W+K pixels in flight Fixed addressing: Low area/power and high performance Old Pixel 2W+3 (with K=3) New Pixel In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs 38

Customized Memory Hierarchy for Convolution Memory architecture customized for convolution Input pixel stream Flip- Flops Convolve Output pixel stream Processing window Line buffers Frame buffers On-chip SRAMs Frame n-2 Frame n- Frame n Off-chip DDR 39

FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Highly parallel and/or deeply pipelined to achieve maximum parallelism Distributed data/control dispatch Silicon configurable to fit algorithm: Compute the exact algorithm at the desired level of numerical accuracy Bit-level sizing and sub-cycle chaining Customized memory hierarchy Performance/watt advantage Low power consumption compared to CPU and GPGPUs Low clock speed Specialized architecture blocks 4

Acknowledgements These slides contain/adapt materials developed by Prof. Jason Cong (UCLA) 42