ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing

Similar documents
Course Overview Revisited

ECE 5775 High-Level Digital Design Automation Fall Fixed-Point Types Analysis of Algorithms

FPGA architecture and design technology

ECE 2300 Digital Logic & Computer Organization. Caches

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

INTRODUCTION TO FPGA ARCHITECTURE

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Field Programmable Gate Array (FPGA)

Zynq Ultrascale+ Architecture

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

Embedded Systems. 7. System Components

XPU A Programmable FPGA Accelerator for Diverse Workloads

EECS150 - Digital Design Lecture 16 Memory 1

EITF35: Introduction to Structured VLSI Design

EECS150 - Digital Design Lecture 16 - Memory

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

ECE 2300 Digital Logic & Computer Organization. More Single Cycle Microprocessor

Reconfigurable Computing

Multiple Instruction Issue. Superscalars

ECE 2300 Digital Logic & Computer Organization

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

The Nios II Family of Configurable Soft-core Processors

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

The Xilinx XC6200 chip, the software tools and the board development tools

discrete logic do not

Lecture 41: Introduction to Reconfigurable Computing

Embedded Systems: Hardware Components (part I) Todor Stefanov

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

Limitations of Scalar Pipelines

CS 152, Spring 2011 Section 8

Altera FLEX 8000 Block Diagram

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

CDA 4253 FGPA System Design Xilinx FPGA Memories. Hao Zheng Comp Sci & Eng USF

Versal: AI Engine & Programming Environment

FABRICATION TECHNOLOGIES

Specializing Hardware for Image Processing

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

CS146 Computer Architecture. Fall Midterm Exam

Design of Digital Circuits

Programmable Logic. Simple Programmable Logic Devices

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

ECE 2300 Digital Logic & Computer Organization. Single Cycle Microprocessor

The Virtex FPGA and Introduction to design techniques

Understanding Peak Floating-Point Performance Claims

ECE 471 Embedded Systems Lecture 2

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Multithreaded Processors. Department of Electrical Engineering Stanford University

HEAD HardwarE Accelerated Deduplication

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

FPGA: What? Why? Marco D. Santambrogio

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Design Methodologies. Full-Custom Design

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic

Programmable Logic. Any other approaches?

04 - DSP Architecture and Microarchitecture

Design Methodologies and Tools. Full-Custom Design

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Implementation of DSP Algorithms

General-purpose Reconfigurable Functional Cache architecture. Rajesh Ramanujam. A thesis submitted to the graduate faculty

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

A Reconfigurable Multifunction Computing Cache Architecture

Virtex-II Architecture

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

A 1-GHz Configurable Processor Core MeP-h1

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Embedded Computing Platform. Architecture and Instruction Set

Trends in the Infrastructure of Computing

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Readings: Storage unit. Can hold an n-bit value Composed of a group of n flip-flops. Each flip-flop stores 1 bit of information.

ASIC Design of Shared Vector Accelerators for Multicore Processors

Higher Level Programming Abstractions for FPGAs using OpenCL

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Computer Architecture. Introduction. Lynn Choi Korea University

FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Spiral 2-8. Cell Layout

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures

CS 152, Spring 2012 Section 8

Multi Cycle Implementation Scheme for 8 bit Microprocessor by VHDL

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Transcription:

ECE 5775 (Fall 7) High-Level Digital Design Automation Specialized Computing

Announcements All students enrolled in CMS & Piazza Vivado HLS tutorial on Tuesday 8/29 Install an SSH client (mobaxterm or putty) % ssh -X <netid>@ecelinux-.ece.cornell.edu Bring your laptop!

Agenda Motivation for hardware specialization Case study on convolution FPGA introduction Basic building blocks Basic homogeneous FPGA architecture Modern heterogeneous FPGA architecture 2

Tension between Efficiency and Flexibility Programmable Processing MOPS/mW Source: High-performance Energy-Efficient Reconfigurable Accelerator Circuits for the Sub-45nm Era July 2 by Ram K. Krishnamurthy, Circuits Research Labs, Intel Corp. 3

A Simple Single-Cycle Microprocessor Adder PC DR SA SB IMM MB FS MD LD MW RAM RF LD SA SB DR D_in DataA DataB SE IMM MB ALU V C Z N M_address Data_in RAM MW MD 4

Evaluating an Simple Expression on CPU R <= M[R] P C RF ALU RAM Step-by-step CPU activities R2 <= M[R+] P C RF ALU RAM R3 <= R + R2 P C RF ALU RAM M[R+2] <= R3 P C RF ALU RAM Source: Adapted from Desh Singh s talk at HCP 4 workshop 5

Unrolling the CPU Hardware R <= M[R] P C RF ALU RAM CPU. Replicate the CPU hardware R2 <= M[R+] P C RF ALU CPU2 RAM R3 <= R + R2 P C RF ALU CPU3 RAM Space M[R+2] <= R3 P C RF ALU CPU4 RAM 6

Eliminating Unused Logic R <= M[R] RF ALU RAM. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 RF RF ALU ALU RAM Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic M[R+2] <= R3 RF ALU RAM 7

A Special-Purpose Architecture R <= M[R] R LW R. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 + LW R2 + Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic R3 5. Wire up registers and propagate values M[R+2] <= R3 + SW Resulting circuit can be realized with either ASIC or FPGA 8

Understanding Energy Inefficiency of General-Purpose Processors (GPPs) L-I$ Typical Superscalar OoO Pipeline RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Parameter Value Fetch/issue/retire width 4 # Integer ALUs 3 # FP ALUs 2 # ROB entries 96 # Reservation station entries 64 L I-cache 32 KB, 8-way set associative L D-cache 32 KB, 8-way set associative L2 cache 6 MB, 8-way set associative [source: Jason Cong, ISLPED 4 keynote] 9

Energy Breakdown of Pipeline Components L-I$ RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Memory % Misc 23% FPU 8% Fetch unit 9% Rename 2% Scheduler % Decode 6% Register files 3% Control Mul/div 4% Int ALU 4%

Removing Non-Computing Portions Misc 23% Fetch unit 9% Decode 6% Memory % Mul/div 4% FPU 8% Int ALU 4% Rename 2% Scheduler % Register files 3% Computing portion: % (memory) + 26% (compute) = 36%

Energy Comparison of Processor ALUs and Dedicated Units Operation 32-bit add 32-bit multiply Processor ALU.22 nj@ 2 GHz.2 nj@ 2 GHz 45 nm TSMC standard cell library.2 nj @ GHz.7 nj @ GHz Why are processor units so expensive? ALU can perform multiple operations Add/sub/bitwise XOR/OR/AND 64-bit ALU Singleprecision FP operation.5 nj @ 2GHz.8 nj @ 5 MHz Dynamic/domino logic used to run at high frequency Higher power dissipation 2

Energy Breakdown with Standard-Cell ASICs Misc 23% Fetch unit 9% Decode 6% Rename 2% Register files 3% Memory % ALU/FPU savings 25.% Int ALU.2% Scheduler % FPU.4% Mul/div.2% Computing portion: % (memory) + ~% (compute) = % Only X gain is attainable? 3

Additional Energy Savings from Specialization Specialized memory architecture Exploit regular memory access patterns to minimize energy per memory read/write Specialized communication architecture Exploit data movement patterns to optimize the structure/topology of on-chip interconnection network Customized data type Exploit data range information to reduce bitwidth/precision and simply arithmetic operations These techniques combined can lead to another - X energy efficiency improvement over GPPs 4

Case Study: Memory Specialization for Convolution The main computation of image/video processing is performed over overlapping stencils, termed as convolution 2 3 4 5 6 2 3 4 5 6 Input image frame - -2-2 3x3 Convolution 2 3 4 5 6 2 3 4 5 6 Output image frame 5

Example Application: Edge Detection Identifies discontinuities in an image where brightness (or image intensity) changes sharply Very useful for feature extractions in computer vision Sobel operator G=(G X, G Y ) Figures: Pilho Kim, GaTech 6

CPU Implementation of Convolution for (n=; n<height-; n++) for (m=; m<width-; m++) for (i=; i<3; i++) for (j=; j<3; j++) out[n][m]+=img[n+i][m+j] * f[i][j]; CPU Cache Main Memory 7

General-Purpose Cache for Convolution Minimizes main memory accesses to improve performance W Input picture (W pixels wide) A general-purpose cache is expensive in cost and incurs nontrivial energy overhead 8

Specializing Cache for Convolution () Remove rows that are not in the neighborhood of the convolution window W 9

Specializing Cache for Convolution (2) Rearrange the rows as a D array of pixels Each time we move the window to right and push in the new pixel to the cache W Old New Old Pixel W W Remove the edge pixels that are not needed for computation New Pixel 2

A Specialized Cache : Line Buffer Line buffer: a fixed-width cache with (K-)*W+K pixels in flight Fixed addressing: Low area/power and high performance Old Pixel 2W+3 (with K=3) New Pixel In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs 2

What is an FPGA? FPGA: Field-Programmable Gate Array An integrated circuit designed to be configured by a customer or a designer after manufacturing (wikipedia) Components in an FPGA Chip Programmable logic blocks Programmable interconnects Programmable I/Os 22

Three Important Pieces SRAM-based implementation is popular Non-standard technology means older technology generation LUT Lookup table (LUT, formed by SRAM bits) Pass transistor (controlled by an SRAM bit) Multiplexer (controlled by SRAM bits) 23

Any function of k variables can be implemented with a 2 k : multiplexer 24 Multiplexer as a Universal Gate Cout S Cin B A Cout S Cin B A Cout S Cin B A Cout Sum Cin B A??? 2 3 4 5 6 7 S2 8: MUX S S Cout????????

How Many Functions? How many distinct 3-input -output Boolean functions exist? What about K inputs? 25

Look-Up Table (LUT) A k-input LUT (k-lut) can be configured to implement any k- input -output combinational logic 2 k SRAM bits Delay is independent of logic function / / / / / / / MUX Y / x 2 x x A 3-input LUT 26

How Many LUTs? How many 3-input LUTs are needed to implement the following full adder? How about using 4-input LUTs? A B C in C out S 27

A Logic Element A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed The LUT and FF combined form a logic element LUT 28

A Logic Block A logic block clusters multiple logic elements COUT COUT Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices Two independent carry chains per CLB for implementing adders Crossbar Switch SLICE SLICE Each slice contains four LUTs CIN CIN 29

Traditional Homogeneous FPGA Architecture Switch block Logic block Routing track 3

Modern Heterogeneous Field-Programmable System-on-Chip Island-style configurable mesh routing Lots of dedicated components Memories/multipliers, I/Os, processors Specialization leads to higher performance and lower power [Figure credit: embeddedrelated.com] 3

Dedicated DSP Blocks Built-in components for fast arithmetic operation optimized for DSP applications Essentially a multiply-accumulate core with many other features Fixed logic and connections, functionality may be configured using control signals at run time Much faster than LUT-based implementation (ASIC vs. LUT) 32

Example: Xilinx DSP48E Slice 25x8 signed multiplier 48-bit add/subtract/accumulate 48-bit logic operations SIMD operations (2/24 bit) Pipeline registers for high speed [source: Xilinx Inc.] 33

Finite Impulse Response (FIR) Filter Mapped to DSP Slices N i= y[n] = c i x[n i] x(n) 8 C C C2 C3 DSP Slice 38 y(n) [source: Xilinx Inc.]

Dedicated Block RAMs (BRAMs) Example: Xilinx 8K/36K block RAMs 32k x to 52 x 72 in one 36K block Simple dual-port and true dual-port configurations Built-in FIFO logic 64-bit error correction coding per 36K block 8K/36K block RAM DIA DIPA ADDRA WEA ENA CLKA DIB DIPB ADDRB WEB ENB CLKB DOA DOPA DOB DOPB [source: Xilinx Inc.] 35

Embedded FPGA System-on-Chip Dual ARM Cortex-A9 + NEON SIMD extension @6MHz~GHz Up to 35K logic cells 2MB Block RAM 9 DSP48s Xilinx Zynq All Programmable System-on-Chip [Source: Xilinx Inc.] 36

FPGA as an Accelerator for Cloud Computing Massive amount of finegrained parallelism Silicon configurable to fit the application Block RAM Block RAM Performance/watt advantage over CPUs & GPUs ~2 Million Logic Blocks ~5 DSP Blocks ~3Mb Block RAM AWS Cloud F FPGA instance: Xilinx UltraScale+ VU9P [Figure source: David Pellerin, AWS] 37

Microsoft Deploying FPGAs in Datacenter FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services e.g., project BrainWave claimed ~4Teraflops on large recurrent neural networks using Intel Stratix FPGAs [source: Microsoft, BrainWave, HotChips 27] 38

Summary: FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Highly parallel and/or deeply pipelined to achieve maximum parallelism Distributed data/control dispatch Silicon configurable to fit algorithm Compute the exact algorithm at the desired level of numerical accuracy Bit-level sizing and sub-cycle chaining Customized memory hierarchy Performance/watt advantage Low power consumption compared to CPU and GPGPUs Low clock speed Specialized architecture blocks 39

Next Class Vivado HLS tutorial (led by TAs ) 4

Acknowledgements These slides contain/adapt materials developed by Prof. Jason Cong (UCLA) 4