Course Overview Revisited

Size: px
Start display at page:

Download "Course Overview Revisited"

Transcription

1 Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for (int x = ; x < in.width(); x++) blur (x, y) = (in(x-, y) + in(x, y) + in(x+, y))/3; } Algorithm Parsing Transformations Scheduling Allocation RTL Generation Compiler Binding BitSel Unit BitSel Unit Architecture conv window conv window Adder Tree PE Output Buffer High-Level Design & Automation Programmable System-on-Chip 3

2 Understanding Energy Inefficiency of General-Purpose Processors L-I$ Typical Superscalar OoO Pipeline RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Parameter Value Fetch/issue/retire width 4 # Integer ALUs 3 # FP ALUs 2 # ROB entries 96 # Reservation station entries 64 L I-cache 32 KB, 8-way set associative L D-cache 32 KB, 8-way set associative L2 cache 6 MB, 8-way set associative [source: Jason Cong, ISLPED 4 keynote] 5

3 Energy Breakdown of Pipeline Components L-I$ RAT Int RF Reservationstation LSQ + TLB D-cache ROB Fetch Decode Rename FP RF Schedule ALU Commit Branch Predictor Free list Register Read/write FPU Memory % Misc 23% FPU 8% Fetch unit 9% Rename 2% Scheduler % Decode 6% Register files 3% Control Mul/div 4% Int ALU 4% 6

4 Removing Non-Computing Portions Misc 23% Fetch unit 9% Decode 6% Memory % Mul/div 4% FPU 8% Int ALU 4% Rename 2% Scheduler % Register files 3% Computing portion: % (memory) + 26% (compute) = 36% 7

5 Energy Comparison of Processor ALUs and Dedicated Units Operation 32-bit add 32-bit multiply Processor ALU.22 2 GHz.2 2 GHz 45 nm TSMC standard cell library.2 GHz.7 GHz Why are processor units so expensive? ALU can perform multiple operations Add/sub/bitwise XOR/OR/AND 64-bit ALU Singleprecision FP operation.5 2GHz.8 5 MHz Dynamic/domino logic used to run at high frequency Higher power dissipation 8

6 A Simple Single-Cycle Microprocessor Adder PC DR SA SB IMM MB FS MD LD MW RAM RF LD SA SB DR D_in DataA DataB SE IMM MB ALU V C Z N M_address Data_in RAM MW MD

7 Evaluating an Simple Expression on CPU R <= M[R] P C RF ALU RAM Step-by-step CPU activities R2 <= M[R+] P C RF ALU RAM R3 <= R + R2 P C RF ALU RAM M[R+2] <= R3 P C RF ALU RAM Source: Adapted from Desh Singh s talk at HCP 4 workshop

8 Unrolling the CPU Hardware R <= M[R] P C RF ALU RAM CPU. Replicate the CPU hardware R2 <= M[R+] P C RF ALU CPU2 RAM R3 <= R + R2 P C RF ALU CPU3 RAM Space M[R+2] <= R3 P C RF ALU CPU4 RAM 2

9 Eliminating Unused Logic R <= M[R] RF ALU RAM. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 RF RF ALU ALU RAM Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic M[R+2] <= R3 RF ALU RAM 3

10 A Special-Purpose Architecture R <= M[R] R LW R. Replicate the CPU hardware 2. Instruction fixed -> Remove FETCH logic R2 <= M[R+] R3 <= R + R2 + LW R2 + Space 3. Remove unused ALU operations 4. Remove unused LOAD/STORE logic R3 5. Wire up registers and propagate values M[R+2] <= R3 + SW Can be realized with either ASIC or FPGA 4

11 FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Silicon configurable to fit algorithm Performance/watt advantage 5

12 What is an FPGA? FPGA: Field-Programmable Gate Array An integrated circuit designed to be configured by a customer or a designer after manufacturing (wikipedia) Components in an FPGA Chip Programmable logic blocks Programmable interconnects Programmable I/Os 6

13 Three Important Pieces SRAM-based implementation is popular Non-standard technology means older technology generation LUT Lookup table (LUT, formed by SRAM bits) Pass transistor (controlled by an SRAM bit) Multiplexer (controlled by SRAM bits) 7

14 Any function of k variables can be implemented with a 2 k : multiplexer 8 Multiplexer as a Universal Gate Cout S Cin B A Cout S Cin B A Cout S Cin B A Cout Sum Cin B A??? S2 8: MUX S S Cout????????

15 How Many Functions? How many distinct 2-input -output Boolean functions exist? What about K inputs? 9

16 Look-Up Table (LUT) A k-input LUT (k-lut) can be configured to implement any k- input -output combinational logic 2 k SRAM bits Delay is independent of logic function / / / / / / / MUX Y / x 2 x x A 3-input LUT 2

17 How Many LUTs? How many 3-input LUTs are needed to implement the following full adder? What about using 4-input LUTs? A B C in C out S 2

18 A Logic Element A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed The LUT and FF combined form a logic element LUT 22

19 A Logic Block A logic block clusters multiple logic elements COUT COUT With Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices Two independent carry chains per CLB for implementing adders Crossbar Switch SLICE SLICE Each slice contains four LUTs CIN CIN 23

20 Traditional Homogeneous FPGA Architecture Switch block Logic block Routing track 24

21 Modern Heterogeneous Field-Programmable System-on-Chip Island-style configurable mesh routing Lots of dedicated components Memories/multipliers, I/Os, processors Specialization leads to higher performance and lower power [Figure credit: embeddedrelated.com] 25

22 Dedicated DSP Blocks Built-in components for fast arithmetic operation optimized for DSP applications Fixed logic and connections, functionality may be configured using control signals at run time Much faster than LUT-based implementation (ASIC vs. LUT) Xilinx XtremeDSP blocks Starting with Virtex 4 family, DSP48 block is introduced for highspeed DSP on FPGAs Essentially a multiply-accumulate core with many other features 26

23 Example: Xilinx DSP48E Slice 25x8 signed multiplier 48-bit add/subtract/accumulate 48-bit logic operations SIMD operations (2/24 bit) Pipeline registers for high speed [source: Xilinx Inc.] 27

24 Finite Impulse Response (FIR) Filter Mapped to DSP Slices N i= y[n] = c i x[n i] x(n) 8 C C C2 C3 DSP Slice 38 y(n) [source: Xilinx Inc.]

25 Hardened Floating-Point Units Arria FPGA and SoC Variable-Precision DSP Block Architecture [Source: Altera Corp., 24] 29

26 Dedicated Block RAMs (BRAMs) Xilinx 8K/36K block RAMs 32k x to 52 x 72 in one 36K block Simple dual-port and true dual-port configurations Built-in FIFO logic 64-bit error correction coding per 36K block 8K/36K block RAM DIA DIPA ADDRA WEA ENA CLKA DIB DIPB ADDRB WEB ENB CLKB DOA DOPA DOB DOPB [source: Xilinx Inc.] 3

27 Additional Energy Savings from Specialization Specialized memory architecture Exploit regular memory access patterns to minimize energy per memory read/write Specialized communication architecture Exploit data movement patterns to optimize the structure/topology of on-chip interconnection network Customized data type Exploit data range information to reduce bitwidth/precision and simply arithmetic operations These techniques can lead to -X better energy efficiency over general-purpose processors 3

28 Case Study: Convolution The main computation of image/video processing is performed over overlapping stencils, termed as convolution Input image frame x3 Convolution Output image frame 32

29 Example Application: Edge Detection Identifies discontinuities in an image where brightness (or image intensity) changes sharply Very useful for feature extractions in computer vision Sobel operator G=(G X, G Y ) Figures: Pilho Kim, GaTech 33

30 CPU Implementation of Convolution for (n=; n<height-; n++) for (m=; m<width-; m++) for (i=; i<3; i++) for (j=; j<3; j++) out[n][m]+=img[n+i][m+j] * f[i][j]; CPU Cache Main Memory 34

31 Cache for Convolution Minimizes main memory accesses to improve performance W Input picture (W pixels wide) A general-purpose cache is expensive in cost and incurs nontrivial energy overhead 35

32 Customizing Cache for Convolution () Remove rows that are not in the neighborhood of the convolution window W 36

33 Customizing Cache for Convolution (2) Rearrange the rows as a D array of pixels Each time we move the window to right and push in the new pixel to the cache W Old Pixel W W Remove the edge pixels that are not needed for computation New Pixel 37

34 A Customized Cache : Line Buffer Line buffer: a fixed-width cache with (K-)*W+K pixels in flight Fixed addressing: Low area/power and high performance Old Pixel 2W+3 (with K=3) New Pixel In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs 38

35 Customized Memory Hierarchy for Convolution Memory architecture customized for convolution Input pixel stream Flip- Flops Convolve Output pixel stream Processing window Line buffers Frame buffers On-chip SRAMs Frame n-2 Frame n- Frame n Off-chip DDR 39

36 FPGA as a Programmable Accelerator Massive amount of fine-grained parallelism Highly parallel and/or deeply pipelined to achieve maximum parallelism Distributed data/control dispatch Silicon configurable to fit algorithm: Compute the exact algorithm at the desired level of numerical accuracy Bit-level sizing and sub-cycle chaining Customized memory hierarchy Performance/watt advantage Low power consumption compared to CPU and GPGPUs Low clock speed Specialized architecture blocks 4

37 Acknowledgements These slides contain/adapt materials developed by Prof. Jason Cong (UCLA) 42

ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing

ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing ECE 5775 (Fall 7) High-Level Digital Design Automation Specialized Computing Announcements All students enrolled in CMS & Piazza Vivado HLS tutorial on Tuesday 8/29 Install an SSH client (mobaxterm or

More information

FPGA architecture and design technology

FPGA architecture and design technology CE 435 Embedded Systems Spring 2017 FPGA architecture and design technology Nikos Bellas Computer and Communications Engineering Department University of Thessaly 1 FPGA fabric A generic island-style FPGA

More information

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved Basic FPGA Architecture 2005 Xilinx, Inc. All Rights Reserved Objectives After completing this module, you will be able to: Identify the basic architectural resources of the Virtex -II FPGA List the differences

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

EECS150 - Digital Design Lecture 16 - Memory

EECS150 - Digital Design Lecture 16 - Memory EECS150 - Digital Design Lecture 16 - Memory October 17, 2002 John Wawrzynek Fall 2002 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: data & program storage general purpose registers buffering table lookups

More information

ECE 5775 High-Level Digital Design Automation Fall Fixed-Point Types Analysis of Algorithms

ECE 5775 High-Level Digital Design Automation Fall Fixed-Point Types Analysis of Algorithms ECE 5775 High-Level Digital Design Automation Fall 2018 Fixed-Point Types Analysis of Algorithms Announcements Lab 1 on CORDIC is released Due Monday 9/10 @11:59am Part-time PhD TA: Hanchen Jin (hj424)

More information

EECS150 - Digital Design Lecture 16 Memory 1

EECS150 - Digital Design Lecture 16 Memory 1 EECS150 - Digital Design Lecture 16 Memory 1 March 13, 2003 John Wawrzynek Spring 2003 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: Whenever a large collection of state elements is required. data &

More information

Field Programmable Gate Array (FPGA)

Field Programmable Gate Array (FPGA) Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems

More information

ECE 2300 Digital Logic & Computer Organization. Caches

ECE 2300 Digital Logic & Computer Organization. Caches ECE 23 Digital Logic & Computer Organization Spring 217 s Lecture 2: 1 Announcements HW7 will be posted tonight Lab sessions resume next week Lecture 2: 2 Course Content Binary numbers and logic gates

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah Rakesh.gangarajaiah@eit.lth.se Slides from Chenxin Zhang and Steffan Malkowsky WWW.FPGA What is FPGA? Field

More information

Specializing Hardware for Image Processing

Specializing Hardware for Image Processing Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Reconfigurable Computing

Reconfigurable Computing Reconfigurable Computing FPGA Architecture Architecture should speak of its time and place, but yearn for timelessness. Frank Gehry Philip Leong (philip.leong@sydney.edu.au) School of Electrical and Information

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Programmable Logic. Any other approaches?

Programmable Logic. Any other approaches? Programmable Logic So far, have only talked about PALs (see 22V10 figure next page). What is the next step in the evolution of PLDs? More gates! How do we get more gates? We could put several PALs on one

More information

ECE 2300 Digital Logic & Computer Organization. More Single Cycle Microprocessor

ECE 2300 Digital Logic & Computer Organization. More Single Cycle Microprocessor ECE 23 Digital Logic & Computer Organization Spring 28 More Single Cycle Microprocessor Lecture 6: HW6 due tomorrow Announcements Prelim 2: Tues April 7, 7:3pm, Phillips Hall Coverage: Lectures 8~6 Inform

More information

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007 EE178 Lecture Module 2 Eric Crabill SJSU / Xilinx Fall 2007 Lecture #4 Agenda Survey of implementation technologies. Implementation Technologies Small scale and medium scale integration. Up to about 200

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities

More information

CDA 4253 FGPA System Design Xilinx FPGA Memories. Hao Zheng Comp Sci & Eng USF

CDA 4253 FGPA System Design Xilinx FPGA Memories. Hao Zheng Comp Sci & Eng USF CDA 4253 FGPA System Design Xilinx FPGA Memories Hao Zheng Comp Sci & Eng USF Xilinx 7-Series FPGA Architecture On-Chip block RAM On-Chip block RAM Distributed RAM by Logic Fabric Distributed RAM by Logic

More information

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today. Overview This set of notes introduces many of the features available in the FPGAs of today. The majority use SRAM based configuration cells, which allows fast reconfiguation. Allows new design ideas to

More information

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture FPGA Architecture Overview dr chris dick dsp chief architect wireless and signal processing group xilinx inc. Generic FPGA Architecture () Generic FPGA architecture consists of an array of logic tiles

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Design Methodologies and Tools. Full-Custom Design

Design Methodologies and Tools. Full-Custom Design Design Methodologies and Tools Design styles Full-custom design Standard-cell design Programmable logic Gate arrays and field-programmable gate arrays (FPGAs) Sea of gates System-on-a-chip (embedded cores)

More information

Design Methodologies. Full-Custom Design

Design Methodologies. Full-Custom Design Design Methodologies Design styles Full-custom design Standard-cell design Programmable logic Gate arrays and field-programmable gate arrays (FPGAs) Sea of gates System-on-a-chip (embedded cores) Design

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) Bill Jason P. Tomas Dept. of Electrical and Computer Engineering University of Nevada Las Vegas FIELD PROGRAMMABLE ARRAYS Dominant digital design

More information

ECE 2300 Digital Logic & Computer Organization

ECE 2300 Digital Logic & Computer Organization ECE 2300 Digital Logic & Computer Organization Spring 201 Memories Lecture 14: 1 Announcements HW6 will be posted tonight Lab 4b next week: Debug your design before the in-lab exercise Lecture 14: 2 Review:

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Altera FLEX 8000 Block Diagram

Altera FLEX 8000 Block Diagram Altera FLEX 8000 Block Diagram Figure from Altera technical literature FLEX 8000 chip contains 26 162 LABs Each LAB contains 8 Logic Elements (LEs), so a chip contains 208 1296 LEs, totaling 2,500 16,000

More information

Lecture 41: Introduction to Reconfigurable Computing

Lecture 41: Introduction to Reconfigurable Computing inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41: Introduction to Reconfigurable Computing Michael Le, Sp07 Head TA April 30, 2007 Slides Courtesy of Hayden So, Sp06 CS61c Head TA Following

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design 1 Reconfigurable Computing Platforms 2 The Von Neumann Computer Principle In 1945, the

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I Overview Anti-fuse and EEPROM-based devices Contemporary SRAM devices - Wiring - Embedded New trends - Single-driver wiring -

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

The Virtex FPGA and Introduction to design techniques

The Virtex FPGA and Introduction to design techniques The Virtex FPGA and Introduction to design techniques SM098 Computation Structures Lecture 6 Simple Programmable Logic evices Programmable Array Logic (PAL) AN-OR arrays are common blocks in SPL and CPL

More information

Programmable Logic. Simple Programmable Logic Devices

Programmable Logic. Simple Programmable Logic Devices Programmable Logic SM098 Computation Structures - Programmable Logic Simple Programmable Logic evices Programmable Array Logic (PAL) AN-OR arrays are common blocks in SPL and CPL architectures Implements

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued) Virtex-II Architecture SONET / SDH Virtex II technical, Design Solutions PCI-X PCI DCM Distri RAM 18Kb BRAM Multiplier LVDS FIFO Shift Registers BLVDS SDRAM QDR SRAM Backplane Rev 4 March 4th. 2002 J-L

More information

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory Embedded Systems 8. Hardware Components Lothar Thiele Computer Engineering and Networks Laboratory Do you Remember? 8 2 8 3 High Level Physical View 8 4 High Level Physical View 8 5 Implementation Alternatives

More information

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages DSP Resources Specialized FPGA columns for complex arithmetic functionality DSP48 Tile: two DSP48 slices, interconnect Each DSP48 is a self-contained arithmeticlogical unit with add/sub/multiply/logic

More information

ASIC Design of Shared Vector Accelerators for Multicore Processors

ASIC Design of Shared Vector Accelerators for Multicore Processors 26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras

More information

FPGA Polyphase Filter Bank Study & Implementation

FPGA Polyphase Filter Bank Study & Implementation FPGA Polyphase Filter Bank Study & Implementation Raghu Rao Matthieu Tisserand Mike Severa Prof. John Villasenor Image Communications/. Electrical Engineering Dept. UCLA 1 Introduction This document describes

More information

Spiral 2-8. Cell Layout

Spiral 2-8. Cell Layout 2-8.1 Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as geometric

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK The DSP Primer 8 FPGA Technology Return DSPprimer Home Return DSPprimer Notes August 2005, University of Strathclyde, Scotland, UK For Academic Use Only THIS SLIDE IS BLANK August 2005, For Academic Use

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs ECE 645: Lecture Basic Adders and Counters Implementation of Adders in FPGAs Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 5, Basic Addition and Counting,

More information

Understanding Peak Floating-Point Performance Claims

Understanding Peak Floating-Point Performance Claims white paper FPGA Understanding Peak ing-point Performance Claims Learn how to calculate and compare the peak floating-point capabilities of digital signal processors (DSPs), graphics processing units (GPUs),

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes. Topics! SRAM-based FPGA fabrics:! Xilinx.! Altera. SRAM-based FPGAs! Program logic functions, using SRAM.! Advantages:! Re-programmable;! dynamically reconfigurable;! uses standard processes.! isadvantages:!

More information

Virtex-II Architecture

Virtex-II Architecture Virtex-II Architecture Block SelectRAM resource I/O Blocks (IOBs) edicated multipliers Programmable interconnect Configurable Logic Blocks (CLBs) Virtex -II architecture s core voltage operates at 1.5V

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA Implementation and Validation of the Asynchronous Array of simple Processors FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,

More information

discrete logic do not

discrete logic do not Welcome to my second year course on Digital Electronics. You will find that the slides are supported by notes embedded with the Powerpoint presentations. All my teaching materials are also available on

More information

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA 18-545: FALL 2014 2 Admin stuff Project Proposals happen on Monday Be prepared to give an in-class presentation Lab 1 is

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

FABRICATION TECHNOLOGIES

FABRICATION TECHNOLOGIES FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.

More information

Design and Implementation of a Super Scalar DLX based Microprocessor

Design and Implementation of a Super Scalar DLX based Microprocessor Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

FPGAs: FAST TRACK TO DSP

FPGAs: FAST TRACK TO DSP FPGAs: FAST TRACK TO DSP Revised February 2009 ABSRACT: Given the prevalence of digital signal processing in a variety of industry segments, several implementation solutions are available depending on

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

Basic Computer Architecture

Basic Computer Architecture Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Topics. Midterm Finish Chapter 7

Topics. Midterm Finish Chapter 7 Lecture 9 Topics Midterm Finish Chapter 7 Xilinx FPGAs Chapter 7 Spartan 3E Architecture Source: Spartan-3E FPGA Family Datasheet CLB Configurable Logic Blocks Each CLB contains four slices Each slice

More information

A 1-GHz Configurable Processor Core MeP-h1

A 1-GHz Configurable Processor Core MeP-h1 A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface

More information

ΔΙΑΛΕΞΗ 2: FPGA Architectures

ΔΙΑΛΕΞΗ 2: FPGA Architectures ΗΜΥ 664 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs Χειμερινό Εξάμηνο 2010 ΔΙΑΛΕΞΗ 2: FPGA Architectures ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ Λέκτορας ΗΜΜΥ (ttheocharides@ucy.ac.cy) Some slides adopted from Digital Integrated Circuits,

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

FPGA for Software Engineers

FPGA for Software Engineers FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

H100 Series FPGA Application Accelerators

H100 Series FPGA Application Accelerators 2 H100 Series FPGA Application Accelerators Products in the H100 Series PCI-X Mainstream IBM EBlade H101-PCIXM» HPC solution for optimal price/performance» PCI-X form factor» Single Xilinx Virtex 4 FPGA

More information

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006 The Next Generation 65-nm FPGA Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006 Hot Chips, 2006 Structure of the talk 65nm technology going towards 32nm Virtex-5 family Improved I/O Benchmarking

More information

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013 Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.

More information

Workspace for '4-FPGA' Page 1 (row 1, column 1)

Workspace for '4-FPGA' Page 1 (row 1, column 1) Workspace for '4-FPGA' Page 1 (row 1, column 1) Workspace for '4-FPGA' Page 2 (row 2, column 1) Workspace for '4-FPGA' Page 3 (row 3, column 1) ECEN 449 Microprocessor System Design FPGAs and Reconfigurable

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Zynq Ultrascale+ Architecture

Zynq Ultrascale+ Architecture Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey CMPE-550 Dec 2017 Soldavini, Ramsey (CMPE-550) Zynq Ultrascale+ Architecture Dec 2017 1 / 17 Agenda Heterogeneous Computing Zynq Ultrascale+

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information