CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

Similar documents
CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.5. Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3. Krste Asanovic UC Berkeley Fall 2009

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

L2: Design Representations

CS250 VLSI Systems Design Lecture 9: Memory

Unit 2: High-Level Synthesis

EECS150 - Digital Design Lecture 17 Memory 2

CS 250 VLSI Design Lecture 11 Design Verification

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS150 - Digital Design Lecture 09 - Parallelism

HIGH-LEVEL SYNTHESIS

ECE 485/585 Microprocessor System Design

Final Lecture. A few minutes to wrap up and add some perspective

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Sciences. Spring 2010 May 10, 2010

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC

CS 152 Computer Architecture and Engineering Lecture 1 Single Cycle Design

EE178 Spring 2018 Lecture Module 4. Eric Crabill

Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

High-Level Synthesis (HLS)

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

ESE532: System-on-a-Chip Architecture. Today. Process. Message FIFO. Thread. Dataflow Process Model Motivation Issues Abstraction Recommended Approach

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Chapter 4. The Processor

1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } }

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now?

FPGAs & Multi-FPGA Systems. FPGA Abstract Model. Logic cells imbedded in a general routing structure. Logic cells usually contain:

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

EECS Components and Design Techniques for Digital Systems. Lec 07 PLAs and FSMs 9/ Big Idea: boolean functions <> gates.

Two HDLs used today VHDL. Why VHDL? Introduction to Structured VLSI Design

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Lecture #1: Introduction

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

EECS150 - Digital Design Lecture 20 - Finite State Machines Revisited

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS 3640: Introduction to Networks and Their Applications

PC I/O. May 7, Howard Huang 1

Lecture 11 Cache. Peng Liu.

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

Page 1. Multilevel Memories (Improving performance using a little cash )

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

CS 152 Computer Architecture and Engineering

Physical Implementation

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Hardware/Software Partitioning for SoCs. EECE Advanced Topics in VLSI Design Spring 2009 Brad Quinton

ECE 30 Introduction to Computer Engineering

Summer 2003 Lecture 21 07/15/03

Lecture 8: Synthesis, Implementation Constraints and High-Level Planning

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)


CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Chapter 1. Introduction

CENG3420 Lecture 08: Memory Organization

CS250 VLSI Systems Design Lecture 3: Hardware Design Languages. Fall Krste Asanovic, John Wawrzynek with John Lazzaro and Brian Zimmer (TA)

Lecture 11: Packet forwarding

Lecture 41: Introduction to Reconfigurable Computing

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECE 4750 Computer Architecture, Fall 2017 Lab 1: Iterative Integer Multiplier

Final Exam Fall 2008

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

VHDL. VHDL History. Why VHDL? Introduction to Structured VLSI Design. Very High Speed Integrated Circuit (VHSIC) Hardware Description Language

Lecture 13 - VLIW Machines and Statically Scheduled ILP

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Packet Switching - Asynchronous Transfer Mode. Introduction. Areas for Discussion. 3.3 Cell Switching (ATM) ATM - Introduction

A Streaming Multi-Threaded Model

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Interfacing a PS/2 Keyboard

COE 561 Digital System Design & Synthesis Introduction

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CS 152 Computer Architecture and Engineering

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

CS519: Computer Networks. Lecture 1 (part 2): Jan 28, 2004 Intro to Computer Networking

CSEE W4824 Computer Architecture Fall 2012

CS519: Computer Networks

CS 2506 Computer Organization II Test 2

EE 4683/5683: COMPUTER ARCHITECTURE

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Storage Systems. Storage Systems

Embedded System Design and Modeling EE382V, Fall 2008

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

UNIT I (Two Marks Questions & Answers)

4. Networks. in parallel computers. Advances in Computer Architecture

Introduction to CMOS VLSI Design (E158) Lecture 7: Synthesis and Floorplanning

Transcription:

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns John Wawrzynek, Jonathan Bachrach, with Krste Asanovic, John Lazzaro and Rimas Avizienis (TA) UC Berkeley Fall 2012 Lecture 8, Hardware Design Patterns

A Difficult Design Problem? A humble shift register (For today s lecture, we ll assume clock distribution is not an issue) Lecture 8, Hardware Design Patterns 2

First Complication: Output Stall Shift register should only move data to right if output ready to accept next item Ready? What complication does this introduce? Need to fan out to enable signal on each flop Lecture 8, Hardware Design Patterns 3

Stall Fan-Out Example Ready? Enable! 200 bits per shift register stage, 16 stages! 3200 flip-flops! How many fanout-of-four gate delays to buffer up ready signal? Log 4 (3200) = 5.82, ~ 6 FO4 delays! This doesn t include any penalty for driving enable signal wires! Lecture 8, Hardware Design Patterns 4

Loops Prevent Arbitrary Resizing Shift Register Module Ready? Receiving Module Ready Logic We could increase size of gates in ready logic block to reduce fan out required to drive ready signal to flop enables But this increases load on flops, so they have to get bigger --- a vicious cycle! Lecture 8, Hardware Design Patterns 5

Second Complication: Input Bubbles Sender doesn t have valid data every clock cycle, so empty bubbles inserted into pipeline Valid? Stage 1 Stage 2 Stage 3 Stage 4 Ready? Stage 1 Stage 2 Stage 3!Valid Pipeline Diagram Want to squeeze bubble out of pipeline!valid Stage 4!Ready Time!Ready 6 Lecture 8, Hardware Design Patterns

Logic to Squeeze Bubbles Can move one stage to right if Ready asserted, or if there are any bubbles in stages to right of current stage (Assume same enable logic on every stage) Enable? Ready? Valid? Valid?! Fan-in of number of valid signals grows with number of stages! Fan-out of each stage s valid signal grows with number of stages! Longer combinational paths as number of pipeline stages grows Lecture 8, Hardware Design Patterns 7

A Common Design Problem Valid? Stage 1 Stage 2 Stage 3 Stage 4 Ready? The shift register is an abstraction of any synchronous pipelined block of logic that accepts input data and produces output data, where input and output might not be ready every clock cycle How to manage growth in control logic complexity? Lecture 8, Hardware Design Patterns 8

Solution: Decouple Units with FIFOs Processing Pipeline Consumer Pipeline only cares whether space in FIFO, not about whether consumer can take next value Breaks combinational path between pipeline control logic and consumer control logic For full throughput with decoupling, need at least two elements in FIFO With only one element, have to ping-pong between pipeline enqueue and consumer dequeue Allowing both enqueue and dequeue in same cycle to single-element FIFO destroys decoupling (back to a synchronous connection) Lecture 8, Hardware Design Patterns 9

Decoupled Design Discipline Many large digital designs are divided into local synchronous pipelines, or units, connected via decoupling FIFOs Approx. 10K-100K gates per unit Decoupled units may have different clocks In which case, need asynchronous FIFOs Lecture 8, Hardware Design Patterns 10

Hardware Design Patterns Decoupled units are an example of a design pattern Pattern: Solution to a commonly recurring design problem Idea of patterns and a pattern language first proposed for building architecture (Christopher Alexander) Pattern language is an interlocking set of design patterns Probably better named a pattern hierarchy Alexander proposed single pattern language covering architecture from design of cities to design of roof caps Patterns popular in software engineering ( Gang of Four ) and now being used in Par Lab ( Our Pattern Language (OPL) ) to architect parallel software This semester continues an experiment to see if we can teach hardware design using patterns Lecture 8, Hardware Design Patterns 11

Digital Design Through Patterns MP3 bit string Audio Application(s) Berkeley Hardware Pattern Language MP3 bit string Audio Hardware (RTL) Lecture 8, Hardware Design Patterns 12

BHPL Goals BHPL captures problem-solution pairs for creating hardware designs (machines) to execute applications BHPL Non-Goals Doesn t describe applications themselves, only machines that execute applications and strategies for mapping applications onto machines Lecture 8, Hardware Design Patterns 13

BHPL describes Machines not Applications Applications (including OPL patterns) Structural Patterns Iteration Model-View- Controller Map- Reduce Pipelines Event Based Layered Systems Agent&Repository Process Control Task Graphs Computational Patterns Circuits N-Body Methods Unstructured Grids Dense Linear Algebra Sparse Linear Algebra Graph Traversal Spectral Methods Structured Grids Dynamic Programming Graph Algorithms Graphical Models FSMs BHPL Mapping Patterns Machines Lecture 8, Hardware Design Patterns 14

Vocabulary for Machines Lecture 8, Hardware Design Patterns 15

Why a Vocabulary? Need a standard graphical and textual language to describe the problems and solutions in our pattern language Really just a consistent way of drawing and talking about block diagrams Lecture 8, Hardware Design Patterns 16

Machine Vocabulary Machines described using a hierarchical structural decomposition Units (generalized processing engines) Memories Networks (connect multiple entities) Channels (point-to-point connections) (Memories, Networks, Channels are specialized Units) Lecture 8, Hardware Design Patterns 17

Unit types Memories, Networks, and Channels are also units, just with specialized symbol to convey their main intended purpose. Memories store data. Networks connect multiple entities Channels are point-point communication paths High-level channels show primary direction of information flow Might have wires in other direction to handle flow-control etc. Lecture 8, Hardware Design Patterns 18

Hierarchy within Unit Output Port Input Port Ports shown on edge of unit Input/Output Port Lecture 8, Hardware Design Patterns 19

Hierarchy within Memory Lecture 8, Hardware Design Patterns 20

Hierarchy within Network Lecture 8, Hardware Design Patterns 21

Hierarchy within Network (2) Lecture 8, Hardware Design Patterns 22

Hierarchy within Channel Lecture 8, Hardware Design Patterns 23

Unit-Transaction Level (UTL) Our term for a high-level machine design, where system broken up into decoupled communicating units, and where we can understand system functionality as sequences of transactions performed by each unit i.e., can single-step entire system, one transaction at a time This higher level description admits various mappings into RTL with different cycle timings Lecture 8, Hardware Design Patterns 24

Refine UTL design to RTL All units are eventually mapped into digital hardware, i.e., ultimately describable as a finite-state machine (FSM) Different ways of factoring out the FSM description of a unit Structural decomposition into hierarchical sub-units Decompose functionality into control + datapath All factorings are equivalent, so pick factoring that best explains what unit does Lecture 8, Hardware Design Patterns 25

Unit with Synchronous Control When there is only a single state machine controlling operation of a unit Diamond represents unit-level controller, implied control connections to all other components (datapaths + memories) in unit Lecture 8, Hardware Design Patterns 26

Leaf-Level Hardware Register Combinational Logic Wires Memory Multiplexer/ALU Tristate driver FIFO Conventional schematic notation Lecture 8, Hardware Design Patterns 27

Example: Ethernet MAC for Xilinx [Chris Fletcher] Lecture 8, Hardware Design Patterns 28

Top-Level Lecture 8, Hardware Design Patterns 29

Multiple MACs Lecture 8, Hardware Design Patterns 30

One Level Down Lecture 8, Hardware Design Patterns 31

Low-Level Building Blocks Lecture 8, Hardware Design Patterns 32

Berkeley Hardware Pattern Language Lecture 8, Hardware Design Patterns 33

BHPL 0.5 Overview Applications (including OPL patterns) BHPL App-to-UTL Mappings Layer FFT to SIMD array Problem: Application Computation Solution: UTL Machine UTL-to-UTL Transformation Layer Time- Multiplexing Unrolling Problem: UTL violates constraint (too big, too slow) Solution: Transformed UTL UTL-to-RTL Transformation Layer Microcoded Engine In-Order Pipeline Engine Problem: UTL design Solution: RTL behavior RTL-to-Technology Interleaved Memory FIFO CAM Problem: RTL behavior Solution: Structural design Lecture 8, Hardware Design Patterns 34

Pattern Template (Name here) Problem: Describe the particular problem the pattern is meant to solve. Should include some context (small, high throughput), and also the layer of the pattern hierarchy where it fits. Solution: Describe the solution, which should be some hardware structure with a figure. Solution is usually the pattern name. Should not provide a family of widely varying solutions - these should be separate patterns, possibly grouped under a single more abstract parent pattern. Applicability: Longer discussion of where this particular solution would normally be used, or where it would not be used. Consequences: Issues that arise when using this pattern, but only for cases where it is appropriate to use (use Applicability to delineate cases where it is not appropriate to use). These might point at sub-problems for which there are sub-patterns. There might also be limitations on resulting functionality, or implications in design complexity, or CAD tool use etc. Lecture 8, Hardware Design Patterns 35

Decoupled Units Problem: Difficult to design a large unit with a single controller, especially when components have variable processing rates. Large controllers have long combinational paths. Solution: Break large unit into smaller sub-units where each sub-unit has a separate controller and all channels between sub-units have some form of decoupling (i.e., no assumption about timing of sub-units). Applicability: Large unit where area and performance overhead of decoupling is small compared to benefits of simpler design and shorter controller critical paths. Consequences: Decoupled channels generally have greater communication latency and area/power cost. Sub-unit controllers must cope with unknown arrival time of inputs and unknown time of availability of space on outputs. Sub-units must be synchronized explicitly. Lecture 8, Hardware Design Patterns 36

Decoupled Units Channels to Network network are usually decoupled in any case Shared Memory Unless shared memory is truly multiported, channels to memory must be decoupled Lecture 8, Hardware Design Patterns 37

Related Patterns Often multiple solutions to same problem, but which to pick depends on situation Problem statement and Applicability text should help select the correct pattern to use Example: Pipelined versus multi-cycle operators when delay through operator exceeds desired cycle time Lecture 8, Hardware Design Patterns 38

Pipelined Operator Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is required. Solution: Divide combinational function using pipeline registers such that logic in each stage has critical path below desired cycle time. Improve throughput by initiating new operation every clock cycle overlapped with propagation of earlier operations down pipeline. Applicability: Operators that require high throughput but where latency is not critical. Consequences: Latency of function increases due to propagation through pipeline registers, adds energy/op. Any associated controller might have to track execution of operation across multiple cycles. Lecture 8, Hardware Design Patterns 39

Pipelined Operator f(g(in)) Clock Clock g(in) f(in) Clock Clock Clock Lecture 8, Hardware Design Patterns 40

Multicycle Operator Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is not required. Solution: Hold input registers stable for multiple clock cycles of main system, and capture output after combinational function has settled. Applicability: Operators where high throughput is not required, or if latency is critical (in which case, replicate to increase throughput). Consequences: Associated controller has to track execution of operation across multiple cycles. CAD tools might detect false critical path in block. Lecture 8, Hardware Design Patterns 41

Multicycle Operator f(g(in)) Clock Clock f(g(in)) Clock/2 Clock/2 Lecture 8, Hardware Design Patterns 42

Some Families of Patterns Lecture 8, Hardware Design Patterns 43

Multiport Memory Patterns Provides multiple access ports to a common memory True Multiport Memory Banked Multiport Memory Interleave lesser-ported banks to provide higher bandwidth Cached Multiport Memory Use large single-port main memory, but add cache to service requests for each access port Lecture 8, Hardware Design Patterns 44

Controller Patterns Synchronous control of local datapath State Machine Controller control lines generated by state machine Microcoded Controller single-cycle datapath, control lines in ROM/RAM In-Order Pipeline Controller pipelined control, dynamic interaction between stages Out-of-Order Pipeline Controller operations within a control stream might be reordered internally Threaded Pipeline Controller multiple control streams one execution pipeline can be either in-order (PPU) or out-of-order Lecture 8, Hardware Design Patterns 45

Network Patterns Connects multiple units using shared resources Bus Low-cost, ordered Crossbar High-performance Multi-stage network Trade cost/performance Lecture 8, Hardware Design Patterns 46