L2: Design Representations

Similar documents
CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

CSE241 VLSI Digital Circuits UC San Diego

ECE260B CSE241A Winter Logic Synthesis

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

VLSI CAD Flow: Logic Synthesis, Placement and Routing Lecture 5. Guest Lecture by Srini Devadas

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

VLSI Testing. Fault Simulation. Virendra Singh. Indian Institute of Science Bangalore

ECE 3060 VLSI and Advanced Digital Design

ECE260B CSE241A Winter Logic Synthesis

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

ARM 64-bit Register File

VLSI Testing. Virendra Singh. Bangalore E0 286: Test & Verification of SoC Design Lecture - 7. Jan 27,

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Lecture 1: Introduction Course arrangements Recap of basic digital design concepts EDA tool demonstration

problem maximum score 1 8pts 2 6pts 3 10pts 4 15pts 5 12pts 6 10pts 7 24pts 8 16pts 9 19pts Total 120pts

VLSI Test Technology and Reliability (ET4076)

Don t expect to be able to write and debug your code during the lab session.

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis

ECEU530. Schedule. ECE U530 Digital Hardware Synthesis. Datapath for the Calculator (HW 5) HW 5 Datapath Entity

EECS150 - Digital Design Lecture 17 Memory 2

EECS150, Fall 2004, Midterm 1, Prof. Culler. Problem 1 (15 points) 1.a. Circle the gate-level circuits that DO NOT implement a Boolean AND function.

Lecture #1: Introduction

ECE 565: VLSI CAD Flow + Brief Intro to HLS, Logic Optimization & Technology Mapping. Shantanu Dutt ECE Dept., UIC

Synthesizable Verilog

(ii) Simplify and implement the following SOP function using NOR gates:

1/28/2013. Synthesis. The Y-diagram Revisited. Structural Behavioral. More abstract designs Physical. CAD for VLSI 2

problem maximum score 1 10pts 2 8pts 3 10pts 4 12pts 5 7pts 6 7pts 7 7pts 8 17pts 9 22pts total 100pts

Architecture as Interface

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

Digital VLSI Design with Verilog

Testing & Verification of Digital Circuits ECE/CS 5745/6745. Hardware Verification using Symbolic Computation

Graphics: Alexandra Nolte, Gesine Marwedel, Universität Dortmund. RTL Synthesis

Computer Architecture (TT 2012)

What is Verilog HDL? Lecture 1: Verilog HDL Introduction. Basic Design Methodology. What is VHDL? Requirements

CS 250 VLSI Design Lecture 11 Design Verification

Verilog Fundamentals. Shubham Singh. Junior Undergrad. Electrical Engineering

ECE 4514 Digital Design II. Spring Lecture 20: Timing Analysis and Timed Simulation

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

MODELING LANGUAGES AND ABSTRACT MODELS. Giovanni De Micheli Stanford University. Chapter 3 in book, please read it.

Topics. Midterm Finish Chapter 7

Introduction to Field Programmable Gate Arrays

VHDL for Synthesis. Course Description. Course Duration. Goals

ECE 5745 Complex Digital ASIC Design Topic 12: Synthesis Algorithms

Synthesis at different abstraction levels

Digital Design Methodology

Verilog. What is Verilog? VHDL vs. Verilog. Hardware description language: Two major languages. Many EDA tools support HDL-based design

Electronic Design Automation Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Programming with HDLs

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

Verilog for High Performance

Chapter 1 Overview of Digital Systems Design

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

structure syntax different levels of abstraction

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

CS250 VLSI Systems Design Lecture 9: Memory

COE 561 Digital System Design & Synthesis Introduction

3. The high voltage level of a digital signal in positive logic is : a) 1 b) 0 c) either 1 or 0

Top-Down Transaction-Level Design with TL-Verilog

CS8803: Advanced Digital Design for Embedded Hardware

Digital Design Methodology (Revisited) Design Methodology: Big Picture

EECS150 - Digital Design Lecture 10 Logic Synthesis

(Refer Slide Time: 00:01:53)

EECS150 - Digital Design Lecture 10 Logic Synthesis

ECE 595Z Digital Systems Design Automation

Verilog Tutorial. Verilog Fundamentals. Originally designers used manual translation + bread boards for verification

Digital Design with FPGAs. By Neeraj Kulkarni

Verilog Tutorial 9/28/2015. Verilog Fundamentals. Originally designers used manual translation + bread boards for verification

Do we need more chips (ASICs)?

Overview of Digital Design with Verilog HDL 1

St.MARTIN S ENGINEERING COLLEGE Dhulapally, Secunderabad

INSTITUTE OF AERONAUTICAL ENGINEERING Dundigal, Hyderabad ELECTRONICS AND COMMUNICATIONS ENGINEERING

CPU Pipelining Issues

Lecture 3: Modeling in VHDL. EE 3610 Digital Systems

Advanced VLSI Design Prof. Virendra K. Singh Department of Electrical Engineering Indian Institute of Technology Bombay

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

CS429: Computer Organization and Architecture

HIGH-LEVEL SYNTHESIS

Memory and Programmable Logic

MLR Institute of Technology

TECHNOLOGY MAPPING FOR THE ATMEL FPGA CIRCUITS

CAD for VLSI Design - I. Lecture 21 V. Kamakoti and Shankar Balachandran

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

CS61C : Machine Structures

Lecture 8: Synthesis, Implementation Constraints and High-Level Planning

Question?! Processor comparison!

Lab 3 Verilog Simulation Mapping

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis

Picture of memory. Word FFFFFFFD FFFFFFFE FFFFFFFF

Overview. Design flow. Principles of logic synthesis. Logic Synthesis with the common tools. Conclusions

RIZALAFANDE CHE ISMAIL TKT. 3, BLOK A, PPK MIKRO-e KOMPLEKS PENGAJIAN KUKUM. SYNTHESIS OF COMBINATIONAL LOGIC (Chapter 8)

Lecture 15: System Modeling and Verilog

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Unit 2: High-Level Synthesis

Lecture 41: Introduction to Reconfigurable Computing

ESE 150 Lab 07: Digital Logic

Music. Numbers correspond to course weeks EULA ESE150 Spring click OK Based on slides DeHon 1. !

EE178 Spring 2018 Lecture Module 1. Eric Crabill

Transcription:

CS250 VLSI Systems Design L2: Design Representations John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA)

Engineering Challenge Application Gap usually too large to bridge in one step, but there are exceptions... Physics 2

Magnetic Compass Application Physics 3

Design Abstraction Stack Application Unit-Transaction Level (UTL) Register-Transfer Level (RTL) Gates Circuits Devices (Transistors) Physics n Conduction Band Eg Valence Band oxi p n 4

Properties of a Useful Abstraction Hides less important details e.g., for RTL, don t worry how combinational logic is decomposed into logic gates Allows control of more important details e.g., RTL designer still controls how much logic is performed between any two registers If done right, provides portable efficiency i.e., same RTL can be implemented as custom logic, standard cells, FPGA, or even vacuum tube logic, with reasonably good results 5

CS250 Design Abstractions Primary Design Abstractions Interface to Technology Application Unit-Transaction Level (UTL) Register-Transfer Level (RTL) Gates Circuits Devices (Transistors) Physics (UCB EE141/241) (UCB EE130/230) 6

CS250 Design Refinement Application (C/C++) UTL (C/C++) RTL (Verilog) Gates (Stdcell Library) Architecture Design (Manual) Micro(µ)-Architecture Design (Manual) Synthesis + Place&Route (Automated) 7

Course Prerequisites B+ in CS150 for UCB undergrads, or equivalent for incoming grad students This means you should have seen RTL and Verilog/VHDL before We won t be covering Verilog coding details in lecture, but some coverage in section + handouts 8

RTL Representation Combinational Logic Combinational Logic Clock When writing Verilog, be sure to separate RTL code into pure state and pure logic 9

Application to RTL in One Step? Modern hardware systems have complex functionality (graphics chips, video encoders, wireless communication channels), but sometimes designers try to map directly to an RTL cycle-level µarchitecture in one step Requires detailed cycle-level design of each sub-unit Significant design effort required before clear if design will meet goals Interactions between units becomes unclear if arbitrary circuit connections allowed between units, with possible cycle-level timing dependencies Increases complexity of unit specifications Removes degrees of freedom for unit designers Reduces possible space for architecture exploration Difficult to document intended operation, therefore difficult to verify 10

Example Difficult Design Problem The humble shift register (For today s lecture, we ll assume clock distribution is not an issue) 11

First Complication: Output Stall Shift register should only move data to right if output ready to accept next item Ready What complication does this introduce? Need to fan out to enable signal on each flop 12

Stall Fan-Out Example Ready Enable 200 bits per shift register stage, 16 stages 3200 flip-flops How many fanout-of-four gate delays to buffer up ready signal? Log 4 (3200) = 5.82 This doesn t include any penalty for driving enable signal wiring! 13

Loops Prevent Arbitrary Resizing Shift Register Module Ready Receiving Module Ready Logic We could increase size of gates in ready logic block to reduce fan out required to drive ready signal to flop enables BUT, this increases load on flops, so they have to get bigger --- a vicious circle 14

Second Complication: Bubbles Sender doesn t have valid data every clock cycle, empty bubbles inserted into pipeline Valid Ready ~Valid Stage 1 Stage 2 Stage 3 Stage 4 Time ~Ready Would like to squeeze bubbles out of pipeline 15

Logic to Squeeze Bubbles Can move one stage to right if Ready asserted, or there is any bubble in stages to right of current stage Enable? Ready? Valid Valid? Fan-in of number of valid signals grows with number of pipeline stages Fan-out of each stage s valid signal also grows with number of pipeline stages Results in slow combinational paths as number of pipeline stages grows 16

Decoupled Design Discipline The shift register is a simple example that illustrates the control complexity problems of any large synchronous pipeline Usually, there are even more complex interactions between stages Combinational Logic Combinational Logic Clock To avoid these problems (and many others), designers will use a decoupled design discipline, where moderate size synchronous units (~10-100K gates) are connected by decoupling FIFOs or channels 17

Decoupled Architectures and Unit-Transaction Level Design 18

CS250 Design Refinement Application (C/C++) UTL (C/C++) RTL (Verilog) Gates (Stdcell Library) Architecture Design (Manual) µarchitecture Design (Manual) Synthesis + Place&Route (Automated) 19

Unit-Transaction Level Design Arch. State Unit 1 Arch. State Arch. State Unit 2 Unit 3 Shared Memory Unit Model design as messages flowing through FIFO buffers between units containing architectural state Each unit can independently perform an operation, or transaction, that may consume messages, update local state, and send further messages Transaction and/or communication might take many cycles Have to design RTL of unit microarchitecture during design refinement 20

Unit Architectural State Arch. State Architectural state is any state that is visible to an external agent i.e, architectural state can be observed by sending strings of packets into input queues and looking at values returned at outputs. High-level specification of a unit only refers to architectural state Detailed implementation of a unit may have additional microarchitectural state that is not visible externally Intra-transaction sequencing logic Pipeline registers 21

Queues Queues expose communication latency and decouple units execution Queues are point-to-point channels only No fanout, a unit must replicate messages on multiple queues Any buses must be hidden inside a unit in a UTL design Transactions can only pop head of input queues and push at most one element onto each output queue Avoids exposing size of buffers in queues Also avoids synchronization inherent in waiting for multiple elements 22

UTL Example: IP Routing Input packets Routing Trie Packet Output Queues IP Address in header x y z w Packet data Store longest prefix match of address header using trie structure Use next 8 bits of header to move down tree, or finish if subnet found Variable number of RAM lookups in table to find output port for next hop. 23 34.5.103.? Out_1 34.7.?.? Out_2 32.?.?.? Out_1

UTL Example: Packet Routing Table Replies Table Access Packet Input Lookup Table Packet Output Queues Transactions in decreasing scheduler priority Table_Write (request on table access queue) Writes a given 32-bit value to a given 12-bit address Table_Read (request on table access queue) Reads a 32-bit value given a 12-bit address, puts response on reply queue Packet_Process (request on packet input queue) Looks up header in route tree stored in table and places routed packet on correct output queue This level of detail is all the information we really need to understand what the unit is supposed to do! Everything else is implementation. 24

Refining Packet Routing to RTL Table Access Packet Input Completion Buffer Recirculation Pipeline Lookup RAM Table Replies Packet Output Queues The recirculation pipeline registers and the completion buffer are microarchitectural state that should be invisible to external units. Implementation must ensure atomicity of UTL transactions: Completion buffer ensures packets flow through unit in order Must also ensure table write doesn t appear to happen in middle of packet lookup, e.g., wait for pipeline to drain before performing write 25

UTL & Architecture-Level Verification Can easily develop a sequential golden model of a UTL description (pick a unit with a ready transaction and execute that sequentially) This is not straightforward if design does not obey UTL discipline Much more difficult if units not decoupled by point-to-point queues, or semantics of multiple operations depends on which other operations run concurrently Golden model is important component in verification strategy e.g., can generate random tests and compare candidate design s output against architectural golden model s output 26

Function->UTL->RTL->Gates Should develop initial functional model in C++ Refine to UTL architectural model, also in C++ verifiy UTL design against functional model Refine each unit in UTL model to RTL pipeline verify each RTL unit against UTL model of unit It will take some experience to determine when it is appropriate to break design into decoupled units versus a larger synchronous pipeline around 10K-100K gates is common threshold (this is a pretty big chunk of logic - a simple processor fits under this) 27

Design Template for Unit RTL Scheduler Arch. State 1 Arch. State 2 Scheduler (control logic) only fires transaction into pipelined datapath when it can complete without stalls Fire and forget model Avoids driving heavily loaded stall signals backwards from later pipe stages Each piece of architectural state (and outputs) only written in one stage of pipeline Reduces ports, simplifies WAW hazard detection/prevention between transactions Use bypassing logic to get read values earlier 28

RTL to Layout Flow HDL RTL Synthesis manual design Library/ module generators netlist logic optimization a b 0 1 s d clk q netlist a b 0 1 d q physical design s clk layout

Logic Synthesis Following slides by Srini Devadas, MIT

Logic optimization flow LOGIC EQUATIONS TECHNOLOGY-INDEPENDENT OPTIMIZATION Factoring Commonality Extraction TECH-DEPENDENT OPTIMIZATION (MAPPING, TIMING) LIBRARY OPTIMIZED LOGIC NETWORK

Logic optimization flow LOGIC EQUATIONS TECHNOLOGY-INDEPENDENT OPTIMIZATION Factoring Commonality Extraction TECH-DEPENDENT OPTIMIZATION (MAPPING, TIMING) LIBRARY OPTIMIZED LOGIC NETWORK

Why logic optimization? Transistor count reduction Circuit count reduction Gate count (fanout) reduction AREA POWER DELAY Area reduction, power reduction and delay reduction improves design

Boolean Optimizations Involves: Finding common subexpressions. Substituting one expression into another. Factoring single functions. F = F = f 1 f 2 Find common expressions = AB + AC + AD + AE + A BC D E = AB + AC + AD + AF + A BC D F f 1 = A( B + C + D + E) + ABC DE f 2 = A( B + C + D + F) + ABC DF Extract and substitute common expression G = g 1 = B + C + D f 1 = A( g 1 + E ) + A E g 1 f 2 = A( g 1 + F ) + A F g 1

Logic optimization flow LOGIC EQUATIONS TECHNOLOGY-INDEPENDENT OPTIMIZATION Factoring Commonality Extraction TECH-DEPENDENT OPTIMIZATION (MAPPING, TIMING) LIBRARY OPTIMIZED LOGIC NETWORK

Closed Book Technologies A standard cell technology or library is typically restricted to a few tens of types (NAND, NOR, NOT, AOI) of gate e.g., MSU library: 31 cells but may have many different sizes/skews of each A B A A C A A C AB+C B

Standard Cell Library For each cell Functional information Timing information Input slew Intrinsic delay Output capacitance Physical footprint Power characteristics

Gate Type Sample Library INVERTER (2) NAND2 (3) NAND3 (4) NAND4 (5) Gate Area Alternative mappings to NAND2 gates

Sample Library - 2 AOI21 (4) AOI22 (5)

Mapping via DAG * Covering Represent network in canonical form subject DAG Represent each library gate with canonical forms for the logic function primitive DAGs Each primitive DAG has a cost Goal: Find a minimum cost covering of the subject DAG by the primitive DAGs * Directed Acyclic Graph

Trivial Covering Reduce netlist into ND2 gates subject DAG 7 NAND2 = 21 5 INV = 10 31 (area cost)

Covering #1 2 INV = 4 2 NAND2 = 6 1 NAND3 = 4 1 NAND4 = 5 19 (area cost)

Covering #2 1 INV = 2 1 NAND2 = 3 2 NAND3 = 8 1 AOI21 = 4 17 (area cost)

Multiple fan-out

Partitioning a Graph Partition input netlist into a forest of trees Solve each tree optimally Stitch trees back together

Optimum Tree Covering INV 11 + 2 = 13 AOI21 4 + 3 = 7 NAND2 2 + 6 + 3 = 11 INV 2 NAND2 3 + 3 = 6 NAND2 3 NAND2 3

DAG Covering steps Partition DAG into a forest of trees Normalize the netlist Optimally cover each tree Generate all candidate matches Find optimal match using dynamic programming

Logic Synthesis Summary Logic optimization is an important step in the design flow Two-step flow Technology independent optimization Technology dependent optimization Advantages of logic optimization Reduce area Reduce power Reduce delay