EECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007

Similar documents
EECS150 - Digital Design Lecture 23 - High-Level Design (Part 1)

Computer Organization

Computer Organization. Structure of a Computer. Registers. Register Transfer. Register Files. Memories

Block diagram view. Datapath = functional units + registers

Why Should I Learn This Language? VLSI HDL. Verilog-2

BUILDING BLOCKS OF A BASIC MICROPROCESSOR. Part 1 PowerPoint Format of Lecture 3 of Book

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis

EE 3170 Microcontroller Applications

REGISTER TRANSFER LANGUAGE

Implementing an Instruction Set

CSE140L: Components and Design Techniques for Digital Systems Lab

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

MICROPROCESSOR B.Tech. th ECE

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now?

One and a half hours. Section A is COMPULSORY UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

VHDL: RTL Synthesis Basics. 1 of 59

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis

Homework deadline extended to next friday

CSE140L: Components and Design

Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis. Spring 2007 Lec #8 -- HW Synthesis 1

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

Topics. Midterm Finish Chapter 7

Note: Closed book no notes or other material allowed, no calculators or other electronic devices.

Synthesis of Language Constructs. 5/10/04 & 5/13/04 Hardware Description Languages and Synthesis

Microcomputers. Outline. Number Systems and Digital Logic Review

For Example: P: LOAD 5 R0. The command given here is used to load a data 5 to the register R0.

UNIT-III REGISTER TRANSFER LANGUAGE AND DESIGN OF CONTROL UNIT

CSE 141L Computer Architecture Lab Fall Lecture 3

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

Verilog for High Performance

Register Transfer Level in Verilog: Part I

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

Hardware Design I Chap. 10 Design of microprocessor

Topics. Midterm Finish Chapter 7

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1

CHAPTER 5 : Introduction to Intel 8085 Microprocessor Hardware BENG 2223 MICROPROCESSOR TECHNOLOGY

Verilog Tutorial. Introduction. T. A.: Hsueh-Yi Lin. 2008/3/12 VLSI Digital Signal Processing 2

ECE 2300 Digital Logic & Computer Organization. More Sequential Logic Verilog

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 16 Memory 1

A Brief Introduction to Verilog Hardware Definition Language (HDL)

Microprocessor Architecture. mywbut.com 1

6.1 Combinational Circuits. George Boole ( ) Claude Shannon ( )

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Design of Embedded DSP Processors

CSE140: Components and Design Techniques for Digital Systems

EECS150 - Digital Design Lecture 10 Logic Synthesis

CHAPTER 4: Register Transfer Language and Microoperations

EECS Components and Design Techniques for Digital Systems. Lec 07 PLAs and FSMs 9/ Big Idea: boolean functions <> gates.

EECS150 - Digital Design Lecture 16 - Memory

Lecture 2: Data Types, Modeling Combinational Logic in Verilog HDL. Variables and Logic Value Set. Data Types. Why use an HDL?

Ch 5: Designing a Single Cycle Datapath

EEL 4783: HDL in Digital System Design

CSE 140L Final Exam. Prof. Tajana Simunic Rosing. Spring 2008

Inf2C - Computer Systems Lecture Processor Design Single Cycle

FPGA architecture and design technology

Lecture 32: SystemVerilog

EECS150 - Digital Design Lecture 10 Logic Synthesis

Digital Design with FPGAs. By Neeraj Kulkarni

Announcements. Midterm 2 next Thursday, 6-7:30pm, 277 Cory Review session on Tuesday, 6-7:30pm, 277 Cory Homework 8 due next Tuesday Labs: project

COSC 243. Computer Architecture 1. COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1

CS222: Processor Design

The Need of Datapath or Register Transfer Logic. Number 1 Number 2 Number 3 Number 4. Numbers from 1 to million. Register

The functional block diagram of 8085A is shown in fig.4.1.

CSE 140L Final Exam. Prof. Tajana Simunic Rosing. Spring 2008

Contents. Chapter 9 Datapaths Page 1 of 28

EECS150 - Digital Design Lecture 4 - Verilog Introduction. Outline

Control and Datapath 8

REGISTER TRANSFER AND MICROOPERATIONS

Field Programmable Gate Array (FPGA)

Digital Circuit Design and Language. Datapath Design. Chang, Ik Joon Kyunghee University

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Levels in Processor Design

EXPERIMENT NUMBER 11 REGISTERED ALU DESIGN

Lecture 15: System Modeling and Verilog

VHDL for Synthesis. Course Description. Course Duration. Goals

CS 61C: Great Ideas in Computer Architecture Datapath. Instructors: John Wawrzynek & Vladimir Stojanovic

ELCT 501: Digital System Design

FPGA for Software Engineers

Announcements. This week: no lab, no quiz, just midterm

problem maximum score 1 10pts 2 8pts 3 10pts 4 12pts 5 7pts 6 7pts 7 7pts 8 17pts 9 22pts total 100pts

VLSI Design 13. Introduction to Verilog

Synthesizable Verilog

Digital Integrated Circuits

Verilog Sequential Logic. Verilog for Synthesis Rev C (module 3 and 4)

Review: Timing. EECS Components and Design Techniques for Digital Systems. Lec 13 Storage: Regs, SRAM, ROM. Outline.

Chapter 4. The Processor

CpE242 Computer Architecture and Engineering Designing a Single Cycle Datapath

CS429: Computer Organization and Architecture

Mark Redekopp, All rights reserved. EE 352 Unit 8. HW Constructs

Chapter 4. The Processor

Introduction to Verilog/System Verilog

Recommended Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto

Topic #6. Processor Design

Lab Manual for COE 203: Digital Design Lab

CMPT 250 : Week 3 (Sept 19 to Sept 26)

UC Berkeley CS61C : Machine Structures

COMPUTER ORGANIZATION AND DESIGN

Date Performed: Marks Obtained: /10. Group Members (ID):. Experiment # 11. Introduction to Verilog II Sequential Circuits

Transcription:

EECS 5 - Components and Design Techniques for Digital Systems Lec 2 RTL Design Optimization /6/27 Shauki Elassaad Electrical Engineering and Computer Sciences University of California, Berkeley Slides adapted from Prof. Culler s 24 lecture http://www-inst.eecs.berkeley.edu/~cs5

Levels of Design Representation Pgm Language CS 6C Asm / Machine Lang Instruction Set Arch Machine Organization HDL FlipFlops Gates Circuits EE 4 Devices Transistor Physics Transfer Function 2

A Standard High-level Organization Controller accepts external and control input, generates control and external output and sequences the movement of data in the datapath. Datapath is responsible for data manipulation. Usually includes a limited amount of storage. Memory optional block used for long term storage of data structures. Standard model for CPUs, micro-controllers, many other digital sub-systems. Usually not nested. Often cascaded: 3

Datapath vs Control Datapath Controller signals Control Points Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path Based on desired function and signals 4

input Register Transfer Level Descriptions A standard high-level representation for describing systems. It follows from the fact that all synchronous digital system can be described as a set of state elements connected by combination logic (CL) blocks: clock CL reg option feedback input output CL reg output RTL comprises a set of register transfers with optional operators as part of the transfer. Example: rega regb regc rega + regb if (start==) rega regc Personal style: use ; to separate transfers that occur on separate cycles. Use, to separate transfers that occur on the same cycle. Example (2 cycles): rega regb, regb ; regc rega; 5

RTL Abstraction Increases productivity by allowing designers to focus on behavior rather than gate-level logic Design components can be specified w/ concise and modular code in verilog Synthesis tools understand RTL design Think of design in terms of Control and Datapath. Designers are still very close to hardware. They can think of and optimize architectures, timing (cycle-level), and other design trade-offs (power, speed, area..) 6

RTL Design Process Data-path Requirements How many registers do you need? What transformations/operations are needed? Interface Requirements What signals control the operations? What order these signals are in? State-machine design What are the outputs in each state? Look for concurrency in the design. 7

A Register Transfer A B C A Sel D E C Sel Sel Sel ; Ld C B Sel ; Ld Bus Ld C Clk Clk Sel A on Bus B on Bus One of potentially many source regs goes on the bus to one or more destination regs Register transfer on the clock Ld Ld C from Bus? 8

Register Transfers - interconnect Point-to-point connection Dedicated wires Muxes on inputs of each register Common input from multiplexer Load enables for each register Control signals for multiplexer Common bus with output enables Output enables and load enables for each register MUX MUX MUX MUX rs rt rd R4 rs rt rd R4 MUX rs rt rd R4 BUS 9

Register Transfer multiple busses One transfer per bus Each set of wires can carry one value MUX MUX MUX MUX State Elements Registers Register files Memory Combinational Elements Busses ALUs Memory (read) rs rt rd R4

Registers Selectively loaded EN or LD input Output enable OE input Multiple registers group 4 or 8 in parallel LD OE D7 D6 D5 D4 D3 D2 D D CLK Q7 Q6 Q5 Q4 Q3 Q2 Q Q OE asserted causes FF state to be connected to output pins; otherwise they are left unconnected (high impedance) LD asserted during a lo-to-hi clock transition loads new data into FFs

Register Files Collections of registers in one package Two-dimensional array of FFs Address used as index to a particular word Separate read and write addresses so can do both at same time Ex: 4 by 4 register file 6 D-FFs Organized as four words of four bits each Write-enable (load) Read-enable (output enable) RE RB RA WE WB WA D3 D2 D D Q3 Q2 Q Q 2

Memories Larger Collections of Storage Elements Implemented not as FFs but as much more efficient latches High-density memories use -5 switches (transitors) per bit Ex: Static RAM 24 words each 4 bits wide Once written, memory holds forever (not true for denser dynamic RAM) Address lines to select word ( lines for 24 words) Read enable» Same as output enable» Often called chip select» Permits connection of many chips into larger array Write enable (same as load enable) Bi-directional data lines» output when reading, input when writing RD WR A9 A8 A7 A6 A5 A4 A3 A2 A2 A A 3 IO3 IO2 IO IO

ALU Block Diagram Input: data and operation to perform» Add, Sub, AND, OR, NOT, XOR, Output: result of operation and status information A B 6 6 Operation 6 N S Z 4

Data Path (Hierarchy) Arithmetic circuits constructed in hierarchical and iterative fashion each bit in datapath is functionally identical 4-bit, 8-bit, 6-bit, 32-bit datapaths Ain Bin Cin FA Sum Cout Ain Bin Cin HA HA Sum Cout 5

Example Data Path (ALU + Registers) Accumulator Special register One of the inputs to ALU Output of ALU stored back in accumulator One-input Operation Other operand and destination is accumulator register AC < AC op REG Single address instructions» AC < AC op Mem[addr] OP REG 6 6 AC 6 N Z 6 6

Data Path (Bit-slice) Bit-slice concept: iterate to build n-bit wide datapaths Data bit busses run through the slice CO ALU CI CO ALU ALU CI AC AC AC R R R rs rs rs rt rt rt rd rd rd from memory from memory from memory bit wide 2 bits wide 7

Example of Using RTL S S R R ACC ACC + R, R R; ACC ACC + R, R R; R ACC; S2 S3 + ACC RTL description is used to sequence the operations on the datapath (dp). It becomes the high-level specification for the controller. Design of the FSM controller follows directly from the RTL sequence. FSM controls movement of data by controlling the multiplexor/tri-state control signals. 8

Example of Using RTL RTL often used as a starting point for designing both the dp and the control: example: rega IN; regb IN; regc rega + regb; regb regc; From this we can deduce: IN must fanout to both rega and regb rega and regb must output to an adder the adder must output to regc regb must take its input from a mux that selects between IN and regc What does the datapath look like: The controller: 9

Announcements Lab Etiquette Food in the lab is still a problem. If problem persists, we will be forced to close the lab when TAs are not present! Discussion sessions are on for this week. No Lab Lecture this week 2

How does RTL design relate to your project? AC97Controller BitCount Decode Sync CodecReady << Shift Register << >> Shift Register >> SDataIn SDataOut Audio Codec Bit Count Decode Mux Decode Control 32b PCM Audio Data Handshaking 32b PCM Audio Recorded Data Mux Audio Buffer FullVolumeControl Understanding data-flow at this level simplifies and clarifies the design Data going in and out of Audio Buffer is specified at packet level (not at bit-level). Compare this block diagram to the detailed synthesized gate-level design Micro-architecture is influenced by design library: 2

Components of the data path Storage Flip-flops Registers Register Files SRAM Arithmetic Units Adders, subtraters, ALUs (built out of FAs or gates) Comparators Counters Interconnect Wires Busses Tri-state Buffers MUX 22

Arithmetic Circuit Design Full Adder Adder Relationship of positional notation and operations on it to arithmetic circuits Each componet has associated costs: Power Speed Area Reliability A B Cin FA Co S 23

List Processor Example RTL gives us a framework for making high-level optimizations. Fixed function unit Approach extends to instruction interpreters General design procedure outline:. Problem, Constraints, and Component Library Spec. 2. Algorithm Selection 3. Micro-architecture Specification 4. Analysis of Cost, Performance, Power 5. Optimizations, Variations 6. Detailed Design 24

. Problem Specification Design a circuit that forms the sum of all the 2's complements integers stored in a linked-list structure starting at memory address : All integers and pointers are 8-bit. The link-list is stored in a memory block with an 8-bit address port and 8-bit data port, as shown below. The pointer from the last element in the list is. At least one node in list. I/Os: START resets to head of list and starts addition process. DONE signals completion R, Bus that holds the final result 25

. Other Specifications Design Constraints: Usually the design specification puts a restriction on cost, performance, power or all. We will leave this unspecified for now and return to it later. Component Library: component simple logic gates n-bit register n-bit 2- multiplexor n-bit adder memory zero compare delay.5ns clk-to-q=.5ns setup=.5ns (data and LD) ns (2 log(n) + 2)ns ns read (asynchronous read).5 log(n) (single ported memory) Are these reasonable? 26

2. Algorithm Specification In this case the memory only allows one access per cycle, so the algorithm is limited to sequential execution. If in another case more input data is available at once, then a more parallel solution may be possible. Assume datapath state registers NEXT and SUM. NEXT holds a pointer to the node in memory. SUM holds the result of adding the node values to this point. If (START==) NEXT, SUM ; repeat { SUM SUM + Memory[NEXT+]; NEXT Memory[NEXT]; } until (NEXT==); R SUM, DONE ; 27

3. Architecture # Datapath + SUM_SEL LD_NEXT LD_SUM SUM Direct implementation of RTL description: == NEXT_SEL + NEXT A_SEL D Memory A Controller NEXT_ZERO If (START==) NEXT, SUM ; repeat { SUM SUM + Memory[NEXT+]; NEXT Memory[NEXT]; } until (NEXT==); R SUM, DONE ; 28

4. Analysis of Cost, Performance, and Power Skip Power for now. Cost: How do we measure it? # of transistors? # of gates? # of CLBs? Depends on implementation technology. Usually we are interested in comparing the relative cost of two competing implementations. (Save this for later) Performance: 2 clock cycles per number added. What is the minimum clock period? The controller might be on the critical path. Therefore we need to know the implementation, and controller input and output delay. 29

Possible Controller Implementation START START LD_SUM NEXT_ZERO COMP SUM SUM_SEL A_SEL LD_NEXT START GET NEXT NEXT_SEL START DONE DONE Based on this, what is the controller input and output delay? 3

Critical Path Longest path from any reg out to any reg input 3

4. Analysis of Performance COMPUTE_SUM state CLK 8-bit add memory 5-bit add setup CLK-Q MUX MUX NEXT.5 8.5 3ns GET_NEXT state CLK control output delay == MUX MUX A_SEL memory control input delay NEXT_ZERO.5.5.5 5.5ns 32

Critical paths SUM_SEL + NEXT_SEL LD_NEXT NEXT A_SEL D A Memory LD_SUM SUM == + NEXT_ZERO Identify bottlenecks in design Share/schedule resources to improve performance 33

4. Analysis of Performance Detailed timing: clock period (T) = max (clock period for each state) T > 3ns, F < 32 MHz Observation: COMPUTE_SUM state does most of the work. Most of the components are inactive in GET_NEXT state. GET_NEXT does: Memory access + COMPUTE_SUM does: 8-bit add, memory access, 5-bit add + Conclusion: Move one of the adds to GET_NEXT. 34

5. Optimization Add new register named NUMA, for address of number to add. Update RTL to reflect our change (note still 2 cycles per iteration): If (START==) NEXT, SUM, NUMA ; repeat { SUM SUM + Memory[NUMA]; NUMA Memory[NEXT] +, NEXT Memory[NEXT] ; } until (NEXT==); R SUM, DONE ; 35

5. Optimization Architecture #2: SUM_SEL + NEXT_SEL LD_NEXT NEXT A_SEL D A Memory LD_SUM SUM == NEXT_ZERO If (START==) NEXT, SUM, NUMA ; repeat { SUM SUM + Memory[NUMA]; NUMA Memory[NEXT] +, NEXT Memory[NEXT] ; } until (NEXT==); R SUM, DONE ; NEXT_SEL LD_NEXT + NUMA Incremental cost: addition of another register and mux. 36

5. Optimization, Architecture #2 COMPUTE_SUM state CLK NUMA MUX memory CLK-Q 5-bit add MUX setup New timing: Clock Period (T) = max (clock period for each state) T > 23ns, F < 43Mhz GET_NEXT state.5.5 23ns Is this worth the extra cost? Can we lower the cost? CLK A_SEL control output delay MUX MUX memory 8-bit add.5 8.5 2ns NUMA reg setup Notice that the circuit now only performs one add on every cycle. Why not share the adder for both cycles? 37

5. Optimization, Architecture #3 ADD_SEL NEXT_SEL A_SEL D Memory + LD_NEXT NEXT A SUM_SEL NEXT_SEL LD_SUM SUM NUMA LD_NEXT == NEXT_ZERO Incremental cost: Addition of another mux and control. Removal of an 8-bit adder. Performance: mux adds ns to cycle time. 24ns, 4.67MHz. Is the cost savings worth the performance degradation? 38

Design Complexity & Productivity Gap Design gap is accelerating with advances in processing technology. RTL Designers must identify downstream problems timing, signal integrity, reliability, and others prior to synthesis and be able to implement design fixes where they will have a more significant impact on chip performance. The key to a successful design is design closure. The various performance specifications comprising timing, power, and reliability, along with chip cost, are all closely coupled. EETimes 8/22/23 39

Design Gap Keeping up with Moore's Law requires the implementation of disruptive design technology every few years. A common theme of advancing design technology is the continuing move to higher design abstraction levels. 4