CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Similar documents
CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.5. Krste Asanovic UC Berkeley Fall 2009

L2: Design Representations

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3. Krste Asanovic UC Berkeley Fall 2009

Unit 2: High-Level Synthesis

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

CS250 VLSI Systems Design Lecture 9: Memory

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Network Processors and their memory

[ ] In earlier lectures, we have seen that switches in an interconnection network connect inputs to outputs, usually with some kind buffering.

Kaisen Lin and Michael Conley

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS425 Computer Systems Architecture

HIGH-LEVEL SYNTHESIS

Portland State University ECE 588/688. Cray-1 and Cray T3E

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Architecture or Parallel Computers CSC / ECE 506

Ultra-Fast NoC Emulation on a Single FPGA

Lecture: Interconnection Networks

Advanced Memory Organizations

UNIT I (Two Marks Questions & Answers)

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Ten Reasons to Optimize a Processor

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 1. Introduction

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

EE414 Embedded Systems Ch 5. Memory Part 2/2

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including

ECEN Final Exam Fall Instructor: Srinivas Shakkottai

Caches. Hiding Memory Access Times

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer Systems Architecture

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

PC I/O. May 7, Howard Huang 1

2. System Interconnect Fabric for Memory-Mapped Interfaces

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Tutorial 11. Final Exam Review

Lecture 11 Cache. Peng Liu.

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Lecture 14: Multithreading

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Multiprocessors & Thread Level Parallelism

A closer look at network structure:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Final Lecture. A few minutes to wrap up and add some perspective

Lecture 18: Multithreading and Multicores

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 18: Communication Models and Architectures: Interconnection Networks

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Chapter 5 Internal Memory

High-Level Synthesis (HLS)

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Modern Computer Architecture

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Router Architectures

Cache Optimisation. sometime he thought that there must be a better way

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 8: Synthesis, Implementation Constraints and High-Level Planning

Memory. Objectives. Introduction. 6.2 Types of Memory

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

This page intentionally left blank

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

Lecture 25: Busses. A Typical Computer Organization

The Memory Hierarchy & Cache

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

4. Networks. in parallel computers. Advances in Computer Architecture

Optical Packet Switching

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University

Computer Systems Architecture

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CS 426 Parallel Computing. Parallel Computing Platforms

14:332:331. Week 13 Basics of Cache

Computer Systems Organization

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

EECS150 - Digital Design Lecture 17 Memory 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

The von Neuman architecture characteristics are: Data and Instruction in same memory, memory contents addressable by location, execution in sequence.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Transcription:

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA) UC Berkeley Fall 2010

Unit-Transaction Level (UTL) A UTL design s functionality is specified as sequences of atomic transactions performed at each unit, affecting only local state and I/O of unit i.e., serializable: can reach any legal state by single-stepping entire system, one transaction at a time High-level UTL spec admits various mappings into RTL with various cycle timings and overlap of transactions executions T3 Unit Network T5 Unit T1 T2 T4 Memory Unit Unit Unit 2

Transactional Specification of Unit Network Scheduler Trans Trans 11 Trans 11 Transaction Architectural State Memory Each transaction has a combinational guard function defined over local state and state of I/O indicating when it can fire e.g., only fire when head of input queue present and of certain type Transaction mutates local state and performs I/O when it fires Scheduler is combinational function that picks next ready transaction to fire 3

Architectural State The architectural state of a unit is that which is visible from outside the unit through I/O operations i.e., architectural state is part of the spec When a unit is refined into RTL, there will usually be additional microarchitectural state that is not visible from outside Intra-transaction sequencing logic Pipeline registers Internal caches and/or buffers 4

UTL Example: Route Lookup Table Access Packet Input Table_Write Scheduler Table_Read Route Lookup Table Table Replies Packet Output Queues Transactions in decreasing scheduler priority Table_Write (request on table access queue) Writes a given 12-bit value to a given 12-bit address Table_Read (request on table access queue) Reads a 12-bit value given a 12-bit address, puts response on table reply queue Route (request on packet input queue) Looks up header in table and places routed packet on correct output queue This level of detail is all the information we really need to understand what the unit is supposed to do! Everything else is implementation. 5

Refining Route Lookup to RTL Table Access Packet Input Control Reorder Buffer Trie Lookup Pipeline Lookup RAM Table Replies Packet Output Queues The reorder buffer, the trie lookup pipeline s registers, and any control state are microarchitectural state that should not affect function as viewed from outside Implementation must ensure atomicity of UTL transactions: Reorder buffer ensures packets flow through unit in order Must also ensure table write doesn t appear to happen in middle of packet lookup, e.g., wait for pipeline to drain before performing write 6

System Design Goal: Rate Balancing Network On-Chip Memory System performance limited by application requirements, on-chip performance, off-chip I/O, or power/energy Want to balance throughput of all units (processing, memory, networks) so none too fast or too slow 7 Off-Chip Memory

Rate-Balancing Patterns To make unit faster, use parallelism Unrolling (for processing units) Banking (for memories) Multiporting (for memories) Widen links (for networks) I.e., Use more resources by expanding in space, shrinking in time To make unit slower, use time-multiplexing Replace dedicated links with a shared bus (for networks) Replace dedicated memories with a common memory Replace multiport memory with multiple cycles on single port Multithread computations onto a common pipeline Schedule a dataflow graph onto a single ALU I.e., Use less resources by shrinking in space, expanding in time 8

Stateless Stream Unit Unrolling (Stream is an ordered sequence) Problem: A stateless unit processing a single input stream of requests has insufficient throughput. Solution: Replicate the unit and stripe requests across the parallel units. Aggregate the results from the units to form a response stream. Applicability: Stream unit does not communicate values between independent requests. Consequences: Requires additional hardware for replicated units, plus networks to route requests and collect responses. Latency and energy for each individual request increases due to additional interconnect cost. 9

Stateless Stream Unit Unrolling T1 T2 T3 T1 T4 Time T1 T4 Distribute Collect T2 T3 Time 10

Variable-Latency Stateless Stream Unit Unrolling Problem: A stateless stream unit processing a single input stream of requests has insufficient throughput, and each request takes a variable amount of time to process. Solution: Replicate the unit. Allocate space in output reorder buffer in stream order, then dispatch request to next available unit. Unit writes result to allocated slot in output reorder buffer when completed (possibly out-of-order), but results can only be removed in stream order. Applicability: Stream unit does not communicate values between independent requests. Consequences: Additional hardware for replicated units plus added scheduler, buffer, and interconnects. Need scheduler to find next free unit and possibly an arbiter for reorder buffer write ports. Latency and energy for each individual request increases due to additional buffers and interconnect. 11

Variable-Latency Stateless Stream Unit Unrolling T1 T2 T3 T4 T1 Time T1 T2 T4 Scheduler Dispatch Arbiter Reorder Buffer T3 Time 12

Time Multiplexing Problem: Too much hardware used by several units processing independent transactions. Solution: Provide only a single unit and time-multiplex hardware within unit to process independent transactions. Applicability: Original units have similar functionality and required throughput is low. Consequences: Combined unit has to provide superset of functionality of original units. Combined unit has to provide architectural state for all architectural state in original units (microarchitectural state, such as pipeline registers, can be shared). Control logic has to arbitrate usage of shared resources among independent transactions, and provide any performance guarantees. 13

Time Multiplexing A + A B * B * + C * + C 14

Other Forms of Rate Balancing Increase/reduce voltage Trade dynamic power for performance Increase/reduce Vt, Tox, etc. Trade leakage power for performance 15

Processing Unit Design Patterns 16

Control+Datapath Problem: Arithmetic operations within transaction require wide functional units and memories connected by wide buses. Sequencing of operations within transaction is complex. Solution: Split processing unit into 1) datapath,which contains functional units, data memories, and their interconnect, and 2) control, which contains all sequencing logic. Applicability: Where there is a clear divide between control logic and data processing logic, with relatively few signals crossing this divide and mostly from control to datapath not vice versa. Consequences: Most design errors are confined to the control portion. Same datapath design can perform many different transactions by changing control sequencing. Paths between control and datapath, particularly from datapath back to control, are often timing critical. 17

Control+Datapath Control Datapath RAM 18

Controller Patterns Synchronous control of local datapath State Machine Controller control lines generated by state machine Microcoded Controller single-cycle datapath, control lines in ROM/RAM In-Order Pipeline Controller control pipelined datapath, dynamic interaction between stages Out-of-Order Pipeline Controller operations within a control stream might be reordered internally Threaded Pipeline Controller multiple control streams, one execution pipeline 19

Control Decomposition Can divide control functions into three categories: Transaction Scheduling Pick next transaction to be executed Transaction Sequencing Sequence operations within transaction Pipeline Control Control execution of operations on pipelined datapath To datapath From datapath 20

FSM-Control Processing Unit Simple units have all three of transaction scheduling, transaction sequencing, and pipeline control folded into a single finite state machine design 21

Microcoded Processing Unit 22

In-Order Pipelined Processing Unit 23

Out-of-Order Processing Unit 24

Threaded Processing Unit 25

Skid Buffering Sched. Tags Data Miss #1 Sched. Tags Data Miss #2 Stop further loads/stores Sched. Tags Data Consider non-blocking cache implemented as a 3-stage pipeline: (scheduler, tag access, data access) CPU Load/Store not admitted into pipeline unless miss tag, reply queue, and victim buffer available in case of miss Hit/miss determined at end of Tags stage, 2nd miss can enter pipeline Solutions? Could only allow one load/store every two cycles => low throughput Skid buffering: Add additional victim buffer, miss tags, and replay queues to complete following transaction if miss. Stall scheduler whenever there is not enough space for two misses. 26

Interconnect Design Patterns 27

Implementing Communication Queues Queue can be implemented as centralized FIFO with single control FSM if both ends are close to each other and directly connected: Cntl. In large designs, there may be several cycles of communication latency from one end to other. This introduces delay both in forward data propagation and in reverse flow control Send Recv. Control split into send and receive portions. A credit-based flow control scheme is often used to tell sender how many units of data it can send before overflowing receiver s buffer. 28

End-End Credit-Based Flow Control Send Recv. For one-way latency of N cycles, need 2*N buffers at receiver to ensure full bandwidth Will take at least 2N cycles before sender can be informed that first unit sent was consumed (or not) by receiver If receive buffer fills up and stalls communication, will take N cycles before first credit flows back to sender to restart flow, then N cycles for value to arrive from sender meanwhile, receiver can work from 2*N buffered values 29

Distributed Flow Control Cntl. Cntl. Cntl. An alternative to end-end control is distributed flow control (chain of FIFOs) Requires less storage, as communication flops reused as buffers, but needs more distributed control circuitry Lots of small buffers also less efficient than single larger buffer Sometimes not possible to insert logic into communication path e.g., wave-pipelined multi-cycle wiring path, or photonic link 30

Network Patterns Connects multiple units using shared resources Bus Low-cost, ordered Crossbar High-performance Multi-stage network Trade cost/performance 31

Buses Bus Unit Bus Control Buses were popular board-level option for implementing communication as they saved pins and wires Less attractive on-chip as wires are plentiful and buses are slow and cumbersome with central control Often used on-chip when shrinking existing legacy system design onto single chip Newer designs moving to either dedicated point-point unit communications or an on-chip network 32

On-Chip Network Router Router Router Router On-chip network multiplexes long range wires to reduce cost Routers use distributed flow control to transmit packets Units usually need end-end credit flow control in addition because intermediate buffering in network is shared by all units 33