ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

Similar documents
Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1

CS184a: Computer Architecture (Structure and Organization) Previously. Today. Computing Requirements (review) Requirements

CS184a: Computer Architecture (Structure and Organization)

ESE534: Computer Organization. Previously. Today. Computing Requirements (review) Instruction Control. Instruction Taxonomy

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

ESE (ESE534): Computer Organization. Today. Soft Errors. Soft Error Effects. Induced Soft Errors. Day 26: April 18, 2007 Et Cetera

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

Unit 2: High-Level Synthesis

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

ESE534: Computer Organization. Previously. Today. Preclass. Why Registers? Latches, Registers

ESE534: Computer Organization. Previously. Today. Preclass. Preclass. Latches, Registers. Boolean Logic Gates Arithmetic Complexity of computations

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

ESE535: Electronic Design Automation. Today. Question. Question. Intuition. Gate Array Evaluation Model

ECE 587 Hardware/Software Co-Design Lecture 23 Hardware Synthesis III

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

HIGH-LEVEL SYNTHESIS

High-Level Synthesis (HLS)

Mesh. Mesh Channels. Straight-forward Switching Requirements. Switch Delay. Total Switches. Total Switches. Total Switches?

ESE534 Computer Organization. Previously. Today. Spatial Programmable. Spatially Programmable. Needs?

ESE532: System-on-a-Chip Architecture. Today. Message. Preclass 1. Computing Forms. Preclass 1

INTRODUCTION TO FPGA ARCHITECTURE

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

ECE 4514 Digital Design II. Spring Lecture 22: Design Economics: FPGAs, ASICs, Full Custom

COE 561 Digital System Design & Synthesis Introduction

Lecture 41: Introduction to Reconfigurable Computing

Preclass Warmup. ESE535: Electronic Design Automation. Motivation (1) Today. Bisection Width. Motivation (2)

A Time-Multiplexed FPGA

The Nios II Family of Configurable Soft-core Processors

Verilog for High Performance

Today. ESE532: System-on-a-Chip Architecture. Day 3: Memory Scaling. Message. Day 3: Lesson DRAM DRAM

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

VHDL for Synthesis. Course Description. Course Duration. Goals

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

CS 152 Computer Architecture and Engineering

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Lecture 25: Board Notes: Threads and GPUs

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

An FPGA Architecture Supporting Dynamically-Controlled Power Gating

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

ESE535: Electronic Design Automation. Today. LUT Mapping. Simplifying Structure. Preclass: Cover in 4-LUT? Preclass: Cover in 4-LUT?

Lecture 20: High-level Synthesis (1)

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures *

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

ADVANCED FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 3 & 4

TSEA44 - Design for FPGAs

A CAD Framework for MALIBU: An FPGA with Time-multiplexed Coarse-Grained Elements. David Grant

Topics. Verilog. Verilog vs. VHDL (2) Verilog vs. VHDL (1)

EECS150 - Digital Design Lecture 09 - Parallelism

EEL 4783: HDL in Digital System Design

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

ESE535: Electronic Design Automation. Today. Example: Same Semantics. Task. Problem. Other Goals. Retiming. Day 23: April 13, 2011 Retiming

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Unit 8: Superscalar Pipelines

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now?

How Much Logic Should Go in an FPGA Logic Block?

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

ORCA FPGA- Optimized VectorBlox Computing Inc.

Fast Flexible FPGA-Tuned Networks-on-Chip

Computer Architecture

Lecture 9: Bridging & Switching"

Lecture 5. Other Adder Issues

Lecture 7: Introduction to Co-synthesis Algorithms

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

RTL Power Estimation and Optimization

Copyright 2012, Elsevier Inc. All rights reserved.

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

TRIPS: Extending the Range of Programmable Processors

Computer Architecture Spring 2016

discrete logic do not

FPGA: What? Why? Marco D. Santambrogio

14:332:331 Pipelined Datapath

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

structure syntax different levels of abstraction

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stacked Silicon Interconnect Technology (SSIT)

A Path Based Algorithm for Timing Driven. Logic Replication in FPGA

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

CHAPTER 1 INTRODUCTION

A responsive, fully-featured MIDI controller at an affordable price

High performance, power-efficient DSPs based on the TI C64x

Abstract. 1 Introduction. Reconfigurable Logic and Hardware Software Codesign. Class EEC282 Author Marty Nicholes Date 12/06/2003

Transcription:

ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline architectures specifically interconnect talked about general case Saw how to reuse resources at maximum rate to do the same thing Saw demand-for and options-to-support data retiming 3 Multicontext Review why Cost Packing into contexts Retiming requirements Some components [concepts we saw in overview week 2-3, we can now dig deeper into details] 4 How often is reuse of the same operation applicable? In what cases can we exploit highfrequency, heavily pipelined operation? and when can we not? 5 How often is reuse of the same operation applicable? Can we exploit higher frequency offered? High throughput, feed-forward (acyclic) Cycles in flowgraph abundant data level parallelism [C-slow] no data level parallelism Low throughput tasks structured (e.g. datapaths) [serialize datapath] unstructured Data dependent operations similar ops [local control -- next time] dis-similar ops 6 1

Structured Datapaths Preclass 1 Datapaths: same pinst for all bits Can serialize and reuse the same data elements in succeeding cycles example: adder 7 Sources of inefficient mapping W task =4, L task =4 to W arch =1, C=1 architecture? 8 Preclass 1 Throughput Yield How transform W task =4, L task =4 to run efficiently on W arch =1, C=1 architecture? Impact on efficiency? 9 FPGA Model -- if throughput requirement is reduced for wide word operations, serialization allows us to reuse active area for same computation 10 Throughput Yield Remaining Cases Benefit from multicontext as well as high clock rate i.e. cycles, no parallelism data dependent, dissimilar operations low throughput, irregular (can t afford swap?) Same graph, rotated to show backside. 11 12 2

Single Context When have: cycles and no data parallelism low throughput, unstructured tasks dis-similar data dependent tasks Active resources sit idle most of the time Waste of resources Cannot reuse resources to perform different function, only same 13 Resource Reuse To use resources in these cases must direct to do different things. Must be able tell resources how to behave separate instructions (pinsts) for each behavior 14 Preclass 2 How schedule onto 3 contexts? Preclass 2 How schedule onto 4 contexts? 15 16 Preclass 2 Example: Dis-similar Operations How schedule onto 6 contexts? 17 18 3

Multicontext Organization/ Area Preclass 3 A ctxt 80Kλ 2 dense encoding A base 800Kλ 2 A ctxt :A base = 1:10 Area: Single context? 3 contexts? 4 contexts? 6 contexts? 19 20 Multicontext Tradeoff Curves Assume Ideal packing: N active =N total /L Limitations from: Scheduling Retiming In Practice Reminder: Robust point: c*a ctxt =A base 21 22 Scheduling Scheduling Limitations N A (active) size of largest stage Precedence: can evaluate a LUT only after predecessors have been evaluated cannot always, completely equalize stage requirements 23 24 4

Scheduling Precedence limits packing freedom Freedom do have shows up as slack in network Scheduling Computing Slack: ASAP (As Soon As Possible) Schedule propagate depth forward from primary inputs depth = 1 + max input depth ALAP (As Late As Possible) Schedule propagate distance from outputs back from outputs level = 1 + max output consumption level Slack slack = L+1-(depth+level) [PI depth=0, PO level=0] 25 26 Work Slack Example Preclass 4 With precedence constraints, and unlimited hardware, how many contexts? 27 28 Preclass 5 Preclass 6 Without precedence, how many compute blocks needed to evaluate in 4 contexts? Where can schedule? J D 29 30 5

Preclass 6 Preclass 6 Where can schedule D if J in 3? Where can schedule D if J in 2? Where can schedule J if D in 1? Where can schedule J if D in 2? Where schedule operations? Physical blocks? 31 32 Reminder (Preclass 1) Sequentialization Adding time slots more sequential (more latency) add slack allows better balance L=4 N A =2 (4 contexts) 33 34 Multicontext Data Retiming Retiming How do we accommodate intermediate data? 35 36 6

Signal Retiming Single context, non-pipelined hold value on LUT Output (wire) from production through consumption Signal Retiming Multicontext equivalent need LUT to hold value for each intermediate context Wastes wire and switches by occupying for entire critical path delay L not just for 1/L th of cycle takes to cross wire segment How show up in multicontext? 37 38 ASCII Hex Example ASCII Hex Example Single Context: 21 LUTs @ 880Kλ 2 =18.5Mλ 2 39 Three Contexts: 12 LUTs @ 1040Kλ 2 =12.5Mλ 2 40 ASCII Hex Example Alternate Retiming All retiming on wires (active outputs) saturation based on inputs to largest stage Recall from last time (Day 20) Net buffer smaller than LUT Output retiming may have to route multiple times Input buffer chain only need LUT every depth cycles Ideal Perfect scheduling spread + no retime overhead 41 42 7

Reminder Input Buffer Retiming Can only take K unique inputs per cycle Configuration depth differ from contextto-context Cannot schedule LUTs in slot 2 and 3 on the same physical block, since require 6 inputs. ASCII Hex Example All retiming on wires (active outputs) saturation based on inputs to largest stage Ideal Perfect scheduling spread + no retime overhead 43 44 ASCII Hex Example (input retime) @ depth=4, c=6: 5.5Mλ 2 (compare 18.5Mλ 2 ) 3.4 General throughput mapping: If only want to achieve limited throughput Target produce new result every t cycles 1. Spatially pipeline every t stages cycle = t 2. retime to minimize register requirements 3. multicontext evaluation w/in a spatial stage try to minimize resource usage 4. Map for depth (i) and contexts (c) 45 46 Benchmark Set Multicontext vs. Throughput 23 MCNC circuits area mapped with SIS and Chortle 47 48 8

Multicontext vs. Throughput 49 General Theme Ideal Benefit e.g. Active=N/C Logical Constraints Precedence Resource Limits Sometimes bottleneck Net Benefit Resource Balance 50 Only an Area win? Beyond Area If area were free, would we always want a fully spatial design? (Did not cover this section in class) 51 52 Communication Latency Optimal Delay for Graph App. Communication latency across chip can limit designs Serial design is smaller less latency 53 54 9

1 2 4 8 Optimal Delay Phenomena What Minimizes Energy HW5 25000 "EnergyALU" B = N 20000 15000 10000 "EnergyALU" 5000 0 16 32 64 128 256 512 1024 2048 4096 8192 16384 55 56 DPGA (1995) Components 57 [Tau et al., FPD 1995] 58 Xilinx Time-Multiplexed FPGA Mid 1990s Xilinx considered Multicontext FPGA Based on XC4K (pre-virtex) devices Prototype Layout in F=500nm Required more physical interconnect than XC4K Concerned about power (10W at 40MHz) Xilinx Time-Multiplexed FPGA Two unnecessary expenses: Used output registers with separate outs Based on XC4K design Did not densely encode interconnect configuration Compare 8 bits to configure input C-Box connection Versus log 2 (8)=3 bits to control mux select Approx. 200b pinsts vs. 64b pinsts [Trimberger, FCCM 1997] 59 60 10

Tabula 8 context, 1.6GHz, 40nm 64b pinsts Our model w/ input retime 1Mλ 2 base 80Kλ 2 / 64b pinst Instruction mem/context 40Kλ 2 / input-retime depth 1Mλ 2 +8 0.12Mλ 2 ~=2Mλ 2 4 LUTs (ideal) Recall ASCIItoHex 3.4, similar for thput map They claim 2.8 LUTs [MPR/Tabula 3/29/2009] 61 Admin No required reading for Wed. Supplemental on web Office hours Tuesday Good for questions about project 62 Big Ideas [MSB Ideas] Several cases cannot profitably reuse same logic at device cycle rate cycles, no data parallelism low throughput, unstructured dis-similar data dependent computations These cases benefit from more than one instructions/operations per active element A ctxt << A active makes interesting save area by sharing active among instructions 63 Big Ideas [MSB-1 Ideas] Economical retiming becomes important here to achieve active LUT reduction one output reg/lut leads to early saturation c=4--8, I=4--6 automatically mapped designs roughly 1/3 single context size Most FPGAs typically run in realm where multicontext is smaller How many for intrinsic reasons? How many for lack of HSRA-like register/cad support? 64 11