Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1

Size: px

Start display at page:

Download "Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1"

Eunice Gray
5 years ago
Views:

ESE534: Computer Organization Day 23: November 21, 2016 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. March, 2015 Tabula closed doors 1 [src: www.

1 ESE534: Computer Organization Day 23: November 21, 2016 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. March, 2015 Tabula closed doors 1 [src: 2 Previously Today Basic pipelining Saw how to reuse resources at maximum rate to do the same thing Saw how to use instructions to reuse resources in time to do different things Saw demand-for and options-to-support data retiming 3 Multicontext Review why Cost Packing into contexts Retiming requirements for Multicontext Some components [concepts we saw in overview week 2-3, we can now dig deeper into details] 4 How often is reuse of the same operation applicable? In what cases can we exploit highfrequency, heavily pipelined operation? and when can we not? 5 How often is reuse of the same operation applicable? Can we exploit higher frequency offered? High throughput, feed-forward (acyclic) Cycles in flowgraph abundant data level parallelism [C-slow] no data level parallelism Low throughput tasks structured (e.g. datapaths) [serialize datapath] unstructured Data dependent operations similar ops [local control lecture after next] dis-similar ops 6 1

Structured Datapaths Preclass 1 Datapaths: same pinst for all bits Can serialize and reuse the same data elements in succeeding cycles example: adder 7 Recall looked at mismatches Width, instruction

8 Preclass 1 Throughput Yield How transform W task =4, L task =4 (path length from throughput) to run efficiently on W arch =1, C=1 architecture? Impact on efficiency?

2 Structured Datapaths Preclass 1 Datapaths: same pinst for all bits Can serialize and reuse the same data elements in succeeding cycles example: adder 7 Recall looked at mismatches Width, instruction depth/task length Sources of inefficient mapping W task =4, L task =4 to W arch =1, C=1 architecture? 8 Preclass 1 Throughput Yield How transform W task =4, L task =4 (path length from throughput) to run efficiently on W arch =1, C=1 architecture? Impact on efficiency? 9 FPGA Model -- if throughput requirement is reduced for wide word operations, serialization allows us to reuse active area for same computation 10 Same graph, rotated to show backside. Throughput Yield 11 How often is reuse of the same operation applicable? Can we exploit higher frequency offered? High throughput, feed-forward (acyclic) Cycles in flowgraph abundant data level parallelism [C-slow] no data level parallelism Low throughput tasks structured (e.g. datapaths) [serialize datapath] unstructured Data dependent operations similar ops [local control -- next time] dis-similar ops 12 2

3 Remaining Cases Benefit from multicontext as well as high clock rate i.e. cycles, no parallelism data dependent, dissimilar operations low throughput, irregular (can t afford swap?) Single Context/Fully Spatial When have: cycles and no data parallelism low throughput, unstructured tasks dis-similar data dependent tasks Active resources sit idle most of the time Waste of resources Cannot reuse resources to perform different function, only same Resource Reuse To use resources in these cases must direct to do different things. Preclass 2 How schedule onto 3 contexts? Must be able tell resources how to behave! separate instructions (pinsts) for each behavior Preclass 2 Preclass 2 How schedule onto 4 contexts? How schedule onto 6 contexts?

Multicontext Organization/ Area Preclass 3 A ctxt 20KF 2 dense encoding A base 200KF 2 A ctxt :A base = 1:10 Area: Single context? 3 contexts? 4 contexts? 6 contexts?

4 Multicontext Organization/ Area Preclass 3 A ctxt 20KF 2 dense encoding A base 200KF 2 A ctxt :A base = 1:10 Area: Single context? 3 contexts? 4 contexts? 6 contexts? Multicontext Tradeoff Curves Assume Ideal packing: N active =N total /L Limitations from: Scheduling Retiming In Practice Reminder: Robust point: c*a ctxt =A base Scheduling Scheduling Limitations N A (active) size of largest stage Precedence (data dependence): can evaluate a LUT only after predecessors have been evaluated "cannot always, completely equalize stage requirements

5 Scheduling Reminder (Preclass 2) Precedence limits packing freedom Note difference Preclass 2 vs. N total /L Sequentialization Adding time slots more sequential (more latency) add slack allows better balance Retiming L=4 N A =2 (4 contexts) Multicontext Data Retiming Signal Retiming How do we accommodate intermediate data? Single context, non-pipelined hold value on LUT Output (wire) from production through consumption Wastes wire and switches by occupying for entire critical path delay L not just for 1/L th of cycle takes to cross wire segment 29 How show up in multicontext? 30 5

Signal Retiming Multicontext equivalent need LUT to hold value for

retime a value for multiple cycles to deliver to the consumer If

full LUT for signal retiming but we saw last time could retime more

1MF 2 34 ASCII Hex Example Alternate Retiming All retiming on wires

from last time (Day 22) Net buffer smaller than LUT Output retiming

6 Signal Retiming Multicontext equivalent need LUT to hold value for each intermediate context Retiming in Multicontext May need to retime a value for multiple cycles to deliver to the consumer If our retiming resource is a LUT output flop We are forced to use a full LUT for signal retiming but we saw last time could retime more economically ASCII Hex Example ASCII Hex Example Single Context: KF 2 =4.6MF 2 33 Three Contexts: KF 2 =3.1MF 2 34 ASCII Hex Example Alternate Retiming All retiming on wires (active outputs) saturation based on inputs to largest stage Recall from last time (Day 22) Net buffer smaller than LUT Output retiming may have to route multiple times Input buffer chain only need LUT every depth cycles Ideal Perfect scheduling spread + no retime overhead

7 Depth Mismatch Efficiency What happens when Arch(retime) > App(retime)? What happens when Arch(retime) < App(retime)? Input Buffer Retiming Can only take K unique inputs per cycle Configuration depth differ from contextto-context Cannot schedule LUTs in slot 2 and 3 on the same physical block, since require 6 inputs Reminder ASCII Hex Example ASCII Hex Example (input retime) All retiming on wires (active outputs) saturation based on inputs to largest depth=4, c=6: 1.4M F 2 (compare 4.6M F 2 ) 3.4 Ideal Perfect scheduling spread + no retime overhead General throughput mapping: If only want to achieve limited throughput Target produce new result every t cycles 1. Spatially pipeline every t stages cycle = t 2. retime to minimize register requirements 3. multicontext evaluation w/in a spatial stage try to minimize resource usage 4. Map for depth (i) and contexts (c) Benchmark Set 23 MCNC circuits area mapped with SIS and Chortle

8 Multicontext vs. Throughput Multicontext vs. Throughput General Theme Ideal Benefit e.g. Active=N/C Logical Constraints Precedence Resource Limits Sometimes bottleneck Net Benefit Resource Balance 45 Computing Device Composition Retiming as key parameter (d) Alongside c, w, p How balance for designs with c>1 46 Only an Area win? Beyond Area If area were free, would we always want a fully spatial design?

1 2 4 8 Communication Latency Optimal Delay for Graph App.

model) B = N 25000 20000 15000 "EnergyALU" Multicontext Energy Energy normalized to FPGA N=2 29 gates Multicontext Processor W=64,

9 Communication Latency Optimal Delay for Graph App. Communication latency across chip can limit designs Serial design is smaller " less latency What Minimizes Energy HW5 (different model) B = N "EnergyALU" Multicontext Energy Energy normalized to FPGA N=2 29 gates Multicontext Processor W=64, I= "EnergyALU" FPGA [DeHon / FPGA 2014] Compare vs. p, W=1, N=2 19 Energy Processor Components FPGA Multicontext DeHon--FPT Rent p 54 9

DPGA (1995) Xilinx Time-Multiplexed FPGA Mid 1990s Xilinx considered Multicontext FPGA Based on XC4K (pre-virtex) devices Prototype Layout in F=500nm Required more physical interconnect than XC4K

, FPD 1995] 55 [Trimberger, FCCM 1997] 56 Xilinx Time-Multiplexed FPGA Two unnecessary expenses: Used output registers with separate outs Based on XC4K design Did not densely encode interconnect

6GHz, 40nm 64b pinsts Our model w/ input retime 1Mλ 2 base 80Kλ 2 / 64b pinst Instruction mem/context 40Kλ 2 / input-retime depth 1Mλ 2 +8 0.12Mλ 2 ~=2Mλ 2! 4 LUTs (ideal) Recall ASCIItoHex 3.

10 DPGA (1995) Xilinx Time-Multiplexed FPGA Mid 1990s Xilinx considered Multicontext FPGA Based on XC4K (pre-virtex) devices Prototype Layout in F=500nm Required more physical interconnect than XC4K Concerned about power (10W at 40MHz) [Tau et al., FPD 1995] 55 [Trimberger, FCCM 1997] 56 Xilinx Time-Multiplexed FPGA Two unnecessary expenses: Used output registers with separate outs Based on XC4K design Did not densely encode interconnect configuration Compare 8 bits to configure input C-Box connection Versus log 2 (8)=3 bits to control mux select Approx. 200b pinsts vs. 64b pinsts 57 Tabula 8 context, 1.6GHz, 40nm 64b pinsts Our model w/ input retime 1Mλ 2 base 80Kλ 2 / 64b pinst Instruction mem/context 40Kλ 2 / input-retime depth 1Mλ Mλ 2 ~=2Mλ 2! 4 LUTs (ideal) Recall ASCIItoHex 3.4, similar for thput map They claim 2.8 LUTs [MPR/Tabula 3/29/2009] 58 Big Ideas [MSB Ideas] Several cases cannot profitably reuse same logic at device cycle rate cycles, no data parallelism low throughput, unstructured dis-similar data dependent computations These cases benefit from more than one instructions/operations per active element A ctxt << A active makes interesting save area by sharing active among instructions 59 Big Ideas [MSB-1 Ideas] Energy benefit for large W, p For c>1, retiming balance drives d>1 one output reg/lut leads to retiming bottleneck c=4--8, I=4--6 automatically mapped designs roughly 1/3 single context size Most FPGAs typically run in realm where multicontext is smaller (maybe not less energy) How many for intrinsic reasons? How many for lack of register/cad support? 60 10

11 Admin Wednesday 11/23 is a Friday No class for us Next lecture Monday Reading on web Final Exercise Nothing due Friday Wednesday 11/23 First milestone due Wed 11/

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable? ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline