Lecture 12: SMART NoC

Size: px
Start display at page:

Download "Lecture 12: SMART NoC"

Transcription

1 ECE 8823 A / CS ICN Interconnection Networks Spring Lecture 12: SMART NoC Tushar Krishna Assistant Professor School of Electrical and Computer Engineering Georgia Institute of Technology tushar@ece.gatech.edu Krishna et al, Breaking the On-Chip Latcy Barrier Using SMART, HPCA 2013

2 Network Architecture 2 Topology How to connect the nodes ~Road Network Routing Which path should a message take ~Series of road segmts from source to destination Flow Control Wh does the message have to stop/proceed ~Traffic signals at d of each road segmt Router Microarchitecture How to build the routers ~Design of traffic intersection (number of lanes, algorithm for turning red/gre) ICN Spring 2017 L10: Router uarch February 15, 2017

3 Common Pipeline Optimizations 3 BW + RC in parallel Lookahead Routing BW RC VA SA BR ST LT SA + VA in parallel VC Select (switch output port winner selects VC from pool of free VCs) Speculative VA (if VA takes long, speculatively allocate a VC while flit performs SA) (Peh and Dally, HPCA 2001) If SA and VA both successful, go for ST If SA or VA fails, retry next cycle BR + SA in parallel The winner of Input Arbitration is read out and st to the input of the crossbar speculatively Low-load Bypassing Wh no flits in input buffer Speculatively ter ST On port conflict, speculation aborted

4 Express Virtual Channels (ISCA 2007) 4 Analogy Express Trains and Local Trains Flits on Express VCs do not get buffered at intermediate routers Sd a lookahead to ask local flits to wait (i.e., kill switch allocation)

5 Modern Pipelines 5 1-cycle for arbitration (t r ), 1-cycle for traversal (t w ) Used by Tilera s imesh, Intel s Ring, NoC prototypes (Park et al., DAC 2012)

6 Is that the best we can do? 6 T N = (t r +t w ) H + T c + T S 1 t w = 1 Latcy α Hops fundamtal limitation? Can we remove the depdce of latcy on hops? Stay tuned!

7 Designing a 1-cycle network 7 What limits us from designing a 1-cycle network? Is it the wire delay? Repeated global wires can go up to 16mm within 1ns Repeated global wire delay expected to remain constant/decrease slightly Technology = 45nm with technology scaling. Target Clock Period = 1ns Energy'(fJ/bit/mm)' 51# 48# 45# 42# 39# 36# 33# 30# 27# 24# 21# 18# 15# 45nm#(Place4and4Route)# Clocked( 45nm#(Projected*)# Driver( 32nm#(Projected*)# 22nm#(Projected*)# 0# 1# 2# 3# 4# 5# 6# 7# 8# 9# 10# 11# 12# 13# 14# 15# 16# 17# 18# 19# 20# 21# 22# Lgth/period'(mm/ns)' Metal Layer = M6 Repeater Spacing = 1mm Wire Width = DRC min Wire Spacing = 3 DRC min (coupling cap à 0) *DSENT (NOCS 2012): Timing-driv NoC Power Estimation Tool

8 Designing a 1-cycle network 8 What limits us from designing a 1-cycle network? Is it the wire delay? Global repeated wires can transmit up to 10-16mm within 1ns (1GHz) Global On-chip repeated wires wires fast delay ough expected to transmit to remain across fairly the chip constant! Chip within dimsions 1-2 cycles expected at 1GHz to remain ev similar as technology (yield) scales Clock frequcy expected to remain similar (power wall) ~20mm ~20mm ~20mm ~20mm

9 Designing a 1-cycle network 9 What limits us from designing a 1-cycle network? Is it the wire delay? No! Classic scaling challge with wires Wire-delay increases relative to logic delay But Wire-delay in cycles expected to remain constant. Wires fast ough to transmit across chip in 1-2 cycles today and in future.

10 Designing a 1-cycle network 10 What limits us from designing a 1-cycle network? Is it the wire delay? No! Dedicated 1-cycle wire Dream Traversal repeater Actual Traversal HopàStopàHop on-chip router to manage sharing of output link every cycle hop router Shared Links! Number of wires = O(n 2 ) Fully-connected (Impractical) Mesh (Practical) Number of wires = O(n)

11 Designing a 1-cycle network 11 What limits us from designing a 1-cycle network? Is it the wire delay? No! Is it the routers? Yes! 1-cycle repeater Dedicated topology impractical* *unless we design a chip for a specific application router 9-Cycles (Best Case) (single-cycle HopàStopàHop router, no other traffic!) Routers Delay: required 2-4 to cycles. share links Best Case: 1 cycle. Note: this is at every hop Number of wires = O(n 2 ) Fully-connected (Impractical) Mesh (Practical) Number of wires = O(n)

12 Designing a 1-cycle network 12 What limits us from designing a 1-cycle network? Is it the wire delay? No! Is it the routers? Yes! 1-cycle repeater Cycles (Best Case) (single-cycle router, no other traffic!) Dedicated topology impractical* *unless we design a chip for a specific application repeater Routers required to share links Can we get both? Yes! SMART: achieve the performance of dedicated connections over a network of shared links Single-cycle Multi-hop Asynchronous Repeated Traversal 1-cycle (no other traffic)

13 A SMART Network-on-Chip 13 Microarchitecture Bypass path with clockless repeater at each router Flow Control Compete for and reserve a sequce of shared links cycle-by-cycle Dynamically create repeated links ( SMART paths ) betwe any two routers repeater Single-cycle traversal Single-cycle reconfiguration Blue Flow has to stop for Pink flow

14 A SMART Network-on-Chip 14 Microarchitecture Bypass path with clockless repeater at each router Flow Control Compete for and reserve a sequce of shared links cycle-by-cycle Dynamically create repeated links ( SMART paths ) betwe any two routers How well does SMART perform? 88-90% of the performance of an O(n 2 ) wire fully-connected (dream) topology with an O(n) wire SMART NoC Baseline Mesh needs to be clocked 5.4 times faster to match SMART (64-core full-system simulation with real applications) Microarchitecture and Flow Control details next! T. Krishna et al. HPCA 2013 IEEE Computer 2013 IEEE Micro Top Picks 2014 NOCS 2014 (Best Paper Award)

15 Microarchitecture: Data Path 15 network packet Flit Input Links Core North South East West 128-bit Traditional SMART Router Microarchitecture Buffer able (buffer): latch input flit or not Control Bypass Path bypass Queues Input Port local bar Bypass select (): who uses crossbar+link (local or bypass) Output Links Repeater These signals are setup by the control path bar select (): select line for s output

16 Microarchitecture: Control Path 16 Dedicated repeated links from every router to help setup a SMART path Lgth = HPC max hops Width = log 2 (1+HPC max ) bits HPC max (max Hops Per Cycle): maximum number of hops that the underlying wire allows the flit to traverse within a clock cycle e.g., HPC max = 1GHz, 1mm hop Let HPC max = 3 SSR for E direction 2-bit control path R0 R1 R2 R3 R4 128-bit data path

17 SMART Flow Control Divide the packet into flits 17 Assume HPC max = 3 (max Hops Per Cycle) head body tail Divide the route into segmts of HPC max A B HPC max Request a path of desired lgth (in hops) over the SSR wires Intermediate routers arbitrate betwe control requests from various routers and setup buffer, and for the data path No ACK has to be st back! Sd the flit on the data wires May get partial or full SMART path based on conttion that cycle

18 Traversal Example 1: R0 à R3 18 Cycle 1 (Ctrl): R0 sds SSR = 3 (3-hop path request) Assume HPC max = 3 (max Hops Per Cycle) SSR R0 = 3 R0 R1 R2 R3 R4

19 Traversal Example 1: R0 à R3 19 Cycle 1 (Ctrl): R0 sds SSR = 3 (3-hop path request) Assume HPC max = 3 (max Hops Per Cycle) All routers set buffer, and for this request. SSR R0 = 3 R0 R1 R2 buffer 1 R3 R4 local bypass bypass

20 Traversal Example 1: R0 à R3 20 Cycle 1 (Ctrl): R0 sds SSR = 3 (3-hop path request) Assume HPC max = 3 (max Hops Per Cycle) All routers set buffer, and for this request. Cycle 2 (Data): R0 sds flit to R3 SSR R0 = 3 R0 R1 R2 buffer 1 R3 R4 local bypass bypass A SMART path is simply a combination of buffer, and at all the intermediate routers.

21 Example 2: R0 à R3 and R2 à R4 21 Cycle 1 (Ctrl): R0 sds SSR = 3. R2 sds SSR = 2. Challge: only one flit can be st on shared link betwe R2 and R3 at a time SSR R2 = 2 Assume HPC max = 3 (max Hops Per Cycle) SSR R0 = 3 R0 R1 R2 R3 R4 Solution: Routers prioritize betwe path requests based on distance two alternate schemes: Prio=Local and Prio=Bypass

22 Example 2: R0 à R3 and R2 à R4 22 Prio = Local è 0 hop > 1 hop > 2 hop > HPC max hop Cycle 2 (Data): R2 sds flit to R4. R0 s flit blocked at R2. R0 R1 R2 R3 R4 buffer 1 buffer 1 local bypass local bypass Cycle 4+ (Data): R2 sds blocked flit to R3. (after local arbitration + sding ctrl) R0 R1 R2 R3 R4 buffer 1 local

23 Example 2: R0 à R3 and R2 à R4 23 Prio = Bypass è HPC max hop > (HPC max - 1) hop > > 1 hop > 0 hop Cycle 2 (Data): R0 sds flit to R3. R2 s flit waits. R0 R1 R2 R3 R4 buffer 1 local bypass bypass Cycle 3 (Data): R2 sds flit to R4. R0 R1 R2 R3 R4 buffer 1 local bypass

24 Achievable Hops Per Cycle (HPC) 24 At low loads (best case), as good as dedicated wires 6" SMART paths are opportunistic Achievable HPC depds on link conttion Average hops in traffic pattern Average'HPC'(hops)' 5" 4" 3" 2" 1" 0" Fully"Connected"(Dream)" SMART" Baseline"(Mesh)" At high loads (worst case), no worse 0" 20" 40" 60" than 80" the baseline 100" Flit'Injec9on'Rate'(%'of'capacity)'

25 Pipeline and Implemtation 25 SSR R2 SSR R1 SSR R0 SSR R3 min{t ST+LT, T Req + T SA-G } determines critical path (thus HPC max ) Repeater Overhead Energy: Asynchronous repeater consumes 14.3% lower ergy than clocked driver Area: No area overhead since repeaters are embedded in wire-dominated crossbar SA-L SA-G buffer Cycle 0 Switch Allocation Local (SA-L) (can be bypassed at low loads) Cycle 1* R3 SA-G R4 Overhead R5 Energy: ~2% SMART_1D, ~2-6% SMART_2D Area: Cycle < 21% SMART_1D, 1-5% SMART_2D Req + Multi-hop Switch (crossbar) + Switch Allocation Link Traversal (ST+LT) Global (SA-G) (till it is stopped by some buffer=1) HPC max = 13 SMART_1D HPC max = 9 SMART_2D HPC max = 11 *Convtionally (i.e. in baseline), winners of SA-L go for Switch + Link Traversal

26 The Devil is in the Details 26 Managing Distributed Arbitration Could flits get misrouted? Could flits not arrive wh expected? SMART_2D How can flits bypass routers at turns? Buffer Managemt How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way? How is buffer availability conveyed? Multi-flit packets How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered? How do flits know which VC to stop in

27 The Devil is in the Details 27 Managing Distributed Arbitration Could flits get misrouted? Could flits not arrive wh expected? SMART_2D How can flits bypass routers at turns? Buffer Managemt How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way? How is buffer availability conveyed? Multi-flit packets How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered? How do flits know which VC to stop in

28 Managing Distributed Arbitration 28 Cycle 1: R0 sds SSR = 3. R2 sds SSR = 2. Can differt routers force differt priorities? SSR R0 = 3 SSR R2 = 2 Assume HPC max = 3 (max Hops Per Cycle) R0 R1 R2 R3 R4 Prio = Bypass Prioritize SSR R0 over SSR R2 Prio = Local Prioritize SSR R2 over SSR R0

29 Managing Distributed Arbitration 29 Cycle 1: R0 sds SSR = 3. R2 sds SSR = 2. Can differt routers force differt priorities? No! SSR R0 = 3 SSR R2 = 2 Assume HPC max = 3 (max Hops Per Cycle) R0 R1 R2 R3 buffer 1 R4 local bypass bypass bypass WàE WàE WàE WàE Prio = Bypass Prioritize SSR R0 over SSR R2 R0 s flit incorrectly reaches R4, instead of getting stopped at R3. Prio = Local Prioritize SSR R2 over SSR R0

30 Managing Distributed Arbitration 30 Distributed Conssus: All routers need to take the same decision about multiple contding flits in a distributed manner Solution: All routers follow the same static priority betwe the path setup requests that they receive Prio = Local: 0 hop > 1 hop > (HPC max -1) hop > HPC max hop Prio = Bypass: HPC max hop > (HPC max -1) hop > 1 hop > 0 hop Implication: a router will not receive a flit that it does not expect But can a router not receive a flit that it does expect?

31 Managing Distributed Arbitration 31 Can a router not receive a flit that it does expect? SSR R1 = SSR R0 = 3 Control for N direction Control for E direction R0 R1 R2 R3 R4 Prio = Local

32 Managing Distributed Arbitration 32 Can a router not receive a flit that it does expect? Yes! Is there a performance loss? The R2àR3 link was granted for this cycle, but wt unused. What if some other flit wanted to use it? No. (Prio=Local) R0 buffer 1 R1 R2 buffer 1 R3 R4 local local bypass W->N

33 Impact of False Negatives 33 Cycle 1 (Ctrl Req): R0àR2, R1àR4, R3àR4. SSR R1 = 3 SSR R0 = 2 SSR R3 = 1 Req Priority = Bypass R0 R1 R2 R3 R4 Cycle 2: Flit: R0àR2. forced starvation and throughput loss R0 R1 R2 R3 R4 local bypass buffer 1 local bypass buffer 1

34 Impact of False Negatives 34 Average'flit'latcy'(cycles)' Prio = Bypass increases false negatives at high-loads 28" 24" 20" 16" 12" 8" 4" 0" Prio=Bypass" Prio=Local" Prio = Bypass saturates at ~44-48% injection rate 0" 20" 40" 60" 80" 100" Flit'Injec5on'Rate'(%'of'capacity)'

35 The Devil is in the Details 35 Managing Distributed Arbitration Could flits get misrouted? Could flits not arrive wh expected? SMART_2D How can flits bypass routers at turns? Buffer Managemt How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way? How is buffer availability conveyed? Multi-flit packets How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered? How do flits know which VC to stop in

36 Bypass at turns 36 Dest Src

37 Bypass at turns 37 Assume HPC max = 3 Any blue router (HPC max neighborhood) can be reached in 1 cycle Intermediate Ctrl Dest Shortest Path Routing à Any router in shaded HPC max quadrant is pottial intermediate destination Src Separate ctrl path for every possible route. One of the ctrl paths chos during route computation. Only one of these 5 Reqs will be valid and will request for E/N/S output port.

38 Control Arbitration 38 Ctrl Prio = Local Challge: All input and output ports at all participating routers should make consistt decisions simultaneously. Red wins over Blue D B A Solution: 2-level priority among Reqs 1. Distance (i.e. Prio=Local or Prio=Bypass) 2. Direction C local WàE buffer 1 local WàE bypass WàN? S àn? bypass Sà??à E How do we choose betwe Red and Purple?

39 Priority at Output Port 39 At A: Purple wins over Red 2-level Ctrl Req Priority 1. Distance 0 hop > 1 hop > 2 hop (Prio = Local) B 2. Direction Straight hops > Left hops > Right hops 0 hop 1 hop 2 hop 3 hop D E A Intermediate C All gre routers (sources) can request for the North output port at the blue (intermediate) router Assume HPC max = 3 1 Source Routers All routers force same priority (to guarantee no false positives)

40 Priority at Input Port 40 At A: Purple wins over Red At B: Purple wins over Red 0 hop B Intermediate D 1 hop 2 hop 3 hop 4 2 A 3 5 C level Ctrl Req Priority 1. Distance 0 hop > 1 hop > 2 hop (Prio = Local) 2. Direction Straight hops > Left hops > Right hops All gre routers (sources) can request for the South input port at the blue (intermediate) router 1 Source Routers Assume HPC max = 3

41 The Devil is in the Details 41 Managing Distributed Arbitration Could flits get misrouted? Could flits not arrive wh expected? SMART_2D How can flits bypass routers at turns? Buffer Managemt How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way? How is buffer availability conveyed? Multi-flit packets How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered? How do flits know which VC to stop in ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

42 The Devil is in the Details 42 Managing Distributed Arbitration Could flits get misrouted? Could flits not arrive wh expected? SMART_2D How can flits bypass routers at turns? Buffer Managemt How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way? How is buffer availability conveyed? Multi-flit packets How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered? How do flits know which VC to stop in ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

43 Buffer Managemt 43 R0 à R3, R2 à R4 SSR R2 = 2 SSR R0 = 3 R0 R1 R2 R3 R4 local bypass buffer 1 local bypass buffer 1 How do we guarantee R2 has a free buffer / VC for the flit from R0? Every router has free VC information about its neighbor, just like the baseline. R0 sds only if R1 has a free buffer R1 lets the incoming flit bypass only if R2 has a free buffer, else latches it. Corollary: VCid is allocated after it stops, rather than before starting, since router where it stops is not known. ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

44 The Devil is in the Details 44 Managing Distributed Arbitration Could flits get misrouted? Could flits not arrive wh expected? SMART_2D How can flits bypass routers at turns? Buffer Managemt How is a flit guaranteed a buffer (and in the correct virtual channel) if it is stopped mid-way? How is buffer availability conveyed? Multi-flit packets How does SMART guarantee that flits (head, body, tail) of a packet do not get re-ordered? How do flits know which VC to stop in ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

45 Multi-flit Packets (1) 45 R0 à R3, R2 à R4 SSR R2 = 2 SSR R0 = 3 R0 R1 R2 R3 R4 buffer 1 buffer 1 local bypass local bypass How do we guarantee Body/Tail flits from R0 do not bypass R2 if Head was stopped? If R2 sees a SSR requesting bypass from a router from where it has a buffered flit, it stops all subsequt flits from that router, and buffers them in the same VC ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

46 Multi-flit Packets (2) 46 R0 à R3, R2 à R4 SSR R2 = 2 SSR R0 = 3 R0 R1 R2 R3 R4 buffer 1 buffer 1 local bypass local bypass How do the Body/Tail flits know which VC to stop in if VC is allocated at the router where Head stops? Match using source_id of all the flits of the packet ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

47 Multi-flit Packets (2) 47 R0 à R3, R2 à R4 SSR R2 = 2 SSR R0 = 3 R0 R1 R2 R3 R4 buffer 1 buffer 1 local bypass local bypass What if 2 flits from differt packets from the same source arrive? Virtual Cut-Through ough Buffers to store all flits of a packet all flits of one packet should leave before flits of another packet ICN Spring 2016 L11: Network Interface Tushar Krishna, School of ECE, Georgia Tech February 24, 2016

48 Evaluations 48 Simulation Infrastructure: GEMS (Full-System) + Garnet (NoC) System: 64-core (8x8 Mesh) Technology: 45nm, 1GHz Networks being compared BASELINE (t r =1): Baseline Mesh with 1-cycle router at every hop SMART-<HPC max >_1D: Always stop at turning router Best case delay: 4 cycles (Req à Flit à Req Y à Flit Y ) SMART-<HPC max >_2D: Bypass turning router Best case delay: 2 cycles (Req à Flit) DREAM (T N =1): Conttion-less 1-cycle network (fully-connected)

49 Breaking the latcy barrier 49 Average'flit'latcy'(cycles)' 32" 28" 24" 20" 16" 12" 8" 4" 0" BASELINE"(tr=1)" r SMART:8_1D" SMART:8_2D" DREAM IDEAL"(Tn=1)" N = 1) Lower is better 0" 0.1" 0.2" 0.3" 0.4" 0.5" Injec4on'Rate'(flits/node/cycle)' Average'flit'latcy'(cycles)' 32" 28" 24" 20" 16" 12" 8" 4" 0" BASELINE"(tr=1)" r SMART:8_1D" SMART:8_2D" IDEAL"(Tn=1)" DREAM N = 1) 0" 0.05" 0.1" 0.15" 0.2" 0.25" Injec4on'Rate'(flits/node/cycle)' Uniform Random (Avg Hops = 5.33) Bit Complemt (Avg Hops = 8) SMART reduces low-load latcy to 2-4 cycles across all traffic patterns, indepdt of average hops. Average'flit'latcy'(cycles)' 32" 28" 24" 20" 16" 12" 8" 4" 0" BASELINE"(tr=1)" r SMART:8_1D" SMART:8_2D" DREAM IDEAL"(Tn=1)" N = 1) 0" 0.05" 0.1" 0.15" 0.2" 0.25" Injec4on'Rate'(flits/node/cycle)' Shuffle (Avg Hops = 4) Average'flit'latcy'(cycles)' 32" 28" 24" 20" 16" 12" 8" 4" 0" BASELINE"(tr=1)" r SMART:8_1D" SMART:8_2D" IDEAL"(Tn=1)" DREAM N = 1) 0" 0.05" 0.1" 0.15" 0.2" Injec4on'Rate'(flits/node/cycle)' Transpose (Avg Hops = 6)

50 Impact of HPC max 50 HPC max of 2 and 4 give 1.8 and 3 reduction in latcy Avg$Flit$Latcy$(cycles)$ 32" 28" 24" 20" 16" 12" 8" 4" 0" 0" 0.05" 0.1" 0.15" 0.2" 0.25" Injec4on$Rate$(flits/node/cycle)$ SMART is a better design choice than a baseline 1-cycle router network ev at larger tile size or higher frequcy (i.e. lower HPC max ) HPC max of 8 gives 5.4 reduction in latcy HPC max (max Hops Per Cycle): maximum number of hops that the underlying wire allows the flit to traverse within a clock cycle SMART01_1D" SMART02_1D" SMART04_1D" SMART08_1D" SMART04_2D" SMART08_2D" SMART012_2D" Bit Complemt Traffic A baseline mesh network needs to run at 5.4GHz to match the performance of a 1GHz SMART NoC HPC max will go up as technology scales! (smaller cores, similar frequcy)

51 Full-system Runtime 51 Normalized+Applica/on+Run/me+ 1.2" 1" 0.8" 0.6" 0.4" 0.2" 0" BASELINE"(tr=1)" r SMART58_1D" SMART58_2D" DREAM(T IDEAL"(Tn=1)" N = 1) )" lu" nlu" radix" water5nsq" water5spa9al" blackscholes" canneal" fluidanimate" swap9ons" x264" AVERAGE" SPLASH52" PARSEC" Directory Full5state"Directory" Protocol N Private L2/tile SMART reduces runtime by 26-27% for Private L2 8% off a dream 1-cycle network Normalized+Applica/on+Run/me+ 1.2" 1" 0.8" 0.6" 0.4" 0.2" 0" BASELINE"(tr=1)" SMART58_1D" SMART58_2D" DREAM(T IDEAL"(Tn=1)" N = 1) r )" lu" nlu" radix" water5nsq" water5spa9al" blackscholes" canneal" fluidanimate" swap9ons" x264" AVERAGE" SPLASH52" PARSEC" Directory Full5state"Directory" Protocol N Shared L2/tile SMART reduces runtime by 49-52% for Shared L2 9% off a dream 1-cycle network

52 Backups 52

53 Protocol-level ordering 53 If a virtual network requires protocol-level ordering Only deterministic routing allowed within that virtual network. Priority should be Local > Bypass Guarantees that 2 flits from the same source do not overtake each other at any router.

54 SMART vs. Flatted Butterfly 54 What if we exploit repeated wires and add explicit 1-cyle physical express links? Ports Router Delay SMART 5-port router 1-2 cycle in router [SA-L, SSR+SA-G] Flatted Butterfly 15-port router 3-4 cycle in router [8-14:1 Arbiter, Multi-stage bar] Bisection Bandwidth (BB) 128x8 1-flit req 5-flit resp 128x8 [1x] 7-flit req 35-flit resp 448x8 [3.5x] 2-flit req 10-flit resp 896x8 [7x] 1-flit req 5-flit resp Dyn Power (mw) Area (mm x mm) [~1x] 37 [1.5x] 58 [2.3x] 0.27x x0.47 [3x] 0.53x0.53 [3.9x] 0.7x0.7 [6.7x] R0

55 SMART vs. Flatted Butterfly 55 Assume 1-cycle Flatted Butterfly Router: Highly optimistic assumption Avg$Pkt$Latcy$(cycles)$ 32" 28" 24" 20" 16" 12" 8" 4" 0" 1"pkt"="" 128)bit" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" SMART always beats FBfly in latcy Injec4on$Rate$(packets/node/cycle)$ FBfly matches SMART in throughput only with 3.5 more wires SMART28_1D" SMART28_2D" FBfly_BB=1x" FBfly_BB=3.5x" FBfly_BB=7x" Better to use SMART and reconfigure 1-cycle multi-hop paths based on traffic rather than use a fixed high-radix topology.

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

Lecture 7: Flow Control - I

Lecture 7: Flow Control - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical

More information

Breaking the On-Chip Latency Barrier Using SMART

Breaking the On-Chip Latency Barrier Using SMART Breaking the On-Chip Latency Barrier Using SMART Tushar Krishna Chia-Hsin Owen Chen Woo Cheol Kwon Li-Shiuan Peh Computer Science and Artificial Intelligence Laboratory (CSAIL) Massachusetts Institute

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

SCORPIO: 36-Core Shared Memory Processor

SCORPIO: 36-Core Shared Memory Processor : 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) OpenSMART (https://tinyurl.com/get-opensmart)

More information

A Case for Low Frequency Single Cycle Multi Hop NoCs for Energy Efficiency and High Performance

A Case for Low Frequency Single Cycle Multi Hop NoCs for Energy Efficiency and High Performance A Case for Low Frequency Single Cycle Multi Hop NoCs for Energy Efficiency and High Performance Monodeep Kar and Tushar Krishna School of Electrical and Computer Engineering Georgia Institute of Technology

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom ISCA 2018 Session 8B: Interconnection Networks Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom Aniruddh Ramrakhyani Georgia Tech (aniruddh@gatech.edu) Tushar

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Prediction Router: Yet another low-latency on-chip router architecture

Prediction Router: Yet another low-latency on-chip router architecture Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani Michihiro Koibuchi Hideharu Amano Tsutomu Yoshinaga (Keio Univ., Japan) (NII, Japan) (Keio Univ., Japan) (UEC, Japan)

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

Lecture 13: System Interface

Lecture 13: System Interface ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 13: System Interface Tushar Krishna Assistant Professor School of Electrical

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang.

A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang. High-Performance Crossbar Designs for Network-on-Chips (NoCs) A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements

More information

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs Lecture 15: NoC Innovations Today: power and performance innovations for NoCs 1 Network Power Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton Energy for a flit

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Lecture 23: Router Design

Lecture 23: Router Design Lecture 23: Router Design Papers: A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 21 Routing Outline Routing Switch Design Flow Control Case Studies Routing Routing algorithm determines which of the possible paths are used as routes how

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Express Virtual Channels: Towards the Ideal Interconnection Fabric

Express Virtual Channels: Towards the Ideal Interconnection Fabric Express Virtual Channels: Towards the Ideal Interconnection Fabric Amit Kumar, Li-Shiuan Peh, Partha Kundu and Niraj K. Jha Dept. of Electrical Engineering, Princeton University, Princeton, NJ 8544 Microprocessor

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

Extending the Performance of Hybrid NoCs beyond the Limitations of Network Heterogeneity

Extending the Performance of Hybrid NoCs beyond the Limitations of Network Heterogeneity Journal of Low Power Electronics and Applications Article Extending the Performance of Hybrid NoCs beyond the Limitations of Network Heterogeneity Michael Opoku Agyeman 1, *, Wen Zong 2, Alex Yakovlev

More information

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect 1 A Soft Tolerant Network-on-Chip Router Pipeline for Multi-core Systems Pavan Poluri and Ahmed Louri Department of Electrical and Computer Engineering, University of Arizona Email: pavanp@email.arizona.edu,

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2 Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously

More information

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS OASIS NoC Architecture Design in Verilog HDL Technical Report: TR-062010-OASIS Written by Kenichi Mori ASL-Ben Abdallah Group Graduate School of Computer Science and Engineering The University of Aizu

More information

ISSN Vol.03,Issue.06, August-2015, Pages:

ISSN Vol.03,Issue.06, August-2015, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.03,Issue.06, August-2015, Pages:0920-0924 Performance and Evaluation of Loopback Virtual Channel Router with Heterogeneous Router for On Chip Network M. VINAY KRISHNA

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

ES1 An Introduction to On-chip Networks

ES1 An Introduction to On-chip Networks December 17th, 2015 ES1 An Introduction to On-chip Networks Davide Zoni PhD mail: davide.zoni@polimi.it webpage: home.dei.polimi.it/zoni Sources Main Reference Book (for the examination) Designing Network-on-Chip

More information

The Runahead Network-On-Chip

The Runahead Network-On-Chip The Network-On-Chip Zimo Li University of Toronto zimo.li@mail.utoronto.ca Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Natalie Enright Jerger University of Toronto enright@ece.utoronto.ca

More information

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV. Prof. Onur Mutlu Carnegie Mellon University 10/29/2012

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV. Prof. Onur Mutlu Carnegie Mellon University 10/29/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012 New Review Assignments Were Due: Sunday, October 28, 11:59pm. Das et

More information

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms Outline Networks: Routing and Design Routing Switch Design Case Studies CS 5, Spring 99 David E. Culler Computer Science Division U.C. Berkeley 3/3/99 CS5 S99 Routing Recall: routing algorithm determines

More information

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015 18-740/640 Computer Architecture Lecture 16: Interconnection Networks Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015 Required Readings Required Reading Assignment: Dubois, Annavaram,

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

Flow Control can be viewed as a problem of

Flow Control can be viewed as a problem of NOC Flow Control 1 Flow Control Flow Control determines how the resources of a network, such as channel bandwidth and buffer capacity are allocated to packets traversing a network Goal is to use resources

More information

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Interconnection Networks: Routing. Prof. Natalie Enright Jerger Interconnection Networks: Routing Prof. Natalie Enright Jerger Routing Overview Discussion of topologies assumed ideal routing In practice Routing algorithms are not ideal Goal: distribute traffic evenly

More information

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012. CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION by Stephen Chui Bachelor of Engineering Ryerson University, 2012 A thesis presented to Ryerson University in partial fulfillment of the

More information

NETWORK-ON-CHIPS (NoCs) [1], [2] design paradigm

NETWORK-ON-CHIPS (NoCs) [1], [2] design paradigm IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Extending the Energy Efficiency and Performance With Channel Buffers, Crossbars, and Topology Analysis for Network-on-Chips Dominic DiTomaso,

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Quality-of-Service for a High-Radix Switch

Quality-of-Service for a High-Radix Switch Quality-of-Service for a High-Radix Switch Nilmini Abeyratne, Supreet Jeloka, Yiping Kang, David Blaauw, Ronald G. Dreslinski, Reetuparna Das, and Trevor Mudge University of Michigan 51 st DAC 06/05/2014

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

Asynchronous Bypass Channel Routers

Asynchronous Bypass Channel Routers 1 Asynchronous Bypass Channel Routers Tushar N. K. Jain, Paul V. Gratz, Alex Sprintson, Gwan Choi Department of Electrical and Computer Engineering, Texas A&M University {tnj07,pgratz,spalex,gchoi}@tamu.edu

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS A Thesis by SONALI MAHAPATRA Submitted to the Office of Graduate and Professional Studies of Texas A&M University

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Evaluating Bufferless Flow Control for On-Chip Networks

Evaluating Bufferless Flow Control for On-Chip Networks Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University In a nutshell Many researchers report high buffer

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs

Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs Locality-Oblivious Cache Organization leveraging Single-Cycle Multi-Hop NoCs Woo-Cheol Kwon Department of EECS, MIT Cambridge, MA 9 wckwon@csail.mit.edu Tushar Krishna Intel Corporation, VSSAD Hudson,

More information

SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In- Network Ordering

SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In- Network Ordering SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In- Network Ordering The MIT Faculty has made this article openly available. Please share how this access benefits

More information

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011 548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011 Extending the Effective Throughput of NoCs with Distributed Shared-Buffer Routers Rohit Sunkam

More information

NoCAlert: An On-Line and Real- Time Fault Detection Mechanism for Network-on-Chip Architectures

NoCAlert: An On-Line and Real- Time Fault Detection Mechanism for Network-on-Chip Architectures NoCAlert: An On-Line and Real- Time Fault Detection Mechanism for Network-on-Chip Architectures Andreas Prodromou, Andreas Panteli, Chrysostomos Nicopoulos, and Yiannakis Sazeides University of Cyprus

More information

Design of a high-throughput distributed shared-buffer NoC router

Design of a high-throughput distributed shared-buffer NoC router Design of a high-throughput distributed shared-buffer NoC router The MIT Faculty has made this article openly available Please share how this access benefits you Your story matters Citation As Published

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects A Low Latency Supporting Adaptivity for On-Chip Interconnects 34.2 Jongman Kim, Dongkook Park, T. Theocharides, N. Vijaykrishnan and Chita R. Das Department of Computer Science and Engineering The Pennsylvania

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS

VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS A Thesis by HAIYIN GU Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

EE 6900: Interconnection Networks for HPC Systems Fall 2016

EE 6900: Interconnection Networks for HPC Systems Fall 2016 EE 6900: Interconnection Networks for HPC Systems Fall 2016 Avinash Karanth Kodi School of Electrical Engineering and Computer Science Ohio University Athens, OH 45701 Email: kodi@ohio.edu 1 Acknowledgement:

More information

Latency Criticality Aware On-Chip Communication

Latency Criticality Aware On-Chip Communication Latency Criticality Aware On-Chip Communication Zheng Li, Jie Wu, Li Shang, Robert P. Dick, and Yihe Sun Tsinghua National Laboratory for Information Science and Technology, Inst. of Microelectronics,

More information

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip Anh T. Tran and Bevan M. Baas Department of Electrical and Computer Engineering University of California - Davis, USA {anhtr,

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information