ES1 An Introduction to On-chip Networks

Size: px

Start display at page:

Download "ES1 An Introduction to On-chip Networks"

Jasmin Rhoda Fowler
5 years ago
Views:

1 December 17th, 2015 ES1 An Introduction to On-chip Networks Davide Zoni PhD mail: webpage: home.dei.polimi.it/zoni

2 Sources Main Reference Book (for the examination) Designing Network-on-Chip Architecture in the Nanoscale Era, José Flich, Davide Bertozzi Chapters 2 and 3 Additional References Timothy M. Pinkston, University of Southern California, On-Chip Networks, Natalie E. Jerger and Li-Shiuan Peh Principles and Practices of Interconnection Networks, William J. Dally and Brian Towles Chita Das webpage 2

microarchitecture Baseline model Optimizations Metrics

3 On-chip Networks for shared memory multicores Nomenclature and Topology Cache implications Router microarchitecture Baseline model Optimizations Metrics Power Performance PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE 3

4 What about an interconnection network? An Interconnection Network is a programmable system that transports data between terminals Technology: Interconnection network helps efficiently utilize scarce resources Application: Managing communication can be critical to performance 4

5 Why networks? (again) Why NoCs if so difficult to design? Increasing number of cores inside a single chip Reliability, flexibility, scalability, etc. 5

6 Memory Model in CMPs Message Passing Explicit movement of data between nodes and address spaces Programmers manage communication Shared Memory Communication occurs implicitly through loads/stores and accessing instructions Will focus on shared memory Look at optimization for cache coherence protocols 6

7 Memory Model in CMPs Logically Practically... All processors access some shared memory cache hierarchies reduce access latency to improve performance Requires cache coherence protocol to maintain coherent view in presence of multiple shared copies Consistency model: the behaviour of the memory model in multi-core environment, i.e. what is allowed and what is not allowed Coherence: shadow the cache hierarchy to the programmer (without lose performance improvement) 7

8 Tiled multi-core architecture with shared memory Source: Natalie Jerger, ACACES Summer School,

9 Coherence Protocol on Network Performance Coherence protocol shapes communication needed by system Single writer, multiple reader invariant Requires: Data requests Data responses Coherence permissions/forwards/acks Suggested reading for a quick review of coherence: A Primer on Memory Consistency and Cache Coherence, Daniel Sorin, Mark Hill and David Wood. Morgan Claypool Publishers,

10 Hardware cache coherence Rough goal: all caches have same data at all times Minimal flushing, maximum caches best performance Two solutions: Broadcast-based protocol: All processors see all requests at the same time, same order. Often relies on bus But can broadcast on unordered interconnect Directory-based protocol: Order of the requests relies on a different mechanism than bus Maybe better flexibility and scalability Maybe higher latency 10

11 Scalable Cache Coherence Source: Natalie Jerger, ACACES Summer School,

12 Coherence Protocol Requirements Different message types Unicast, multicast, broadcast Directory protocol Majority of requests: Unicast Lower bandwidth demands on network More scalable due to point-to-point communication Broadcast protocol Majority of requests: Broadcast Higher bandwidth demands Often rely on network ordering 12

13 Impact of Cache Hierarchy Sharing of injection/ejection port among cores and caches Caches reduce average memory latency Private caches Multiple L2 copies Data can be replicated to be close to processor Shared caches Data can only exist in one L2 to bank Addresses striped across banks (Lots of different ways to do this) Aside: lots of research on cache block placement, replication and migration Serve as filter for interconnect traffic 13

14 On-chip Network: Private L2 Cache Hit Private L2 Cache Hit A 3 Tag s Router A Logic Data Controller L1 I/D Cache 2 Core 1 Miss A LD A Memory Controller Source: Chita Das, ACACES Summer School,

15 On-chip Network: Private L2 Cache Miss Format message to memory controller 4 Miss A Private L2 3 Cache Tag Data s (off-chip) Router Logic 6 Data received, sent to L2 Controller L1 I/D Cache 2 Miss A Core 1 LD A Source: Chita Das, ACACES Summer School, Memory Controller Request sent offchip 15

16 On-chip Network: Shared L2 Local Cache Miss Receive data, send to L1 and core Format request message 3 and sent to L2 Bank that A maps to Router 16 (on-chip) 7 Shared L2 Cache Tags Data Logic Send data to 6 requestor Receive message and sent to L2 4 Shared L2 Cache L2 Hit Controller Tags L1 I/D Cache 2 Core 5 Data Controller 1 LD A Miss A A Memory Controller Source: Chita Das, ACACES Summer School, 2011 Router L1 I/D Cache Logic A Core

17 17 Network-on-Chip details

18 18 Topology nomenclature 1 Two broad classes: Direct and Indirect Networks Direct Networks: Every node is both a terminal and a switch Examples: Mesh, Torus, k-ary-n-cubes Indirect Networks: The network is basically composed of switches that connect the end nodes Examples: MIN, Crossbar, etc Direct Source: Natalie Jerger, ACACES Summer School, 2012 Indirect

19 19 Topology abstract metrics 1 Switch Degree: Number of links/edges incident on a node Proxy for estimating cost Higher degree requires more links and port counts at each router 2 Source: Natalie Jerger, ACACES Summer School, ,3,4 4

20 20 Topology abstract metrics 2 Hop Count: Number of hops a message takes from source to destination Proxy for network latency Every node, link incurs some propagation delay even when no contention Network diameter: large min hop count in network Average minimum hop count: average across all source/destination pairs Minimal hop count: smallest hop count connecting two nodes Implementation may incorporate non-minimal paths (increase avg hop count) Max=4 Avg=2.2 Source: Natalie Jerger, ACACES Summer School, 2012 Max=4 Avg=1.77 Max=2 Avg=1.33

21 Topology abstract metrics implications Abstract metrics are just proxies: Does not always correlate with the real metric they represent Example: Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs. Network B with 3 hops, 1 stage pipeline, 1 cycle link traversal Hop Count says A is better than B But A has 18 cycle latency vs. 6 cycle latency for B Topologies typically trade-off hop count and node degree 21

22 Traffic patterns: How to stress a NoC Synthetic traffic patterns Uniform random, Matrix transpose, Hot Spot Many others based on probabilistic distributions and pattern selection algorithms PROS: Fast analysis, corner case evaluation, future traffic pattern generation CONS: It could be not real Real traffic patterns Real benchmarks executed on the simulated architecture Complete evaluation of the system performance PROS: Data collected are from real scenarios for sure CONS: Time consuming simulations, no more scenarios than what provided by the exloited bench suite 22

23 Routing, Arbitration, and Switching Routing Defines the allowed path(s) for each packet (Which paths?) Problems Livelock and Deadlock Arbitration Determines use of paths supplied to packets (When allocated?) Problems Starvation Switching Establishes the connection of paths for packets (How allocated?) Switching techniques Circuit switching, Packet switching 23

24 24 Until now old wine in a new bottle...but for caches Deadlock Packets Routing algorithm Flow control Router/switch Throughtput Where is the difference? Latency

25 25 Until now old wine in a new bottle...but for caches Low power Limited resources High performance High reliability Thermal issues On-chip network criticalities

26 26 Network-on-Chip: router architecture

27 NoC granulatity overview Messages: composed of one or more packets (NOTE:If message size is maximum packet size only one packet created) Packets: composed of one or more flits Flit: flow control digit Phit: physical digit (Subdivides flit into chunks = to link width) Off-chip: channel width limited by pins On-chip: abundant wiring means phit size == flit size 27

28 NoC microarchitecture based on granulatiry Message-based: allocation made at message granularity circuit switching Packet-based: allocation made to whole packets 28 Store and forward (SaF) Large latency and buffer required Off-chip Virtual Cut Through (VCT) Improves SaF but still large buffers and latency Flit-based: allocation made on a flit-by-flit basis Wormhole Efficient buffer utilization, low latency Suffers Head of Line (HoL) Virtual channels Primary to face deadlock Then face HoL On-chip

29 29 Network-on-Chip: wormhole and wormhole+vcs

30 Switch/Router Wormhole Microarchitecture Flit-based,i.e. Packet divided in flits Pipelined in 4 stages BW,RC,SA,ST,LT Buffers organized on a flit basis Single buffer per port Buffer states: G idle,routing,active waiting, R output port (route) C credit count P pointers to data 30

31 Switch/Router Virtual Channel Microarchitecture 31

32 Router components Router components Input buffers, route computation logic, virtual channel allocator, switch allocator, crossbar switch Most OCN routers are input buffered Use single-ported memories Buffer store flits for duration in router Contrast with processor pipeline that latches between stages Basic router pipeline (Canonical 5-stage pipeline) BW: Buffer Write RC: Routing computation VA:Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal 32

33 Router components Routing computation performed once per packet Virtual channel allocated once per packet Body and tail flits inherit this info from head flit Router performance Baseline (no load) delay: 5 cycles + link delay x Hop + tserialization How to reduce latency? 33

34 Pipeline optimization: Actual Baseline=BW+RC BW RC VA SA Usually RC happens at the BW stage A single RC unit is required per input port Is fast, since BW is not the critical stage Routing computation needed at next hop Can be computed in parallel with VA ST LT 34

35 Pipeline optimization: Speculation BW RC VA SA ST Assume that Virtual Channel Allocation stage will be successful Valid under low to moderate loads Entire VA and SA in parallel If VA unsuccessful (no virtual channel returned) LT Must repeat VA/SA in next cycle Prioritize non-speculative requests 35

36 Pipeline optimization: Speculation + LRC BW LRC VA SA ST LT LookAhead Route Computation (LRC) the output port is computed in the previous node 4 actions are performed in parallel The LRC does not actually compute the route but only reads the precomputed one in the head flit 36

Router Pipeline: module dipendencies Dependence between output of one module and input of another Determine critical path through router Cannot bid

37 Router Pipeline: module dipendencies Dependence between output of one module and input of another Determine critical path through router Cannot bid for switch port until routing performed Li-Shiuan Peh and William J. Dally A Delay Model and Speculative Architecture for Pipelined Routers 37

38 Router Pipeline: delay model Li-Shiuan Peh and William J. Dally A Delay Model and Speculative Architecture for Pipelined Routers 38

39 Switch/Router Flow Control 39 Flow control determines how a network resources, such as bandwidth, buffer capacity and control state are allocated to packets that are traversing the NoC Resource allocation problem: from the resources point of view Contention resolution: from the packet point of view Bufferless, buffered

40 Switch/Router Buffered Flow Control Buffers More flexibility, with the possibility to decouple resource allocation in steps Two modes Wormhole flow control Virtual channel flow control William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 40

41 Switch/Router Buffered Wormhole Flow Control Allocate on a per flit basis More efficient in buffer consumption Head of Line (HOL) blocking issues Buffered solutions allow to decouple resource allocation U upper outport, L lower outport In port States (I,W,A) (idle, waiting, allocated) Flits (H,B,T) (head, body, tail) William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 41

42 Switch/Router Virtual Channel Flow Control Multiple buffers on the same input port Need for a state on each virtual channel More complex wormhole to manage than Allows to manage different flows at the same time Solves the HoL issues Deadlock avoidance property William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 42

43 Why Virtual Channels (VCs): Head of Line Block William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 43

44 Buffer Management and Backpressure 44 How to manage buffers between neighbors (i.e. how can I know the downstream destination router buffer is full?) Three ways: Credit based The upstream router keeps track of the available flit slots available in the downstream router Upstream router decreases counter when sends a flit while downstream router increases the couter (backward) when a flit leave the router Accurate fine grain control on flow control, but a lot of messages On/off Threshold mechanism with single bit low overhead to signal upstream router the permission to send Ack/nack No state in the upstream node Sends and wait for ack/nack, no net gain Waist of bandwitdh, sending without ack guarantee

45 Credit-based flow control William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 45

46 On-off flow control William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 46

47 Ack-nack flow control William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 47

48 48 NoCs: some implementation details From: Designing Network-on-Chip Architectures in the Nanoscale Era José Flich Davide Bertozzi, 2011

49 Buffer structure and Crossbar 49

50 Input-first Switch Allocator 50

51 Input-first VC allocator 51

52 52 EXTRA

Application throughput (IPC) Power/Energy Watts/Joules Energy Delay

53 Evaluation metrics for NOCs Performance Network centric Latency Throughput Application Centric System throughput (Weighted Speedup) Application throughput (IPC) Power/Energy Watts/Joules Energy Delay Product (EDP) Fault-Tolerance Process variation/reliability Thermal Temperature 53

54 Network-on-Chip power consumption Network power breakdown - Buffer power, crossbar power and link power are comparable - Arbiter power is negligible Source: Chita Das, ACACES summer school

55 Switch/Router Bufferless Flow Control No buffers Allocate channels and bandwidth to competing packets Two modes Dropping flow control Circuit switching flow control 55 William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

56 Bufferless Dropping Flow Control 1 Simplest flow control form Allocate channel and bandwidth to competing packets In case of collisions we experience packet drops Collision can be signaled or not using ack-nack messages William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 56

57 Bufferless Dropping Flow Control 2 With no ack messages the only viable way is timeout timers Ack messages can reduce latency William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 57

for resources, and if stalled no resend needed William Dally and Brian Towles. 2003.

58 Bufferless Circuit switching Flow Control 1 It allocates all needed resources before send the message When no further packets must be sent, the circuit is deallocated Head flit arbitrates for resources, and if stalled no resend needed William Dally and Brian Towles Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 58

An introduction on the on-chip networks (NoC)

Friday, October 12th, 2012 An introduction on the on-chip networks (NoC) Davide Zoni PhD Student email: zoni@elet.polimi.it webpage: home.dei.polimi.it/zoni Outline Introduction to Network-on-Chip New