Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University of Rostock, Germany philipp.gorski@googlemail.com & tim.wegner@uni-rostock.de
Outline Fundamentals and trends Chip-multiprocessor Network-on-Chip Quadrant-based mesh (QMesh) Overview Experimental results Conclusion / future work 2
external memory Fundamentals and trends - CMPs Modern Chip-Multiprocessors (CMPs) Modular design High degree of parallelization (thread/core level) Challenge: efficient on-chip communication despite rising core count on-chip components L1/L2 cache, regs, FIFO, I/O IP local IP local IP local IP local computation communication memory GPP, GPU, DSP, system I/O infrastructure L3/L4 cache, edram, I/O global P2P, busses, crossbars, on-chip networks 3
Fundamentals and trends - CMPs Key trends and issues Technology scaling Utilization wall Power temperature Bandwidth (BW) delay Dark silicon PVT variation increases Reliability Architecture IP cores # increases Heterogeneous On-chip memory On-chip communication dominant Workload Application diversity Multiple domains Latency: BW, comp. Virtualization Interferences Memory access Communication Starting point: on-chip communication infrastructure (here: Networks-on-Chip) Vital link between computation and storage 4
Fundamentals and trends - NoCs Networks-on-Chip (NoCs): approach for scalable on-chip communication Managed wire segments Packet-based communication Globally asynchronous, locally synchronous (GALS) Topology: global interconnection between components y 0,3 1,3 2,3 3,3 0,2 1,2 2,2 3,2 0,1 1,1 2,1 3,1 Router (R) Link Network interface (NI) Tile: voltage/frequency island 0,0 1,0 2,0 3,0 (x,y) x 4x4 2D-mesh 5
Fundamentals and trends - NoCs Paths and routing Path: E2E packet route through NoC (start/end: NI of SRC/DST) Determined by routing algorithm Path length (PL): number of traversed routers (hops) (x DST,y DST ) 0,3 1,3 2,3 3,3 0,2 1,2 2,2 3,2 0,1 1,1 2,1 3,1 Dimension-ordered routing: XY or YX Minimal PL Deterministic Dead-/livelock-free Minimal HW effort Non-adaptive 0,0 1,0 2,0 3,0 (x SRC,y SRC ) 6
QMesh - overview Idea: improve IP core connectivity Increase the number of NIs per IP core (Q 0 Q 3 ) Connect each core to all surrounding routers QMesh: quadrant-based mesh + XY routing Q 3 Q 0 IP core Q 2 Q 1 West North R South East 7
QMesh - overview QMesh characteristics Preservation of basic 2D-mesh structure Dual-path routing Spatially independent paths Required modifications / additional HW costs 8-ported router 4 NIs per IP core 1 programmable Path Table (PT) per IP core 4 bit addressing extension (for Q in and Q out ) 4 bits Q inin 10 00 11... 01 Q out 10 00 11... 01 Q 0 = 11 Q 1 = 10 Q 2 = 00 Q 3 = 01 R R NI Q 3 IP core NI Q 0 NI Q 2 QMesh tile Processing element CX PT Left NI Q 1 Q 3 Q 2 Up R Down R Q 0 Right Q 1 Advantages Costs comparable to 2D-mesh with XY/YX routing Reduced average path length Mitigation of traffic interferences Increased traffic locality Benefits of 2D-mesh maintained (e.g. deterministic routing) 8
QMesh - overview DST at same row (L, R) or column (U, D) DST in quadrant (Q 0 Q 3 ) U Q 3 Q 0 Q 3 IP DST IP IP Q 0 Q 2 Q 1 Q 3 Q 2 Q 0 Q 1 IP DST IP IP IP IP IP IP IP IP IP IP L IP SRC IP R IP IP SRC IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP D IP IP Q 2 Q 1 L, R, U, D: path length reduced by 1 hop (compared to 2D-mesh with XY routing) Q 0 Q 3 : path length reduced by up to 2 hops (worst case: same as 2D-mesh) 9
QMesh experimental results Parameters Simulator: SystemC-based, cycle-accurate 2D-mesh and QMesh (XY routing) 8x8 NoC Link width: 64 bit width NoC frequency: 1GHz Router: FIFO depth 9 flit Synthetic traffic patterns Single-threaded applications (bit complement/reverse, transpose, shuffle) Multi-threaded applications (nearest neighbor, hotspot, rentian) Evaluated parameters Network saturation margin Packet delay Power overhead Reliability: wear-out acceleration factor a MTTF Robustness: E2E connectivity, number of reachable resources 10
QMesh experimental results SAT : relative improvement of network saturation margin Due to hop reduction and dual-path options 11
QMesh experimental results Reduction of packet delay ( DELAY ) compensates power overhead ( POWER ) Locality/fewer hops reduced traffic lower router/link activity reduced P dyn POWER lower than expected (~100%) 12
QMesh experimental results Reliability Evaluation via acceleration factor of Mean-Time-To-Failure (MTTF): a MTTF Wear-out increase: a MTTF < 1 Wear-out decrease: a MTTF > 1 a MTTF = t E a QMesh = e k t 2D mesh 1 1 T QMesh T 2D mesh t QMesh, t 2D mesh : MTTF of QMesh/2D-mesh T QMesh, T 2D mesh : avg. router temperature for QMesh/2D-mesh k: Boltzmann s constant (8.6 10-5 ev/k) E a : activation energy of the CMOS devices (here: 0.7 ev at 45nm CMOS) 13
QMesh experimental results General wear-out decrease through QMesh Mean router lifetime increase: 10% for low PIR, 60% for high PIR (avg. ~ 35%) 14
QMesh experimental results Robustness Degradation of E2E connectivity? Injection of random router faults (uniform distribution) 15
QMesh experimental results Robustness Number of reachable resources? Injection of random router faults (uniform distribution) 16
Conclusion / future work Modern CMPs require efficient architecture for on-chip communication NoCs provide appropriate infrastructure QMesh topology: integration of multiple NIs per IP core to improve connectivity Preservation of basic NoC structure and associated benefits Improvements over standard 2D-mesh Increase of network saturation margin Reduction of avg. packet delay Reliability: increased router lifetime due to lower max. temperatures Higher robustness due to dual-path routing (spatially independent) Tolerable costs Next step: full HW/SW integration of traffic monitoring and path reconfiguration 17
Thank you for your attention! Questions? 18
QMesh overview Global view Orthogonal system-level integration (HW/SW) TRAFFIC MONITORING PATH RECONFIGURATION SOFTWARE Traffic evaluation Path adaptation CLUSTER CLUSTER CLUSTER Step 3: Traffic Management Path reconfiguration Extendable & adaptable HARDWARE Links/ports Local level Aggregation Local sensors Destinations Cluster level Master-Tile Local actors Path table Cluster level SNoC: QMesh monitoring/control network DNoC: QMesh data network TILE TILE TILE Step 2: Observability Traffic monitoring Extendable & adaptable Step 1: QMesh (this work) Dual-path options Performance Reliability & robustness 19
L1-I PE LEFT Q2 L1-D UP Router/Link De- and Packetization Network-on-Chip - Packets Message Buffer (MB) Read/Write messages of workload applications Network-Interface (NI) Packetization (MB-to-NoC) and depacketization (NoC-to-MB), End-to-End (E2E) flow control between source/destination R 2D-Mesh Tile R DOWN RIGHT Flits Data words that traverse links and routers Messages/Transactions MSG 4 MSG 3 MSG 2 MSG 1 MB CX NI Packets Private or Shared L2 Cache PKT 3.4 PKT 3.3 PKT 3.2 PKT 3.1 Flits R R Tail Body Header (x dst,y dst ) 20
Network-on-Chip - Router Packet Worm INPUT BUFFER FIFO UP RIGHT DOWN LEFT CORE/Q2 ALL CROSSBAR CONTROL LOGIC ROUTING + ARBITRATION PACKET HEADER OUTPUT BUFFER ALL FIFO HEADER ALL ALL LINK IN HEADER ARBITRATION CROSSBAR LINK OUT ROUTING CANONICAL ROUTER PIPELINE (4 Stages) Standard router architecture for mesh-based topologies: Five-ported router (N P = 5) FIFO-based input-buffers Central crossbar Control logic Routing Output selection Arbitration Output allocation REQ/ACK flow control 4-staged canonical pipeline Wormhole-switching (WHS) Non-interfering concurrent packet traversals if desired outputs differ 21
Network-on-Chip - Link Bidirectional point-to-point busses Up-/Downlink Repeated or buffered wires Wire delay reduction Slowest component in the NoC and critical design issue wirelength Rebalancing with router pipelines possible Input Downlink Uplink Linklength l link Output Control wires Data wires Linkwidth == Flitwidth Output Wire segment Repeater/Buffer Input 22
Packet Header Delay [ns] # of Destinations QMesh - overview The 2D-Mesh NoC + XY-DOR has desirable characteristics that should be kept! Scalable (N X /N Y ++), deterministic and regular (GALS + power), simple and short links Performance Options Increase spatial communication locality Minimize pathlength Reduce interferences of traffic flows Queuing at SRC backpressure Concurrency along paths in NoC Spatial/temporal separation of traffic Reliability Options Increase spatial communication locality Activity wear-out Add multi-path routing mechanisms Circumvent faulty components Rebalance activity Make paths spatially independent 1,0E+05 1,0E+04 1,0E+03 1,0E+02 1,0E+01 1,0E+00 Total Packet Delay Post-NI Network Delay Pre-NI Queuing Delay 0 0,005 0,01 0,015 0,02 0,025 Packet Injection Rate [Packets / IPcore / ns] 45 40 35 30 25 20 15 10 5 0 Dual XY/YX Path Destinations Single Path Destinations Not good! 24 33% 12 4 0 4 8 12 16 0% 50% 2 3 4 5 Hop Distance Range ( n hops ) 23
Qmesh -overview Additional hardware efforts similar to XY/YX-DOR 2D-Mesh 0 1 2... NxN 4 bits Q IN 10 00 11... 01 Q OUT 10 00 11... 01 Q0 = 11 Q1 = 10 Q2 = 00 Q3 = 01 R R L1-I NI Q3 NI Q2 QMesh Tile PT PE CX MB L1-D NI Q0 NI Q1 Private or Shared L2 Cache Q3 LEFT Q2 UP R DOWN R Q0 Q1 RIGHT Required Modifications: Eight-ported router (N P = 8) Four NI terminals MB segmentation Programmable Path Table (PT) Addressing Extensions: Input Quadrant at SRC Q IN Output Quadrant at router Q OUT Stored in PT (cmp MB ~ kbytes) 8x8 NoC PT = 32 bytes 16x16 NoC PT = 128 bytes PT represents a flexible interface for path adaptation Source-based, inter-tile, 24
QMesh experimental results PD Limit = 500ns Packet Delay (PD) Comparable Region Δ DELAY and Δ POWER 2D-Mesh QMesh PIR LOW PIR SAT 1 = PIR HIGH Δ SAT PIR SAT 2 Packet Injection Rate (PIR) 26
QMesh experimental results Parameter Link width Router port buffer depth Topology Switching Routing CMOS technology NoC frequency Packet size (random uniform) NoC operation time Value 64 bits / 8 bytes 9 flits / 72 bytes 2D-Mesh, CMesh, QMesh Wormhole XY 45 nm 1 GHz 20% with 2 flits / 16 bytes 80% with 9 flits / 72 bytes 5 ms 27
Packet Delay [ns] Router Temperature [ C] Quadrant-based Mesh Topology A deeper look on the parameter progress along the comparable region Example: 8x8 NoC with NN pattern at 20% direct neighbor communication 5,0E+04 4,5E+04 4,0E+04 3,5E+04 3,0E+04 2,5E+04 2,0E+04 1,5E+04 1,0E+04 5,0E+03 0,0E+00 120 100 80 60 Relative Change [2DMesh-QMesh] 68% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% DELAY-2DMESH 40 DELAY-QMESH TEMP-2DMESH 20 TEMP-QMESH -100% 0 NoC Power Packet Delay Comparable Region 32% Packet Injection Rate [Packets / IPcore/ ns] -44% Packet Injection Rate [Packets / IPcore / ns] -57% 28