Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Similar documents
Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Lecture 3: Flow-Control

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Prediction Router: Yet another low-latency on-chip router architecture

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

4. Networks. in parallel computers. Advances in Computer Architecture

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

Network-on-chip (NOC) Topologies

The Design and Implementation of a Low-Latency On-Chip Network

Topologies. Maurizio Palesi. Maurizio Palesi 1

Basic Low Level Concepts

Lecture 18: Communication Models and Architectures: Interconnection Networks

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

Network-on-Chip Micro-Benchmarks

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

A Novel Energy Efficient Source Routing for Mesh NoCs

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

A Hybrid Interconnection Network for Integrated Communication Services

POLYMORPHIC ON-CHIP NETWORKS

NOC: Networks on Chip SoC Interconnection Structures

TDT Appendix E Interconnection Networks

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

Packet Switch Architecture

Packet Switch Architecture

Interconnection Network

MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS. A Thesis HRISHIKESH NANDKISHOR DESHPANDE

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

Fault-adaptive routing

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Deadlock and Livelock. Maurizio Palesi

Extended Junction Based Source Routing Technique for Large Mesh Topology Network on Chip Platforms

Energy-efficient fault tolerant technique for deflection routers in two-dimensional mesh Network-on-Chips

High Performance Interconnect and NoC Router Design

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

A Literature Review of on-chip Network Design using an Agent-based Management Method

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

NOC Deadlock and Livelock

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Network on Chip Architecture: An Overview

Lecture: Interconnection Networks

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Lecture 12: SMART NoC

Lecture 22: Router Design

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Extending the Performance of Hybrid NoCs beyond the Limitations of Network Heterogeneity

Embedded Systems: Hardware Components (part II) Todor Stefanov

Mapping a Pipelined Data Path onto a Network-on-Chip

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang.

Ultra-Fast NoC Emulation on a Single FPGA

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Power and Area Efficient NOC Router Through Utilization of Idle Buffers

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

The Nostrum Network on Chip

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Interconnect Technology and Computational Speed

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Dynamic Stress Wormhole Routing for Spidergon NoC with effective fault tolerance and load distribution

Efficient And Advance Routing Logic For Network On Chip

Topologies. Maurizio Palesi. Maurizio Palesi 1

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Network-on-Chip Architecture

Evaluating Bufferless Flow Control for On-Chip Networks

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

Phastlane: A Rapid Transit Optical Routing Network

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

NoC Test-Chip Project: Working Document

Multi-level Fault Tolerance in 2D and 3D Networks-on-Chip

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM

Trading hardware overhead for communication performance in mesh-type topologies

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Interconnection Networks

Area Efficient Asynchronous SDM Routers Using 2-Stage Clos Switches Wei Song, Doug Edwards, Jim Garside and William J. Bainbridge

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

WITH THE CONTINUED advance of Moore s law, ever

Transcription:

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University of Rostock, Germany philipp.gorski@googlemail.com & tim.wegner@uni-rostock.de

Outline Fundamentals and trends Chip-multiprocessor Network-on-Chip Quadrant-based mesh (QMesh) Overview Experimental results Conclusion / future work 2

external memory Fundamentals and trends - CMPs Modern Chip-Multiprocessors (CMPs) Modular design High degree of parallelization (thread/core level) Challenge: efficient on-chip communication despite rising core count on-chip components L1/L2 cache, regs, FIFO, I/O IP local IP local IP local IP local computation communication memory GPP, GPU, DSP, system I/O infrastructure L3/L4 cache, edram, I/O global P2P, busses, crossbars, on-chip networks 3

Fundamentals and trends - CMPs Key trends and issues Technology scaling Utilization wall Power temperature Bandwidth (BW) delay Dark silicon PVT variation increases Reliability Architecture IP cores # increases Heterogeneous On-chip memory On-chip communication dominant Workload Application diversity Multiple domains Latency: BW, comp. Virtualization Interferences Memory access Communication Starting point: on-chip communication infrastructure (here: Networks-on-Chip) Vital link between computation and storage 4

Fundamentals and trends - NoCs Networks-on-Chip (NoCs): approach for scalable on-chip communication Managed wire segments Packet-based communication Globally asynchronous, locally synchronous (GALS) Topology: global interconnection between components y 0,3 1,3 2,3 3,3 0,2 1,2 2,2 3,2 0,1 1,1 2,1 3,1 Router (R) Link Network interface (NI) Tile: voltage/frequency island 0,0 1,0 2,0 3,0 (x,y) x 4x4 2D-mesh 5

Fundamentals and trends - NoCs Paths and routing Path: E2E packet route through NoC (start/end: NI of SRC/DST) Determined by routing algorithm Path length (PL): number of traversed routers (hops) (x DST,y DST ) 0,3 1,3 2,3 3,3 0,2 1,2 2,2 3,2 0,1 1,1 2,1 3,1 Dimension-ordered routing: XY or YX Minimal PL Deterministic Dead-/livelock-free Minimal HW effort Non-adaptive 0,0 1,0 2,0 3,0 (x SRC,y SRC ) 6

QMesh - overview Idea: improve IP core connectivity Increase the number of NIs per IP core (Q 0 Q 3 ) Connect each core to all surrounding routers QMesh: quadrant-based mesh + XY routing Q 3 Q 0 IP core Q 2 Q 1 West North R South East 7

QMesh - overview QMesh characteristics Preservation of basic 2D-mesh structure Dual-path routing Spatially independent paths Required modifications / additional HW costs 8-ported router 4 NIs per IP core 1 programmable Path Table (PT) per IP core 4 bit addressing extension (for Q in and Q out ) 4 bits Q inin 10 00 11... 01 Q out 10 00 11... 01 Q 0 = 11 Q 1 = 10 Q 2 = 00 Q 3 = 01 R R NI Q 3 IP core NI Q 0 NI Q 2 QMesh tile Processing element CX PT Left NI Q 1 Q 3 Q 2 Up R Down R Q 0 Right Q 1 Advantages Costs comparable to 2D-mesh with XY/YX routing Reduced average path length Mitigation of traffic interferences Increased traffic locality Benefits of 2D-mesh maintained (e.g. deterministic routing) 8

QMesh - overview DST at same row (L, R) or column (U, D) DST in quadrant (Q 0 Q 3 ) U Q 3 Q 0 Q 3 IP DST IP IP Q 0 Q 2 Q 1 Q 3 Q 2 Q 0 Q 1 IP DST IP IP IP IP IP IP IP IP IP IP L IP SRC IP R IP IP SRC IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP D IP IP Q 2 Q 1 L, R, U, D: path length reduced by 1 hop (compared to 2D-mesh with XY routing) Q 0 Q 3 : path length reduced by up to 2 hops (worst case: same as 2D-mesh) 9

QMesh experimental results Parameters Simulator: SystemC-based, cycle-accurate 2D-mesh and QMesh (XY routing) 8x8 NoC Link width: 64 bit width NoC frequency: 1GHz Router: FIFO depth 9 flit Synthetic traffic patterns Single-threaded applications (bit complement/reverse, transpose, shuffle) Multi-threaded applications (nearest neighbor, hotspot, rentian) Evaluated parameters Network saturation margin Packet delay Power overhead Reliability: wear-out acceleration factor a MTTF Robustness: E2E connectivity, number of reachable resources 10

QMesh experimental results SAT : relative improvement of network saturation margin Due to hop reduction and dual-path options 11

QMesh experimental results Reduction of packet delay ( DELAY ) compensates power overhead ( POWER ) Locality/fewer hops reduced traffic lower router/link activity reduced P dyn POWER lower than expected (~100%) 12

QMesh experimental results Reliability Evaluation via acceleration factor of Mean-Time-To-Failure (MTTF): a MTTF Wear-out increase: a MTTF < 1 Wear-out decrease: a MTTF > 1 a MTTF = t E a QMesh = e k t 2D mesh 1 1 T QMesh T 2D mesh t QMesh, t 2D mesh : MTTF of QMesh/2D-mesh T QMesh, T 2D mesh : avg. router temperature for QMesh/2D-mesh k: Boltzmann s constant (8.6 10-5 ev/k) E a : activation energy of the CMOS devices (here: 0.7 ev at 45nm CMOS) 13

QMesh experimental results General wear-out decrease through QMesh Mean router lifetime increase: 10% for low PIR, 60% for high PIR (avg. ~ 35%) 14

QMesh experimental results Robustness Degradation of E2E connectivity? Injection of random router faults (uniform distribution) 15

QMesh experimental results Robustness Number of reachable resources? Injection of random router faults (uniform distribution) 16

Conclusion / future work Modern CMPs require efficient architecture for on-chip communication NoCs provide appropriate infrastructure QMesh topology: integration of multiple NIs per IP core to improve connectivity Preservation of basic NoC structure and associated benefits Improvements over standard 2D-mesh Increase of network saturation margin Reduction of avg. packet delay Reliability: increased router lifetime due to lower max. temperatures Higher robustness due to dual-path routing (spatially independent) Tolerable costs Next step: full HW/SW integration of traffic monitoring and path reconfiguration 17

Thank you for your attention! Questions? 18

QMesh overview Global view Orthogonal system-level integration (HW/SW) TRAFFIC MONITORING PATH RECONFIGURATION SOFTWARE Traffic evaluation Path adaptation CLUSTER CLUSTER CLUSTER Step 3: Traffic Management Path reconfiguration Extendable & adaptable HARDWARE Links/ports Local level Aggregation Local sensors Destinations Cluster level Master-Tile Local actors Path table Cluster level SNoC: QMesh monitoring/control network DNoC: QMesh data network TILE TILE TILE Step 2: Observability Traffic monitoring Extendable & adaptable Step 1: QMesh (this work) Dual-path options Performance Reliability & robustness 19

L1-I PE LEFT Q2 L1-D UP Router/Link De- and Packetization Network-on-Chip - Packets Message Buffer (MB) Read/Write messages of workload applications Network-Interface (NI) Packetization (MB-to-NoC) and depacketization (NoC-to-MB), End-to-End (E2E) flow control between source/destination R 2D-Mesh Tile R DOWN RIGHT Flits Data words that traverse links and routers Messages/Transactions MSG 4 MSG 3 MSG 2 MSG 1 MB CX NI Packets Private or Shared L2 Cache PKT 3.4 PKT 3.3 PKT 3.2 PKT 3.1 Flits R R Tail Body Header (x dst,y dst ) 20

Network-on-Chip - Router Packet Worm INPUT BUFFER FIFO UP RIGHT DOWN LEFT CORE/Q2 ALL CROSSBAR CONTROL LOGIC ROUTING + ARBITRATION PACKET HEADER OUTPUT BUFFER ALL FIFO HEADER ALL ALL LINK IN HEADER ARBITRATION CROSSBAR LINK OUT ROUTING CANONICAL ROUTER PIPELINE (4 Stages) Standard router architecture for mesh-based topologies: Five-ported router (N P = 5) FIFO-based input-buffers Central crossbar Control logic Routing Output selection Arbitration Output allocation REQ/ACK flow control 4-staged canonical pipeline Wormhole-switching (WHS) Non-interfering concurrent packet traversals if desired outputs differ 21

Network-on-Chip - Link Bidirectional point-to-point busses Up-/Downlink Repeated or buffered wires Wire delay reduction Slowest component in the NoC and critical design issue wirelength Rebalancing with router pipelines possible Input Downlink Uplink Linklength l link Output Control wires Data wires Linkwidth == Flitwidth Output Wire segment Repeater/Buffer Input 22

Packet Header Delay [ns] # of Destinations QMesh - overview The 2D-Mesh NoC + XY-DOR has desirable characteristics that should be kept! Scalable (N X /N Y ++), deterministic and regular (GALS + power), simple and short links Performance Options Increase spatial communication locality Minimize pathlength Reduce interferences of traffic flows Queuing at SRC backpressure Concurrency along paths in NoC Spatial/temporal separation of traffic Reliability Options Increase spatial communication locality Activity wear-out Add multi-path routing mechanisms Circumvent faulty components Rebalance activity Make paths spatially independent 1,0E+05 1,0E+04 1,0E+03 1,0E+02 1,0E+01 1,0E+00 Total Packet Delay Post-NI Network Delay Pre-NI Queuing Delay 0 0,005 0,01 0,015 0,02 0,025 Packet Injection Rate [Packets / IPcore / ns] 45 40 35 30 25 20 15 10 5 0 Dual XY/YX Path Destinations Single Path Destinations Not good! 24 33% 12 4 0 4 8 12 16 0% 50% 2 3 4 5 Hop Distance Range ( n hops ) 23

Qmesh -overview Additional hardware efforts similar to XY/YX-DOR 2D-Mesh 0 1 2... NxN 4 bits Q IN 10 00 11... 01 Q OUT 10 00 11... 01 Q0 = 11 Q1 = 10 Q2 = 00 Q3 = 01 R R L1-I NI Q3 NI Q2 QMesh Tile PT PE CX MB L1-D NI Q0 NI Q1 Private or Shared L2 Cache Q3 LEFT Q2 UP R DOWN R Q0 Q1 RIGHT Required Modifications: Eight-ported router (N P = 8) Four NI terminals MB segmentation Programmable Path Table (PT) Addressing Extensions: Input Quadrant at SRC Q IN Output Quadrant at router Q OUT Stored in PT (cmp MB ~ kbytes) 8x8 NoC PT = 32 bytes 16x16 NoC PT = 128 bytes PT represents a flexible interface for path adaptation Source-based, inter-tile, 24

QMesh experimental results PD Limit = 500ns Packet Delay (PD) Comparable Region Δ DELAY and Δ POWER 2D-Mesh QMesh PIR LOW PIR SAT 1 = PIR HIGH Δ SAT PIR SAT 2 Packet Injection Rate (PIR) 26

QMesh experimental results Parameter Link width Router port buffer depth Topology Switching Routing CMOS technology NoC frequency Packet size (random uniform) NoC operation time Value 64 bits / 8 bytes 9 flits / 72 bytes 2D-Mesh, CMesh, QMesh Wormhole XY 45 nm 1 GHz 20% with 2 flits / 16 bytes 80% with 9 flits / 72 bytes 5 ms 27

Packet Delay [ns] Router Temperature [ C] Quadrant-based Mesh Topology A deeper look on the parameter progress along the comparable region Example: 8x8 NoC with NN pattern at 20% direct neighbor communication 5,0E+04 4,5E+04 4,0E+04 3,5E+04 3,0E+04 2,5E+04 2,0E+04 1,5E+04 1,0E+04 5,0E+03 0,0E+00 120 100 80 60 Relative Change [2DMesh-QMesh] 68% 100% 80% 60% 40% 20% 0% -20% -40% -60% -80% DELAY-2DMESH 40 DELAY-QMESH TEMP-2DMESH 20 TEMP-QMESH -100% 0 NoC Power Packet Delay Comparable Region 32% Packet Injection Rate [Packets / IPcore/ ns] -44% Packet Injection Rate [Packets / IPcore / ns] -57% 28