Multi-level Fault Tolerance in 2D and 3D Networks-on-Chip

Similar documents
Outline. Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication. Outline

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Network on Chip Architecture: An Overview

Adaptive Multi-bit Crosstalk-Aware Error Control Coding Scheme for On-Chip Communication

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

3D WiNoC Architectures

Configurable Error Control Scheme for NoC Signal Integrity*

Subject: Adhoc Networks

High Performance Interconnect and NoC Router Design

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

EE/CSCI 451: Parallel and Distributed Computation

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Some previous important concepts

Network-on-Chip Architecture

A New CDMA Encoding/Decoding Method for on- Chip Communication Network

4. Networks. in parallel computers. Advances in Computer Architecture

Basic Low Level Concepts

CHAPTER TWO LITERATURE REVIEW

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Transport layer issues

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Impact of transmission errors on TCP performance. Outline. Random Errors

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

CSE 123: Computer Networks Alex C. Snoeren. HW 2 due Thursday 10/21!

FAULT TOLERANT SYSTEMS

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

The MAC Address Format

Chapter 5 Ad Hoc Wireless Network. Jang Ping Sheu

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

NoC Test-Chip Project: Working Document

ESA IPs & SoCs developments

Message Passing Models and Multicomputer distributed system LECTURE 7

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

Computer Networks and reference models. 1. List of Problems (so far)

Concepts for Robust NoC Communication

DESIGN AND IMPLEMENTATION ARCHITECTURE FOR RELIABLE ROUTER RKT SWITCH IN NOC

The Nostrum Network on Chip

Part I. Wireless Communication

INTERCONNECTION NETWORKS LECTURE 4

Design and Performance Evaluation of a New Spatial Reuse FireWire Protocol. Master s thesis defense by Vijay Chandramohan

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Network Control and Signalling

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

Network-on-chip (NOC) Topologies

E-Commerce. Infrastructure I: Computer Networks

A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Travis H. Boraten.

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

Part IV: 3D WiNoC Architectures

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Multiprocessor Interconnection Networks

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 18

NETWORKS on CHIP A NEW PARADIGM for SYSTEMS on CHIPS DESIGN

CSMA based Medium Access Control for Wireless Sensor Network

Noc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

TDT Appendix E Interconnection Networks

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Rollback-Recovery p Σ Σ

Parallel Computer Architecture II

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

CS575 Parallel Processing

Fault Tolerance. The Three universe model

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission

Lecture 9: Bridging & Switching"

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

On-chip Monitoring Infrastructures and Strategies for Many-core Systems

Networked Control Systems for Manufacturing: Parameterization, Differentiation, Evaluation, and Application. Ling Wang

Mapping of Real-time Applications on

Fault-tolerant & Adaptive Stochastic Routing Algorithm. for Network-on-Chip. Team CoheVer: Zixin Wang, Rong Xu, Yang Jiao, Tan Bie

Chapter 8 Fault Tolerance

Design of a System-on-Chip Switched Network and its Design Support Λ

Congestion Management in Lossless Interconnects: Challenges and Benefits

Uncoordinated Multiple Access for Vehicular Networks

AFRA: A Low Cost High Performance Reliable Routing for 3D Mesh NoCs

Links Reading: Chapter 2. Goals of Todayʼs Lecture. Message, Segment, Packet, and Frame

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009

Computation of Multiple Node Disjoint Paths

Introduction. Communication

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Real-Time Mixed-Criticality Wormhole Networks

OASIS Network-on-Chip Prototyping on FPGA

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Data Link Layer. Goals of This Lecture. Engineering Questions. Outline of the Class

Transcription:

Multi-level Fault Tolerance in 2D and 3D Networks-on-Chip Claudia usu Vladimir Pasca Lorena Anghel TIMA Laboratory Grenoble, France

Outline Introduction Link Level outing Level Application Level Conclusions 2

Network-on-Chip based Systems NoC vs. traditional connection systems P2P Bus NoC outer PE Link NoC advantages Efficient sharing of wires Shorter design time, lower effort Scalability 3

NoC QoS vs. Faults Quality of service (QoS) reliability, throughput, latency, bandwidth Unreliable signal transmission medium timing and data errors process variation induced logic and timing errors, crosstalks, electromagnetic interferences, radiations Technology down scaling Increased system complexity => Increased vulnerability to faults 4

Fault Tolerance in NoCs Faults and Fault Tolerance At different NoC components Links outers switching blocks buffers At different levels of the communication protocol stack Fault tolerant solutions EDC, ECC, NM Fault-tolerant routing Stochastic communication Application checkpoint Application Transport Network Data link Physical outer PE Link Fault 5

3D NoCs Manufacturing Stacked silicon active layers Interlayer wires (TSV) Topology Vertical and horizontal links Challenges Manufacture high density vertical wires (TSV) Place and route for heat removal 6

Outline Introduction Link Level outing Level Application Level Conclusions 7

Error Control Schemes Fault Tolerance Mechanism Goal: eceiver always receives good (error free) data Main classes A. Forward Error Correction received data is always corrected (use only error correction codes: Hamming SEC/DED, Hsiao). B. Detection and etransmission resend data when errors are detected (use error detection codes: Parity, Hamming). C. Hybrid correct as many errors as possible and alternatively request retransmission to improve error correction capability (use correction codes). Flow Control a. Switch to Switch error recovery on each intermediary link b. End to End error recovery at destination Add several blocks to the initial scheme Encoder, Syndrome decoders and Correctors Dedicated buffers for retransmission Additional logic for the control signals of the scheme 8

a.a. Switch to Switch Error Correction O U T E ENCODE DATA DETECT EO COECT EO O U T E Protect data lines with error correction codes Alternative protection (e.g. TM) schemes for control signals Encode before sending flits on the communication lines (Flit) Check for errors on the received flits code syndrome Decode syndrome to identify errors correction vector Correct flit (received flit xor correction vector) 9

a.b. Switch to Switch Detection and etransmission O U T E ETASMISSION BUFFE ENCODE DATA DETECT EO O U T E Protect data lines with error detection codes Alternative protection (e.g. TM) schemes for control signals Store transmitted flits in dedicated buffers Error detected during transmission etransmission from dedicated buffers etransmission Detect Error Start retransmission During etransmission ead Buffer & Disable link Empty Buffer Stop etransmission & Enable link 10

a.c. Switch to Switch Hybrid Error Detection/Correction O U T E ETASMISSION BUFFE ENCODE DATA DETECT EO COECT EO O U T E Protect data lines with error detection codes Alternative protection (e.g. TM) schemes for control signals Store transmitted flits in dedicated buffers Error detected during transmission Decode syndrome and correct error Unable to correct detected errors etransmission from dedicated buffers Uncorrectable Error Start retransmission During etransmission ead Buffer & Disable link Empty Buffer Stop etransmission & Enable link 11

b.a. End to End Error Correction N I ENCODE O U T E O U T E DETECT EO COECT EO N I Protect data lines with error detection codes Alternative protection (e.g. TM) schemes for control signals Encode flits before transmission over the network (at the Network Interface) Intermediate buffers store encoded flits Flit arrives at destination Detect and correct errors Forward corrected data to the NI 12

b.b. End to End Detection and etransmission N I ENCODE O U T E O U T E DETECT EO N I Protect data lines with error detection codes Alternative protection (e.g. TM) schemes for control signals Encode flits before transmission over the network (at the Network Interface) Transmission buffers in intermediate nodes store encoded flits Flit arrives at destination Detect errors Signal errors to NI NI makes retransmission requests for detected errors 13

b.c. End to End Hybrid Error Detection / Correction N I ENCODE O U T E O U T E DETECT EO COECT EO N I Protect data lines with error detection codes Alternative protection (e.g. TM) schemes for control signals Encode flits before transmission over the network (at the Network Interface) Transmission buffers of intermediate nodes store encoded flits When the flit arrives at destination Detect and correct errors Signal detected, but uncorrected errors to NI NI makes a retransmission request for detected, but uncorrected errors 14

Implementation Overhead 3x3 Mesh Network on Chip 9 Nodes and 21 Bi-directional Links Switch to Switch Error Control 42 Encoding, Detection, Correction Modules 42 etransmission Buffers End to End Switch Control 9 Encoding, Detection, Correction Modules 9 etransmission Buffers 15

Link Level - Conclusions Criterion/scheme S2S E2E Deal with cumulated errors along the path yes no Area and power overhead higher smaller Intermediate buffer width narrower wider Latency S2S E2E etransmission 3 clock cycles / link depends on the path, traffic load etc. Correction 1 clock cycle / link 1 clock cycle Area overhead etransmission Correction higher lower Efficiency S2S E2E etransmission Correction - Many errors uniformly distributed along the path - Few errors uniformly distributed along the path - Many errors not uniformly distributed along the path - Few errors not uniformly distributed along the path 16

Outline Introduction Link Level outing Level Application Level Conclusions 17

2D FT outing Virtual networks N Last and S Last virtual networks if Dst is N of Crt, use S Last if Dst is S of Crt, use N Last The VN changes on the path when the location (N or S) of the Dst from the Crt changes Numbering of nodes (only increasing or only decreasing) -> avoid deadlocks 18

3D Fault-tolerant outing Extends the 2D FT routing Folded 2D FT routing Assumptions All layers are reachable All nodes in the same layer are reachable Algorithm while Crt layer <> Dst layer if Dst layer < Crt layer then V = Up else V = Down end if 2D_FT_routing to V route to next layer end while 2D_FT_routing in Dst layer 19

Determining V By gossiping econfiguration in case of failure Initial Node failure (18) V failure (17) 20

outing Level - Conclusions Fault-tolerant routing in 3D NoC Deadlock free Tolerate multiple node and link failures Topology independent Heterogeneous stacking Compatible with any 2D FT routing reusability of specific 2D FT routing of each layer 21

Outline Introduction Link Level outing Level Application Level Conclusions 22

Checkpoint and ollback ecovery. Principle No failure tolerance Failure => estart Checkpoint and rollback recovery Failure => esume from a more recent state Principle start rollback recovery start Failure-free periodically store states on stable storage Failure rollback to the last consistent stored state restart consistent state failure rollback failure t t 23

Checkpoint and ollback ecovery. Consistent State of Several Tasks Message types vs. recovery line T A T B past message S A early/orphan message late message S B future message Consistent state with late messages T A T B past message S A future message S B late message early messages are avoided future message late messages are to be replayed after rollback t t t t 24

Coordinated vs. Uncoordinated Coordinated T A epoch rollback Uncoordinated T A garbage collection rollback T B T B T C T C consistent states global synchronizations global synchronizations recovery line (consistent state) Failure-free synchronization > consistent state Failure rollback to the last consistent state Failure-free individual task checkpoints Failure synchronization > consistent state rollback to consistent state 25

Blocking and Non-blocking Coordinated Checkpointing T A T B I T C T D Messages in NoC during checkpointing Blocking synchronization messages Non-blocking synchronization messages application messages Initiator - broadcast CK_EQ - when CK_TAKEN received from all tasks - validate global checkpoint Unique coordination Non-initiator (blocking or not) - on CK_EQ receipt - broadcast CK_STAT - when CK_STAT received from all tasks - take local checkpoint - send to initiator CK_TAKEN Allows blocking of a task set and the non-blocking of another 26

Coordinated Checkpointing Scalability Improvements Coordinated checkpointing improvements educe number of broadcasts Smart broadcast => Significant reduction of checkpointing overhead and latency, especially for higher NoC sizes => Scalability improvement 2x2 4x4 8x8 16x16 27

Protocol Improvement No optimization I educed number of broadcasts I T A T D T A TD T B T C T B T C n nodes CK_EQ n CK_STAT n*(n-1) O(n 2 ) CK_TAKEN n n nodes CK_EQ n CK_STAT n O(n) CK_STAT_ALL n CK_TAKEN n 28

Smart Broadcast Classical broadcast Smart broadcast simplicity redundancy unequal link utilization; high utilization of certain links > bottlenecks uniform link utilization no redundancy no bottlenecks adjust routing 29

Optimized Configurations simple MU shared MU multiple MUs nodes Application partitioning educe number of participating nodes educe total distance among nodes esults Partition configurations assure lower overhead and lower latency 30

Application Level - Conclusions Coordinated vs. Uncoordinated Lower overhead induced by coordinated checkpointing Traffic and failure rate influence Lower traffic load and lower failure rate Coordinated checkpointing is more efficient Higher traffic load and/or higher failure rate Uncoordinated checkpointing becomes relevant Blocking vs. Non-blocking Checkpointing duration increases with the traffic load Non-blocking: significantly Blocking: lesser Application latency increases with the traffic load and the failure rate Non-blocking: significantly Blocking: lesser For higher traffic loads and higher failure rates, the blocking approach becomes indispensable Unique blocking and non-blocking protocol Scalability Improvements Optimized Configurations 31

Fault-tolerance and econfiguration Link Level econfigurable error correction scheme Self-healing capabilities outing Level Dynamic reconfiguration of the path in case of failures Application Level Node failure Use checkpoint and reallocate task 32

Outline Introduction Link Level outing Level Application Level Conclusions 33

Conclusions Multi-level fault-tolerant mechanisms link, routing, application Complementary mechanisms Efficiency cost trade-off Error rate Application characteristics equired QoS Overhead 34