Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Similar documents
An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart

Low-Power Interconnection Networks

Network-on-Chip Architecture

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Basic Low Level Concepts

Brief Background in Fiber Optics

Network on Chip Architecture: An Overview

ECE/CS 757: Advanced Computer Architecture II Interconnects

4. Networks. in parallel computers. Advances in Computer Architecture

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Phastlane: A Rapid Transit Optical Routing Network

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Network on Chip Architectures BY JAGAN MURALIDHARAN NIRAJ VASUDEVAN

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

SpiNNaker - a million core ARM-powered neural HPC

Networks-on-Chip Router: Configuration and Implementation

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Lecture: Interconnection Networks

WITH THE CONTINUED advance of Moore s law, ever

Index 283. F Fault model, 121 FDMA. See Frequency-division multipleaccess

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

ECE 551 System on Chip Design

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

The Design and Implementation of a Low-Latency On-Chip Network

Prediction Router: Yet another low-latency on-chip router architecture

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS

The Impact of Optics on HPC System Interconnects

High Performance Interconnect and NoC Router Design

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Interconnection Networks

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Lecture 2: Topology - I

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM

Real-Time Mixed-Criticality Wormhole Networks

A Novel Energy Efficient Source Routing for Mesh NoCs

A closer look at network structure:

Routerless Networks-on-Chip

Connection-oriented Multicasting in Wormhole-switched Networks on Chip

Interconnection Networks

Ultra-Fast NoC Emulation on a Single FPGA

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

IV. PACKET SWITCH ARCHITECTURES

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Network-on-chip (NOC) Topologies

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Part IV: 3D WiNoC Architectures

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

A Survey of Techniques for Power Aware On-Chip Networks.

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Asynchronous Bypass Channel Routers

Advances in Designing Clockless Digital Systems

3D WiNoC Architectures

Synthetic Traffic Generation: a Tool for Dynamic Interconnect Evaluation

Smart Port Allocation in Adaptive NoC Routers

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lecture 7: Flow Control - I

The Tofu Interconnect 2

Continuum Computer Architecture

Topologies. Maurizio Palesi. Maurizio Palesi 1

Network Design Considerations for Grid Computing

Flow Control can be viewed as a problem of

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

ABSTRACT. Michael N. Horak Master of Science, 2008

Lecture 3: Flow-Control

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Graphene-enabled hybrid architectures for multiprocessors: bridging nanophotonics and nanoscale wireless communication

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

Lecture 3: Topology - II

A New CDMA Encoding/Decoding Method for on- Chip Communication Network

OASIS Network-on-Chip Prototyping on FPGA

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Hardware Design, Synthesis, and Verification of a Multicore Communications API

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

Optimization of Robust Asynchronous Circuits by Local Input Completeness Relaxation. Computer Science Department Columbia University

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Transcription:

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation Conference (DAC), Austin, TX

Motivation for Networks-on-Chip Future of computing is many-core 8 to 22 cores widely available: Intel 22-core Xeon-E5 2699 series Expected progression: hundreds or thousands of cores NoC separates communication and computation Improves scalability - global interconnects have high latency and power consumption (e.g. buses and point-to-point wiring) Increases performance/energy efficiency - share wiring resources between parallel data flows Facilitates design reuse - optimized IPs can simply plug in considerably decrease design efforts Key challenge for NoCs = support for new traffic patterns Support communication patterns for advanced parallel architectures Compatibility with emerging technologies for NoCs: wireless, photonics, CDMA 1

Multicast (1-to-Many) Communication Sending packets from one source to multiple destinations Widely-used in parallel computing: 3 key applications Cache coherence: sending write-invalidates to multiple sharers For Token Coherence protocol, 52.4 % of injected traffic is multicast Shared-operand networks: operand delivery to multiple processors Multi-threaded applications: for barrier synchronization - [Jerger/Lipasti et al., Virtual circuit tree multicasting: a case for on-chip hardware multicast support, ISCA-08] Additional applications: multicast in emerging technologies Wireless: mixed wire + millimeter-wave (or surface-wave) Nano-photonics: support for energy-efficient optical broadcast Large-scale neuromorphic CMPs: multicast between 1000s of neurons Key challenge for NoCs: performance/energy-efficient multicast 2

Asynchronous Design: Potential Advantages Lower power No clock power Energy-proportional computing: on-demand operation Less overall power than deeply clock-gated sync counterpart Comparison with synchronous NoC router [in 40 nm technology] 71% area reduction 39% lower latency, comparable throughput 44% lower energy/flit - [Ghiribaldi/Bertozzi/Nowick, A transition-signaling bundled data NoC switch architecture for cost-effective GALS multicore systems, DATE-13] Industrial uptake of asynchronous NoCs IBM s TrueNorth neuromorphic chip 5.4 billion transistors, fully-asynchronous chip, consuming only 63 mw 4096 neurosynaptic asynchronous cores modeling 1 million neurons connected using fully-asynchronous NoC - [Merolla et al,. A million-spiking neuron integrated circuit with a scalable communication network and interface, Science (Aug. 2014), COVER STORY] 3

Related Work: Techniques for Multicast 1) Path-based serial multicast [Ebrahimi/Daneshtalab/Tenhunen IEEE TC-14] Packet routed to first destination, from there to next, and so on Expensive if large number of destinations latency overheads 2) Tree-based parallel multicast: high-performance, widely-used First route packet on a common path from source to all destinations - When common path ends, replicate packet and diverge Earlier works set up tree in advance using multiple unicasts [Jerger/Lipasti ISCA-08] Recent works do not use unicast-based set up: tree constructed dynamically - [Krishna/Reinhardt MICRO-11] Serial Path- Based destination D E destination A source destination B C destination Parallel Tree- Based destination D E destination source A destination B C destination 4

Major Contributions 1) First general-purpose asynchronous NoC to support multicast Initial solution: uses simple tree-based parallel multicast 2) Novel strategy called Local Speculation for parallel multicast Always broadcast at subset of very fast speculative routers Neighboring non-speculative routers: Quickly throttle misrouted packets from speculative nodes Correctly route the other packets based on source-routing address New multicast protocol relaxed variant of tree-based multicast 3) New hybrid network architecture Mixes speculative and non-speculative routers 17.8-21.4% improvement in network latency over basic non-hybrid tree-based solution 4) Additional contributions: Two more architectures with extreme degrees of speculation: no speculation and full (global) speculation Router-level protocol optimizations for multi-flit packets Further improve power and performance 5

Variant Mesh-of-Trees Topology Variant MoT: contains two binary trees Fanout tree: 1-to-2 routing nodes Fanin tree: 2-to-1 arbitration nodes Recently used for core-to-cache network In shared memory parallel processors Several advantages of variant MoT: Small hop count from source to destination constant: log (n) Unique path from source to destination Minimize network contention Challenge: lack of path diversity Can be bottleneck for unbalanced traffic But overall, significant benefits for improved saturation throughput - [Horak/Nowick et al. A low-overhead asynchronous interconnection network for GALS chip multiprocessor, TCAD-11] [Balkan/Vishkin et al. Layout-accurate design and implementation of a highthroughput interconnection network for single-chip parallel processing, HOTI-07] [Rahimi/Benini et al. A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters, DATE-11] 6

Baseline Asynchronous NoC New approach builds on recent async NoC: supports only unicast - [Horak/Nowick et al. A low-overhead asynchronous interconnection network for GALS chip multiprocessor, TCAD-11] Comparison with synchronous 8x8 MoT network Network latency: 1.7x lower (vs. 800 MHz synchronous) Node-level metrics: significantly lower area, energy/packet than 1GHz sync Key design decisions: async communication + packet addressing Uses 2-phase handshaking protocol instead of 4-phase Only 1 round trip communication per data transfer Data encoding: single-rail bundled data encoding High coding efficiency and low area/power Source routing: header contains address for every fanout node on its path Allows simple fanout node Due to lack of multicast support Multicast packet serially routed using multiple unicasts Our focus only on fanout nodes Only fanout nodes will be modified to support parallel multicast Enhancements to support parallel replication, new multicast addressing No changes needed to fanin nodes for multicast: use baseline ones 7

Overview of Proposed Approach

Local Speculation: Basic Idea Goal of research High-performance parallel multicast: improve latency/throughput Basic strategy = speculation Fixed subset of fanout nodes are always speculative Speculative nodes always broadcast every packet Lightweight, very fast: no route computation or channel allocation steps Novel approach: does not follow classic speculation Hybrid network: non-speculative nodes surround speculative Non-speculative nodes: always route based on address Support parallel replication capability for multicast Throttle any redundant copies received from speculative nodes Redundant copies restricted to small local regions Net effect: High performance due to speculation Minimum power overhead due to local restriction 9

New Hybrid Network Architecture 10

Local Speculation: Multicast Operation Speculative nodes: very fast and simple: latency: 52 ps in 45 nm low area: 247 um 2 in 45 nm Non-speculative nodes: latency: 299 ps in 45 nm area: 406 um 2 in 45 nm Similar operation for unicast traffic Simplified source routing: Only encode non-speculative nodes on paths to destinations No addressing for speculative nodes: improves packet coding efficiency 11

Node-Level Protocol Optimizations Optimize power and performance for multi-flit packets 1) Speculative nodes extra power due to redundant copies Optimize power switch to non-speculative mode for body flits After header: no need for speculation as correct route known Speculative for head Switch to non-speculative for body going to one port Back to speculative for tail 2) Non-speculative nodes slow, compute route + allocate channel per flit Optimize latency/throughput using channel pre-allocation Routing of head used to pre-allocate correct output channel(s) for body/tail Body/tail fast forwarded after arrival Header: Body/Tail: 12

Experimental Results

Experimental Setup Compare 5 new parallel multicast networks with serial Baseline BasicNonSpeculative: tree-based multicast/unoptimized fanout nodes BasicHybridSpeculative: local speculation/unoptimized fanout nodes OptNonSpeculative: tree-based multicast/optimized fanout nodes OptHybridSpeculative: local speculation/optimized fanout nodes OptAllSpeculative: full (global) speculation/optimized fanout nodes Six 8x8 MoT networks: one for each configuration Technology-mapped pre-layout implementation using structural Verilog Implemented using FreePDK Nangate 45 nm technology Six synthetic benchmarks 3 unicast: Uniform Random (UR), Bit Permutation, and Hotspot 3 multicast: Multicast5/10 5% or 10% of injected packets are multicast Multicast_static: 3 sources perform multicast, remaining: UR unicast 14

Network Latency Latency measured at 25% saturation load of respective network Significant improvements for new hybrid networks over tree-based and Baseline 39.1-74.1% improvement for basic parallel tree-based multicast over Baseline serial Additional 10.5-21.4% improvement using local speculation (unoptimized/optimized) Unicast Multicast 15

Network Power Power measured at 25% saturation load of Baseline Optimized network with local speculation (OptHybridSpeculative) has minimal overhead vs. Baseline Both tree-based and speculative (unoptimized) incur moderate (5.8-23.8%) overhead over Baseline Optimized speculative network (OptHybridSpeculative) reduces this overhead to 2.9-10.3% vs Baseline Unicast Multicast 16

Different Degrees of Speculation: Effect on Network Power Power measured at 25% saturation load of Baseline Fully-speculative network (OptAllSpeculative) incurs significant overhead due to global speculation Optimized network with local speculation (OptHybridSpeculative) has almost the same power as non-speculative network However, fully-speculative network (OptAllSpeculative) incurs 14.7-22.9% extra power over non-speculative network Unicast Multicast 17

Conclusions and Future Work New parallel multicast approaches for asynchronous NoCs First general-purpose asynchronous NoC to support multicast New routing strategy called Local Speculation for parallel multicast Fixed high-speed speculative switches: always broadcast Extremely simple and fast Non-speculative switches: rapidly throttle incorrect traffic locally Redundant copies restricted to neighboring local regions New hybrid network architecture Mixes speculative and non-speculative switches Experimental design-space exploration New basic tree-based parallel multicast network achieves: 39.1-74.1% latency reduction over unicast-based serial multicast baseline Incurs only small power overheads over serial multicast baseline Additionally, new local speculation based hybrid network achieves: 17.8-21.4% latency improvements over our basic tree-based solution Small power reductions over our basic tree-based approach Future Work Extend approach to larger MoTs, 2D-mesh topology, synchronous NoCs 18