An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart

Similar documents
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Low-Power Interconnection Networks

The Design of the KiloCore Chip

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

Parallel Architectures

Lecture 3: Flow-Control

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

Advances in Designing Clockless Digital Systems

High Performance Interconnect and NoC Router Design

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Design-for-Test Approach of an Asynchronous etwork-on-chip Architecture and its Associated Test Pattern Generation and Application

The Xilinx XC6200 chip, the software tools and the board development tools

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

Basic Low Level Concepts

Lecture: Interconnection Networks

The Design of Low-Latency Interfaces for Mixed-Timing Systems

Interconnection Networks

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

International Journal of Research and Innovation in Applied Science (IJRIAS) Volume I, Issue IX, December 2016 ISSN

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Clocked and Asynchronous FIFO Characterization and Comparison

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

FABRICATION TECHNOLOGIES

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Design of Asynchronous Interconnect Network for SoC

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

A full asynchronous serial transmission converter for network-on-chips

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

TIMA Lab. Research Reports

CS250 VLSI Systems Design Lecture 9: Memory

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

Architecture without explicit locks for logic simulation on SIMD machines

III. RELATED WORK SWITCH ARCHITECTURES UNDER TEST

Quality-of-Service for a High-Radix Switch

Conquering Memory Bandwidth Challenges in High-Performance SoCs

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

WITH THE CONTINUED advance of Moore s law, ever

Design and Analysis of Networks-on-Chip in Heterogeneous Multicore Systems. Young Jin Yoon

OCB-Based SoC Integration

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Network-on-Chip Architecture

On Packet Switched Networks for On-Chip Communication

A Novel Energy Efficient Source Routing for Mesh NoCs

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

The Nostrum Network on Chip

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

THERE are few published methods to aid in automating

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

ECE/CS 757: Advanced Computer Architecture II Interconnects

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

ScienceDirect. Packet-based Adaptive Virtual Channel Configuration for NoC Systems

Place Your Logo Here. K. Charles Janac

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

The Design and Implementation of a Low-Latency On-Chip Network

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

KiloCore: A 32 nm 1000-Processor Array

Ultra-Fast NoC Emulation on a Single FPGA

Digital Integrated Circuits

Built-In Self-Test for Regular Structure Embedded Cores in System-on-Chip

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Reconfigurable Cell Array for DSP Applications

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

Performance Evaluation of Elastic GALS Interfaces and Network Fabric

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

EECS 570 Final Exam - SOLUTIONS Winter 2015

Network on Chip Architecture: An Overview

Kyoung Hwan Lim and Taewhan Kim Seoul National University

Bandwidth Optimization in Asynchronous NoCs by Customizing Link Wire Length

Asynchronous Circuit Design

ISSN Vol.03, Issue.02, March-2015, Pages:

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

Martin Dubois, ing. Contents

Asynchronous Bypass Channel Routers

Transcription:

An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang Columbia University, USA Gabriele Miorandi University of Ferrara, Italy Wayne Burleson Advanced Micro Devices, USA Davide Bertozzi University of Ferrara, Italy Steven M. Nowick Columbia University, USA Greg Sadowski Advanced Micro Devices, USA ACM/IEEE Design, Automation and Test in Europe (DATE-17)

Motivation for Networks-on-Chip Future of computing is multi-core CPU: 8 to 24 cores widely available - AMD 16-core Opteron 6000 series - AMD Ryzen 4,6,8,+ cores - Intel 24-core Xeon-E7 - Intel Xeon Phi 80+ core AMD Ryzen 8-core Processor (March 2017) GPU: up to 2500-3500 graphics cores - AMD FirePro series: up to 2560 GCN Stream Processors - NVIDIA Titan X: 3584 CUDA Cores 1

Motivation for Networks-on-Chip (Cont.) NoC separates computation and communication Improves scalability - global interconnects have high latency and power consumption (e.g. buses and point-to-point wiring) Increases performance/energy efficiency - share wiring resources between parallel data flows Facilitates design reuse - optimized IPs can simply plug in largely decrease design efforts 2

Potential Advantages of Asynchronous Design No global clock No clock power less overall power than deeply clock-gated sync designs No clock design overhead no clock generation, distribution, skew analysis, etc. - [Gebhardt/Stevens et al., Comparing energy and latency of asynchronous and synchronous NoCs for embedded SoCs, NOCS-10] Greater flexibility/modularity Easily integrates multiple timing domains Supports reusable components - [Bainbridge/Furber, CHAIN: a delay-insensitive chip area interconnect, IEEE Micro-02] Lower system latency No per-router clock synchronization no waiting for clock - [Sheibanyrad/Greiner et al., Multisynchronous and fully asynchronous NoCs for GALS architectures, IEEE Design & Test of Computers-08] 3

Recent Commercial Asynchronous NoC Chips Intel s FM5000/6000 Ethernet switches [IEEE Design & Test 2015] - high performance: 640 Gbps max. bandwidth + 400 ns cut-through latency - support up to 176 ports IBM s TrueNorth neuromorphic chip [Science 2014] - a 5.4-billion-transitor chip with 4096 neurosynaptic cores - models 1M neurons and 256M synapses - ultra-low power: only 63 milliwatts with 400x240 video input at 30 frames/sec. STMicroelectronics STHORM processor [DAC-12] - A GALS computing accelerator for embedded SoCs - connect 4 clusters, each with 16 sync processors - improved performance efficiency over several Quadro and Nvidia GPUs 4

Contributions (1) First comparison for: async vs. commercial sync router in advanced technology Sync baseline is for high-end processors and graphics products - NoC handles system config and power/performance control Sync baseline uses aggressive clock optimization and finegrain clock gating Comparison in a 14nm FinFET library - not textbook academic technology library - state of the art CMOS technology used in commercial products Dominating results for asynchronous - in key metrics: area, latency and idle/active power 5

Contributions (2) Implementation and validation at pre- and post-layout results presented only for pre-layout (confidentiality reasons) Industrial tools used in async design and validation Functional validation tool (using Synopsys environment) - wrapper added for async design for sync environment re-use - used for both pre- and post-layout implementations Place & Route tool (using AMD s internal tool environment) - largely manual synthesis + automated P&R - expect automated logic synthesis can be included with reasonable efforts (e.g.,an existing solution is proposed in [Ghiribaldi/Bertozzi/Nowick DATE-13]) 6

Contributions (3) A novel async end-to-end credit-based Virtual Channel control scheme Key idea = lazy credit-update approach - credit-increments are queued and no immediate update - credit updated only with a credit-decrement - fewer backward credit synchronization to upstream router Potential increased throughput VC is required for practical industrial usage - many existing async NoCs do not include VCs Not the focus of this presentation (see paper for details) 7

Proposed Asynchronous Node Structure Local Terminal North Channel North Channel Router Local Interface North Interface Router West Channel West Interface Switch 1 Switch 0 East Interface East Channel West Channel Router for South Interface East Channel Router for South Channel South Channel Two identical and uncorrelated planes Follows AMD sync baseline router architecture 8

Proposed Asynchronous Node Structure (Cont.) Switch replication inside each plane - as many times as the number of VCs Local Terminal West Channel West Interface North Channel Local Interface North Interface Switch 1 Switch 0 North Channel East Interface East Channel For VC #0 traffic For VC #1 traffic West Channel Router for South Interface East Channel Router for South Channel South Channel 9

Node Operation 1 Example: data from west input -> east output De-mux data to a switch 3 Merge data from 2 VCs Local Terminal North Channel North Channel Local Interface North Interface Datain West Channel West Interface Switch 1 Switch 0 East Interface East Channel Dataout West Channel Router for South Interface East Channel Router for 2 South Channel Data traverses the switch { South Channel Header sets up the path Body/tail flits follow the pre-set up path 10

New Components in the Async Router Two new components added on previous DATE-13 async router Local Terminal North Channel Local Interface North Interface North Channel [Ghiribaldi/Bertozzi/Nowick DATE-13] Input interface: New high-performance Input buffer West Channel West Interface Switch 1 Switch 0 East Interface East Channel Switch 1 Switch 0 Input Interface Output Interface East Channel West Channel Router for South Channel South Interface Router for South Channel East Channel Output interface: New VC control Identical switches; new components in router interfaces 11

Input Buffer Circular FIFO: Forward Latency Default-open single D-latch register Default-open single D-latch register + XOR2 Forward latency: 2 x D Q latch delay + XOR2 + XOR4 Written-in data can be immediately read out (not aligned to clk cycle: much faster than a sync circular FIFO) 12

Input Buffer Circular FIFO: Storage Element 13 Each async storage element = single level-sensitive D-latch register - Each latch register has full storage capacity - Half area/power cost as a typical Flip-Flop storage in sync key source for performance/area/power benefits 13

Output Interface Design: Proposed VC Control Two d a ta inp ut c ha nnels: ea c h from a d ifferent VC a nd c orresp ond ing switc h (OPM) Ackout1 Ackout0 Reqin0 Reqin1 Datain0 Datain1 Blocks or allows output traffic for a particular VC L4 D Q E L3 D Q E full0 full0_valid Full Detector0 Timer 0 forced _clk0 Mutex Input Ctl0 zerowins mutex _req0 Mutex E D Q L1 Mutex Input Ctl1 mutex _req1 E D Q L2 onewins R_ Q Timer 1 forced _clk1 S Q DataMux sel full1 full1_valid Full Detector1 E D Q L5 E D Q L6 E D Q L7 D E Q Data Reg VC c ontrols: from the outp ut link Credit_increment0 Credit_increment1 Ackin Reqout Dataout Da ta outp ut c ha nnel: to the outp ut link Updates downstream credits only every time a flit is sent out (See details in the paper) 14

Design Validation Tool Async Router Design Wrapper Pre- or Post-layout netlist Synchronize async I/O data to a given clock (Ideal wrapper, not considering metastability) Standard Sync Simulator Re-used standard sync I/Os and benchmarks 15

Design Flow and Place & Route Tool Manually derive gate netlist Manually add inverter chains Manual Synthesis Manual Timing Correction Automated P&R Timing violations? Yes No Final Layout Standard sync P&R with don t touch everything Expect further synthesis automation can be included with reasonable effort - An async logic synthesis solution was proposed in [Ghiribaldi/Bertozzi/Nowick DATE-13] 16

Actual Layout for Asynchronous Router West channel pins East channel pins Local channel pins North channels pins Router config.: - double-plane router - 5 port + 2 VCs South channel pins 17

Experimental Results: Overview AMD commercial sync router vs. proposed async router Identical router configuration for both routers - 5-port + 2 VCs - buffer depth = 7 for each VC Pre-layout results only (for confidentiality reasons) - post-layout comparisons expected to be similar for small designs One testing benchmark: activating all switch ports - evenly distributed traffic from all inputs to all outputs - sufficient for initial router-level results Testing corner: 14nm FinFET library (0.65V, TT) Additional projected results for more complex routers 7-port router with 2 VCs 5-port router with 8 VCs for 3D stacking more realistic VC configuration 18

Basic comparison: 5-port router with 2 VCs Asynchronous router dominates in area, latency and power Comparison for 5-port router with 2 VCs Sync router Async router 55% lower 28% lower 88% lower 58% lower 19

Projected Results for More Complex Routers Absolute area and power costs are noticeably increased - due to higher radix or more VCs Relative asynchronous benefits are largely maintained Comparison for 5-port router with 8 VCs Sync 5-port 2 VCs Async 5-port 2 VCs Sync 7-port 2 VCs Async 7-port 2 VCs Sync 5-port 8 VCs Async 5-port 8 VCs 47% lower 28% lower 85% lower 51% lower 47% lower 16% lower 85% lower 51% lower Comparison for 7-port router with 2 VCs 20

Conclusions First async vs. commercial sync router in advanced library Sync router optimized for high-end products with fine-grain clock-gating Comparison in 14nm FinFET library Industrial tools for async design and validation Design validation tool: sync testing environments are largely re-used Manual synthesis + automated P&R - synthesis automation can be further included with some effort Shows opportunity for industrial asynchronous designs Some remaining tool challenges for full automation A novel async end-to-end credit-based VC control approach Lazy credit-update approach potential higher throughput Results: async router shows significant benefits In key metrics: area, latency and power 21