Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Similar documents
Fast Flexible FPGA-Tuned Networks-on-Chip

CONNECT: Fast Flexible FPGA-Tuned Networks-on-Chip

CONNECT: Re-Examining Conventional Wisdom for Designing NoCs in the Context of FPGAs

Modern integrated circuits

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

BeiHang Short Course, Part 5: Pandora Smart IP Generators

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Interconnection Networks

AUGMENTING FPGAS WITH EMBEDDED NETWORKS-ON-CHIP

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Interconnection Networks

Lecture 6: Hard vs Soft Logic. James C. Hoe Department of ECE Carnegie Mellon University

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Small Virtual Channel Routers on FPGAs Through Block RAM Sharing

Ultra-Fast NoC Emulation on a Single FPGA

A Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems

OASIS Network-on-Chip Prototyping on FPGA

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

DESIGN GUIDELINES FOR THE IMPLEMENTATION OF EMBEDDED NETWORK ON CHIP (NOC) IN FPGAS. Noha Gamal Mohamed

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Interconnection Networks

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Lecture 3: Topology - II

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Design of a System-on-Chip Switched Network and its Design Support Λ

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization

Design Methodologies. Full-Custom Design

ECE/CS 757: Advanced Computer Architecture II Interconnects

NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS

Introduction to the Qsys System Integration Tool

Lecture: Interconnection Networks

Networks-on-Chip Router: Configuration and Implementation

INTRODUCTION TO FPGA ARCHITECTURE

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing

FPGA Based Digital Design Using Verilog HDL

POLYMORPHIC ON-CHIP NETWORKS

Network-on-chip (NOC) Topologies

Network on Chip Architecture: An Overview

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

Design Methodologies and Tools. Full-Custom Design

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

Low-Power Interconnection Networks

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Networks-on-Chip for FPGAs: Hard, Soft or Mixed?

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Design and Implementation of A Reconfigurable Arbiter

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Internet Architecture and Protocol

Yet Another Implementation of CoRAM Memory

Part IV: 3D WiNoC Architectures

TDT Appendix E Interconnection Networks

Trading hardware overhead for communication performance in mesh-type topologies

ORCA FPGA- Optimized VectorBlox Computing Inc.

Lecture 2: Topology - I

INTERCONNECTION NETWORKS LECTURE 4

Multicycle-Path Challenges in Multi-Synchronous Systems

Lecture 7: Flow Control - I

4. Networks. in parallel computers. Advances in Computer Architecture

Evaluating Bufferless Flow Control for On-Chip Networks

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

A Novel Approach for Network on Chip Emulation

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

NoC Test-Chip Project: Working Document

FPGA-Accelerated Instrumentation

A closer look at network structure:

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

CS310 Embedded Computer Systems. Maeng

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

Digital Systems Design. System on a Programmable Chip

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

Real-Time Mixed-Criticality Wormhole Networks

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Embedded Systems: Hardware Components (part II) Todor Stefanov

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Packet Switch Architecture

Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques

Packet Switch Architecture

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

Time-Multiplexed FPGA Overlay Networks on Chip

How Much Logic Should Go in an FPGA Logic Block?

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

INT G bit TCP Offload Engine SOC

Programmable NICs. Lecture 14, Computer Networks (198:552)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Transcription:

This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs Michael K. Papamichael, James C. Hoe papamix@cs.cmu.edu, jhoe@ece.cmu.edu Computer Architecture Lab at Monterey, CA, February 2012

FPGAs and Networks-on-Chip (NoCs) Rapid growth of FPGA capacity and features Extended SoC and full-system prototyping FPGA-based high-performance computing Need for flexible NoCs to support communication Map existing ASIC-oriented NoC designs on FPGAs? 2

FPGAs and Networks-on-Chip (NoCs) Rapid growth of FPGA capacity and features Extended SoC and full-system prototyping FPGA-based high-performance computing Need for flexible NoCs to support communication Map existing ASIC-oriented NoC designs on FPGAs? Inefficient use of FPGA resources ASIC-driven NoC architecture not optimal for FPGA 3

FPGAs and Networks-on-Chip (NoCs) Rapid growth of FPGA capacity and features Extended SoC and full-system prototyping FPGA-based high-performance computing Need for flexible NoCs to support communication Map existing ASIC-oriented NoC designs on FPGAs? Inefficient use of FPGA resources ASIC-driven NoC architecture not optimal for FPGA FPGA-tuned NoC Architecture Embodies FPGA-motivated design principles Very lightweight, minimizes resource usage ~50% resource reduction vs. ASIC-oriented NoC Flexible NoC generator public release today! Often goes against ASIC-driven NoC conventional wisdom 4

Outline NoC Terminology (single-slide review) Approach Tailoring NoCs to FPGAs Results & Demo Related Work & Conclusion Public Release 5

Outline NoC Terminology (single-slide review) Approach Tailoring NoCs to FPGAs Results & Demo Related Work & Conclusion Public Release 6

NoC Terminology Overview Router A Virtual Channels Tail Flit Flits Packet T F F F F F Data Channel/Link Flow Control Link Head Flit H F Router B Virtual Channels Packets Basic logical unit of transmission Flits Packets broken into into multiple flits unit of flow control Virtual Channels Multiple logical channels over single physical link Flow Control Management of buffer space in the network 7

Outline NoC Terminology (single-slide review) Approach Tailoring NoCs to FPGAs Results & Demo Related Work & Conclusion Public Release 8

How FPGAs are Different from ASICs FPGAs peculiar HW realization substrate in terms of Relative cost of speed vs. logic vs. wires vs. memory Unique mapping and operating characteristics focuses on 4 FPGA characteristics: 1 2 3 4 Abundance of Wires Storage Shortage & Peculiarities Frequency Challenged Reconfigurable Nature FPGA characteristics uniquely influence key NoC design decisions 9

Tailoring NoCs to FPGAs 1 Abundance of of Wires Densely connected wiring substrate (Over)provisioned to handle worst case Wires are free compared to other resources Make datapaths and channels as wide as possible Adjust packet format --CONNECT-- NoC Implications E.g. carry control info on the side through dedicated links Adapt traditional credit-based flow control 10

Tailoring NoCs to FPGAs 2 Storage Abundance Shortage & of Peculiarities Wires Modern FPGAs offer storage in two forms Block RAMs and LUT RAMs (use logic resources) Only come in specific aspect ratios and sizes Typically in high demand, especially Block RAMs Minimize usage and optimize for aspect ratios and sizes Implement multiple logical flit buffers in each physical buffer Use LUT RAM for flit buffers --CONNECT-- NoC Implications Block RAM much larger than typically NoC flit buffer sizes Allow rest of design to use scarce Block RAM resources 11

Tailoring NoCs to FPGAs 3 Frequency Abundance Challenged of Wires Much lower frequencies compared to ASICs LUTs inherently slower that ASIC standard cells Large wire delays when chaining LUTs Rapidly diminishing returns when pipelining Deep pipelining hard due to quantization effects --CONNECT-- NoC Implications Design router as single-stage pipeline Also dramatically reduces network latency Make up for lower frequency by adjusting network E.g. increase width of datapath and links or change topology 12

Tailoring NoCs to FPGAs 4 Reconfigurable Abundance of Nature Wires Reconfigurable nature of FPGAs Sets them apart from ASICs Support diverse range of applications --CONNECT-- NoC Implications Support extensive application-specific customization Flexible parameterized NoC architecture Automated NoC design generator (demo!) Adhere to standard common interface NoC appears as plug-and-play black box from user-perspective 13

Outline NoC Terminology (single-slide review) Approach Tailoring NoCs to FPGAs Results & Demo Related Work & Conclusion Public Release 14

LUTs w/ FPGA RTL opts. LUTs CONNECT vs. ASIC-Oriented RTL 16-node 4x4 Mesh Network-on-Chip (NoC) SOTA: state-of-the-art high-quality ASIC-oriented RTL* CONNECT: identically configured -generated RTL FPGA Resource Usage (same router/noc configuration) 9K 8K 7K 60K 50K 6K 5K 4K 40K 30K 50% 3K 2K 1K 50% 20K 10K Single Router 4x4 Mesh Noc *NoC RTL from http://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/resources/router 15

LUTs w/ FPGA RTL opts. LUTs Avg. Packet Latency (in ns) CONNECT vs. ASIC-Oriented RTL 16-node 4x4 Mesh Network-on-Chip (NoC) SOTA: state-of-the-art high-quality ASIC-oriented RTL* CONNECT: identically configured -generated RTL 9K 8K 7K FPGA Resource Usage (same router/noc configuration) 6K 40K 50% 5K ~50% lower LUT 30K Usage 4K 3K 2K 1K 50% Single Router 60K 50K 20K 10K 4x4 Mesh Noc 500 400 300 200 100 Network Performance (uniform random traffic @ 100MHz) similar bandwidth same LUT bugdet 4x BW 2x lower idle latency 2 4 6 8 Load (in Gbps) *NoC RTL from http://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/resources/router 16

Latency (in cycles) Latency (in cycles) CONNECT Sample Networks Four sample CONNECT Networks ( router, endpoint) 16 endpoints, 2/4 virtual channels, 128-bit datapath Ring Fat Tree Mesh High Radix All above networks are interchangeable from user perspective 40 Uniform Random Traffic 40 90% Neighbor Traffic 20 20 0.25 0.50 0.75 1 Load (in flits/cycle) 0.25 0.50 0.75 1 Load (in flits/cycle) 17

Latency (in cycles) Latency (in cycles) CONNECT Sample Networks Four sample CONNECT Networks ( router, endpoint) 16 endpoints, 2/4 virtual channels, 128-bit datapath Ring Fat Tree Mesh High Radix All above networks are interchangeable from user perspective Uniform Random Traffic 90% Neighbor Traffic 40 40 There 20 is no one-size-fits-all NoC! Tune 20 NoC to application. 0.25 0.50 0.75 1 0.25 0.50 0.75 1 please Load see (in paper flits/cycle) for more synthesis & performance Load results (in flits/cycle) 18

CONNECT NoC Generator Generates FPGA-tuned NoCs Used as part of larger system Coexist with rest of FPGA-resident components Lightweight Minimize FPGA resource usage Flexible Support rapid design space exploration Fully parameterized, topology-agnostic architecture Standard interface across different configurations Adjustable Trade off performance vs. area Public Release Demo! 19

Outline NoC Terminology (single-slide review) Approach Tailoring NoCs to FPGAs Results & Demo Related Work & Conclusion Public Release 20

Related Work FPGA-oriented NoC Architectures PNoC: lightweight circuit-switched NoC [Hilton 06] NoCem: simple router block, no virtual channels [Schelle 08] FPGA-related NoC Studies Analytical models for predicting NoC perf. on FPGAs [Lee 10] Effect of FPGA NoC params on multiproccesor system [Lee 09] Modify FPGA configuration circuitry to build NoC Metawire: use configuration circuitry as NoC [Shelburne 08] Time-division multiplexed wiring to enable new NoC [Francis 08] Commercial Interconnect Approaches ARM AMBA, STNoC, CoreConnect PLB/OPB, Altera Qsys, etc. 21

Conclusions Significant gains from tuning for FPGA FPGAs and ASICs have different design sweet spot CONNECT flexible, efficient, lightweight NoC Compared to ASIC-driven NoC, CONNECT offers Significantly lower network latency and ~50% lower LUT usage or 3-4x higher network performance Take advantage of reconfigurable nature of FPGA Tailor NoC to specific communication needs of application 22

Outline NoC Terminology (single-slide review) Approach Tailoring NoCs to FPGAs Results & Demo Related Work & Conclusion Public Release 23

Public Release http://www.ece.cmu.edu/~mpapamic/connect/ NoC Generator with web-based interface Supports multiple pre-configured topologies Includes graphical editor for custom topologies FreeBSD-like license (limited to non-commercial research use) Acknowledgments Derek Chiou, Daniel Becker & Stanford CVA group NSF, Xilinx, Bluespec Please come see me for demo! 24

Thanks! Questions?