Garnet2.0: A Detailed On-Chip Network Model Inside a Full-System Simulator

Similar documents
Unit Testing with VectorCAST and AUTOSAR

SCORPIO: 36-Core Shared Memory Processor

What s New in AppSense Management Suite Version 7.0?

Isilon InsightIQ. Version 2.5. User Guide

Lecture 12: SMART NoC

Lecture 13: System Interface

Lecture 3: Topology - II

EMC ViPR. User Guide. Version

Illumina LIMS. Software Guide. For Research Use Only. Not for use in diagnostic procedures. Document # June 2017 ILLUMINA PROPRIETARY

Lecture 3: Flow-Control

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 7: Flow Control - I

CS 153 Design of Operating Systems Spring 18

Membership Library in DPDK Sameh Gobriel & Charlie Tai - Intel DPDK US Summit - San Jose

CS 153 Design of Operating Systems

Ellucian ODS9.0 Upgrade Migrating from OWB to ODI. Amir Saleem Centennial College May 17, 2017

CS 153 Design of Operating Systems Spring 18

An Introduction to GPU Computing. Aaron Coutino MFCF

EXAMINATIONS 2010 END OF YEAR NWEN 242 COMPUTER ORGANIZATION

Standard. 8029HEPTA DataCenter. Because every fraction of a second counts. network synchronization requiring minimum space. hopf Elektronik GmbH

Agate User s Guide. System Technology and Architecture Research (STAR) Lab Oregon State University, OR

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

EMC AppSync. User Guide. Version REV 01

DSCS6020: SQLite and RSQLite

VirtuOS: an operating system with kernel virtualization

Chapter 5 Network Layer

Lecture 2: Topology - I

Lecture 13: Traffic Engineering

The final datapath. M u x. Add. 4 Add. Shift left 2. PCSrc. RegWrite. MemToR. MemWrite. Read data 1 I [25-21] Instruction. Read. register 1 Read.

CS 153 Design of Operating Systems Spring 18

Review. A single-cycle MIPS processor

CAPL Scripting Quickstart

EC 513 Computer Architecture

CS 153 Design of Operating Systems Spring 18

Making Full Use of Multi-Core ECUs with AUTOSAR Basic Software Distribution

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

The extra single-cycle adders

Low-Power Interconnection Networks

Lecture 22: Router Design

CS 153 Design of Operating Systems Spring 18

Lecture 23: Router Design

Master for Co-Simulation Using FMI

CS 153 Design of Operating Systems Spring 18

DPDK on Microsoft Azure

Basic Low Level Concepts

5 Performance Evaluation

Computer User s Guide 4.0

TDT Appendix E Interconnection Networks

Networks An introduction to microcomputer networking concepts

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

DPDK s Best Kept Secret: Micro-benchmarks. M Jay DPDK Summit - San Jose 2017

On the Computational Complexity and Effectiveness of N-hub Shortest-Path Routing

EMC M&R (Watch4net ) Installation and Configuration Guide. Version 6.4 P/N REV 02

Content Safety Precaution... 4 Getting started... 7 Input method... 9 Using the Menus Use of USB Maintenance & Safety...

dss-ip Manual digitalstrom Server-IP Operation & Settings

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

L EGAL NOTICES. ScanSoft, Inc. 9 Centennial Drive Peabody, MA 01960, United States of America

Prof. Kozyrakis. 1. (10 points) Consider the following fragment of Java code:

Comparison of memory write policies for NoC based Multicore Cache Coherent Systems

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

CS 153 Design of Operating Systems Spring 18

Enhanced Memory Management

rte_security: enabling IPsec hw acceleration

Local Run Manager Generate FASTQ Analysis Module

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

webinar series

A Case for Low Frequency Single Cycle Multi Hop NoCs for Energy Efficiency and High Performance

LDAP Configuration Guide

CS 153 Design of Operating Systems Spring 18

Today s Lecture. Software Architecture. Lecture 27: Introduction to Software Architecture. Introduction and Background of

This chapter is based on the following sources, which are all recommended reading:

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg

EMC VNX Series. Problem Resolution Roadmap for VNX with ESRS for VNX and Connect Home. Version VNX1, VNX2 P/N REV. 03

Chapter 4: Network Layer

EEC 483 Computer Organization

Pipelined van Emde Boas Tree: Algorithms, Analysis, and Applications

Tdb: A Source-level Debugger for Dynamically Translated Programs

BIS - Basic Package V4.6

Lecture 11: IPv6. CSE 123: Computer Networks Alex C. Snoeren. HW 2 due FRIDAY

TDT4255 Friday the 21st of October. Real world examples of pipelining? How does pipelining influence instruction

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

The Disciplined Flood Protocol in Sensor Networks

Ultra-Fast NoC Emulation on a Single FPGA

The single-cycle design from last time

(2, 4) Tree Example (2, 4) Tree: Insertion

EMC NetWorker Module for SAP

Lecture 4: Routing. CSE 222A: Computer Communication Networks Alex C. Snoeren. Thanks: Amin Vahdat

BIS - Basic Package V4.4

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

BIS - Basic package V4.2

BIS - Basic package V4.3

Intro To VLANS on the CRS Joshua Gray, Brian Vargyas Baltic Networks MUM 2016 CRS VLans

NextSeq 550. System Guide. For Research Use Only. Not for use in diagnostic procedures. Document # v04 May 2018 ILLUMINA PROPRIETARY

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Exceptions and interrupts

IoT-Cloud Service Optimization in Next Generation Smart Environments

Addressing in Future Internet: Problems, Issues, and Approaches

Review Multicycle: What is Happening. Controlling The Multicycle Design

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Transcription:

Garnet2.0: A Detailed On-Chip Network Model Inside a Fll-System Simlator Tshar Krishna gem5 workshop ARM Research Smmit September 11, 2017 Assistant Professor School of ECE and CS Georgia Institte of Technology tshar@ece.gatech.ed http://synergy.ece.gatech.ed/tools/garnet

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 2

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 3

Networks-on-Chip 4

Introdction to NoCs Core + L1$ Core + L1$ Core + L1$ Core + L1$ A Core + L1$ Core + L1$ LD (A) R1 Req Rsp On-Chip Network Fwd L2$ L2$ L2$ L2$ L2$ L2$ 5

Modern NoCs Core L1 D$ L2$ L1 I$ Network Interface L3$/Directory Roter Network Interface converts cache messages (ctrl or data) into packets. Packets get broken down into one or more flits depending on NoC link width Tile 6

Network Architectre Topology Roting Flow Control Roter Microarchitectre 7

Topology: How to connect nodes with links ~Road Network 8

Roting: Which path shold a message take ~Series of road segments from sorce to destination 9

Flow Control: When does a message stop/proceed ~Traffic Signals / Stop signs at end of each road segment 10

Roter Microarchitectre: How to bild the roters ~Design of traffic intersection (nmber of lanes, algorithm for trning red/green) 11

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 12

Why model NoCs accrately? Case Stdy I: Directory vs. Broadcast Protocols Case Stdy II: Private vs. Shared L2 1.6 Fll-state Directory HyperTransport Token Coherence 1.2 Shared L2 Private L2 Normalized Rntime 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Baseline NoC (Intel SCC) FANOUT + FANIN NoC (Krishna et al, MICRO 2011) SMART NoC (Krishna et al, HPCA 2013) 64-core CMP with different NoC Microarchitectres Normalized Rntime 1 0.8 0.6 0.4 0.2 0 Baseline NoC (Intel SCC) SMART NoC (Krishna et al, HPCA 2013) 64-core CMP with different NoC Microarchitectres Different NoC Microarchitectres may lead to different microarchitectral decisions and new design optimization opportnities 64-core CMP rnning PARSEC workloads in fll-system gem5. Average rntime plotted. 13

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 14

Garnet2.0 Detailed NoC Model Crrently part of Rby Memory System in gem5 Original version (5-stage pipeline) released in 2009 Developed by Niket Agarwal (crrently at Google) and myself New version (1-stage pipeline, more configrability) released in 2016 Resorces Sorce: src/mem/rby/network/garnet2.0 gem5 wiki page: www.gem5.org/garnet2.0 Dev patches + practice labs: http://synergy.ece.gatech.ed/garnet 15

Topology Each topology is a python file in configs/topologies/ Core + L1$ L2$ DMA Dir ExtLink (bi-directional) R R IntLink (ni-directional) Core + L1$ R R Dir Core + L1$ L2$ L2$ R R Step 1: Instantiate roters and connect to controllers via Ext Links Step 2: Instantiate Int Links and connect to roters in desired topology This is an example of MeshDirCorners_XY.py L2$ L2$ L2$ Core + L1$ Dir Core + L1$ Dir Core + L1$ 16

Topology Configrable Parameters Roter roter_latency (in cycles) Can be set per roter Defined in src/mem/rby/network/basicroter.py Link link_latency (in cycles) Can be set per link Defined in src/mem/rby/network/basiclink.py weight (i.e., link weight) To bias roting algorithm [later slides] src_otport (string) and dst_inport (string) Port direction (e.g., East ) Helps with readability of Config file + Adaptive roting algorithms bw_mltiplier (vale) Used by Rby s simple network model, NOT by Garnet Link bandwidth is set inside Garnet via the ni_flit_size parameter [later slides] 17

Defalt Topologies Spported Pt2Pt Crossbar Mesh Mesh_XY Mesh_westfirst MeshDirCorners_XY Clster 18

Roting Roting table Atomatically poplated based on topology All messages se shortest path In case of mltiple options, the path with the smaller weight is chosen Deterministic Roting Cstom Users can leverage otport/inport direction names associated with each port to implement cstom algorithms (say adaptive) 19

Deadlock Avoidance Deadlock: A condition in which a set of agents wait indefinitely trying to acqire a set of resorces D 0 1 x A C 3 w 2 v B Packet A holds bffer (in 1) and wants bffer v (in 2) Packet B holds bffer v (in 2) and wants bffer w (in 3) Packet C holds bffer w (in 3) and wants bffer x (in 0) Packet D holds bffer x (in 0) and wants bffer (in 1) 20

Deadlock-free Roting Algorithms Deadlocks may occr if the trns taken form a cycle Removing some trns can make the roting algorithm deadlock free XY Model West-First Trn Model 21

Deadlock-free Roting Algorithms Assign weights to bias which links sed first (to ensre no cyclic dependence) XY Roting: Always go X first, then Y DC 1 1 1 1 DA DB 2 2 2 2 1 1 2 2 1 1 SA SB SC 2 2 2 2 1 1 1 1 2 2 Mesh_XY 22

Deadlock-free Roting Algorithms Assign weights to bias which links sed first (to ensre no cyclic dependence) West-first Roting: Go W first, then N/S DC 1 1 2 2 DA DB 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 SA SB SC 1 1 2 2 Mesh_westfirst 23

Flow Control Virtal Channels Coherence Protocol reqires certain nmber of virtal networks / message classes to avoid protocol deadlocks This is the minimm nmber of VCs reqired Within each vnet, there can be more than one VC for boosting network performance Credits In Garnet, only one packet can se a VC inside a roter at a time VCs in vnets carrying control messages are 1-flit deep VCs in vnets carrying data (cacheline) messages are 4-5 flit deep Each VC conveys its bffer availability by sending credits to its pstream roter 24

Conventional VC Roter Microarch FLIT V N 0 V N 1 Rote Compte VC 0 VC 1 VC n Inpt Bffers VC 1 VC 2 VC Allocator SW Allocator BW: Bffer Write RC: Rote Compte VA: VC Allocation Inpt VCs arbitrate for otpt VCs (Inpt VCs at next roter) SA: Switch Allocation Inpt ports arbitrate for otpt ports BR: Bffer Read VC n Inpt Bffers Crossbar Switch ST: Switch Traversal LT: Link Traversal BW RC VA SA BR ST LT 25

Single-Cycle Roter Implementation Roter.ccàwakep() 1 3 2 Network Link.cc InptUnit.cc àwakep() SwitchAllocator.cc àwakep() * Arbitrate inports has_free_vc() OtptUnit.cc àwakep() Credit Link.cc VN0 VC 0 VC 1 VN1 VC 2 VC 3 VirtalChannel.cc * Bffer Write * Rote Compte 4 * Arbitrate otports * VC Allocate * send credit (schedle creditlink wakep next cycle) * Bffer Read (send flit to switch) select_free_vc() OtVCState.cc * Rcv Credit * Update State CreditLink.cc For mlti-cycle roter, add delay in flit before it can do SA CrossbarSwitch.ccàwakep() * Switch Traversal: Psh winner flits on link qee * Schedle otpt link wakep for next cycle NetworkLink.cc 26

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 27

System Organization Rby Memory System GarnetNetwork.cc Coherence Protocol Network Interface Network Link Interface is MessageBffers from Rby Roter Roter Stats Credit Link 28

Network Interface Msg_size decides nmber of flits Can add additional info in flit Core + L1$ Dir NI Core + L1$ <VN, msg> Flitisize VC Select credit L2$ DMA NI NI NI R R NI R R NI L2$ Cache Controller VN0 VN1 VC 0 VC 1 VC 2 VC 3 Ingress To/From Roter L2$ Core + L1$ NI NI NI Dir NI Core + L1$ NI L2$ Egress NetworkInterface.cc àwakep() 29

Garnet Configrable Parameters ni_flit_size Defalt = 16B (128b) à 1-flit ctrl, 5-flit data This sets the bandwidth of each physical link vcs_per_vnet Total VCs in each inport = nm_vnets * vcs_per_vnet bffers_per_data_vc Defalt = 4 bffers_per_ctrl_vc Defalt = 1 roting_algorithm Weight-based table or cstom Defined in: src/mem/rby/network/garnet2.0/garnetnetwork.py 30

Command Line Parameters src/mem/rby/network/ BasicRoter.py BasicLink.py Network.py garnet2.0/garnetnetwork.py Definitions and Defalt Vales rby/rby.py configs/ common/options.py network/network.py example/garnet_synth_test.py - Overrides defalt vales in Rby - All parameters in in these.py files can be specified from command line 31

Rnning Garnet with Rby Bild Rby Coherence Protocol scons bild/x86_moesi_hammer/gem5.opt PROTOCOL=MOESI_hammer Protocol determines nmber of message classes/virtal networks reqired Invoke garnet2.0 from command line with appropriate network parameters./bild/x86_moesi_hammer/gem5.opt configs/example/fs.py --network=garnet2.0 --topology=mesh_xy... 32

Rnning Garnet Standalone Bild the Garnet_standalone protocol scons bild/null/gem5.debg PROTOCOL=Garnet_Standalone Dmmy protocol jst for traffic injection via Garnet Synthetic Traffic tester (next slide) 3 Virtal Networks: vnet 0 and vnet 1 inject ctrl (1-flit) packets, vnet 2 injects data (5-flit) packets Rn ser-specified synthetic traffic./bild/null/gem5.debg configs/example/garnet_synth_traffic.py --network=garnet2.0 --topology=... \ --synthetic=niform_random \ --injectionrate=0.01 \ --nm-packets-max=100 \ 33

Garnet Synthetic Traffic Dmmy CPU Model [only works with Garnet_standalone protocol] src/cp/testers/garnet_synthetic_traffic Inject packets for ser-specified traffic pattern at ser-specified injection rate niform_random tornado bit_complement bit_reverse bit_rotation neighbor shffle transpose Ability to inject continosly or fixed nmber of packets, and/or for fixed time, and/or from fixed sorce and/or to fixed destination in one/more vnets Heavy cstomization very sefl for debgging All parameters described here: http://www.gem5.org/garnet_synthetic_traffic 34

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 35

Sample Simlation./bild/NULL/gem5.debg configs/example/garnet_synth_traffic.py --topology=mesh_xy --nm-cps=16 --nm-dirs=16 --mesh-rows=4 m5ot/config.ini 36

Otpt Stats m5ot/stats.txt Packet and Flit stats (#injected, #received, qeeing latency at NIC, and network latency) per VN and overall More stats can be added via GarnetNetwork.cc 37

Overview What are Networks-On-Chip (NoCs)? Why model NoCs accrately Garnet2.0 Configration Topology Roting Flow-Control Roter Microarchitectre System Integration Network Interface Network Parameters Rnning Garnet with Rby Rby Garnet Standalone Otpt Stats Extensions and FAQs 38

Extensions and FAQs System Level Modeling How can I integrate Garnet into my own simlator (or with the gem5 classic memory system)? This shold not be too hard - wold jst reqire some changes to the NI code on how it receives its inpts. If anyone wants to try it, I wold love to give pointers. How do I print a network trace? Yo can add some code to the NI. I have a patch on my website for reference. Can garnet read a network trace? Yo can rn Garnet in a standalone mode, and have it inject traffic from a trace instead of a fixed synthetic pattern. Alternately, try to se cp/testers/traffic_gen Does Garnet report NoC area and power nmbers? The otpt of Garnet can be fed into DSENT (Sn et al, NOCS 2012) which is present in ext/dsent. We will add an atomatic stats.txt parser for it soon. Can we model mltiple clock domains? The entire Rby memory system (inclding Garnet inside it) is one clock domain. To mimic a mlti-clock domain design, yo can schedle wakeps of slow roters intelligently at some mltiples of the clock rather than every cycle. 39

Extensions and FAQs Topology How do we model mltiple BW links in Garnet? Inherently, that wold reqire spport in the roter to manage mltiple flit sizes. Instead, yo can add mltiple links between the same nodes if yo want to model higher bandwidth. If they have the same weight, Garnet will randomly send over the two. Can we model a heterogeneos CPU-GPU system? Yes. The crrent AMD GPU model models a clster of CPUs connected to a GPU. Can we model indirect networks sch as Clos? Yes, there can be additional roters that are not connected to any controller and act prely as switches. Can we model large-scale HPC networks? Garnet can model any sized network. Yo can rn 256-node standalone simlations easily. However, beyond that gem5 cannot instantiate more directories (which it ses as destination nodes). I have a patch to rn 1024- node synthetic sims on my website. Bt these rn qite slowly. 40

Extensions and FAQs Roting How do we model an adaptive roting algorithm? If yo want to se internal NoC metrics (sch as nmber of credits at an otpt port) for making roting decisions, do not se table-based roting. Instead, set the roting-algorithm to cstom, and implement yor own roting fnction inside RotingUnit.cc. See otportcomptexy() for reference. Flow Control Can we implement alternate deadlock avoidance schemes (sch as escape VCs or dateline)? Yo can pdate the vc selection scheme inside SwitchAllocator to control which VCs get allocated. Microarchitectre Can we model variable nmber of VCs in each roter? Crrently the codebase is very tied to the global vcs_per_vnet parameter. If yo want to model variable nmber of VCs, one hack cold be to have everyone instantiate the same nmber of VCs, bt modify the VC select to never allocate certain VCs 41

Conclsions Garnet2.0 is an open-sorce research vehicle Use it and contribte to it! Being actively maintained by the following people Georgia Tech: Tshar Krishna AMD Research: Brad Beckmann, Onr Kayiran, Jieming Yin, Matt Porembas If yo have any qestions, email on gem5-sers or gem5-dev mailing lists Wold love to have it integrated with the classic memory model Usefl Resorces: www.gem5.org/interconnection_network www.gem5.org/garnet2.0 http://synergy.ece.gatech.ed/garnet Thank yo! 42