Dynamic Routing of Hierarchical On Chip Network Traffic

Similar documents
Deadlock-free XY-YX router for on-chip interconnection network

Basic Switch Organization

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Network-on-chip (NOC) Topologies

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

Topologies. Maurizio Palesi. Maurizio Palesi 1

Linux System Administration

CS644 Advanced Networks

EC 513 Computer Architecture

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Mesh Networks

Lecture 2: Topology - I

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

Topologies. Maurizio Palesi. Maurizio Palesi 1

Planning for Information Network

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

SEMESTER 2 Chapter 3 Introduction to Dynamic Routing Protocols V 4.0

Fast, Efficient, and Robust Multicast in Wireless Mesh Networks

Principles behind data link layer services

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Network on Chip Architecture: An Overview

Computer Networks. Routing

Chapter 3. Introduction to Dynamic Routing Protocols. CCNA2-1 Chapter 3

Chapter 3 Part 2 Switching and Bridging. Networking CS 3470, Section 1

Basic Low Level Concepts

4. Networks. in parallel computers. Advances in Computer Architecture

DATA COMMUNICATOIN NETWORKING

Computer Networks. Routing Algorithms

AC : HOT SPOT MINIMIZATION OF NOC USING ANT-NET DYNAMIC ROUTING ALGORITHM

Course 6. Internetworking Routing 1/33

Principles behind data link layer services:

Principles behind data link layer services:

Performance of Multicast Traffic Coordinator Framework for Bandwidth Management of Real-Time Multimedia over Intranets

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Lecture 3: Flow-Control

Address InterLeaving for Low- Cost NoCs

Multicomputer distributed system LECTURE 8

Packet Switch Architecture

Packet Switch Architecture

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Network Calculus: A Comparison

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Outline. Computer Communication and Networks. The Network Core. Components of the Internet. The Network Core Packet Switching Circuit Switching

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks

Open Shortest Path First (OSPF)

Lecture: Interconnection Networks

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Sample Routers and Switches. High Capacity Router Cisco CRS-1 up to 46 Tb/s thruput. Routers in a Network. Router Design

Quality of Service Mechanism for MANET using Linux Semra Gulder, Mathieu Déziel

IP: Routing and Subnetting

Parameters of a On-Chip Network for Multi-Processor System-on-Chip Environments

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CS 43: Computer Networks Internet Routing. Kevin Webb Swarthmore College November 16, 2017

Lecture 3: Topology - II

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Kent State University

II. Principles of Computer Communications Network and Transport Layer

The Effects of Asymmetry on TCP Performance

From Routing to Traffic Engineering

Routing. 4. Mar INF-3190: Switching and Routing

Basic Idea. Routing. Example. Routing by the Network

Chapter 4: Network Layer

Unit 2 Packet Switching Networks - II

Optimal Internal Congestion Control in A Cluster-based Router

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization

Routing by the Network

048866: Packet Switch Architectures

What Is Congestion? Effects of Congestion. Interaction of Queues. Chapter 12 Congestion in Data Networks. Effect of Congestion Control

OASIS Network-on-Chip Prototyping on FPGA

Fairness Example: high priority for nearby stations Optimality Efficiency overhead

Trading hardware overhead for communication performance in mesh-type topologies

CHAPTER 9: PACKET SWITCHING N/W & CONGESTION CONTROL

Summary of MAC protocols

A Literature Review of on-chip Network Design using an Agent-based Management Method

3. Evaluation of Selected Tree and Mesh based Routing Protocols

Fault-adaptive routing

Queuing Disciplines. Order of Packet Transmission and Dropping. Laboratory. Objective. Overview

Lecture 7: Flow Control - I

3D WiNoC Architectures

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Design of a System-on-Chip Switched Network and its Design Support Λ

Configuring QoS CHAPTER

QoS-Aware Hierarchical Multicast Routing on Next Generation Internetworks

ECE 697J Advanced Topics in Computer Networks

CS 43: Computer Networks. 24: Internet Routing November 19, 2018

EEC-484/584 Computer Networks

Ultra-Fast NoC Emulation on a Single FPGA

AVB Latency Math. v5 Nov, AVB Face to Face Dallas, TX Don Pannell -

Simulation & Performance Analysis of Mobile Ad-Hoc Network Routing Protocol

Routing Basics. What is Routing? Routing Components. Path Determination CHAPTER

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Transcription:

System Level Interconnect Prediction 2014 Dynamic outing of Hierarchical On Chip Network Traffic Julian Kemmerer jvk@drexel.edu Baris Taskin taskin@coe.drexel.edu Electrical & Computer Engineering Drexel University 1 June 2014 Drexel VLSI Laboratory vlsi.ece.drexel.edu

Hierarchical On Chip Network Hierarchical NoC: Designed to alleviate congestion issues common to flat topologies. Accomplished using addition interconnect layers. Homogeneous layers (e.g. mesh placed over another mesh). Heterogeneous layers (e.g. pyramid shape). CMeshX2 Topology MeshX2 Topology 3 Level PyraMesh Topology * *. Manevich, I. Cidon, A. Kolodny, Handling global traffic in future CMP NoCs, Electrical and Electronics Engineers in Israel (IEEEI), Nov. 2012, pp. 1, 5, 14-17 2

outing of Hierarchical On Chip Network Traffic Known that hierarchical networks can suffer from an inability to balance network traffic across all layers. Long distance ('global') traffic tends to be a small portion of total traffic. However, bandwidth allocated to global traffic is significant (%50+). Common metric for dividing local v.s. global traffic is the Manhattan distance between source and destination nodes on the lowest hierarchy level. 3

Local outer ange From previous work* that builds off of Manhattan distance as metric for routing decisions. Uses subdivided lowest hierarchy layer (subnets). Subnets that are not allowed to communicate (without routing through upper hierarchy) are said to be hard-walled. 'Leaky-walls' that allow subnets to communicate were proposed as solution to congestion on upper hierarchy levels. Local router range determines when to use 'leaky' subnet boundaries. * A. More, Network-on-Chip (NoC) Architectures for Exa-scale Chip-Multi-Processors (CMPs), Ph.D. dissertation, ECE. Dept., Drexel. Univ., Philadelphia., PA, 2013. 4

Local outer ange (continued) Local outer ange (L), defined as: Integer distance from the destination node within which the source node injects packets into the bottom-level network instead of the top-level network. 5

Local outer ange Subnet Size: 2x2 Manhattan Distance = 2 L = 3 (continued) Case 1) If the destination node is within the same sub-net as the source node, then the traffic is deemed local enough to be routed on the lowest hierarchy level. DST SC 6

Local outer ange Subnet Size: 2x2 Manhattan Distance = 3 L = 3 Case 2) If the destination node is not in the same sub-net as the source core, and the the Manhattan distance between source and destination on the lowest hierarchy level is less than or equal to the L, then the lowest hierarchy level will be used. SC (continued) DST 7

Local outer ange (continued) DST Subnet Size: 2x2 Manhattan Distance = 4 L = 3 Case 3) If the destination node is not in the same sub-net as the source core, and the number of hops is greater than the L, the packet is routed onto the higher hierarchy level. SC 8

Local outer ange (continued) DST Previous work: Single static L value. Statically balanced traffic across hierarchy levels. Presented technique(s) examine the effect of: Various static L L < 6 values. Modification the L dynamically at run time. SC L >= 6 9

Experimental Setup Custom cycle-accurate NoC simulator. Three topologies examined: MeshX2 CMeshX2 MeshMesh Four traffic types examined: andom, Localization = 25%, 50%, and 75%. andom: Source and destination node chosen randomly. Localization Percentage: Percentage of traffic has destination node within two hops of source node. MeshX2 Topology CMeshX2 Topology 10

Effect of Static L Values Example case: Fixed CMeshX2 topology. Fixed injection rate and traffic pattern. Examined average throughput and delay. Maximum Throughput L = 12 Difference between best performing and worst performing L values: Throughput = 1.8X Delay = 3.4X Overall maximum performance difference: 1.8X Issues: Single L value is not optimal in both throughput and delay. Fixed injection rate. Minimum Delay L = 6 11

Performance Metric Combination of throughput and delay. Need a single 'optimal' L value. Performance defined as: Performance = Throughput / Delay (flits/cycle^2) If throughput increases and/or delay decreases the performance of the NoC has increased. 12

Static L and PI Sweep Each (L, PI) pair yields a single performance value. Can now identify maximally performing (L, PI) pair for each topology. However, link exists between PI and throughput. Not looking to optimize PI. Important Questions: At which PI does the sweep of L values yield the largest change in performance? At that PI, which L value yields the best performance? Topology Static PI Static (pkts/cycle) L MeshX2 0.18 3 CMeshX2 0.05 6 MeshMesh 0.06 8 13

Dynamic Local outer ange Ideally: L Control L NoC Performance Instantaneous knowledge of network wide performance. Use this as feedback for dynamic L control system. ealistically: L Control L NoC? Performance Estimation Estimated Performance 14

Dynamic L (continued) Issues: Centralized difficult to implement in hardware. '?' - What to pull from NoC that indicates performance? 'Performance Estimation' How to use that information to estimate performance? 'L Control' What control system will govern L changes? L Control L NoC? Performance Estimation Estimated Performance 15

Circumventing Centralized Control Systems Standard NoC Tile: outer + Processing Element X/TX X/TX Processing Element outer X/TX X/TX X/TX 16

Circumventing Centralized Control Systems (continued) Distributed L control: X/TX X/TX X/TX X/TX L Processing Element outer X/TX L Control 17

L Control System Parameters AFS - Average Free Slots (buffer space) available in neighboring routers. equires connection to neighboring routers. Value is average buffer space available across all connected neighbors. APQ - Average Packets in processing element Queue. Value is average over all virtual channel queues. These values should indicate, to some degree, network congestion at and around a specific NoC tile. 18

Circumventing Centralized Control Systems (continued) Actual System Implementation: Dynamic and distributed control. FS: Free Slots APQ: Average packets in processing element queue X/TX FS FS FS FS X/TX APQ L X/TX X/TX Processing Element X/TX outer FS L Control FS FS 19

Dynamic L Control System (continued) emaining Issues: 'Performance Estimation' How to use AFS/APQ information to estimate performance? 'L Control' What control system will govern L changes? PE PE L CTL L CTL PE PE L CTL L CTL 20

Performance Estimation Using AFS and APQ Perf(AFS) =?, Perf(APQ) =? Each L CTL block should be identical. PE TIME AFS APQ PEF 1318 2 0.125 0.1 1319 1.75 0.0 0.2 1320 2 0.125 0.1... PE TIME AFS APQ PEF 1318 1.0 0.0 0.2 1319 1.75 0.125 0.2 1320 2 0.125 0.1... Use data from NoC simulator. ecord run time values (AFS, APQ, Performance). PE TIME AFS APQ PEF 1318 1.0 0.125 0.1 1319 1.0 0.0 0.5 1320 2 0.125 0.1... PE TIME AFS APQ PEF 1318 2 0.5 0.1 1319 2 0.5 0.2 1320 2 0.125 0.1... epeat for all four traffic conditions. 21

Performance Estimation Using AFS and APQ (continued) Performance as function of APQ for MeshX2 Topology. Performance as function of AFS for MeshX2 Topology. These Perf(APQ) and Perf(AFS) relationships are required for a L control system. 22

Performance Estimation Using AFS and APQ (continued) Performance as function of APQ for MeshX2 Topology. Performance as function of AFS for MeshX2 Topology. Data from all traffic patterns combined. egression lines shown. 23

L Control Estimated performance now available to L Control blocks. Control Algorithm: How and how often to adjust L values? L Control AFS Performance Estimation L Control System Est. Perf. L Control Algorithm APQ 24

L Control Algorithm #Inputs: L, prev_l, # perf, prev_perf, counter #Initialization # L = 3 # prev_l = L 1 if counter % n == 0: #Update delta = L prev_l prev_l = L if (perf prev_perf) > 0: #Performance increase L += delta else if(perf prev_perf) < 0: #Performance decrease, #2 steps back to make up for mistake L = 2*delta #But keep old local router range #at just one step back prev_l = L + delta ΔL v.s. ΔPerf Optimal static L; Topology specific Frequency of operation epeat performance increasing changes; emove performance decreasing changes 25

Algorithm Evaluation Frequency 'n' Value Sweep 'n' : How often the adjust the L value. Topology Average Optimal 'n' Value Max atio Max Perf. / Min Perf. MeshX2 1536 1.18 CMeshX2 1025 2.66 MeshMesh 1024 1.24 Performance as function of 'n' parameter for MeshMesh topology. 26

Comparison to Static L Dynamic L v.s. optimal static L. atio of dynamic performance / static performance. Dynamic L / static L performance ratio for MeshX2 topology. 27

Comparison to Static L (continued) Dynamic L / static L performance ratio for CMeshX2 topology. 28

Comparison to Static L (continued) Dynamic L / static L performance ratio for MeshMesh topology. 29

Summary and Conclusions Topology Maximum Perf. atio MeshX2 1.37 CMeshX2 1.22 MeshMesh 1.07 Local router range as a mechanism for hierarchical routing decisions Optimal static L Non-optimal v.s. optimal static L performance: 1.8X Dynamic routing without being a dynamic routing algorithm Deadlock free Decentralized control Minimal design overhead Dynamic L performance > Static L MeshX2 and CMeshX2 topologies At all packet injection rates For all traffic types Maximum dynamic v.s. optimal static L performance difference: 1.37X Maximum performance increase over non-optimal static L: 1.8 * 1.37 = 2.47X 30