A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Similar documents
Bandwidth Aware Routing Algorithms for Networks-on-Chip

HiRA: A Methodology for Deadlock Free Routing in Hierarchical Networks on Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks

STG-NoC: A Tool for Generating Energy Optimized Custom Built NoC Topology

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

Address InterLeaving for Low- Cost NoCs

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

NoCAlert: An On-Line and Real- Time Fault Detection Mechanism for Network-on-Chip Architectures

Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems

in Oblivious Routing

A Literature Review of on-chip Network Design using an Agent-based Management Method

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Traffic Generation and Performance Evaluation for Mesh-based NoCs

Mapping and Physical Planning of Networks-on-Chip Architectures with Quality-of-Service Guarantees

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

Escape Path based Irregular Network-on-chip Simulation Framework

Sanaz Azampanah Ahmad Khademzadeh Nader Bagherzadeh Majid Janidarmian Reza Shojaee

Analyzing Methodologies of Irregular NoC Topology Synthesis

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Clustering-Based Topology Generation Approach for Application-Specific Network on Chip

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Evaluation of NOC Using Tightly Coupled Router Architecture

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

ACO-BASED FAULT-AWARE ROUTING ALGORITHM FOR NETWORK-ON-CHIP SYSTEMS

Flow Control can be viewed as a problem of

Deadlock-free XY-YX router for on-chip interconnection network

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Real-Time Mixed-Criticality Wormhole Networks

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections

Lecture 22: Router Design

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

AC : HOT SPOT MINIMIZATION OF NOC USING ANT-NET DYNAMIC ROUTING ALGORITHM

Efficient Congestion-oriented Custom Network-on-Chip Topology Synthesis

Network-on-chip (NOC) Topologies

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Temperature and Traffic Information Sharing Network in 3D NoC

Runtime Network-on-Chip Thermal and Power Balancing

NED: A Novel Synthetic Traffic Pattern for Power/Performance Analysis of Network-on-chips Using Negative Exponential Distribution

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH /$ IEEE

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection

NOC Deadlock and Livelock

PDA-HyPAR: Path-Diversity-Aware Hybrid Planar Adaptive Routing Algorithm for 3D NoCs

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Noxim the NoC Simulator

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

ERA: An Efficient Routing Algorithm for Power, Throughput and Latency in Network-on-Chips

Lecture 2: Topology - I

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

BISTed cores and Test Time Minimization in NOC-based Systems

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Bandwidth-aware routing algorithms for networks-on-chip platforms M. Palesi 1 S. Kumar 2 V. Catania 1

4. Networks. in parallel computers. Advances in Computer Architecture

DATA REUSE DRIVEN MEMORY AND NETWORK-ON-CHIP CO-SYNTHESIS *

Low-Power Interconnection Networks

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Topologies. Maurizio Palesi. Maurizio Palesi 1

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Mapping of Real-time Applications on

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

The Design and Implementation of a Low-Latency On-Chip Network

FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links

OASIS Network-on-Chip Prototyping on FPGA

Ultra-Fast NoC Emulation on a Single FPGA

Scheduling Direct Acyclic Graphs on Massively Parallel 1K-core Processors

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 11, NOVEMBER

ACCORDING to the International Technology Roadmap

TDT Appendix E Interconnection Networks

Fault-adaptive routing

Network on Chip Architecture: An Overview

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Energy-Aware Communication and Task Scheduling for Network-on-Chip Architectures under Real-Time Constraints

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Deadlock and Livelock. Maurizio Palesi

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

NoC Test-Chip Project: Working Document

An Effective Queuing Scheme to Provide Slim Fly topologies with HoL Blocking Reduction and Deadlock Freedom for Minimal-Path Routing

EE/CSCI 451: Parallel and Distributed Computation

Transcription:

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology Jan 27 th,yokohama, Japan, ASP-DAC 2011 1

Outline Introduction Motivation Application-specific and thermal-aware routing overview Proposed routing algorithm Router Microarchitecture Experimental results Conclusions 2

Network-on-Chips (NoC) Network-on-chips (NoC) : a scalable and modular solution for multiprocessor system-on-chip (MPSOC) design. NoC based multiprocessor system CPU1 CPU2 DMA DMA DMA DMA CPU3 CPU4 Advantages of NoC : scalability / latency, power consumption / throughput / reliability etc. 3

Application-specific NoC NoC designed given a target application domain The application is characterized by a given communication task graph (CTG). Traffic information (communication pairs and volume) are obtained through profiling. An example task graph and tile mapping for VOPD (Video Object Plane Decode) application: 4

Design Challenges for NoC Design constraints and objectives : Energy and power consumption Latency and throughput Bandwidth requirement Hardware implementation etc. Temperature and peak power have become the dominant constraints A typical runtime thermal profile Temperature hotspot: 1) Degraded performance 2) Reduced reliability 5

Application-specific NoC Design Flow Task scheduling and allocation Energy aware task scheduling/mapp ing [1] Mapping and Floorplanning Energy /bandwidth / thermal aware placement [2] Routing design This work Path 3 Path 2 Path 1 T1 T3 [1] D.Bertozzi et.al NoC synthesis flow for customized domain specific multiprocessor system-on-chip IEEE Transactions on Parallel and Distributed Systems.16(2), pp.113-129, 2005 [2] H.Jingcao et.al Energy- and performance- aware mapping for regular NoC architectures IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 24(4), pp.551-562, 2005 6

Previous Work Network-on-chip routing algorithm design: Fault tolerant routing [3] Bandwidth aware routing [4] Limitations: temperature and thermal issues are not taken into consideration Thermal-aware NoC routing : Ant-colony routing algorithm [5] Thermal-region based routing [6] Limitations: Generic routing algorithm is used; complex control schemes in [5]; deadlock avoidance issue in [6] [3] D.Fick et.al A highly resilient routing algorithm for fault-tolerant NoCs In Proc. DATE, pp.21-26, 2009 [4] M.Palsi et.al Bandwidth-aware routing algorithms for networks-on-chip platforms IET, Computer & Digital Techniques. 3(5). pp.413-429, 2009 [5] M.Daneshtalab et.al NoC Hot Spot minimization Using AntNet Dynamic Routing Algorithm In Proc. ASAP, pp. 33-38, 2006 [6] L.shang et.al Temperature-Aware On-chip Networks IEEE Micro. 26(1), pp. 130-139,2006 7

Adaptive Routing Deterministic routing : only one path provided for every communication pair Path 1,Path 3 and Path 5 are provided ( XY routing ) Simple but may introduce congestion and hotspots Adaptive routing : several paths dynamically selected within the router Multiple paths can be used for routing Distribute traffic more evenly Adaptive and minimal path routing is adopted in this work : Reducing hotspot temperature Maintaining the latency and throughput 8

Outline Introduction Motivation Application-specific and thermal-aware routing overview Proposed routing algorithm Router Microarchitecture Experimental results Conclusions 9

A Motivation Example Routing algorithm can be exploited to reduce the hotspot temperature Communication network consumes a significant power budget (e.g. 39% in [8]) [8] S.Vangal et.al A 5.1GHz 0.34mm2 router for Network-on-Chip Applications In proc. IEEE Symposium on VLSI Circuit. pp. 42-43,2007 10

Outline Introduction Motivation Application-specific and thermal-aware routing overview Proposed routing algorithm Router Microarchitecture Experimental results Conclusions 11

Overview of the Proposed Application-specific Routing Algorithm Application Specific Design input T1 T3 T2 T11 T7 T10 T8 T9 T4 T5 T6 Task graph / scheduling and allocation results L6_2 L1_2 Building Channel Dependency graph (CDG) Cycle Removing Algorithm for deadlock avoidance L2_3 L5_1 L3_7 L6_5 Re_Edge1 L7_6 L6_10 L10_11 L2_6 L3_2 L7_3 L11_7 Re_Edge2 Edges in Channel dependency graph Edges to be removed L6_7 ( c ) Application specific and deadlock avoidance path set finding Deadlock free constraints Packet injection rate of each communication pair ; Communication Bandwidth A set of admissible paths Updating Routing Table Routing table update Linear Programming Engine Offline traffic allocation Traffic ratio NoC implementation 12

Outline Introduction Motivation Application-specific and thermal-aware routing overview The proposed routing algorithm Router Microarchitecture Experimental results Conclusions 13

Deadlock Free Path set finding algorithm Using application traffic information can improve the adaptivity Communication Graph P0 P4 P2 P1 P3 P5 Topology Graph P0 P1 P2 P3 P4 P5 Total minimal paths: 18 Application specific :16 Westfirst :13 Northlast :14 Negativefirst :15 Oddeven :14 Application Channel Dependency Graph (CDG) L0_1 L1_0 L1_2 L2_1 L3_0 L0_3 L4_1 L1_4 L5_2 L2_5 L3_4 L4_3 L4_5 L5_4 14

Deadlock Free Path set finding algorithm Here we use an application specific and deadlock free path set finding similar to that used in [7]: Find Circles in CDG Circle removing algorithm Deadlock free Path set [7] M.Palesi et.al Application Specific Routing algorithms for Network on chip IEEE Transactions on Parellel and Distributed Systems. 20(3), pp. 316-330, 2009 15

Deadlock Free Path Set Finding Algorithm We modified the cost function from [7] : Maximize the flexibility of re-divert traffic to even out the power distribution 1 1 ΦS ( ) max max max ( ) edge c α = αcwc = βw c C C Φ( c) c C C: the set of communication pairs in the application c: one communication pair c C α = c # of paths provided for c # of total paths existed in network : adaptivity of the communication pair c α S edge : average adaptivity of the communication : the set of edges to be removed in the channel dependency graph (CDG) to break cycles Φ() c Φ S () c edge : set of all minimal paths for communication c : set of all minimal paths for communication c after edges in Sedge being removed 16

Optimal Traffic Ratio Calculation Router energy consumption model: Δ E = ( E + E ) S + E + E + E r _ i buffer _ rw forward packet rc sel vc Energy consumption for routing a single packet Forward a single flit Buffer read and write Packet size Output port selection Routing computation Wormhole head flit 17

Optimal traffic ratio calculation Problem formulation for optimal traffic ratio : Variables: P0 r(0,5,2) r(0,5,3) r( i, j, k) -- the ratio of using the kth path for sending packets between tile i and tile j r(0,5,1) P1 P2 P3 P4 P5 Three deadlock free paths are available for P0->P5: Path 1: P0->P1->P2->P5 Path 2: P0->P3->P4->P5 Path 3: P0->P3->P4->P5 The path (a, b, k) passing through tile i Tile energy: Ei = Ep_ i +Δ Er_ i r( a, b, k) p( a, b) T i Traffic rate from source a to destination b 18

Optimal traffic ratio calculation LP Problem formulation for optimal traffic ratio: Objective function: obj min(max( E i )) Problem constraints: Traffic splitting constraints : Summation of all the traffic allocation ratios between a given pair (i, j) should equal to one. L i, j k = 1 ri (, jk, ) = 1 (, i j) C ri (, j, k) 0 i, j [1, N], k L ij Bandwidth constraints : the aggregate bandwidth should not exceed the link capacity T T i j rabk (,, ) pab (, ) S T packet _ bit C ij 19

Converting and Combining the Path Ratios Routing tables are used in the routing routing decisions are made locally within the router for minimal path routing, at most two candidate ports are available the path ratios are converted into the local probability stored in the routing table for output port selection Two types of routing table formats Source destination pair Destintation only Source id Input port Dst id P0 P1 P2 Routing table format in P4 (source-destination pair) Path 1 Path 2 Path 3 0 0 3 Output port ratio W 8 S P1 W 8 E P2 W 8 E P3 Path 1 Path 2 P3 Path 3 P4 P5 P6 P7 P8 20

Outline Introduction Motivation Application-specific and thermal-aware routing overview Proposed routing algorithm Router Microarchitecture Experimental results Conclusions 21

Router Microarchitecture Output selection hardware design If (output_1 and output_2 available ) τ= output of LFSR If τ< p(o1) return o1 else return o2 If (only one port available) return the available one A pseudo random number generator using linear feedback register (LFSR) is employed If one output port is not available for routing due to limited buffer space etc., the back pressure signal will disable the corresponding port from selection 22

Outline Introduction Motivation Application-specific and thermal-aware routing overview Proposed routing algorithm Router Microarchitecture Experimental results Conclusions 23

Simulation environment setup Experimental Results A C++ program is developed for the thermal-aware routing algorithm A cycle accurate, flit-based NoC simulator,extended from Noxim, is used for simulation Both synthetic traffic and real benchmarks are used for the simulation MPEG4, VOPD, MMS Adaptivity comparison 20%-30% more paths are available by consideration the application traffic information Higher adaptivity will help to distribute the traffic more uniformly 24

Latency Simulation- Synthetic Traffic Random traffic Transpose-1 traffic Transpose-2 traffic Hotspot center traffic 25

Peak Energy Simulation- Synthetic Traffic Random traffic Transpose-1 traffic Transpose-2 traffic Hotspot center traffic 26

Peak Energy Simulation- Real Benchmark Traffic Peak energy profile Proposed Oddeven In average, 16.6% peak energy reduction can be achieved 27

Peak Energy Reduction with Different PE/Router Ratios Tile s energy depends on both the routers and the processing element We evaluate the effectiveness of the routing algorithm of reducing the Average processor energy peak energy when the energy ratio r = varies. e Average router energy Peak energy reduction Synthetic Traffic Uniform random Hotspot-center Transpose-1 MMS-1 VOPD Average Average Energy ratio (r e ) vs. XY vs. OE vs. XY vs. OE vs. XY vs. OE vs. XY vs. OE vs. XY vs. OE vs. XY vs. OE 0.67 17.4% 12.8% 15.7% 17.6% 15.6% 17.9% 17.7% 11.8% 16.5% 28.9% 16.6% 17.6% 1.00 15.7% 15.3% 14.4% 16.2% 13.3% 14.2% 15.7% 10.4% 12.3% 23.6% 14.3% 15.7% 1.67 10.6% 8.6% 12.3% 14.0% 11.2% 13.8% 10.6% 6.5% 9.3% 17.7% 10.9% 12.0% 2.00 11.9% 9.3% 11.6% 15.0% 11.1% 13.8% 10.6% 6.5% 9.3% 17.7% 10.9% 12.0% 2.67 9.4% 7.7% 10.2% 13.0% 8.9% 11.1% 10.4% 7.0% 7.6% 14.8% 9.3% 10.5% 3.00 8.8% 7.0% 9.6% 10.9% 8.8% 10.6% 9.7% 6.5% 7.6% 14.2% 8.9% 9.7% 3.67 8.1% 6.5% 8.7% 9.9% 7.8% 9.2% 7.9% 5.2% 7.1% 13.0% 7.9% 8.6% 4.00 7.9% 6.1% 8.3% 9.4% 7.0% 9.0% 7.3% 4.7% 6.5% 12.0% 7.4% 8.1% 28

Outline Introduction Motivation Application-specific and thermal-aware routing overview Proposed routing algorithm Router Microarchitecture Experimental results Conclusions 29

Conclusions In this paper, we propose an application-specific and thermal aware routing algorithm for network on chips. Given the application traffic characteristics, a set of deadlock free paths with higher adaptivity is first obtained for routing. A LP problem is formulated to allocate the traffic properly among the paths. A table based router is also proposed to select the output ports according to the ratios From the simulation results, the peak energy reduction can be as high as 16.6% for both synthetic traffic and industry benchmarks. 30