Network-on-Chip Micro-Benchmarks

Similar documents
Standards for NoC: What can we gain?

Towards Open Network-on-Chip Benchmarks

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

The Nostrum Network on Chip

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks

Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture 3: Topology - II

Network-on-chip (NOC) Topologies

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips

Phastlane: A Rapid Transit Optical Routing Network

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Bandwidth Aware Routing Algorithms for Networks-on-Chip

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

4. Networks. in parallel computers. Advances in Computer Architecture

CSCD 433/533 Advanced Networks Spring Lecture 22 Quality of Service

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router

Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessor System

On the Performance Characteristics of WLANs: Revisited

EE/CSCI 451: Parallel and Distributed Computation

Connection-oriented Multicasting in Wormhole-switched Networks on Chip

Dynamic Flow Regulation for IP Integration on Network-on-Chip

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Matrox Imaging White Paper

Design and Implementation of Multistage Interconnection Networks for SoC Networks

Topology basics. Constraints and measures. Butterfly networks.

Supporting Distributed Shared Memory. Axel Jantsch Xiaowen Chen, Zhonghai Lu Royal Institute of Technology, Sweden September 16, 2009

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Real-Time Protocol (RTP)

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

IV. PACKET SWITCH ARCHITECTURES

A Novel Energy Efficient Source Routing for Mesh NoCs

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Homework Assignment #1: Topology Kelly Shaw

Mohammad Hossein Manshaei 1393

Basics (cont.) Characteristics of data communication technologies OSI-Model

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Load Dynamix Enterprise 5.2

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Why Shaping Traffic at the Sender is Important. By Chuck Meyer, CTO, Production December 2017

Unit 2 Packet Switching Networks - II

WHITE PAPER. Latency & Jitter WHITE PAPER OVERVIEW

Module objectives. Integrated services. Support for real-time applications. Real-time flows and the current Internet protocols

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation

Real-Time Mixed-Criticality Wormhole Networks

Conquering Memory Bandwidth Challenges in High-Performance SoCs

Promoting the Use of End-to-End Congestion Control in the Internet

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

A closer look at network structure:

A Predictable Communication Scheme for Embedded Multiprocessor Systems

Transaction Level Model Simulator for NoC-based MPSoC Platform

Comparison of Shaping and Buffering for Video Transmission

Topologies. Maurizio Palesi. Maurizio Palesi 1

Chapter 1. Introduction

Modelling a Video-on-Demand Service over an Interconnected LAN and ATM Networks

Network on Chip Architectures BY JAGAN MURALIDHARAN NIRAJ VASUDEVAN

MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems

Quality of Service (QoS)

SELECTION OF METRICS (CONT) Gaia Maselli

IEEE Time-Sensitive Networking (TSN)

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Chapter 8. Network Troubleshooting. Part II

Design and Implementation of AXI-based Network-on-Chip Systems for Flow Regulation. Jiayi Zhang September 2009

Chapter 2 Designing Crossbar Based Systems

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Arista 7050X, 7050X2, 7250X and 7300 Series Performance Validation

BROADBAND AND HIGH SPEED NETWORKS

Basic Switch Organization

Matching Information Network Reliability to Utility Grid Reliability

CONGESTION CONTROL BY USING A BUFFERED OMEGA NETWORK

An Approach for Enhanced Performance of Packet Transmission over Packet Switched Network

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

High Performance Interconnect and NoC Router Design

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

On Packet Switched Networks for On-Chip Communication

Low-Power Interconnection Networks

Skewed-Associative Caches: CS752 Final Project

STG-NoC: A Tool for Generating Energy Optimized Custom Built NoC Topology

Adaptive Internet Data Centers

QoS-Enabled Video Streaming in Wireless Sensor Networks

Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Switching Architectures for Cloud Network Designs

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Resource Control and Reservation

RSVP 1. Resource Control and Reservation

Layer 3: Network Layer. 9. Mar INF-3190: Switching and Routing

Advanced Computer Networks

Comparative Study of blocking mechanisms for Packet Switched Omega Networks

Chapter -5 QUALITY OF SERVICE (QOS) PLATFORM DESIGN FOR REAL TIME MULTIMEDIA APPLICATIONS

Quality of Service II

Internet Services & Protocols. Quality of Service Architecture

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Transcription:

Network-on-Chip Micro-Benchmarks Zhonghai Lu *, Axel Jantsch *, Erno Salminen and Cristian Grecu * Royal Institute of Technology, Sweden Tampere University of Technology, Finland Abstract University of British Columbia, Canada The rapid development of Network-on-Chip (NoC) calls for a systematic approach to evaluate and fairly compare various NoC architectures. In this specification, we define a generic NoC architecture, a comprehensive set of synthetic workloads as micro-benchmarks, workload scenarios and evaluation criteria. These micro-benchmarks enable measuring particular properties of NoC architectures, complementing application benchmarks. Keywords: Network-on-Chip, Performance Evaluation, Benchmark Introduction Network-on-Chip (NoC) has been recognized as a promising architecture to accommodate tens, hundreds or even thousand of cores. As a result, a number of NoC architectures have been and are being proposed. On one hand, this diversity offers designers a large selection of possibilities. On the other hand, this raises an urgent need to fairly evaluate and compare different NoC architectures in order to assist designers in making right decisions and to further advance and accelerate the state-of-the-art. Classic benchmarks for multiprocessor systems, for example, SPEC [1] and E3S [2], are application-oriented, and cannot be used directly for communication-intensive architecture such as NoCs. Moreover, the nature of the applications running on NoCbased designs is expected to be more varied and heterogeneous compared to typical applications for multiprocessor computers. To complement application benchmarks, OCP-IP has initiated a NoC benchmark endeavor [3], one part of which are NoC microbenchmarks. While benchmark programs evaluate the combined effect of many aspects of the platform as well as of the application, micro-benchmarks isolate individual properties and allow for a faster and deeper point analysis. Micro-benchmarks define synthetic workloads intending to exercise a NoC in a specific way or measure a single particular aspect. Hence, a measurement offers insight in a specific property and facilitates the analysis and design of a communication infrastructure. A single micro-benchmark provides only a very limited view and does not allow for far reaching conclusions about the suitability for an application domain. However, a set of well designed micro-benchmarks can give both a broad and detailed understanding of a given communication network. 1

Architecture Definition Figure 1. A NoC model with four nodes NoC features a network as a global interconnect integrating various resources, such as processors, memory, configurable or dedicated logic blocks, or local bus-based subsystems. We exemplify the NoC model to be evaluated in Figure 1, which shows four nodes connected by a packet-switched network. A node consists of a Resource Model (RM), a Network Interface (NI) and a Router (R). In our context, a resource is attached to exactly one NI, which in turn connects to exactly one router. However, a router may have no NI and RM connected to it in the case of indirect networks. The NI provides hardware interconnect interface implementing an existing on-chip communication protocol, such as OCP, AXI etc. Transactions are initiated by RMs, packetized into packets in NIs, and depacketized back into transactions after network delivery, which are received by RMs. No assumptions are made about network attributes such as topology, routing algorithm, switching policy and flow control scheme. The network may offer two classes of communication services: best-effort (BE) and guaranteed service (GS). The BE service is connection-less, delivering packets as soon as possible. The guaranteed service is connection-oriented, providing certain bounds in latency and/or bandwidth. The network reserves resources such as buffers and link bandwidth for connections. The NIs manage connections in terms of setup, configuration and tear down. The evaluation is concerned with the network and NIs including the interconnect interface. The scope of evaluation is solely on unicast communication. 2

Traffic Configuration Micro-benchmarks differentiate between temporal and spatial traffic distributions. Temporal distribution The temporal distribution determines how an individual RM generates traffic over time. This is based on the b-model [4] by which the burstiness of traffic generation can be controlled by a single parameter b in the range 0 < b 0.5. In the b-model, a bias parameter b = 0.4 means that, within a given time interval, 40% of the data are generated in one half of the time interval and the remaining 60% in the other half, and this continues recursively until reaching the time resolution. When b = 0.5, there is no burstiness and the emission probability is constant. The burstiness increases as b is approaching 0, Spatial distribution The spatial distribution governs the spatial property of a traffic pattern: who communicates with whom. Assuming N nodes in the network, the following spatial distributions are covered: 1. Uniform: In this classic case the probability to send a packet from one node to 1 another node is. A node does not send data to itself. N 1 2. Local: The probability to send a packet to a destination node depends on the source-destination distance. 3. Bit Rotation: a bit permutation pattern in which a given source node sends data only to one destination node whose address is obtained by rotating the bit string representation of the source node address to the right by one. 4. N-Complement: Similarly to Bit Rotation, this scenario creates load on sourcedestination pairs. Suppose that nodes are numbered as naturals 1, 2,... N, if a source node address is n s, its destination address is n d such that ns + n d = N. 5. Hot Spot: This scenario selects M of the N nodes as hot spots. A certain fraction of traffic is targeted to these hot-spots. One hot-spot is selected at a time by uniform random selection. 3

6. Fork-Join Pipeline: it is a pattern where a fork node feeds c nodes that are the starting point of c parallel pipelines. Each pipeline has a depth of e nodes. At the end of the pipelines after e stages, the data is merged into a join node. Measurement We consider different workload scenarios, for which we define measurement metrics. Workload type For best-effort services, we consider three workload types: packets at the network level and read/write transactions at the application level. The length of a read/write transaction may vary from a byte to a few words. For guaranteed services, we examine another three workload types: open connection, close connection and message. Here, message is the data transmitted over connections. Workload cases and metrics We differentiate unloaded and loaded cases, for which we define the following measurement metrics. Service BE GS Workload type Packet Read 2/4/8 Write 2/4/8 Open connection Close connection Message 4/16/32/128 Delay [ns or cycles] Throughput [Mbits/s] Energy [pj] Min. Avg. Max. Min. Avg. Max. Min. Avg. Max. Table 1. Evaluation criteria for the unloaded case In the unloaded case, individual packets or transactions are injected/initiated and measured so that only a single traffic source is active. This yields minimum delay and peak performance. These values may vary depending on the location of the source. Therefore, determining the minimum, average, and maximum values require several measurements. Table 1 shows the performance metrics for the loaded case. 4

Service Workload type Avg. Delay [ns or cycles] D 1 D2 D 3 Dn Avg. Jitter J 1 J 2 J n Θ s [Mbits/s] Energy [pj] BE GS Packet Read 2/4/8 Write 2/4/8 Open connection Close connection Message 4/16/32/128 Table 2. Evaluation criteria for the loaded case The loaded case investigates the network behavior when many independent packets or transactions compete for the same resources, and congestion, arbitration, buffering and flow control policies are exercised. Sometimes part of the load in the network may be generated as background traffic with some other micro-benchmark than the one used for measurements, for example, the uniform traffic. In the presence of congestion the network typically does not exhibit a deterministic delay behavior, we therefore also capture both delay and delay variations. Table 2 shows the performance metrics for the loaded case, where Θ s represents sustained throughput. Di / J i is the delay/jitter bound i for 1 10 of data. For instance, D / J, D / J, D / J 1 1 2 2 3 3 bounds 90%, 99.9%, 99.99% of all packets or transactions, respectively. Interference of resource reservation on BE traffic To study the impact of resource allocation by guaranteed services on BE traffic, a set of micro-benchmarks shall measure the BE traffic performance when a given portion of the bandwidth of each link in the network is allocated to a guaranteed service. Network scalability Besides performance and power metrics, scalability is an important quality metric. To evaluate the scalability of NoCs, a range of network sizes can be specified as an input to the micro-benchmarks. 5

The Micro-Benchmark Set Each micro-benchmark has a name which reflects its function. The names have the following format: NoCmb_TEMP_SPAT_LUL_WORKLOAD_GS_SIZE_MP Fields: NoCmb: constant string suffix standing for NoC micro-benchmark. TEMP: temporal distribution. Besides burstiness, different average emission probabilities may be supplied. SPAT: spatial distribution. LUL: workload case, either LOADED or UNLOADED. WORKLOAD: workload type. GS: fraction of bandwidth reserved for the guaranteed service. SIZE: number of network nodes. MP: measurement point indicating where the performance is measured, Raw or Buffered. The Raw delay measures the delay in the network. The Buffered delay includes source queuing delay plus the RAW delay. 7 Hence, a complete set of micro-benchmarks may contain up to n micro-programs, if each field has n options. Each combination of the fields is a micro-benchmark. We give five examples: 1. NoCmb_Burst0.3Avg0.5_BitRotation_UNLOADED_Packet_GS[0,0.2,0.3,0.5]_S ize16_raw: This unloaded case creates best-effort packets with burstiness 0.3 and average probability 0.5 to bit-rotated destination nodes in a 16-node network when 0%, 20%, 30% and 50% percent of network bandwidth is reserved by the guaranteed service. Raw delay is to be measured. 2. NoCmb_Burst[0.5,0.4,0.3,0.2]Avg0.4_Uniform_LOADED_Packet_GS0_Size64_ RAW: This loaded case creates uniformly distributed best-effort packets on a 64- node network with burstiness b = 0.5, 0.4, 0.3 and 0.2, average probability 0.4. 3. NoCmb_Burst0.4Avg0.4_Uniform_LOADED_Packet_GS0_Size[4,16,64,128,25 6]_RAW: This loaded case creates uniformly distributed best-effort packets with 6

burstiness 0.4 and average probability 0.4 on networks with the number of nodes being 4, 16, 64, 128 and 256. 4. NoCmb_Burst0.3Avg0.05_BitRotation_UNLOADED_OpenConnection_GS[0,0. 2,0.3,0.5]_Size32_RAW: This unloaded case creates open-connection packets with burstiness 0.3 and average probability 0.05, to bit-rotated destination nodes in a 32-node network when 0%, 20%, 30% and 50% percent of network bandwidth is reserved by the guaranteed service. 5. NoCmb_Burst0.2Avg[0.2,0.3,0.4,0.5,0.6,0.7]_Uniform_LOADED_Packet_GS0_ Size64_Buffered: This loaded case creates uniformly distributed best-effort packets on a 64-node network with burstiness 0.2 and average probability ranging from 20% to 70% with a step length 10%. Buffered delay is to be measured. The micro-benchmarks may be implemented in different modeling or programming languages such as VHDL/Verilog, C/C++/SystemC. NoC Working Group at OCP-IP aims to provide an open source implementation of the micro-benchmarks in SystemC. Concluding Remark We have specified a rich set of micro-benchmarks to evaluate and compare NoC architectures. It will systematically exercise a set of important aspects of a NoC. It will give the NoC developer insight and guidelines for improvement. It will also give the NoC user a detailed understanding of the NoC behavior, its strengths and weaknesses. We envision that this set of micro-benchmarks will continue to evolve together with increasing endeavors in NoC activities. Acknowledgment Developing the NoC micro-benchmarks is a persistent effort of the NoC Working Group at OCP-IP. The authors would like to thank all members for their helpful discussions and insightful comments. References [1] The Standard Performance Evaluation Corporation, SPEC, http://www.spec.org/hpg/ [2] R. Dick, Embedded System Synthesis Benchmarks Suites (E3S) http://www.ece.northwestern.edu/~dickrp/e3s/ 7

[3] Cristian Grecu, Andre Ivanov, Partha Pande, Axel Jantsch, Erno Salminen, Umit Ogras, and Radu Marculescu. Towards open network-on-chip benchmarks. In Proceedings of First International Symposium on Networks-on-Chip, 2007 [4] Mengzhi Wang, T. Madhyastha, Chan Ngai Hang, S. Papadimitriou, and C. Faloutsos. Data mining meets performance evaluation: fast algorithms for modeling bursty traffic. In Proceedings of the 18 th International Conference on Data Engineering, 2002. 8