Network-on-Chip Micro-Benchmarks

Network-on-Chip Micro-Benchmarks Zhonghai Lu *, Axel Jantsch *, Erno Salminen and Cristian Grecu * Royal Institute of Technology, Sweden Tampere University of Technology, Finland Abstract University of British Columbia, Canada The rapid development of Network-on-Chip (NoC) calls for a systematic approach to evaluate and fairly compare various NoC architectures. In this specification, we define a generic NoC architecture, a comprehensive set of synthetic workloads as micro-benchmarks, workload scenarios and evaluation criteria. These micro-benchmarks enable measuring particular properties of NoC architectures, complementing application benchmarks. Keywords: Network-on-Chip, Performance Evaluation, Benchmark Introduction Network-on-Chip (NoC) has been recognized as a promising architecture to accommodate tens, hundreds or even thousand of cores. As a result, a number of NoC architectures have been and are being proposed. On one hand, this diversity offers designers a large selection of possibilities. On the other hand, this raises an urgent need to fairly evaluate and compare different NoC architectures in order to assist designers in making right decisions and to further advance and accelerate the state-of-the-art. Classic benchmarks for multiprocessor systems, for example, SPEC [1] and E3S [2], are application-oriented, and cannot be used directly for communication-intensive architecture such as NoCs. Moreover, the nature of the applications running on NoCbased designs is expected to be more varied and heterogeneous compared to typical applications for multiprocessor computers. To complement application benchmarks, OCP-IP has initiated a NoC benchmark endeavor [3], one part of which are NoC microbenchmarks. While benchmark programs evaluate the combined effect of many aspects of the platform as well as of the application, micro-benchmarks isolate individual properties and allow for a faster and deeper point analysis. Micro-benchmarks define synthetic workloads intending to exercise a NoC in a specific way or measure a single particular aspect. Hence, a measurement offers insight in a specific property and facilitates the analysis and design of a communication infrastructure. A single micro-benchmark provides only a very limited view and does not allow for far reaching conclusions about the suitability for an application domain. However, a set of well designed micro-benchmarks can give both a broad and detailed understanding of a given communication network. 1

Architecture Definition Figure 1. A NoC model with four nodes NoC features a network as a global interconnect integrating various resources, such as processors, memory, configurable or dedicated logic blocks, or local bus-based subsystems. We exemplify the NoC model to be evaluated in Figure 1, which shows four nodes connected by a packet-switched network. A node consists of a Resource Model (RM), a Network Interface (NI) and a Router (R). In our context, a resource is attached to exactly one NI, which in turn connects to exactly one router. However, a router may have no NI and RM connected to it in the case of indirect networks. The NI provides hardware interconnect interface implementing an existing on-chip communication protocol, such as OCP, AXI etc. Transactions are initiated by RMs, packetized into packets in NIs, and depacketized back into transactions after network delivery, which are received by RMs. No assumptions are made about network attributes such as topology, routing algorithm, switching policy and flow control scheme. The network may offer two classes of communication services: best-effort (BE) and guaranteed service (GS). The BE service is connection-less, delivering packets as soon as possible. The guaranteed service is connection-oriented, providing certain bounds in latency and/or bandwidth. The network reserves resources such as buffers and link bandwidth for connections. The NIs manage connections in terms of setup, configuration and tear down. The evaluation is concerned with the network and NIs including the interconnect interface. The scope of evaluation is solely on unicast communication. 2

Traffic Configuration Micro-benchmarks differentiate between temporal and spatial traffic distributions. Temporal distribution The temporal distribution determines how an individual RM generates traffic over time. This is based on the b-model [4] by which the burstiness of traffic generation can be controlled by a single parameter b in the range 0 < b 0.5. In the b-model, a bias parameter b = 0.4 means that, within a given time interval, 40% of the data are generated in one half of the time interval and the remaining 60% in the other half, and this continues recursively until reaching the time resolution. When b = 0.5, there is no burstiness and the emission probability is constant. The burstiness increases as b is approaching 0, Spatial distribution The spatial distribution governs the spatial property of a traffic pattern: who communicates with whom. Assuming N nodes in the network, the following spatial distributions are covered: 1. Uniform: In this classic case the probability to send a packet from one node to 1 another node is. A node does not send data to itself. N 1 2. Local: The probability to send a packet to a destination node depends on the source-destination distance. 3. Bit Rotation: a bit permutation pattern in which a given source node sends data only to one destination node whose address is obtained by rotating the bit string representation of the source node address to the right by one. 4. N-Complement: Similarly to Bit Rotation, this scenario creates load on sourcedestination pairs. Suppose that nodes are numbered as naturals 1, 2,... N, if a source node address is n s, its destination address is n d such that ns + n d = N. 5. Hot Spot: This scenario selects M of the N nodes as hot spots. A certain fraction of traffic is targeted to these hot-spots. One hot-spot is selected at a time by uniform random selection. 3

6. Fork-Join Pipeline: it is a pattern where a fork node feeds c nodes that are the starting point of c parallel pipelines. Each pipeline has a depth of e nodes. At the end of the pipelines after e stages, the data is merged into a join node. Measurement We consider different workload scenarios, for which we define measurement metrics. Workload type For best-effort services, we consider three workload types: packets at the network level and read/write transactions at the application level. The length of a read/write transaction may vary from a byte to a few words. For guaranteed services, we examine another three workload types: open connection, close connection and message. Here, message is the data transmitted over connections. Workload cases and metrics We differentiate unloaded and loaded cases, for which we define the following measurement metrics. Service BE GS Workload type Packet Read 2/4/8 Write 2/4/8 Open connection Close connection Message 4/16/32/128 Delay [ns or cycles] Throughput [Mbits/s] Energy [pj] Min. Avg. Max. Min. Avg. Max. Min. Avg. Max. Table 1. Evaluation criteria for the unloaded case In the unloaded case, individual packets or transactions are injected/initiated and measured so that only a single traffic source is active. This yields minimum delay and peak performance. These values may vary depending on the location of the source. Therefore, determining the minimum, average, and maximum values require several measurements. Table 1 shows the performance metrics for the loaded case. 4

Service Workload type Avg. Delay [ns or cycles] D 1 D2 D 3 Dn Avg. Jitter J 1 J 2 J n Θ s [Mbits/s] Energy [pj] BE GS Packet Read 2/4/8 Write 2/4/8 Open connection Close connection Message 4/16/32/128 Table 2. Evaluation criteria for the loaded case The loaded case investigates the network behavior when many independent packets or transactions compete for the same resources, and congestion, arbitration, buffering and flow control policies are exercised. Sometimes part of the load in the network may be generated as background traffic with some other micro-benchmark than the one used for measurements, for example, the uniform traffic. In the presence of congestion the network typically does not exhibit a deterministic delay behavior, we therefore also capture both delay and delay variations. Table 2 shows the performance metrics for the loaded case, where Θ s represents sustained throughput. Di / J i is the delay/jitter bound i for 1 10 of data. For instance, D / J, D / J, D / J 1 1 2 2 3 3 bounds 90%, 99.9%, 99.99% of all packets or transactions, respectively. Interference of resource reservation on BE traffic To study the impact of resource allocation by guaranteed services on BE traffic, a set of micro-benchmarks shall measure the BE traffic performance when a given portion of the bandwidth of each link in the network is allocated to a guaranteed service. Network scalability Besides performance and power metrics, scalability is an important quality metric. To evaluate the scalability of NoCs, a range of network sizes can be specified as an input to the micro-benchmarks. 5

The Micro-Benchmark Set Each micro-benchmark has a name which reflects its function. The names have the following format: NoCmb_TEMP_SPAT_LUL_WORKLOAD_GS_SIZE_MP Fields: NoCmb: constant string suffix standing for NoC micro-benchmark. TEMP: temporal distribution. Besides burstiness, different average emission probabilities may be supplied. SPAT: spatial distribution. LUL: workload case, either LOADED or UNLOADED. WORKLOAD: workload type. GS: fraction of bandwidth reserved for the guaranteed service. SIZE: number of network nodes. MP: measurement point indicating where the performance is measured, Raw or Buffered. The Raw delay measures the delay in the network. The Buffered delay includes source queuing delay plus the RAW delay. 7 Hence, a complete set of micro-benchmarks may contain up to n micro-programs, if each field has n options. Each combination of the fields is a micro-benchmark. We give five examples: 1. NoCmb_Burst0.3Avg0.5_BitRotation_UNLOADED_Packet_GS[0,0.2,0.3,0.5]_S ize16_raw: This unloaded case creates best-effort packets with burstiness 0.3 and average probability 0.5 to bit-rotated destination nodes in a 16-node network when 0%, 20%, 30% and 50% percent of network bandwidth is reserved by the guaranteed service. Raw delay is to be measured. 2. NoCmb_Burst[0.5,0.4,0.3,0.2]Avg0.4_Uniform_LOADED_Packet_GS0_Size64_ RAW: This loaded case creates uniformly distributed best-effort packets on a 64- node network with burstiness b = 0.5, 0.4, 0.3 and 0.2, average probability 0.4. 3. NoCmb_Burst0.4Avg0.4_Uniform_LOADED_Packet_GS0_Size[4,16,64,128,25 6]_RAW: This loaded case creates uniformly distributed best-effort packets with 6

burstiness 0.4 and average probability 0.4 on networks with the number of nodes being 4, 16, 64, 128 and 256. 4. NoCmb_Burst0.3Avg0.05_BitRotation_UNLOADED_OpenConnection_GS[0,0. 2,0.3,0.5]_Size32_RAW: This unloaded case creates open-connection packets with burstiness 0.3 and average probability 0.05, to bit-rotated destination nodes in a 32-node network when 0%, 20%, 30% and 50% percent of network bandwidth is reserved by the guaranteed service. 5. NoCmb_Burst0.2Avg[0.2,0.3,0.4,0.5,0.6,0.7]_Uniform_LOADED_Packet_GS0_ Size64_Buffered: This loaded case creates uniformly distributed best-effort packets on a 64-node network with burstiness 0.2 and average probability ranging from 20% to 70% with a step length 10%. Buffered delay is to be measured. The micro-benchmarks may be implemented in different modeling or programming languages such as VHDL/Verilog, C/C++/SystemC. NoC Working Group at OCP-IP aims to provide an open source implementation of the micro-benchmarks in SystemC. Concluding Remark We have specified a rich set of micro-benchmarks to evaluate and compare NoC architectures. It will systematically exercise a set of important aspects of a NoC. It will give the NoC developer insight and guidelines for improvement. It will also give the NoC user a detailed understanding of the NoC behavior, its strengths and weaknesses. We envision that this set of micro-benchmarks will continue to evolve together with increasing endeavors in NoC activities. Acknowledgment Developing the NoC micro-benchmarks is a persistent effort of the NoC Working Group at OCP-IP. The authors would like to thank all members for their helpful discussions and insightful comments. References [1] The Standard Performance Evaluation Corporation, SPEC, http://www.spec.org/hpg/ [2] R. Dick, Embedded System Synthesis Benchmarks Suites (E3S) http://www.ece.northwestern.edu/~dickrp/e3s/ 7

[3] Cristian Grecu, Andre Ivanov, Partha Pande, Axel Jantsch, Erno Salminen, Umit Ogras, and Radu Marculescu. Towards open network-on-chip benchmarks. In Proceedings of First International Symposium on Networks-on-Chip, 2007 [4] Mengzhi Wang, T. Madhyastha, Chan Ngai Hang, S. Papadimitriou, and C. Faloutsos. Data mining meets performance evaluation: fast algorithms for modeling bursty traffic. In Proceedings of the 18 th International Conference on Data Engineering, 2002. 8