On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

Similar documents
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

4. Networks. in parallel computers. Advances in Computer Architecture

Interconnect Technology and Computational Speed

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Topologies. Maurizio Palesi. Maurizio Palesi 1

Network-on-chip (NOC) Topologies

Lecture 2 Parallel Programming Platforms

Boosting the Performance of Myrinet Networks

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

A Hybrid Interconnection Network for Integrated Communication Services

Lecture: Interconnection Networks

Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture 18: Communication Models and Architectures: Interconnection Networks

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Hierarchical Ring Network Configuration and Performance Modeling

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

A Cache Hierarchy in a Computer System

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

CHAPTER 5 PROPAGATION DELAY

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

DUE to the increasing computing power of microprocessors

ANALYSIS OF THE CORRELATION BETWEEN PACKET LOSS AND NETWORK DELAY AND THEIR IMPACT IN THE PERFORMANCE OF SURGICAL TRAINING APPLICATIONS

Deadlock- and Livelock-Free Routing Protocols for Wave Switching

Flow Control can be viewed as a problem of

Basic Low Level Concepts

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Lecture 7: Flow Control - I

Chapter 3 : Topology basics

Network-on-Chip Micro-Benchmarks

Lecture 3: Flow-Control

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

A Survey of Techniques for Power Aware On-Chip Networks.

Multi-path Routing for Mesh/Torus-Based NoCs

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

INTERCONNECTION NETWORKS LECTURE 4

Cisco IOS Switching Paths Overview

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Computer Networks. Routing

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

A Novel Energy Efficient Source Routing for Mesh NoCs

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Latency on a Switched Ethernet Network

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

A Real-Time Communication Method for Wormhole Switching Networks

Latency on a Switched Ethernet Network

Network on Chip Architecture: An Overview

ECE 669 Parallel Computer Architecture

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Interconnection Networks

Multicomputer distributed system LECTURE 8

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

Comparative Study of blocking mechanisms for Packet Switched Omega Networks

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Design Space Exploration of Network Processor Architectures

RED behavior with different packet sizes

3. Evaluation of Selected Tree and Mesh based Routing Protocols

Chapter 4 NETWORK HARDWARE

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

How Much Logic Should Go in an FPGA Logic Block?

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router

McGill University - Faculty of Engineering Department of Electrical and Computer Engineering

Interface The exit interface a packet will take when destined for a specific network.

Physical Organization of Parallel Platforms. Alexandre David

Initial studies of SCI LAN topologies for local area clustering

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

TDT Appendix E Interconnection Networks

Chapter 1. Introduction

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Concurrent/Parallel Processing

EE382 Processor Design. Illinois

CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Improving VoD System Efficiency with Multicast and Caching

TCP Congestion Control in Wired and Wireless networks

Packet Switch Architecture

Packet Switch Architecture

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ

Prioritized Shufflenet Routing in TOAD based 2X2 OTDM Router.

PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

Scalable Cache Coherence

On the Relationship of Server Disk Workloads and Client File Requests

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Transcription:

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada gravindr@newbridge.com Michael Stumm Department of Electrical and Computer Engineering University of Toronto Toronto, ON M5S 3G4, Canada stumm@eecg.toronto.edu Abstract Hierarchical-ring based multiprocessors are interesting alternatives to the more popular two-dimensional direct networks. They allow for simple router designs and wider communication paths than their direct network counterparts. There are several ways hierarchical-ring networks can be configured for a given number of processors. Feasible topologies range from tall, lean networks to short, wide networks, but only a few of these possess high throughput and low latency. This paper presents the results of a simulation study (i) to determine how large hierarchical-ring networks can become before their performance deteriorates due to their bisection bandwidth constraints and (ii) to derive topologies with high throughput and low latency for a given number of processors. We show that a system with a maximum of 20 processors and three levels of hierarchy can sustain most memory access behaviors, but that larger systems can be sustained, only if their bisection bandwidth is increased.. Introduction Multiprocessors based on a single ring are limited to a small number of processors mainly because the diameter of the network grows linearly with the number of processors. A hierarchical-ring network can accommodate a larger number of processors by interconnecting multiple rings in a hierarchical fashion [7]. A major advantage of the hierarchicalring topology is that it can be used to exploit spatial locality of memory accesses often exhibited in parallel programs. Also, hierarchical structure allows efficient implementation of broadcasts, useful for implementing cache coherence and multicast protocols [7]. However, they have a constant bisection bandwidth that limits the scalability of such networks. There are several ways we can build a hierarchical-ring network for a given number of processors. Feasible configurations or topologies range from tall, lean networks to short, wide networks. However, only a few of these topologies tend to possess a high throughput, low latency combination, which should be the goal of any topology. In this paper, we describe a bottom-up approach in finding good topologies and discuss the effect that the bisection bandwidth has on the performance of such networks. In a related earlier work, we presented the results of a simulation study of the scalability and bisection bandwidth constraints of hierarchical-ring networks that considered only two request rates and did not take into account system throughput [4]. In another study, Holliday and Stumm [2] studied the performance and scalability of hierarchical-ring networks, but they assumed a high degree of locality in their memory accesses. Hamacher and Jiang [] used analytical models to derive optimal hierarchical-ring topologies. In contrast to the above work, our study is much more extensive; results are derived under a wide range of request rates and cache line sizes, and in addition to latency, also considers system throughput. 2. Simulated System Figure shows a shared-memory multiprocessor system with a number of processing modules connected by a twolevel hierarchy of unidirectional rings. The processing modules are all connected to lowest level rings referred to as local rings. The system provides a flat, global (physical) address space thereby allowing processors to transparently access any memory location in the system. While local memory accesses do not involve the network, remote memory accesses require a request packet to be sent to the target memory, followed by a response packet from the target memory to the requesting processor. Packets sent are of variable size and are transferred in flits, bit-parallel, along a unique

Global Ring Input Upper Ring Buffer Output PM PM Network Interface Controller Inter Ring Interface Local Ring Down Buffer Up Buffer PM PM Processor Memory Module Output Lower Ring Buffer Input Figure. A hierarchical-ring system with two levels. Input Input Request and Response Buffers Ring Buffer Processor Memory Output Request and Response Buffers Output Figure 2. A network interface controller (NIC) for a hierarchical-ring connected multiprocessor network. path through the network. We assume wormhole switching, where a packet is sent as a contiguous sequence of flits with the header flit containing the routing and sequencing information. A packet containing a remote request whose target memory is in a different ring than its source, first travels up the hierarchy to the level needed to reach the target node, and then descends the hierarchy to the target node where it is removed from the local ring. The target node sends a response packet back to the requesting PM along a similar path. 2. Network Nodes In a hierarchical ring, there are two types of network nodes: Network Interface Controllers (NIC) connect processing modules (PM) to local rings and Inter-ring Interfaces (IRI) connect two rings of adjacent levels. The NIC examines the header of a packet and switches () incoming packets from the ring to a PM, (2) outgoing packets from the Figure 3. An inter-ring interface controller for a hierarchical-ring connected multiprocessor network. PM to the ring, and (3) continuing packets from the input link to the output link. The IRI controls the traffic between two rings and is modeled as a 2 2 crossbar switch. Possible implementations of these network nodes are depicted in Figures 2 and 3. The NIC has a FIFO ring buffer to temporarily store transit packets when the output link is busy transmitting a packet from the local PM. The NIC also has a FIFO input buffer to store packets destined for the local PM and a FIFO output buffer to store packets originating from the local PM destined for remote nodes (see Figure 2). Priority for transmission on the output link is given to transit packets followed by packets in the output queue. An IRI has two ring buffers: one for the lower ring and one for the upper ring. It also has a down buffer and an up buffer (see Figure 3). The down buffer stores packets arriving from the upper ring destined for the lower ring, while the up buffer stores packets arriving from the lower ring destined for the upper ring. Switching takes place independently at the lower and upper ring sides. We use a prioritized link arbitration where priority is given to packets that do not change rings. Arriving transit packets block and are placed in the ring buffer when the output link is in the process of transmitting a packet from the up/down buffer. We assume that all communication occurs synchronously: that is, within a network clock cycle, each NIC can transfer one flit to the next adjacent node (if the link is not blocked) and receive a flit from the previous node it connects to; and an IRI can transmit and receive a flit on each ring (again, if there is no blocking). 2.2. Simulator The simulator we use reflects the behavior of a system at the register-transfer level on a cycle-by-cycle basis and 2

was implemented using the smpl simulation library [3]. The batch means method of output analysis is used with the first batch discarded to account for initialization bias. The batch termination criterium is that each processor has to complete at least some minimum number of requests. All simulation results have confidence interval half-widths of % or less at a 95% confidence level, except near saturation where the confidence interval half-width may increase to a few percent. 3. System and Workload Parameters A hierarchical-ring based shared-memory multiprocessor can be characterized in part by the following parameters: (i) system size (number of processors), (ii) the relative processor, memory, and network cycle times, (iii) the maximum number of transactions a processor may have outstanding at a time, and (iv) the topology. We model the effect of techniques such as prefetching, non-blocking reads, and relaxed consistency models by allowing up to four outstanding transactions per processor. The topology of hierarchicalring networks is specified by the branching factor at each level of the hierarchy, starting at the local ring up to the global ring. A topology specified as 8 4 2 refers to a three-level hierarchy with 8 nodes per local ring, 4 level- rings per level-2 ring, and 2 level-2 rings connected to the global ring. The main parameters in our synthetic workload model include request rate, which is the mean time (in processor cycles) between cache misses 2 given a non-blocked processor, the probability that the cache miss is a read, and a measure of communication locality. We subject the network to a wide range of request rates from 0.00 to 0.. Given a cache miss, we assume the probability of it being a read is 0.7. In our synthetic workload model, two main memory transactions, namely the read and write transaction and four types of packets, namely, read request, read response, write request and write response are simulated. In a read transaction, the request packet contains the target memory address and the response packet contains the requested cache line data. In a write transaction, the request packet contains the cache line data to be written and the response packet contains an acknowledgment. 3 Communication locality can greatly affect system throughput in shared-memory multiprocessor networks. We use clusters of locality to model locality in our synthetic In the batch means method, a single long run is divided into several batches with a separate sample mean computed for each batch. These batch means are then used to compute the grand mean and confidence interval. 2 The mean time between cache misses follows a negative exponential distribution. 3 For writes, the response packet is sent back to the requesting station as soon as the write is queued at the target memory, so the latency of the actual memory operation is hidden. workloads [5]. This communication model logically organizes all processors into clusters and assigns a probability for each cluster being the target of a transaction. Two vector parameters specify a specific locality model. The vector S = (S; S2; ; Sn) specifies the size of each cluster. The vector P = (P; P2; ; Pn) then specifies the probability of each cluster being the target of a transaction. Given that the target memory is in a particular cluster, the probability of a processor module being the target within that cluster is uniformly distributed. This definition of clusters is independent of the topologyof the network. For a hierarchy of rings, the clusters are defined in terms of the absolute difference (modulo the size of the system) between the two processor numbers, when numbered them left to right when processors are viewed as the leaves of a tree defined by the ring hierarchy. We use two specific workloads derived from this model:. Workload T loc, represented as S = (; 4; n? 5) and P = (0:5; 0:8; :0), models 3 clusters; the first cluster is the source processor module itself, the second cluster contains the source processor module s four closest neighbors, and the third cluster contains all other processor modules. Cluster has probability 0.5 of being the target, cluster 2 has probability 0.8 of being the target, given that the target is not in cluster, and cluster 3 has probability.0 of containing the target, given that the target is not in cluster or cluster 2. This workload models high communication locality, where there is a probability of 0.9 that the target memory lies within the first two clusters. 2. Workload T uniform, represented as S = (n) and P = (:0) with n being the total number of processors, models a single cluster that has a probability.0 of containing the target memory. This workload models poor communication locality. The system and workload parameters used in our study are summarized in Table. We define the cycle ratio as the relative speed of the processor, network, and memory [2]. It is specified as N XM Y which means that each network cycle is X times as slow as a processor cycle and the memory requires Y processor cycles to service one memory request. We define network cycle time as the time required for a packet to move from the input of one node to the input of the next node. Such a transfer need not occur in a single network cycle. Our assumption that the network cycle time is a factor of two slower than the processor cycle time is justified from the fact that for a 5ns processor cycle time (200 MHz), our ring cycle time of 0ns is close to that used in SCI performance studies [6]. 4 4 SCI specifies a ring cycle time of 2ns with 4 ring cycles required to transfer a packet from the input of one node to the input of the neighboring node. 3

P arameter V alue Description n 4-20 Number of processors b Number of memory banks nl nl2 nlh 8 5; 8 5 3 Hierarchical-ring Topology N XM Y N 2M 0 Ratio of network and memory cycles to processor cycle T 4 Maximum number of outstanding transactions 0.00-0. Request rate R 0.7 Probability that a cache miss is a read S = (S; S2; ; Sm) (N ); (; 4; n? 5) Cluster size P = (P; P2; ; Pm) (); (0:5; 0:8; :0) Cluster probabilities Table. System and synthetic workload parameters and their range of values used in our simulations. 4. Deriving Hierarchical-ring Topologies In this section, we derive high-performance hierarchicalring topologies using flit-level simulations. It should be noted that the hierarchical-ring topologies we derive are largely independent of the switching technique or buffer sizes assumed [5]. We use a bottom-up approach and start from the lowest level in the hierarchy and work up one level at a time. At the lowest level, we derive the maximum number of processors that can be sustained at high throughput and low latency and then fix that configuration. At higher levels, we derive the maximum number of next lower level rings of the previously set configuration that still gives high throughput and low latency. 4.. Single Rings Here, we will show that a single ring can reasonably sustain a total of 8 processor-memory modules across most memory access patterns given the chosen system parameters, and that as we increase the cache line size, the effect of locality in the memory access pattern on system performance becomes less significant. Figures 4a and 4b present the throughput-latency curves for single ring topologies when subjected to the T uniform and T loc workload respectively. For the case with no locality a 4 processor gives us a low latency configuration with high throughput compared to that of 8 and 6 processor systems; initially when the number of processors is less than 4, performance is throughput limited and when we add more processors, throughput increases to a point after which performance becomes latency limited. Hence, we choose n L ;uniform = 4. For the T loc workload (Figure 4b), the maximum achievable throughput for the 8 processor ring is much higher than for the 4 processor ring; therefore, we choose n L ;loc = 8, although the 4 processor configuration exhibits lower latency at low request rates. For both workload models, however, the 6 processor configuration is clearly not desirable, as it exhibits higher latency when compared to the 8 processor topology. For n L ;loc, the maximum achievable throughput is 65% higher when there is high locality in the memory accesses than when there is poor locality. For n L ;uniform the difference is only 5%. Given the probabilities of memory accesses in the three clusters as P = (P; P2; ), we define locality as follows: locality = P + (? P)P2 () For T uniform, P = ( n ; 4 ; n ), where n is the total number of processors in the system. For T loc, P = (0:5; 0:8; :0). Substitutingthese values in Equation, locality varies from locality uniform = n + (? n ) 4 n for the T uniform workload to locality loc = 0:9 for the T loc workload. Normalized locality (between 0 and ) is given by, locality = locality? locality uniform locality loc? locality uniform (2) Figure 5 presents the maximum throughput gain in percent by using a 8 processor topology (n L ;loc ) as opposed to a 4 processor topology (n L ;uniform ) for different degrees of locality. It is obvious that there is a positive throughput gain (as high as 45%) for most memory access patterns by using the 8 processor topology. The trend is similar for larger cache line sizes, where there is still a gain in the maximum achievable throughput when using a 8 processor topology although it is much less for 64 and 28-byte cache line systems than for the 32-byte cache line system. 5 5 The ring size n L ;uniform and n L ;loc remain the same at 4 and 8 nodes for 64 and 28-byte cache line sizes. 4

Latency (cycles) 200 80 20 00 80 Proc 2 Proc 4 Proc 8 Proc 6 Proc (a) Latency (cycles) 00 90 80 70 Proc 2 Proc 4 Proc 8 Proc 6 Proc (b) 20 0 0.05 0. 0.5 0.2 0.25 0.3 Throughput (requests/cycle) 20 0 0.05 0. 0.5 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Throughput (requests/cycle) Figure 4. Throughput-latency curves for single ring topologies with 32B cache lines for (a) T uniform and (b) T loc workloads. Throughput Gain % 20 0 0 32B Cache Line 64B Cache Line 28B Cache Line -0 0 0.2 0.4 0.6 0.8 Locality Figure 5. Throughput gain (loss) in percent of using an 8 processor system as opposed to a 4 processor system. As a result, we conclude that a total of 8 processormemory modules can be reasonably sustained in a single ring across most memory access patterns, and as we increase the cache line size, the effect of locality in the memory access pattern on system performance becomes less significant. 4.2 Two-level Rings In this section, we will show that a total of 5 local rings can be reasonably sustained in a two-level hierarchical-ring topology for most memory access patterns and that the effect of locality in memory accesses on system performance is independent of cache line size for systems of this size. To do so, we add a second level ring, L2, and determine how many L local rings a two level hierarchy can sustain. The L2 global ring connects a number of L local rings, each containing the maximum number of processor-memory modules (n L = 8), as determined in the previous section. Figure 6 presents the throughput-latency curves for 2-level hierarchical rings with 32-byte cache lines. With the T uniform workload, a global ring can sustain only 3 (n L2 ;uniform = 3) local rings; any increase in the number of local rings decreases the maximum achievable throughput of the network. However, with the T loc workload that has high locality, we can increase the number of local rings to 5 (n L2 ;loc = 5). In the latter case, when the number of local rings is further increased, there is no significant increase in the maximum achievable throughput. One major difference between the single ring and twolevel ring topologies is the effect of locality on the maximum achievable throughput. For T loc, the maximum achievable throughput with n L2 ;loc (8 5 topology) is about 0% higher than for T uniform. Figure 7 presents the throughput gain in percent when using an 8 5 topology as opposed to an 8 3 topology. We see that by using the 85 topologythere is a throughput loss for most memory access patterns; however, this loss is small and decreases as locality is increased. The throughput gain starts to grow at a higher rate when locality > 0:6, resulting in a 45% throughput gain when locality =. Since the throughput gain by using an 8 5 topology is much higher at higher locality levels than the throughput loss at lower locality levels, we can reasonably assume that the number of local rings a second-level global ring can sustain is 5. 4.3. Three-level Rings We next introduce a third level to the hierarchy and proceed to determine how many L2 rings can be sustained. Each L2 ring now consists of a second level ring connected 5

Latency (cycles) 80 20 00 80 2 Local Rings 3 Local Rings 4 Local Rings 5 Local Rings (a) Latency (cycles) 75 70 65 55 45 35 2 Local Rings 3 Local Rings 4 Local Rings 5 Local Rings 6 Local Rings (b) 0 0.05 0. 0.5 0.2 0.25 0.3 0.35 0.4 Throughput (requests/cycle) 0 0.2 0.4 0.6 0.8.2.4.6 Throughput (requests/cycle) Figure 6. Throughput-latency curves for two level ring topologies with 32B cache lines for the (a) T uniform and (b) T loc workloads. Throughput Gain % 20 0 0-0 32B Cache Line 64B Cache Line 28B Cache Line -20 0 0.2 0.4 0.6 0.8 Locality Figure 7. Throughput gain (loss) in percent of using an 8 5 topology as opposed to an 8 3 topology. to 5 L rings of 8 nodes each, for a total of nodes. We refer to the third level ring as the global ring. Figure 8 presents the throughput-latency curves for 3-level hierarchical rings with 32-byte cache lines. For T uniform, the trend is similar to what we observed for 2-level rings, namely that a maximum of 3 L2 rings (n L3 ;uniform = 3 ) can be sustained by a global ring. However, for T loc, we are also only able to sustain 3 L2 rings (n L3 ;loc = 3 ). The constant bisection bandwidth constraint of the hierarchical-ring network offsets the benefits of high locality in the memory accesses. Thus, even good locality (where in this case 90% of all requests lie within a 4 neighbor cluster) saturates the global ring fairly easily. 5. Effect of Critical Parameters In this section, we develop a simple analytical model to study the effect of certain critical parameters such as router speed on the performance of hierarchical-ring topologies. The analytical model is semi-empirical in that it uses some input parameters derived from simulations. This semiempirical model allows us to save much simulation time and is useful for determining which part of the design space should be simulated for more accurate predictions. In particular, we define the following parameters: = processor request rate in requests/cycle. max = maximum processor request rate. lm = fraction of to local memory. lr = fraction of (? lm ) to processors on the local ring, not including the local processor. gr = fraction of (? lm )(? lr ) to processors with in the 2-level ring hierarchy, but not to the local ring. S proc = processor speed in cycles/second. S nic = NIC speed in cycles/second. S iri = second-level IRI router speed in cycles/second. S glb iri = third-level IRI router speed in cycles/second. n L = no. of nodes in a local ring. n L2 = no. of local rings connected to a second level ring. n L3 = no. of 2-level rings connected to a third level ring. W = Channel width in bits. L trans = average length of a memory transaction (bits). 5.. Single Rings As a first step, we develop a model for a single ring and then extend it to include a second and a third level. The traffic, m L, in bits/sec, injected by a processor into the ring 6

Latency (cycles) 0 4 0 3 0 2 200 00 2 2-level Rings 3 2-level Rings 4 2-level Rings (a) Latency (cycles) 00 90 80 70 2 2-level Rings 3 2-level Rings 4 2-level Rings (b) 0.05 0. 0.5 0.2 0.25 0.3 0.35 0.4 Throughput (req/cycle) 0 0.5.5 2 2.5 3 Throughput (req/cycle) Figure 8. Throughput-latency curves for three level rings with 32B cache lines for the (a) T uniform and (b) T loc workloads. depends on the processor request rate,, the fraction of requests that go to local memory, lm, the average length of a transaction, L trans, and the processor speed, S proc : m L = (? lm ) L trans S proc (3) Assuming the T uniform workload, the average load at any point in the ring will be m L n L 2, since a packet typically (on average) traverses half the ring. We refer this as the bisection load. For the bisection load to be less than or equal to the bisection bandwidth it is necessary that: m L n L 2 2 W S nic (4) Substituting for m L from Equation 3 we have: (? lm ) S proc S nic L trans W n L 4 (5) If we define S ratio as the ratio of processor and NIC router speeds, S proc S nic, and n phits (number of physical transfer units) as the ratio of the average length of a transaction to the channel width, L trans W, and substituting n L for lm for the T uniform workload, Equation 5 can be rewritten as: (n L? ) S ratio n phits 4 (6) Therefore, the maximum processor request rate in a single ring is given by, max(?level) = 4 n phits n L? (7) S ratio In other words, to keep a single ring network below saturation, a processor s cache miss rate should be at most the value defined in Equation 7. It should be noted that this value is inversely proportional to the average length of a transaction (n phits ), the ratio of processor and NIC router speeds (S ratio ), and the number of nodes in a ring. 5.2. Additional Ring Levels For two levels of rings, the equivalent of Equation 4 is: m L2 n L2 2 2 W S iri (8) where m L2 is the request rate from a local ring into the global ring, n L2 is the total number of local rings, and S iri is the inter-ring interface router speed. The traffic, m L2, can be defined in the same way as in Equation 3: m L2 = n L [((? lm )L trans S proc )](? lr ) S ratio (9) Substituting S proc S nic for S ratio and expanding Equation 8 using Equation 9 we obtain the maximum processor request rate in a 2-level ring system: max(2?level) = 4 n L? n L2? n phits S lcl ratio (0) where S lcl ratio is the ratio of the NIC and IRI router speeds, S nic S iri, lr = n L2, and for lm = n L (for the T uniform workload). We can proceed similarly and derive the equation for the maximum processor request rate max(3?level) for 3-level rings. An interesting property of (contention-free) maximum processor request rates is that they decrease by a factor of two for every increase in the number of levels in the hierarchy. From equations 7 and 0, by substituting S glb ratio = S lcl ratio =, S ratio = 2, n L2 = 5, and n L3 = 3, we obtain, max(3?level) max(2?level) = max(2?level) = 0:5 () max(single?ring) We can, for example, use this property to obtain the number of lower level rings a global ring can sustain for a level 7

higher than 3. For example, in a 4-level hierarchical-ring network, we know that the maximum processor request rate max(4?level) will be half of that in a 3-level hierarchy. Therefore, max(4?level) = max(3?level)? gr S glb ratio S glb ratio n L4 = 0:5 (2) Substituting gr = n L4, S glb ratio = S glb ratio =, and solving for n L4, we obtain, n L4 = max(3?level) max(4?level) + = 3 (3) Therefore, we can sustain up to 3 L3 rings in a 4-level hierarchy. 5.3. Effect of Router Speeds on Performance As shown earlier, the performance and scalability of hierarchical rings are clearly limited by their constant bisection bandwidth. By increasing the bandwidth of the global ring (and thus the bisection bandwidth), we can connect additional lower level rings without worsening the average memory access latency. Targeting just the global ring is effective, because the utilization of the lower level rings is low, especially when the global ring is saturated. The bandwidth of the global ring can be increased either by increasing the width of the ring or the speed of the ring. We explore the option of clocking the global ring at a speed higher than that of local and intermediate rings. For 2-level rings, we use Equation 0 to obtain n L2, the maximum number of local rings connected to a global ring, n L2 = 4 max(2?level) n L? + n phits S iri ratio (4) If the global ring is twice as fast as the local rings then S iri ratio = S nic S = iri 0:5. Dividing equation 4 by itself, we obtain: n L2(Siri ratio=0 :5)? n L2(Siri ratio=)? = 2 (5) Substituting n L2(iri ratio=) = 5 from our simulation results and solving for n L2(iri ratio=0 :5), we obtain: n L2(iri ratio=0 :5) = 9 (6) From this we conclude that a 2-level hierarchical ring can sustain up to 9 local rings when the global ring is twice as fast as the local rings. For 3-level rings, equation 5 becomes: n L3(glb ratio=0 :5)? n L3(glb ratio=)? = 2 (7) Since n L3(glb ratio=) = 3 and n L3(glb ratio=0 :5) = 5, the global ring in a 3-level hierarchy can sustain up to 5 secondlevel rings when it is clocked at twice the speed of local rings. 6. Conclusion This paper presented techniques to derive highperformance topologies for hierarchical-ring networks. Our overall goal was to maximize system throughput. We derived the following topologies from a bottom-up approach: up to 8 processors on a level- ring, a maximum of 5 level- rings in a 2-level hierarchy, and a maximum of 3 level-2 rings in a 3-level hierarchy. As we increase the levels in hierarchy, the constant bisection bandwidth constraint of the hierarchical-ring network offsets the benefits of high locality in memory accesses, saturating the global ring fairly easily. It was also shown that single rings and 2-level hierarchicalring topologies are more sensitive to locality in memory accesses, whereas higher level hierarchical-ring topologies are less sensitive. We also presented a semi-empirical analytical model to explore design spaces not considered in our simulations. References [] V.C. Hamacher and H. Jiang, Performance and configuration of hierarchical ring networks for multiprocessors, Proc. Intl. Conf. on Parallel Processing, Vol. I, August 997. [2] M. Holliday and M. Stumm, Performance evaluation of hierarchical ring-based shared memory multiprocessors, IEEE Trans. on Computers, Vol. 43, No., pp. 52-67, Jan 994. [3] M.H. MacDougall, Simulating Computer Systems: Techniques and Tools, MIT Press, 987. [4] G. Ravindran and M. Stumm, Hierarchical ring topologies and the effect of their bisection bandwidth constraints, in Proc. Intl. Conf. on Parallel Processing, pp. I/5-55, August 995. [5] Govindan Ravindran, Performance issues in the design of hierarchical-ring and direct networks for sharedmemory multiprocessors, Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Toronto, January 998. [6] S. Scott, J. R. Goodman, and M. K. Vernon, Performance of SCI ring, Proc. Intl. Symp. on Computer Architecture, pp. 3-44, 992. [7] Z. G. Vranesic et. al., The NUMAchine multiprocessor, Technical Report, CSRI-TR-324, CSRI, University of Toronto, 995. 8