On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

Size: px
Start display at page:

Download "On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors"

Transcription

1 On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada Michael Stumm Department of Electrical and Computer Engineering University of Toronto Toronto, ON M5S 3G4, Canada Abstract Hierarchical-ring based multiprocessors are interesting alternatives to the more popular two-dimensional direct networks. They allow for simple router designs and wider communication paths than their direct network counterparts. There are several ways hierarchical-ring networks can be configured for a given number of processors. Feasible topologies range from tall, lean networks to short, wide networks, but only a few of these possess high throughput and low latency. This paper presents the results of a simulation study (i) to determine how large hierarchical-ring networks can become before their performance deteriorates due to their bisection bandwidth constraints and (ii) to derive topologies with high throughput and low latency for a given number of processors. We show that a system with a maximum of 20 processors and three levels of hierarchy can sustain most memory access behaviors, but that larger systems can be sustained, only if their bisection bandwidth is increased.. Introduction Multiprocessors based on a single ring are limited to a small number of processors mainly because the diameter of the network grows linearly with the number of processors. A hierarchical-ring network can accommodate a larger number of processors by interconnecting multiple rings in a hierarchical fashion [7]. A major advantage of the hierarchicalring topology is that it can be used to exploit spatial locality of memory accesses often exhibited in parallel programs. Also, hierarchical structure allows efficient implementation of broadcasts, useful for implementing cache coherence and multicast protocols [7]. However, they have a constant bisection bandwidth that limits the scalability of such networks. There are several ways we can build a hierarchical-ring network for a given number of processors. Feasible configurations or topologies range from tall, lean networks to short, wide networks. However, only a few of these topologies tend to possess a high throughput, low latency combination, which should be the goal of any topology. In this paper, we describe a bottom-up approach in finding good topologies and discuss the effect that the bisection bandwidth has on the performance of such networks. In a related earlier work, we presented the results of a simulation study of the scalability and bisection bandwidth constraints of hierarchical-ring networks that considered only two request rates and did not take into account system throughput [4]. In another study, Holliday and Stumm [2] studied the performance and scalability of hierarchical-ring networks, but they assumed a high degree of locality in their memory accesses. Hamacher and Jiang [] used analytical models to derive optimal hierarchical-ring topologies. In contrast to the above work, our study is much more extensive; results are derived under a wide range of request rates and cache line sizes, and in addition to latency, also considers system throughput. 2. Simulated System Figure shows a shared-memory multiprocessor system with a number of processing modules connected by a twolevel hierarchy of unidirectional rings. The processing modules are all connected to lowest level rings referred to as local rings. The system provides a flat, global (physical) address space thereby allowing processors to transparently access any memory location in the system. While local memory accesses do not involve the network, remote memory accesses require a request packet to be sent to the target memory, followed by a response packet from the target memory to the requesting processor. Packets sent are of variable size and are transferred in flits, bit-parallel, along a unique

2 Global Ring Input Upper Ring Buffer Output PM PM Network Interface Controller Inter Ring Interface Local Ring Down Buffer Up Buffer PM PM Processor Memory Module Output Lower Ring Buffer Input Figure. A hierarchical-ring system with two levels. Input Input Request and Response Buffers Ring Buffer Processor Memory Output Request and Response Buffers Output Figure 2. A network interface controller (NIC) for a hierarchical-ring connected multiprocessor network. path through the network. We assume wormhole switching, where a packet is sent as a contiguous sequence of flits with the header flit containing the routing and sequencing information. A packet containing a remote request whose target memory is in a different ring than its source, first travels up the hierarchy to the level needed to reach the target node, and then descends the hierarchy to the target node where it is removed from the local ring. The target node sends a response packet back to the requesting PM along a similar path. 2. Network Nodes In a hierarchical ring, there are two types of network nodes: Network Interface Controllers (NIC) connect processing modules (PM) to local rings and Inter-ring Interfaces (IRI) connect two rings of adjacent levels. The NIC examines the header of a packet and switches () incoming packets from the ring to a PM, (2) outgoing packets from the Figure 3. An inter-ring interface controller for a hierarchical-ring connected multiprocessor network. PM to the ring, and (3) continuing packets from the input link to the output link. The IRI controls the traffic between two rings and is modeled as a 2 2 crossbar switch. Possible implementations of these network nodes are depicted in Figures 2 and 3. The NIC has a FIFO ring buffer to temporarily store transit packets when the output link is busy transmitting a packet from the local PM. The NIC also has a FIFO input buffer to store packets destined for the local PM and a FIFO output buffer to store packets originating from the local PM destined for remote nodes (see Figure 2). Priority for transmission on the output link is given to transit packets followed by packets in the output queue. An IRI has two ring buffers: one for the lower ring and one for the upper ring. It also has a down buffer and an up buffer (see Figure 3). The down buffer stores packets arriving from the upper ring destined for the lower ring, while the up buffer stores packets arriving from the lower ring destined for the upper ring. Switching takes place independently at the lower and upper ring sides. We use a prioritized link arbitration where priority is given to packets that do not change rings. Arriving transit packets block and are placed in the ring buffer when the output link is in the process of transmitting a packet from the up/down buffer. We assume that all communication occurs synchronously: that is, within a network clock cycle, each NIC can transfer one flit to the next adjacent node (if the link is not blocked) and receive a flit from the previous node it connects to; and an IRI can transmit and receive a flit on each ring (again, if there is no blocking) Simulator The simulator we use reflects the behavior of a system at the register-transfer level on a cycle-by-cycle basis and 2

3 was implemented using the smpl simulation library [3]. The batch means method of output analysis is used with the first batch discarded to account for initialization bias. The batch termination criterium is that each processor has to complete at least some minimum number of requests. All simulation results have confidence interval half-widths of % or less at a 95% confidence level, except near saturation where the confidence interval half-width may increase to a few percent. 3. System and Workload Parameters A hierarchical-ring based shared-memory multiprocessor can be characterized in part by the following parameters: (i) system size (number of processors), (ii) the relative processor, memory, and network cycle times, (iii) the maximum number of transactions a processor may have outstanding at a time, and (iv) the topology. We model the effect of techniques such as prefetching, non-blocking reads, and relaxed consistency models by allowing up to four outstanding transactions per processor. The topology of hierarchicalring networks is specified by the branching factor at each level of the hierarchy, starting at the local ring up to the global ring. A topology specified as refers to a three-level hierarchy with 8 nodes per local ring, 4 level- rings per level-2 ring, and 2 level-2 rings connected to the global ring. The main parameters in our synthetic workload model include request rate, which is the mean time (in processor cycles) between cache misses 2 given a non-blocked processor, the probability that the cache miss is a read, and a measure of communication locality. We subject the network to a wide range of request rates from 0.00 to 0.. Given a cache miss, we assume the probability of it being a read is 0.7. In our synthetic workload model, two main memory transactions, namely the read and write transaction and four types of packets, namely, read request, read response, write request and write response are simulated. In a read transaction, the request packet contains the target memory address and the response packet contains the requested cache line data. In a write transaction, the request packet contains the cache line data to be written and the response packet contains an acknowledgment. 3 Communication locality can greatly affect system throughput in shared-memory multiprocessor networks. We use clusters of locality to model locality in our synthetic In the batch means method, a single long run is divided into several batches with a separate sample mean computed for each batch. These batch means are then used to compute the grand mean and confidence interval. 2 The mean time between cache misses follows a negative exponential distribution. 3 For writes, the response packet is sent back to the requesting station as soon as the write is queued at the target memory, so the latency of the actual memory operation is hidden. workloads [5]. This communication model logically organizes all processors into clusters and assigns a probability for each cluster being the target of a transaction. Two vector parameters specify a specific locality model. The vector S = (S; S2; ; Sn) specifies the size of each cluster. The vector P = (P; P2; ; Pn) then specifies the probability of each cluster being the target of a transaction. Given that the target memory is in a particular cluster, the probability of a processor module being the target within that cluster is uniformly distributed. This definition of clusters is independent of the topologyof the network. For a hierarchy of rings, the clusters are defined in terms of the absolute difference (modulo the size of the system) between the two processor numbers, when numbered them left to right when processors are viewed as the leaves of a tree defined by the ring hierarchy. We use two specific workloads derived from this model:. Workload T loc, represented as S = (; 4; n? 5) and P = (0:5; 0:8; :0), models 3 clusters; the first cluster is the source processor module itself, the second cluster contains the source processor module s four closest neighbors, and the third cluster contains all other processor modules. Cluster has probability 0.5 of being the target, cluster 2 has probability 0.8 of being the target, given that the target is not in cluster, and cluster 3 has probability.0 of containing the target, given that the target is not in cluster or cluster 2. This workload models high communication locality, where there is a probability of 0.9 that the target memory lies within the first two clusters. 2. Workload T uniform, represented as S = (n) and P = (:0) with n being the total number of processors, models a single cluster that has a probability.0 of containing the target memory. This workload models poor communication locality. The system and workload parameters used in our study are summarized in Table. We define the cycle ratio as the relative speed of the processor, network, and memory [2]. It is specified as N XM Y which means that each network cycle is X times as slow as a processor cycle and the memory requires Y processor cycles to service one memory request. We define network cycle time as the time required for a packet to move from the input of one node to the input of the next node. Such a transfer need not occur in a single network cycle. Our assumption that the network cycle time is a factor of two slower than the processor cycle time is justified from the fact that for a 5ns processor cycle time (200 MHz), our ring cycle time of 0ns is close to that used in SCI performance studies [6]. 4 4 SCI specifies a ring cycle time of 2ns with 4 ring cycles required to transfer a packet from the input of one node to the input of the neighboring node. 3

4 P arameter V alue Description n 4-20 Number of processors b Number of memory banks nl nl2 nlh 8 5; Hierarchical-ring Topology N XM Y N 2M 0 Ratio of network and memory cycles to processor cycle T 4 Maximum number of outstanding transactions Request rate R 0.7 Probability that a cache miss is a read S = (S; S2; ; Sm) (N ); (; 4; n? 5) Cluster size P = (P; P2; ; Pm) (); (0:5; 0:8; :0) Cluster probabilities Table. System and synthetic workload parameters and their range of values used in our simulations. 4. Deriving Hierarchical-ring Topologies In this section, we derive high-performance hierarchicalring topologies using flit-level simulations. It should be noted that the hierarchical-ring topologies we derive are largely independent of the switching technique or buffer sizes assumed [5]. We use a bottom-up approach and start from the lowest level in the hierarchy and work up one level at a time. At the lowest level, we derive the maximum number of processors that can be sustained at high throughput and low latency and then fix that configuration. At higher levels, we derive the maximum number of next lower level rings of the previously set configuration that still gives high throughput and low latency. 4.. Single Rings Here, we will show that a single ring can reasonably sustain a total of 8 processor-memory modules across most memory access patterns given the chosen system parameters, and that as we increase the cache line size, the effect of locality in the memory access pattern on system performance becomes less significant. Figures 4a and 4b present the throughput-latency curves for single ring topologies when subjected to the T uniform and T loc workload respectively. For the case with no locality a 4 processor gives us a low latency configuration with high throughput compared to that of 8 and 6 processor systems; initially when the number of processors is less than 4, performance is throughput limited and when we add more processors, throughput increases to a point after which performance becomes latency limited. Hence, we choose n L ;uniform = 4. For the T loc workload (Figure 4b), the maximum achievable throughput for the 8 processor ring is much higher than for the 4 processor ring; therefore, we choose n L ;loc = 8, although the 4 processor configuration exhibits lower latency at low request rates. For both workload models, however, the 6 processor configuration is clearly not desirable, as it exhibits higher latency when compared to the 8 processor topology. For n L ;loc, the maximum achievable throughput is 65% higher when there is high locality in the memory accesses than when there is poor locality. For n L ;uniform the difference is only 5%. Given the probabilities of memory accesses in the three clusters as P = (P; P2; ), we define locality as follows: locality = P + (? P)P2 () For T uniform, P = ( n ; 4 ; n ), where n is the total number of processors in the system. For T loc, P = (0:5; 0:8; :0). Substitutingthese values in Equation, locality varies from locality uniform = n + (? n ) 4 n for the T uniform workload to locality loc = 0:9 for the T loc workload. Normalized locality (between 0 and ) is given by, locality = locality? locality uniform locality loc? locality uniform (2) Figure 5 presents the maximum throughput gain in percent by using a 8 processor topology (n L ;loc ) as opposed to a 4 processor topology (n L ;uniform ) for different degrees of locality. It is obvious that there is a positive throughput gain (as high as 45%) for most memory access patterns by using the 8 processor topology. The trend is similar for larger cache line sizes, where there is still a gain in the maximum achievable throughput when using a 8 processor topology although it is much less for 64 and 28-byte cache line systems than for the 32-byte cache line system. 5 5 The ring size n L ;uniform and n L ;loc remain the same at 4 and 8 nodes for 64 and 28-byte cache line sizes. 4

5 Latency (cycles) Proc 2 Proc 4 Proc 8 Proc 6 Proc (a) Latency (cycles) Proc 2 Proc 4 Proc 8 Proc 6 Proc (b) Throughput (requests/cycle) Throughput (requests/cycle) Figure 4. Throughput-latency curves for single ring topologies with 32B cache lines for (a) T uniform and (b) T loc workloads. Throughput Gain % B Cache Line 64B Cache Line 28B Cache Line Locality Figure 5. Throughput gain (loss) in percent of using an 8 processor system as opposed to a 4 processor system. As a result, we conclude that a total of 8 processormemory modules can be reasonably sustained in a single ring across most memory access patterns, and as we increase the cache line size, the effect of locality in the memory access pattern on system performance becomes less significant. 4.2 Two-level Rings In this section, we will show that a total of 5 local rings can be reasonably sustained in a two-level hierarchical-ring topology for most memory access patterns and that the effect of locality in memory accesses on system performance is independent of cache line size for systems of this size. To do so, we add a second level ring, L2, and determine how many L local rings a two level hierarchy can sustain. The L2 global ring connects a number of L local rings, each containing the maximum number of processor-memory modules (n L = 8), as determined in the previous section. Figure 6 presents the throughput-latency curves for 2-level hierarchical rings with 32-byte cache lines. With the T uniform workload, a global ring can sustain only 3 (n L2 ;uniform = 3) local rings; any increase in the number of local rings decreases the maximum achievable throughput of the network. However, with the T loc workload that has high locality, we can increase the number of local rings to 5 (n L2 ;loc = 5). In the latter case, when the number of local rings is further increased, there is no significant increase in the maximum achievable throughput. One major difference between the single ring and twolevel ring topologies is the effect of locality on the maximum achievable throughput. For T loc, the maximum achievable throughput with n L2 ;loc (8 5 topology) is about 0% higher than for T uniform. Figure 7 presents the throughput gain in percent when using an 8 5 topology as opposed to an 8 3 topology. We see that by using the 85 topologythere is a throughput loss for most memory access patterns; however, this loss is small and decreases as locality is increased. The throughput gain starts to grow at a higher rate when locality > 0:6, resulting in a 45% throughput gain when locality =. Since the throughput gain by using an 8 5 topology is much higher at higher locality levels than the throughput loss at lower locality levels, we can reasonably assume that the number of local rings a second-level global ring can sustain is Three-level Rings We next introduce a third level to the hierarchy and proceed to determine how many L2 rings can be sustained. Each L2 ring now consists of a second level ring connected 5

6 Latency (cycles) Local Rings 3 Local Rings 4 Local Rings 5 Local Rings (a) Latency (cycles) Local Rings 3 Local Rings 4 Local Rings 5 Local Rings 6 Local Rings (b) Throughput (requests/cycle) Throughput (requests/cycle) Figure 6. Throughput-latency curves for two level ring topologies with 32B cache lines for the (a) T uniform and (b) T loc workloads. Throughput Gain % B Cache Line 64B Cache Line 28B Cache Line Locality Figure 7. Throughput gain (loss) in percent of using an 8 5 topology as opposed to an 8 3 topology. to 5 L rings of 8 nodes each, for a total of nodes. We refer to the third level ring as the global ring. Figure 8 presents the throughput-latency curves for 3-level hierarchical rings with 32-byte cache lines. For T uniform, the trend is similar to what we observed for 2-level rings, namely that a maximum of 3 L2 rings (n L3 ;uniform = 3 ) can be sustained by a global ring. However, for T loc, we are also only able to sustain 3 L2 rings (n L3 ;loc = 3 ). The constant bisection bandwidth constraint of the hierarchical-ring network offsets the benefits of high locality in the memory accesses. Thus, even good locality (where in this case 90% of all requests lie within a 4 neighbor cluster) saturates the global ring fairly easily. 5. Effect of Critical Parameters In this section, we develop a simple analytical model to study the effect of certain critical parameters such as router speed on the performance of hierarchical-ring topologies. The analytical model is semi-empirical in that it uses some input parameters derived from simulations. This semiempirical model allows us to save much simulation time and is useful for determining which part of the design space should be simulated for more accurate predictions. In particular, we define the following parameters: = processor request rate in requests/cycle. max = maximum processor request rate. lm = fraction of to local memory. lr = fraction of (? lm ) to processors on the local ring, not including the local processor. gr = fraction of (? lm )(? lr ) to processors with in the 2-level ring hierarchy, but not to the local ring. S proc = processor speed in cycles/second. S nic = NIC speed in cycles/second. S iri = second-level IRI router speed in cycles/second. S glb iri = third-level IRI router speed in cycles/second. n L = no. of nodes in a local ring. n L2 = no. of local rings connected to a second level ring. n L3 = no. of 2-level rings connected to a third level ring. W = Channel width in bits. L trans = average length of a memory transaction (bits). 5.. Single Rings As a first step, we develop a model for a single ring and then extend it to include a second and a third level. The traffic, m L, in bits/sec, injected by a processor into the ring 6

7 Latency (cycles) level Rings 3 2-level Rings 4 2-level Rings (a) Latency (cycles) level Rings 3 2-level Rings 4 2-level Rings (b) Throughput (req/cycle) Throughput (req/cycle) Figure 8. Throughput-latency curves for three level rings with 32B cache lines for the (a) T uniform and (b) T loc workloads. depends on the processor request rate,, the fraction of requests that go to local memory, lm, the average length of a transaction, L trans, and the processor speed, S proc : m L = (? lm ) L trans S proc (3) Assuming the T uniform workload, the average load at any point in the ring will be m L n L 2, since a packet typically (on average) traverses half the ring. We refer this as the bisection load. For the bisection load to be less than or equal to the bisection bandwidth it is necessary that: m L n L 2 2 W S nic (4) Substituting for m L from Equation 3 we have: (? lm ) S proc S nic L trans W n L 4 (5) If we define S ratio as the ratio of processor and NIC router speeds, S proc S nic, and n phits (number of physical transfer units) as the ratio of the average length of a transaction to the channel width, L trans W, and substituting n L for lm for the T uniform workload, Equation 5 can be rewritten as: (n L? ) S ratio n phits 4 (6) Therefore, the maximum processor request rate in a single ring is given by, max(?level) = 4 n phits n L? (7) S ratio In other words, to keep a single ring network below saturation, a processor s cache miss rate should be at most the value defined in Equation 7. It should be noted that this value is inversely proportional to the average length of a transaction (n phits ), the ratio of processor and NIC router speeds (S ratio ), and the number of nodes in a ring Additional Ring Levels For two levels of rings, the equivalent of Equation 4 is: m L2 n L2 2 2 W S iri (8) where m L2 is the request rate from a local ring into the global ring, n L2 is the total number of local rings, and S iri is the inter-ring interface router speed. The traffic, m L2, can be defined in the same way as in Equation 3: m L2 = n L [((? lm )L trans S proc )](? lr ) S ratio (9) Substituting S proc S nic for S ratio and expanding Equation 8 using Equation 9 we obtain the maximum processor request rate in a 2-level ring system: max(2?level) = 4 n L? n L2? n phits S lcl ratio (0) where S lcl ratio is the ratio of the NIC and IRI router speeds, S nic S iri, lr = n L2, and for lm = n L (for the T uniform workload). We can proceed similarly and derive the equation for the maximum processor request rate max(3?level) for 3-level rings. An interesting property of (contention-free) maximum processor request rates is that they decrease by a factor of two for every increase in the number of levels in the hierarchy. From equations 7 and 0, by substituting S glb ratio = S lcl ratio =, S ratio = 2, n L2 = 5, and n L3 = 3, we obtain, max(3?level) max(2?level) = max(2?level) = 0:5 () max(single?ring) We can, for example, use this property to obtain the number of lower level rings a global ring can sustain for a level 7

8 higher than 3. For example, in a 4-level hierarchical-ring network, we know that the maximum processor request rate max(4?level) will be half of that in a 3-level hierarchy. Therefore, max(4?level) = max(3?level)? gr S glb ratio S glb ratio n L4 = 0:5 (2) Substituting gr = n L4, S glb ratio = S glb ratio =, and solving for n L4, we obtain, n L4 = max(3?level) max(4?level) + = 3 (3) Therefore, we can sustain up to 3 L3 rings in a 4-level hierarchy Effect of Router Speeds on Performance As shown earlier, the performance and scalability of hierarchical rings are clearly limited by their constant bisection bandwidth. By increasing the bandwidth of the global ring (and thus the bisection bandwidth), we can connect additional lower level rings without worsening the average memory access latency. Targeting just the global ring is effective, because the utilization of the lower level rings is low, especially when the global ring is saturated. The bandwidth of the global ring can be increased either by increasing the width of the ring or the speed of the ring. We explore the option of clocking the global ring at a speed higher than that of local and intermediate rings. For 2-level rings, we use Equation 0 to obtain n L2, the maximum number of local rings connected to a global ring, n L2 = 4 max(2?level) n L? + n phits S iri ratio (4) If the global ring is twice as fast as the local rings then S iri ratio = S nic S = iri 0:5. Dividing equation 4 by itself, we obtain: n L2(Siri ratio=0 :5)? n L2(Siri ratio=)? = 2 (5) Substituting n L2(iri ratio=) = 5 from our simulation results and solving for n L2(iri ratio=0 :5), we obtain: n L2(iri ratio=0 :5) = 9 (6) From this we conclude that a 2-level hierarchical ring can sustain up to 9 local rings when the global ring is twice as fast as the local rings. For 3-level rings, equation 5 becomes: n L3(glb ratio=0 :5)? n L3(glb ratio=)? = 2 (7) Since n L3(glb ratio=) = 3 and n L3(glb ratio=0 :5) = 5, the global ring in a 3-level hierarchy can sustain up to 5 secondlevel rings when it is clocked at twice the speed of local rings. 6. Conclusion This paper presented techniques to derive highperformance topologies for hierarchical-ring networks. Our overall goal was to maximize system throughput. We derived the following topologies from a bottom-up approach: up to 8 processors on a level- ring, a maximum of 5 level- rings in a 2-level hierarchy, and a maximum of 3 level-2 rings in a 3-level hierarchy. As we increase the levels in hierarchy, the constant bisection bandwidth constraint of the hierarchical-ring network offsets the benefits of high locality in memory accesses, saturating the global ring fairly easily. It was also shown that single rings and 2-level hierarchicalring topologies are more sensitive to locality in memory accesses, whereas higher level hierarchical-ring topologies are less sensitive. We also presented a semi-empirical analytical model to explore design spaces not considered in our simulations. References [] V.C. Hamacher and H. Jiang, Performance and configuration of hierarchical ring networks for multiprocessors, Proc. Intl. Conf. on Parallel Processing, Vol. I, August 997. [2] M. Holliday and M. Stumm, Performance evaluation of hierarchical ring-based shared memory multiprocessors, IEEE Trans. on Computers, Vol. 43, No., pp , Jan 994. [3] M.H. MacDougall, Simulating Computer Systems: Techniques and Tools, MIT Press, 987. [4] G. Ravindran and M. Stumm, Hierarchical ring topologies and the effect of their bisection bandwidth constraints, in Proc. Intl. Conf. on Parallel Processing, pp. I/5-55, August 995. [5] Govindan Ravindran, Performance issues in the design of hierarchical-ring and direct networks for sharedmemory multiprocessors, Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Toronto, January 998. [6] S. Scott, J. R. Goodman, and M. K. Vernon, Performance of SCI ring, Proc. Intl. Symp. on Computer Architecture, pp. 3-44, 992. [7] Z. G. Vranesic et. al., The NUMAchine multiprocessor, Technical Report, CSRI-TR-324, CSRI, University of Toronto,

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. of Computer Engineering (DISCA) Universidad Politécnica de Valencia

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks Hitoshi Oi and N. Ranganathan Department of Computer Science and Engineering, University of South Florida, Tampa, FL Abstract

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

A Hybrid Interconnection Network for Integrated Communication Services

A Hybrid Interconnection Network for Integrated Communication Services A Hybrid Interconnection Network for Integrated Communication Services Yi-long Chen Northern Telecom, Inc. Richardson, TX 7583 kchen@nortel.com Jyh-Charn Liu Department of Computer Science, Texas A&M Univ.

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Hierarchical Ring Network Configuration and Performance Modeling

Hierarchical Ring Network Configuration and Performance Modeling University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Journal Articles Computer Science and Engineering, Department of 1-001 Hierarchical Ring Network Configuration and Performance

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs -A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs Pejman Lotfi-Kamran, Masoud Daneshtalab *, Caro Lucas, and Zainalabedin Navabi School of Electrical and Computer Engineering, The

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* Young-Joo Suh, Binh Vien Dao, Jose Duato, and Sudhakar Yalamanchili Computer Systems Research Laboratory Facultad de Informatica School

More information

DUE to the increasing computing power of microprocessors

DUE to the increasing computing power of microprocessors IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 7, JULY 2002 693 Boosting the Performance of Myrinet Networks José Flich, Member, IEEE, Pedro López, M.P. Malumbres, Member, IEEE, and

More information

ANALYSIS OF THE CORRELATION BETWEEN PACKET LOSS AND NETWORK DELAY AND THEIR IMPACT IN THE PERFORMANCE OF SURGICAL TRAINING APPLICATIONS

ANALYSIS OF THE CORRELATION BETWEEN PACKET LOSS AND NETWORK DELAY AND THEIR IMPACT IN THE PERFORMANCE OF SURGICAL TRAINING APPLICATIONS ANALYSIS OF THE CORRELATION BETWEEN PACKET LOSS AND NETWORK DELAY AND THEIR IMPACT IN THE PERFORMANCE OF SURGICAL TRAINING APPLICATIONS JUAN CARLOS ARAGON SUMMIT STANFORD UNIVERSITY TABLE OF CONTENTS 1.

More information

Deadlock- and Livelock-Free Routing Protocols for Wave Switching

Deadlock- and Livelock-Free Routing Protocols for Wave Switching Deadlock- and Livelock-Free Routing Protocols for Wave Switching José Duato,PedroLópez Facultad de Informática Universidad Politécnica de Valencia P.O.B. 22012 46071 - Valencia, SPAIN E-mail:jduato@gap.upv.es

More information

Flow Control can be viewed as a problem of

Flow Control can be viewed as a problem of NOC Flow Control 1 Flow Control Flow Control determines how the resources of a network, such as channel bandwidth and buffer capacity are allocated to packets traversing a network Goal is to use resources

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

Lecture 7: Flow Control - I

Lecture 7: Flow Control - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical

More information

Chapter 3 : Topology basics

Chapter 3 : Topology basics 1 Chapter 3 : Topology basics What is the network topology Nomenclature Traffic pattern Performance Packaging cost Case study: the SGI Origin 2000 2 Network topology (1) It corresponds to the static arrangement

More information

Network-on-Chip Micro-Benchmarks

Network-on-Chip Micro-Benchmarks Network-on-Chip Micro-Benchmarks Zhonghai Lu *, Axel Jantsch *, Erno Salminen and Cristian Grecu * Royal Institute of Technology, Sweden Tampere University of Technology, Finland Abstract University of

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

Multi-path Routing for Mesh/Torus-Based NoCs

Multi-path Routing for Mesh/Torus-Based NoCs Multi-path Routing for Mesh/Torus-Based NoCs Yaoting Jiao 1, Yulu Yang 1, Ming He 1, Mei Yang 2, and Yingtao Jiang 2 1 College of Information Technology and Science, Nankai University, China 2 Department

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management Marina Garcia 22 August 2013 OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodríguez Document number OFAR-CM: Efficient Dragonfly

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Cisco IOS Switching Paths Overview

Cisco IOS Switching Paths Overview This chapter describes switching paths that can be configured on Cisco IOS devices. It contains the following sections: Basic Router Platform Architecture and Processes Basic Switching Paths Features That

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Computer Networks. Routing

Computer Networks. Routing Computer Networks Routing Topics Link State Routing (Continued) Hierarchical Routing Broadcast Routing Sending distinct packets Flooding Multi-destination routing Using spanning tree Reverse path forwarding

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Jose Flich 1,PedroLópez 1, Manuel. P. Malumbres 1, José Duato 1,andTomRokicki 2 1 Dpto.

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols

More information

Latency on a Switched Ethernet Network

Latency on a Switched Ethernet Network Page 1 of 6 1 Introduction This document serves to explain the sources of latency on a switched Ethernet network and describe how to calculate cumulative latency as well as provide some real world examples.

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

A Real-Time Communication Method for Wormhole Switching Networks

A Real-Time Communication Method for Wormhole Switching Networks A Real-Time Communication Method for Wormhole Switching Networks Byungjae Kim Access Network Research Laboratory Korea Telecom 62-1, Whaam-dong, Yusung-gu Taejeon, Korea E-mail: bjkim@access.kotel.co.kr

More information

Latency on a Switched Ethernet Network

Latency on a Switched Ethernet Network FAQ 07/2014 Latency on a Switched Ethernet Network RUGGEDCOM Ethernet Switches & Routers http://support.automation.siemens.com/ww/view/en/94772587 This entry is from the Siemens Industry Online Support.

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 21 Routing Outline Routing Switch Design Flow Control Case Studies Routing Routing algorithm determines which of the possible paths are used as routes how

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Multicomputer distributed system LECTURE 8

Multicomputer distributed system LECTURE 8 Multicomputer distributed system LECTURE 8 DR. SAMMAN H. AMEEN 1 Wide area network (WAN); A WAN connects a large number of computers that are spread over large geographic distances. It can span sites in

More information

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms Outline Networks: Routing and Design Routing Switch Design Case Studies CS 5, Spring 99 David E. Culler Computer Science Division U.C. Berkeley 3/3/99 CS5 S99 Routing Recall: routing algorithm determines

More information

Comparative Study of blocking mechanisms for Packet Switched Omega Networks

Comparative Study of blocking mechanisms for Packet Switched Omega Networks Proceedings of the 6th WSEAS Int. Conf. on Electronics, Hardware, Wireless and Optical Communications, Corfu Island, Greece, February 16-19, 2007 18 Comparative Study of blocking mechanisms for Packet

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Design Space Exploration of Network Processor Architectures

Design Space Exploration of Network Processor Architectures Design Space Exploration of Network Processor Architectures ECE 697J December 3 rd, 2002 ECE 697J 1 Introduction Network processor architectures have many choices Number of processors Size of memory Type

More information

RED behavior with different packet sizes

RED behavior with different packet sizes RED behavior with different packet sizes Stefaan De Cnodder, Omar Elloumi *, Kenny Pauwels Traffic and Routing Technologies project Alcatel Corporate Research Center, Francis Wellesplein, 1-18 Antwerp,

More information

3. Evaluation of Selected Tree and Mesh based Routing Protocols

3. Evaluation of Selected Tree and Mesh based Routing Protocols 33 3. Evaluation of Selected Tree and Mesh based Routing Protocols 3.1 Introduction Construction of best possible multicast trees and maintaining the group connections in sequence is challenging even in

More information

Chapter 4 NETWORK HARDWARE

Chapter 4 NETWORK HARDWARE Chapter 4 NETWORK HARDWARE 1 Network Devices As Organizations grow, so do their networks Growth in number of users Geographical Growth Network Devices : Are products used to expand or connect networks.

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns H. H. Najaf-abadi 1, H. Sarbazi-Azad 2,1 1 School of Computer Science, IPM, Tehran, Iran. 2 Computer Engineering

More information

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router erformance Evaluation of robe-send Fault-tolerant Network-on-chip Router Sumit Dharampal Mediratta 1, Jeffrey Draper 2 1 NVIDIA Graphics vt Ltd, 2 SC Information Sciences Institute 1 Bangalore, India-560001,

More information

McGill University - Faculty of Engineering Department of Electrical and Computer Engineering

McGill University - Faculty of Engineering Department of Electrical and Computer Engineering McGill University - Faculty of Engineering Department of Electrical and Computer Engineering ECSE 494 Telecommunication Networks Lab Prof. M. Coates Winter 2003 Experiment 5: LAN Operation, Multiple Access

More information

Interface The exit interface a packet will take when destined for a specific network.

Interface The exit interface a packet will take when destined for a specific network. The Network Layer The Network layer (also called layer 3) manages device addressing, tracks the location of devices on the network, and determines the best way to move data, which means that the Network

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

Initial studies of SCI LAN topologies for local area clustering

Initial studies of SCI LAN topologies for local area clustering Prepared for the First International Workshop on SCI-Based Low-Cost/High-Performance Computing, Santa Clara University Initial studies of SCI LAN topologies for local area clustering Haakon Bryhni * and

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic

More information

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com

More information

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks 2080 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012 Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks Rohit Sunkam

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

EE382 Processor Design. Illinois

EE382 Processor Design. Illinois EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate

More information

CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS

CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS Proceedings of the International Conference on Parallel and Distributed Computing and Systems, Las Vegas, Nevada, pp. 379-384, October 1998. CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS

More information

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 233 6.2 Types of Memory 233 6.3 The Memory Hierarchy 235 6.3.1 Locality of Reference 237 6.4 Cache Memory 237 6.4.1 Cache Mapping Schemes 239 6.4.2 Replacement Policies 247

More information

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 Question 344 Points 444 Points Score 1 10 10 2 10 10 3 20 20 4 20 10 5 20 20 6 20 10 7-20 Total: 100 100 Instructions: 1. Question

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Improving VoD System Efficiency with Multicast and Caching

Improving VoD System Efficiency with Multicast and Caching Improving VoD System Efficiency with Multicast and Caching Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 1. Introduction 2. Previous Works 3. UVoD

More information

TCP Congestion Control in Wired and Wireless networks

TCP Congestion Control in Wired and Wireless networks TCP Congestion Control in Wired and Wireless networks Mohamadreza Najiminaini (mna28@cs.sfu.ca) Term Project ENSC 835 Spring 2008 Supervised by Dr. Ljiljana Trajkovic School of Engineering and Science

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ J. Flich, P. López, M. P. Malumbres, and J. Duato Dept. of Computer Engineering

More information

Prioritized Shufflenet Routing in TOAD based 2X2 OTDM Router.

Prioritized Shufflenet Routing in TOAD based 2X2 OTDM Router. Prioritized Shufflenet Routing in TOAD based 2X2 OTDM Router. Tekiner Firat, Ghassemlooy Zabih, Thompson Mark, Alkhayatt Samir Optical Communications Research Group, School of Engineering, Sheffield Hallam

More information

PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS

PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS THE UNIVERSITY OF NAIROBI DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING FINAL YEAR PROJECT. PROJECT NO. 60 PARALLEL ALGORITHMS FOR IP SWITCHERS/ROUTERS OMARI JAPHETH N. F17/2157/2004 SUPERVISOR:

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

On the Relationship of Server Disk Workloads and Client File Requests

On the Relationship of Server Disk Workloads and Client File Requests On the Relationship of Server Workloads and Client File Requests John R. Heath Department of Computer Science University of Southern Maine Portland, Maine 43 Stephen A.R. Houser University Computing Technologies

More information

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Internetworking Multiple networks are a fact of life: Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Fault isolation,

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information