The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes

Size: px
Start display at page:

Download "The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes"

Transcription

1 The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes T.H. Szymanski McGill University, Canada Abstract Parallel FFT data-flow graphs based on a Butterfly graph followed by a bit-reversal permutation are known, as are optimal-order embeddings of these flow-graphs onto meshes and hypercubes. Embeddings onto a 2D mesh require O(sqrt) data transfer steps and O(log) computation steps. Embeddings onto a hypercube require O(log) data transfer steps and O(log) computation steps. A similar FFT algorithm for the recently proposed hypermesh, with O(log) computation steps and O(log) data transfer steps, is proposed. The performance complexity of the FFT algorithm on all three interconnection networks is then compared, based on the assumptions that (1) all networks are built with discrete crossbar switches interconnected with transmission lines, (2) all networks compared have equivalent aggregate bandwidth, and (3) the packet transmission time is inversely proportional to the link bandwidth. The algorithms are viewed at the Word-level of abstraction, where every packet is treated as an indivisible unit. Under these assumptions, it is concluded that for practical network sizes the 2D hypermesh is faster than the 2D mesh and the binary hypercube by factors of O( /log) and O(log) respectively. Considering the computation of a 4K sample FFT on 4K processor networks, the hypermesh is roughly a factor of 27 times faster than a 2D mesh and a factor of 10 time faster than a binary hypercube. Variations in the assumptions may affect the end results slightly; these conclusions may not hold when the network is implemented entirely on a single wafer, but this scenario is unlikely for the next decade or two. These complexity results indicate that the hypermesh is the preferred interconnection scheme in discrete component constructions of parallel supercomputers. Index Terms hypermesh, hypercube, 2D mesh, FFT, SIMD I. ITRODUCTIO Most commercially available parallel supercomputers are based on point-to-point interconnection networks such as meshes, toroids or hypercubes. A hypermesh interconnection network was recently proposed in [12], and further analysed in [13]. This network is architecturally distinct from the two main classes of networks used in current large scale machines, namely multistage networks and point-to-point networks. The hypermesh can be modelled as a hypergraph with nodes arranged in n-dimensional space, where all nodes whose addresses differ in exactly one base-b digit belong to a hypergraph net, and where every net can realize permutations of data between all its members. The hypermesh can be built with conventional commercially available electronic crossbars without requiring advanced technology, and it also has some attractive optical implementations. An efficient optical implementation of hypermeshes which does not require any electrical crossbar switches is described in [12]. (To avoid possible confusion, note the hypermesh described here is not the same network as the spanning bus hypercubes described in [2] or the spanning bus hypermeshes described in [8]. In those networks, a bus or shared transmission medium interconnects all the nodes aligned along a dimension. In the hypermesh described here and in [12][13], all the nodes aligned along a dimension have the ability to perform permutations in one step, which a bus or shared medium cannot perform. A network similar to the hypermesh was proposed last year in [14] but there are some potentially significant implementation differences; See section 2.) Practical advantages of hypermeshes over banyans and hypercubes were identified in [12][13]; The hypermesh can realize all Omega, Omega Inverse, DESCED and ASCED permutations in one pass and in minimum logical distance. The majority of parallel algorithms, such as the Bitonic sort, the FFT, and matrix algorithms, use these permutations. The hypermesh, like the hypercube, can execute all of these algorithms in optimal-order time under the traditional assumptions. In this paper, we examine the inherent complexity of the well known Fast Fourier Transform algorithm (FFT) on 2D meshes, 2D hypermeshes, and binary hypercubes. Assume a Word-Level of abstraction where every packet is treated as an indivisible unit. The performance of the FFT algorithm on each network can be determined by two components, the number of parallel data transfer steps and the number of computation steps. The number of computation steps in each network is equivalent and this component need not be considered further. The communication time in each network can be bounded by considering the number of data transfer steps required and the time taken to perform each step. By assumption all networks are constructed with the same number of crossbar switches, all of the same degree. Each crossbar switch is implemented on a single pin-limited integrated circuit, reflecting realistic engineering constraints in network design. All networks therefore have an equivalent aggregate bandwidth. All inter-processor links are modelled as high-speed transmission lines. In practical discrete component systems, the time to transmit a packet over a link has two components; the time required for the data to depart from the source (called the transmission delay ), plus the time required to flush the transmission pipeline (called the propagation delay ). Propagation delays tend to be negligible in realistic systems, but the effects of such delays can easily be modelled. Under these assumptions, it is concluded that the 2D hypermesh is faster than the 2D mesh and binary hypercubes by factors of O( /log) and O(log) respectively, for practical network sizes. To estimate the speedups that could

2 be expected in practice, parallel computing systems with 4K Processing Elements, each built with commercially available GaAS crossbar switches, where considered. In the computation of a 4K sample FFT on networks with 4K processors, it was concluded that the 2D hypermesh will be faster than the 2D mesh and binary hypercube by factors of 27 and 10 respectively (this considers data transfer time only). When large propagation delays were modeled, the estimated speedups were reduced to 13 and 6 respectively, which are still very significant. A hardware design that realizes even a fraction of this speedup should have considerable impact. Modifications of the basic assumptions may change the results, but it appears that the main conclusion will still apply to the next generation or two of parallel supercomputers. Repeating the complexity analysis at the Bit-level, where every packet is viewed as being composed of individual bits, will yield different results. At the bit-level, O(log) bits are required just to encode the destination of a packet, and hence the packet transmission time must be O(log). The propagation delay must be O(L), where L is the length of the transmission line. Therefore, a basic data transfer step would have duration O(log + L).) However, the packets and the networks would have to be extremely and unrealistically large before the effects would be noticeable. Therefore, in this paper we will focus the discussion at the Word-Level. Recent trends in parallel supercomputer architecture are towards 2D meshes and lower dimensional toroids. Dally has shown that when the entire interconnection network is implemented on a single VLSI wafer, lower dimensional meshes and toroids may outperform higher dimensional meshes and toroids such as the binary hypercube under certain assumptions [4]. Some notable assumptions in [4] are; (1) the entire network is implemented on a large wafer, (2) all networks compared have an equivalent bisection bandwidth, (3) the logic delay through a node is negligible with respect to the transmission delay over a link, and (4) the traffic is randomly distributed over all nodes in the network. However, some of these assumptions are not valid in current systems. The ability to embed a large network onto a single wafer is currently technologically impossible, and will likely remain so for a few decades. Therefore Dally s main conclusion, that lower dimensional toroids outperform higher dimensional toroids (such as hypercubes), was not meant to apply, nor does it apply, to discrete component constructions of parallel supercomputers. Finally, it should be noted that comparisons for the ability of networks to simulate other networks were also derived by others. Under the word-model, Valiant proved that the hypercube (with nodes of degree log) can simulate any other bounded degree network with a slow-down of O(log) steps [15]. Thus, the hypercube is said to be Universal since a parallel computer built with a hypercube could simulate a parallel computer built with any other bounded degree interconnection network with at most a polylogarithmic slowdown. Leiserson proved that under the word-model, his Fat- Trees could simulate any other network built with equivalent volume in at most a polylogarithmic slowdown in time [5]. It is important to note that in [15], the propagation delays were assumed to be negligible (since otherwise, the longest wires have length sqrt and this propagation delay would dominate all others). Incidentally, under the same assumptions as in [15] it was proven in [13] that the degree-log hypermesh could simulate any bounded degree network with at most a slowdown of O(log/loglog) steps, which is faster than the hypercube by a factor of O(loglog). In this sense, the hypermeshes are also universal and faster than the hypercubes by a factor of O(loglog). The result in [13] provided the motivation for this paper, to estimate the speedups that could be expected in practice for a specific class of parallel algorithms. This paper is organized as follows. Section 2 reviews the hypermesh network. Section 3 describes the parallel FFT algorithms. Section 4 estimates the speedup that could be expected in practice. Section 5 considers bisection bandwidths, and section 6 contains some concluding remarks. II. THE HYPERMESH ETWORK A 2D hypermesh is shown in fig. 1a and a PE-node is shown in fig. 1b. The bold lines represent hypergraph nets. In the original description of d n hypermeshes in [12], it was stated that every PE-node required a small n n crossbar to switch between between the n dimensions. This added crossbar can be costly, since it requires a separate Integrated circuit, routing logic and buffering ability. The ability to realize useful permutations and embed other useful graphs is not impeded by eliminating this n n crossbar ([12][13]), which can reduce the crossbar IC count by about 50 for SIMD machines. The network in [14] is essentially a hypermesh with this added n n crossbar at each PE-node. III. THE FFT ALGORITHM The data-flow graph of the standard parallel radix two Cooley-Tukey FFT algorithm can be illustrated graphically, as shown in fig. 2. The flow-graph consists of an SW-banyan or Butterfly graph followed by a bit-reversal permutation. Each node performs a computation, typically multiplying the lower input by a twiddle factor and performing a complex addition of the result with the upper input. The details of the actual computations occurring within each node can be ignored since the communications time is of interest. A. FFT on the Binary Hypercube Since the hypercube can implement all Butterfly permutations without conflict, embedding the SW-banyan part of the data-flow graph is straight-forward, and it requires exactly log data-transfer steps and log computation steps. The worst-case distance between any source and destination in a bit-reversal permutation is exactly log (i.e., the node at will have to send its data to the node ), requiring a traversal over all log hypercube dimensions. Therefore the best that any routing algorithm could do is exactly log data transfer steps.

3 In summary, an optimal-order embedding of the FFT flowgraph in fig. 2 onto the binary hypercube requires log computation steps and 2 log data transfer steps. (ote: we have not attempted to minimize the constants in this particular document.) B. FFT on the 2D Mesh Let each row (or column) in the 2D mesh have elements. Assume an embedding of the flow graph onto the mesh in row-major order; each and column effectively contain a smaller Butterfly graph with inputs and outputs. Assuming that there are no wrap-around links, it is not difficult to verify that the butterflies on a row or column require exactly 1 data transfer steps and log/2 computation steps. Ignoring the bit-reversal at the end, the parallel FFT algorithm on a 2D mesh therefore requires log computation steps and 2 2 data transfer steps. The longest path in the bit-reversal is formed by the packets in diagonally opposite corners, which must be interchanged. Thus the final bit-reversal permutation will require at least another 2 2 steps, assuming no wrap-around. (With wrap-around links, then the longest path is not less than /2; Consider a packet in the middle of row 0.) In summary, an optimal-order embedding of the FFT flow graph in fig. 2 onto the 2D mesh requires log computation steps and 2 2 data transfer steps, which ignores the bit-reversal permutation needed at the end. The bit-reversal permutation adds at least /2 extra data transfer steps (which assumes wrap-around links are available). C. FFT on the 2D Hypermesh Since the hypermesh can implement all Butterfly permutations without conflict, then the SW-banyan part of the FFT data-flow graph can be embedded in exactly log computation steps and log data transfer steps (just as in a hypercube). The bit-reversal permutation can be performed in the 2D hypermesh in at most 3 parallel data transfer steps (by using the fact that the 2D hypermesh is rearrangeable and can realize any permutation in 3 steps; see property [6] in [12]). In summary, an optimal-order FFT on a 2D hypermesh requires log computation steps and not more than log +3 data transfer steps. See the Table 2A for a summary. D. Inter-PE Link Bandwidth, Given Equivalent Aggregate Bandwidth By assumption each network is constructed with equivalent number of crossbar switches, where all switches have degree K, and where every IO pin in the crossbar has bandwidth L. Assume that each point-to-point network places one such crossbar at each Processing Element (PE), and that all inter- PE links are bi-directional. When such degree-k crossbar is used as a smaller b b node (for b <= K) assume that each inter-pe link is driven by K/b crossbar IO pins arranged in parallel, thereby increasing the inter-pe link bandwidth. Each routing node in a 2D Mesh has degree 5 (direct connections to 4 nearest neighbors, plus one connection to the Processing Element itself). Therefore, the bandwidth of every inter-pe link is K L/5. An -node 2D mesh requires crossbar switches. Each node in the hypercube has degree log +1, and therefore the bandwidth of every inter-pe link is K L/(log +1). An -node hypercube requires crossbar switches. The 2D hypermesh requires rows and columns (note the constraint K >= in the 2D case). Since each row or each column requires one hypergraph net, we require 2 hypergraph nets for the complete 2D hypermesh. Since there are crossbar switches available to implement the entire network, then each hypergraph net can use /(2 ) crossbar ICs in parallel. Therefore, the inter-pe link bandwidth in a net is given by (2 ) K L = K L/2 (1) E. Bounds on the Communication Time The packet transmission time over each inter-pe link is inversely proportional to the inter-pe link bandwidths computed above. For simplicity ignore the computation time in what follows, so that the comparison is based on communication time only. (We assume that propagation delays are negligible with respect to the transmission delay in this section, which is the case for realistic systems.) The FFT algorithm in the 2D mesh requires O( ) communication steps, and each step requires 5/KL = O(1/KL) time, for a total communications time of O( /KL). It is not difficult to verify that the use of virtual channels or the wormhole routing technique described in [4] cannot improve this bound in a 2D mesh. The FFT algorithm in the binary hypercube requires O(log) communications steps, and each step requires O(log/KL) time, for a total communication time of O(log 2 /KL). The FFT algorithm on the 2D hypermesh requires O(log) communications steps, and each step requires O(1/KL) time, for a total communication time of O(log/KL). These results are tabulated in Table 2B. Based on these bounds, the 2D hypermesh is faster than the 2D mesh and the binary hypercube by factors of O( /log) and O(log) respectively. IV. REALISTIC COMPARISOS OF 2D MESHES, HYPERMESHES AD HYPERCUBES Currently a crossbar switch can be implemented on a single GaAs IC, and such ICs are commercially available. Each crossbar link (or IO pin) has a bandwidth of 200 Mbit/sec., and each crossbar requires a separate application specific IC to perform application specific routing functions. When such a crossbar is used as a b b node (for b <= 64) assume that each inter-pe link is driven by 64/b crossbar IO pins arranged in parallel. To avoid processor intervention when routing messages in a 2D mesh assume that a crossbar based routing node handles these functions independently. A 4K Processor mesh requires 4K routing nodes; using a GaAs IC for each node, each inter- PE link would use 64/5 = 12.8 crossbar IO pins for an

4 inter-pe link bandwidth of 2.56 Gbit/sec. Therefore the time required to transmit a 128-bit packet between adjacent nodes is 50 nanosec. (ote: the figure 12.8 should be rounded down to 12, but by ignoring this rounding the performance of the 2D mesh is over-estimated slightly.) In a 4K Processor hypercube each processor requires a degree 13 node. Using a GaAs IC for each routing node, each inter-pe link would use 64/13 = 4.92 crossbar IO pins for an inter-pe link bandwidth of.985 Gbit/sec. Therefore the time to transmit a 128-bit packet between two neighboring nodes is 130 nanosec. (ote: the figure 4.92 should be rounded down, but by ignoring the rounding the performance of the hypercube is over-estimated slightly.) A number of choices exist for the hypermesh; a 8 4, 16 3 and 64 2 hypermesh can all interconnect 4K Processors. Consider a 2D 64 2 hypermesh with 64 rows and 64 columns, with hypergraph net in each row and in each column, for a total of 128 nets. To use the same number of GaAs crossbar ICs as the 2D mesh and the hypercube, assume that each hypermesh net uses 32 GaAs ICs in parallel. The inter-pe link bandwidth is then Mbit./sec. = 6.4 Gbit/sec and the time to transmit a 128-bit packet between 2 nodes in the same row or column is then 20 nanosec, plus propagation delay. A. Estimated Speedup, egligible Propagation Delays The total communication time in the 2D mesh, allowing an optimistic /2 steps for the bit-reversal permutation, is then (5/2 steps) (50 nsec/step) = 8 µsec (2) The total communication time in the binary hypercube is then (2 log steps) (130 nanosec/step) = 3.12 µ sec (3) The total communication time in the 2D hypermesh is then (log +3 steps) (20 nanosec/step) = 0.3 µ sec (4) Therefore, the 2D hypermesh is faster than the 2D mesh by a factor of 26.6, and the 2D hypermesh is faster than the binary hypercube by a factor of (If the bit-reversal is not needed, as in many applications, the figures become 26.6 and 6.5 respectively.) A similar comparison was performed in [13] for the Bitonic sort executing on the 2D mesh, the 2D hypermesh and the binary hypercube. In [13] it was concluded that the hypermesh is faster than the 2D mesh and the binary hypercube by factors of 12.3 and 6.47 respectively. B. Estimated Speedup, Including Propagation Delay Finally, we may wish to add a 20 nanosec propagation delay to the hypermesh and hypercube, which would model the propagation of a signal over about 20 feet of transmission line. In this case, the 2D hypermesh is faster than the 2D mesh and the binary hypercube by factors of 13.3 and 6 respectively. ote that the 2D hypermesh is still 25 transfers between nearest neighbors. V. BISECTIO BADWIDTH The bisection bandwidth of a network is defined as the bandwidth that crosses an imaginary bisector which subdivides a network into two halves of equal size. Further insight into why the hypermesh performance exceeds that of the 2D mesh and hypercube can be found by considering the bisection bandwidth of each network. Each network has the same aggregate bandwidth, and one fundamental difference between them is how this bandwidth is distributed over the nodes. The hypermesh has a much larger bisection bandwidth compared to the 2D mesh or hypercube, regardless of how one bisects the network. This increase in bisection bandwidth appears as significantly decreased communication time, especially for ASCED and DESCED algorithms, where every Butterfly permutation causes transfers over a network bisector. The bisection bandwidth of the 2D mesh is cdot(kl/5). The bisection bandwidth of the binary hypercube is (/2) (KL/log). The bisection bandwidth in any one hypermesh net is (KL/2) and the bisection bandwidth of the entire 2D hypermesh is KL/2. (This figure is intuitively obvious; each crossbar has its full bandwidth crossing the bisector, and there are /2 such crossbars.) Clearly, the 2D hypermesh has a bisection bandwidth that is larger than that of the 2D mesh and the binary hypercube by factors of O( ) and O(log) respectively. Therefore every data permutation which results in data transfers over any bisector will run significantly faster on the hypermesh. VI. COCLUSIOS A parallel FFT algorithm for the recently proposed hypermesh [12] was illustrated; the algorithm requires log 3 fewer data transfer steps than the similar FFT algorithm for the binary hypercube, since the bit-reversal permutation can be implemented in at most 3 parallel steps on a hypermesh. The temporal complexity of the FFT algorithm, when executing on 2D meshes, hypercubes and 2D hypermeshes, was then derived. It was shown that for practical network sizes, the 2D hypermesh was faster than the 2D mesh and the binary hypercube by factors of O( /log) and O(log) respectively. Assuming 4K PE parallel processor built with existing technology, in practice the 2D hypermesh should be faster than the 2D mesh and the binary hypercube by factors of 27 and 10 respectively, when propagation delays are negligible. When propagation delays were modelled, the 2D hypermesh should be faster than the 2D mesh and the hypercube by factors of roughly 13 and 6 respectively. A simple explanation for this increase in performance is based on the concept of bisection bandwidth. While all networks being compared have equivalent aggregate bandwidth, this bandwidth is distributed over the nodes in different ways, depending on the network topology. The 2D hypermesh has a bisection bandwidth that is larger than that of the 2D mesh and the binary hypercube by factors of and log respectively. This increase in bisection bandwidth translates to significantly decreased delays, especially for the ASCED and DESCED

5 TABLE I TABLE 1A: HARDWARE COMPLEXITY BEFORE ORMALIZATIO FOR EQUIVALET COST. (EACH ETWORK HAS PES. HYPERMESH DEGREE = log. ) network # crossbars degree diameter 2D Mesh 4 2D Hypermesh 2 2 hypercube log log hypermesh /loglog log log/loglog TABLE II TABLE 1B: COMPARISO AFTER ORMALIZATIO. Fig. 1. A 2D hypermesh. Bold lines are hypergraph nets. network link-bw Diameter D D/BW 2D Mesh KL/4 O( /KL) 2D Hypermesh KL/2 2 O(1/KL) hypercube KL/log log O(log 2 /KL) Fig. 2. A PE-node in a 2D hypermesh-based SIMD machine. permutations, each of which causes data transfers over some network bisector. REFERECES [1] S. Abraham and K. Padmanabhan, Constraint Based Evaluation of Multicomputer etworks, Int. Conf. Parallel Processing, 1991, pp [2] L.. Bhuyan and D.P. Aggrawal, Generalized hypercube and Hyperbus structures for a Computer etwork, IEEE Trans. Comput., Vol. C-33, o. 4, pp , 1984 [3] C. Fang and T.H. Szymanski, An Analysis of Deflection Routing in Multidimensional Regular Mesh etworks, IEEE Infocom 91, April 1991, pp [4] W.J. Dally, Performance Analysis of k-ary n-cube Interconnection etworks, IEEE Trans. Comput., June 1990, pp [5] C.E. Leiserson, Fat-Trees: Universal etworks for Hardware-Efficient Supercomputing, IEEE Trans. Comput., Vol. C-34, o. 10. Oct. 1985, pp [6] M.C. Pease, The Indirect Binary n-cube Microprocessor Array, IEEE Trans. Comput., Vol. C-26, May 1977, pp [7] F.P. Preparata and J. Vuillemen, The Cube-Connected Cycles: A Versatile etwork for Parallel Computation, CACM, May 1981, pp [8] I.D. Scherson, Orthogonal Graphs for the Construction of a Class of Interconnection etworks, IEEE Trans. Parallel and Distributed Systems, Jan. 1991, Vol. 2, o. 1, pp [9] H.J. Siegel, Interconnection etworks for Large Scale Parallel Processing; Theory and Case Studies, 2nd Edition, McGraw-Hill, 1990 [10] H.J. Siegel, A Model of SIMD Machines and a Comparison of Various Interconnection etworks, IEEE Trans. Comput., Vol. C-28, o. 12, Dec. 1979, pp [11] H.S. Stone, High Performance Computer Architecture, 2nd Edition, Addison-Wesley, 1990 [12] T.H. Szymanski, A Fiber-Optic Hypermesh for SIMD/MIMD Machines, IEEE Supercomputing-90, ovember 1990, pp [13] T.H. Szymanski, O(log/loglog) Randomized Routing on Degreelog Hypermeshes, IEEE Int. Conf. Parallel Processing, August 1991, pp [14]. Tanabe, T. Suzuoka, S. akamura, Y. Kawakura and S. Oyanagi, Base-m n-cube: High-Performance Interconnection etworks for Highly Parallel Computer Prodigy, IEEE Int. Conf. Parallel Processing, August 1991, pp [15] L.G. Valiant and G.J. Brebner, Universal Schemes for Parallel Communications, Proc. 13th Annual ACM Symp. on Theory of Computing, 1981, pp TABLE III TABLE 2A: COMPARISO -FFTO VARIOUS ETWORKS network # bit-reversal steps # d.t. steps total 2D Mesh >= /2 >= 5/2 Hypercube >= log >= 2 log 2D hypermesh <= 3 <= log + 3 TABLE IV TABLE 2B: FFT EXECUTIO TIME AFTER ORMALIZATIO. (T comm DEOTES TOTAL COMMUICATIO TIME). Fig. 3. Data-flow graph of the Cooley-Tukey FFT. network # data transfer steps O(T comm) 2D Mesh O( ) O( /KL) Hypercube O(log) O(log 2 /KL) 2D hypermesh O(log) O(log/KL)

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2 Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99

More information

PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODES IN AN INTERCONNECTION NETWORK FOR PASM

PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODES IN AN INTERCONNECTION NETWORK FOR PASM PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODES IN AN INTERCONNECTION NETWORK FOR PASM Robert J. McMillen, George B. Adams III, and Howard Jay Siegel School of Electrical Engineering, Purdue University

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Interconnection Networks. Issues for Networks

Interconnection Networks. Issues for Networks Interconnection Networks Communications Among Processors Chris Nevison, Colgate University Issues for Networks Total Bandwidth amount of data which can be moved from somewhere to somewhere per unit time

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Network Topologies John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 14 28 February 2017 Topics for Today Taxonomy Metrics

More information

Data Communication and Parallel Computing on Twisted Hypercubes

Data Communication and Parallel Computing on Twisted Hypercubes Data Communication and Parallel Computing on Twisted Hypercubes E. Abuelrub, Department of Computer Science, Zarqa Private University, Jordan Abstract- Massively parallel distributed-memory architectures

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

CS575 Parallel Processing

CS575 Parallel Processing CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

More information

Constant Queue Routing on a Mesh

Constant Queue Routing on a Mesh Constant Queue Routing on a Mesh Sanguthevar Rajasekaran Richard Overholt Dept. of Computer and Information Science Univ. of Pennsylvania, Philadelphia, PA 19104 ABSTRACT Packet routing is an important

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

An Empirical Comparison of Area-Universal and Other Parallel Computing Networks

An Empirical Comparison of Area-Universal and Other Parallel Computing Networks Loyola University Chicago Loyola ecommons Computer Science: Faculty Publications and Other Works Faculty Publications 9-1996 An Empirical Comparison of Area-Universal and Other Parallel Computing Networks

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA) Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

ManArray Processor Interconnection Network: An Introduction

ManArray Processor Interconnection Network: An Introduction ManArray Processor Interconnection Network: An Introduction Gerald G. Pechanek 1, Stamatis Vassiliadis 2, Nikos Pitsianis 1, 3 1 Billions of Operations Per Second, (BOPS) Inc., Chapel Hill, NC, USA gpechanek@bops.com

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns H. H. Najaf-abadi 1, H. Sarbazi-Azad 2,1 1 School of Computer Science, IPM, Tehran, Iran. 2 Computer Engineering

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA) Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.

More information

Homework Assignment #1: Topology Kelly Shaw

Homework Assignment #1: Topology Kelly Shaw EE482 Advanced Computer Organization Spring 2001 Professor W. J. Dally Homework Assignment #1: Topology Kelly Shaw As we have not discussed routing or flow control yet, throughout this problem set assume

More information

Dr e v prasad Dt

Dr e v prasad Dt Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Multiprocessor Interconnection Networks

Multiprocessor Interconnection Networks Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Multiprocessor Interconnection Networks- Part Three

Multiprocessor Interconnection Networks- Part Three Babylon University College of Information Technology Software Department Multiprocessor Interconnection Networks- Part Three By The k-ary n-cube Networks The k-ary n-cube network is a radix k cube with

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #5 1/29/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Lectures 8/9. 1 Overview. 2 Prelude:Routing on the Grid. 3 A couple of networks.

Lectures 8/9. 1 Overview. 2 Prelude:Routing on the Grid. 3 A couple of networks. U.C. Berkeley CS273: Parallel and Distributed Theory Lectures 8/9 Professor Satish Rao September 23,2010 Lecturer: Satish Rao Last revised October 23, 2010 Lectures 8/9 1 Overview We will give a couple

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 DUE : April 9, 2014 HOMEWORK IV READ : - Related portions of Chapter 5 and Appendces F and I of the Hennessy book - Related portions of Chapter 1, 4 and 6 of

More information

Reliable Unicasting in Faulty Hypercubes Using Safety Levels

Reliable Unicasting in Faulty Hypercubes Using Safety Levels IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 2, FEBRUARY 997 24 Reliable Unicasting in Faulty Hypercubes Using Safety Levels Jie Wu, Senior Member, IEEE Abstract We propose a unicasting algorithm for faulty

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009 VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs April 6 th, 2009 Message Passing Costs Major overheads in the execution of parallel programs: from communication

More information

Networks, Routers and Transputers:

Networks, Routers and Transputers: This is Chapter 1 from the second edition of : Networks, Routers and Transputers: Function, Performance and applications Edited M.D. by: May, P.W. Thompson, and P.H. Welch INMOS Limited 1993 This edition

More information

Õ(Congestion + Dilation) Hot-Potato Routing on Leveled Networks

Õ(Congestion + Dilation) Hot-Potato Routing on Leveled Networks Õ(Congestion + Dilation) Hot-Potato Routing on Leveled Networks Costas Busch Rensselaer Polytechnic Institute buschc@cs.rpi.edu July 23, 2003 Abstract We study packet routing problems, in which we route

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

The final publication is available at

The final publication is available at Document downloaded from: http://hdl.handle.net/10251/82062 This paper must be cited as: Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Fabrizio Petrini Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford OX1 3QD, England e-mail: fabp@comlab.ox.ac.uk

More information

CSC630/CSC730: Parallel Computing

CSC630/CSC730: Parallel Computing CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

Emulation of a PRAM on Leveled Networks

Emulation of a PRAM on Leveled Networks Emulation of a PRAM on Leveled Networks Michael Palis 1, Sanguthevar Rajasekaran 1,DavidS.L.Wei 2 ABSTRACT There is an interesting class of ICNs, which includes the star graph and the n-way shuffle, for

More information

Balance of Processing and Communication using Sparse Networks

Balance of Processing and Communication using Sparse Networks Balance of essing and Communication using Sparse Networks Ville Leppänen and Martti Penttonen Department of Computer Science University of Turku Lemminkäisenkatu 14a, 20520 Turku, Finland and Department

More information

CS256 Applied Theory of Computation

CS256 Applied Theory of Computation CS256 Applied Theory of Computation Parallel Computation II John E Savage Overview Mesh-based architectures Hypercubes Embedding meshes in hypercubes Normal algorithms on hypercubes Summing and broadcasting

More information

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

GIAN Course on Distributed Network Algorithms. Network Topologies and Local Routing

GIAN Course on Distributed Network Algorithms. Network Topologies and Local Routing GIAN Course on Distributed Network Algorithms Network Topologies and Local Routing Stefan Schmid @ T-Labs, 2011 GIAN Course on Distributed Network Algorithms Network Topologies and Local Routing If you

More information

Linear Arrays. Chapter 7

Linear Arrays. Chapter 7 Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between

More information

Chapter 4 : Butterfly Networks

Chapter 4 : Butterfly Networks 1 Chapter 4 : Butterfly Networks Structure of a butterfly network Isomorphism Channel load and throughput Optimization Path diversity Case study: BBN network 2 Structure of a butterfly network A K-ary

More information

Prefix Computation and Sorting in Dual-Cube

Prefix Computation and Sorting in Dual-Cube Prefix Computation and Sorting in Dual-Cube Yamin Li and Shietung Peng Department of Computer Science Hosei University Tokyo - Japan {yamin, speng}@k.hosei.ac.jp Wanming Chu Department of Computer Hardware

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research CHAPTER 1 Introduction This chapter provides the background knowledge about Multistage Interconnection Networks. Metrics used for measuring the performance of various multistage interconnection networks

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

IV. PACKET SWITCH ARCHITECTURES

IV. PACKET SWITCH ARCHITECTURES IV. PACKET SWITCH ARCHITECTURES (a) General Concept - as packet arrives at switch, destination (and possibly source) field in packet header is used as index into routing tables specifying next switch in

More information

Constant Queue Routing on a Mesh 1

Constant Queue Routing on a Mesh 1 Constant Queue Routing on a Mesh 1 Sanguthevar Rajasekaran Richard Overholt Dept. of Computer and Information Science Univ. of Pennsylvania, Philadelphia, PA 19104 1 A preliminary version of this paper

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing 244 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 10, NO 2, APRIL 2002 Heuristic Algorithms for Multiconstrained Quality-of-Service Routing Xin Yuan, Member, IEEE Abstract Multiconstrained quality-of-service

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

Shortest Path Routing on Multi-Mesh of Trees

Shortest Path Routing on Multi-Mesh of Trees Shortest Path Routing on Multi-Mesh of Trees Sudhanshu Kumar Jha, Prasanta K. Jana, Senior Member, IEEE Abstract Multi-Mesh of Trees (MMT) is an efficient interconnection network for massively parallel

More information

Keywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data.

Keywords: Binary Sort, Sorting, Efficient Algorithm, Sorting Algorithm, Sort Data. Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient and

More information

A Hybrid Interconnection Network for Integrated Communication Services

A Hybrid Interconnection Network for Integrated Communication Services A Hybrid Interconnection Network for Integrated Communication Services Yi-long Chen Northern Telecom, Inc. Richardson, TX 7583 kchen@nortel.com Jyh-Charn Liu Department of Computer Science, Texas A&M Univ.

More information

GIAN Course on Distributed Network Algorithms. Network Topologies and Interconnects

GIAN Course on Distributed Network Algorithms. Network Topologies and Interconnects GIAN Course on Distributed Network Algorithms Network Topologies and Interconnects Stefan Schmid @ T-Labs, 2011 The Many Faces and Flavors of Network Topologies Gnutella P2P. Social Networks. Internet.

More information

Performance Study of Packet Switching Multistage Interconnection Networks

Performance Study of Packet Switching Multistage Interconnection Networks ETRI Journal, volume 16, number 3, October 1994 27 Performance Study of Packet Switching Multistage Interconnection Networks Jungsun Kim CONTENTS I. INTRODUCTION II. THE MODEL AND THE ENVIRONMENT III.

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Topology basics. Constraints and measures. Butterfly networks.

Topology basics. Constraints and measures. Butterfly networks. EE48: Advanced Computer Organization Lecture # Interconnection Networks Architecture and Design Stanford University Topology basics. Constraints and measures. Butterfly networks. Lecture #: Monday, 7 April

More information

Communication Performance in Network-on-Chips

Communication Performance in Network-on-Chips Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6) International Journals of Advanced Research in Computer Science and Software Engineering ISS: 2277-128X (Volume-7, Issue-6) Research Article June 2017 Image Encryption Based on 2D Baker Map and 1D Logistic

More information

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.

More information

Path Delay Fault Testing of a Class of Circuit-Switched Multistage Interconnection Networks

Path Delay Fault Testing of a Class of Circuit-Switched Multistage Interconnection Networks Path Delay Fault Testing of a Class of Circuit-Switched Multistage Interconnection Networks M. Bellos 1, D. Nikolos 1,2 & H. T. Vergos 1,2 1 Dept. of Computer Engineering and Informatics, University of

More information

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip

More information

Scalable Schedulers for High-Performance Switches

Scalable Schedulers for High-Performance Switches Scalable Schedulers for High-Performance Switches Chuanjun Li and S Q Zheng Mei Yang Department of Computer Science Department of Computer Science University of Texas at Dallas Columbus State University

More information

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES S. SRINIVAS KUMAR *, R.BASAVARAJU ** * PG Scholar, Electronics and Communication Engineering, CRIT

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

The Impact of Optics on HPC System Interconnects

The Impact of Optics on HPC System Interconnects The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes

More information

Multiprocessors Interconnection Networks

Multiprocessors Interconnection Networks Babylon University College of Information Technology Software Department Multiprocessors Interconnection Networks By Interconnection Networks Taxonomy An interconnection network could be either static

More information

6. Concluding Remarks

6. Concluding Remarks [8] K. J. Supowit, The relative neighborhood graph with an application to minimum spanning trees, Tech. Rept., Department of Computer Science, University of Illinois, Urbana-Champaign, August 1980, also

More information