The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes

Size: px

Start display at page:

Download "The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes"

Garey Lang
6 years ago
Views:

1 The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes T.H. Szymanski McGill University, Canada Abstract Parallel FFT data-flow graphs based on a Butterfly graph followed by a bit-reversal permutation are known, as are optimal-order embeddings of these flow-graphs onto meshes and hypercubes. Embeddings onto a 2D mesh require O(sqrt) data transfer steps and O(log) computation steps. Embeddings onto a hypercube require O(log) data transfer steps and O(log) computation steps. A similar FFT algorithm for the recently proposed hypermesh, with O(log) computation steps and O(log) data transfer steps, is proposed. The performance complexity of the FFT algorithm on all three interconnection networks is then compared, based on the assumptions that (1) all networks are built with discrete crossbar switches interconnected with transmission lines, (2) all networks compared have equivalent aggregate bandwidth, and (3) the packet transmission time is inversely proportional to the link bandwidth. The algorithms are viewed at the Word-level of abstraction, where every packet is treated as an indivisible unit. Under these assumptions, it is concluded that for practical network sizes the 2D hypermesh is faster than the 2D mesh and the binary hypercube by factors of O( /log) and O(log) respectively. Considering the computation of a 4K sample FFT on 4K processor networks, the hypermesh is roughly a factor of 27 times faster than a 2D mesh and a factor of 10 time faster than a binary hypercube. Variations in the assumptions may affect the end results slightly; these conclusions may not hold when the network is implemented entirely on a single wafer, but this scenario is unlikely for the next decade or two. These complexity results indicate that the hypermesh is the preferred interconnection scheme in discrete component constructions of parallel supercomputers. Index Terms hypermesh, hypercube, 2D mesh, FFT, SIMD I. ITRODUCTIO Most commercially available parallel supercomputers are based on point-to-point interconnection networks such as meshes, toroids or hypercubes. A hypermesh interconnection network was recently proposed in [12], and further analysed in [13]. This network is architecturally distinct from the two main classes of networks used in current large scale machines, namely multistage networks and point-to-point networks. The hypermesh can be modelled as a hypergraph with nodes arranged in n-dimensional space, where all nodes whose addresses differ in exactly one base-b digit belong to a hypergraph net, and where every net can realize permutations of data between all its members. The hypermesh can be built with conventional commercially available electronic crossbars without requiring advanced technology, and it also has some attractive optical implementations. An efficient optical implementation of hypermeshes which does not require any electrical crossbar switches is described in [12]. (To avoid possible confusion, note the hypermesh described here is not the same network as the spanning bus hypercubes described in [2] or the spanning bus hypermeshes described in [8]. In those networks, a bus or shared transmission medium interconnects all the nodes aligned along a dimension. In the hypermesh described here and in [12][13], all the nodes aligned along a dimension have the ability to perform permutations in one step, which a bus or shared medium cannot perform. A network similar to the hypermesh was proposed last year in [14] but there are some potentially significant implementation differences; See section 2.) Practical advantages of hypermeshes over banyans and hypercubes were identified in [12][13]; The hypermesh can realize all Omega, Omega Inverse, DESCED and ASCED permutations in one pass and in minimum logical distance. The majority of parallel algorithms, such as the Bitonic sort, the FFT, and matrix algorithms, use these permutations. The hypermesh, like the hypercube, can execute all of these algorithms in optimal-order time under the traditional assumptions. In this paper, we examine the inherent complexity of the well known Fast Fourier Transform algorithm (FFT) on 2D meshes, 2D hypermeshes, and binary hypercubes. Assume a Word-Level of abstraction where every packet is treated as an indivisible unit. The performance of the FFT algorithm on each network can be determined by two components, the number of parallel data transfer steps and the number of computation steps. The number of computation steps in each network is equivalent and this component need not be considered further. The communication time in each network can be bounded by considering the number of data transfer steps required and the time taken to perform each step. By assumption all networks are constructed with the same number of crossbar switches, all of the same degree. Each crossbar switch is implemented on a single pin-limited integrated circuit, reflecting realistic engineering constraints in network design. All networks therefore have an equivalent aggregate bandwidth. All inter-processor links are modelled as high-speed transmission lines. In practical discrete component systems, the time to transmit a packet over a link has two components; the time required for the data to depart from the source (called the transmission delay ), plus the time required to flush the transmission pipeline (called the propagation delay ). Propagation delays tend to be negligible in realistic systems, but the effects of such delays can easily be modelled. Under these assumptions, it is concluded that the 2D hypermesh is faster than the 2D mesh and binary hypercubes by factors of O( /log) and O(log) respectively, for practical network sizes. To estimate the speedups that could

2 be expected in practice, parallel computing systems with 4K Processing Elements, each built with commercially available GaAS crossbar switches, where considered. In the computation of a 4K sample FFT on networks with 4K processors, it was concluded that the 2D hypermesh will be faster than the 2D mesh and binary hypercube by factors of 27 and 10 respectively (this considers data transfer time only). When large propagation delays were modeled, the estimated speedups were reduced to 13 and 6 respectively, which are still very significant. A hardware design that realizes even a fraction of this speedup should have considerable impact. Modifications of the basic assumptions may change the results, but it appears that the main conclusion will still apply to the next generation or two of parallel supercomputers. Repeating the complexity analysis at the Bit-level, where every packet is viewed as being composed of individual bits, will yield different results. At the bit-level, O(log) bits are required just to encode the destination of a packet, and hence the packet transmission time must be O(log). The propagation delay must be O(L), where L is the length of the transmission line. Therefore, a basic data transfer step would have duration O(log + L).) However, the packets and the networks would have to be extremely and unrealistically large before the effects would be noticeable. Therefore, in this paper we will focus the discussion at the Word-Level. Recent trends in parallel supercomputer architecture are towards 2D meshes and lower dimensional toroids. Dally has shown that when the entire interconnection network is implemented on a single VLSI wafer, lower dimensional meshes and toroids may outperform higher dimensional meshes and toroids such as the binary hypercube under certain assumptions [4]. Some notable assumptions in [4] are; (1) the entire network is implemented on a large wafer, (2) all networks compared have an equivalent bisection bandwidth, (3) the logic delay through a node is negligible with respect to the transmission delay over a link, and (4) the traffic is randomly distributed over all nodes in the network. However, some of these assumptions are not valid in current systems. The ability to embed a large network onto a single wafer is currently technologically impossible, and will likely remain so for a few decades. Therefore Dally s main conclusion, that lower dimensional toroids outperform higher dimensional toroids (such as hypercubes), was not meant to apply, nor does it apply, to discrete component constructions of parallel supercomputers. Finally, it should be noted that comparisons for the ability of networks to simulate other networks were also derived by others. Under the word-model, Valiant proved that the hypercube (with nodes of degree log) can simulate any other bounded degree network with a slow-down of O(log) steps [15]. Thus, the hypercube is said to be Universal since a parallel computer built with a hypercube could simulate a parallel computer built with any other bounded degree interconnection network with at most a polylogarithmic slowdown. Leiserson proved that under the word-model, his Fat- Trees could simulate any other network built with equivalent volume in at most a polylogarithmic slowdown in time [5]. It is important to note that in [15], the propagation delays were assumed to be negligible (since otherwise, the longest wires have length sqrt and this propagation delay would dominate all others). Incidentally, under the same assumptions as in [15] it was proven in [13] that the degree-log hypermesh could simulate any bounded degree network with at most a slowdown of O(log/loglog) steps, which is faster than the hypercube by a factor of O(loglog). In this sense, the hypermeshes are also universal and faster than the hypercubes by a factor of O(loglog). The result in [13] provided the motivation for this paper, to estimate the speedups that could be expected in practice for a specific class of parallel algorithms. This paper is organized as follows. Section 2 reviews the hypermesh network. Section 3 describes the parallel FFT algorithms. Section 4 estimates the speedup that could be expected in practice. Section 5 considers bisection bandwidths, and section 6 contains some concluding remarks. II. THE HYPERMESH ETWORK A 2D hypermesh is shown in fig. 1a and a PE-node is shown in fig. 1b. The bold lines represent hypergraph nets. In the original description of d n hypermeshes in [12], it was stated that every PE-node required a small n n crossbar to switch between between the n dimensions. This added crossbar can be costly, since it requires a separate Integrated circuit, routing logic and buffering ability. The ability to realize useful permutations and embed other useful graphs is not impeded by eliminating this n n crossbar ([12][13]), which can reduce the crossbar IC count by about 50 for SIMD machines. The network in [14] is essentially a hypermesh with this added n n crossbar at each PE-node. III. THE FFT ALGORITHM The data-flow graph of the standard parallel radix two Cooley-Tukey FFT algorithm can be illustrated graphically, as shown in fig. 2. The flow-graph consists of an SW-banyan or Butterfly graph followed by a bit-reversal permutation. Each node performs a computation, typically multiplying the lower input by a twiddle factor and performing a complex addition of the result with the upper input. The details of the actual computations occurring within each node can be ignored since the communications time is of interest. A. FFT on the Binary Hypercube Since the hypercube can implement all Butterfly permutations without conflict, embedding the SW-banyan part of the data-flow graph is straight-forward, and it requires exactly log data-transfer steps and log computation steps. The worst-case distance between any source and destination in a bit-reversal permutation is exactly log (i.e., the node at will have to send its data to the node ), requiring a traversal over all log hypercube dimensions. Therefore the best that any routing algorithm could do is exactly log data transfer steps.

3 In summary, an optimal-order embedding of the FFT flowgraph in fig. 2 onto the binary hypercube requires log computation steps and 2 log data transfer steps. (ote: we have not attempted to minimize the constants in this particular document.) B. FFT on the 2D Mesh Let each row (or column) in the 2D mesh have elements. Assume an embedding of the flow graph onto the mesh in row-major order; each and column effectively contain a smaller Butterfly graph with inputs and outputs. Assuming that there are no wrap-around links, it is not difficult to verify that the butterflies on a row or column require exactly 1 data transfer steps and log/2 computation steps. Ignoring the bit-reversal at the end, the parallel FFT algorithm on a 2D mesh therefore requires log computation steps and 2 2 data transfer steps. The longest path in the bit-reversal is formed by the packets in diagonally opposite corners, which must be interchanged. Thus the final bit-reversal permutation will require at least another 2 2 steps, assuming no wrap-around. (With wrap-around links, then the longest path is not less than /2; Consider a packet in the middle of row 0.) In summary, an optimal-order embedding of the FFT flow graph in fig. 2 onto the 2D mesh requires log computation steps and 2 2 data transfer steps, which ignores the bit-reversal permutation needed at the end. The bit-reversal permutation adds at least /2 extra data transfer steps (which assumes wrap-around links are available). C. FFT on the 2D Hypermesh Since the hypermesh can implement all Butterfly permutations without conflict, then the SW-banyan part of the FFT data-flow graph can be embedded in exactly log computation steps and log data transfer steps (just as in a hypercube). The bit-reversal permutation can be performed in the 2D hypermesh in at most 3 parallel data transfer steps (by using the fact that the 2D hypermesh is rearrangeable and can realize any permutation in 3 steps; see property [6] in [12]). In summary, an optimal-order FFT on a 2D hypermesh requires log computation steps and not more than log +3 data transfer steps. See the Table 2A for a summary. D. Inter-PE Link Bandwidth, Given Equivalent Aggregate Bandwidth By assumption each network is constructed with equivalent number of crossbar switches, where all switches have degree K, and where every IO pin in the crossbar has bandwidth L. Assume that each point-to-point network places one such crossbar at each Processing Element (PE), and that all inter- PE links are bi-directional. When such degree-k crossbar is used as a smaller b b node (for b <= K) assume that each inter-pe link is driven by K/b crossbar IO pins arranged in parallel, thereby increasing the inter-pe link bandwidth. Each routing node in a 2D Mesh has degree 5 (direct connections to 4 nearest neighbors, plus one connection to the Processing Element itself). Therefore, the bandwidth of every inter-pe link is K L/5. An -node 2D mesh requires crossbar switches. Each node in the hypercube has degree log +1, and therefore the bandwidth of every inter-pe link is K L/(log +1). An -node hypercube requires crossbar switches. The 2D hypermesh requires rows and columns (note the constraint K >= in the 2D case). Since each row or each column requires one hypergraph net, we require 2 hypergraph nets for the complete 2D hypermesh. Since there are crossbar switches available to implement the entire network, then each hypergraph net can use /(2 ) crossbar ICs in parallel. Therefore, the inter-pe link bandwidth in a net is given by (2 ) K L = K L/2 (1) E. Bounds on the Communication Time The packet transmission time over each inter-pe link is inversely proportional to the inter-pe link bandwidths computed above. For simplicity ignore the computation time in what follows, so that the comparison is based on communication time only. (We assume that propagation delays are negligible with respect to the transmission delay in this section, which is the case for realistic systems.) The FFT algorithm in the 2D mesh requires O( ) communication steps, and each step requires 5/KL = O(1/KL) time, for a total communications time of O( /KL). It is not difficult to verify that the use of virtual channels or the wormhole routing technique described in [4] cannot improve this bound in a 2D mesh. The FFT algorithm in the binary hypercube requires O(log) communications steps, and each step requires O(log/KL) time, for a total communication time of O(log 2 /KL). The FFT algorithm on the 2D hypermesh requires O(log) communications steps, and each step requires O(1/KL) time, for a total communication time of O(log/KL). These results are tabulated in Table 2B. Based on these bounds, the 2D hypermesh is faster than the 2D mesh and the binary hypercube by factors of O( /log) and O(log) respectively. IV. REALISTIC COMPARISOS OF 2D MESHES, HYPERMESHES AD HYPERCUBES Currently a crossbar switch can be implemented on a single GaAs IC, and such ICs are commercially available. Each crossbar link (or IO pin) has a bandwidth of 200 Mbit/sec., and each crossbar requires a separate application specific IC to perform application specific routing functions. When such a crossbar is used as a b b node (for b <= 64) assume that each inter-pe link is driven by 64/b crossbar IO pins arranged in parallel. To avoid processor intervention when routing messages in a 2D mesh assume that a crossbar based routing node handles these functions independently. A 4K Processor mesh requires 4K routing nodes; using a GaAs IC for each node, each inter- PE link would use 64/5 = 12.8 crossbar IO pins for an

4 inter-pe link bandwidth of 2.56 Gbit/sec. Therefore the time required to transmit a 128-bit packet between adjacent nodes is 50 nanosec. (ote: the figure 12.8 should be rounded down to 12, but by ignoring this rounding the performance of the 2D mesh is over-estimated slightly.) In a 4K Processor hypercube each processor requires a degree 13 node. Using a GaAs IC for each routing node, each inter-pe link would use 64/13 = 4.92 crossbar IO pins for an inter-pe link bandwidth of.985 Gbit/sec. Therefore the time to transmit a 128-bit packet between two neighboring nodes is 130 nanosec. (ote: the figure 4.92 should be rounded down, but by ignoring the rounding the performance of the hypercube is over-estimated slightly.) A number of choices exist for the hypermesh; a 8 4, 16 3 and 64 2 hypermesh can all interconnect 4K Processors. Consider a 2D 64 2 hypermesh with 64 rows and 64 columns, with hypergraph net in each row and in each column, for a total of 128 nets. To use the same number of GaAs crossbar ICs as the 2D mesh and the hypercube, assume that each hypermesh net uses 32 GaAs ICs in parallel. The inter-pe link bandwidth is then Mbit./sec. = 6.4 Gbit/sec and the time to transmit a 128-bit packet between 2 nodes in the same row or column is then 20 nanosec, plus propagation delay. A. Estimated Speedup, egligible Propagation Delays The total communication time in the 2D mesh, allowing an optimistic /2 steps for the bit-reversal permutation, is then (5/2 steps) (50 nsec/step) = 8 µsec (2) The total communication time in the binary hypercube is then (2 log steps) (130 nanosec/step) = 3.12 µ sec (3) The total communication time in the 2D hypermesh is then (log +3 steps) (20 nanosec/step) = 0.3 µ sec (4) Therefore, the 2D hypermesh is faster than the 2D mesh by a factor of 26.6, and the 2D hypermesh is faster than the binary hypercube by a factor of (If the bit-reversal is not needed, as in many applications, the figures become 26.6 and 6.5 respectively.) A similar comparison was performed in [13] for the Bitonic sort executing on the 2D mesh, the 2D hypermesh and the binary hypercube. In [13] it was concluded that the hypermesh is faster than the 2D mesh and the binary hypercube by factors of 12.3 and 6.47 respectively. B. Estimated Speedup, Including Propagation Delay Finally, we may wish to add a 20 nanosec propagation delay to the hypermesh and hypercube, which would model the propagation of a signal over about 20 feet of transmission line. In this case, the 2D hypermesh is faster than the 2D mesh and the binary hypercube by factors of 13.3 and 6 respectively. ote that the 2D hypermesh is still 25 transfers between nearest neighbors. V. BISECTIO BADWIDTH The bisection bandwidth of a network is defined as the bandwidth that crosses an imaginary bisector which subdivides a network into two halves of equal size. Further insight into why the hypermesh performance exceeds that of the 2D mesh and hypercube can be found by considering the bisection bandwidth of each network. Each network has the same aggregate bandwidth, and one fundamental difference between them is how this bandwidth is distributed over the nodes. The hypermesh has a much larger bisection bandwidth compared to the 2D mesh or hypercube, regardless of how one bisects the network. This increase in bisection bandwidth appears as significantly decreased communication time, especially for ASCED and DESCED algorithms, where every Butterfly permutation causes transfers over a network bisector. The bisection bandwidth of the 2D mesh is cdot(kl/5). The bisection bandwidth of the binary hypercube is (/2) (KL/log). The bisection bandwidth in any one hypermesh net is (KL/2) and the bisection bandwidth of the entire 2D hypermesh is KL/2. (This figure is intuitively obvious; each crossbar has its full bandwidth crossing the bisector, and there are /2 such crossbars.) Clearly, the 2D hypermesh has a bisection bandwidth that is larger than that of the 2D mesh and the binary hypercube by factors of O( ) and O(log) respectively. Therefore every data permutation which results in data transfers over any bisector will run significantly faster on the hypermesh. VI. COCLUSIOS A parallel FFT algorithm for the recently proposed hypermesh [12] was illustrated; the algorithm requires log 3 fewer data transfer steps than the similar FFT algorithm for the binary hypercube, since the bit-reversal permutation can be implemented in at most 3 parallel steps on a hypermesh. The temporal complexity of the FFT algorithm, when executing on 2D meshes, hypercubes and 2D hypermeshes, was then derived. It was shown that for practical network sizes, the 2D hypermesh was faster than the 2D mesh and the binary hypercube by factors of O( /log) and O(log) respectively. Assuming 4K PE parallel processor built with existing technology, in practice the 2D hypermesh should be faster than the 2D mesh and the binary hypercube by factors of 27 and 10 respectively, when propagation delays are negligible. When propagation delays were modelled, the 2D hypermesh should be faster than the 2D mesh and the hypercube by factors of roughly 13 and 6 respectively. A simple explanation for this increase in performance is based on the concept of bisection bandwidth. While all networks being compared have equivalent aggregate bandwidth, this bandwidth is distributed over the nodes in different ways, depending on the network topology. The 2D hypermesh has a bisection bandwidth that is larger than that of the 2D mesh and the binary hypercube by factors of and log respectively. This increase in bisection bandwidth translates to significantly decreased delays, especially for the ASCED and DESCED

Bold lines are hypergraph nets. network link-bw Diameter D D/BW 2D Mesh KL/4 O( /KL) 2D Hypermesh KL/2 2 O(1/KL) hypercube KL/log log O(log 2 /KL) Fig. 2. A PE-node in a 2D hypermesh-based SIMD machine.

5 TABLE I TABLE 1A: HARDWARE COMPLEXITY BEFORE ORMALIZATIO FOR EQUIVALET COST. (EACH ETWORK HAS PES. HYPERMESH DEGREE = log. ) network # crossbars degree diameter 2D Mesh 4 2D Hypermesh 2 2 hypercube log log hypermesh /loglog log log/loglog TABLE II TABLE 1B: COMPARISO AFTER ORMALIZATIO. Fig. 1. A 2D hypermesh. Bold lines are hypergraph nets. network link-bw Diameter D D/BW 2D Mesh KL/4 O( /KL) 2D Hypermesh KL/2 2 O(1/KL) hypercube KL/log log O(log 2 /KL) Fig. 2. A PE-node in a 2D hypermesh-based SIMD machine. permutations, each of which causes data transfers over some network bisector. REFERECES [1] S. Abraham and K. Padmanabhan, Constraint Based Evaluation of Multicomputer etworks, Int. Conf. Parallel Processing, 1991, pp [2] L.. Bhuyan and D.P. Aggrawal, Generalized hypercube and Hyperbus structures for a Computer etwork, IEEE Trans. Comput., Vol. C-33, o. 4, pp , 1984 [3] C. Fang and T.H. Szymanski, An Analysis of Deflection Routing in Multidimensional Regular Mesh etworks, IEEE Infocom 91, April 1991, pp [4] W.J. Dally, Performance Analysis of k-ary n-cube Interconnection etworks, IEEE Trans. Comput., June 1990, pp [5] C.E. Leiserson, Fat-Trees: Universal etworks for Hardware-Efficient Supercomputing, IEEE Trans. Comput., Vol. C-34, o. 10. Oct. 1985, pp [6] M.C. Pease, The Indirect Binary n-cube Microprocessor Array, IEEE Trans. Comput., Vol. C-26, May 1977, pp [7] F.P. Preparata and J. Vuillemen, The Cube-Connected Cycles: A Versatile etwork for Parallel Computation, CACM, May 1981, pp [8] I.D. Scherson, Orthogonal Graphs for the Construction of a Class of Interconnection etworks, IEEE Trans. Parallel and Distributed Systems, Jan. 1991, Vol. 2, o. 1, pp [9] H.J. Siegel, Interconnection etworks for Large Scale Parallel Processing; Theory and Case Studies, 2nd Edition, McGraw-Hill, 1990 [10] H.J. Siegel, A Model of SIMD Machines and a Comparison of Various Interconnection etworks, IEEE Trans. Comput., Vol. C-28, o. 12, Dec. 1979, pp [11] H.S. Stone, High Performance Computer Architecture, 2nd Edition, Addison-Wesley, 1990 [12] T.H. Szymanski, A Fiber-Optic Hypermesh for SIMD/MIMD Machines, IEEE Supercomputing-90, ovember 1990, pp [13] T.H. Szymanski, O(log/loglog) Randomized Routing on Degreelog Hypermeshes, IEEE Int. Conf. Parallel Processing, August 1991, pp [14]. Tanabe, T. Suzuoka, S. akamura, Y. Kawakura and S. Oyanagi, Base-m n-cube: High-Performance Interconnection etworks for Highly Parallel Computer Prodigy, IEEE Int. Conf. Parallel Processing, August 1991, pp [15] L.G. Valiant and G.J. Brebner, Universal Schemes for Parallel Communications, Proc. 13th Annual ACM Symp. on Theory of Computing, 1981, pp TABLE III TABLE 2A: COMPARISO -FFTO VARIOUS ETWORKS network # bit-reversal steps # d.t. steps total 2D Mesh >= /2 >= 5/2 Hypercube >= log >= 2 log 2D hypermesh <= 3 <= log + 3 TABLE IV TABLE 2B: FFT EXECUTIO TIME AFTER ORMALIZATIO. (T comm DEOTES TOTAL COMMUICATIO TIME). Fig. 3. Data-flow graph of the Cooley-Tukey FFT. network # data transfer steps O(T comm) 2D Mesh O( ) O( /KL) Hypercube O(log) O(log 2 /KL) 2D hypermesh O(log) O(log/KL)

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99