ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES Sergio N. Zapata, David H. Williams and Patricia A. Nava Department of Electrical and Computer Engineering The University of Texas at El Paso El Paso, TX 79968 0523 Abstract Cluster computing provides an economical alternative for high performance computing, that in the past could only be provided by expensive parallel supercomputers. Clusters are built with standard components and interconnected by various interconnection topologies. These interconnection topologies provide different approaches for communication between processing nodes within the cluster. A study has been performed to evaluate the computing and network speeds of a cluster of nine computers consisting of a front-end and eight compute nodes interconnected in star, channel bonding, and flat neighborhood network topologies. For this task two applications were developed; one performs a data transfer test between two nodes and measures the round trip time of the transfer; the second application performs distributed matrix multiplication using all of the compute nodes. In addition the High Performance Linpack (HPL) benchmark was utilized. These applications were applied with the cluster network configured using the three aforementioned topologies. Results show that 2-way channel bonding is the best alternative providing a peak performance of 12 GFLOPs. The Flat Neighborhood Network proved to be effective, but at a higher cost since at least one extra switch and one extra NIC for each processing node was required. Keywords: Cluster Computing, Interconnection Networks 1.0 Introduction Cluster 1 computing has become an economical solution to satisfy the need for super computing power. This approach is basically achieved by clustering more than two computers built from mass-market commodity off-the-shelf (M 2 COTS) components. 1 This work was supported in part by an equipment grant from Cisco Systems, and by the National Science Foundation under grant #EIA-0325024. Commodity components make clustering extremely cost effective, with an excellent priceperformance curve [1]. In addition, with the development of free software, operating systems, and tools, this design method has become even more appealing. The objective of this work was to build a small Beowulf cluster, and to investigate its performance based on three different interconnection network topologies [2]. In order to test the overall performance of the cluster, a well-known test bench was utilized: High Performance Linpack (HPL). HPL performs a series of matrix operations that evaluate the computing power of the computing nodes, and distributes the processing in different ways measuring the ability of the network topology. In addition two other applications were developed in order to test the network performance. These include a data transfer rate test between two CPUs to measure the raw communication speed, and matrix multiplication distributed to multiple CPUs. 2.0 Test Methods 2.1 Hardware Configuration Each node within the system was constructed using a Gigabyte GA-7VKMP motherboard, with one AMD Athlon 2100+ XP CPU, 512 MB of memory, CD-ROM and floppy drives, and a 60 GB disk drive. On the compute nodes, the disk is employed for system software, and on the front-end node, user files are also stored. Two 10/100 Dlink network cards were added to each node, in addition to the 10/100 NIC contained on the motherboard. Each computer, then, included a total of three 100

Mb/Sec Ethernet NICs, to provide multiple communication channels for the different network topologies. Three topologies were tested using star, channel bonding, and flat network configurations. The star topology is the most common configuration employed in a local area network. Its most basic form consists of a network switch at the center, and all computers directly linked to it forming a shape resembling a star as shown in Figure 1. done through the MAC addresses of the packets, and thereby preventing collisions from traffic from different nodes. Figure 2. Channel Bonding Topology The Flat Neighborhood Network (FNN) was developed at the University of Kentucky [3] and has the main characteristic of low latency. In typical scenarios, when trying to interconnect large computer clusters (more than 1000 nodes), switch fabrics are employed which in addition to being very expensive, add latency to the network. A FNN achieves low latency by having only one switch connecting any two nodes in the cluster [Agg04]. Figure 3 shows the FNN configuration employed for this study. Figure 1. Star Topology Channel bonding consists of logically striping (enslaving) N network interfaces, where N > 2, making them work as one. This can theoretically increase the bandwidth by a factor of N. In addition to fulfill bandwidth requirements, channel bonding can also be used for high availability by means of redundancy, or to service multiple network segments. In data bandwidth channel bonding the data packets are sent to the output devices (NICs) in round robin order. This trunk shares the Medium Access Control (MAC) address from the first enslaved device between all the NICs members of the trunk. Thus, regardless from which interface a request was received, the response is always sent from the next available NIC. Each NIC must be connected to a different switch as shown in Figure 2. The switches perform layer two routing, which is Figure 3. Flat Neighborhood Network As can be seen from Figure 3, the interconnection network is consisted of three different subnets that allow each machine to communicate with each other through only one intermediate switch. 2.2 Software Configuration In operating a cluster, it is desirable to have software to maintain the homogeneity of the system software on all of the nodes and to provide tools to manage and monitor the operation of the system. For our System, the Rocks Cluster Distribution developed by the National Partnership for Advanced Computational Infrastructure [4] was chosen for this task since it includes all of the necessary

tools for administration; it is open software; and is based on the very well known Red Hat Linux Distribution. Currently we employ version 3.6 of ROCKS. 2.2.1 Performance Evaluation Software Three test suites were employed to test the communications and/or computational performance of the system. These include a data transfer program that transfers large amounts of packets between two nodes and measures the round trip time of the transfer; a distributed matrix multiplication program for overall performance evaluation; and the well known High Performance Linpack (HPL) benchmark to test overall system speed. The data transfer rate test was devised in order to better understand the behavior of the topology being tested and is based on the client server model. The client creates a short integer array with random values, and using sockets transmits the array to the server. The server, receives the data, stores it in an array, and resends, or echoes back the same data to the client. The type of data, and the amount of data to be sent between the client and the server was determined based on the size of an Ethernet frame. Excluding the overhead of the protocols, the theoretical size of data per frame that can be sent is 1460 bytes [1], but with the aid of the Ethereal [5] sniffer, the actual amount of data that actually could be sent was determined to be 1448 bytes, or 724 short elements. The effectiveness of the network is determined by measuring the data round trip time. The amount of data increases in intervals of 724 elements, forcing the transfer of data to an integer number of Ethernet frames; no partial packets are created. The iterations were run from 1 to 10000 times in increments of 1 and multiples of 724 short integers, so the minimum data sent is 1448 bytes and the maximum data sent is approximately 14 Mbytes. The distributed matrix multiplication program was used to measure the correlation between processing power and network speed to determine the cluster performance and consists of a client and a server program. The client program reads two random valued, large matrices, up to 48 million double elements in size, into memory. Then, the client proceeds to fork up to 8 children processes, one per server. Each of the children uses the sockets protocol to send one matrix and 1/N of the other matrix to its corresponding server, where N specifies the number of servers. Each server reads the data from the socket, performs the matrix multiplication, and returns the results. At the client side, each child, after reading the partial results from the server, writes the results to a shared memory region accessible by the client. When all children have returned, client prints the resulting matrix. This scenario was built to mimic a more realistic operation of data transfer, since the packets produced now are not forced to perfectly fit in an Ethernet frame, and the computational performance of the cluster is included. The HPL [6] program, the standard benchmark to measure cluster speed [7], solves a dense linear system of double precision arithmetic values. To accomplish this task, and to best accommodate the system architecture to be benchmarked, HPL provides various configuration options. These options include where the output of the benchmark is to be written; the number of problem sizes which are directly dependant on the amount of memory; the number of block sizes in which the operation is going to be divided; number of process grids which are multiples of the number of processors available in the system; and the method of factorization, left looking variant, Crout variant, or Right looking variant. Among the most important factors for this work is the functionality of different virtual panel broadcast topologies for data distribution and processor communications, which directly affect the cluster performance depending on the interconnection network topology.

3.0 Results and Conclusions 3.1 Data Transfer Rate Test The Data Transfer Rate Test (DTRT) is executed between every two nodes of the cluster, with transfer data sizes from 1,448 to 14,480,000 bytes, in 10,000 iterations. All data sets are multiples of 1,448 bytes to completely fill each Ethernet frame without creating partial frames. The DTRT times were measured using the gettimeofday() system call. The DTRT was only performed on the star, and the 2-way channel bonding network since the results from the star topology can be extrapolated to the FNN case. After the results from the 2-way channel bonding were obtained, further testing was accomplished by means of an additional switch allowing for 3-way channel bonding. The star topology and the Flat Neighborhood Network share the same model at the pair-wise level. Both layouts have only a single direct link connection to a switch between any two nodes. Figure 4 displays the results from the DTRT. theoretical limit of 2.0 that would be expected. The addition of another interface to the channelbonding scheme shows an improvement of 2.9 over the star topology, from which a theoretical improvement of 3.0 was expected. Up to this point, channel bonding seems to be a cost effective alternative. 3.2 Distributed Matrix Multiplication The Distributed Matrix Multiplication Test (DMMT) gives an overall ideal performance of the network because the data sent between the machines is forced to avoid partial packet creation, and the results do not include computing factors. The DMMT involves multiplying two very large matrices whose size is dependent on the number of servers. The matrix dimensions are Nx6000000 and 6000000xN, where N = {2,4,6,8}, the number of servers. Figure 5 illustrates the timing results of the DMMT for each of the interconnection network topology types. Figure 5. Timing Results for the DMMT Figure 4. Data Transfer Rate Test Results The results show the cost-effectiveness of the channel-bonded networks become more apparent as the size of the data increases. Small data sets show no improvement of channelbonded networks over the traditional star topology or the FNN, however, as the size increases, the 2-way channel-bonded network presents an improvement of 1.9, close to the Figures 4 and 5 illustrate that 2-way channel bonding gives the best overall performance since it scales linearly such that its performance increase over the star topology, is close to 2. On the other hand, 3-way channel bonding gives the most performance increase, but does not scale linearly. The maximum performance increase for 3-way channel bonding was reached when 4 servers were active for a performance increase factor of 2.8. For the

most intensive cases using 6 and 8 servers, the performance increase is only around 2.4 times faster than from the star topology. These results help to conclude that having extra network interface, in a system performing actual computations may not be cost-effective. Many factors can cause this nonideal performance from the 3-way channel bonding. One of the reasons may be the nonsymmetric networking approach utilized and the inability of the driver to properly handle an odd number of network interface cards. Another reason may be the high amount of partial packets created and the collision overhead caused by these packets directly affect the performance of the network. Lastly is the possibility that the network queues may be saved out of order and therefore be ineffectually used by the bonding driver. The FNN was tested in two different ways, one forcing all of the data distribution and computation possible through one switch, and the alternative approach using two switches for the communication. The performance of the first approach is close to that of a single channel star configuration, since only one switch is available for data distribution. The alternate approach shows a performance close to 2-way channel bonding because data are distributed using all the possible bandwidth that two switches can provide. 3.3 High Performance Linpack HPL was executed with various problem sizes, process grids, block sizes, and with different virtual broadcasting topologies. The problem sizes utilized were 2,000; 5,000; 10,000; 15,000; and 20,000 elements, except in a special circumstance in which a problem size of 18,000 elements was employed. Since the process grids are dependent of the number of processors are employed and noting that up to 9 single processor nodes (1 front-end and 8 compute nodes) are available, process grids of 1x8, 2x4, and 3x3 (the last one using the front-end node as well) were utilized. The data distribution takes place by the virtual broadcasting topologies. There exist six different topologies, which make full deployment of the network by distributing the data in a ring, and by employing different pairwise distribution schemes [6]. Although all of the broadcasting topologies were tested, the results presented below are only from the best broadcasting topology for this application. Figure 6 show the performance of the system when configured using the three network topologies. The figure illustrates that 2-way channel bonding to be the best by a factor around 2 GFLOPS over the other two topologies. (3-way channel bonding was not included in this test because, as previously stated, it did not scale well.) From the figure it can also be concluded that having a basic FNN is not cost-effective compared to the channel bonding since it only provides minor improvements over the single channel, or Star, topology. The cost of setting up the FNN is much more expensive compared to the star topology or even the 2-way channel bonding since it requires three switches, and special routing tables for each node. The performance of 8 processors was compared to the performance of 9 processors by using the front-end as a compute node. The maximum output of in both cases was very close to one another, and in the cases of the other two networks when all nine nodes (3x3) were doing computing, the performance decreased, compared with that of the previous case of 8 (2x4) processors. The most plausible reason is that the network reached its limit or, in other words, network contention dominates. 3.4 Conclusions Test protocols were implemented for each of the three topologies: Data Transfer Rate Test (DTRT), Distributed Matrix Multiplication (DMMT), and High Performance Linpack (HPL). The DTRT showed that in general Channel Bonding is the best option to implement, giving improvement factors of close

to the number of NICs attached to each node. On the other hand, the DMM, being a closer to real life application, showed that true cost effectiveness is only reached when using 2-way channel bonding. HPL tested overall system performance peaking 12 GFLOPS with 2-way channel bonding. So, all in all, 2-way channel bonding provided the fastest, least expensive, and most scalable communications. [7] Performance Linpack Benchmark for Distributed-Memory Computers, Innovative Computing Laboratory, University of Tennessee, Computer Science Department. 2004. http://www.netlib.org/benchmark/hpl/ind ex.html Top 500 Supercomputer Sites. University of Mannheim, University of Tennessee, National Energy Research Scientific Computer Center. 2004 Figure 6. HPL Performance in FLOPS 2x4 Process Grids and 210 Block Size 4.0 References [1] Sterling, Thomas. Beowulf Cluster Computing with Linux. The MIT Press, 2002 [2] Zapata, Sergio N., Analysis of Cluster Interconnection Network Topologies, M. S. Thesis, The University of Texas At El Paso, July 2004. [3] Dietz, Hank. FNN: Flat Neighborhood Network. University of Kentucky, 2004. http://www.aggregate.org/fnn/ [4] NPACI. Rocks Cluster Distribution Users Guide, 2003, http://rocks.npaci.edu/rocksdocumentation/3.0.0/ [5] Combs, Gerald. Ethereal: The world most popular network protocol analyzer. 2004, http://www.ethereal.com [6] Petitet, A., Whaley, R. C., Dongarra J., and Cleary, A. HPL - A Portable Implementation of the High-