Latest Trends in Applied Informatics and Computing

Parallel Simulation and Communication Performance Evaluation of a Multistage BBN Butterfly Interconnection Network for High- Performance Computer Clusters PLAMENKA BOROVSKA, DESISLAVA IVANOVA, PAVEL TSVETANSKI Computer Systems Department Technical University of Sofia 8 Kliment Ohridski Boul., 1000 Sofia BULGARIA pborovska@tu-sofia.bg, d_ivanova@tu-sofia.bg, pavel_tsvetanski@tu-sofia.bg http://cs-tusofia.eu/ Abstract: The communication performance of multistage interconnection networks is a crucial factor influencing the parallel performance of high-performance computer clusters. In this paper we have proposed a methodology for parallelization of an OMNET++ sequential model. We designed in parallel manner a multistage BBN interconnect network topology to meet the demands of efficient and fast communication on high-performance computer systems. The parallel communication are evaluated on the basis of parallel simulation models using the simulation framework OMNET++ (MPI) that is run on IBM HS22 Blade Center at the High-Performance and GRID Computing Laboratory located at Computer Systems Department, Technical University of Sofia. Result analysis of parallel simulation results has been performed. Key-Words: High-Speed Interconnection Networks, BBN Network Architecture, OMNET++, Null Message Protocol, Parallel Simulations, Communication Performance Evaluation, Performance Analysis 1 Introduction Interconnection network architecture designs are influenced by next generation high-performance computer clusters and supercomputer technology. The path to next generation Tier-0 computer systems is increasingly dependent on designing computer clusters with hundreds and thousands of processors. The interconnection topology design of a parallel computer system is a critical factor in determining the computer performance. [1-4] Interconnection network designs vary with respect to communication parameters: throughput and latency and cost. Communication network performance determines computer cluster performance for many applications. Therefore, the choice of network architecture has a significant impact on computer performance and will affect the usability of a parallel computer cluster. Interconnection networks are composed of a set of shared switch nodes and links, and the network topology refers to the arrangement of these nodes and links. Selecting the network topology is the first and very important step in designing a network because the flow-control and routing algorithm depend heavily on the network topology design. The goal of this paper is to propose a methodology for parallelization of an OMNET++ sequential model and to evaluate the communication performance of a multistage BBN network design on the basis of program implementation on IBM HS22 Blade Center, located at the High- Performance and GRID Computing Laboratory, Technical University of Sofia. Communication performance of a BBN multistage topology is performed by means of network simulations using OMNET++. 2 OMNET++ Platform and Parallel Simulations OMNeT ++ is essentially a set of software tools and libraries that supports the development of simulation models. Most often OMNeT++ is used to develop models of computer networks and protocols. OMNeT++ represents a simulation environment, including specific libraries (simulation framework and library). It is built up of individual components called modules. Its main purpose is to be used for building network simulations of ad-hoc networks, wireless networks, communication networks and others. OMNeT++ includes Eclipse-based graphical development environment (IDE) and some ISBN: 978-1-61804-130-2 237

additional tools to facilitate the work of the developers. [5] OMNeT++ also provides support for parallel simulation execution. Very large simulations may benefit from the parallel distributed simulation (PDES) feature, either by getting speedup, or by distributing memory requirements. [8] 2.1 Null Message Protocol OMNeT++ provides a Null Message protocol, which implements the Null Message conservative synchronization algorithm in a class called cnullmessageprotocol. The implementation of Null Message Protocol in OMNeT++ is based on the terminology defined in [8, 9]. Let LPp be the logical processes that a given parallel simulation model is composed of, where p is in the range [0, count of logical processes 1]. Let r be a moment in the physical time of a given simulation execution. Taking LPp and r into consideration, several quantities can be identified: Earliest Input Time EIT: EITp(r) = the lowest boundary of the timestamp value (measured in units of simulation time) of a message that the logical process LPp can receive in the physical time interval (r, ); Earliest Output Time EOT: EOTp(r) = the lowest boundary of the timestamp value (measured in units of simulation time) of a message, that the logical process LPp can send in the physical time interval (r, ); Earliest Conditional Output Time ECOT: ECOTp(r) = the lowest boundary of the timestamp value (measured in units of simulation time) of a message, that the logical process LPp can send in the physical time interval (r, ), with the assumption that LPp will receive no messages in the given physical time interval. Lookahead: lap(r) = the lowest boundary of the time, after which LPp will send a message to another logical process. The most common method used to advance EIT (i.e. to synchronize) is the usage of Null messages, via the Null Message algorithm. For EITp to be increased it is sufficient that the respective LPp sends a Null message to every other LP in its destset (a vector array of logical processes that LPp can send messages to) on every change in its EOT. Every logical process calculates its own EIT as the minimum EOT value of the most recent EOT values received via source-set (a vector array of logical processes that LPp can receive messages from). 3 BBN Butterfly OMNeT++ Simulation Model and Result Analysis 3.1 Sequential model A network simulation model for sequential execution is implemented in OMNeT++ with the BBN Butterfly network topology. The routing algorithm used is destination tag (DTR). DTR is a routing algorithm that determines the port, to which a switch has to re-route the received packages using only the destination address. This algorithm is typical for omega, butterfly and other multistage networks. This routing algorithm is highly dependable on the network topology, in which it is working the nodes are addressed in a definite way. The address of a host is divided into n (the number of the levels) equal parts, each of them is corresponding to the level and has to be big enough to sum k/2 in binary. If the address is not big enough, it is padded with zeros to the most significant part. Three traffic patterns are simulated: Uniform, Bit reversal and Transpose. Uniform traffic pattern addressing a packet from a certain node of the network is made randomly. The probability to forward the packet to each node (excluding itself) is equal. Bit-reversal traffic pattern addressing a packet from a certain node of the network is made depending packet's own address, each node sends only to address that is bit reversal of the sender s address. Matrix transpose addressing a packet from a certain node, each node sends messages only to a destination with the upper and lower halves of its own address transposed. ISBN: 978-1-61804-130-2 238

control communication channel prevents A from sending to B more flits that B 's buffer could accept for the respective port. Sending credits from a switch to nodes connected to its input ports is implemented using control communication channels out_credit[0..4], matching every output data channel in[0..4]. 3.2 Parallel model Parallelization of the BBN Butterfly topology model is a process of transformation of that model from a sequential execution implementation to a parallel execution implementation where a simulation can be run on many nodes of a given computer cluster by means of message-passing and node-synchronization algorithms. [10] Fig.1: BBN OMNET++ Sequential Model Simulation is executed for three different values of packet size: 32 flits, 64 flits and 128 flits. Flit size is 16 bits. Ten values for offered traffic (in per cent of capacity) are simulated from 10 to 100% in 10% increments. Every host sends 1000 packets. The topology itself is 4-ary (3+1)-fly, consisting of 4 stages, where one stage is an extra stage. [1] The extra stage helps to increase the performance of the interconnection network when traffic patterns causing competition for a given channel are employed, by implementing 4 different paths from every sending host (source) to every receiving host (destination). The topology consists of 128 nodes, 64 of which are Radix-4 crossbar, and 64 terminal (hosts). The network is generally a switch that connects all inputs to all outputs and topologically is consists of a number of overlapping trees. [3] Nodes in the simulated network are connected by b=1gbps unidirectional communication channels with a delay of 3.3ns. Radix-4 switches are buffered and the flow control is credit-based. Credit-based flow control allows switches to prevent rejection of incoming flits due to a full buffer, thus optimizing the performance of the network. Switches inform every node connected with one of their input ports about the availability of buffer space for the respective port by sending credits. Every credit sent informs the node, connected with a given input port of the switch, that 1-flit of buffer space is available for the respective port. The feedback that node A receives from switch B by means of the number of received credits via the Four IBM Blade Center nodes are used for running simulations. All nodes have OMNeT++ Version 4.2.1, Build id: 120118-94e2a29 and MPICH2 Version 1.4.1 installed. The parallel programming interface MPI, implemented by Argonne National Lab (MPICH2 for Windows), is used by OMNeT++ via the cmpicommunications class as a mechanism to pass messages between cluster nodes. The conservative synchronization protocol Null Message Protocol (the cnullmessageprotocol OMNeT++ class) is used for message synchronization. The sequential simulation model is used as a fundament on which the parallel model is built. That is achieved by the creation of several new components and the modification of existing ones. Fig.2: BBN OMNET++ Parallel Model ISBN: 978-1-61804-130-2 239

index in the vector array of 16 host components for the partition, and ownindex is the partition's index. Fig.3: BBN Butterfly OMNET++ Sequence Charts End Event Logs Parallel simulation requires that message processing in the network defined in the sequential model be divided in 4 partitions. To achieve loose coupling between partitions a bisection is performed on the network until it is divided in 4, Fig.2. Every partition has an identical component structure and component interconnection with that of the first 16 switches and first 16 hosts of the network, with a few differences. In other words, one partition is a network description component (Network Description File) named SubNet. That component contains 16 elements of type host and 16 elements of type switch, or ¼ of the total number of switches and hosts in the network (64 switches and 64 hosts). The interconnection of these elements is analogical to that between the first 16 switches and first 16 hosts of the network, with few key differences: a. Switch and host addresses in a given partition are a function of the partition's index, which leads to a uniqueness and full coverage of the address ranges of those components in the network (addresses from 0 to 63 are given both to switches and hosts). The component addresses must be differentiated from the component indexes, where the former range from 0 to 63 and the latter range from 0 to 15 for every partition. Taking the partition's index in consideration, the switch address is calculated using the formula: self_address = ((index%4)+(index- (index%4))*4+4*ownindex), where self_address is the switch address, index is the switch's index in the vector array of 16 switch components for the partition, and ownindex is the partition's index. Host address is calculated using the formula: self_address = index + ownindex*numhosts, where self_address is the host's address, index is the host's b. Part of the interconnection between switches from stage 1 and stage 2 of the BBN Butterfly topology is made outside of the SubNet component, because some connections between stage 1 and stage 2 switches are made between different partitions, in order to fully implement the network. Internal and external switch interconnections may be defined based on whether the interconnection is defined inside or outside the partition component SubNet. Internal for the partition is that part of the connections between stage 1 and stage 2 switches, for which the sending switch and the receiving switch are present as components, with regard to their addresses, in the same partition. External for the partition is that part of connections between stage 1 and stage 2 switches that is not included in the internal part of connections. Internal connections are always 8 in number, but switch port indexes, through which the connections are made, are defined as a function of the partition's index: the port index (0, 1, 2 or 3) is equal to the partition's index. In contrast, external connections for a given partition are made through 3 of 4 ports of every one of the stage 1 and stage 2 switches, whose index does not equal the partition's index. The dependency of the external connection's port index on the partition index is modelled in the SubNet component using conditional links (A[switch output port index]-->b[gate output index] if partition's index = x). The link between A and B is only implemented if the condition is true. This enables the SubNet component to be used for all partitions, despite the port index value dependencies on partition index values. A gate denotes a link to a component that is external to the current component (in this case the component is the partition SubNet). A higher level component is used to define connections between gates thus implementing inter-partition connections and realizing the whole parallel model. In MPI communications there can be no global variables that can be used for communication between partitions. Thus, for each partition to read the total number of packets received in the network (reaching a predefined number of sent/received packets is a condition for successful simulation termination) a monitoring system must be implemented. That system is called PartSync. It monitors the total number of received packets in the ISBN: 978-1-61804-130-2 240

network and terminates the simulation with success when that number reaches a preconfigured value. PartSync uses message passing, considering MPI constraints, and implements a ring topology to send a message from a partition that just received a new packet to all other partitions in that way informing them of the increase of the total number of received packets. PartSync synchronization messages use custom OMNeT++.msg component PartSyncMsg created with the opp_msgc utility. PartSync messages are analogical to Null messages with the difference that the former transfer data about the sum of received packets in the sending partition and the latter are used for time synchronization between partitions. 3.2 Result Analysis The simulation experiment framework is targeted to evaluate the parallel performance of OMNET++ BBN Butterfly Interconnection Network for High- Performance Computer Clusters. The simulation models, implemented in C++, run on the following configuration: Software platform: OMNeT++ running on Windows Server R2 64-bit OS using the GCC for OMNeT++ Tool chain; Hardware platform: IBM Blade Center - HS22 Blade Servers; High-Performance and GRID Computing Laboratory located at Computer Systems Department, Technical University of Sofia. The experiments imply three different traffic patterns: Uniform, Bit reversal and Transpose and three different values of packet size: 32 flits, 64 flits and 128 flits. Thus, nine configurations of parallel execution are conducted for five different offered traffic levels (20% to 100% with 20% increments), Fig.4. Experimental data indicate that the parallel execution speedup increases with increasing the offered load (percent of capacity). Also, the speedup for uniform traffic pattern is greater compared to other communication traffic patterns. The results performed experimentally determine the maximum speedup of 2.76 for parallel discrete event-based simulation of BBN Butterfly Interconnection Network where offered traffic is 100% and the traffic pattern is uniform. 4 Conclusion In this paper we have presented the evaluation of parallel performance of BBN Butterfly Interconnection Network. The parallel performance are evaluated on the basis of parallel simulation models which have been run on IBM HS22 Blade Center for the case studies of several most popular communication patterns: Uniform, Bit reversal and Transpose and for three different values of packet size: 32 flits, 64 flits and 128 flits. Fig.4: Speedup Results of Parallel Execution This paper described an approach for designing parallel models in OMNeT++. The suggested BBN Butterfly interconnection network simulation model was designed to work in a parallel discrete event simulation (PDES) environment. Any network, composed in this way, can be simulated in a completely parallel manner to exploit the needed computational resources in order to simulate more complex network designs, connecting a large number of nodes. ISBN: 978-1-61804-130-2 241

This approach can be used as a methodology to develop more complex designs. Empirical simulation data confirms features of the OMNET++ parallel performance described in [5]. ACKNOWLEDGEMENT The results reported in this paper are part of a research project DRNF 02/9-2009, supported by the National Science Fund, Bulgarian Ministry of Education and Science. References: [1] Dally W. J., Towels B.: Principles and practices of Interconnection Networks, Morgan Kaufmann, ISBN-13: 978-0122007514, 2004 [2] James Milano Gary L. Mullen-Schultz, Gary Lakner: BlueGene-red book: Blue Gene: Hardware Overview and Planning [3] P. Borovska. Computer systems. Sofia; Bulgaria: Ciela, ISBN 954-649-633-2 (in Bulgarian), 2009. [4] Duato, J., Yalamanchili, S., Lionel M. Interconnection networks: An engineering approach, Morgan Kaufmann Publishers, ISBN 1-55860-852-4, 2002. [5] OMNET++ Discrete Event Simulation Environment: http://omnetpp.org/doc [6] Pl. Borovska, O. Nakov, D. Ivanova, K. Ivanov, G. Georgiev: Communication Performance Evaluation and Analysis of a Mesh System Area Network for High Performance Computers. 12-th WSEAS International Conference on Mathematical Methods, Computational Techniques and Intelligence Systems (MAMECTIS 10), Kantaoui, Sousse, Tunisia, May 3-6, 2010, ISBN: 978-960-474-188-5, pp. 217-222. [7] Plamenka Borovska, Desislava Ivanova, Venelina Ianakieva, Vladislav Mitov, Halil Alkaf: Comparative Analysis of Communication Performance Evaluation for Butterfly Bidirectional Multistage Interconnection Network Topology with Routing Table and Destination Tag Routing, Sixth International Scientific Conference Computer Science 2011, Ohrid, Macedonia, pp. 29-34, 01-03 September 2011 [8] D. Wu, E. Wu, J. Lai, A. Varga, Y. A. Sekercioglu, G. K. Egan, Implementing MPI Based Portable Parallel Discrete Event Simulation Support in the OMNeT++ Framework, Proceedings 14th European Simulation Symposium A. Verbraeck, W. Krug, eds. (c) SCS Europe BVBA, 2002 [9] R. L. Bagrodia, M. Takai, V. Jha, Performance evaluation of conservative algorithms in parallel simulation languages, Parallel and Distributed Systems, IEEE Transactions on, pages 395-411, Apr 2000. [10] D. Wu, E. Wu, J. Lai, A. Varga, Y. A. Sekercioglu, G. K. Egan, Implementing MPI Based Portable Parallel Discrete Event Simulation Support in the OMNeT++ Framework, Proceedings 14th European Simulation Symposium A. Verbraeck, W. Krug, eds. (c) SCS Europe BVBA, 2002. ISBN: 978-1-61804-130-2 242