Latest Trends in Applied Informatics and Computing

Similar documents
Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems

WP2: Multiprocessors communication networks for PetaFLOPS supercomputers 1. Main activities and results Task 2.1: System Area Network Topology

Topologies. Maurizio Palesi. Maurizio Palesi 1

Distributed simulation with MPI in ns-3. Josh Pelkey Dr. George Riley

Basic Low Level Concepts

Enabling Distributed Simulation of OMNeT++ INET Models

Homework Assignment #1: Topology Kelly Shaw

Chapter 4 : Butterfly Networks

Parallel Architecture. Sathish Vadhiyar

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster

4. Networks. in parallel computers. Advances in Computer Architecture

Measuring the Efficiency of Parallel Discrete Event Simulation in Heterogeneous Execution Environments

Interconnection Network

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

INTERCONNECTION NETWORKS LECTURE 4

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Under the Hood, Part 1: Implementing Message Passing

Topology basics. Constraints and measures. Butterfly networks.

Distributed Simulation of Large Computer Systems

ARCS: A SIMULATOR FOR DISTRIBUTED SENSOR NETWORKS

Implementing MPI Based Portable Parallel Discrete Event Simulation Support in the OMNeT++ Framework

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 3: Topology - II

CS575 Parallel Processing

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Butterfly vs. Unidirectional Fat-Trees for Networks-on-Chip: not a Mere Permutation of Outputs

Interconnection Networks

Lookahead Accumulation in Conservative Parallel Discrete Event Simulation.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

EE/CSCI 451: Parallel and Distributed Computation

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Lecture 2: Topology - I

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Network-on-chip (NOC) Topologies

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

Spider-Web Topology: A Novel Topology for Parallel and Distributed Computing

Phastlane: A Rapid Transit Optical Routing Network

Three basic multiprocessing issues

Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture: Interconnection Networks

Interconnection Networks. Issues for Networks

A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks

Interconnect Technology and Computational Speed

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

Routing Algorithms. Review

Design and Implementation of Multistage Interconnection Networks for SoC Networks

DEVELOPMENT OF PARAMETERIZED CELL OF SPIRAL INDUCTOR USING SKILL LANGUAGE

Communication Performance in Network-on-Chips

IMPROVING PERFORMANCE OF PARALLEL SIMULATION KERNEL FOR WIRELESS NETWORK SIMULATIONS

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research

Parallel Computing Platforms

Parallel and Distributed VHDL Simulation

Comparative Study of blocking mechanisms for Packet Switched Omega Networks

Overview. Processor organizations Types of parallel machines. Real machines

Efficiency and Quality of Solution of Parallel Simulated Annealing

Chapter 3 : Topology basics

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Design of a System-on-Chip Switched Network and its Design Support Λ

OASIS Network-on-Chip Prototyping on FPGA

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

PARALLEL QUEUING NETWORK SIMULATION WITH LOOKBACK- BASED PROTOCOLS

CS Parallel Algorithms in Scientific Computing

Parallel Computing Interconnection Networks

Dr e v prasad Dt

Ultra-Fast NoC Emulation on a Single FPGA

A Novel Energy Efficient Source Routing for Mesh NoCs

Boundary Recognition in Sensor Networks. Ng Ying Tat and Ooi Wei Tsang

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Processor Architecture and Interconnect

Initial studies of SCI LAN topologies for local area clustering

EVENT DRIVEN PACKET SIMULATOR

COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS

SIMULATIONS. PACE Lab, Rockwell Collins, IMPROVING PERFORMANCE OF PARALLEL SIMULATION KERNEL FOR WIRELESS NETWORK

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

PARALLEL SIMULATION MADE EASY WITH OMNeT++

Optimization solutions for the segmented sum algorithmic function

Bandwidth Aware Routing Algorithms for Networks-on-Chip

Slim Fly: A Cost Effective Low-Diameter Network Topology

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Performance Analysis of Interconnection Networks for Packet Delay using Source Routing

Chapter 9 Multiprocessors

Scalability and Classifications

A Multicast Routing Algorithm for 3D Network-on-Chip in Chip Multi-Processors

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

High Performance Computing. University questions with solution

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Parallel Implementation of 3D FMA using MPI

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Transcription:

Parallel Simulation and Communication Performance Evaluation of a Multistage BBN Butterfly Interconnection Network for High- Performance Computer Clusters PLAMENKA BOROVSKA, DESISLAVA IVANOVA, PAVEL TSVETANSKI Computer Systems Department Technical University of Sofia 8 Kliment Ohridski Boul., 1000 Sofia BULGARIA pborovska@tu-sofia.bg, d_ivanova@tu-sofia.bg, pavel_tsvetanski@tu-sofia.bg http://cs-tusofia.eu/ Abstract: The communication performance of multistage interconnection networks is a crucial factor influencing the parallel performance of high-performance computer clusters. In this paper we have proposed a methodology for parallelization of an OMNET++ sequential model. We designed in parallel manner a multistage BBN interconnect network topology to meet the demands of efficient and fast communication on high-performance computer systems. The parallel communication are evaluated on the basis of parallel simulation models using the simulation framework OMNET++ (MPI) that is run on IBM HS22 Blade Center at the High-Performance and GRID Computing Laboratory located at Computer Systems Department, Technical University of Sofia. Result analysis of parallel simulation results has been performed. Key-Words: High-Speed Interconnection Networks, BBN Network Architecture, OMNET++, Null Message Protocol, Parallel Simulations, Communication Performance Evaluation, Performance Analysis 1 Introduction Interconnection network architecture designs are influenced by next generation high-performance computer clusters and supercomputer technology. The path to next generation Tier-0 computer systems is increasingly dependent on designing computer clusters with hundreds and thousands of processors. The interconnection topology design of a parallel computer system is a critical factor in determining the computer performance. [1-4] Interconnection network designs vary with respect to communication parameters: throughput and latency and cost. Communication network performance determines computer cluster performance for many applications. Therefore, the choice of network architecture has a significant impact on computer performance and will affect the usability of a parallel computer cluster. Interconnection networks are composed of a set of shared switch nodes and links, and the network topology refers to the arrangement of these nodes and links. Selecting the network topology is the first and very important step in designing a network because the flow-control and routing algorithm depend heavily on the network topology design. The goal of this paper is to propose a methodology for parallelization of an OMNET++ sequential model and to evaluate the communication performance of a multistage BBN network design on the basis of program implementation on IBM HS22 Blade Center, located at the High- Performance and GRID Computing Laboratory, Technical University of Sofia. Communication performance of a BBN multistage topology is performed by means of network simulations using OMNET++. 2 OMNET++ Platform and Parallel Simulations OMNeT ++ is essentially a set of software tools and libraries that supports the development of simulation models. Most often OMNeT++ is used to develop models of computer networks and protocols. OMNeT++ represents a simulation environment, including specific libraries (simulation framework and library). It is built up of individual components called modules. Its main purpose is to be used for building network simulations of ad-hoc networks, wireless networks, communication networks and others. OMNeT++ includes Eclipse-based graphical development environment (IDE) and some ISBN: 978-1-61804-130-2 237

additional tools to facilitate the work of the developers. [5] OMNeT++ also provides support for parallel simulation execution. Very large simulations may benefit from the parallel distributed simulation (PDES) feature, either by getting speedup, or by distributing memory requirements. [8] 2.1 Null Message Protocol OMNeT++ provides a Null Message protocol, which implements the Null Message conservative synchronization algorithm in a class called cnullmessageprotocol. The implementation of Null Message Protocol in OMNeT++ is based on the terminology defined in [8, 9]. Let LPp be the logical processes that a given parallel simulation model is composed of, where p is in the range [0, count of logical processes 1]. Let r be a moment in the physical time of a given simulation execution. Taking LPp and r into consideration, several quantities can be identified: Earliest Input Time EIT: EITp(r) = the lowest boundary of the timestamp value (measured in units of simulation time) of a message that the logical process LPp can receive in the physical time interval (r, ); Earliest Output Time EOT: EOTp(r) = the lowest boundary of the timestamp value (measured in units of simulation time) of a message, that the logical process LPp can send in the physical time interval (r, ); Earliest Conditional Output Time ECOT: ECOTp(r) = the lowest boundary of the timestamp value (measured in units of simulation time) of a message, that the logical process LPp can send in the physical time interval (r, ), with the assumption that LPp will receive no messages in the given physical time interval. Lookahead: lap(r) = the lowest boundary of the time, after which LPp will send a message to another logical process. The most common method used to advance EIT (i.e. to synchronize) is the usage of Null messages, via the Null Message algorithm. For EITp to be increased it is sufficient that the respective LPp sends a Null message to every other LP in its destset (a vector array of logical processes that LPp can send messages to) on every change in its EOT. Every logical process calculates its own EIT as the minimum EOT value of the most recent EOT values received via source-set (a vector array of logical processes that LPp can receive messages from). 3 BBN Butterfly OMNeT++ Simulation Model and Result Analysis 3.1 Sequential model A network simulation model for sequential execution is implemented in OMNeT++ with the BBN Butterfly network topology. The routing algorithm used is destination tag (DTR). DTR is a routing algorithm that determines the port, to which a switch has to re-route the received packages using only the destination address. This algorithm is typical for omega, butterfly and other multistage networks. This routing algorithm is highly dependable on the network topology, in which it is working the nodes are addressed in a definite way. The address of a host is divided into n (the number of the levels) equal parts, each of them is corresponding to the level and has to be big enough to sum k/2 in binary. If the address is not big enough, it is padded with zeros to the most significant part. Three traffic patterns are simulated: Uniform, Bit reversal and Transpose. Uniform traffic pattern addressing a packet from a certain node of the network is made randomly. The probability to forward the packet to each node (excluding itself) is equal. Bit-reversal traffic pattern addressing a packet from a certain node of the network is made depending packet's own address, each node sends only to address that is bit reversal of the sender s address. Matrix transpose addressing a packet from a certain node, each node sends messages only to a destination with the upper and lower halves of its own address transposed. ISBN: 978-1-61804-130-2 238

control communication channel prevents A from sending to B more flits that B 's buffer could accept for the respective port. Sending credits from a switch to nodes connected to its input ports is implemented using control communication channels out_credit[0..4], matching every output data channel in[0..4]. 3.2 Parallel model Parallelization of the BBN Butterfly topology model is a process of transformation of that model from a sequential execution implementation to a parallel execution implementation where a simulation can be run on many nodes of a given computer cluster by means of message-passing and node-synchronization algorithms. [10] Fig.1: BBN OMNET++ Sequential Model Simulation is executed for three different values of packet size: 32 flits, 64 flits and 128 flits. Flit size is 16 bits. Ten values for offered traffic (in per cent of capacity) are simulated from 10 to 100% in 10% increments. Every host sends 1000 packets. The topology itself is 4-ary (3+1)-fly, consisting of 4 stages, where one stage is an extra stage. [1] The extra stage helps to increase the performance of the interconnection network when traffic patterns causing competition for a given channel are employed, by implementing 4 different paths from every sending host (source) to every receiving host (destination). The topology consists of 128 nodes, 64 of which are Radix-4 crossbar, and 64 terminal (hosts). The network is generally a switch that connects all inputs to all outputs and topologically is consists of a number of overlapping trees. [3] Nodes in the simulated network are connected by b=1gbps unidirectional communication channels with a delay of 3.3ns. Radix-4 switches are buffered and the flow control is credit-based. Credit-based flow control allows switches to prevent rejection of incoming flits due to a full buffer, thus optimizing the performance of the network. Switches inform every node connected with one of their input ports about the availability of buffer space for the respective port by sending credits. Every credit sent informs the node, connected with a given input port of the switch, that 1-flit of buffer space is available for the respective port. The feedback that node A receives from switch B by means of the number of received credits via the Four IBM Blade Center nodes are used for running simulations. All nodes have OMNeT++ Version 4.2.1, Build id: 120118-94e2a29 and MPICH2 Version 1.4.1 installed. The parallel programming interface MPI, implemented by Argonne National Lab (MPICH2 for Windows), is used by OMNeT++ via the cmpicommunications class as a mechanism to pass messages between cluster nodes. The conservative synchronization protocol Null Message Protocol (the cnullmessageprotocol OMNeT++ class) is used for message synchronization. The sequential simulation model is used as a fundament on which the parallel model is built. That is achieved by the creation of several new components and the modification of existing ones. Fig.2: BBN OMNET++ Parallel Model ISBN: 978-1-61804-130-2 239

index in the vector array of 16 host components for the partition, and ownindex is the partition's index. Fig.3: BBN Butterfly OMNET++ Sequence Charts End Event Logs Parallel simulation requires that message processing in the network defined in the sequential model be divided in 4 partitions. To achieve loose coupling between partitions a bisection is performed on the network until it is divided in 4, Fig.2. Every partition has an identical component structure and component interconnection with that of the first 16 switches and first 16 hosts of the network, with a few differences. In other words, one partition is a network description component (Network Description File) named SubNet. That component contains 16 elements of type host and 16 elements of type switch, or ¼ of the total number of switches and hosts in the network (64 switches and 64 hosts). The interconnection of these elements is analogical to that between the first 16 switches and first 16 hosts of the network, with few key differences: a. Switch and host addresses in a given partition are a function of the partition's index, which leads to a uniqueness and full coverage of the address ranges of those components in the network (addresses from 0 to 63 are given both to switches and hosts). The component addresses must be differentiated from the component indexes, where the former range from 0 to 63 and the latter range from 0 to 15 for every partition. Taking the partition's index in consideration, the switch address is calculated using the formula: self_address = ((index%4)+(index- (index%4))*4+4*ownindex), where self_address is the switch address, index is the switch's index in the vector array of 16 switch components for the partition, and ownindex is the partition's index. Host address is calculated using the formula: self_address = index + ownindex*numhosts, where self_address is the host's address, index is the host's b. Part of the interconnection between switches from stage 1 and stage 2 of the BBN Butterfly topology is made outside of the SubNet component, because some connections between stage 1 and stage 2 switches are made between different partitions, in order to fully implement the network. Internal and external switch interconnections may be defined based on whether the interconnection is defined inside or outside the partition component SubNet. Internal for the partition is that part of the connections between stage 1 and stage 2 switches, for which the sending switch and the receiving switch are present as components, with regard to their addresses, in the same partition. External for the partition is that part of connections between stage 1 and stage 2 switches that is not included in the internal part of connections. Internal connections are always 8 in number, but switch port indexes, through which the connections are made, are defined as a function of the partition's index: the port index (0, 1, 2 or 3) is equal to the partition's index. In contrast, external connections for a given partition are made through 3 of 4 ports of every one of the stage 1 and stage 2 switches, whose index does not equal the partition's index. The dependency of the external connection's port index on the partition index is modelled in the SubNet component using conditional links (A[switch output port index]-->b[gate output index] if partition's index = x). The link between A and B is only implemented if the condition is true. This enables the SubNet component to be used for all partitions, despite the port index value dependencies on partition index values. A gate denotes a link to a component that is external to the current component (in this case the component is the partition SubNet). A higher level component is used to define connections between gates thus implementing inter-partition connections and realizing the whole parallel model. In MPI communications there can be no global variables that can be used for communication between partitions. Thus, for each partition to read the total number of packets received in the network (reaching a predefined number of sent/received packets is a condition for successful simulation termination) a monitoring system must be implemented. That system is called PartSync. It monitors the total number of received packets in the ISBN: 978-1-61804-130-2 240

network and terminates the simulation with success when that number reaches a preconfigured value. PartSync uses message passing, considering MPI constraints, and implements a ring topology to send a message from a partition that just received a new packet to all other partitions in that way informing them of the increase of the total number of received packets. PartSync synchronization messages use custom OMNeT++.msg component PartSyncMsg created with the opp_msgc utility. PartSync messages are analogical to Null messages with the difference that the former transfer data about the sum of received packets in the sending partition and the latter are used for time synchronization between partitions. 3.2 Result Analysis The simulation experiment framework is targeted to evaluate the parallel performance of OMNET++ BBN Butterfly Interconnection Network for High- Performance Computer Clusters. The simulation models, implemented in C++, run on the following configuration: Software platform: OMNeT++ running on Windows Server R2 64-bit OS using the GCC for OMNeT++ Tool chain; Hardware platform: IBM Blade Center - HS22 Blade Servers; High-Performance and GRID Computing Laboratory located at Computer Systems Department, Technical University of Sofia. The experiments imply three different traffic patterns: Uniform, Bit reversal and Transpose and three different values of packet size: 32 flits, 64 flits and 128 flits. Thus, nine configurations of parallel execution are conducted for five different offered traffic levels (20% to 100% with 20% increments), Fig.4. Experimental data indicate that the parallel execution speedup increases with increasing the offered load (percent of capacity). Also, the speedup for uniform traffic pattern is greater compared to other communication traffic patterns. The results performed experimentally determine the maximum speedup of 2.76 for parallel discrete event-based simulation of BBN Butterfly Interconnection Network where offered traffic is 100% and the traffic pattern is uniform. 4 Conclusion In this paper we have presented the evaluation of parallel performance of BBN Butterfly Interconnection Network. The parallel performance are evaluated on the basis of parallel simulation models which have been run on IBM HS22 Blade Center for the case studies of several most popular communication patterns: Uniform, Bit reversal and Transpose and for three different values of packet size: 32 flits, 64 flits and 128 flits. Fig.4: Speedup Results of Parallel Execution This paper described an approach for designing parallel models in OMNeT++. The suggested BBN Butterfly interconnection network simulation model was designed to work in a parallel discrete event simulation (PDES) environment. Any network, composed in this way, can be simulated in a completely parallel manner to exploit the needed computational resources in order to simulate more complex network designs, connecting a large number of nodes. ISBN: 978-1-61804-130-2 241

This approach can be used as a methodology to develop more complex designs. Empirical simulation data confirms features of the OMNET++ parallel performance described in [5]. ACKNOWLEDGEMENT The results reported in this paper are part of a research project DRNF 02/9-2009, supported by the National Science Fund, Bulgarian Ministry of Education and Science. References: [1] Dally W. J., Towels B.: Principles and practices of Interconnection Networks, Morgan Kaufmann, ISBN-13: 978-0122007514, 2004 [2] James Milano Gary L. Mullen-Schultz, Gary Lakner: BlueGene-red book: Blue Gene: Hardware Overview and Planning [3] P. Borovska. Computer systems. Sofia; Bulgaria: Ciela, ISBN 954-649-633-2 (in Bulgarian), 2009. [4] Duato, J., Yalamanchili, S., Lionel M. Interconnection networks: An engineering approach, Morgan Kaufmann Publishers, ISBN 1-55860-852-4, 2002. [5] OMNET++ Discrete Event Simulation Environment: http://omnetpp.org/doc [6] Pl. Borovska, O. Nakov, D. Ivanova, K. Ivanov, G. Georgiev: Communication Performance Evaluation and Analysis of a Mesh System Area Network for High Performance Computers. 12-th WSEAS International Conference on Mathematical Methods, Computational Techniques and Intelligence Systems (MAMECTIS 10), Kantaoui, Sousse, Tunisia, May 3-6, 2010, ISBN: 978-960-474-188-5, pp. 217-222. [7] Plamenka Borovska, Desislava Ivanova, Venelina Ianakieva, Vladislav Mitov, Halil Alkaf: Comparative Analysis of Communication Performance Evaluation for Butterfly Bidirectional Multistage Interconnection Network Topology with Routing Table and Destination Tag Routing, Sixth International Scientific Conference Computer Science 2011, Ohrid, Macedonia, pp. 29-34, 01-03 September 2011 [8] D. Wu, E. Wu, J. Lai, A. Varga, Y. A. Sekercioglu, G. K. Egan, Implementing MPI Based Portable Parallel Discrete Event Simulation Support in the OMNeT++ Framework, Proceedings 14th European Simulation Symposium A. Verbraeck, W. Krug, eds. (c) SCS Europe BVBA, 2002 [9] R. L. Bagrodia, M. Takai, V. Jha, Performance evaluation of conservative algorithms in parallel simulation languages, Parallel and Distributed Systems, IEEE Transactions on, pages 395-411, Apr 2000. [10] D. Wu, E. Wu, J. Lai, A. Varga, Y. A. Sekercioglu, G. K. Egan, Implementing MPI Based Portable Parallel Discrete Event Simulation Support in the OMNeT++ Framework, Proceedings 14th European Simulation Symposium A. Verbraeck, W. Krug, eds. (c) SCS Europe BVBA, 2002. ISBN: 978-1-61804-130-2 242