Comparing the Parix and PVM parallel programming environments

Similar documents
A Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel

This article has been submitted to the HPCN Europe 1995 conference in Milano, Italy.

Experiences in building Cosy - an Operating System for Highly Parallel Computers

Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS

Evaluation of LH*LH for a Multicomputer Architecture

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

The Public Shared Objects Run-Time System

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

High Performance Discrete Event Simulations to evaluate Complex Industrial Systems

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Verification and Validation of X-Sim: A Trace-Based Simulator

Slides prepared by : Farzana Rahman 1

Handling Parallelisation in OpenFOAM

Modeling of an MPEG Audio Layer-3 Encoder in Ptolemy

Parallel Implementation of 3D FMA using MPI

Programming with MPI

SHOTGUN SURGERY DESIGN FLAW DETECTION. A CASE-STUDY

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

Analysis of Matrix Multiplication Computational Methods

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Accelerating sequential computer vision algorithms using commodity parallel hardware

The Development of Scalable Traffic Simulation Based on Java Technology

A Framework for Effective Scheduling of Data-Parallel Applications in Grid Systems

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Java performance - not so scary after all

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Inspection System for High-Yield Production of VLSI Wafers

A Test Suite for High-Performance Parallel Java

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Parallel I/O Libraries and Techniques

AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING

Workload Characterization using the TAU Performance System

In-Memory Analytics with EXASOL and KNIME //

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Reproducible Measurements of MPI Performance Characteristics

From Cluster Monitoring to Grid Monitoring Based on GRM *

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Space-filling curves for 2-simplicial meshes created with bisections and reflections

POCCS: A Parallel Out-of-Core Computing System for Linux Clusters

EVALUATING IEEE 1588 IN A HOMOGENOUS SWITCHED NETWORK TEST ARTICLE SEGMENT

Practical Case Studies in Teaching Concurrency. A. J. Cowling

Programming with MPI

Under the Hood, Part 1: Implementing Message Passing

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

LINDA. The eval operation resembles out, except that it creates an active tuple. For example, if fcn is a function, then

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

Distributed Systems COMP 212. Lecture 18 Othon Michail

Risk-based Object Oriented Testing

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

A Comparison of Two Distributed Systems: Amoeba & Sprite. By: Fred Douglis, John K. Ousterhout, M. Frans Kaashock, Andrew Tanenbaum Dec.

Top-down definition of Network Centric Operating System features

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Symbolic Evaluation of Sums for Parallelising Compilers

Processor Architecture and Interconnect

An Architecture Workbench for Multicomputers

Building a Reactive Immune System for Software Services

Performance Prediction for Parallel Local Weather Forecast Programs

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Academic Reference Standards (ARS) for Electronics and Electrical Communications Engineering, B. Sc. Program

Chapter Seven Morgan Kaufmann Publishers

Monte Carlo Method on Parallel Computing. Jongsoon Kim

CONCENTRATIONS: HIGH-PERFORMANCE COMPUTING & BIOINFORMATICS CYBER-SECURITY & NETWORKING

EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH PARALLEL IN-MEMORY DATABASE. Dept. Mathematics and Computing Science div. ECP

A COLLOCATED APPROACH FOR COEXISTENCE RESOLUTION IN WIRELESS HOME NETWORKING

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Cache Justification for Digital Signal Processors

EPOS: an Object-Oriented Operating System

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Bringing Insight into the Analysis of Relay Life-Test Failures.

MURDOCH RESEARCH REPOSITORY

Object-Oriented Finite Element Analysis: A Distributed Approach to Mesh. Generation

Modelling E-Commerce Systems Quality with Belief Networks

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

Parallel and distributed crashworthiness simulation

Dynamic Clustering of Data with Modified K-Means Algorithm

Machine Learning based Compilation

Overview. Processor organizations Types of parallel machines. Real machines

Operating Systems 2010/2011

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

A Composite Graph Model for Web Document and the MCS Technique

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Parallel Performance of the XL Fortran random_number Intrinsic Function on Seaborg

Point-to-Point Synchronisation on Shared Memory Architectures

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Bachelor of Property Valuation

VCUPDATE, A NUMERICAL TOOL FOR FINITE ELEMENT MODEL UPDATING

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

JamaicaVM Java for Embedded Realtime Systems

Scalable GPU Graph Traversal!

REGULARITY INDICES FOR BRIDGE STRUCTURES

Abstract. 1 Introduction

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

Portable Power/Performance Benchmarking and Analysis with WattProf

Transcription:

Comparing the Parix and PVM parallel programming environments A.G. Hoekstra, P.M.A. Sloot, and L.O. Hertzberger Parallel Scientific Computing & Simulation Group, Computer Systems Department, Faculty of Mathematics and Computer Science, University of Amsterdam, Kruislaan 03, 1098 SJ, Amsterdam, the Netherlands, telephone: +31 20 2763, email: alfons@fwi.uva.nl, http://www.fwi.uva.nl/fwi/research/vg/pwrs/ Summary Genericity of parallel programming environments, enabling development of portable parallel programs, is expected to result in performance penalties. Furthermore, programmability and tool support of programming environments are important issues if a choice between programming environments has to be made. We propose a methodology to compare native and generic parallel programming environments, taking into account such competing issues as portability and performance. As a case study, this paper compares the Parix and PVM parallel programming environments on a 12 node Parsytec GCel. Furthermore, we apply our methodology to compare Parix and PVM on a new architecture, a 32 node Parsytec PowerXplorer, which is based on the PowerPC chip. In our approach we start with a representative application and isolate the basic (environment) dependent building blocks. These basic building blocks, which depend on floating point performance and communication capabilities of the environments, are analysed independently. We have measured point to point communication times, global communication times and floating point performance. All information is combined into a time complexity analysis, allowing the comparison of the environments on different degrees of functionality. 1. Introduction Real success of Massively Parallel Processing critically depends on programmability of parallel computers and on portability of parallel programs. We are made to believe that parallel computing has come to age. Although it is safe to say that parallel hardware has reached a convincing stage of maturity, both programmability of the parallel hardware and portability of parallel programs still pose serious problems to developers of parallel applications. In this paper we will address the question how to compare (native and generic) parallel programming environments, taking into account issues such as performance, portability, and availability of tools. We propose a methodology and apply it to a case study of native and generic environments on a transputer platform and on a PowerPC platform. Many research groups have compared parallel programming environments [e.g. 1, 2, 3,, ]. The majority of such comparisons however 1

concentrate around clusters of workstations, and have not analysed the behaviour of programming environments such as PVM on true massively parallel machines. The goal of this work is to propose a strategy to compare parallel programming environments, and to apply this to native and generic programming environments running on a large massively parallel system. We will compare a native parallel programming environment, i.e. Parix, with a generic environment, i.e. PVM, by examining the behaviour of a representative parallel application implemented in these environments. These experiments are executed on a 12 node Parsytec GCel. As a case study we have implemented an application from Physics, i.e. Elastic Light Scattering simulations using the Coupled Dipole method [6, 7, 8] on the Parsytec GCel. We will analyse the behaviour of the parallel Coupled Dipole method in both environments by analysing basic and global communication routines, floating point performance, and actual execution times of the parallel program as a function of problem size and number of processors. We will investigate if the basic measurements can predict the runtime of the application, and if such basic measurements can be used as a heuristic to assess the merits of a programming environment. In this way we can judge the trade-off which exists between native environments, usually offering a better performance at the price of extensive programming effort, and generic environments which allow to develop more portable programs. Finally, we will apply our methodology to assess the merits of Parix and PVM on a very recent PowerPC based architecture, the Parsytec PowerXplorer. 2. Results 2.1 Methodology In order to compare the environments we have measured floating point performance, basic communication routines, a global communication routine (a vectorgather operation) and finally the execution time of the parallel Coupled Dipole method on the Parsytec GCel. Subsequently we have compared PVM and Parix on the PowerXplorer system by measuring floating point performance and relevant communication routines. Here we will briefly present the results for the communication measurements, the complete set of experiments will be provided in the full paper. 2.2 Result for the Parsytec GCel We assume that point to point communication can be described by a linear two parameter model. The point to point communication time t pp is t pp = τ setup + n τ send, [1] with n the number of bytes sent, τ setup a setup time to initialise and start the communication, and τ send the send time to transfer 1 byte. Here we have neglected effects due to buffering. 2

Parix's virtual links allow point to point communication between any node, the kernel of Parix routes the messages through the hardware. However, in the Coupled Dipole implementation the only point to point communication consists of synchronous send/receive pairs between adjacent processors. Therefore we have only measured synchronous point to point communication between neighbouring processors. The results are τ send = 0.92 ± 0.02 µs/byte and τ setup = 67 ± 2 µs. The analysis of point to point communication in PVM is complicated by the fact that the localisation of the parallel processes is not known, and therefore we do not know if a communication is between physical neighbouring processors. In order to get a picture of the basic communication performance of PVM we measured all possible point to point communications between node zero and all other nodes in a PVM partition, and analysed the distribution in the resulting setup - and send times. We have fitted all experiments and generated histograms of τ send and τ setup. The histograms are drawn in figures 1 and 2. send time PVM setup time PVM 60 12 0 8 20 0.8 1.2 1.6 2 time 0 1000 3000 000 time Figure 1: The histogram of the occurrences of send times (in µs/byte) in a 26 node partition in the GCel; the step size is 0.01 µs/byte. Figure 2: The histogram of the occurrences of setup times (in µs) in a 26 node partition in the GCel; the step size is 100 µs. We observe two fast connections having a τ send of approximately 0.9 µs/byte and 1.0 µs/byte. These numbers are comparable to the τ send of Parix. The rest of the connections of PVM cluster around 1.6 and 1.8 µs/byte. PVM shows a broad distribution of τ setup around approximately 3000 µs. Here we clearly observe a difference with Parix. PVM has much higher setup times for point to point communication. We have measured the time for the vector gather operation as a function of the number of processors and as a function of the total vector length n in bytes. For each value of p we measured communication time as a function of n. We have fitted the measurements to t vg =τ a n+τ b. We observe two fundamental differences between PVM and Parix. First, in PVM the τ a depends on p, and increases sub linear in p. Furthermore, the initialisation τ b of PVM increases faster than linear with p, which results for large p in (unacceptably) high initialisation times. This result is a very good example of the trade-off of programmability against performance. The PVM vector gather operation consist of just one call to a global communication routine, and a trivial buffer compacting. However, the price to be paid is a bad 3

scaling behaviour, as compared to Parix. 2.3 Comparison of Parix and PVM on the Parsytec PowerXplorer Figures 3 and shows the histograms for τ send and τ setup (as introduced in section.3), in the 32 node partion of the PowerXplorer system for Parix and PowerPVM. 20 send time Parix 20 send time PVM 1 1 10 10 0 0.92 0.9 0.96 0.98 1. time 0 0.92 0.9 0.96 0.98 1. time Figure 3: The histogram of the occurrences of send times (in µs/byte) in the 32 node partition in the PowerXplorer; the step size is 0.01 µs/byte. setup time Parix setup time PVM 3 3 2 2 1 1 0 200 00 600 800 1000 time 0 200 00 600 800 1000 time Figure : The histogram of the occurrences of setup times (in µs) in the 32 node partition in the PowerXplorer; the step size is 10 µs. 3. Concluding Remarks We have compared the Parix and PVM parallel programming environments on a Parsytec GCel, by a detailed analysis of the performance of a particular application. Our approach, in which we start with an application, isolate the basic (environment) dependent building blocks which are analysed independently, and combining all information in a time complexity analysis, allowed us to compare the environments on all relevant degrees of functionality. Together with demands for portability of the code, and development time (i.e. programmability), an overall judgement of the environments can be made. Our approach has similarities with the much more elaborate Parkbench benchmark suite [9]. However, by placing the basic measurements in the context of a specific application, as we do, the interpretation of the measurements can be more accurate. In general we observe that increasing portability and programmability, in

going from Parix to PVM, results in a degradation of especially the communication capabilities. The global communication routine of PVM that we tested has a very bad scaling behaviour which clearly shows up in the larger partitions. This results in poor scalability of the PVM implementation. Fortunately, in real production situations, with large problem sizes, the application has an efficiency very close to one, and the run time is mainly determined by the floating point performance. Application of our heuristic to compare PVM with Parix on a Parsytec PowerXplorer shows that point to point communication in PVM is as good as Parix. However, the global communication in PVM is still far from optimal. Future implementations of PVM should strive to more optimized global routines. Global communication routines strongly increase the programmability of parallel computers. However, in our case studies, we have observed a poor scaling behaviour of these routines, as compared to handcoded Parix routines that exploit data locality and point to point communication on virtual topologies. Currently the Message Passing Interface (MPI) standard has been defined, and the first implementations of MPI have been reported [10]. Our group is working on a MPI implementation on top of Parix [11]. We will test the MPI implementation using the methods as described in this paper. Furthermore, the global communication routines of MPI, such as the MPI_BCAST, will be implemented such that the reported scaling behaviour of comparable Express and PVM routines can be improved.. References 1. M. Weber, Workstation clusters: one way to parallel computing, International J. Mod. Phys. C, 1307-131 (1993). 2. A.R. Larrabee, The P Parallel Programming System, the Linda Environment, and Some Experiences with Parallel Computing, Scientific Programming 2, 23-3 (1993). 3. A. Matrone, P. Schiano, and V. Puotti, LINDA and PVM: A comparison between two environments for parallel programming, Parallel Computing 19, 99-97 (1993).. T.G. Mattson, C.C. Douglas, and M.H. Schultz, A comparison of CPS, Linda, P, Posbyl, PVM, and TCGMSG: two node communication times, Tech. Rept. YALEU/DCS/TR-97 Dept. of Computer Science, Yale University, 1993.. C.C. Douglas, T.G. Mattson, and M.H. Schultz, Parallel Programming systems for workstation clusters, Tech. Rept. YALEU/DCS/TR-97 Dept. Computer Science, Yale University, 1993. 6. A.G. Hoekstra, Computer Simulations of Elastic Light Scattering:Implementation and Applications, Ph.D. Thesis, University of Amsterdam, 199. 7. A.G. Hoekstra and P.M.A. Sloot, New Computational Techniques to Simulate Light Scattering from Arbitrary Particles, In Proceedings of the 3rd International Congress on Optical Particle Sizing '93 - Yokohama, pp. 167-172, M. Maeda (Eds.), 1993. 8. A.G. Hoekstra and P.M.A. Sloot, Simulating Elastic Light Scattering Using High Performance Computing Techniques, In European Simulation Symposium 1993, pp. 62-70, A. Verbraeck and E.J.H. Kerckhoffs (Eds.), Society for Computer Simulation International, 1993. 9. R. Hockney and M. Berry, Public International Benchmarks for Parallel Computers: Parkbench Committee report 1., Scientific Computing 3, 101-16 (199). 10. D.W. Walker, The design of a standard message passing interface for distributed memory concurrent computers, Parallel Computing 20, 67-673 (199).

11. P.M.A. Sloot, private communication, for more information you can send email to peterslo@fwi.uva.nl. 6