Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Similar documents
Low Latency MPI for Meiko CS/2 and ATM Clusters

BİL 542 Parallel Computing

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Wired and Wireless Parallel Simulation of Fluid Flow Problem on Heterogeneous Network Cluster

Vipar Libraries to Support Distribution and Processing of Visualization Datasets

Design Optimization of Building Structures Using a Metamodeling Method

Towards a Portable Cluster Computing Environment Supporting Single System Image

Performance Analysis of Distributed Iterative Linear Solvers

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Types of Parallel Computers

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

Adaptive Mesh Refinement in Titanium

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014

6.1 Multiprocessor Computing Environment

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Multicast can be implemented here

Large-scale Structural Analysis Using General Sparse Matrix Technique

Parallel Implementations of Gaussian Elimination

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Multiprocessors - Flynn s Taxonomy (1966)

Implementation of an integrated efficient parallel multiblock Flow solver

A MATLAB Toolbox for Distributed and Parallel Processing

Visualization for the Large Scale Data Analysis Project. R. E. Flanery, Jr. J. M. Donato

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

Scalability and Classifications

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS

High Performance Computing in Europe and USA: A Comparison

Computer Architecture

Numerical Methods for PDEs : Video 9: 2D Finite Difference February 14, Equations / 29

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

An Experimental Assessment of Express Parallel Programming Environment

INTERIOR POINT METHOD BASED CONTACT ALGORITHM FOR STRUCTURAL ANALYSIS OF ELECTRONIC DEVICE MODELS

Tiling: A Data Locality Optimizing Algorithm

Evaluating MMX Technology Using DSP and Multimedia Applications

Design and Implementation of a Java-based Distributed Debugger Supporting PVM and MPI

Early Experiences Writing Performance Portable OpenMP 4 Codes

Overview. Processor organizations Types of parallel machines. Real machines

Tiling: A Data Locality Optimizing Algorithm

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Predicting Slowdown for Networked Workstations

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Using Shift Number Coding with Wavelet Transform for Image Compression

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

POM: a Virtual Parallel Machine Featuring Observation Mechanisms

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

GENERAL ASSIGNMENT PROBLEM via Branch and Price JOHN AND LEI

DM and Cluster Identification Algorithm

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Parallel algorithms for Scientific Computing May 28, Hands-on and assignment solving numerical PDEs: experience with PETSc, deal.

An efficient multilevel master-slave model for distributed parallel computation

Profile-Based Load Balancing for Heterogeneous Clusters *

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Modelling and implementation of algorithms in applied mathematics using MPI

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Just the Facts Small-Sliding Contact in ANSYS Mechanical

Parallel Implementation of 3D FMA using MPI

Bilinear Programming

Lecture 7: Parallel Processing

Free upgrade of computer power with Java, web-base technology and parallel computing

CS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006

Multi-core Programming - Introduction

MULTIVARIATE GEOGRAPHIC CLUSTERING IN A METACOMPUTING ENVIRONMENT USING GLOBUS *

PAMIHR. A Parallel FORTRAN Program for Multidimensional Quadrature on Distributed Memory Architectures

Optimization of structures using convex model superposition

Introduction to Parallel. Programming

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

Image Compression With Haar Discrete Wavelet Transform

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

A STRUCTURAL OPTIMIZATION METHODOLOGY USING THE INDEPENDENCE AXIOM

1.2 Numerical Solutions of Flow Problems

Commodity Cluster Computing

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Wavelet-Galerkin Solutions of One and Two Dimensional Partial Differential Equations

IN A FINAL REPORT' PARALLEL COMPUTING ENVIRONMENT OCEAN PREDICTABILITY STUDIES. DOE Contract DE-FG83-91 ERG NOVEMBER R. H.

Tools and Primitives for High Performance Graph Computation

Linear Programming. Linear programming provides methods for allocating limited resources among competing activities in an optimal way.

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

A Study of Workstation Computational Performance for Real-Time Flight Simulation

LAPLACIAN MESH SMOOTHING FOR TETRAHEDRA BASED VOLUME VISUALIZATION 1. INTRODUCTION

Domain Decomposition: Computational Fluid Dynamics

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

MOLECULAR DYNAMICS ON DISTRIBUTED-MEMORY MIMD COMPUTERS WITH LOAD BALANCING 1. INTRODUCTION

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Smart Data Centres. Robert M Pe, Data Centre Consultant HP Services SEA

A Distance Learning Tool for Teaching Parallel Computing 1

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

The Dynamic Response of an Euler-Bernoulli Beam on an Elastic Foundation by Finite Element Analysis using the Exact Stiffness Matrix

A parallel computing framework and a modular collaborative cfd workbench in Java

Integrated Machine Learning in the Kepler Scientific Workflow System

Transcription:

Finite difference and finite element analyses using a cluster of workstations K.P. Wang, J.C. Bruch, Jr. Department of Mechanical and Environmental Engineering, q/ca/z/brm'a, 5Wa jbw6wa CW 937% Abstract Because of high computing speed, cost effectiveness, and scalability, parallel computation on clusters of workstations is becoming one of the major trends in the study of parallel computation. This paper presents studies of using a cluster of workstations for finite difference analysis and finite element analysis. A parallel algorithm proven to be simple to implement and efficient for both analyses is used to perform them on a cluster of workstations. A network of workstations is utilized as the hardware of a parallel system. Two popular parallel software packages, PVM (Parallel Virtual Machine) and P4, are used to handle the communications among the networked workstations. Also used for comparison purposes are the Paragon and Meiko CS-2 computers. Furthermore, an approach to develop a portable parallel code is given. 1 Introduction For the past few years, with the advanced technology in the computer industry, workstations have been produced with high computing speed and at low cost. Because of their high computing speed, cost effectiveness, and scalability, parallel computation on clusters of workstations is becoming one of the major trends in the study of parallel computation. Studies of using a cluster of workstations for both finite difference analysis and finite element analysis are presented herein. Previous work [l]-[4] has shown that the SOR (Successive Over-Relaxation) iteration method for the finite element and finite difference methods can be fully parallelized by reordering the discretized equations. Speedups close to linear (theoretical speedup) or better have been obtained using the ipsc/2 Hypercube parallel computer.

132 High-Performance Computing in Engineering P4 [5] and PVM [6] are message passing libraries for a cluster of workstations and parallel computers. With P4 or PVM, a cluster of workstations can be used as if it were a single parallel computing resource. P4 was developed at Argonne National Laboratory and PVM was developed at Oak Ridge National Laboratory. The version of P4 used in this study is 1.4 while the version of PVM is 3.3.4. A cluster of 7 SGI Indy workstations running the Irix 5.1 operation system was used for this study. Each workstation is equipped with a ethernet card. All workstations are networked with a central file server. The transmission is slow when compared to around 30 MB/sec for a parallel computer. However, it is common to see the same type of network configuration in many institutions. The two parallel computers that will also be used are the Paragon and Meiko CS-2 computers. Both computers are MIMD distributed memory multicomputers. Processors on the Paragon or Meiko can only communicate by message passing. The programming model used is SPMD (Single Program Multiple Data). Every processor is loaded with the same code, but execute a different branch of the code or operate on a different set of data. This is not the only way a parallel program can be written but is a widely used model. This study shows an approach to develop a portable parallel code as well as the feasibility of using a cluster of workstations to perform parallel computation. The approach for developing portable codes using a cluster of workstations will be presented first. The same codes are compiled and executed using P4 and PVM on a cluster of SGI workstations and on the Paragon and Meiko parallel computers. Speedups for all test cases will be shown in order to discuss the feasibility of using clusters of workstations for parallel computation. 2 Implementation In solving many engineering problems, both the finite difference method and the finite element method will lead to a linear system: where K is the coefficient matrix, / is the force vector and u is the solution vector. The K will be a banded diagonal matrix in general. Thus, it is possible to perform reordering of the equations in the linear system in order to decouple the system of equations. The process of reordering equations is equivalent to decomposing the computation domain into subdomains and interfaces. The parallel SOR iterative algorithm presented in [l]-[4] uses this idea to transform the sequential SOR to a fully parallel SOR algorithm in the sense that computations in all the subdomains are performed in parallel and the computations on all the interfaces are also performed in parallel.

High-Performance Computing in Engineering 133 The model problem to be solved is the free surface seepage problem presented in [l]-[4]. Because the solution of the reformulated problem of this model problem has a constraint of having to be greater than or equal to zero, no system of equations can be generated. Only a pointwise iterative scheme can be formulated as presented in [l]-[4]. The first step to implement the parallel algorithm on all the parallel systems is to convert the parallel codes developed from previous studies to one of the parallel systems. The second step is to make the code portable. Since there are differences in different message passing systems, the easiest way to make a parallel code portable is to use only a message passing library and translate all other message passing libraries to that message passing library. The translation mechanism will be unique for each system. However, once developed, all parallel codes developed later can use the same mechanism without modifying the parallel codes. The original implementation of the SOR parallel algorithm was on the Intel ipsc/2 Hypercube parallel computer. The programming model was a host-node model. The input and output are controlled by a host program and the computation was handled by the node program. The message passing library in the ipsc/2 Hypercube is similar to the NX message passing library for the Paragon. The first step in this study is to convert parallel programs from the host-node programming model to a SPMD programming model used on the Paragon. This conversion can be achieved by assigning processor 0 as the host processor. An extra step is to convert the old ipsc/2 function calls to the NX functions. To develop portable codes, it is intuitive to see that if all message passing libraries can be translated to the Paragon NX message passing library we can recompile the same parallel code with the translation mechanism on different systems without modifying the code. In this study only a few translation mechanisms need to be developed: 1. begin parallel program. 2. end parallel program. 3. send message. 4. receive message. 5. global collective operation of arrays. Each mechanism does not mean an implementation for a function call. It may represent a group of function calls. Since different message passing libraries have different ways to begin or end the parallel processes and the NX library has no specific function calls to them, common function calls are needed for all systems to begin and end a parallel program. Thefirsttwo mechanisms will be needed for all message passing libraries. Mechanisms 3-5 are required for P4,

134 High-Performance Computing in Engineering PVM, and Meiko. Upon completing the translation mechanism on a system, a parallel code can be compiled with the translation mechanism without changing any part of the code and run on that system. Speedup results and their evaluation are presented in the next section. When testing the parallel codes using P4 or PVM on a cluster of workstations, it is important to make sure that there are no other users logged on the workstations to be used. Because if there are, the timing results will be affected by the computation load of the other users. Furthermore, it is also important to make sure there are no other users using computers on the same network even if they are not using the same workstations to be tested. If there are, the communication speed will be affected by these users. Therefore, the tests were performed during late night and quarter breaks. 3 Results and Discussion Figures 1, 2, and 3 show the finite difference speedups for cases with (101,101), (141,141), and (201,201) mesh points. As shown in these three figures, speedups (sometimes better than the linear speedup because of the way boundary data is input) from the Paragon and Meiko parallel computers are better when increasing the number of mesh points. Similar trends can be observed for P4 and PVM. Their speedups, however, are only a little over 2 even when more than 2 processors are used. Speedups from P4 are better than those from PVM for all three cases. One explanation is that the communication speed on the QL -D U CD CD Q. linear -0 P4 -+-- PVM -B-- Paragon -x Meiko -A-- 3 4 5 Figure 1: Finite difference speedup for (101, 101) mesh points.

High-Performance Computing in Engineering 135 Q. a =3 CD CD Q. linear P4 PVM -H-- Paragon X Meiko 3 4 5 6 7 Figure 2: Finite difference speedup for (141, 141) mesh points. T linear P4 PVM -B-- Paragon.%... Meiko -A-. B----.Q B Q D f 2 3 4 5 6 7 Figure 3: Finite difference speedup for (201, 201) mesh points. cluster of workstations played an important factor in slowing down the parallel execution. The data transmission rate on the ethernet board is slow, and the communication management is not as efficient as the parallel computers. When the ratio of computation time versus the communication time is small, the speedup will be small even with increasing the. These results, despite the speedup, show that a parallel program can be developed on a cluster of workstations first then moved to a parallel computer for

136 High-Performance Computing in Engineering Q. =3 "0 CD Q. CD linear -0 P4 -h- PVM -D-- Paragon -X-- Meiko -A-- 2 3 4 5 6 7 Figure 4: Finite element speedup for 4257 degrees of freedom. linear -0 P4 -+ - PVM -D-- Paragon --X-- Meiko -A-- 1 2 3 4 5 6 Figure 5: Finite element speedup for 8353 degrees of freedom. the production mode. Since the code is portable with the developed mechanisms, no modification of the code is necessary. The goal of portability is achieved.

High-Performance Computing in Engineering 137 Figures 4 and 5 show finite element speedups for cases with 4257 and 8353 degrees of freedom, respectively. Speedups from the Paragon and Meiko parallel computers are close to the linear speedup for 4257 degrees of freedom and better then linear speedup for 8353 degrees of freedom. Speedups better than the finite difference analysis are obtained for P4 and PVM. This is due to the ratio of computation time versus the communication time being larger for the finite element analysis. When less than 5 workstations are used, speedups close to or better than linear speedup are obtained when using P4. When more than 5 workstations are used, the speedup decreased. This may be due to communication management of the network. This shows that it is still possible to use a cluster of workstations to perform parallel computations for computation intensive applications. Speedup for PVM is not as good as it for P4. This still shows the portability of the approach. Note that, it is not the intention herein to compare the performance for the different systems in this study. Although the results from PVM are not as good as results from P4, it is not to say that P4 is a better system than PVM. As discussed, the performances of using a cluster of workstations are limited by the speed of the network devices and the communication management. It is suggested that the performance on a cluster of workstations can be improved if fast network devices and better communication management are used. 4 Conclusion In this study, it has been shown that a cluster of workstations can be used for developing parallel applications as well as performing parallel computation. Although the speedup on a cluster of workstations is not yet satisfactory, fast network devices and better communication management are suggested for further study of parallel computation using a cluster of workstations. In addition, it has shown that it is possible to develop portable parallel codes across different parallel computing resources. By compiling with the translation mechanism on a parallel system, a portable parallel code which uses function calls from the translation mechanism can be executed without any modification. Acknowledgements The authors would like to thank the San Diego Supercomputer Center for providing time on its Paragon parallel computer and the Computer Science Department at the University of California at Santa Barbara for providing time on its Meiko CS-2 parallel computer which was obtained under a grant from the National Science Foundation, Award No. CDA92-16202.

138 High-Performance Computing in Engineering References 1. Wang, K.P. & Bruch, J.C., Jr., An Efficient Fully Parallel Finite Difference SOR Algorithm for the Solution of a Free Boundary Seepage Problem, ed L. C. Wrobel & C.A. Brebbia, pp. 37-48, 2nd International Conference on Computational Modeling of Free and Moving Boundary Problems, Milan, Italy, Computational Mechanics Publications, Southampton, U.K., 1993. 2. Wang, K.P. & Bruch, J.C., Jr., A Highly Efficient Iterative Parallel Computational Method for Finite Element Systems, Eng. Comput., 1993, 10, 195-204. 3. Wang, K.P. & Bruch, J.C., Jr., An Efficient Iterative Parallel Finite Element Computational Method, Chapter 12, The Mathematics of Finite Elements and Applications, ed. J. R. Whiteman, pp. 179-188, John Wiley, New York, 1994. 4. Wang, K.P. & Bruch, J.C., Jr., A SOR Iterative Algorithm for the Finite Difference and the Finite Element Methods that is Efficient and Parallelizable, Advances in Engineering Software, 1995, in press. 5. Butler, R. & Lush, E., User's Guide to the P4 Programming System, Technical Report TM-ANL/92/17, Argonne National Laboratory, 1992. 6. Geist et al, PVM 3 User's Guide and Reference Manual, Technical Report, ORNL/TM-12187, Oak Ridge National Laboratory, 1994.