LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

Similar documents
Boosting the Performance of Electromagnetic Simulations on a PC-Cluster

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Sparse Matrix Operations on Multi-core Architectures

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

System Architecture PARALLEL FILE SYSTEMS

6.1 Multiprocessor Computing Environment

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

The Optimal CPU and Interconnect for an HPC Cluster

Evaluation of Parallel Application s Performance Dependency on RAM using Parallel Virtual Machine

Parallel Pipeline STAP System

Performance of Multicore LUP Decomposition

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Parallelizing a seismic inversion code using PVM: a poor. June 27, Abstract

Technical Brief: Specifying a PC for Mascot

Accelerated Earthquake Simulations

1e+07 10^5 Node Mesh Step Number

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Computer Architecture

COSC 6385 Computer Architecture - Multi Processor Systems

Data mining with sparse grids

Maple on the Intel Paragon. Laurent Bernardin. Institut fur Wissenschaftliches Rechnen. ETH Zurich, Switzerland.

Guided Prefetching Based on Runtime Access Patterns

Database Architectures

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Parallel solution for finite element linear systems of. equations on workstation cluster *

execution host commd

Monte Carlo Method on Parallel Computing. Jongsoon Kim

COSC 6374 Parallel Computation. Parallel Computer Architectures

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

A Scalable Multiprocessor for Real-time Signal Processing

100 Mbps DEC FDDI Gigaswitch

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

TK2123: COMPUTER ORGANISATION & ARCHITECTURE. CPU and Memory (2)

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Real Parallel Computers

SMD149 - Operating Systems - Multiprocessing

Accelerating Implicit LS-DYNA with GPU

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

What are Clusters? Why Clusters? - a Short History

COSC 6374 Parallel Computation. Parallel Computer Architectures

Parallel Computing Platforms

NSR A Tool for Load Measurement in Heterogeneous Environments

TFLOP Performance for ANSYS Mechanical

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

D036 Accelerating Reservoir Simulation with GPUs

Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG)

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

clients (compute nodes) servers (I/O nodes)

Large scale Imaging on Current Many- Core Platforms

Storage Hierarchy Management for Scientific Computing

Big Orange Bramble. August 09, 2016

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Architecture, Programming and Performance of MIC Phi Coprocessor

Efficiently building on-line tools for distributed heterogeneous environments

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

A Database Redo Log System Based on Virtual Memory Disk*

CSC501 Operating Systems Principles. OS Structure

Parallelization of a Electromagnetic Analysis Tool

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

Patagonia Cluster Project Research Cluster

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Benchmarking CPU Performance. Benchmarking CPU Performance

Lecture 7: Introduction to HFSS-IE

Parallel Crew Scheduling in PAROS*

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Parallel Programming. Michael Gerndt Technische Universität München

Intel Enterprise Processors Technology

Geneva 10.0 System Requirements

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

File Server Comparison: Executive Summary. Microsoft Windows NT Server 4.0 and Novell NetWare 5. Contents

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers

Outline Marquette University

NUMA replicated pagecache for Linux

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

Dynamic Balancing Complex Workload in Workstation Networks - Challenge, Concepts and Experience

Overview. Processor organizations Types of parallel machines. Real machines

Parallel Algorithm Design. CS595, Fall 2010

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

A Comprehensive Study on the Performance of Implicit LS-DYNA

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet

Transcription:

Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen (TUM), Institut fur Informatik Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR) Arcisstr. 21, D-80333 Munchen email:fmichael.eberl Wolfgang.Karl Carsten.Trinitisg@in.tum.de, WWW home page: http://wwwbode.in.tum.de 2 Asea Brown Boveri Corporate Research Center, Speyerer Str. 4, D-69115 Heidelberg, Germany email:ab@decrc.abb.de, WWW home page: http://www.decrc.abb.de Abstract. This paper summarizes the results that were obtained using the parallel 3D electric eld simulation program POLOPT on a cluster of PCs connected via Fast Ethernet. With the high performance of the CPUs and interconnection technology, the results can be compared to those obtained on multiprocessor machines. Several practical high voltage engineering problems have been calculated. An outlook regarding further speedup due to the improvement of the interconnection technology is given. 1 Introduction One of the most important stages in the development process of high voltage apparatus is the simulation of the eld strength distribution in order to detect critical areas that need to be changed. Roughly speaking, the simulation process consists of the input of geometric data (usually with a CAD modeling program), the creation of an accompanying mesh, the generation of the coecient matrix and the solution of the linear system, and post-processing tasks like potential and eld calculation in points of interest. Typical sizes for the equation systems are in the orders of magnitude of 10 3 to 10 4 with fully populated coecient matrices. In 1994 ABB Corporate Research started a project aimed at the parallelization of the eld calculation program POLOPT [1] based on the boundary element method [2], [4], [3]. The parallelization of the code was based on PVM [5]. The results obtained so far show that high eciency can be achieved using typical industrial hardware equipment like workstation clusters or even multiprocessor supercomputers like IBM SP2 or SGI PowerChallenge. This paper shows an alternative approach in parallel computation of such CPU-intensive problems: The code has been ported to a PC cluster running

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those presented in [3]. We demonstrate that this environment is a suitable alternative to expensive multiprocessor computers when dealing with numerical industrial applications. 2 Parallelization Concept This section briey summarizes the basic idea (see Fig. 1) behind the parallelization concept. It has been presented in detail in [4] and [3]. As mentioned in Fig. 1. Parallelization concept: Each node generates its own part of the coecient matrix and stores it locally. The size of this part corresponds to the node's speed. the introduction, the eld simulation process consists of modeling the geometric data, generating an accompanying mesh, computing the (fully populated) coef- cient matrix, solving the resulting equation system and calculating eld and potential in the points of interest. The part that can be parallelized is the actual numerical calculation, i.e. the latter three steps. Each matrix row can be generated independently from the other rows if the input data has been replicated on each node. The generated parts of the matrix are distributed over the nodes.

The parallelization is based on a master-slave approach. The workload is distributed by the master following a Mandelbrot algorithm which takes into account each node's speed and current load. The solver being used is the iterative GMRES method [8]. This solver can be parallelized in a straightforward manner as the operation that is performed during each iteration is basically a matrixvector multiplication. Since the basic parallelization concept is rather algebraic than topological (no domain decomposition) the parallel eciency depends only on the problem size. 3 The PC cluster environment Traditionally, numerical computations like the POLOPT program run either on massively parallel processor systems (MPP), like the IBM SP2, multiprocessors like the SGI PowerChallenge, or networks of workstations (NOW). However, the permanently increasing performance of modern CPUs for PCs led to a low-cost yet powerful alternative platform for compute-intensive applications. The Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR) investigates high-performance PC-based cluster computing. The ABB POLOPT code as a typical industrial application is used as a test case for LRR's PC cluster computing approach [6]. The PC cluster consists of ve dual Pentium-II Xeons running at 450 MHz which are interconnected via switched Fast Ethernet. Each PC is equipped with 256 MB of ECC SDRAM, connected to the 100 MHz system bus, 4 GB Ultra Wide SCSI hard disks. The PCs are running Linux (kernel version 2.2.6) and for POLOPT's parallel communication, PVM 3.4.0 was used. 4 Results 4.1 Overview Two benchmark problems of typical sizes (3500 unknowns and 7000 unknowns) have been tested on our PC cluster described in the previous section. These benchmarks are the same as the ones used in [4] and [3]. The computation times have been measured on 1, 2, 4, 8 and 10 CPUs. The parallel eciency resulting from these computations has been compared to the one obtained in [3] on the IBM SP2 and on the SGI PowerChallenge. Fig. 2 shows the computation times for the medium size problem (3500 unknowns) on the PC cluster compared to those obtained from the IBM SP2 and SGI PowerChallenge machines. Not surprisingly, the more modern processors of our PC cluster were able to outperform the older parallel machines. However, the suitability of PC clusters for running large industrial applications can also be demonstrated by comparing the parallel eciency for the benchmark problems on each parallel system, respectively. Fig. 3 compares the parallel eciency for the medium-sized problem (matrix size 3500). Fig. 4 shows the comparison for the large problem (matrix size 7000).

Fig. 2. Computation times for a medium size problem (3500 unknowns) compared to those obtained on multiprocessor machines as used in [3] 4.2 A Closer Look In general, the parallel eciency always decreases with an increasing number of processors and an increasing demand for communication. In contrast to the PC cluster used for our measurements in this work, specialized parallel computers, such as the IBM SP2 or the SGI PowerChallenge referenced in our comparisons, are using a specially designed and highly ecient internal network to connect their compute nodes. Those networks typically provide the parallel machines with a much lower message latency and a higher bandwidth than the (switched) Ethernet interface used for a PC cluster. The PC cluster with up to 10 processors yields a reasonably high parallel eciency (83% for the large, and 87% for the medium problem). However, for congurations with more than 4 processors it is signicantly lower and drops faster than for dedicated parallel computers. Although we have been using a switched network, the parallel processes only communicate with the master process, so that the network interface at the masters node becomes the communication bottleneck, thus impeding the advantages of a switched network. In order to increase the parallel eciency for this type of

Fig. 3. Parallel eciency for a medium size problem (3500 unknowns) compared to those obtained on multiprocessor machines as used in [3] applications it is not enough to increase the gross network capacity, but instead the performance (latency and bandwidth) at every node has to be improved. 5 Conclusion and Outlook We conclude that the network interface is the Achilles' Heel of larger PC clusters. Primarily, Ethernet has not been designed as an interconnection network for parallel systems. It incurs a lot of overhead in message transfers, including drivers in kernel mode and a complex low-level protocol (TCP/IP). A technical solution to alleviate this problem is provided by the SCI interconnection technology (IEEE std. 1596-1992, [7]). SCI provides high bandwidth and due to its hardware based distributed shared memory facility also low message latency. With modern communication architectures on top of SCI, the time consuming system calls and buering can be avoided. Consequently, the next step on our roadmap will be the adaptation of POLOPT for the SCI technology, yielding a much more ecient implementation.

Fig. 4. Parallel eciency for a large problem (7000 unknowns) compared to the IBM SP2 as used in [3] References 1. Z. Andjelic. POLOPT 4.5 User's Guide. Asea Brown Boveri Corporate Research, Heidelberg, 1996. 2. R. Bausinger and G. Kuhn. Die Boundary-Element Methode. Expert Verlag, Ehingen, 1987. 3. A. Blaszczyk and C. Trinitis. Experience with PVM in an Industrial Environment. Lecture notes in Computer Science 1156, EuroPVM'96, Springer Verlag, pp. 174-179, 1996. 4. A. Blaszczyk et al. Parallel Computation of Electric Field in a Heterogeneous Workstation Cluster. Lecture Notes in Computer Science 919, HPCN Europe, Springer Verlag, pp. 606-611, 1995. 5. A. Geist et al. PVM 3 User's Guide and Reference Manual. Oak Ridge National Laboratory, Tennessee, May 1994. 6. Hermann Hellwagner, Wolfgang Karl, and Markus Leberecht. Enabling a PC Cluster for High-Performance Computing. SPEEDUP Journal, 11(1), June 1997. 7. IEEE Standard for the Scalable Coherent Interface (SCI). IEEE Std 1596-1992, 1993. IEEE 345 East 47th Street, New York, NY 10017-2394, USA. 8. Y. Saad and M.H. Schultz. GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J.Sci. Stat. Comput., pp. 856-869, 1989.