LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen (TUM), Institut fur Informatik Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR) Arcisstr. 21, D-80333 Munchen email:fmichael.eberl Wolfgang.Karl Carsten.Trinitisg@in.tum.de, WWW home page: http://wwwbode.in.tum.de 2 Asea Brown Boveri Corporate Research Center, Speyerer Str. 4, D-69115 Heidelberg, Germany email:ab@decrc.abb.de, WWW home page: http://www.decrc.abb.de Abstract. This paper summarizes the results that were obtained using the parallel 3D electric eld simulation program POLOPT on a cluster of PCs connected via Fast Ethernet. With the high performance of the CPUs and interconnection technology, the results can be compared to those obtained on multiprocessor machines. Several practical high voltage engineering problems have been calculated. An outlook regarding further speedup due to the improvement of the interconnection technology is given. 1 Introduction One of the most important stages in the development process of high voltage apparatus is the simulation of the eld strength distribution in order to detect critical areas that need to be changed. Roughly speaking, the simulation process consists of the input of geometric data (usually with a CAD modeling program), the creation of an accompanying mesh, the generation of the coecient matrix and the solution of the linear system, and post-processing tasks like potential and eld calculation in points of interest. Typical sizes for the equation systems are in the orders of magnitude of 10 3 to 10 4 with fully populated coecient matrices. In 1994 ABB Corporate Research started a project aimed at the parallelization of the eld calculation program POLOPT [1] based on the boundary element method [2], [4], [3]. The parallelization of the code was based on PVM [5]. The results obtained so far show that high eciency can be achieved using typical industrial hardware equipment like workstation clusters or even multiprocessor supercomputers like IBM SP2 or SGI PowerChallenge. This paper shows an alternative approach in parallel computation of such CPU-intensive problems: The code has been ported to a PC cluster running

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those presented in [3]. We demonstrate that this environment is a suitable alternative to expensive multiprocessor computers when dealing with numerical industrial applications. 2 Parallelization Concept This section briey summarizes the basic idea (see Fig. 1) behind the parallelization concept. It has been presented in detail in [4] and [3]. As mentioned in Fig. 1. Parallelization concept: Each node generates its own part of the coecient matrix and stores it locally. The size of this part corresponds to the node's speed. the introduction, the eld simulation process consists of modeling the geometric data, generating an accompanying mesh, computing the (fully populated) coef- cient matrix, solving the resulting equation system and calculating eld and potential in the points of interest. The part that can be parallelized is the actual numerical calculation, i.e. the latter three steps. Each matrix row can be generated independently from the other rows if the input data has been replicated on each node. The generated parts of the matrix are distributed over the nodes.

The parallelization is based on a master-slave approach. The workload is distributed by the master following a Mandelbrot algorithm which takes into account each node's speed and current load. The solver being used is the iterative GMRES method [8]. This solver can be parallelized in a straightforward manner as the operation that is performed during each iteration is basically a matrixvector multiplication. Since the basic parallelization concept is rather algebraic than topological (no domain decomposition) the parallel eciency depends only on the problem size. 3 The PC cluster environment Traditionally, numerical computations like the POLOPT program run either on massively parallel processor systems (MPP), like the IBM SP2, multiprocessors like the SGI PowerChallenge, or networks of workstations (NOW). However, the permanently increasing performance of modern CPUs for PCs led to a low-cost yet powerful alternative platform for compute-intensive applications. The Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR) investigates high-performance PC-based cluster computing. The ABB POLOPT code as a typical industrial application is used as a test case for LRR's PC cluster computing approach [6]. The PC cluster consists of ve dual Pentium-II Xeons running at 450 MHz which are interconnected via switched Fast Ethernet. Each PC is equipped with 256 MB of ECC SDRAM, connected to the 100 MHz system bus, 4 GB Ultra Wide SCSI hard disks. The PCs are running Linux (kernel version 2.2.6) and for POLOPT's parallel communication, PVM 3.4.0 was used. 4 Results 4.1 Overview Two benchmark problems of typical sizes (3500 unknowns and 7000 unknowns) have been tested on our PC cluster described in the previous section. These benchmarks are the same as the ones used in [4] and [3]. The computation times have been measured on 1, 2, 4, 8 and 10 CPUs. The parallel eciency resulting from these computations has been compared to the one obtained in [3] on the IBM SP2 and on the SGI PowerChallenge. Fig. 2 shows the computation times for the medium size problem (3500 unknowns) on the PC cluster compared to those obtained from the IBM SP2 and SGI PowerChallenge machines. Not surprisingly, the more modern processors of our PC cluster were able to outperform the older parallel machines. However, the suitability of PC clusters for running large industrial applications can also be demonstrated by comparing the parallel eciency for the benchmark problems on each parallel system, respectively. Fig. 3 compares the parallel eciency for the medium-sized problem (matrix size 3500). Fig. 4 shows the comparison for the large problem (matrix size 7000).

Fig. 2. Computation times for a medium size problem (3500 unknowns) compared to those obtained on multiprocessor machines as used in [3] 4.2 A Closer Look In general, the parallel eciency always decreases with an increasing number of processors and an increasing demand for communication. In contrast to the PC cluster used for our measurements in this work, specialized parallel computers, such as the IBM SP2 or the SGI PowerChallenge referenced in our comparisons, are using a specially designed and highly ecient internal network to connect their compute nodes. Those networks typically provide the parallel machines with a much lower message latency and a higher bandwidth than the (switched) Ethernet interface used for a PC cluster. The PC cluster with up to 10 processors yields a reasonably high parallel eciency (83% for the large, and 87% for the medium problem). However, for congurations with more than 4 processors it is signicantly lower and drops faster than for dedicated parallel computers. Although we have been using a switched network, the parallel processes only communicate with the master process, so that the network interface at the masters node becomes the communication bottleneck, thus impeding the advantages of a switched network. In order to increase the parallel eciency for this type of

Fig. 3. Parallel eciency for a medium size problem (3500 unknowns) compared to those obtained on multiprocessor machines as used in [3] applications it is not enough to increase the gross network capacity, but instead the performance (latency and bandwidth) at every node has to be improved. 5 Conclusion and Outlook We conclude that the network interface is the Achilles' Heel of larger PC clusters. Primarily, Ethernet has not been designed as an interconnection network for parallel systems. It incurs a lot of overhead in message transfers, including drivers in kernel mode and a complex low-level protocol (TCP/IP). A technical solution to alleviate this problem is provided by the SCI interconnection technology (IEEE std. 1596-1992, [7]). SCI provides high bandwidth and due to its hardware based distributed shared memory facility also low message latency. With modern communication architectures on top of SCI, the time consuming system calls and buering can be avoided. Consequently, the next step on our roadmap will be the adaptation of POLOPT for the SCI technology, yielding a much more ecient implementation.

Fig. 4. Parallel eciency for a large problem (7000 unknowns) compared to the IBM SP2 as used in [3] References 1. Z. Andjelic. POLOPT 4.5 User's Guide. Asea Brown Boveri Corporate Research, Heidelberg, 1996. 2. R. Bausinger and G. Kuhn. Die Boundary-Element Methode. Expert Verlag, Ehingen, 1987. 3. A. Blaszczyk and C. Trinitis. Experience with PVM in an Industrial Environment. Lecture notes in Computer Science 1156, EuroPVM'96, Springer Verlag, pp. 174-179, 1996. 4. A. Blaszczyk et al. Parallel Computation of Electric Field in a Heterogeneous Workstation Cluster. Lecture Notes in Computer Science 919, HPCN Europe, Springer Verlag, pp. 606-611, 1995. 5. A. Geist et al. PVM 3 User's Guide and Reference Manual. Oak Ridge National Laboratory, Tennessee, May 1994. 6. Hermann Hellwagner, Wolfgang Karl, and Markus Leberecht. Enabling a PC Cluster for High-Performance Computing. SPEEDUP Journal, 11(1), June 1997. 7. IEEE Standard for the Scalable Coherent Interface (SCI). IEEE Std 1596-1992, 1993. IEEE 345 East 47th Street, New York, NY 10017-2394, USA. 8. Y. Saad and M.H. Schultz. GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J.Sci. Stat. Comput., pp. 856-869, 1989.