Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm.

Similar documents
Parallel Mesh Partitioning in Alya

K-means seeding via MUS algorithm

Methods of solving sparse linear systems. Soldatenko Oleg SPbSU, Department of Computational Physics

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallelization Strategy

Message Passing Interface (MPI)

Parallel Architecture & Programing Models for Face Recognition

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

CSC630/CSC730 Parallel & Distributed Computing

MOC20741 Networking with Windows Server 2016

Processing and Others. Xiaojun Qi -- REU Site Program in CVMA

Parallel Numerical Algorithms

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Parallel Architectures

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Parallelization Strategy

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Lecture 27: Fast Laplacian Solvers

Parallel Algorithm Design. CS595, Fall 2010

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

MOC10215 Implementing and Managing Server Virtualization

Processes in Distributed Systems

Monte Carlo Method on Parallel Computing. Jongsoon Kim

Unit 9 : Fundamentals of Parallel Processing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

6.1 Multiprocessor Computing Environment

Algorithms and Applications

04MAC03/03/01/001 RIMADIMA an INTERREG III B CADSES NP, An EU-funded project managed by the European Agency for Reconstruction

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Processes in Distributed Systems

Naming in Distributed Systems

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Laboratorio di Sistemi Software Design Patterns 2

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

Corso di Elettronica dei Sistemi Programmabili

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Graph Partitioning for Scalable Distributed Graph Computations

Parallel Implementation of 3D FMA using MPI

Hierarchical Clustering of Process Schemas

Concurrent programming: Introduction II. Anna Lina Ruscelli Scuola Superiore Sant Anna

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

THREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

Example of a Parallel Algorithm

Architecture, Programming and Performance of MIC Phi Coprocessor

RELATIVELY OPTIMAL CONTROL: THE STATIC SOLUTION

Georg Pick s reticular Geometry and Didactics of Mathematics

B553 Lecture 12: Global Optimization

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Material handling and Transportation in Logistics. Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

PIPELINE AND VECTOR PROCESSING

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

Laboratorio di Problemi Inversi Esercitazione 3: regolarizzazione iterativa, metodo di Landweber

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Evaluation of Parallel Programs by Measurement of Its Granularity

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Task Assignment Problem in Camera Networks

Architettura Database Oracle

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Fondamenti di Informatica

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies

Communication in Distributed Systems

CORSO MOC10215: Implementing and Managing Microsoft Server Virtualization. CEGEKA Education corsi di formazione professionale

Distributed Systems CS /640

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Algorithm Design Techniques part I

Surfing Ada for ESP Part 2

Lecture V: Introduction to parallel programming with Fortran coarrays

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming

A Global Operating System for HPC Clusters

Spectral Graph Sparsification: overview of theory and practical methods. Yiannis Koutis. University of Puerto Rico - Rio Piedras

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

Introduction to Distributed Systems

Ingegneria del Software Corso di Laurea in Informatica per il Management

The Public Shared Objects Run-Time System

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Flexible Batched Sparse Matrix-Vector Product on GPUs

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Parallelization Principles. Sathish Vadhiyar

Parallel quicksort algorithms with isoefficiency analysis. Parallel quicksort algorithmswith isoefficiency analysis p. 1

Hierarchical Multi level Approach to graph clustering

Introduction to Parallel Programming

Automatic Creation of Define.xml for ADaM

Design of Parallel Algorithms. Course Introduction

Link Analysis and Web Search

Chapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

A tree-search algorithm for ML decoding in underdetermined MIMO systems

CellSs Making it easier to program the Cell Broadband Engine processor

Optimal Decision Trees Generation from OR-Decision Tables

Analysis of high dimensional data via Topology. Louis Xiang. Oak Ridge National Laboratory. Oak Ridge, Tennessee

L-Systems and Affine Transformations

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

DANTE CONTROL SYSTEM DATA FLOW

Transcription:

Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm. A. Bevilacqua Dipartimento di Fisica, Università di Bologna INFN, Sezione di Bologna G. Spaletta Dipartimento di Matematica, Università di Bologna Sommario Il lavoro svolto è teso al completamento dell analisi dell algoritmo di Recursive Decoupling (RD) per sistemi tridiagonali lineari. Tale metodo, intrinsecamente parallelo e scalabile, trova una naturale implementazione sulle architetture CRAY della generazione MPP, quali il T3E. Recentemente è sorta la richiesta di impiego di tale solutore in problemi applicativi di ampie dimensioni, legati alla ingegneria dei circuiti microelettronici ed alla ricostruzione e restoring di immagini mediche. La necessità di una versione portabile ad altri sistemi distribuiti ha giustificato la revisione dell algoritmo di RD, per la sua programmazione in modello Message Passing. Grazie al grant CINECA, è stato possibile sviluppare su CRAY T3E una prima versione PVM dell algoritmo in esame, in modo da renderne il più possibile consistente il confronto con le precedenti implementazioni realizzate in modello a memoria condivisa su CRAY T3D. Le prestazioni del metodo RD su CRAY T3E vengono valutate in base alle caratteristiche di scalabilità ed alla accuratezza dei risultati ottenuti. Abstract The work developed here is meant as a completion to the analysis of the Recursive Decoupling solver (RD) for tridiagonal linear systems. Such a method, intrinsically parallel and scalable, finds a natural implementation on the CRAY architectures, belonging to the MPP generation, such as the T3E. Recently, it has arisen a request to utilize the technique above mentioned in the solution of applied problems, related to the Engineering of microwave circuits and to the reconstruction and restoration of medical images. The need for an RD version, portable to various distributed systems, justified the remodeling of the RD algorithm, to make feasible its Message Passing (MP) implementation. Thanks to the CINECA grant, it has been possible to develop 1

a PVM version of the RD solver for the CRAY T3E, which makes also interesting and consistent a comparison between the current MP model implementation and the previous CRAY T3D shared memory versions. The performance of the RD method on the CRAY T3E is then evaluated, based on the obtained scalability features and accuracy results. 1. The algorithm Here we synthesize the concepts on which the RD technique is based; they can be found in more details in [1, 2]. Let A be a diagonally dominant, tridiagonal matrix A of dimension n = k x 2 q (q, k integers). The related square system (1) A u = d is solved by partitioning A into an easily invertible block diagonal matrix J and a sum of m-1 rank-one matrices x (j) y (j)t, having set m=n/2. A first approximation to the solution vector u is given by solving (2) J u = d At the same time the following m-1 systems are solved (3) J g (j) = x (j) The partitioning of A is mirrored in the vectors u and g (j), i.e. such vectors are partitioned into subvectors, whose size is that of the diagonal blocks in J. The recursive use of the Sherman-Morrison formula, then, permits to obtain the solution of the original tridiagonal system (1). The n-dimensional vectors g (j) are recursively updated and used to update u, providing the wanted solution when the final step is reached. RD is a direct method, thus the number of steps is finite; such number is equal to log 2 (m), due to the sparsity of the matrices involved, the dimension n being a power of 2 and the fan-in pattern of the updating. Furthermore, the sparsity and dimension features of the problem considered permit to hold all vectors u, g (j) by log 2 (n) auxiliary vectors v (k), each of dimension n. All the updating operations can be performed in parallel, maintaining the partition of data chosen during the solution of (2) and (3), with a minimal need for communication among tasks. In fact, systems (2) and (3) can also be solved in parallel partitioned form. 2

The procedure outlined gives rise to an intrinsically parallel and scalable algorithm. The current version uses the C programming language and the Message Passing programming model, to answer the request for a wider use of this solver. Results of numerical testing and the analysis of the performance obtained are given next, including a qualitative comparison with analogous testing of the CRAY T3D shared memory implementation. 2. PVM Recursive Decoupling The C implementation of the RD method uses the routines of the PVM library, devoted to the communication among PEs (Processing Elements). Further to portability, the aims of such implementation are those sought by every parallel program: efficiency, speed-up, scalability and workload balancing. The SPMD programming model, required on the T3E, consists of one main program, running on each processor; the master-slave paradigm is put into effect by appending the following instruction as the closing one in the main program: if (PE_id == 0) then <master code> else <slave code> Data assignment (splitting) is crucial to the message passing version of the RD algorithm. All structures involved, that is to say the auxiliary vectors v (k) are partitioned into subvectors vi (k) of equal length (given by the ratio between the problem dimension n and the number npes of processors used), that are local to each PE. The updating of vectors v (k) is then obtained by each PE updating its substructure. The need for some communication of partial scalar results among groups of PEs justifies the choice of the master-slave programming model, with one PE specialized to execute the master code, to avoid the overhead due to synchronization among such groups. The master code has the following tasks: reads the input data A and d from an external file; subdivides A, d, and sends their parts to the slaves; collects the local results from each group of processors; computes and sends the global results to each group of slaves; receives the final result and writes it on an external file. The kernel of the algorithm, correspondent to the updating procedure, basically consists of three nested loops; by denoting with the index k each step (level) forming such procedure, the following schemata holds for each slave task: 3

for each level k k = 1.. log 2 (m) if all data needed for the updating are local for each j remaining level j = k +1.. log 2 (m) updates vi (k ) else for each remaining level j computes its partial scalar result sends it to the master receives the global scalar result uses it to update vi (k ) i = 1.. n/npes Having one PE to perform the master controls implies that up to 64 slaves can be used, missing the possibility to exploit all 128 PE available on the T3E. We considered choosing a model in which all PE perform the master controls as well as the slave task. Such model would have lead to a less efficient implementation, mainly because, at each level k, different partial scalar results have to be shared among different pools of PEs; having one master PE minimizes the occurrence of idle processors. The uniform partitioning of structures, as described above, meets the goal for a perfect even load balance. 3. Performance analysis As a source of test problems for our numerical experiments, we consider a tridiagonal linear system of large dimension n = 2 q, whose randomly generated entries are such to guarantee diagonal dominance. The exponent q ranges from 17 to 23 in the double precision version (d.p.), while q reaches up to 24 in the single precision implementation (s.p.). The accuracy results, working in single precision, comply with those obtained by the previous version on the CRAY T3D (see [1], [2] for a more detailed description). The double precision improvement is paid off by the inability to solve systems of dimension greater than 2 23. If the solution is only required to be accurate in the first few significant digits, then the RD routine is serviceable to solve general tridiagonal linear systems. Timing results, used for the evaluation of speed-up and efficiency, are measured in seconds and shown in the following three tables. Table 1 refers to the computational time required by the RD solver on a problem of dimension n = 2 17, both using PVM on the T3E (single and double precision) and using CRAFT on the T3D (single precision). Such a comparison must take into account the enhanced resource; the time gain observed reflects the increased speed of the T3E processor. 4

Tables 2 and 3 gather all computational timing required by the PVM implementation of the RD method, on the T3E. The empty fields mean that the memory requirements, needed to run a 2 q x 2 q problem onto a chosen number of PEs, were too high. The computational complexity of the PVM implementation is O(n (log 2 n) 2 ), which is confirmed by the timing observed. This is slightly less satisfactory than the theoretical complexity of O(n log 2 n) and calls for further improvement in the PVM restructuring of the algorithm. Communication timings are not explicitly shown, since their incidence on the overall computational time is always lower than 10%. As a consequence, the workload is almost perfectly balanced. Speed-ups and efficiency are good, reaching the optimal value in most cases (in a few cases, we obtain superlinear speed-ups); such a good behavior fades when 64 processors are used on a problem whose dimension is not large enough to give each processor a significant work amount, to overcome the communication and synchronization overhead. This is shown in Figure 1, in which the Kuck Function is plotted (against the number of processors, in logarithmic scale) for problems of dimension 2 17, 2 18 and 2 19, respectively. The Kuck function gives a compound information, being the square of the geometric mean of speed-up and efficiency (obtained with a fixed number of processors). The scaling properties of the method are also confirmed by the timing results; the scaling factor, in its best instance, reaches the value of 0.88 (close to the theoretical optimal value 1). 5

Table 1. MP-T3E vs SM-T3D Timing Comparison: problem dimension is q=17 PEs 1 2 4 8 16 32 64 T3E s.p. 5.882 2.8234 1.4349.7207.3680.2094.1423 T3E d.p. 6.288 3.1268 1.5659.7896.4051.2285.1576 T3D 22.159 11.053 5.494 2.758 1.393.7073.3553 Table 2. MP-T3E Timings (double precision): problem dimension given by q PEs 1 2 4 8 16 32 64 q = 18 14.3959 7.1378 3.5883 1.7820.9101.4907.3090 q = 19 16.2389 8.0822 4.0265 2.0372 1.0807.6491 q = 20 18.2524 9.0763 4.5963 2.3884 1.5529 q = 21 20.4751 10.380 5.3492 2.9657 q = 22 22.9562 11.7664 6.7028 q = 23 28.2628 15.6026 Table 3. MP-T3E Timings (single precision): problem dimension given by q PEs 1 2 4 8 16 32 64 q = 18 13.1494 6.5497 3.2791 1.6216.8272.4412.2761 q = 19 29.7550 14.8085 7.3862 3.6964 1.8816.9769.5751 q = 20 33.6471 16.7606 8.3642 4.1986 2.1877 1.2179 q = 21 37.9167 18.9307 9.4531 4.9017 2.6547 q = 22 42.6558 21.4814 11.1691 5.837 q = 23 47.7032 24.0896 12.8759 q = 24 53.4316 28.5315 4. Conclusion and future work PVM assures flexibility and portability of code, which is a very important requirement in all branches of applied sciences. This could allow the use of the RD routine, on its own or as part of a more general application solver, on an heterogeneous cluster of computers or any other MPP architecture. Because of the intrinsic parallelism of this problem the workload is perfectly balanced; accuracy and scaling features are maintained and confirmed by the PVM implementation of the RD algorithm. The communication and synchronization overhead, already quite small, might be further decreased; there is space for further improvement in the current message passing implementation, as is suggested both by the Kuck function and the observed complexity. Parallel to such improvement to the PVM version, an MPI implementation is also being developed. 6

Acknowledgments We wish to thank Dr. Bassini, Dr. Voli and their colleagues at CINECA for the kind availability. Computational resources provided by the CINECA Supercomputing Center, under grant n.97/335-5, are gratefully acknowledged. Bibliography [1] G. Spaletta, The Recursive Decoupling Solver for Tridiagonal Linear Systems on the CRAY T3D, Parallel Computing: State-of-the-Art and Perspectives, Elsevier, 1996, pp. 197-204. [2] G. Spaletta, Recursive Decoupling on the CRAY T3D, Science and Supercomputing at CINECA, 1995 Report, pp. 507-511. [3] A. Bevilacqua, G. Spaletta, Solving Systems by Recursive Decoupling with PVM and MPI, in preparation. 7