Simulating tsunami propagation on parallel computers using a hybrid software framework

Simulating tsunami propagation on parallel computers using a hybrid software framework Xing Simula Research Laboratory, Norway Department of Informatics, University of Oslo March 12, 2007

Outline Intro Parallelization Vision HLRS 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

The origin of the word tsunami

Different types of tsunamis Tsunamis: large waves formed by rapid mass movements Induced by subwater earthquake (such as Dec. 2004 Indian Ocean Tsunami) Induced by asteroid impact (such as the Mjølnir Impact) Induced by landslide (of great importance to the Norwegian fjords)

Motivation Wave propagation simulation is very important for studying tsunamis A computational challenge huge computational domain different physics required in different areas Parallel computing should reuse existing serial wave codes should allow different math models/resolutions in different areas Objective: a framework for parallel hybrid tsunami simulations

Huge computations (example: Indian Ocean) 1km 1km resolution overall: about 40 10 6 mesh points 200m 200m resolution overall: 10 9 mesh points

Computational challenge Example: Indian Ocean 1km 1km resolution is not sufficient everywhere 200m 200m resolution overall is too much We need smart computing : High resolution only in areas where necessary Simple mathematical model in vast areas Advanced mathematical model (due to complicated physics) in small areas Result: parallel hybrid tsunami simulator Desirable resolution requires number of mesh points 100 10 6 number of time steps many thousands

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

Parallelization objectives Requirement 1: easy parallelization Reuse of serial wave codes during parallelization Different serial codes collaborate inside a hybrid framework Requirement 2: efficient for computational resource FEM only in areas where unstructured meshes and advanced numerics are needed FDM elsewhere

Basic idea: divide and conquer Domain decomposition: one global solution domain is divided into many subdomains Each subdomain: (relatively) independent working unit Collaboration between the subdomains: communication

Overall parallelization strategy Ω = P s=1ω s Divide a vast ocean domain into many subdomains Uniform local meshes and FDM on most of the subdomains Unstructured local meshes and FEM on selected subdomains A global iteration among all subdomains During each iteration a subdomain independently updates its local solution Exchange of local solutions between neighboring subdomains at end of each iteration Solution of L Ω (u) = f Ω is found as u 0,u 1,...,u i L Ωs (u i s) = f i Ω s 1 s P u i = P s=1u i s

Convergence among subdomains Schwarz methods work as the numerical foundation Small amount of overlap between neighboring subdomains (overlapping domain decomposition) Originally well-known as a parallel numerical strategy for solving large linear systems We apply DD at software level (not at linear-algebra level ) No global matrices/vectors exist, all represented by the collection of subdomain matrices/vectors Neighboring subdomain meshes may be non-matching and/or of different types

A generic library of Schwarz methods Schwarz methods: a general approach to solving PDEs in parallel, a generic library can be programmed Object-oriented programming is well suited Generic components: subdomain solvers and a global administrator class SubdomainSolver: generic interface of a subdomain solver, only declaration of standard functions, no implementation class Administrator: implementation of generic functions for invoking communication and checking global convergence

A framework of hybrid tsunami simulators Objective: a generic framework for creating hybrid parallel tsunami simulators, based on existing serial codes Starting point C++ Boussinesq solver using FEM: class Boussinesq Legacy F77 code using FDM: a set of subroutines Direct parallelization of either code requires too much work A hybrid software framework class SubdomainBQFEMSolver : public Boussinesq, public SubdomainSolver class SubdomainBQFDMSolver : public SubdomainSolver (calling F77 subroutines internally) HybridBQSolver : public Administrator Implementation using Diffpack (www.diffpack.com)

Flexibility Intro Parallelization Vision HLRS Free choice between SubdomainBQFEMSolver and SubdomainBQFDMSolver for each subdomain Adaptive mesh refinement allowed for FEM subdomains Neighboring subdomains may use non-matching local meshes Possible to incorporate other serial codes as subdomain solvers

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

Subdomain preparation p1 p4 768 New finite element code 700 Finite difference legacy code 629.2 200.4 300 331.8 2000 1000 0 1000 2000 2581 6000 4000 2000 0 1000 6.78 6.1 5.43 4.75 4.07 3.39 2.71 2.03 1.36 0.678 0 p2 Simulating tsunami propagation on parallel computers p3 using a h

Coarse-mesh simulation of Indian Ocean Tsunami Initial wave elevation after the earthquake

Coarse-mesh simulation snapshot 1 After 1.4 hours

Coarse-mesh simulation snapshot 2 After 2.8 hours

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

Motivation for my HPC Europa visit Vector-CPU based system at HLRS Extensive experience with performance analysis at HLRS Purpose: a fine-grained diagnosis of the tsunami simulator and our parallel PDE library

Observations so far (1) When the computational domain has no points on land, the parallel computation is well balanced On SX-8, the main work at each time step goes to the discretization, not solving the resulting distributed linear system

Observations so far (2) When the computational domain has points on land, the parallel computation is not balanced Causes of imbalance: Imbalance in the distributed discretization (some subdomains have many points on land) Imbalance in the parallel DD solver (some subdomain problems are easier to solve)

Observations so far (3) The SX compiler does not optimize the discretization phase very well C++ code Many levels of nested for-loops Extensive use of virtual functions

Observations so far (4) Vectorization is enabled for some parts of the code Example: vector addition x = y + z #pragma cdir nodep for (int i=0; i<length; i++) tmp_x[i] = tmp_y[i] + tmp_z[i]; Percentage of vectorized code is increased from 6-7% to 13-14% in the solution phase

Observations so far (5) Vectorization does not work for some parts of the code Example: sparse matrix-vector multiplication y = Ax Compressed row storage Indirect (and random) access of data entries #pragma cdir nodep for (i = 1; i <= nrows; i++) { rstart = ad.irow(i); rstop = ad.irow(i+1); #pragma cdir novector tmp = 0.0; for (r = rstart; r < rstop; r++) tmp += entries(r) * x(ad.jcol(r)); y(i) += tmp; } Vectorization of the inner for-loop has to be turned off!

Conclusions Schwarz methods: numerical foundation for the parallelization Object-oriented programming enables a hybrid framework of tsunami simulators Full flexibility in choosing subdomain solvers different mathematical models different discretizations different local meshes different codes Some parts of the tsunami simulator are improved due to analysis done at HLRS Challenge: performance and load balancing