Simulating tsunami propagation on parallel computers using a hybrid software framework

Similar documents
Simulation of tsunami propagation

Parallel Simulation of Tsunamis Using a Hybrid Software Approach

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Overlapping Domain Decomposition Methods

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Computational Fluid Dynamics - Incompressible Flows

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Theoretical Foundations

On a Future Software Platform for Demanding Multi-Scale and Multi-Physics Problems

Parallel Computing Why & How?

HPC Algorithms and Applications

1.2 Numerical Solutions of Flow Problems

Implementation of an integrated efficient parallel multiblock Flow solver

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Asynchronous OpenCL/MPI numerical simulations of conservation laws

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

Approaches to Parallel Implementation of the BDDC Method

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

ADAPTIVE FINITE ELEMENT

High-performance computing on distributed-memory architecture

High Performance Computing for PDE Towards Petascale Computing

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Parallel resolution of sparse linear systems by mixing direct and iterative methods

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Diffpack- A Flexible Development Framework for the Numerical Modeling and Solution of Partial Differential Equations

A parallel direct/iterative solver based on a Schur complement approach

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Introduction to parallel Computing

Lecture 2 Unstructured Mesh Generation

Lecture 15: More Iterative Ideas

Finite element methods in scientific computing. Wolfgang Bangerth, Texas A&M University

Lecture 04 FUNCTIONS AND ARRAYS

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

INF3380: Parallel Programming for Natural Sciences

Shallow Water Simulations on Graphics Hardware

Improving Inter-subdomain Communication and Load-balancing for the Parallel Diffpack Library. Master s thesis. Martin Burheim Tingstad

arxiv: v1 [math.na] 26 Jun 2014

INF3380: Parallel Programming for Scientific Problems

Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code

Partial Differential Equations

The Shallow Water Equations and CUDA

: What is Finite Element Analysis (FEA)?

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Lecture 6: Input Compaction and Further Studies

Adaptive Mesh Refinement in Titanium

Massively Parallel Finite Element Simulations with deal.ii

Application of Finite Volume Method for Structural Analysis

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Just the Facts Small-Sliding Contact in ANSYS Mechanical

Partitioning and Partitioning Tools. Tim Barth NASA Ames Research Center Moffett Field, California USA

Parallel Greedy Matching Algorithms

Fast Methods with Sieve

GPU Cluster Computing for FEM

A Parallel Implementation of the BDDC Method for Linear Elasticity

Shape Optimizing Load Balancing for Parallel Adaptive Numerical Simulations Using MPI

Parallel Programming Patterns

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Smoothers. < interactive example > Partial Differential Equations Numerical Methods for PDEs Sparse Linear Systems

Presented by: Terry L. Wilmarth

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

The Shallow Water Equations and CUDA

Large-scale workflows for wave-equation based inversion in Julia

Parallel Implementations of Gaussian Elimination

Introduction to Multigrid and its Parallelization

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

ABOUT THE GENERATION OF UNSTRUCTURED MESH FAMILIES FOR GRID CONVERGENCE ASSESSMENT BY MIXED MESHES

PROGRAMMING OF MULTIGRID METHODS

Adaptive-Mesh-Refinement Pattern

SPH: Why and what for?

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

Sparse Matrix Formats

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

Parallel Mesh Partitioning in Alya

Mesh-Free Applications for Static and Dynamically Changing Node Configurations

The Shallow Water Equations and CUDA

computational Fluid Dynamics - Prof. V. Esfahanian

Session 3 Introduction to SIMULINK

Geometric Modeling Assignment 3: Discrete Differential Quantities

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

lecture 8 Groundwater Modelling -1

Parallelizing the Method of Conjugate Gradients for Shared Memory Architectures

Efficiency of adaptive mesh algorithms

Parallel Adaptive Tsunami Modelling with Triangular Discontinuous Galerkin Schemes

High Performance Computing

Tools and Primitives for High Performance Graph Computation

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

The Icosahedral Nonhydrostatic (ICON) Model

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Transcription:

Simulating tsunami propagation on parallel computers using a hybrid software framework Xing Simula Research Laboratory, Norway Department of Informatics, University of Oslo March 12, 2007

Outline Intro Parallelization Vision HLRS 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

The origin of the word tsunami

Different types of tsunamis Tsunamis: large waves formed by rapid mass movements Induced by subwater earthquake (such as Dec. 2004 Indian Ocean Tsunami) Induced by asteroid impact (such as the Mjølnir Impact) Induced by landslide (of great importance to the Norwegian fjords)

Motivation Wave propagation simulation is very important for studying tsunamis A computational challenge huge computational domain different physics required in different areas Parallel computing should reuse existing serial wave codes should allow different math models/resolutions in different areas Objective: a framework for parallel hybrid tsunami simulations

Huge computations (example: Indian Ocean) 1km 1km resolution overall: about 40 10 6 mesh points 200m 200m resolution overall: 10 9 mesh points

Computational challenge Example: Indian Ocean 1km 1km resolution is not sufficient everywhere 200m 200m resolution overall is too much We need smart computing : High resolution only in areas where necessary Simple mathematical model in vast areas Advanced mathematical model (due to complicated physics) in small areas Result: parallel hybrid tsunami simulator Desirable resolution requires number of mesh points 100 10 6 number of time steps many thousands

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

Parallelization objectives Requirement 1: easy parallelization Reuse of serial wave codes during parallelization Different serial codes collaborate inside a hybrid framework Requirement 2: efficient for computational resource FEM only in areas where unstructured meshes and advanced numerics are needed FDM elsewhere

Basic idea: divide and conquer Domain decomposition: one global solution domain is divided into many subdomains Each subdomain: (relatively) independent working unit Collaboration between the subdomains: communication

Overall parallelization strategy Ω = P s=1ω s Divide a vast ocean domain into many subdomains Uniform local meshes and FDM on most of the subdomains Unstructured local meshes and FEM on selected subdomains A global iteration among all subdomains During each iteration a subdomain independently updates its local solution Exchange of local solutions between neighboring subdomains at end of each iteration Solution of L Ω (u) = f Ω is found as u 0,u 1,...,u i L Ωs (u i s) = f i Ω s 1 s P u i = P s=1u i s

Convergence among subdomains Schwarz methods work as the numerical foundation Small amount of overlap between neighboring subdomains (overlapping domain decomposition) Originally well-known as a parallel numerical strategy for solving large linear systems We apply DD at software level (not at linear-algebra level ) No global matrices/vectors exist, all represented by the collection of subdomain matrices/vectors Neighboring subdomain meshes may be non-matching and/or of different types

A generic library of Schwarz methods Schwarz methods: a general approach to solving PDEs in parallel, a generic library can be programmed Object-oriented programming is well suited Generic components: subdomain solvers and a global administrator class SubdomainSolver: generic interface of a subdomain solver, only declaration of standard functions, no implementation class Administrator: implementation of generic functions for invoking communication and checking global convergence

A framework of hybrid tsunami simulators Objective: a generic framework for creating hybrid parallel tsunami simulators, based on existing serial codes Starting point C++ Boussinesq solver using FEM: class Boussinesq Legacy F77 code using FDM: a set of subroutines Direct parallelization of either code requires too much work A hybrid software framework class SubdomainBQFEMSolver : public Boussinesq, public SubdomainSolver class SubdomainBQFDMSolver : public SubdomainSolver (calling F77 subroutines internally) HybridBQSolver : public Administrator Implementation using Diffpack (www.diffpack.com)

Flexibility Intro Parallelization Vision HLRS Free choice between SubdomainBQFEMSolver and SubdomainBQFDMSolver for each subdomain Adaptive mesh refinement allowed for FEM subdomains Neighboring subdomains may use non-matching local meshes Possible to incorporate other serial codes as subdomain solvers

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

Subdomain preparation p1 p4 768 New finite element code 700 Finite difference legacy code 629.2 200.4 300 331.8 2000 1000 0 1000 2000 2581 6000 4000 2000 0 1000 6.78 6.1 5.43 4.75 4.07 3.39 2.71 2.03 1.36 0.678 0 p2 Simulating tsunami propagation on parallel computers p3 using a h

Coarse-mesh simulation of Indian Ocean Tsunami Initial wave elevation after the earthquake

Coarse-mesh simulation snapshot 1 After 1.4 hours

Coarse-mesh simulation snapshot 2 After 2.8 hours

List of Topics 1 Introduction 2 A hybrid software framework for parallelization 3 Desirable simulation setup for future 4 Performance analysis done at HLRS

Motivation for my HPC Europa visit Vector-CPU based system at HLRS Extensive experience with performance analysis at HLRS Purpose: a fine-grained diagnosis of the tsunami simulator and our parallel PDE library

Observations so far (1) When the computational domain has no points on land, the parallel computation is well balanced On SX-8, the main work at each time step goes to the discretization, not solving the resulting distributed linear system

Observations so far (2) When the computational domain has points on land, the parallel computation is not balanced Causes of imbalance: Imbalance in the distributed discretization (some subdomains have many points on land) Imbalance in the parallel DD solver (some subdomain problems are easier to solve)

Observations so far (3) The SX compiler does not optimize the discretization phase very well C++ code Many levels of nested for-loops Extensive use of virtual functions

Observations so far (4) Vectorization is enabled for some parts of the code Example: vector addition x = y + z #pragma cdir nodep for (int i=0; i<length; i++) tmp_x[i] = tmp_y[i] + tmp_z[i]; Percentage of vectorized code is increased from 6-7% to 13-14% in the solution phase

Observations so far (5) Vectorization does not work for some parts of the code Example: sparse matrix-vector multiplication y = Ax Compressed row storage Indirect (and random) access of data entries #pragma cdir nodep for (i = 1; i <= nrows; i++) { rstart = ad.irow(i); rstop = ad.irow(i+1); #pragma cdir novector tmp = 0.0; for (r = rstart; r < rstop; r++) tmp += entries(r) * x(ad.jcol(r)); y(i) += tmp; } Vectorization of the inner for-loop has to be turned off!

Conclusions Schwarz methods: numerical foundation for the parallelization Object-oriented programming enables a hybrid framework of tsunami simulators Full flexibility in choosing subdomain solvers different mathematical models different discretizations different local meshes different codes Some parts of the tsunami simulator are improved due to analysis done at HLRS Challenge: performance and load balancing