This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM and OpenCFD trade marks.

Similar documents
Running in parallel. Total number of cores available after hyper threading (virtual cores)

This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM and OpenCFD trade marks.

Introductory OpenFOAM Course From 17th to 21st February, Matteo Bargiacchi

Running OpenFOAM in parallel

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM Basic Training Tutorial Nine

CEE 618 Scientific Parallel Computing (Lecture 10)

Introductory OpenFOAM Course From 2 nd to 6 th July, Joel Guerrero University of Genoa, DICAT

Handling Parallelisation in OpenFOAM

Batch Systems. Running your jobs on an HPC machine

Batch Systems. Running calculations on HPC resources

Lecture 15: More Iterative Ideas

CFD with OpenSource software

More tutorials. Håkan Nilsson, Chalmers/ Applied Mechanics/ Fluid Dynamics 67

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Application of GPU technology to OpenFOAM simulations

Performance of Implicit Solver Strategies on GPUs

OpenFOAM on GPUs. 3rd Northern germany OpenFoam User meeting. Institute of Scientific Computing. September 24th 2015

Introduction to fluid mechanics simulation using the OpenFOAM technology

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

OpenPBS Users Manual

Report of Linear Solver Implementation on GPU

fspai-1.0 Factorized Sparse Approximate Inverse Preconditioner

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Conjugate Simulations and Fluid-Structure Interaction In OpenFOAM

Working with Shell Scripting. Daniel Balagué

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014) Porto, Portugal

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

CFD with OpenSource software

Introduction to GALILEO

Computing with the Moore Cluster

A User Friendly Toolbox for Parallel PDE-Solvers

Direct Numerical Simulation of a Low Pressure Turbine Cascade. Christoph Müller

Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms. Thesis by Amani AlOnazi

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenFOAM directory organization

NEW ADVANCES IN GPU LINEAR ALGEBRA

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

How to get started with OpenFOAM at SHARCNET

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

SGI OpenFOAM TM Quick Start Guide

A Hands-On Tutorial: RNA Sequencing Using High-Performance Computing

OpenFOAM. q Open source CFD toolbox, which supplies preconfigured solvers, utilities and libraries.

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Ulrich Heck, DHCAE-Tools UG. techniques. CastNet: CAD-based Pre-Processor for OpenFOAM. Attributes: Concept of CAD associated mesh and solution set-up

Open Source Software Course: Assignment 1

Working on the NewRiver Cluster

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Running LAMMPS on CC servers at IITM

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

GPU-based Parallel Reservoir Simulators

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Presentation slides for the course CFD with OpenSource Software 2015

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Hands-On Training with OpenFOAM

AM119: Yet another OpenFoam tutorial

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

Introduction to PICO Parallel & Production Enviroment

Introduction to HPC Using zcluster at GACRC

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Mesh generation using blockmesh. blockmesh

Slurm basics. Summer Kickstart June slide 1 of 49

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

BDDCML. solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) Jakub Šístek version 1.

Second OpenFOAM Workshop: Welcome and Introduction

Installing and running COMSOL 4.3a on a Linux cluster COMSOL. All rights reserved.

Introduction to HPC Using zcluster at GACRC

Introduction to Parallel Programming with MPI

Paralution & ViennaCL

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

OpenFOAM case: Mixing

Tutorial Four Discretization Part 1

Andrew V. Knyazev and Merico E. Argentati (speaker)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

B. Tech. Project Second Stage Report on

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

How to implement a new boundary condition

AMS526: Numerical Analysis I (Numerical Linear Algebra)

OpenFOAM structure: Linux environment, user environment variables, directory and file structure, units, etc.

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Introduction to the Computer Exercices Turbulence: Theory and Modelling R.Z. Szasz, Energy Sciences, LTH Lund University

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

snappyhexmesh Basic Training single Region meshing simulation of heat transfer with scalartransportfoam

Research Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library

SGE Roll: Users Guide. Version 5.3 Edition

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

How to implement a new boundary condition

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

An Introduction to Cluster Computing Using Newton

Design of Finite Element Software for Modeling Bone Deformation and Failure

User Guide of High Performance Computing Cluster in School of Physics

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Practical: a sample code

Transcription:

Disclaimer This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM and OpenCFD trade marks.

Introductory OpenFOAM Course From 8 th to 12 th July, 2013 University of Genoa, DICCA Dipartimento di Ingegneria Civile, Chimica e Ambientale

Your Lecturer Joel GUERRERO joel.guerrero@unige.it guerrero@wolfdynamics.com

Acknowledgements These slides and the tutorials presented are based upon personal experience, OpenFOAM source code, OpenFOAM user guide, OpenFOAM programmer s guide, and presentations from previous OpenFOAM training sessions and OpenFOAM workshops. We gratefully acknowledge the following OpenFOAM users for their consent to use their material: Hrvoje Jasak. Wikki Ltd. Hakan Nilsson. Department of Applied Mechanics, Chalmers University of Technology. Eric Paterson. Applied Research Laboratory Professor of Mechanical Engineering, Pennsylvania State University.

Today s lecture 1. Running in parallel 2. Running in a cluster using a job scheduler 3. Running with a GPU 4. Hands-on session

Today s lecture 1. Running in parallel 2. Running in a cluster using a job scheduler 3. Running with a GPU 4. Hands-on session

Running in parallel The method of parallel computing used by OpenFOAM is known as domain decomposition, in which the geometry and associated fields are broken into pieces and distributed among different processors. Shared memory architectures Workstations and portable computers Distributed memory architectures Cluster and super computers

Running in parallel Some facts about running OpenFOAM in parallel: Applications generally do not require parallel-specific coding. The parallel programming implementation is hidden from the user. Most of the applications and utilities run in parallel. If you write a new solver, it will be in parallel (most of the times). I have been able to run in parallel up to 4096 processors. I have been able to run OpenFOAM using single GPU and multiple GPUs. Do not ask me about scalability, that is problem/hardware specific. If you want to learn more about MPI and GPU programming, do not look in my direction. Did I forget to mention that OpenFOAM is free.

Running in parallel To run OpenFOAM in parallel you will need to: Decompose the domain. To do so we use the decomposepar utility. You also will need a dictionary named decomposepardict which is located in the system directory of the case. Distribute the jobs among the processors or computing nodes. To do so, OpenFOAM uses the public domain OpenMPI implementation of the standard message passing interface (MPI). By using MPI, each processor runs a copy of the solver on a separate part of the decomposed domain. Finally, the solution is reconstructed to obtain the final result. This is done by using the reconstrucpar utility.

Running in parallel Domain Decomposition The mesh and fields are decomposed using the decomposepar utility. The main goal is to break up the domain with minimal effort but in such a way to guarantee a fairly economic solution. The geometry and fields are broken up according to a set of parameters specified in a dictionary named decomposepardict that must be located in the system directory of the case.

Running in parallel Domain Decomposition In the decomposepardict file the user must set the number of domains in which the case should be decomposed. Usually it corresponds to the number of cores available for the calculation. numberofsubdomains NP; where NP is the number of core/processors. The user has a choice of seven methods of decomposition, specified by the method keyword. On completion, a set of subdirectories will have been created, one for each processor. The directories are named processorn where N = 0, 1, 2, 3, and so on. Each directory contains the decomposed fields.

Running in parallel Domain Decomposition simple: simple geometric decomposition in which the domain is split into pieces by direction. hierarchical: Hierarchical geometric decomposition which is the same as simple except the user specifies the order in which the directional split is done. manual: Manual decomposition, where the user directly specifies the allocation of each cell to a particular processor. structured multilevel

Running in parallel Domain Decomposition scotch: requires no geometric input from the user and attempts to minimize the number of processor boundaries (similar to metis). metis: requires no geometric input from the user and attempts to minimize the number of processor boundaries. parmetis: MPI-based version of METIS with extended functionality. Not supported anymore

Running in parallel We will now run a case in parallel. From now on follow me. Go to the $path_to_openfoamcourse/parallel_tut/rayleigh_taylor folder. In the terminal type: cd $path_to_openfoamcourse/parallel_tut/rayleigh_taylor/c1 blockmesh checkmesh cp 0/alpha1.org 0/alpha1 funkysetfields -time 0 (if you do not have this tool, copy the file alpha1.init to alpha1. The file is located in the directory 0) decomposepar mpirun -np 8 interfoam -parallel Here I am using 8 processors. In your case, use the maximum number of processor available in your laptop, for this you will need to modify the decomposepardict dictionary located in the system folder. Specifically, you will need to modify the entry numberofsubdomains, and set its value to the number of processors you want to use.

Running in parallel We will now run a case in parallel. From now on follow me. After the simulation is finish, type in the terminal: parafoam -builtin To directly post-process the decomposed case. Alternatively, you can reconstruct the case and post-process it: reconstructpar parafoam To post-process the reconstructed case, it will reconstruct all time-steps saved. Both of the methods are valid, but the first one does not require to reconstruct the case. More about post-processing in parallel in the next slides.

Running in parallel We will now run a case in parallel. From now on follow me. Notice the syntax used to run OpenFOAM in parallel: mpirun -np 8 solver_name -parallel where mpirun is a shell script to use the mpi library, -np is the number of processors you want to use, solver_name is the OpenFOAM solver you want to use, and -parallel is a flag you shall always use if you want to run in parallel.

Running in parallel Almost all OpenFOAM utilities for post-processing can be run in parallel. The syntax is as follows: mpirun -np 2 utility -parallel (notice that I am using 2 processors) When post-processing cases that have been run in parallel the user has three options: Reconstruction of the mesh and field data to recreate the complete domain and fields. reconstructpar parafoam

Running in parallel Post-processing each decomposed domain individually parafoam -case processor0 To load all processor folders, the user will need to manually create the file processorn.openfoam (where N is the processor number) in each processor folder and then load each file into parafoam. Reading the decomposed case without reconstructing it. For this you will need to: parafoam -builtin (this will use a parafoam version that will let you read a decomposed case or read a reconstructed case)

Running in parallel Let us post-process each decomposed domain individually. From now on follow me. Go to the $path_to_openfoamcourse/parallel_tut/yf17 folder. In the terminal type: cd $path_to_openfoamcourse/parallel_tut/yf17 In this directory you will find two folders, namely hierarchical and scotch. Let us decompose both geometries and visualize the partitioning. Let us start with the scotch partitioning method. In the terminal type: cd $path_to_openfoamcourse/parallel_tut/yf17/scotch decomposepar By the way, we do not need to run this case. We just need to decompose it to visualize the partitions.

Running in parallel Let us post-process each decomposed domain individually. From now on follow me. cd $path_to_openfoamcourse/parallel_tut/yf17/scotch decomposepar touch processor0.openfoam touch processor1.openfoam touch processor2.openfoam touch processor3.openfoam cp processor0.openfoam processor0 cp processor1.openfoam processor1 cp processor2.openfoam processor2 cp processor3.openfoam processor3 parafoam

Running in parallel In parafoam open each *.OpenFOAM file within the processor* directory. By doing like this, we are opening the mesh partition for each processor. Now, choose a different color for each set you just opened, and visualize the partition for each processor. Now do the same with the directory hierarchical, and compare both partitioning methods. If you partitioned the mesh with many processors, creating the files *.OpenFOAM manually can be extremely time consuming, for doing this in an automatic way you can create a small shell script. Also, changing the color for each set in parafoam can be extremely time consuming, for automate this you can write a python script.

Running in parallel Decomposing big meshes One final word, the utility decomposepar does not run in parallel. So, it is not possible to distribute the mesh among different computing nodes to do the partitioning in parallel. If you need to partition big meshes, you will need a computing node with enough memory to handle the mesh. I have been able to decompose meshes with up to 300.000.000 elements, but I used a computing node with 512 gigs of memory. For example, in a computing node with 16 gigs of memory, it is not possible to decompose a mesh with 30.000.000; you will need to use a computing node with at least 32 gigs of memory. Same applies for the utility reconstructpar.

Today s lecture 1. Running in parallel 2. Running in a cluster using a job scheduler 3. Running with a GPU 4. Hands-on session

Running in a cluster using a job scheduler Running OpenFOAM in a cluster is similar to running in a normal workstation with shared memory. The only difference is that you will need to launch your job using a job scheduler. Common job schedulers are: Terascale Open-Source Resource and Queue Manager (TORQUE). Simple Linux Utility for Resource Management (SLURM). Portable Batch System (PBS). Sun Grid Engine (SGE). Maui Cluster Scheduler. BlueGene LoadLeveler (LL). Ask your system administrator the job scheduler installed in your system. Hereafter I will assume that you will run using PBS.

Running in a cluster using a job scheduler To launch a job in a cluster with PBS, you will need to write a small script file where you tell to the job scheduler the resources you want to use and what you want to do. #!/bin/bash # # Simple PBS batch script that reserves 8 nodes and runs a # MPI program on 64 processors (8 processor on each node) # The walltime is 24 hours! # #PBS -N openfoam_simulation //name of the job #PBS -l nodes=16,walltime=24:00:00 //max execution time #PBS -m abe -M joel.guerrero@unige.it //send an email as soon as the job is launch or terminated cd PATH_TO_DIRECTORY #decomposepar //go to this directory //decompose the case, this line is commented mpirun np 128 pimplefoam -parallel > log //run parallel openfoam The green lines are not PBS comments, they are comments inserted in this slide. PBS comments use the numeral character (#).

Running in a cluster using a job scheduler To launch your job you need to use the qsub command (part of the PBS job scheduler). The command will send your job to queue. qsub script_name Remember, running in a cluster is no different from running in your workstation or portable computer. The only difference is that you need to schedule your jobs. Depending on the system current demand of resources, the resources you request and your job priority, sometimes you can be in queue for hours, even days, so be patient and wait for your turn. Remember to always double check your scripts. Finally, remember to plan how you will use the resources available.

Running in a cluster using a job scheduler For example, if each computing node has 8 gigs of memory available and 8 cores. You will need to distribute the work load in order not to exceed the maximum resources available per computing node. So if you are running a simulation that requires 32 gigs of memory, the following options are valid: Use 4 computing nodes and ask for 32 cores. Each node will use 8 gigs of memory and 8 cores. Use 8 computing nodes and ask for 32 cores. Each node will use 4 gigs of memory and 4 cores. Use 8 computing nodes and ask for 64 cores. Each node will use 4 gigs of memory and 8 cores. But the following options are not valid: Use 2 computing nodes. Each node will need 16 gigs of memory. Use 16 computing nodes and ask for 256 cores. The maximum number of cores for this job is 128.

Today s lecture 1. Running in parallel 2. Running in a cluster using a job scheduler 3. Running with a GPU 4. Hands-on session

Running with a GPU The official release of OpenFOAM (version 2.2.0), does not support GPU computing. To use your GPU with OpenFOAM, you will need to install the cufflink library. There are a few more options available, but I do like this one and is free. To test OpenFOAM GPU capabilities with cufflink, you will need to install the latest extend version (OpenFOAM-1.6-ext).

Running with a GPU You can download cufflink from the following link (as for 15/May/2013) : http://code.google.com/p/cufflink-library/ Additionally, you will need to install the following libraries: Cusp: which is a library for sparse linear algebra and graph computations on CUDA. http://code.google.com/p/cusp-library/ Thrust: which is a parallel algorithms library which resembles the C++ Standard Template Library (STL). http://code.google.com/p/thrust/ Latest NVIDIA CUDA library. https://developer.nvidia.com/cuda-downloads

What is cufflink? Running with a GPU cufflink stands for Cuda For FOAM Link. cufflink is an open source library for linking numerical methods based on Nvidia's Compute Unified Device Architecture (CUDA ) C/C++ programming language and OpenFOAM. Currently, the library utilizes the sparse linear solvers of Cusp and methods from Thrust to solve the linear Ax = b system derived from OpenFOAM's ldumatrix class and return the solution vector. cufflink is designed to utilize the coarse-grained parallelism of OpenFOAM (via domain decomposition) to allow multi-gpu parallelism at the level of the linear system solver. Cufflink Features Currently only supports the OpenFOAM-extend fork of the OpenFOAM code. Single GPU support. Multi-GPU support via OpenFOAM coarse grained parallelism achieved through domain decomposition (experimental).

Cufflink Features Running with a GPU A conjugate gradient solver based on Cusp for symmetric matrices (e.g. pressure), with a choice of Diagonal Preconditioner. Sparse Approximate Inverse Preconditioner. Algebraic Multigrid (AMG) based on Smoothed Aggregation Precondtioner. A bi-conjugate gradient stabilized solver based on CUSP for asymmetric matrices (e.g. velocity, epsilon, k), with a choice of Diagonal Preconditioner. Sparse Approximate Inverse Preconditioner. Single Precision (sm_10), Double precision (sm_13), and Fermi Architecture (sm_20) supported. The double precision solvers are recommended over single precision due to known errors encountered in the Smoothed Aggregation Preconditioner in Single precision.

Running with a GPU Running cufflink in OpenFOAM-extend project Once the cufflink library has been compiled and in order to use the library in OpenFOAM one needs to include the line libs ("libcufflink.so"); in the controldict dictionary. In addition, a solver must be chosen in the fvsolution dictionary: p { } solver cufflink_cg; preconditioner none; tolerance 1e-10; //reltol 1e-08; maxiter 10000; storage 1;//COO=0 CSR=1 DIA=2 ELL=3 HYB=4 all other numbers use default CSR gpuspermachine 2;//for multi gpu version on a machine with 2 gpus per machine node AinvType ; droptolerance ; linstrategy ; This particular setup uses an un-preconditioned conjugate gradient solver on a single GPU; compressed sparse row (CSR) matrix storage; 1e-8 absolute tolerance and 0 relative tolerance; with a maximum number of inner iterations of 10000.

Mesh generation using open source tools Additional tutorials In the folder $path_to_openfoamcourse/parallel_tut you will find many tutorials, try to go through each one to understand how to setup a parallel case in OpenFOAM.

Thank you for your attention

Today s lecture 1. Running in parallel 2. Running in a cluster using a job scheduler 3. Running with a GPU 4. Hands-on session

Hands-on session In the course s directory ($path_to_openfoamcourse) you will find many tutorials (which are different from those that come with the OpenFOAM installation), let us try to go through each one to understand and get functional using OpenFOAM. If you have a case of your own, let me know and I will try to do my best to help you to setup your case. But remember, the physics is yours.