Lab: Scientific Computing Tsunami-Simulation

Similar documents
Department of Informatics V. Tsunami-Lab. Session 4: Optimization and OMP Michael Bader, Alex Breuer. Alex Breuer

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Masterpraktikum - High Performance Computing

Parallel Programming: OpenMP

Lecture 4: OpenMP Open Multi-Processing

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP - Introduction

Overview: The OpenMP Programming Model

ECE 574 Cluster Computing Lecture 10

Distributed Systems + Middleware Concurrent Programming with OpenMP

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Our new HPC-Cluster An overview

Multithreading in C with OpenMP

Parallel Programming

OpenMP loops. Paolo Burgio.

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

EPL372 Lab Exercise 5: Introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Scientific Programming in C XIV. Parallel programming

Shared Memory Programming with OpenMP

POSIX Threads and OpenMP tasks

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Parallel Programming

OpenMP dynamic loops. Paolo Burgio.

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Introduction to HPC and Optimization Tutorial VI

OpenMP 4.5: Threading, vectorization & offloading

Introduction to OpenMP

CS691/SC791: Parallel & Distributed Computing

15-418, Spring 2008 OpenMP: A Short Introduction

Open Multi-Processing: Basic Course

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

CME 213 S PRING Eric Darve

Speeding Up Reactive Transport Code Using OpenMP. OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Parallel programming using OpenMP

A brief introduction to OpenMP

Introduction to OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Implementation of Parallelization

Computer Architecture

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Allows program to be incrementally parallelized

Native Computing and Optimization. Hang Liu December 4 th, 2013

Introduction to OpenMP

Tasking and OpenMP Success Stories

EE/CSCI 451: Parallel and Distributed Computation

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Parallel Computing. Prof. Marco Bertini

Programming Intel R Xeon Phi TM

Parallel Programming using OpenMP

Introduction to OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Parallel Programming using OpenMP

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Scientific Computing

OpenMP Fundamentals Fork-join model and data environment

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

High Performance Computing: Tools and Applications

Tasking in OpenMP. Paolo Burgio.

OpenMP 4.0/4.5. Mark Bull, EPCC

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CME 213 S PRING Eric Darve

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

New Features after OpenMP 2.5

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Introduction to OpenMP

Introduction to Standard OpenMP 3.1

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Shared Memory Programming

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

CS 470 Spring Mike Lam, Professor. OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Programming with OpenMP

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Transcription:

Lab: Scientific Computing Tsunami-Simulation Session 4: Optimization and OMP Sebastian Rettenberger, Michael Bader 23.11.15 Session 4: Optimization and OMP, 23.11.15 1

Department of Informatics V Linux-Cluster 1999 2015 https://www.lrz.de/services/compute/linux-cluster/lx_timeline/ Session 4: Optimization and OMP, 23.11.15 2

Hardware Segment #Nodes #Cores aggregate peak performance (TFlop/s) aggregate memory (TByte) CoolMUC2 252 7056 260 16.1 MPP 178 2848 22.7 2.8 UV 2 2080 20.0 6.0 Session 4: Optimization and OMP, 23.11.15 3

Intel Haswell EP http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 Session 4: Optimization and OMP, 23.11.15 4

Rules 1. Do not use login nodes to run your application. 2. Do not start more than one interactive job. 3. Do not put large input/output files in your home directory. Session 4: Optimization and OMP, 23.11.15 5

Some Basics X-forward: ssh -Y user@lxlogin5.lrz.de module system: module load, list, unload, info, avail Compute Interactive: salloc --ntasks=1 --cpus-per-task=28 --time=00:15:00 srun./executable ARGS... Compute Batch: squeue sbatch sinfo sview Session 4: Optimization and OMP, 23.11.15 6

VTune Session 4: Optimization and OMP, 23.11.15 7

VTune Start Analysis Session 4: Optimization and OMP, 23.11.15 8

VTune Start Analysis Session 4: Optimization and OMP, 23.11.15 9

VTune Results Session 4: Optimization and OMP, 23.11.15 10

What is OpenMP? OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs http://openmp.org/openmp-faq.html#whatis Session 4: Optimization and OMP, 23.11.15 11

Department of Informatics V Shared-Memory-Parallelism Transparent programming view: Cores access same memory In hardware: Be aware of layout, e.g. NUMA; first touch policy Sandy Bridge CPU, http://www.bit-tech.net/hardware/ cpus/2011/01/03/intel-sandybridge-review/1 https://computing.llnl.gov/tutorials/openmp/ Knights Corner and Knights Landing, http://www.zdnet.com/sc13-intelreveals-knights-landing-highperformance-cpu-7000023393/ Session 4: Optimization and OMP, 23.11.15 12

Fork-Join Model https://computing.llnl.gov/tutorials/openmp/#programmingmodel Session 4: Optimization and OMP, 23.11.15 13

A Simple Loop # i n c l u d e <omp. h> / / some i n i t i a l i z a t i o n #pragma omp p a r a l l e l f o r f o r ( i = 0; i < n ; i ++) { f o r ( j = 0; j < n ; j ++) { / / do some work } } OpenMP - Parallel programming on shared memory systems https://www.lrz.de/services/software/ parallel/openmp/ Session 4: Optimization and OMP, 23.11.15 14

OMP: An Overview Compiler-directives: #pragma omp...[clause[[,] clause]...] Work-sharing #pragma omp for..., and many more Synchronization #pragma omp barrier, and many more Data Scope Clauses shared, private, firstprivate, reduction,... Library routines omp get num threads(), omp get thread num(),... Session 4: Optimization and OMP, 23.11.15 15

Compilation and Execution compile with additional Compiler-Flag -openmp (-fopenmp in case of gcc): icc -openmp -o my algorithm alg.c export OMP NUM THREADS=number of threads./my algorithm Session 4: Optimization and OMP, 23.11.15 16

Parallel Region #pragma omp p a r a l l e l [ clause [ [, ] clause ]... ] s t r u c t u r e d block code within parallel region is executed by all threads # i n c l u d e <omp. h> / / more i n i t i a l i z a t i o n... #pragma omp p a r a l l e l p r i v a t e ( size, rank ) { size = omp get num threads ( ) ; rank = omp get thread num ( ) ; p r i n t f ( Hello World! ( Thread %d of %d ), rank, size ) ; } Session 4: Optimization and OMP, 23.11.15 17

Work Sharing: for #pragma omp for [clause[[,] clause]...] for-loop within a parallel region and directly in front of a for-loop iterations are scheduled across different threads schedule clause determines how to map iterations to threads schedule(static[, chunksize]) default-value chunksize is #iterations divided by #threads schedule(dynamic[,chunksize]) default-value chunksize is 1 schudule(runtime) Scheduling is given by environment variable OMP SCHEDULE implicit synchronization ad the end of for-loop (can be disabled with nowait clause) shortcut possible: #pragma omp parallel for https://computing. llnl.gov/tutorials/ openmp/ Session 4: Optimization and OMP, 23.11.15 18

Work Sharing: for #pragma omp for [clause[[,] clause]...] for-loop within a parallel region and directly in front of a for-loop iterations are scheduled across different threads schedule clause determines how to map iterations to threads schedule(static[, chunksize]) default-value chunksize is #iterations divided by #threads schedule(dynamic[,chunksize]) default-value chunksize is 1 schudule(runtime) Scheduling is given by environment variable OMP SCHEDULE implicit synchronization ad the end of for-loop (can be disabled with nowait clause) shortcut possible: #pragma omp parallel for Intel, Load Balance and Parallel Performance, https://software. intel.com/enus/articles/loadbalance-and-parallelperformance Session 4: Optimization and OMP, 23.11.15 18

Work Sharing: task #pragma omp task [clause[[,] clause]...] structured block where clause can be... final(expression): if expression is true, task is executed sequentially; no recursive task generation untied: the execution of a task is no tied one single thread shared firstprivate private [...] #pragma omp taskwait: waits for children completion Intel, Load Balance and Parallel Performance, https://software. intel.com/enus/articles/loadbalance-and-parallelperformance Session 4: Optimization and OMP, 23.11.15 19

Work Sharing : single #pragma omp single [clause[[,] clause]...] structured block where clause can be... use only one (arbitrary) thread implicit synchronization #pragma omp master structured block use only master thread for structured block NO synchronization at the end https://computing. llnl.gov/tutorials/ openmp/ Session 4: Optimization and OMP, 23.11.15 20

Synchronization #pragma omp barrier: block execution until all threads have reached the barrier #pragma omp critical: structured block only one thread at a time can execute the structured block encapsulated by critical Warning: Use carefully; can definitely kill performance! Session 4: Optimization and OMP, 23.11.15 21

Reduction reduction(operator: list) executes a reduction of variables in list using operator available operators: +,, -, &&, since OpenMP 3.1: min, max #pragma omp p a r a l l e l f o r p r i v a t e ( r ), r e d u c t i o n ( + : sum) f o r ( i = 0; i < n ; i ++) { r = compute r ( i ) ; sum = sum + r ; } Session 4: Optimization and OMP, 23.11.15 22

Scopes private(list): declares variables in list as private for each thread (no copy!) shared(list): variables in list are used by all threads (race conditions are possible!), write accesses have to be handled by the programmer firstprivate(list): private variables and init them with the latest valid value before parallel region lastprivate(list): after parallel execution, in serial part reused last modification... default data scope is shared, but exceptions exit be precise! local variables are always private loop-variables of for-loops are private Changes between OpenMP-2 and OpenMP-3! Session 4: Optimization and OMP, 23.11.15 23

Fill the Gaps k = compute k ( ) ; #pragma omp p a r a l l e l f o r p r i v a t e (? ), shared (? ), f i r s t p r i v f o r ( i = 0; i < n ; i ++) { k = compute my k ( i, k ) ; r = compute r ( i, k, h ) ; sum = sum + r ; } Session 4: Optimization and OMP, 23.11.15 24

NUMA Non-uniform memory access (NUMA): Not all cores access all memory in the same time Pinning: Use close-to-core memory as far as possible Linux first touch policy: not the malloc counts but the first (write) access numactl has some options Georg Hager s Blog, http://blogs.fau.de/hager/ Session 4: Optimization and OMP, 23.11.15 25