Scientific Computing

Similar documents
Scientific Computing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

High Performance Computing: Tools and Applications

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Lecture 4: OpenMP Open Multi-Processing

CS691/SC791: Parallel & Distributed Computing

ECE 574 Cluster Computing Lecture 10

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

NUMERICAL PARALLEL COMPUTING

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP

Lecture 2 A Hand-on Introduction to OpenMP

Multithreading in C with OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Parallel Numerical Algorithms

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Parallel programming using OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

DPHPC: Introduction to OpenMP Recitation session

Programming Shared Memory Systems with OpenMP Part I. Book

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Alfio Lazzaro: Introduction to OpenMP

CS 5220: Shared memory programming. David Bindel

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Data Environment: Default storage attributes

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

CS 470 Spring Mike Lam, Professor. OpenMP

<Insert Picture Here> OpenMP on Solaris

Masterpraktikum - High Performance Computing

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Distributed Systems + Middleware Concurrent Programming with OpenMP

Parallel Programming using OpenMP

Parallel Programming using OpenMP

OpenMP Fundamentals Fork-join model and data environment

Parallel Programming: OpenMP

Shared Memory Programming Model

CS 470 Spring Mike Lam, Professor. OpenMP

Shared Memory Programming Paradigm!

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared Memory Parallelism - OpenMP

Open Multi-Processing: Basic Course

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

INTRODUCTION TO OPENMP

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Introduction to. Slides prepared by : Farzana Rahman 1

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Concurrent Programming with OpenMP

Session 4: Parallel Programming with OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Multi-core Architecture and Programming

OpenMP Shared Memory Programming

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Introduction to OpenMP

OpenMP on Ranger and Stampede (with Labs)

DPHPC: Introduction to OpenMP Recitation session

OpenMP loops. Paolo Burgio.

INTRODUCTION TO OPENMP (PART II)

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Shared Memory Programming Models I

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Computing. Lecture 13: OpenMP - I

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Introduction to OpenMP.

Introduction to OpenMP

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Parallel Programming in C with MPI and OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Parallel Programming

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Raspberry Pi Basics. CSInParallel Project

Introduction to OpenMP

Overview: The OpenMP Programming Model

Synchronisation in Java - Java Monitor

OpenMP threading: parallel regions. Paolo Burgio

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Parallel Programming in C with MPI and OpenMP

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Shared Memory Parallel Programming

Scientific Programming in C XIV. Parallel programming

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Lab: Scientific Computing Tsunami-Simulation

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Transcription:

Lecture on Scientific Computing Dr. Kersten Schmidt Lecture 20 Technische Universität Berlin Institut für Mathematik Wintersemester 2014/2015

Syllabus Linear Regression, Fast Fourier transform Modelling by partial differential equations (PDEs) Maxwell, Helmholtz, Poisson, Linear elasticity, Navier-Stokes equation boundary value problem, eigenvalue problem boundary conditions (Dirichlet, Neumann, Robin) handling of infinite domains (wave-guide, homogeneous exterior: DtN, PML) boundary integral equations Computer aided-design (CAD) Mesh generators Space discretisation of PDEs Finite difference method Finite element method Discontinuous Galerkin finite element method Solvers Linear Solvers (direct, iterative), preconditioner Nonlinear Solvers (Newton-Raphson iteration) Eigenvalue Solvers Parallelisation Computer hardware (SIMD, MIMD: shared/distributed memory) Programming in parallel: OpenMP, MPI VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 2

Shared memory computer A process is an instance of a program that is executing more or less autonomously on a physical processor. A thread is a sequence of instructions. Several threads may share resources (CPU, memory). Threads of a process share its instructions and their variables. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 3

Shared-memory parallel programming model Static process generation: the number of threads is fixed a-priori by the programmer. Dynamic process generation: fork / join parallelism. There is a master process that forks slave processes / threads as needed in a parallel region. At the end of the parallel region the slave processes may be killed. The number of threads may vary from one parallel region to another. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 4

OpenMP is an application programming interface that provides a parallel programming model for shared memory and distributed shared memory multiprocessors, extends programming languages (C/C++ and Fortran) by a set of compiler directives to express shared memory parallelism (they are called pragmas in C), runtime library routines and environment variables that are used to examine and modify execution parameters. There is a standard include file omp.h for C/C++ OpenMP programs. With gcc/g++ compile with flag -fopenmp, e.g. gcc -fopenmp example.c -o example OpenMP is becoming the de facto standard for parallelising applications for shared memory multiprocessors. OpenMP is independent of the underlying hardware and operating system. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 5

First OpenMP demo program in C: omp1st.c #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel { int myid = omp_get_thread_num(); int num = omp_get_num_threads(); printf("thread %d from %d is ready.\n", myid, num); Calling % gcc -fopenmp omp1st.c -o omp1st % export OMP_NUM_THREADS=4 % omp1st Thread 2 from 4 is ready. Thread 3 from 4 is ready. Thread 0 from 4 is ready. Thread 1 from 4 is ready. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 6

Another OpenMP demo program in C: omp2nd.c #include <stdio.h> #include <omp.h> int main() { omp_set_num_threads(2); #pragma omp parallel { int myid = omp_get_thread_num(); int num = omp_get_num_threads(); Calling % gcc -fopenmp omp2nd.c -o omp2nd % omp2nd Thread 0 from 2 is ready. Thread 1 from 2 is ready. Thread 2 from 3 is ready. Thread 0 from 3 is ready. Thread 1 from 3 is ready. printf("thread %d from %d is ready.\n", myid, num); printf("\n"); omp_set_num_threads(3); #pragma omp parallel { int myid = omp_get_thread_num(); int num = omp_get_num_threads(); printf("thread %d from %d is ready.\n", myid, num); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 7

How is OpenMP typically used? to parallise loops find your most time consuming loops and split them up between threads OpenMP parallel control structures that fork new threads: The parallel directive is used to create multiple threads that execute concurrently. It applies to a structured block, i.e. in C/C++ between { and. The for directive is used to express loop-level parallelism. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 8

An example SAXPY : Y αx + Y 1. The sequential program for(i = 0; i < N; ++i){ y[i] = alpha * x[i] + y[i]; 2. OpenMP parallel region #pragma omp parallel { int id, i, num, istart, iend; id = omp_get_thread_num(); num = omp_get_num_threads(); istart = id*n/num; iend = (id + 1)*N/num; for (i = istart; i < iend; ++i) { y[i] = alpha * x[i] + y[i]; 3. OpenMP parallel region combined with a for-directive #pragma omp parallel #pragma omp for for(i = 0; i < N; ++i){ y[i] = alpha * x[i] + y[i]; or short #pragma omp parallel for for(i = 0; i < N; ++i){ y[i] = alpha * x[i] + y[i]; VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 9

OpenMP communication The threads share all global variables, variables on the stack or dynamically allocatecd variables (on heap), except... if variables are indicated as private. The (possibly different) values for each thread are stored at multiple locations. OpenMP synchronization Threads can easily communicate with each other through reads and writes of shared variables. However, this has to be done at the right time. Example: Two threads should some value add to a variable Two forms of process synchronization or coordination are mutual exclusion : use the critical directive to indicate that threads have to read and write variables after each other. event synchronization : use the barrier directive to define a point where each thread waits for all other threads to arrive. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 10

Example: dot product : dotproduct_i.c #define NUM_THREADS 2 int main() { long N = 100; long i; double sum = 0.0, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; #pragma omp parallel for for(i = 0; i < N; ++i) sum = sum + x[i]*y[i]; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); % gcc -fopenmp dotproduct_i.c \ -o dotproduct_i %./dotproduct_i 3240?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 2775?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 4950?= 4950 What is the problem here? Why do we obtain different results if we run the code several times? Unintended sharing of variables (here sum) can lead to race conditions: the results (may) change as the threads are scheduled differently. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 11

Example: dot product : dotproduct_ii.c #define NUM_THREADS 2 Right result? int main() { long N = 500000; long i; double sum = 0.0, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; #pragma omp parallel for for(i = 0; i < N; ++i) { #pragma omp critical { sum = sum + x[i]*y[i]; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 12

Example: dot product : dotproduct_iii.c #define NUM_THREADS 2 int main() { long N = 500000, i; This is already much better. There are now only NUM_THREADS points of synchronization. double sum = 0.0, local_sum, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; #pragma omp parallel private (local_sum) { local_sum = 0.0; #pragma omp for for(i = 0; i < N; ++i) { local_sum = local_sum + x[i]*y[i]; #pragma omp critical { sum = sum + local_sum; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 13

Example: dot product : dotproduct_iv.c #define NUM_THREADS 2 int main() { long N = 500000, i; double sum = 0.0, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; The most elegant way how to deal with the above problem is to give the variable sum the reduction attribute. Here, the compiler implicitely creates a local copy of sum. At the end of the loop, the local copies are added by reduction operation in an optimal way. Remark: Reduction operations are possible with the operators +, -, *, &,, ^, &&,. #pragma omp parallel for reduction (+ : sum) for(i = 0; i < N; ++i) { sum = sum + x[i]*y[i]; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 14

Data dependence Whenever a statement in a program reads and writes a memory location and another statement reads or writes the same memory location, and at least one of the two statements writes the location, then there is a data dependence on that memory location between the two statements. Loop-carried data dependence if the two statements involved occur in different iterations of a loop. Removing data dependencies in algorithms, examples: Copying of values in auxilliary variables (arrays) change loops such that outer loop is longest, inner loop with critical directive VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 15