Parallel Computing. Lecture 13: OpenMP - I

Similar documents
CS691/SC791: Parallel & Distributed Computing

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Introduc)on to OpenMP

Lecture 4: OpenMP Open Multi-Processing

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Multithreading in C with OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

Shared Memory Programming Model

Parallel Computing. Lecture 16: OpenMP - IV

Introduction to OpenMP

OpenMP Fundamentals Fork-join model and data environment

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 574 Cluster Computing Lecture 10

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Programming using OpenMP

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Parallel Programming using OpenMP

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Shared Memory: Virtual Shared Memory, Threads & OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming with OpenMP. CS240A, T. Yang

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Shared Memory Programming with OpenMP

Session 4: Parallel Programming with OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP loops. Paolo Burgio.

Shared Memory Programming Paradigm!

High Performance Computing: Tools and Applications

OpenMP Programming. Aiichiro Nakano

Parallel Numerical Algorithms

CS 5220: Shared memory programming. David Bindel

Parallel programming using OpenMP

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

OpenMP: Open Multiprocessing

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

Computer Architecture

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Shared Memory programming paradigm: openmp

OpenMP on Ranger and Stampede (with Labs)

Parallel and Distributed Programming. OpenMP

Assignment 1 OpenMP Tutorial Assignment

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP.

Introduction to. Slides prepared by : Farzana Rahman 1

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Computing. Lecture 17: OpenMP Last Touch

15-418, Spring 2008 OpenMP: A Short Introduction

OpenMP threading: parallel regions. Paolo Burgio

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Concurrent Programming with OpenMP

Parallel Programming

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Scientific Computing

Lecture 2 A Hand-on Introduction to OpenMP

OpenMP: Open Multiprocessing

An Introduction to OpenMP

Shared memory parallel computing

Heterogeneous Compu.ng using openmp lecture 2

Alfio Lazzaro: Introduction to OpenMP

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Parallel Programming

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Shared Memory Programming with OpenMP

Programming Shared Address Space Platforms using OpenMP

DPHPC: Introduction to OpenMP Recitation session

OPENMP OPEN MULTI-PROCESSING

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP. Today s lecture. Scott B. Baden / CSE 160 / Wi '16

Announcements. Scott B. Baden / CSE 160 / Wi '16 2

A brief introduction to OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

Multi-core Architecture and Programming

Parallel Computing. Prof. Marco Bertini

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Shared Memory Parallelism - OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

CSL 860: Modern Parallel

Introduction to OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Concurrent Programming with OpenMP

Open Multi-Processing: Basic Course

Transcription:

CSCI-UA.0480-003 Parallel Computing Lecture 13: OpenMP - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Small and Easy Motivation #include <stdio.h> #include <stdlib.h> int main() { // Do this part in parallel printf( "Hello, World!\n" ); } return 0;

Small and Easy Motivation #include <stdio.h> #include <stdlib.h> #include <omp.h> int main() { omp_set_num_threads(16); // Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); } } return 0;

Simple! OpenMP can parallelize many serial programs with relatively few annotations that specify parallelism and independence OpenMP is a small API that hides cumbersome threading calls with simpler directives

Interesting Insights About OpenMP These insights are coming from HPC folks though! Source: www.sdsc.edu/~allans/cs260/lectures/openmp.ppt

OpenMP An API for shared-memory parallel programming. Designed for systems in which each thread can potentially have access to all available memory. System is viewed as a collection of cores or CPU s, all of which have access to main memory shared memory architecture

Copyright 2010, Elsevier Inc. All rights Reserved A shared memory system

Pragmas Special preprocessor instructions. Typically added to a system to allow behaviors that aren t part of the basic C specification. Compilers that don t support the pragmas ignore them. #pragma Copyright 2010, Elsevier Inc. All rights Reserved

Copyright 2010, Elsevier Inc. All rights Reserved

gcc g Wall fopenmp o omp_hello omp_hello. c. /omp_hello 4 running with 4 threads Hello from thread 0 of 4 Hello from thread 1 of 4 Hello from thread 2 of 4 Hello from thread 3 of 4 possible outcomes Hello from thread 1 of 4 Hello from thread 2 of 4 Hello from thread 0 of 4 Hello from thread 3 of 4 Hello from thread 3 of 4 Hello from thread 1 of 4 Hello from thread 2 of 4 Hello from thread 0 of 4

OpenMp pragmas # pragma omp parallel Most basic parallel directive. The number of threads that run the following structured block of code is determined by the run-time system. Copyright 2010, Elsevier Inc. All rights Reserved

A process forking and joining two threads Copyright 2010, Elsevier Inc. All rights Reserved

clause Text that modifies a directive. The num_threads clause can be added to a parallel directive. It allows the programmer to specify the number of threads that should execute the following block. # pragma omp parallel num_threads ( thread_count ) Copyright 2010, Elsevier Inc. All rights Reserved

Of note There may be system-defined limitations on the number of threads that a program can start. The OpenMP standard doesn t guarantee that this will actually start thread_count threads. Unless we re trying to start a lot of threads, we will almost always get the desired number of threads. Copyright 2010, Elsevier Inc. All rights Reserved

Some terminology In OpenMP parlance the collection of threads executing the parallel block the original thread and the new threads is called a team, the original thread is called the master, and the additional threads are called slaves.

Copyright 2010, Elsevier Inc. All rights Reserved Again: The trapezoidal rule

Copyright 2010, Elsevier Inc. All rights Reserved Serial algorithm

A First OpenMP Version 1) We identified two types of tasks: a) computation of the areas of individual trapezoids, and b) adding the areas of trapezoids. 2) There is no communication among the tasks in the first collection, but each task in the first collection communicates with task 1b. Copyright 2010, Elsevier Inc. All rights Reserved

A First OpenMP Version 3) We assumed that there would be many more trapezoids than cores. So we aggregated tasks by assigning a contiguous block of trapezoids to each thread. Copyright 2010, Elsevier Inc. All rights Reserved

Assignment of trapezoids to threads Copyright 2010, Elsevier Inc. All rights Reserved

Unpredictable results when two (or more) threads attempt to simultaneously execute: global_result += my_result ; Copyright 2010, Elsevier Inc. All rights Reserved

Mutual exclusion # pragma omp critical global_result += my_result ; only one thread can execute the following structured block at a time Copyright 2010, Elsevier Inc. All rights Reserved

Copyright 2010, Elsevier Inc. All rights Reserved

Copyright 2010, Elsevier Inc. All rights Reserved

Another Example Problem: Count the number of times each ASCII character occurs on a page of text. Input: ASCII text stored as an array of characters. Output: A histogram with 128 buckets one for each ASCII character source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example Speed on Quad Core: 10.36 seconds 1: void compute_histogram_st(char *page, int page_size, int *histogram){ 2: for(int i = 0; i < page_size; i++){ 3: char read_character = page[i]; 4: histogram[read_character]++; 5: } 6: } Sequential Version source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example We need to parallelize this. source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example 1: void compute_histogram_st(char *page, int page_size, int *histogram){ 2: #pragma omp parallel for 3: for(int i = 0; i < page_size; i++){ 4: char read_character = page[i]; 5: histogram[read_character]++; 6: } The above code does not work!! Why? source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example 1: void compute_histogram_mt2(char *page, int page_size, int *histogram){ 2: #pragma omp parallel for 3: for(int i = 0; i < page_size; i++){ 4: char read_character = page[i]; 5: #pragma omp atomic 6: histogram[read_character]++; 7: } 8: } Speed on Quad Core: 114.89 seconds > 10x slower than the single thread version!! source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example 1: void compute_histogram_mt3(char *page, int page_size, int *histogram, int num_buckets){ 2: #pragma omp parallel 3: { 4: int local_histogram[111][num_buckets]; 5: int tid = omp_get_thread_num(); 6: #pragma omp for nowait 7: for(int i = 0; i < page_size; i++){ 8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: } 11: for(int i = 0; i < num_buckets; i++){ 12: #pragma omp atomic 13: histogram[i] += local_histogram[tid][i]; 14: } 15: } 16: } Runs in 3.8 secs Why speedup is not 4 yet? source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example void compute_histogram_mt4(char *page, int page_size, int *histogram, int num_buckets){ 1: int num_threads = omp_get_max_threads(); 2: #pragma omp parallel 3: { 4: declspec (align(64)) int local_histogram[num_threads+1][num_buckets]; 5: int tid = omp_get_thread_num(); 6: #pragma omp for 7: for(int i = 0; i < page_size; i++){ 8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: } 11: #pragma omp barrier 12: #pragma omp single 13: for(int t = 0; t < num_threads; t++){ 14: for(int i = 0; i < num_buckets; i++) 15: histogram[i] += local_histogram[t][i]; 16: } 17: } Speed is 4.42 seconds. Slower than the previous version. source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Another Example void compute_histogram_mt4(char *page, int page_size, int *histogram, int num_buckets){ 1: int num_threads = omp_get_max_threads(); 2: #pragma omp parallel 3: { 4: declspec (align(64)) int local_histogram[num_threads+1][num_buckets]; 5: int tid = omp_get_thread_num(); 6: #pragma omp for 7: for(int i = 0; i < page_size; i++){ 8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: } 11: 12: #pragma omp for 13: for(int i = 0; i < num_buckets; i++){ 14: for(int t = 0; t < num_threads; t++) 15: histogram[i] += local_histogram[t][i]; 16: } 17: } Speed is 3.60 seconds. source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

What Can We Learn from the Previous Example? Atomic operations They are expensive Yet, they are fundamental building blocks. Synchronization: correctness vs performance loss Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and software algorithms together

OpenMP Parallel Programming 1. Start with a parallelizable algorithm loop-level parallelism is necessary 2. Implement serially 3. Test and Debug 4. Annotate the code with parallelization (and synchronization) directives Hope for linear speedup 5. Test and Debug

OpenMP uses the fork-join model of parallel execution. All OpenMP programs begin with a single thread: master thread (ID = 0 ) FORK: the master thread then creates a team of parallel threads. JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate

Programming Model - Threading int main() { // serial region } printf( Hello ); // parallel region #pragma omp parallel { printf( World ); } // serial again printf(! ); Fork Join We didn t use omp_set_num_threads(), what will be the output?

Isn t Nested Parallelism Interesting? Fork Fork Join Join

Important! The following are implementation dependent: Nested parallelism Dynamically alter number of threads It is entirely up to the programmer to ensure that I/O is conducted correctly within the context of a multithreaded program.

What we learned so far #include <omp.h> gcc fopenmp omp_set_num_threads(x); omp_get_thread_num(); omp_get_num_threads(); #pragma omp parallel [num_threads(x)] #pragma omp parallel for [nowait] #pragma omp atomic #pragma omp barrier #pragma omp single #pragma omp critical

Conclusions OpenMP is a standard for programming shared-memory systems. OpenMP uses both special functions and preprocessor directives called pragmas. OpenMP programs start multiple threads rather than multiple processes. Many OpenMP directives can be modified by clauses.