CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Similar documents
CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8

CS4961 Parallel Programming. Lecture 7: Fancy Solution to HW 2. Mary Hall September 15, 2009

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Lecture 4: OpenMP Open Multi-Processing

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Introduction to. Slides prepared by : Farzana Rahman 1

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Computing Parallel Programming Languages Hwansoo Han

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP.

OpenMP Application Program Interface

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel and Distributed Programming. OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Introduction [1] 1. Directives [2] 7

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Programming Shared Address Space Platforms using OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Shared Memory Programming Model

Parallel programming using OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Application Program Interface

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Multicore Computing. Arash Tavakkol. Instructor: Department of Computer Engineering Sharif University of Technology Spring 2016

Data Handling in OpenMP

Shared Memory Parallelism - OpenMP

OPENMP OPEN MULTI-PROCESSING

Programming Shared Memory Systems with OpenMP Part I. Book

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Concurrent Programming with OpenMP

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Multithreading in C with OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

EE/CSCI 451: Parallel and Distributed Computation

Data Environment: Default storage attributes

15-418, Spring 2008 OpenMP: A Short Introduction

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

Shared Memory programming paradigm: openmp

Multi-core Architecture and Programming

Shared Memory Parallelism using OpenMP

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Programming Shared-memory Platforms with OpenMP. Xu Liu

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Parallel and Distributed Computing

A brief introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

OpenMP on Ranger and Stampede (with Labs)

More Advanced OpenMP. Saturday, January 30, 16

Parallel Programming in C with MPI and OpenMP

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Computational Mathematics

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

<Insert Picture Here> OpenMP on Solaris

Overview: The OpenMP Programming Model

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Allows program to be incrementally parallelized

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallel Programming

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

CSL 860: Modern Parallel

Synchronization. Event Synchronization

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Introduction to Standard OpenMP 3.1

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

Introduction to OpenMP. Motivation

ECE 574 Cluster Computing Lecture 10

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

NUMERICAL PARALLEL COMPUTING

Lecture 2 A Hand-on Introduction to OpenMP

Transcription:

Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines - Use the following command: handin cs4961 prog1 <gzipped tar file> Mailing list set up: cs4961@list.eng.utah.edu Mary Hall September 22, 2009 09/22/2010 1 09/22/2010 2 Today s Lecture Go over questions on Proj1 Review of OpenMP Data Parallelism Discussion of Task Parallelism in Open MP 2.x and 3.0 Sources for Lecture: - OpenMP Tutorial by Ruud van der Pas http://openmp.org/mp-documents/ntu-vanderpas.pdf - OpenMP 3.0 specification (May 2008): http://www.openmp.org/mp-documents/spec30.pdf 09/22/2010 3 Using Intel Software on VS2008 File -> New Project (C++) Right Click Project or Project -> Intel Compiler -> Use Intel Compiler Project -> Properties -> C/C++ -> General -> Suppress Startup Banner = No Project -> Properties -> C/C++ -> Optimization -> Maximize Speed (/O2) Project -> Properties -> C/C++ -> Optimization -> Enable Intrinsic Functions (/Oi) Project -> Properties -> Code Generation -> Runtime Library -> Multithreaded DLL (/MD) Project -> Properties -> Code Generation -> Enable Enhanced Instruction Set = Streaming SIMD Extensions 3 (/arch:sse3) Project -> Properties -> Command Line -> Additional Options -> Add / Qvec-report:3 Click Apply Your command line options should look like /c /O2 /Oi /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /arch:sse3 /fp:fast / Fo"Debug/" /W3 /ZI /Qvec-report:3 09/22/2010 4 1

Project 1 Assignment PART I Each of the files t1.c, t2.c, t3.c, t4.c and t5.c from the website cannot be vectorized by the ICC compiler. Your assignment is to produce the equivalent but vectorizable nt1.c, nt2.c, nt3.c, nt4.c and n5t.c. To determine whether or not the compiler has vectorized a loop in the code, look at the output of the compiler that results from the flag vecreport3. If it says: remark: LOOP WAS VECTORIZED., then you have succeeded! Hints: There are several reasons why the above examples cannot be vectorized. In some cases, the compiler is concerned about the efficiency of the vectorized code. This may be because of the cost of alignment, concerns about exceeding the register capacity (there are only 8 128-bit registers on most SSE3- supporting platforms!) or the presence of data dependences. In other cases, there is concern about correctness, due to aliasing. 09/22/2010 5 Project 1 Assignment, cont. PART II The code example youwrite.c is simplified from a sparse matrixvector multiply operation used in conjugate gradient. In the example, the compiler is able to generate vectorizable code, but it must deal with alignment in the inner loop. In this portion, we want you to be a replacement for the compiler and use the intrinsic operations found in Appendix C (pp. 832-844) of the document found at: http://developer.intel.com/design/pentiumii/manuals/ 243191.htm. The way this code will be graded is from looking at it (no running of the code this time around), so it would help if you use comments to describe the operations you are implementing. It would be a good idea to test the code you write, but you will need to generate some artificial input and compare the output of the vectorized code to that of the sequential version. 09/22/2010 6 Examples from Assignment T2.c: maxval = 0.; a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); if (maxval > 1000.0) break; T3.c: a[i] = b[i*4] + c[i]; 09/22/2010 7 Examples from Assignment T4.c: int start = rand(); for (j=start; j<n; j+=2) { for (i=start; i<n; i+=2) { a[i][j] = b[i][j] + c[i][j]; a[i+1][j] = b[i+1][j] + c[i+1][j]; a[i][j+1] = b[i][j+1] + c[i][j+1]; a[i+1][j+1] = b[i+1][j+1] + c[i+1][j+1]; T5.c: t = t + abs(t-i); a[i] = t* b[i] + c[i]; 09/22/2010 8 2

Examples from Assignment Youwrite.c: scanf("%f %f %d\n", &a[i], &x[i], &colstr[i]); t[i] = 0.0; for (j=0; j<n; j++) { for (k = colstr[j]; k<colstr[j+1]-1; k++) t[k] = t[k] + a[k] * x[j]; OpenMP: Prevailing Shared Memory Programming Approach Model for parallel programming Shared-memory parallelism Portable across shared-memory architectures Scalable Incremental parallelization Compiler based Extensions to existing programming languages (Fortran, C and C++) - mainly by directives - a few library routines See http://www.openmp.org 09/22/2010 9 09/17/2010 A Programmer s View of OpenMP OpenMP is a portable, threaded, shared-memory programming specification with light syntax - Exact behavior depends on OpenMP implementation! - Requires compiler support (C/C++ or Fortran) OpenMP will: - Allow a programmer to separate a program into serial regions and parallel regions, rather than concurrently-executing threads. - Hide stack management - Provide synchronization constructs OpenMP will not: - Parallelize automatically - Guarantee speedup - Provide freedom from data races OpenMP Summary Work sharing - parallel, parallel for, TBD - scheduling directives: static(chunk), dynamic(), guided() Data sharing - shared, private, reduction Environment variables - OMP_NUM_THREADS, OMP_SET_DYNAMIC, OMP_NESTED, OMP_SCHEDULE Library - E.g., omp_get_num_threads(), omp_get_thread_num() 09/17/2010 09/22/2010 12 3

Summary of Previous Lecture OpenMP, data-parallel constructs only - Task-parallel constructs next time What s good? - Small changes are required to produce a parallel program from sequential - Avoid having to express low-level mapping details - Portable and scalable, correct on 1 processor What is missing? - No scan - Not completely natural if want to write a parallel code from scratch - Not always possible to express certain common parallel constructs - Locality management - Control of performance 09/17/2010 Conditional Parallelization if (scalar expression) Only execute in parallel if expression evaluates to true Otherwise, execute serially #pragma omp parallel if (n > threshold) \ shared(n,x,y) private(i) { #pragma omp for for (i=0; i<n; i++) x[i] += y[i]; /*-- End of parallel region --*/ Review: parallel for private shared 09/22/2010 14 Locks Simple locks: - may not be locked if already in a locked state Nestable locks: - may be locked multiple times by the same thread before being unlocked Simple locks omp_init_lock omp_destroy_lock omp_set_lock omp_unset_lock Nestable locks omp_init_nest_lock omp_destroy_nest_lock omp_set_nest_lock omp_unset_nest_lock OpenMP sections directive #pragma omp parallel { s {{ a=...; b=...; { c=...; d=...; { e=...; f=...; { g=...; h=...; /*omp end sections*/ /*omp end parallel*/ 09/22/2010 15 4

Parallel Sections, Example #pragma omp parallel default(none)\ shared(n,a,b,c,d) private(i) { s nowait { for (i=0; i<n; i++) d[i] = 1.0/c[i]; for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; /*-- End of sections --*/ /*-- End of parallel region SINGLE and MASTER constructs Only one thread in team executes code enclosed Useful for things like I/O or initialization No implicit barrier on entry or exit #pragma omp single { <code-block> Similarly, only master executes code #pragma omp master { <code-block> 09/22/2010 17 09/22/2010 18 Motivating Example: Linked List Traversal... while(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; // End of while loop... How to express with parallel for? - Must have fixed number of iterations - Loop-invariant loop condition and no early exits Convert to parallel for - A priori count number of iterations (if possible) OpenMP 3.0: Tasks! my_pointer = listhead; #pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; // End of single - no implied barrier (nowait) // End of parallel region - implied barrier here firstprivate = private and copy initial value from global variable lastprivate = private and copy back final value to global variable 09/22/2010 19 09/22/2010 20 5

Summary Completed coverage of OpenMP - Locks - Conditional execution - Single/Master - Task parallelism - Pre-3.0: parallel sections - OpenMP 3.0: tasks Next time: - OpenMP programming assignment 09/22/2010 21 6