OpenMP * Past, Present and Future

Similar documents
OpenMP: The "Easy" Path to Shared Memory Computing

PARLab Parallel Boot Camp

Alfio Lazzaro: Introduction to OpenMP

Multi-core Architecture and Programming

Programming with OpenMP*

Intel Parallel Composer. Stephen Blair-Chappell Intel Compiler Labs

Lecture 2 A Hand-on Introduction to OpenMP

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial.

Parallel Programming Principle and Practice. Lecture 6 Shared Memory Programming OpenMP. Jin, Hai

Cluster Computing 2008


Parallel Programming with OpenMP. CS240A, T. Yang

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Intel Developer Products for Parallelized Software Development

Data Environment: Default storage attributes

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to. Slides prepared by : Farzana Rahman 1

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

Introduction to OpenMP.

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

DPHPC: Introduction to OpenMP Recitation session

OpenMP on Ranger and Stampede (with Labs)

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Cray XE6 Performance Workshop

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

A brief introduction to OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Parallel Programming in C with MPI and OpenMP

Session 4: Parallel Programming with OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Introduction to OpenMP

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

An Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

Lecture 4: OpenMP Open Multi-Processing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Tieing the Threads Together

Introduction to OpenMP

Multithreading in C with OpenMP

Shared Memory Programming Model

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Teaching people how to think parallel

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Lecture 2 A Hand-on Introduction to OpenMP

<Insert Picture Here> OpenMP on Solaris

Parallel Computing Parallel Programming Languages Hwansoo Han

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming in C with MPI and OpenMP

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Nanos Mercurium: a Research Compiler for OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

DPHPC: Introduction to OpenMP Recitation session

Distributed Systems + Middleware Concurrent Programming with OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Overview: The OpenMP Programming Model

Allows program to be incrementally parallelized

A Hands-on Introduction to OpenMP *

15-418, Spring 2008 OpenMP: A Short Introduction

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Shared memory programming

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP - Introduction

Parallel programming using OpenMP

Shared Memory programming paradigm: openmp

Introduction to OpenMP. Motivation

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

ECE 574 Cluster Computing Lecture 10

CS691/SC791: Parallel & Distributed Computing

Introduction to OpenMP

Transcription:

OpenMP * Past, Present and Future Tim Mattson Intel Corporation Microprocessor Technology Labs timothy.g.mattson@intel.com * The name OpenMP is the property of the OpenMP Architecture Review Board. 1

OpenMP * Overview: C$OMP FLUSH #pragma omp critical C$OMP THREADPRIVATE(/ABC/) C$OMP parallel do shared(a, b, c) call OMP_INIT_LOCK (ilok) C$OMP SINGLE PRIVATE(X) C$OMP PARALLEL REDUCTION (+: A, B) #pragma omp parallel for private(a, B) CALL OMP_SET_NUM_THREADS(10) OpenMP: An API for Writing Multithreaded Applications call omp_test_lock(jlok) C$OMP ATOMIC C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP MASTER A set of compiler setenv directives OMP_SCHEDULE and library dynamic routines for parallel application programmers Greatly simplifies writing multi-threaded programs in Fortran, C and C++ C$OMP ORDERED C$OMP SECTIONS Standardizes last 20 years of SMP practice!$omp BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(xx) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck) * The name OpenMP is the property of the OpenMP Architecture Review Board. 2

History of OpenMP DEC SGI Cray Merged, needed commonality across products HP IBM Intel KAI ISV - needed larger market Wrote a rough draft straw man SMP API Other vendors invited to join ASCI was tired of recoding for SMPs. Urged vendors to standardize. 1997 3

OpenMP Release History 1998 OpenMP C/C++ 1.0 2002 OpenMP C/C++ 2.0 2005 OpenMP Fortran 1.0 1997 OpenMP Fortran 1.1 1999 OpenMP Fortran 2.0 2000 OpenMP 2.5 A single specification for Fortran, C and C++ 4

OpenMP Programming Model: Fork-Join Parallelism: Master thread spawns a team of threads as needed. Parallelism is added incrementally until desired performance is achieved: i.e. the sequential program evolves into a parallel program. Master Thread Parallel Constructs A Nested Nested Parallel Parallel construct construct 5

OpenMP loves simple loops: The sequential PI program #include <omp.h> static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) private(x) } for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; 6

PI Program: Win32 API #include <windows.h> #define NUM_THREADS 2 HANDLE thread_handles[num_threads]; CRITICAL_SECTION hupdatemutex; static long num_steps = 100000; double step; double global_sum = 0.0; void main () { double pi; int i; DWORD threadid; int threadarg[num_threads]; for(i=0; i<num_threads; i++) threadarg[i] = i+1; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *) arg; step = 1.0/(double) num_steps; for (i=start;i<= num_steps; i=i+num_threads){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); } } } InitializeCriticalSection(&hUpdateMutex); for (i=0; i<num_threads; i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadarg[i], 0, &threadid); WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi); *Other names and brands may be claimed as the property of others. 7

OpenMP with complicated loops OpenMP can t handle pointer following loops nodeptr list, p; for (p=list; p!=null; p=p->next) process(p->data); Intel has proposed (and implemented) a taskq construct to deal with this case: nodeptr list, p; #pragma omp parallel taskq for (p=list; p!=null; p=p->next) #pragma omp task process(p->data); Reference: Shah, Haab, Petersen and Throop, EWOMP 1999 paper. 8

Task queue example - FLAME: Shared Memory Parallelism in Dense Linear Algebra Traditional approach to parallel linear algebra: Only parallelism from multithreaded BLAS Observation: Better speedup if parallelism is exposed at a higher level FLAME approach: OpenMP task queues Tze Meng Low, Kent Milfeld, Robert van de Geijn, and Field Van Zee. Z Parallelizing FLAME Code with OpenMP Task Queues. TOMS, submitted. *Other names and brands may be claimed as the property of others. 9

Developing Libraries Traditional approach: Expert painstakingly develops and implements an algorithm operation These slides were borrowed from Robert van de Geijn of UT Austin 10

A Typical Linear Algebra Algorithm (LU factorization) These slides were borrowed from Robert van de Geijn of UT Austin 11

A Typical Linear Algebra Algorithm (LU factorization) These slides were borrowed from Robert van de Geijn of UT Austin 12

A Typical Linear Algebra Algorithm (LU factorization) These slides were borrowed from Robert van de Geijn of UT Austin 13

Notation: Indices are the root of all evil 14

The right way to write math libraries FLAME: www.cs.utexas/users/flame Mathematical specification of operation Mechanical derivation of algorithms Mathematical specification of architecture Family of algorithms Rewrite rules for language Mechanical analysis of algorithms Mechanical translation to code Optimized library These slides were borrowed from Robert van de Geijn of UT Austin 15

Symmetric rank-k update Add A 1 A T 0 Add A 1 A T 1 A 0 C 10 C 11 += A 1 A T 0 A T 1 C A A T Note: the iteration sweeps through C and A, creating a new block of rows to be updated with new parts of A. These updates are completely independent. Tze Meng Low, Kent Milfeld, Robert van de Geijn, and Field Van Zee. Z Parallelizing FLAME Code with OpenMP Task Queues. TOMS, submitted. 16

17

18

Top line represents peak of Machine (Itanium2 1.5GHz, 4CPU) x 10 4 syrk_ln (var2) 2 Reference FLAME OpenFLAME_nth1 OpenFLAME_nth2 OpenFLAME_nth3 OpenFLAME_nth4 1.5 MFLOPS/sec. 1 0.5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 matrix dimension n Note: the above graphs is for the most naïve way of marching through the matrices. By picking blocks dynamically, much faster ramp-up can be achieved. 19

So OpenMP is great and getting better The current specification (2.5) is very effective for loop oriented programs common in scientific computing. The next specification (3.0 in Nov 06) will handle a much larger variety of control structures. 20

But there are major challenges ahead! OpenMP was created with a particular abstract machine or computational model in mind: Multiple processing elements. A shared address space with equal-time access for each processor. Multiple light weight processes (threads) managed outside of OpenMP (the OS or some other third party ). Proc 1 Proc 2 Proc 3 Proc N Shared Address Space 21

But there are major challenges ahead! There are no machines on the market today that truly match the OpenMP model: Processors have caches for fast local memory. Large multiprocessor machines have NUMA memory subsystems. Multi-core only complicates the problem. What are we going to do? We are studying ways to define processor/thread hierarchies in OpenMP we hope to get this into OpenMP 3.0. Cluster OpenMP? 22

Conclusion/Summary OpenMP is an important standard for programming shared memory machines. OpenMP 3.0 will extend OpenMP to a much larger range of algorithms: OpenFLAME shows the power of incremental parallelism, flexible task queue constructs, and intelligent algorithm design. But we have important challenges to resolve How should OpenMP evolve to fit a world where memory hierarchies are complex? NUMA? Clusters? 23