Task-based Execution of Nested OpenMP Loops

Similar documents
Lecture 4: OpenMP Open Multi-Processing

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Session 4: Parallel Programming with OpenMP

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

A brief introduction to OpenMP

OpenMP dynamic loops. Paolo Burgio.

EPL372 Lab Exercise 5: Introduction to OpenMP

CME 213 S PRING Eric Darve

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP

GCC Developers Summit Ottawa, Canada, June 2006

High Performance Computing: Tools and Applications

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Our new HPC-Cluster An overview

Overview: The OpenMP Programming Model

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP - Introduction

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

CS691/SC791: Parallel & Distributed Computing

Department of Informatics V. Tsunami-Lab. Session 4: Optimization and OMP Michael Bader, Alex Breuer. Alex Breuer

Exercise: OpenMP Programming

Tasking in OpenMP. Paolo Burgio.

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP threading: parallel regions. Paolo Burgio

Lab: Scientific Computing Tsunami-Simulation

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP on Ranger and Stampede (with Labs)

OpenMP Fundamentals Fork-join model and data environment

Shared memory parallel computing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

Masterpraktikum - High Performance Computing

Shared Memory programming paradigm: openmp

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OPENMP TIPS, TRICKS AND GOTCHAS

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

CS691/SC791: Parallel & Distributed Computing

More Advanced OpenMP. Saturday, January 30, 16

OPENMP TIPS, TRICKS AND GOTCHAS

Multithreading in C with OpenMP

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

Enhancements in OpenMP 2.0

S Comparing OpenACC 2.5 and OpenMP 4.5

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

!OMP #pragma opm _OPENMP

Shared Memory Programming Model

Shared Memory Programming with OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

EE/CSCI 451: Parallel and Distributed Computation

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP 4.5: Threading, vectorization & offloading

Concurrent Programming with OpenMP

Advanced OpenMP Features

A high performance face detection system using OpenMP

Introduction to OpenMP

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

OpenMP Application Program Interface

ECE 574 Cluster Computing Lecture 10

Parallel Programming

CSL 860: Modern Parallel

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

CS 5220: Shared memory programming. David Bindel

OpenMP Application Program Interface

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

OpenMP: Open Multiprocessing

[Potentially] Your first parallel application

Compsci 590.3: Introduction to Parallel Computing

Introduction to OpenMP.

Introduction to OpenMP

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

Parallel Algorithm Engineering

Introduction to OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Implementation of Parallelization

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Shared Memory Programming Paradigm!

Parallel Computing. Prof. Marco Bertini

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Introduction to OpenMP

Parallel Programming with OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Introduction to OpenMP

INTRODUCTION TO OPENMP (PART II)

Chapter 4: Threads. Chapter 4: Threads

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP 4.0/4.5. Mark Bull, EPCC

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Transcription:

Task-based Execution of Nested OpenMP Loops Spiros N. Agathos Panagiotis E. Hadjidoukas Vassilios V. Dimakopoulos Department of Computer Science UNIVERSITY OF IOANNINA Ioannina, Greece

Presentation Layout Introduction Manual transformation of omp for Transformation Limitations Automatic Transformation in OMPi Evaluation #2

OpenMP Initially designed for Loop parallelism Nested parallelism important feature V3.0 Tasks Nested parallelism Introduction Difficult to handle efficiently o Possible processor oversubscription Ways of controlling overheads using: o Environmental variables (e.g. OMP_MAX_ACTIVE_LEVELS) o collapse clause (when applicable) #3

Introduction OpenMP parallel loops, structures with independent iterations Fortran 95 : FORALL Intel TBB : parallel_for Cilk++ : cilk_for TBB and Cilk++ use tasking mechanisms to perform the job Can an OpenMP implementation do the same? #4

Transforming Loop Code Manually #pragma omp parallel num_threads(m) #pragma omp parallel num_threads(m) { #pragma omp parallel for\ schedule(static) num_threads(n) for (i=lb; i<ub; i++) { <body> { for(t=0; t<n; t++) #pragma omp task { calculate(n, LB, UB, &lb, &ub); for (i=lb; i<ub; i++) <body> #pragma omp taskwait #N implicit tasks transformed to #N explicit tasks Instead of NxM, #M threads in system #5

Manually transforming Code Example Face Detection Application for each scale (< 14) { /* level 1 */ for i=1 to 4 { 16 core machine (2xAMD 6128 Opteron) <body1> for i=1 to 14 { <body2> for i=1 to 14 { <body3> GCC #6

Manually transforming Code Example Face Detection Application for each scale (< 14) { /* level 1 */ for i=1 to 4 { 16 core machine (2xAMD 6128 Opteron) <body1> for i=1 to 14 { <body2> for i=1 to 14 { <body3> ICC #7

Similar Transformation is possible for Dynamic Guided Manual Transformation Limitations Complicated user code Thread specific data whithin loops: Calls to omp_get_thread_num() utilizing thread s ID Accesses to threadprivate variables #8

Manual Transformation Limitations In general: A mini worksharing must be written What thread will execute what task? Impossible to handle thread specific data But within an OpenMP runtime system: All the worksharing functionality already there Access to all thread specific data Auto transformation in OMPi compiler #9

OMPi C Compiler OMPi (Univ. of Ioannina, http://www.cs.uoi.gr/~ompi): V3.0 OpenMP C infrastructure Source-to-source compiler + Runtime Basic code Transformation: outlining for parallel and task regions pragma omp parallel { <parallel code body> pragma omp task { <task code body> thread_func0 { <parallel code body> task_func0 { <task code body> #10

OMPi s Runtime Organization Each OpenMP thread is associated with an EECB (Execution Entity Control Block) EECB All OpenMP thread info: 1) Thread ID 2) Parallel level 3) Pointer to parent EECB 4) #11

OMPi s Runtime Organization for(i=0; i<4; i++) #pragma omp task Thread ID = 0 Level = 0 Initial Implicit Task TASK_QUEUE #12

OMPi s Runtime Organization #pragma omp parallel num_threads(4) Thread ID = 0 Level = 0 Initial Implicit Task P0 Implicit Task P1 Implicit Task P2 Implicit Task P3 Implicit Task #13

OMPi s Runtime Organization A parallel team of 4 threads Thread ID = 0 Level = 0 Initial Implicit Task TASK_QUEUE table P0 Thread ID = 0 Thread ID = 1 P1 Thread 0 Thread 1 Thread 2 Thread 3 P2 Thread ID = 2 Thread ID = 3 P3 #14

Auto Transformation Implementation Compiler Side Almost no changes Extra flag to notify runtime for combined parallel for Runtime: New type of task called pfor_task Emulation of implicit tasks using explicit tasking #15

Nested Parallel for Execution P0 P1 S0 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 S3 S1 S2 P2 P3 Thread ID = 2 Thread ID = 3 #pragma omp parallel for num_threads(4) S0 pfor Task TASKWAIT S1 pfor Task S2 pfor Task S3 pfor Task #16

Nested Parallel for Execution P0 P1 S0 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 S3 S1 S2 P2 P3 Thread ID = 2 Thread ID = 3 Workstealing! S3 id=3 Thread ID = 3 Level = 2 Thread ID = 0 Level = 2 S0 id=0 #17

Nested Parallel for Execution P0 P1 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 S1 S2 P2 P3 Thread ID = 2 Thread ID = 3 S3 S2 id=3 id=2 Thread ID = 23 Level = 2 Thread ID = 10 Level = 2 S0 S1 id=0 id=1 #18

Nested Parallel for Execution P0 P1 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 Thread ID = 2 P2 Thread ID = 3 P3 S2 id=2 Thread ID = 2 Level = 2 Thread ID = 1 Level = 2 S1 id=1 #19

Auto Transformation Concerns All schedule types are supported Worksharing subsystem remains same Important : Less threads than num_threads(x) may execute parallel for Can this cause a problem? #20

ORDERED clause Auto Transformation Concerns Enforces ordering in the execution of iterations Can task execution order (scheduling) cause problems? Dynamic, Guided schedules No problem o If a thread (task) blocks at an ordered region, this means that previous iterations have already been given away o Hence, progress is guaranteed #21

Auto Transformation Concerns Static schedule + chunk size Each task is responsible to execute a precalculated number of iterations #pragma omp parallel for schedule(static, 10) num_threads(4) TASK 0 TASK 1 TASK 2 TASK 3 TASK 0 TASK 1 TASK 2 TASK 3 Iterations = 80 Scheduling depends on num_threads() #22

Auto Transformation Concerns Static + chunk size + ordered scheduling #pragma omp parallel for schedule(static, 10) ordered num_threads(4) TASK 0 TASK 1 TASK 2 TASK 3 TASK 0 TASK 1 TASK 2 TASK 3 Iterations = 80 In case of 2 threads Iterations 20 to 79 will never be executed! Deadlock! #23

Currently: Auto Transformation Concerns Engineering decision: for this special case we disable tasking transformation, and use threads OpenMP v3.1 taskyield may be able to guarantee progress (currently working on it) #24

Evaluation Environment 2X 8-core AMD Opteron 6128 CPUs @ 2GHz 16GB of main memory Debian Squeeze on the 2.6.32.5 kernel GNU gcc (version 4.4.5-8) [-O3 -fopenmp] Intel icc (version 12.1.0) [-fast -openmp] Oracle suncc (version 12.2) [-fast -xopenmp=parallel] OMPi uses GNU gcc as a back-end compiler [-O3] Default Runtime Settings #25

Synthetic Benchmark main() { #pragma omp parallel for num_threads(16) for (i=0; i < 16;t++) testpfor(); testpfor() { for(i=0; i <= 100K; i++){ #pragma omp parallel for num_threads(n) for (j=0; j < N; j++) delay(task_load); #26

Synthetic Benchmark Results TASK_LOAD = 500 16 Threads in 1 st Level #27

Synthetic Benchmark Results 16 Threads in 1 st Level N (L2 Threads) = 4 #28

Face Detection Application Takes as input an image Discovers the number of faces, their position in the image Varying #scales (depends of image usually < 14) Level 1 Unbalanced iterations : Dynamic Schedule Level 2 Static Schedule for each scale { /* level 1 */ for i=1 to 4 {/*level 2*/ <body1> for i=1 to 14 {/*level 2*/ <body2> for i=1 to 14 {/*level 2*/ <body3> #29

Face Detection Results 161 Images CMU test set Speedup for each compiler is calculated w.r.t. its own sequential execution #30

Face Detection Results Class 57 image from CMU test set Speedup for each compiler is calculated w.r.t. its sequential execution #31

Face Detection Results Compiler Best Configuration SpeedUp OMPi improvement GCC 6 X 4 3.219 25.5% ICC 12 X 8 3.236 25.2% SUNCC 4 X 4 3.378 21.9% OMPi 16 X 8 4.327 - Best configurations and comparison with OMPi when processing all images Speedup is calculated in comparison to the best sequential time overall #32

END #33

Parallel for Execution EECB Thread ID = 0 P0 Impl task EECB Thread ID = 1 P1 Impl task Thread 0 Thread 1 Thread 2 Thread 3 S1 S2 EECB Thread ID = 2 P2 Impl task EECB Thread ID = 3 P3 Impl task EECB Thread ID = 3 Level = 2 S3 EECB Thread ID = 0 Level = 2 S0 #34