Task-based Execution of Nested OpenMP Loops

Task-based Execution of Nested OpenMP Loops Spiros N. Agathos Panagiotis E. Hadjidoukas Vassilios V. Dimakopoulos Department of Computer Science UNIVERSITY OF IOANNINA Ioannina, Greece

Presentation Layout Introduction Manual transformation of omp for Transformation Limitations Automatic Transformation in OMPi Evaluation #2

OpenMP Initially designed for Loop parallelism Nested parallelism important feature V3.0 Tasks Nested parallelism Introduction Difficult to handle efficiently o Possible processor oversubscription Ways of controlling overheads using: o Environmental variables (e.g. OMP_MAX_ACTIVE_LEVELS) o collapse clause (when applicable) #3

Introduction OpenMP parallel loops, structures with independent iterations Fortran 95 : FORALL Intel TBB : parallel_for Cilk++ : cilk_for TBB and Cilk++ use tasking mechanisms to perform the job Can an OpenMP implementation do the same? #4

Transforming Loop Code Manually #pragma omp parallel num_threads(m) #pragma omp parallel num_threads(m) { #pragma omp parallel for\ schedule(static) num_threads(n) for (i=lb; i<ub; i++) { <body> { for(t=0; t<n; t++) #pragma omp task { calculate(n, LB, UB, &lb, &ub); for (i=lb; i<ub; i++) <body> #pragma omp taskwait #N implicit tasks transformed to #N explicit tasks Instead of NxM, #M threads in system #5

Manually transforming Code Example Face Detection Application for each scale (< 14) { /* level 1 */ for i=1 to 4 { 16 core machine (2xAMD 6128 Opteron) <body1> for i=1 to 14 { <body2> for i=1 to 14 { <body3> GCC #6

Similar Transformation is possible for Dynamic Guided Manual Transformation Limitations Complicated user code Thread specific data whithin loops: Calls to omp_get_thread_num() utilizing thread s ID Accesses to threadprivate variables #8

Manual Transformation Limitations In general: A mini worksharing must be written What thread will execute what task? Impossible to handle thread specific data But within an OpenMP runtime system: All the worksharing functionality already there Access to all thread specific data Auto transformation in OMPi compiler #9

OMPi C Compiler OMPi (Univ. of Ioannina, http://www.cs.uoi.gr/~ompi): V3.0 OpenMP C infrastructure Source-to-source compiler + Runtime Basic code Transformation: outlining for parallel and task regions pragma omp parallel { <parallel code body> pragma omp task { <task code body> thread_func0 { <parallel code body> task_func0 { <task code body> #10

OMPi s Runtime Organization Each OpenMP thread is associated with an EECB (Execution Entity Control Block) EECB All OpenMP thread info: 1) Thread ID 2) Parallel level 3) Pointer to parent EECB 4) #11

OMPi s Runtime Organization for(i=0; i<4; i++) #pragma omp task Thread ID = 0 Level = 0 Initial Implicit Task TASK_QUEUE #12

OMPi s Runtime Organization #pragma omp parallel num_threads(4) Thread ID = 0 Level = 0 Initial Implicit Task P0 Implicit Task P1 Implicit Task P2 Implicit Task P3 Implicit Task #13

OMPi s Runtime Organization A parallel team of 4 threads Thread ID = 0 Level = 0 Initial Implicit Task TASK_QUEUE table P0 Thread ID = 0 Thread ID = 1 P1 Thread 0 Thread 1 Thread 2 Thread 3 P2 Thread ID = 2 Thread ID = 3 P3 #14

Auto Transformation Implementation Compiler Side Almost no changes Extra flag to notify runtime for combined parallel for Runtime: New type of task called pfor_task Emulation of implicit tasks using explicit tasking #15

Nested Parallel for Execution P0 P1 S0 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 S3 S1 S2 P2 P3 Thread ID = 2 Thread ID = 3 #pragma omp parallel for num_threads(4) S0 pfor Task TASKWAIT S1 pfor Task S2 pfor Task S3 pfor Task #16

Nested Parallel for Execution P0 P1 S0 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 S3 S1 S2 P2 P3 Thread ID = 2 Thread ID = 3 Workstealing! S3 id=3 Thread ID = 3 Level = 2 Thread ID = 0 Level = 2 S0 id=0 #17

Nested Parallel for Execution P0 P1 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 S1 S2 P2 P3 Thread ID = 2 Thread ID = 3 S3 S2 id=3 id=2 Thread ID = 23 Level = 2 Thread ID = 10 Level = 2 S0 S1 id=0 id=1 #18

Nested Parallel for Execution P0 P1 Thread ID = 0 Thread ID = 1 Thread 0 Thread 1 Thread 2 Thread 3 Thread ID = 2 P2 Thread ID = 3 P3 S2 id=2 Thread ID = 2 Level = 2 Thread ID = 1 Level = 2 S1 id=1 #19

Auto Transformation Concerns All schedule types are supported Worksharing subsystem remains same Important : Less threads than num_threads(x) may execute parallel for Can this cause a problem? #20

ORDERED clause Auto Transformation Concerns Enforces ordering in the execution of iterations Can task execution order (scheduling) cause problems? Dynamic, Guided schedules No problem o If a thread (task) blocks at an ordered region, this means that previous iterations have already been given away o Hence, progress is guaranteed #21

Auto Transformation Concerns Static schedule + chunk size Each task is responsible to execute a precalculated number of iterations #pragma omp parallel for schedule(static, 10) num_threads(4) TASK 0 TASK 1 TASK 2 TASK 3 TASK 0 TASK 1 TASK 2 TASK 3 Iterations = 80 Scheduling depends on num_threads() #22

Auto Transformation Concerns Static + chunk size + ordered scheduling #pragma omp parallel for schedule(static, 10) ordered num_threads(4) TASK 0 TASK 1 TASK 2 TASK 3 TASK 0 TASK 1 TASK 2 TASK 3 Iterations = 80 In case of 2 threads Iterations 20 to 79 will never be executed! Deadlock! #23

Currently: Auto Transformation Concerns Engineering decision: for this special case we disable tasking transformation, and use threads OpenMP v3.1 taskyield may be able to guarantee progress (currently working on it) #24

Evaluation Environment 2X 8-core AMD Opteron 6128 CPUs @ 2GHz 16GB of main memory Debian Squeeze on the 2.6.32.5 kernel GNU gcc (version 4.4.5-8) [-O3 -fopenmp] Intel icc (version 12.1.0) [-fast -openmp] Oracle suncc (version 12.2) [-fast -xopenmp=parallel] OMPi uses GNU gcc as a back-end compiler [-O3] Default Runtime Settings #25

Synthetic Benchmark main() { #pragma omp parallel for num_threads(16) for (i=0; i < 16;t++) testpfor(); testpfor() { for(i=0; i <= 100K; i++){ #pragma omp parallel for num_threads(n) for (j=0; j < N; j++) delay(task_load); #26

Synthetic Benchmark Results TASK_LOAD = 500 16 Threads in 1 st Level #27

Synthetic Benchmark Results 16 Threads in 1 st Level N (L2 Threads) = 4 #28

Face Detection Application Takes as input an image Discovers the number of faces, their position in the image Varying #scales (depends of image usually < 14) Level 1 Unbalanced iterations : Dynamic Schedule Level 2 Static Schedule for each scale { /* level 1 */ for i=1 to 4 {/*level 2*/ <body1> for i=1 to 14 {/*level 2*/ <body2> for i=1 to 14 {/*level 2*/ <body3> #29

Face Detection Results 161 Images CMU test set Speedup for each compiler is calculated w.r.t. its own sequential execution #30

Face Detection Results Class 57 image from CMU test set Speedup for each compiler is calculated w.r.t. its sequential execution #31

Face Detection Results Compiler Best Configuration SpeedUp OMPi improvement GCC 6 X 4 3.219 25.5% ICC 12 X 8 3.236 25.2% SUNCC 4 X 4 3.378 21.9% OMPi 16 X 8 4.327 - Best configurations and comparison with OMPi when processing all images Speedup is calculated in comparison to the best sequential time overall #32

END #33

Parallel for Execution EECB Thread ID = 0 P0 Impl task EECB Thread ID = 1 P1 Impl task Thread 0 Thread 1 Thread 2 Thread 3 S1 S2 EECB Thread ID = 2 P2 Impl task EECB Thread ID = 3 P3 Impl task EECB Thread ID = 3 Level = 2 S3 EECB Thread ID = 0 Level = 2 S0 #34