ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

Similar documents
OpenMP 3.0 Tasking Implementation in OpenUH

Open Source Task Profiling by Extending the OpenMP Runtime API

POSIX Threads and OpenMP tasks

OpenMP Tool Interfaces in OpenMP 5.0. Joachim Protze

Advanced OpenMP. Tasks

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Overview: The OpenMP Programming Model

Introduction to OpenMP

Tasking and OpenMP Success Stories

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

A Static Cut-off for Task Parallel Programs

Make the Most of OpenMP Tasking. Sergi Mateo Bellido Compiler engineer

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

A Runtime Implementation of OpenMP Tasks

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

A Lightweight OpenMP Runtime

Runtime Support for Scalable Task-parallel Programs

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

False Sharing Detection in OpenMP Applications Using OMPT API

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Programming Shared-memory Platforms with OpenMP. Xu Liu

Compsci 590.3: Introduction to Parallel Computing

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Towards task-parallel reductions in OpenMP

IWOMP Dresden, Germany

Shared-memory Parallel Programming with Cilk Plus

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Parallel Programming: OpenMP

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CSE 4/521 Introduction to Operating Systems

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

CUDA GPGPU Workshop 2012

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Questions from last time

A Uniform Programming Model for Petascale Computing

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP

EE/CSCI 451: Parallel and Distributed Computation

An Introduction to OpenMP

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

COMP Parallel Computing. SMM (4) Nested Parallelism

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

Introduction to OpenMP

A brief introduction to OpenMP

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Martin Kruliš, v

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Concurrent Programming with OpenMP

A Characterization of Shared Data Access Patterns in UPC Programs

A Dynamic Optimization Framework for OpenMP

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Performance Profiling for OpenMP Tasks

Effective Performance Measurement and Analysis of Multithreaded Applications

Efficient Work Stealing for Fine-Grained Parallelism

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Multithreaded Programming

Shared Memory Programming with OpenMP (3)

OpenMP 4.0/4.5. Mark Bull, EPCC

Experimenting with the MetaFork Framework Targeting Multicores

Shared-memory Parallel Programming with Cilk Plus

OpenMP tasking model for Ada: safety and correctness

MPI & OpenMP Mixed Hybrid Programming

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Algorithm Engineering

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries

Shared Memory programming paradigm: openmp

Task-parallel reductions in OpenMP and OmpSs

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

SCALASCA v1.0 Quick Reference

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Detection and Analysis of Iterative Behavior in Parallel Applications

OpenMP 4.0. Mark Bull, EPCC

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

An Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces

Session 4: Parallel Programming with OpenMP

Lab: Scientific Computing Tsunami-Simulation

Our new HPC-Cluster An overview

An Implementation of the POMP Performance Monitoring for OpenMP based on Dynamic Probes

Transcription:

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University, Jordan

Outline 1. Motivation and Background 2. Related Work 3. APARF Framework Implementation (OpenUH) OpenMP tasking profiling APIs OpenMP profiling tool and performance analysis A hybrid machine learning model for adaptive prediction 4. Analysis and Evaluation 5. Summary and Future work 2

Motivation and Goal Predicting the optimum task scheduling scheme for a given OpenMP program by developing Adaptive and portable framework 3

Main Contributions A B C D I proposed a new open-source API for OpenMP task profiling in OpenUH RTL I developed a reliable OpenMP profiling tool for capturing useful low-level runtime performance measurements. I exploited my performance framework to perform a comprehensive scheduling analysis study I built and evaluated a portable framework (APARF) for predicting the optimal task scheduling scheme that should be applied to new, unseen applications. 4

Background 5

Shared Memory: Logical View Shared memory space proc1 proc2 proc3 procn SMP Vs cc-numa

OpenMP API A standard API to write parallel shared memory applications in C, C++, and Fortran Consists of compiler directives, runtime routines, environment variables Master thread Barrier Worker thread Parallel Regions A Nested Parallel region http://www.openmp.org 7

OpenMP Tasks A task is an asynchronous work unit C/C++: #pragma omp task Fortran:!$omp task Contains a task region and its data environment int fib(int n) { int x, y; if (n < 2) return n; else { #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x + y; } } 8

OpenMP Task Scheduling 9

Performance Observation Profiling vs. Tracing 10

OpenMP performance APIs before OMPT POMP (Profiler for OpenMP) Instrumentation calls inserted by a source-source tool (TAU, KOJAK, Scalasca) Can notably affect compiler optimizations ORA (Collector API) Sampling of call stack Originally has 11 mutually exclusive states, 9 requests, and 22 defined callback event Was accepted as a white paper by ARB Introduced before tasks and implemented in OpenUH RTL OMPT (OpenMP Tool Interface) 11

Related Work 12

Related Work (Adaptive Scheduling) An OpenMP scheduler was proposed to adapt the granularity of work within loops depending on data placement info. Some previous works have focused on disabling threads in parallel loops in the presence of contention. A thread scheduling policy embedded in a GOMP-based framework was proposed for OpenMP programs featuring irregular parallelism. Another area of research aims to reduce scheduling overhead by increasing task granularity by chunking a parallel loop or by using a cut-off technique 13

Characterization using Machine Learning Machine learning was used to characterize programs in representative groups 14

Automatic Portable and Adaptive Runtime Feedback-Driven (APARF) Framework 15

Task Execution Model in OpenUH N N tied? num_children == 0? WAITING Placed on tail of the thread s untied queue Y Y (other task) ompc_task_exit task created READY RUNNING ompc_task_create: adds child tasks to the task pool ompc_task_wait N num_children == 0? ompc_add _task_to_pool: adds task into the task pool ompc_task_switch: execute task removed from pool Y ompc_remove_task_from_pool removes a task from the pool, and switches to it EXITING ompc_task_exit: decrement parent s num_children num_blocking_children == 0? N Y task destroyed (other_task) ompc_task_exit ompc_remove_task_from_pool removes a task from the pool, and switches to it http://web.cs.uh.edu/~openuh/ 16

OMPT and ORA Tasking Implementation in OpenUH RTL proposed a tasking profiling interface in the OpenUH RTL as an extension to the ORA before OMPT Task creation Task execution Task completion Task switching Task suspension OMPT is a super-set of ORA Support sampling of call stack with optional trace event generation. State support, task creation and completion are mandatory, while the others are optional Adapting my tasking APIs to be compatible with OMPT was straightforward 17

Overhead Analysis in OpenUH RTL 18

Adaptive Scheduling Through APARF Queue Organization: Global, Distributed, Hybrid, Hierarchical Task placement and removal: LIFO, FIFO, Deque, INV-Deque Chunk size, NUM-SLOTS, STORAGE Compute/memory bound Granularity (fine-coarse grain) #tasks directives Nested tasks and structure Scheduling scheme OpenMP Program Adaptive Feedback APARF FRAMEWORK OMPT Tasking API in OpenUH RTL UH Profiling Tool Performance system Data-Preprocessing Clustering Classification Hybrid Machine Learning Model Class 1 Class 2 Class 3 19

Interaction Example in APARF Ahmad Qawasmeh, Abid Malik, Barbara Chapman, Kevin Huck, Allen Malony, "Open Source Task Profiling by Extending the OpenMP Runtime API", IWOMP2013, pp. 186-199, September 2013, Canberra, Australia. 20

APARF OpenMP Profiling Tool Implements a single handler to handle all events. Initializes the API to establish a connection with the runtime. Captures useful low-level runtime performance measurements. Timing, HWCs, and Energy/power sensors were integrated. int fib(int n) { int x, y; if (n < 2) return n; else { #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x + y; } } 21

OpenMP Task Scheduling Analysis A An OpenMP task scheduler can be distinguished based on: Queue organization Work-stealing capability. Order in which a task graph is traversed B Two crucial issues should be managed by a task scheduler: C Data locality Load balancing Conflicting Goals: Queue contention, work stealing, synchronization overheads Task granularity (coarse vs. fine) 22

Analysis Setup in OpenUH A We performed a detailed analysis study 200 scheduling schemes were applied to eight BOTS benchmarks Three different sets of threads were used with two input sizes Initial observation: categorized into three representative groups 23

Analysis Setup D We have used our performance framework The captured runtime events are: task suspension, task execution, task completion, task creation, explicit/implicit barrier, parallel-region, and single/master/loop region Exploiting data locality can best be expressed by demonstrating the cache behavior (cache misses, CPI, TLB) Maintaining load balancing was evaluated by obtaining the timing distribution among threads for each captured event. A. Qawasmeh, A. Malik, B. Chapman. OpenMP Task Scheduling Analysis via OpenMP Runtime API and Tool Visualization, In 2014 IEEE 28th IPDPSW. pp. 1049-1058, May, 2014, Phoenix, Arizona, USA. 24

Similarity Among Benchmarks

Similarity Among Benchmarks 26

Hybrid Machine Learning Modeling Why machine learning? Measurements obtained from the runtime by external tool regardless of the used runtime or compiler 384 data instances with 14 selected features (Overwhelming for human processing) Meaning of hybrid in our context? Unsupervised learning (K-Means clustering) Supervised learning Major challenges? Complex search space Limited # task-based programs for training Features selection Java tool based on the weka API 27

Classification Process for Prediction Rank Attribute 1.519 c_suspend 1.18 c_execution 1.169 c_parallel 0.987 t_finish 0.919 t_creation 0.895 t_suspend 0.841 c_single 0.83 c_creation 0.793 t_execution 0.678 c_finish 0.66 c_barrier 0.5 t_parallel 0.489 t_barrier 0.488 t_single 28

Training Data Improvement Improvement (AMD) Improvement (Intel) Fib 26% 35% Health 30% 38% Sort 21% 19% FFT 10% 13% Nqueens 9% 18% Strassen 8% 8% Alignment 3% 3% Sparse 4% 5% 29

Training Data Behavior Normalized data Runtime Event 30

Portable Prediction Behavior 93% prediction accuracy Program UTS Floorplan EPCC Whetstone MD Predicted Class 24 public(1) 16 simple(0),8 default(2) 24 public(1) 24 simple(0) 24 simple(0) MD Application UTS Benchmark 31

Performance Improvement for new/unseen Applications AMD Opteron Intel Xeon A. Qawasmeh, A. Malik, B. Chapman. Adaptive OpenMP Task Scheduling Using Runtime APIs and Machine Learning, In 2015 IEEE 14th ICMLA conference. Dec, 2015, Miami, Florida, USA. (Accepted with 25% acceptance rate) 32

Summary/Future Work 33

Summary and Future Work A B C D I proposed a new open-source API for OpenMP task profiling in OpenUH I developed a reliable OpenMP profiling tool for capturing useful low-level runtime performance measurements. I used my performance framework to perform a comprehensive scheduling analysis study I built and evaluated a portable framework (APARF) for predicting the optimal task scheduling scheme that should be applied to new, unseen applications. >>> Predict energy consumption behavior at the fine-grain level 34

35

Acknowledgement HPCTools Group at the University of Houston, Texas, USA 36