A Lightweight OpenMP Runtime

Similar documents
Getting Performance from OpenMP Programs on NUMA Architectures

NUMA-aware OpenMP Programming

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CPU Architecture. HPCE / dt10 / 2013 / 10.1

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Carlo Cavazzoni, HPC department, CINECA

Introduction to OpenMP

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to parallel Computing

[Potentially] Your first parallel application

A brief introduction to OpenMP

Optimize HPC - Application Efficiency on Many Core Systems

Introduction to Parallel Computing

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Review: Creating a Parallel Program. Programming for Performance

Overview: The OpenMP Programming Model

ECE 574 Cluster Computing Lecture 10

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Effective Performance Measurement and Analysis of Multithreaded Applications

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

An Introduction to OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS370 Operating Systems Midterm Review

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming on Larrabee. Tim Foley Intel Corp

Compsci 590.3: Introduction to Parallel Computing

Progress on OpenMP Specifications

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

EE/CSCI 451: Parallel and Distributed Computation

Efficient Work Stealing for Fine-Grained Parallelism

OpenMP 4.0. Mark Bull, EPCC

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

A Uniform Programming Model for Petascale Computing

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

AusAn TACC. OpenMP. TACC OpenMP Team

OpenMP threading: parallel regions. Paolo Burgio

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Allows program to be incrementally parallelized

Problem. Context. Hash table

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Day 6: Optimization on Parallel Intel Architectures

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

An Introduction to Parallel Programming

OpenMP 4.0/4.5. Mark Bull, EPCC

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Shared Memory Parallelism - OpenMP

Shared Memory programming paradigm: openmp

CUDA GPGPU Workshop 2012

Linux multi-core scalability

Parallel and High Performance Computing CSE 745

Point-to-Point Synchronisation on Shared Memory Architectures

Enhancements in OpenMP 2.0

Shared Memory Programming Model

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

OpenMP Optimization and its Translation to OpenGL

Summary: Open Questions:

Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading

Implementation of Parallelization

Introduction to Multicore Programming

University of Wisconsin-Madison

MPI+X on The Way to Exascale. William Gropp

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

MPI in 2020: Opportunities and Challenges. William Gropp

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Standard promoted by main manufacturers Fortran

Peta-Scale Simulations with the HPC Software Framework walberla:

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

STUDYING OPENMP WITH VAMPIR & SCORE-P

Basics of Performance Engineering

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Scalasca performance properties The metrics tour

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Programming for Fujitsu Supercomputers

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Introduction to Multicore Programming

Callisto-RTS: Fine-Grain Parallel Loops

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Control Hazards. Prediction

A new Mono GC. Paolo Molaro October 25, 2006

Transcription:

Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research

Goals Thread-rich computing environments are becoming more prevalent more computing power, more threads less memory relative to compute There is parallelism, it comes in many forms hybrid MPI - OpenMP parallelism mixed mode OpenMP / Pthread parallelism nested OpenMP parallelism Have to exploit parallelism efficiently providing ease of use for casual programmers providing full control for power programmers providing timing feedback 2 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Objectives Of Lightweight-OpenMP Runtime Handle more threads lower OpenMP overheads lower scalar overheads (Amdal s law) better scaling of overheads (more threads) develop new algorithms inside research runtime Handle nested parallelism: more control with thread affinity more user input on how to map computation to threads currently: no affinity support provided by user proposed a new thread-affinity to OpenMP standard committee contributed reference implementation in research runtime Todo: Provide timing feedback user want to know where is the time spent feedback at little overheads 3 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Part 1: Handle more threads Impact of overheads Approach for near constant-time parallel-region creation Results on BGQ 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Impact of Overhead in Prevalent Threading Model Programming Model MPI distributed process across/within nodes explicit user-managed communication Parallelism lots Overheads low impact Coarse-grain Parallel (OpenMP/Auto) shared memory within nodes/cores for outer parallel-loops Fine-grain Parallel (OpenMP/Auto) shared memory within cores/nodes for inner parallel-loops Impact of OpenMP little high impact 5 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Basic OpenMP Operation: Parallel Region master thread t t1 t2 t3 t sequential work worker threads avail: Beginning of parallel region tid: recruits threads initializes participating threads work: parallel work state: sequential work 6 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger End of parallel region barrier & cleanup sequential overheads parallel overheads useful work overhead

Source of Overheads, Due to OpenMP Standard Action Time line Data used t t1 t2 t3 t t t1 t2 t3 t sequential work 1. find 3 avail threads avail: 1 1 1 2. assign thread IDs tid: 1 2 3 3. assign work work:. signal ready 5. init. thread state state: fls fls fls fls parallel work beginning region work descriptors: fct loopinfo state 6. barrier 7. cleanup sequential work 7 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger end region sequential overheads parallel overheads useful work

Optimized OpenMP Runtime Design Systematic re-design to lower overheads Eliminate sequential overhead reuse previous thread allocations in practice, near 1% hit Extremely compact state minimize initialization/cleanup find avail threads assign IDs assign work signal ready initialize state parallel work t t1 t2 t3 t Use hardware support atomic instructions (atomic increment / xor) barrier cleanup 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Optimization Guiding Principles Cache configurations to eliminate computation & communication reuse as much as possible when nothing has changed Minimum locking one lock for protecting thread allocation data structure locked only on thread recruiting / freeing rest use atomic operations Use global state sparingly work descriptor is only used for parallel region most other OpenMP constructs use no work descriptors Allocate state statically, initialize mostly statically barriers use counters initialized when initializing OpenMP some local state is only initialized on first use 9 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Example: Caching Worker Configurations Initial state 5 6 7 9 1 11 13 1 15 1 2 3 5 6 7 9 1 11 13 1 15 (A) End parallel for t, t When freeing workers leave workers in reserved state (A: end t) 13 1 15 1 2 3 5 6 7 9 1 11 13 1 15 When recruiting workers avoid stealing workers that were reserved by others (B: start t) aim at reusing workers that were previously reserved by this master (C: start t) 1 2 3 5 6 7 9 1 11 13 1 15 1 2 3 5 6 7 9 1 2 3 5 6 7 9 1 1 11 11 13 13 1 1 15 15 (B) (C) Start parallel for t, t 1 2 3 5 6 7 13 1 15 1 2 3 5 6 7 9 1 11 13 1 15 1 2 3 5 6 7 9 1 11 13 1 15 1 2 3 5 6 7 9 1 2 3 5 6 7 9 1 1 11 11 13 13 1 1 15 15 1 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

OpenMP Micro-Benchmark (EPCC) Results Contributions of individual optimizations find avail threads assign IDs assign work signal ready initialize state t t1 t2 t3 t parallel work barrier cleanup 11 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Overhead Scaling for Parallel Region (ECPP) Nearly constant overhead over wide range of thread counts LOMP is an experimental runtime that implements a subset of all OpenMP functionality. Performance will be impacted until full functionality is provided 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Observations Creating a parallel region with to 6 threads overhead are now reduced to below 2K cycles preliminary numbers, will change as we support full OpenMP While we have reduced overheads by x 1x remaining overheads are due to the OpenMP standard others are due to necessary locks / barriers / msyncs compiler optimization can further reduce overheads in some cases Barriers becoming the dominant factor at higher thread counts 13 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Part 2: Efficient Nested Parallelism Examples of requests that are not currently possible get threads on separate cores to get more L1 cache get threads collocated on same core to maximize cache reuse Current runtimes have a fixed policy runtime tries to even out load balance across the machine this works well for single level of parallelism, not as well for nested parallelism Want to allow users to specify where to get threads broad policies that cover most cases Want to allow users to specify where threads are allowed to migrate for load balancing purpose 1 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

OpenMP Affinity Proposal Define the concept of an OpenMP Place a set of one or more logical processors on which OpenMP-threads execute OpenMP-threads may migrate within one place Let the user specify its own set of places by default, the system defines its own list of places in MPI hybrid mode, the mpi-run script would defines the set of places Let the user specify how to recruit threads for OpenMP parallel MASTER: put threads in same place as master CLOSE: put threads close to master reduce false sharing, distribute among places SPREAD: spread threads across the machine reduce overheads of threads sharing the same core optimize memory bandwidth by exploiting cores/sockets 15 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

How to use Place Lists Consider a system with 2 chips, cores, and hardware-threads chip chip 1 core core 1 core 2 core 3 core core 5 core 6 core 7 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 One place per hardware-thread OMP_PLACES=hwthread OMP_PLACES=(),(1),(2), (15) One place per core, including both hardware-threads OMP_PLACES=core OMP_PLACES=(,1),(2,3),(,5) (1,15) One place per chip, excluding first hardware-thread OMP_PLACES=(1,2,,7),(9,1,11,..15) 16 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

CLOSE Policy Compact selects OpenMP threads in the same place as the master consider the next place(s) when master place is full Example with OMP_PLACES=hwthread chip chip 1 core core 1 core 2 core 3 core core 5 core 6 core 7 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 close 2* close close t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 * technically omp parallel num_threads(2) affinity(close) master worker 17 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

SPREAD Policy Spread OpenMP threads as evenly as possible among places Example with OMP_PLACES=hwthread chip chip 1 core core 1 core 2 core 3 core core 5 core 6 core 7 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 spread 2* spread spread t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 * technically omp parallel num_threads(2) affinity(spread) master worker 1 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Spread Policy Partition the Machine Spread also implicitly partition the machine so that nested parallel-regions get threads only from its subset of the machine Example: spread with nested, compact, parallel-regions chip chip 1 core core 1 core 2 core 3 core core 5 core 6 core 7 t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 initial t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 spread t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 close t t1 t2 t3 t t5 t6 t7 t t9 t1 t11 t t13 t1 t15 19 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger partition

Observations Give the user more fine-grain control which hardware thread / core / chip to use which thread to select for a given parallel region e.g. spread vs. compact where threads are allowed to migrate (within a place) Ongoing work implemented in our research OpenMP runtime currently under review with the OpenMP Standard Language Committee 2 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Part 3: Providing Timing Info Possible approaches callbacks statistical sampling (requires interrupt support) embedded timing (using low overhead hardware timers) Experimented with second approach approximate overheads: 1 cycles per OpenMP constructs (for ref: 6-thread barrier -1 cycles, parallel region 1-2 cycles) Questions: what is needed by users what is needed by tool developers what can info can be provided cheaply 21 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger

Timing Info: Cost Evaluation Timing is relatively cheap on POWER get a local timer (register move) save difference of 2 timer values (one store) Saving a current state (idle-barrier/idle-lock) one store per transition Callbacks load value of enabled/disabled, one branch BUT having a call has performance impact on optimized runtime in optimized runtime, everything is inlined (except call outlined functions) calls force caller-saved register back into memory (potentially 1+ load/ store) have seen overhead in 1+ cycles just for one additional function call cheaper if are located just before/after outlined function calls 22 26 June 2 IBM - OpenMP for Exascale - Alexandre Eichenberger