Efficient Work Stealing for Fine-Grained Parallelism

Similar documents
Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Lace: non-blocking split deque for work-stealing

On the cost of managing data flow dependencies

Table of Contents. Cilk

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Effective Performance Measurement and Analysis of Multithreaded Applications

Compsci 590.3: Introduction to Parallel Computing

Intel Thread Building Blocks, Part IV

Scheduling Parallel Programs by Work Stealing with Private Deques

Concurrency, Thread. Dongkun Shin, SKKU

Martin Kruliš, v

CS 240A: Shared Memory & Multicore Programming with Cilk++

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

Introduction to PThreads and Basic Synchronization

The Implementation of Cilk-5 Multithreaded Language

Dynamic inter-core scheduling in Barrelfish

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Last Class: Synchronization

Project 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015

Case Study: Using Inspect to Verify and fix bugs in a Work Stealing Deque Simulator

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

CSE 260 Lecture 19. Parallel Programming Languages

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

CS 475. Process = Address space + one thread of control Concurrent program = multiple threads of control

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

Cilk Plus GETTING STARTED

CS516 Programming Languages and Compilers II

Multiprocessors and Locking

CS 261 Fall Mike Lam, Professor. Threads

Performance Optimization Part 1: Work Distribution and Scheduling

M4 Parallelism. Implementation of Locks Cache Coherence

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

A Static Cut-off for Task Parallel Programs

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CMSC421: Principles of Operating Systems

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Synchronization COMPSCI 386

Shared Memory Programming. Parallel Programming Overview

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

CAB: Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture

A Quick Introduction To The Intel Cilk Plus Runtime

CMU /618 Exam 2 Practice Problems

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Work Stealing. in Multiprogrammed Environments. Brice Dobry Dept. of Computer & Information Sciences University of Delaware

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

Review: Easy Piece 1

CSE 613: Parallel Programming

Shared-memory Parallel Programming with Cilk Plus

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

An Introduction to Parallel Systems

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

DYNAMIC SCHEDULING IN MULTICORE PROCESSORS

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Shared-Memory Programming

Performance Optimization Part 1: Work Distribution and Scheduling

A Lightweight OpenMP Runtime

Introduction to the ThreadX Debugger Plugin for the IAR Embedded Workbench C-SPYDebugger

POSIX Threads and OpenMP tasks

Programming with Shared Memory. Nguyễn Quang Hùng

Synchronization Spinlocks - Semaphores

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Multithreaded Programming in Cilk. Matteo Frigo

Parallel Programming: Background Information

Parallel Programming: Background Information

Topology Aware Task stealing for On-Chip NUMA Multi-Core Processors

Concurrent Preliminaries

+ Today. Lecture 26: Concurrency 3/31/14. n Reading. n Objectives. n Announcements. n P&C Section 7. n Race conditions.

TCSS 422: OPERATING SYSTEMS

Chair of Software Engineering. Java and C# in depth. Carlo A. Furia, Marco Piccioni, Bertrand Meyer. Java: concurrency

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Synchronizing without Locks Charles E. Leiserson Charles E. Leiserson 2

CS 470 Spring Mike Lam, Professor. OpenMP

CSE 153 Design of Operating Systems Fall 2018

Programming Languages

Asymmetry-Aware Work-Stealing Runtimes

PThreads in a Nutshell

Nondeterministic Programming

C09: Process Synchronization

Cilk Plus: Multicore extensions for C and C++

Parallel Programming Languages COMP360

Thread Scheduling for Multiprogrammed Multiprocessors

Multithreaded Parallelism and Performance Measures

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Overview. Thread Packages. Threads The Thread Model (1) The Thread Model (2) The Thread Model (3) Thread Usage (1)

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Questions from last time

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

Transcription:

Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009

Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n<2 ) { return n; } else { int a,b; SPAWN( fib, n-2 ); a = CALL( fib, n-1 ); b = SYNC( fib ); return a+b; } }

Two kinds of fine-grainedness Task granularity How often are tasks spawned? G T = T S /N T Load balancing granularity How often must load balancing (migration, stealing) be done? G L = T S /N M T S is serial run-time with no parallelism overhead N T is number of tasks spawned N M is number of migrations (steals in a work stealing implementation)

The stress program C C C C C C C C Repeat r times (figure shows one repetition): spawn a tree of depth d of tasks (d = 3 in figure); the leaves do empty loop C for n iterations (2n cycles)

Fine-grain tasks and fine-grain load balancing 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 fib(42) 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 stress 4096 (3, 128K) Wool Cilk TBB OpenMP

Basic structures The tasks are scheduled on top of worker threads, one per core Each worker has a worker descriptor containing A task pool with ready tasks for other workers to steal A lock protecting the task pool Each task is represented by a task descriptor with A pointer to the code to run Arguments for the code Space for return value A pointer to the thief, if stolen

Designing for fast inlinined tasks The taskpool is a stack managed by a top pointer in task descriptor push on SPAWN pop on SYNC while thieves use a bot pointer, also in task descriptor, contains task descriptors, not pointers simple memory management Most of the design follows from this.

Optimizing inlined tasks: Synchronize on task SYNC (join) needs to synchronize with thiefs, so takes lock in the baseline Avoid taking lock on every SYNC Writes to worker descriptor (makes subsequent thief accesses miss) Slow operation Synchronize thief and victim with atomic swap on flag in task descriptor Thiefs still take lock in worker descriptor

Optimizing inlined tasks: Task specific join Generate specialized SYNC for each task (rather than generic SYNC in RTS) Knows which task to call when inling, so can use a direct call, not via pointer in task descriptor Knows type of return value, so can pass that in standard way rather than updating via pointer When inlining, this optimization replaces three calls with two Application to SYNC (an RTS function) RTS to wrapper function (indirect call) Wrapper function to task function Application to specialized SYNC (inlinable, defined in header) Specialized SYNC to task (within the same file)

Optimizing inlined tasks: Private tasks Avoid atomic swap on each SYNC by making some tasks private A private task can not be stolen, so no synchronization is needed Private tasks can become public (the task descriptor is still built) at the discretion of the owner Owner must check for the need for more public tasks Thiefs notify owner when only n public tasks remain

Results for inlining optimizations Version Time (s) Overhead (cyc) Base 18.9 77 Synchronize on task 7.8 29 Task specific join 5.9 19 Private tasks (no private) 6.0 19 Private tasks (all private) 3.0 3 Seq 2.4 0 Measured by timing parallel version of fib(42) on a single processor. Overhead calculated as (T 1 T S )/N T, that is: time difference divided by number of SPAWNs Measures the marginal overhead over procedure call

Optimizing steals: peek Before trying to lock a victim, check if it has work If victim has no work, thief does no write Several thiefs can cache the relevant info in a worker in a cache coherent machine Hence spin locally Important when work is hard to find (low parallelism) When a worker spawns, the write notifies the thieves by means of the coherence protocol

Optimizing steals: trylock When a thief finds a victim with work, it uses pthread mutex trylock rather than pthread mutex lock If lock is not free, try another victim Contention is expensive Other workers might also have work

Optimizing steals: nolock Get rid of the lock on the worker descripor altogether We have mutual exclusion between thieves and owner by the atomic swap on the task descriptor This almost gives mutex on worker descriptor (bot) since only the task that bot points to can be stolen bot is only updated upon successful steal

Optimizing steals: nolock However, long delay is possible between read of bot and atomic swap (scheduling, interrupts,...) Thief 1 and 2 both read bot = 3 Thief 1 steals task 3, then finishes it Owner joins with task 3, then with 2 and 1 Owner spawns several tasks Thief 2 steals task 3 Now tasks are stolen out of order; if thief 2 updates bot, tasks 1 and 2 becomes invisible until joined with Solution: Only update bot when it still points at the stolen task

Optimizing steals: stress tests 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 (8, 65536) 1 2 3 4 5 6 7 8 (9, 32768) 1 2 3 4 5 6 7 8 (10, 16384) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (11, 8192) (12, 4096) (13, 2048) Base + peek + trylock + nolock

Comparing Wool with Cilk++, TBB, and OpenMP System Inlined 2 4 8 Wool 3 19 2 200 5 600 10 400 Cilk++ 134 31 050 73 600 110 400 TBB 323 5 800 14 000 30 000 OpenMP 878 4 830 9 200 20 240 Column labeled Inlined gives cost of inlined tasks computed using fib Columns labelled 2,4 and 8 give per repetition overhead of stress for a tree of depth 1,2 and 3 on 2, 4 and 8 processors (respectively), over a tree of depth 0 on one processor (with same number n of leaf loop iterations)

More measurements: Cholesky 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 (500, 2K, 1024) 1 (1000, 2 3 4 4K, 5 6 256) 7 8 1 (2000, 2 3 4 8K, 5 6 64) 7 8 1 (4000, 2 3 4 16K, 5 6 16) 7 8 cholesky (rows, nonzeros, repetitions) Wool Cilk TBB OpenMP

More measurements: Matrix multiply 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 (64, 16384) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (128, 2048) (256, 256) mm (rows, repetitions) 1 2 3 4 5 6 7 8 (512, 32) Wool Cilk TBB OpenMP

More measurements: Sub String Finder 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 (12, 16384) 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (13, 8192) (14, 4096) ssf (fibs, repetitions) Wool Cilk TBB OpenMP 1 2 3 4 5 6 7 8 (15, 2048)