A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Size: px

Start display at page:

Download "A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores"

Gabriella Rich
5 years ago
Views:

1 A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 1/24

2 Shared Cache of Multicore Processors 1. One core with 2 cache levels L 2 L 1 Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24

3 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches L 2 L 2 L 2 L 2 L 1 L 1 L 1 L 1 Core Core Core Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24

4 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache L 3 L 2 L 2 L 2 L 2 L 1 L 1 L 1 L 1 Core Core Core Core The sequential has an unfair advantage. Can we still get linear speedup? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24

5 Scheduling for Efficient Shared Cache Usage Schedule the computation so that shared cache misses do not increase. with a work stealing scheduler allowing efficient dynamic load balancing for parallel loops: no dependencies between tasks 8MB 256KB 256KB 256KB 256KB 32KB 32KB 32KB 32KB Core 0 Core 1 Core 2 Core 3 Xeon Nehalem E5530 (2.4Ghz) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 3/24

6 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 4/24

7 Parallel Loops Examples of parallel loops OpenMP #pragma omp parallel for TBB parallel for Cilk parallel for Parallel STL for each or transform Problem characteristics Schedule n iterations on p cores Iterations can be processed independently Time to process one iteration can vary Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 5/24

8 Static Scheduling of Parallel Loops Static Scheduling Allocate n p iterations to each core ex: OpenMP static scheduling core 1 core 2 core 3 core 4 Characteristics low overhead mechanism bad load balancing if workload is irregular Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 6/24

9 Dynamic Scheduling of Parallel Loops OpenMP dynamic scheduling Allocate iterations in chunks of size g All chunks are stored in a centralized list Each thread remove a chunk from the list and process it Characteristics Good load balancing Contention on the list Chunk creation overhead g core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 7/24

10 Scheduling Parallel Loops with Work Stealing Work Stealing Each thread has its own list of tasks (= chunks) If list is empty, steal tasks in a randomly selected list Binary tree of tasks to minimize number of steals: one steal half of the iterations Characteristics steal Good load balancing Contention is reduced Task creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 8/24

11 Scheduling Parallel Loops with Work Stealing [0, 127] [0, 63] [64, 127] [0, 31] [32, 63] [0, 15] [16, 31] terminated tasks ready tasks running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24

12 Scheduling Parallel Loops with Work Stealing [0, 127] steal [64, 127] [0, 63] [64, 95] [96, 127] [0, 31] [32, 63] [64, 79] [80, 95] [0, 15] [16, 31] terminated tasks ready tasks running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24

13 Parallel Loops with Xkaapi Xkaapi Work stealing library Tasks are created on a steal: reduce task creation overhead Cooperative stealing: The victim stops working to answer work requests The victim can answer to multiple requests at a time Characteristics Good load balancing Low overhead mechanism typedef struct { InputIterator ibeg; InputIterator iend; } Work_t ; // Task void parallel_for (...) { while (iend!= ibeg) do_work ( ibeg++ ) ; } // no more work -> become a thief void splitter ( num_req ) { i = 0 ; size = victim.iend - victim.ibeg ; bloc = size / ( num_req + 1 ) ; local_end = victim.iend ; while ( num_req > 0 ) { thief->iend = local_end ; thief->ibeg = local_end - bloc ; local_end -= bloc ; --num_req ; } } // victim + thieves -> parallel_for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 10/24

14 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 11/24

15 Reuse Distances [Beyls and D Hollander 01] Number of distinct elements accessed between two accesses to the same element. If first access, reuse distance is infinity. On a fully associative LRU cache of size C: reuse distance C hit reuse distance > C miss h d : number of accesses with a reuse distance d number of cache misses M(C) = d=c+1 h d h d C misses d element access A B A B B C B A C reuse distance cache content A A,B A,B A,B A,B B,C B,C A,B Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 12/24

16 Shared Cache Misses with the Classic Schedule Cores work on elements far away Good temporal locality of the sequential algorithm Cores work on distinct data core 1 core 2 shared cache core 3 core 4 To 1 access by a core corresponds p 1 accesses to distinct data by the other cores Reuse distance is multiplied by p: h seq d = h par p d Number of cache misses M par (C) = d=c+1 h par d = d=c+1 h seq d/p = h d d=c/p+1 C p C seq misses d additional misses h seq d = M seq (C/p) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 13/24

17 A Shared Cache Aware Schedule Cores work at distance at most m r(m) accesses to distinct elements in a window of size m The reuse distance is increased by at most r(m): h seq d = hd+r(m) win Number of cache misses M win (C) d=c+1 r(m) Small Overhead over Sequential good sequential locality r(m) small h d small for large d h seq d = M seq (C) + h d m C d=c+1 r(m) r(m) shared cache h seq d seq misses C d additional misses Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 14/24

18 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache 3. Efficient Implementation of the Scheduler 4. Experiments m Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 15/24

19 Implementing the Shared Cache Aware Schedule Using a standard parallel for loop: StaticWindow Divide the iterations in n/m chunks of size m Each chunk is processed in parallel with a standard parallel for Two versions: pthread and Xkaapi Pthread version each thread processes 1/p of a chunk wait on a barrier after each chunk shared cache Xkaapi version m each chunk is processed in parallel with a parallel for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 16/24

20 Implementing the Shared Cache Aware Schedule Optimized implementation using Xkaapi: SlidingWindow Processing iteration i enables iteration i + m Master thread is at the beginning of the sequence On a steal, the master can give work In the interval [ibeg, iend[ like the other workers In the interval [ilast, ibeg + m[ enabled since the last steal ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window typedef struct { InputIterator ibeg; InputIterator iend; } Work_t ; // Task typedef struct { InputIterator ibeg; InputIterator iend; InputIterator ilast; } Master_Work_t ; // Master Task Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 17/24

21 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache 3. Efficient Implementation of the Scheduler m ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 18/24

4 2MB M seq(c) 8MB Iterate through all cells in the mesh Interpolate surface inside each cell

22 Application: Isosurface Extraction Isosurface extraction Common scientific visualization filter Memory bounded Algorithm ( C ) M seq p MB M seq(c) 8MB Iterate through all cells in the mesh Interpolate surface inside each cell Cells close in the mesh share points Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 19/24

23 Experiments 2 processors Opteron: 2 Dualcores with private L 1 and L 2 caches Nehalem: Quadcore with private L 1, L 2 caches and shared L 3 2 schedules NoWindow: classic schedule (Static or Sliding) Window: shared cache aware schedule 7 implementations NoWindow: Pthread, TBB and Xkaapi StaticWindow: Pthread, TBB and Xkaapi SlidingWindow: Xkaapi Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 20/24

24 Synchronization overhead Speedup on 4 cores on Opteron NoWindow StaticWindow 0 Pthread TBB Xkaapi Pthread < TBB < Xkaapi On Opteron: no shared cache Window < NoWindow more synchronizations without gain in cache misses Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 21/24

25 Window Size m L3 cache misses Static Seq. No window size m On 4 cores of Nehalem Shared 8MB L 3 cache For small m: M window M seq (C) M no window M seq ( C p M no window = ( C ) M seq = p ) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 22/24

26 Speedup and Cache Misses on Nehalem Speedup T seq /T par L 3 misses ratio M par /M seq No Static No StaticSliding 0 No Static No StaticSliding Pthread Xkaapi Pthread Xkaapi 4 cores of Nehalem with 8MB of shared cache L 3 Best performance: SlidingWindow Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 23/24

27 Conclusion Shared Cache Aware Scheduler Shared cache aware scheduler for parallel loops Efficient implementation using work stealing For application with good sequential locality the window strategy is as good as if each core had its own copy of the L 3 cache shared cache m Future work Experiment with other applications Automatically find window size with reuse distance histogram Cache-oblivious version (cache size unknown)? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 24/24

A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores

A Work Stealing Algorithm for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian, Vincent Danjean, Thierry Gautier, Fabien Lementec, and Bruno Raffin MOAIS Project, INRIA- LIG ZIRST 51, avenue