GNU libstdc++ parallel mode: Algorithms

Size: px

Start display at page:

Download "GNU libstdc++ parallel mode: Algorithms"

Brook Gardner
6 years ago
Views:

1 Introduction Library Overview Algorithms Conclusion 1/40 The GNU libstdc++ parallel mode: Algorithms Institute for Theoretical Computer Science University of Karlsruhe

2 Introduction Library Overview Algorithms Conclusion 2/40 Talk Outline Introduction Library Overview Algorithms Conclusion

3 Introduction Library Overview Algorithms Conclusion 3/40 Motivation How to Benefit from Multi-Core Systems? automatic parallelization not sufficient manual/explicit parallelization needed, but expensive, beyond qualification of most programmers Our Approach provide a parallelized library of basic algorithms for shared-memory systems provide implementations of all worthwhile STL algorithms included with GCC, under the name libstdc++ parallel mode formerly known as the Multi-Core Standard Template Library make the usage of (data-)parallel algorithms very easy actual parallelism not visible to the user, but encapsulated use established base bring multi-core performance to the end-user in every program

4 Introduction Library Overview Algorithms Conclusion 4/40 Basic Approach Make the usage of (data-)parallel algorithms as easy as winking. Starting Point provide the functionality of the C++ Standard Template Library run the algorithms in parallel Why STL? many useful algorithms and data structures included simple interface, very well-known among developers recompilation of existing programs may suffice C++ accepted and efficient language, standardized

5 Goals Introduction Library Overview Algorithms Conclusion 5/40 Ease of Use easy to use no new language, no language extension no (binary) library to be installed on target system just few compiler options Good Performance some speedup already for small inputs scale down full speedup for larger inputs scale up co-exist with other forms of parallelization respect machine load dynamic load-balancing

6 Introduction Library Overview Algorithms Conclusion 6/40 Competitors STAPL abstracts from memory model/communication must incorporate distributed-memory issues no code publicly available interface only similar to STL Intel Threading Building Blocks mostly on a more abstract level, parallel programming framework only combinatorial generic algorithm is parallel sorter

7 Introduction Library Overview Algorithms Conclusion 7/40 Technical Foundations based on OpenMP (fork-join parallelism) switching on/off parallelism both at compile-time and run-time Applications STL Interface Serial STL Algorithms OpenMP Extensions Parallel STL Algorithms Atomic Ops MCSTL OS Thread Scheduling Multi-Core Hardware

8 Introduction Library Overview Algorithms Conclusion 8/40 Atomic Operations a few operations are executed without any chance of interference atomically fetch and add(x, i) t := x; r := x; r := r + i; x := r; return t; allows concurrent iteration over sequence

9 Introduction Library Overview Algorithms Conclusion 8/40 Atomic Operations a few operations are executed without any chance of interference atomically fetch and add(x, i) t := x; r := x; r := r + i; x := r; return t; allows concurrent iteration over sequence compare and swap(x, c, r) if(x = c) { x := r; return c [true]; } else { return r [false]; } secure state transition, can emulate fetch and add and others by using in a loop slower than usual operation, in particular when concurrent

10 Introduction Library Overview Algorithms Conclusion 9/40 Overview of Important Algorithms Strictly STL (mostly <algorithm>) for each and friends (embarrassingly parallel) find partial sum (prefix sum) partition, partial sort merge sort random shuffle bulk construction and bulk insert for set and map 1 Extension to STL multiway merge multiseq partition (helper) 1 Parallelization of Bulk Operations for STL Dictionaries in HPPC [1]

11 Introduction Library Overview Algorithms Conclusion 10/40 MCSTL Development Status Algorithm Class Function Call(s) Status w/lb w/olb Embarrassingly for each, generate( n), fill( n), impl yes yes Parallel count( if), transform, replace( if), min element, max element, adjacent difference, unique copy Find find( if), find first of, impl yes not adjacent find, mismatch, equal, worthwhile lexicographical compare Search search( n) impl yes not ww. Numerical accumulate, partial sum, impl planned yes Algorithms inner product Partition partition, stable partition impl yes not ww. Merge merge, multiway merge, inplace merge impl planned yes Partial Sort nth element, partial sort impl yes planned Sort sort, stable sort impl yes yes Random Permutation random shuffle impl yes not worthw. Dictionaries (multi )map/set bulk op yes yes Complex set union, set intersection, impl no yes Set Operations set (symmetric )difference,... Vector Arithmetic valarray operations worked yes yes on Heap Construction make heap, sort heap planned Priority Queues amortized update operations planned

12 Introduction Library Overview Algorithms Conclusion 11/40 for each Problem Definition execute a certain function on a range of elements many similar functions like transform, generate parallelization is easy only for uniform execution time per element, exclusive machine

13 Introduction Library Overview Algorithms Conclusion 12/40 for each Implementation Multiple implementations, depending on the purpose Static Load-Balancing divide work into parts of almost equal size used for accumulate, since ends of chunks can easily be spliced (not commutative) Dynamic Load-Balancing initially divide work into parts of almost equal size allow unemployed threads to take work from others work-stealing

14 Introduction Library Overview Algorithms Conclusion 13/40 for each: Work-Stealing additional synchronization is done only by threads out of work steal from random victim steal half of the left jobs atomic operation fetch and add is used to reserve job(s), efficiently supported by today s hardware chunk size C allows compromise between the two worst cases uniformly distributed, little work and skewedly distributed, much work maximal slowdown 10 for C = 1 and no work, neutral for C = 10 and no work full speedup for hard work no matter what distribution logarithmic number of steals suffices with high probability

15 Introduction Library Overview Algorithms Conclusion 14/40 for each: Performance Results Speedup Mandelbrot on 4-way Opteron, at most 1000 iterations per pixel 4 bal. 3 bal. 4 unbal. 2 bal. 3 unbal. 2 unbal. seq Number of pixels

16 find Introduction Library Overview Algorithms Conclusion 15/40 Problem Definition find the first position in a sequence that matches/satisfies a predicate find if also immediately covers find, adjacent find, mismatch, equal, lexicographical compare Considerations sequential algorithm needs O(m) time if first hit is at position m naïve parallel algorithm needs Ω(n/p) = Ω(m) time, if m = n/p 1 (worst case). parallelization overhead makes situation even worse for small m.

17 Introduction Library Overview Algorithms Conclusion 16/40 find: Algorithm Solution start sequentially through position m 0 only then start assigning blocks to p threads dynamically load-balancing using fetch-and-add primitive first thread that is successful signals success to all others by grabbing the remaining part Tradeoff small blocks lower termination latency, but increase overhead solution: exponentially grow block size from starting m to m p 0 p 1 p 2 p sequential m 0 parallel

18 Introduction Library Overview Algorithms Conclusion 17/40 find: Performance Results Speedup th, gb 4 th, gb 8 th, gb 16 th, gb 32 th, gb 2 th, fsb 4 th, fsb 8 th, fsb 16 th, fsb 32 th, fsb 2 th, naive 4 th, naive 8 th, naive 16 th, naive 32 th, naive sequential Length of Sequence 10 7

19 Introduction Library Overview Algorithms Conclusion 18/40 merge, multiway merge Problem Definitions merge: combine two sorted sequences into one sorted sequence multiway merge: combine k > 2 sorted sequences into one sorted sequence important for (external memory) sorting How to divide the problem? find slabs, i. e. consistent sets of ranges from the sequences two possibilities: (randomized) splitting by sampling exact partitioning into slabs of equal size (using multi-sequence selection)

20 Introduction Library Overview Algorithms Conclusion 19/40 merge, multiway merge: Sequence Partitioning Exact Splitting vs. Sampling performance guarantee, no bad inputs like long sequences of equal elements complicated algorithm for multi-sequence selection [4] first generic implementation provided, explicitly handles degenerated inputs Algorithm divide sequences into p slabs of almost equal size determine the target positions in parallel, merge the slabs to the target positions total running time O(m/p log k + k log k log max j S j ) where m = j S j, the accumulated length of all sequences

21 Introduction Library Overview Algorithms Conclusion 20/40 merge, multiway merge: Diagram k { t 3 t 2 t 1 t 0

22 Introduction Library Overview Algorithms Conclusion 21/40 sort, stable sort Parallel Multiway Mergesort + less communication necessary + stable variant easy to derive needs twice the space Parallel Load-Balanced Quicksort + in-place ± dynamic load-balancing to compensate for unequal splitting concurrent access to memory not stable Both variants implemented in the MCSTL, user s choice.

23 Introduction Library Overview Algorithms Conclusion 22/40 Parallel Multiway Mergesort Procedure 1. divide sequence into p parts of equal size 2. in parallel, sort the parts locally 3. use parallel p-way merging to compute the final sequence 4. copy result back to original position t 0 t 1 t 2 t 3

24 Introduction Library Overview Algorithms Conclusion 23/40 partition < > pivot Sequential Algorithm scan from both ends swap to desired order when contrary

25 Introduction Library Overview Algorithms Conclusion 24/40 Parallel Partitioning [Tsigas Zhang 2003] 1. scan blocks of size B from both ends 1.1 claim new blocks when running out of data 2. swap the unfinished blocks to the middle 3. recurse on the middle input p 0 p 1 p 2 swap in parallel rest recursive or sequential time complexity O(n/p + B log p)

26 Introduction Library Overview Algorithms Conclusion 25/40 partition: Example 3 processors, B=3, pivot 50, no special cases p 0 p 1 p

27 Introduction Library Overview Algorithms Conclusion 26/40 Partitioning of 32-bit integers on Sun T1 Speedup sequential 1 thread 2 threads 3 threads 4 threads 8 threads 16 threads 32 threads n

28 Introduction Library Overview Algorithms Conclusion 27/40 Parallel Balanced Quicksort Procedure [3] 1. split sequence using parallel partition, descend recursively with the appropriate number of threads 2. as soon as there is only one processor left per partition: start local sorting 3. each processor sorts locally, pushes parts to process later into lock-free dequeue 4. other processors can steal parts when out of work input partition in parallel partition in parallel p 0 p 1 sequential sorting p 0 p 1 steal p 0 p 1 p 2 p 2

29 Introduction Library Overview Algorithms Conclusion 28/40 Sorting Performance Results Sorting Pairs of 64-bit Integers on the Sun T1 Speedup sequential 2 th, mwms 4 th, mwms 8 th, mwms 16 th, mwms 32 th, mwms 2 th, bqs 4 th, bqs 8 th, bqs 16 th, bqs 32 th, bqs Number of elements

30 Introduction Library Overview Algorithms Conclusion 29/40 Sorting Performance Results Multiway Mergesort for 32-bit integers on 2 Quad-Core-Xeons Speedup sequential 1 thread 2 threads 3 threads 4 threads 5 threads 6 threads 7 threads 8 threads Input Size

31 Introduction Library Overview Algorithms Conclusion 30/40 Random Permutation (random shuffle) Standard Sequential Algorithm (e. g. STL) for 0 i < n swap (a[i], a[rand(i + 1, n 1)]) Cache-efficient (parallel) algorithm 1. distribute randomly to (local) buckets 1b. (copy local buckets to global buckets) 2. permute buckets

32 Introduction Library Overview Algorithms Conclusion 31/40 Random Permutation (random shuffle) time complexity O( n p + p), global communication volume n cache efficiency very important (factor 2) Cache-aware random shuffling of integers on 4-way Opteron Speedup sequential 1 thread 2 threads 3 threads 4 threads n 10 8

33 Introduction Library Overview Algorithms Conclusion 32/40 Dictionary Bulk Operations Algorithmic Problem construct/insert into red-black tree complicated splitting and balancing of work bulk algorithm already brings sequential speedup not yet in parallel mode, but only in MCSTL Memory Management memory allocation takes a considerable share of the time C++ does not allow asymmetric allocation/deallocation, i. e. allocate several nodes at once, later deallocate one by one

34 Introduction Library Overview Algorithms Conclusion 33/40 Dictionary Bulk Operations Performance Insertion, n=0.1k, 2-way Quadcore Xeon Speedup th 7 th 6 th 5 th 4 th 3 th 2 th 1 th seq Number of inserted elements (k)

35 Introduction Library Overview Algorithms Conclusion 34/40 Effect of Core Mapping (Tree Construction) Speedup threads, different sockets 2 threads, same socket, different dice 2 threads, same die 1 thread sequential Number of inserted elements (k)

36 Introduction Library Overview Algorithms Conclusion 35/40 Usage Example Code #include <algorithm> vector<double> v; std::random_shuffle(v.begin(), v.end()); Applications in combination with STXXL: External Memory STL sort, multiway merge suffix array construction additional integer sort routine, merge, for each Release to Public Domain integration into libstdc++ started will ship as a part of GCC GPL with runtime exception open-source, everybody can contribute

37 Introduction Library Overview Algorithms Conclusion 36/40 Conclusions Benefits MCSTL provides a easy way to incorporate data parallelism into programs on an algorithmic level fully generic performance is excellent for large inputs speedup at hand for small inputs as well, depending on circumstances could transparently support new paradigms, e. g. transactional memory repository for parallel algorithm implementations use more (MC)STL

38 Introduction Library Overview Algorithms Conclusion 37/40 Demands to Language Spec, OS, and hardware OS allow specification of affinity between threads Threads 0, 2 and 4 should run close together (shared cache), otherwise widely separated (maximum bandwidth). define penalties for switching cores (cache locality) Hardware more memory bandwidth faster communication larger shared caches

39 Introduction Library Overview Algorithms Conclusion 38/40 Future Work performance estimation automatic switching point detection Switching Number of Threads for Balanced Quicksort: Preliminary Results seq 1 th 2 th 3 th 4 th 5 th 6 th 7 th 8 th Speedup Input Size more application studies updates of complex data structures like priority queues

40 Introduction Library Overview Algorithms Conclusion 39/40 Performance Estimation Issues circumstances can hardly be determined at compile-time execution times of functors, comparators, overloaded assignment operators and copy constructors crucial for performance

41 Introduction Library Overview Algorithms Conclusion 40/40 References L. Frias and J. Singler. Parallelization of bulk operations for STL dictionaries. In Workshop on Highly Parallel Processing on a Chip (HPPC), J. Singler, P. Sanders, and F. Putze. The Multi-Core Standard Template Library. In Euro-Par 2007: Parallel Processing, volume 4641 of LNCS, pages Springer-Verlag. P. Tsigas and Y. Zhang. A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise In 11th Euromicro Conference on Parallel, Distributed and Network-Based Processing, page 372, P. J. Varman, S. D. Scheufler, B. R. Iyer, and G. R. Ricard. Merging Multiple Lists on Hierarchical-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 12(2): , 1991.

Parallel Partition Revisited

Parallel Partition Revisited Leonor Frias and Jordi Petit Dep. de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya WEA 2008 Overview Partitioning an array with respect to a pivot