Parallelization on Multi-Core CPUs

Size: px

Start display at page:

Download "Parallelization on Multi-Core CPUs"

Imogene Atkins
5 years ago
Views:

1 1 / 30

2 Amdahl s Law suppose we parallelize an algorithm using n cores and p is the proportion of the task that can be parallelized (1 p cannot be parallelized) the speedup of the algorithm is assuming infinite parallelism, the speedup is 1 (1 p)+ p n 1 (1 p) for example, if 90% of the work is parallelized, the maximum speedup is only 10 one should make sure that eery phase of one s algorithm that depends on the input data size is parallelized 2 / 30

3 Parallelization Constructs and Libraries low-leel: C++ threads, pthreads (threads, mutexes, barriers, condition ariables) parallel patterns: parallel reduce, parallel for, fork/join parallelism parallel frameworks: TBB, OpenMP, Cilk Plus 3 / 30

4 Intel Thread Building Blocks Open Source library for parallelism and concurrency fairly nice for prototyping manages a pool of worker threads implements work stealing proides high-leel abstractions enables nested parallelism large systems (e.g., database systems) will hae their own framework 4 / 30

5 Thread-Local Storage in C++ ariables can be annotated as thread_local (each thread has its own copy) howeer, sometimes it would be conenient to access the thread-local state of other threads tbb::enumerable_thread_specific allows this 5 / 30

6 Parallel Reduce tbb :: parallel_ reduce ( tbb :: blocked_range < uint64_t >(0, n), // range 0ull, // identity [&]( const tbb :: blocked_range < uint64_t >& r, uint64_ t init ) { // accumulate for ( uint64_t i=r. begin (); i!=r. end (); i ++) init += array [i]; return init ; }, [] ( uint64_t x, uint64_t y) { return x+y; }); // combine tbb :: blocked_ range ( Value begin, Value end, size_ type grainsize =1); 6 / 30

7 Parallel For tbb :: parallel_for ( tbb :: blocked_range < uint64_t >(0, [&]( const tbb :: blocked_range < uint64_t >& r) { for ( uint64_t i=r. begin (); i!=r. end (); i ++) array [i] *= 2; }); n), 7 / 30

8 Partitioners parallel_for and parallel_reduce split the gien range to enable parallel execution there are multiple builtin partitioners: static_partitioner splits work equally among threads up-front (no dynamic work stealing) simple_partitioner splits the range as much as possible (e.g., until grainsize is reached) auto_partitioner heuristic similar to simple_partitioner, but tries to aoid creating too many ranges (default) 8 / 30

9 Fork/Join Parallelism sometimes the amount of work to parallelize is not known upfront fork/join allows one to perform work on other threads ( fork ), and then to wait until these tasks are finished ( join ) often recursie parallelism structure 9 / 30

10 Naie Merge Sort with Fork/Join (TBB) const ptrdiff_ t limit = 1024; template < class Iter > oid merge_ sort ( Iter first, Iter last ) { if ( last - first > limit ) { Iter middle = first + ( last - first ) / 2; tbb :: task_ group g; // alternatie : tbb :: parallel_ inoke g. run ([&]{ merge_sort ( first, middle ); } ); merge_sort ( middle, last ); g. wait (); std :: inplace_merge ( first, middle, last ); } else { merge_sort_serial ( first, last ); } } 10 / 30

11 Analysis What is the speedup for sorting n elements with infinite cores? serial execution: log 2 (n) n rough upper bound (using Amdahl s law): the final merge is serial: n n lower bound for fraction of serial part log 2 (n) n = 1 log 2 (n) 1 using Amdahl s law the maximum speedup is 1 = log 2 (n) log 2 (n) for example, if n = 2 20 the upper bound is log 2 (n) = 20 tighter upper bound: parallel execution: log 2 (n) 1 n i=0 2 = n + n i 2 + n 4 + < 2n for example, if n = 2 20 the upper bound is 20n 2n = 10 (both analyses assume that each leel recursion leel takes the same amount of time) 11 / 30

12 Speedup, n = speedup threads 12 / 30

13 Speedup with 10 Threads speedup data size [log scale] 13 / 30

14 Parallelization Oerhead, n = time [s] 0.10 threads limit [log scale] 14 / 30

15 Parallel Merge (1) template < typename It > oid parallelmerge ( It begin1, It end1, It begin2, It end2, It out ) { tbb :: parallel_ for ( ParallelMergeRange <It >( begin1, end1, begin2, end2, out ), [&]( ParallelMergeRange <It >& r) { std :: merge (r. begin1, r.end1, r. begin2, r.end2, r. out ); }, tbb :: simple_ partitioner ()); } template < typename It > struct ParallelMergeRange { // TBB range concept It begin1, end1, begin2, end2, out ; bool empty () const { return ( end1 - begin1 ) + ( end2 - begin2 )==0; } bool is_ diisible () const { return std :: min (end1 - begin1, end2 - begin2 ) > limit ; } ParallelMergeRange ( It begin1_, It end1_, It begin2_, It end2_, It out_ ) : begin1 ( begin1_ ), end1 ( end1_ ), begin2 ( begin2_ ), end2 ( end2_ ), out ( out_ ) {} 15 / 30

16 Parallel Merge (2) // Splitting constructor ( splits r into two subranges ) ParallelMergeRange ( ParallelMergeRange & r, tbb :: split ) { if ((r.end1 -r. begin1 ) < (r.end2 -r. begin2 )) { // first range should be the larger one std :: swap (r. begin1, r. begin2 ); std :: swap (r.end1, r. end2 ); } It m1 = r. begin1 + (r.end1 -r. begin1 )/2; It m2 = std :: lower_bound (r. begin2, r.end2, *m1 ); begin1 = m1; begin2 = m2; end1 = r. end1 ; end2 = r. end2 ; out = r. out + (m1 -r. begin1 ) + (m2 -r. begin2 ); r. end1 = m1; r. end2 = m2; } }; // struct ParallelMergeRange 16 / 30

17 Parallel Out-Of-Place Merge Sort template < class It > oid parallelmergesort ( It first, It last, It out, bool inplace = false ) { if (( last - first ) < limit ) { merge_sort_serial ( first, last ); if (! inplace ) std :: moe ( first, last, out ); } else { It mid = first + ( last - first )/2; It outmid = out + ( mid - first ); It outlast = out + ( last - first ); tbb :: parallel_ inoke ( [&]() { parallelmergesort ( first, mid, out,! inplace ); }, [&]() { parallelmergesort (mid, last, outmid,! inplace ); }); if ( inplace ) parallelmerge ( out, outmid, outmid, outlast, first ); else parallelmerge ( first, mid, mid, last, out ); } } 17 / 30

18 Scalability, n = M 75M items/s 50M 25M method in-place out-of-place threads 18 / 30

19 HyPer s Parallel Merge Sort 1. diide input data statically, each thread sorts its fraction 2. determine separators, compute output positions (prefix sums) 3. merge into output array in-place sort sort sort global 1/3 global 2/3 local 1/3 Compute global separators from the local separators local 2/3 merge merge merge 19 / 30

20 Pitfalls in Parallel Code non-scalable algorithm re-think algorithm load imbalance break work into smaller tasks, dynamically schedule these between threads task oerhead: managing tasks takes more time than the actual work set a minimum per-thread tasks size (not too small, not to large) 20 / 30

21 Volcano-Style Parallelism plan-drien approach: optimizer statically determines at query compile time how many threads should run instantiates one query operator plan for each thread connects these with exchange operators, which encapsulate parallelism and manage threads Elegant model which is used by many systems Xchg(3:1) r r r r XchgHashSplit(3:3) R R 1 R 2 R 3 21 / 30

22 Volcano-Style Parallelism (2) + operators are largely obliious to parallelism static work partitioning can cause load imbalances degree of parallelism cannot easily be changed mid-query oerhead: thread oersubscription causes context switching hash re-partitioning often does not pay off exchange operators create additional copies of the tuples 22 / 30

23 Morsel-Drien Query Execution (1) break input into constant-sized work units ( morsels ) dispatcher assigns morsels to worker threads # worker threads = # hardware threads operators are designed for parallel execution Z a Z b Result A B 16 8 A 27 B 10 C C y store store HT(T) B C 8 33 x 10 y 5 23 z u Dispatcher probe(8) probe(10) HT(S) A B morsel morsel probe(16) R A Z 16 a 7 c probe(27) 10 i 27 b 18 e 5 j 7 d 5 f 23 / 30

24 Pipeplines each pipeline is parallelized indiidually using all threads BB BA T S R 24 / 30

25 Pipeplines each pipeline is parallelized indiidually using all threads BB T S BA R Build HT(T) Pipe 1 Pipe 1 Pipe 1 Scan T Scan T Scan T 24 / 30

26 Pipeplines each pipeline is parallelized indiidually using all threads BB T BA Build HT(T) Build HT(S) Pipe 2 Pipe 2 Pipe 2 Scan S S R Scan S Scan S 24 / 30

27 Pipeplines each pipeline is parallelized indiidually using all threads Probe HT(T) BB Probe HT(T) Probe HT(T) Probe HT(S) Probe HT(S) T BA Build HT(T) Probe HT(T) Build HT(S) Probe HT(S) Probe HT(S) Pipe 3 Pipe 3 Scan R Pipe 3 Scan R Pipe 3 Scan R S R Scan R 24 / 30

28 Parallel Hash Table Construction Phase 2: scan NUMA-local storage area and insert pointers into HT Phase 1: process T morsel-wise and store NUMA-locally next morsel Storage area of red core global Hash Table T morsel Storage area of green core Storage area of blue core scan Insert the pointer into HT 25 / 30

29 Hash Tagging unused bits in pointers act as a cheap bloom filter hashtable bit tag for early filtering d 48 bit pointer e f 26 / 30

30 Aggregation/Group By parallel aggregation is one of the most difficult relational operators main challenge: behaes ery differently depending on whether there are few or many distinct keys morsel morsel next red morsel K V group group K 8 13 ht V ht K V spill when ht becomes full Partition 0 (12,7) (8,3) (41,4) (13,7) Partition 3 Partition 0 (8,9) (4,30) (13,14) (33,5) Partition 3 group group group group K HT V HT K V Result ptn 0 Result ptn 1 Phase 1: local pre-aggregation Phase 2: aggregate partition-wise 27 / 30

31 Window Functions: Semantics select location, time, ag(alue) oer (partition by location order by time range between interal 2 day preceding and interal 1 day following) from measurement order by frame partition by 28 / 30

32 Window Functions: Semantics hash partitioning (thread-local) thread 1 thread 2 combine hash groups sort/ealuation inter-partition parallelism 3.2. intra-partition parallelism 29 / 30

33 References Structured Parallel Programming: Patterns for Efficient Computation, McCool and Robison and Reinders, Morgan Kaufmann, 2012 Morsel-Drien Parallelism: A NUMA-Aware Query Ealuation Framework for the Many-Core Age, Leis and Boncz and Kemper and Neumann, SIGMOD 2014 Encapsulation of Parallelism in the Volcano Query Processing System, Graefe, SIGMOD / 30

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age Presented by Dennis Grishin What is the problem? Efficient computation requires distribution of processing between