OpenMP in the Real World

Size: px

Start display at page:

Download "OpenMP in the Real World"

Jody King
6 years ago
Views:

1 in the Real World hristian erboven, Dieter an ey {terboven, enter for omputing and ommunication RWH Aachen University, Germany utorial June 3rd, U Dresden, Germany

2 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 2 o P /

3 in the Real World FIRE: Image Retrieval System erboven o FIRE = Flexible Image Retrieval Engine ompare the performance of common features on different databases Analysis of correlation of different features 3 homas Deselaers and Daniel Keysers, RWH I6: hair for Human Language echnology and Pattern Recognition P /

4 in the Real World FIRE: Image Retrieval System erboven o Q: query image, X: set of database images o Q m, X m : m-th feature of Q and X o d m : distance measure, w m : weighting coefficient o Return the k images with lowest distance to query image o Well-suited for Shared-emory parallelization: Data ining in a large image database! 4 o hree levels to exploit parallelism: Process multiple query images in parallel Process database comparison for one query image in parallel omputation of distances might be parallelized as well P /

5 in the Real World erboven FIRE: improves scalability Speedup Sun Fire E25K, 72 dual-core UltraSPAR-IV processors # hreads Only outer level Only inner level o How can improve the scalability? Scalability on outer level is limited because of output sync. overhead increases with the number of threads Dataset might better fit to the number of threads P /

single system image ache coherent connection through InfiniBand odified IB stack and BIOS, virtualization, caching strategies

6 in the Real World erboven ScaleP: virtual Shared-emory Architecture o ScaleP Versatile SP Architecture Aggregation of multiple x86 boards into one single system image ache coherent connection through InfiniBand odified IB stack and BIOS, virtualization, caching strategies Aggregation of all I/O resources to the OS o echnology by o aximum (end of 2008): 16 boards with 32 processors / 128 cores up to 1 B of main memory o Our system: o 13 boards with 2x Intel Xeon E5450 (dual-core, 2.50 GHz) P /

7 in the Real World erboven FIRE: Scalability on a ScaleP System o improves speedup by reducing the total overhead. Best effort: Speedup of on 104 cores. 120 Linear Speedup 100 Only Outer S p e e d u p Only Inner 4x26 1x x Numer of hreads o Explict hread Binding via KP_AFFINIY env. var. (Intel ompiler) P /

8 in the Real World erboven P: Parallel ritical Point Extraction o VR in Aachen: Analysis of large-scale flow simulations Feature extraction from raw data Interactive analysis in virtual environment (e.g. a cave) o ritical Point: Point in the vector field with zero velocity 8 Andreas Gerndt, Virtual Reality enter, RWH Aachen P /

Runtime [s] in the Real World 03.06.2009.

9 Runtime [s] in the Real World erboven P: Addressing Load Imbalance o Algorithmic sketch of ritical Point extraction: Loop over the time steps of unsteady datasets Loop over the blocks of multi-block datasets Loop checking the cells withing the block for P Basic Load otal Runtime ime Level o he time needed to check a cell may vary considerably! P /

10 10 in the Real World erboven P: Addressing Load Imbalance o Solution in is rather simple: #pragma omp parallel for num_threads(nimehreads) \ schedule(dynamic,1) for (cut = 1; cur <= max; ++cur) { #pragma omp parallel for num_threads(nblockhreads) \ schedule(dynamic,1) for (curb = 1; curb <= maxb; ++curb) { #pragma omp parallel for num_threads(nellhreads) \ schedule(guided) } } } for (cur = 1; cur <= max; ++cur) { findriticalpoints(cur, curb, cur); P /

11 in the Real World erboven P: Addressing Load Imbalance o he achievable speedup heavily depends on the dataset o No load imbalance almost perfect scalability o Speedup on Sun Fire E25k, 72 dual-core UltraSPAR-IV processors, execution with 128 threads: Without load balancing: 10.3 (static scheduling) With load balancing: 33.9 (dynamic scheduling) Sun extension to guided sched.: 55.3 (weight factor = 20) 11 P /

in the Real World FLOWer: A Navier-Stokes solver 03.06.2009.

12 in the Real World FLOWer: A Navier-Stokes solver erboven o FLOWer: Navier-Stockes solver, German Aerospace enter o PHOENIX: a small scale prototype of the space launch vehicle HOPPER (take off horizontally, place cargo in orbit, glide back to earth) PI + / autoparallelization hybrid parallel program DB library used to automatically adjust number of threads to improve load balance of PI 12 Birgit Reinartz and ichael Hesse, Laboratory of echanics, RWH Aachen University P /

13 in the Real World erboven FLOWer: PI parallelization is not balanced Process 5 has too much work to do 13 P /

14 processors in the Real World FLOWer: Dynamic thread balancing erboven Sun Fire E25K 23 PI procs start with 2 threads each Warm-up phase (1-12) artificially vary number or threads per process Steering phase (13-30) increase number of threads of busy procs iterations Production phase (31-) freeze thread numbers + nexttouch P /

15 io L2 cache misses per second total GFlop/s in the Real World FLOWer: Dynamic thread balancing Sun Fire 25K, ~ 65 flop/s per thread = 3% of peak performance (high PI communication overhead) local access remote accesses total Gflops erboven 5 4,5 4 3, runtime [secs] nexttouch mechanism applied 3 2,5 2 1,5 1 0,5 0 P /

16 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 16 o P /

17 in the Real World erboven DROPS: A Navier-Stokes Solver in ++ o Numerical Simulation of two-phase flow o odeled by instationary and non-linear Navier-Stokes equation o Level Set function is used to describe the interface between the two phases Example: Silicon oil drop in D 2 O (fluid/fluid) o Written in ++: is object-oriented, uses nested templates, uses SL types, uses compile-time polymorphism, 17 o (Adaptive) etrahedral Grid Hierarchy o Finite Element ethod (FE) P /

18 18 in the Real World o Problems of both options: ode hanges not applicable to complex expressions may introduce additional overhead erboven : Naive Approach w/ PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, { double& tol) VectorL p(n), z(n), q(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] Option 1: Replace operator calls y_ax_par(&q.raw()[0], A.num_rows(), A.raw_val(), A.raw_row(), A.raw_col(), Addr( p.raw())); Option 2: Place parallelization inside operator calls P /

19 in the Real World Possible problem: emporaries laperf::vector<double> x(dim),a(dim),b(dim); x = (a * 2.0) + b; ideal code for this vector-type operation: for( int i = 0; i < dim; ++i ) x[i] = a[i] * b[i]; erboven Users ode but in ++ it translates to: laperf::vector<double> _t1 = operator*(a,2.0); laperf::vector<double> _t2 = operator+(_t1,b); x.operator=(_t2); 19 two temporary vector copies and unnecessary overhead impossible to implement efficient parallelization bad placement of temporaries on cc-nua architectures o Solution: in Expression emplate echanism. P /

20 in the Real World Handling of cc-nua architectures o All x86-based multi-socket system will be cc-nua! erboven urrent Operating Systems apply first-touch placement If cc-nua is ignored, the speedup will be zero, typically o SL provides the concept of an allocator to encapsulate memory management build on the same concept to optimize for cc-nua 20 o wo possible allocators: ay even be plugged into your own data types dist_allocator: Distribute data according to schedule type (same scheduling as in computation) chunked_allocator: Distribute data according to explicitly precalculated scheme to improve load balacing P /

21 in the Real World typedef SparseatrixBasel<...> atrixl; typedef VectorBaseL<double> VectorL; erboven DROPS: Iteration Loop of G-type solver 21 PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, double& tol) { VectorL q(n), p(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] P /

22 in the Real World erboven DROPS: Parallel Iteration Loop of G-type solver typedef SparseatrixBasel<..., yallocator> atrixl; typedef VectorBaseL<double, InternalPar> VectorL; 22 PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, double& tol) { VectorL q(n), p(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] Expression emplates allow parallelization of whole line. Here: omplete Parallel Region inside operator calls. P /

23 Speedup in the Real World erboven Parallel Performance easurements (2/2) o Library-parallelized GRES solver on cc-nua machine: # hreads 23 Optimal BB-std OP-std OP-distributed OP-chunked o cc-nua architecture provides good memory bandwidth o Allocator concept successfull, BB asks have no affinity P /

24 in the Real World erboven o A clean ++ object-oriented coding style may be very helpful for parallelization: Encapsulation prohibits unintended data dependencies Encapsulation may improve data locality (think ccnua) o But: s ++ support is limited: Non-POD types not well supported (e. g. with reductions) Pragmas are not language constructs: You might have to restructure / extend your code 24 o Nevertheless: We like the combination and it can be used successfully! P /

25 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 25 o P /

26 in the Real World omparing Processors and Boxes erboven etric \ Server SF V40z FS RX200 S4 Sun 5120 Processor hip AD Opteron GHz Intel Xeon GHz # sockets UltraSPAR Ghz # cores 8 (dual-core) 8 (quad-core) 8 (octo-core) # threads Accumulated L2 $ 8 mb 16 mb 4 mb L2 $ Strategy Separate per core Shared by 2 cores Shared by 8 cores echnology 90 nm 45 nm 65 nm Peak Performance 35.2 GFLOPS 96 GFLOPS 11.2 GFLOPS Dimension 3 units 1 unit 1 unit 26 Note: Here we compare machines of different ages which can be seens as unfair! For example newer Opteron-based machines provide similar settings in 1 unit P /

27 in the Real World easuring emory Bandwidth erboven o Do not look at the PU performance only the memory subsystem s performance is crucial for your HP application! long long *x, *xstart, *xend, mask; for (x = xstart; x < xend; x++) *x ^= mask; o Each loop iteration: One load + One store o We ran this kernel with multiple threads working on private data at a time using (large memory footprint >> L2) 27 o Explicit processor binding to control the thread placement Linux: taskset command Solaris: SUN_P_PROBIND environment variable P /

28 in the Real World Selected results: 1 thread erboven 2x lovertown, 2.66 GHz 1 thread: GB/s 28 4x Opteron 875, 2.2 GHz 1 thread: GB/s 1x Niagara 2, 1.4 GHz 1 thread: GB/s P /

29 in the Real World erboven emory Bandwidth: Dual-Socket Quad-ore lovertown 1 thread: GB/s 2 threads: GB/s 2 threads: GB/s 2 threads: GB/s 8 threads: GB/s 29 P /

30 in the Real World erboven emory Bandwidth: Quad-Socket Dual-ore Opteron 1 thread: GB/s 2 threads: GB/s 2 threads: GB/s 2 threads: GB/s 30 ccnua! P / 8 threads: GB/s

31 in the Real World erboven emory Bandwidth: 8-ore Niagara-2 1 thread: GB/s 2 threads: GB/s 4 threads: GB/s 8 threads: GBs 4 threads: GB/s 8 threads: GB/s 16 threads: GB/s 32 threads: GB/s 31 P /

32 in the Real World erboven What does that mean for my application? o Ouch, these boxes all behave differently o Lets look at a Sparse atrix Vector multiplication: 32 o Which architecture is best-suited depends on: an your application profit from big caches? an your application profit from shared / separate caches? an your application profit from a high clock rate? Is your application memory bound anyway? and more factors SF V40z (Opteron) FS RX200 S4 (Xeon) SF??? (Niagara2) Sparse atvec (small) GFLOPS Sparse atvec (large) GFLOPS P /

33 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 33 o P /

1 compiler Very cache-friendly runs at high flop/s rates 03.06.2009.

34 in the Real World ase Study: Kegeloleranzen (WZL) o ontact analysis simulation of Bevel Gears Written in Fortran, using Intel Fortran 10.1 compiler Very cache-friendly runs at high flop/s rates erboven Bevel Gear Pair Differential Gear 34 Laboratory for achine ools and Production Engineering, RWH Aachen P /

35 Flop/s in the Real World ase Study: Kegeloleranzen (WZL) o omparing Linux and Server 2008: erboven 35 0 Performance of Kegeloleranzen 8000 Linux 2.6: 4x Opteron dual-core, 2.2 GHz : 4x Opteron dual-core, 2.2 GHz 6000 Linux 2.6: 2x Harpertown quad-core, 3.0 GHz : 2x Harpertown quad-core, 3.0 GHz Performance gain for the user: Speedup of 5.6 on one node. Even better from starting point (desktop: 220 Flop/s) #hreads 7% b e t t e r P /

36 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 36 o P /

37 in the Real World erboven o is often a good alternative to multiple PI processes per node, as parallelization might require less work! Adding levels of parallelism can help to increase scalability! o ools for performance analysis and verification are critical ingredients of the program development environment. o is useful for multi-core architectures! But: urrent support for ++ is limited. But: Support for architecture aspects is still missing. 37 P /

38 in the Real World he End erboven hank you for your attention! 38

Hybrid Parallelization: Performance from SMP Building Blocks

Hybrid Parallelization: Performance from SMP Building Blocks Parallelization: Performance from SMP Building Blocks Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany September 29th, Leogang, Austria