OpenMP in the Real World
|
|
- Jody King
- 6 years ago
- Views:
Transcription
1 in the Real World hristian erboven, Dieter an ey {terboven, enter for omputing and ommunication RWH Aachen University, Germany utorial June 3rd, U Dresden, Germany
2 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 2 o P /
3 in the Real World FIRE: Image Retrieval System erboven o FIRE = Flexible Image Retrieval Engine ompare the performance of common features on different databases Analysis of correlation of different features 3 homas Deselaers and Daniel Keysers, RWH I6: hair for Human Language echnology and Pattern Recognition P /
4 in the Real World FIRE: Image Retrieval System erboven o Q: query image, X: set of database images o Q m, X m : m-th feature of Q and X o d m : distance measure, w m : weighting coefficient o Return the k images with lowest distance to query image o Well-suited for Shared-emory parallelization: Data ining in a large image database! 4 o hree levels to exploit parallelism: Process multiple query images in parallel Process database comparison for one query image in parallel omputation of distances might be parallelized as well P /
5 in the Real World erboven FIRE: improves scalability Speedup Sun Fire E25K, 72 dual-core UltraSPAR-IV processors # hreads Only outer level Only inner level o How can improve the scalability? Scalability on outer level is limited because of output sync. overhead increases with the number of threads Dataset might better fit to the number of threads P /
6 in the Real World erboven ScaleP: virtual Shared-emory Architecture o ScaleP Versatile SP Architecture Aggregation of multiple x86 boards into one single system image ache coherent connection through InfiniBand odified IB stack and BIOS, virtualization, caching strategies Aggregation of all I/O resources to the OS o echnology by o aximum (end of 2008): 16 boards with 32 processors / 128 cores up to 1 B of main memory o Our system: o 13 boards with 2x Intel Xeon E5450 (dual-core, 2.50 GHz) P /
7 in the Real World erboven FIRE: Scalability on a ScaleP System o improves speedup by reducing the total overhead. Best effort: Speedup of on 104 cores. 120 Linear Speedup 100 Only Outer S p e e d u p Only Inner 4x26 1x x Numer of hreads o Explict hread Binding via KP_AFFINIY env. var. (Intel ompiler) P /
8 in the Real World erboven P: Parallel ritical Point Extraction o VR in Aachen: Analysis of large-scale flow simulations Feature extraction from raw data Interactive analysis in virtual environment (e.g. a cave) o ritical Point: Point in the vector field with zero velocity 8 Andreas Gerndt, Virtual Reality enter, RWH Aachen P /
9 Runtime [s] in the Real World erboven P: Addressing Load Imbalance o Algorithmic sketch of ritical Point extraction: Loop over the time steps of unsteady datasets Loop over the blocks of multi-block datasets Loop checking the cells withing the block for P Basic Load otal Runtime ime Level o he time needed to check a cell may vary considerably! P /
10 10 in the Real World erboven P: Addressing Load Imbalance o Solution in is rather simple: #pragma omp parallel for num_threads(nimehreads) \ schedule(dynamic,1) for (cut = 1; cur <= max; ++cur) { #pragma omp parallel for num_threads(nblockhreads) \ schedule(dynamic,1) for (curb = 1; curb <= maxb; ++curb) { #pragma omp parallel for num_threads(nellhreads) \ schedule(guided) } } } for (cur = 1; cur <= max; ++cur) { findriticalpoints(cur, curb, cur); P /
11 in the Real World erboven P: Addressing Load Imbalance o he achievable speedup heavily depends on the dataset o No load imbalance almost perfect scalability o Speedup on Sun Fire E25k, 72 dual-core UltraSPAR-IV processors, execution with 128 threads: Without load balancing: 10.3 (static scheduling) With load balancing: 33.9 (dynamic scheduling) Sun extension to guided sched.: 55.3 (weight factor = 20) 11 P /
12 in the Real World FLOWer: A Navier-Stokes solver erboven o FLOWer: Navier-Stockes solver, German Aerospace enter o PHOENIX: a small scale prototype of the space launch vehicle HOPPER (take off horizontally, place cargo in orbit, glide back to earth) PI + / autoparallelization hybrid parallel program DB library used to automatically adjust number of threads to improve load balance of PI 12 Birgit Reinartz and ichael Hesse, Laboratory of echanics, RWH Aachen University P /
13 in the Real World erboven FLOWer: PI parallelization is not balanced Process 5 has too much work to do 13 P /
14 processors in the Real World FLOWer: Dynamic thread balancing erboven Sun Fire E25K 23 PI procs start with 2 threads each Warm-up phase (1-12) artificially vary number or threads per process Steering phase (13-30) increase number of threads of busy procs iterations Production phase (31-) freeze thread numbers + nexttouch P /
15 io L2 cache misses per second total GFlop/s in the Real World FLOWer: Dynamic thread balancing Sun Fire 25K, ~ 65 flop/s per thread = 3% of peak performance (high PI communication overhead) local access remote accesses total Gflops erboven 5 4,5 4 3, runtime [secs] nexttouch mechanism applied 3 2,5 2 1,5 1 0,5 0 P /
16 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 16 o P /
17 in the Real World erboven DROPS: A Navier-Stokes Solver in ++ o Numerical Simulation of two-phase flow o odeled by instationary and non-linear Navier-Stokes equation o Level Set function is used to describe the interface between the two phases Example: Silicon oil drop in D 2 O (fluid/fluid) o Written in ++: is object-oriented, uses nested templates, uses SL types, uses compile-time polymorphism, 17 o (Adaptive) etrahedral Grid Hierarchy o Finite Element ethod (FE) P /
18 18 in the Real World o Problems of both options: ode hanges not applicable to complex expressions may introduce additional overhead erboven : Naive Approach w/ PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, { double& tol) VectorL p(n), z(n), q(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] Option 1: Replace operator calls y_ax_par(&q.raw()[0], A.num_rows(), A.raw_val(), A.raw_row(), A.raw_col(), Addr( p.raw())); Option 2: Place parallelization inside operator calls P /
19 in the Real World Possible problem: emporaries laperf::vector<double> x(dim),a(dim),b(dim); x = (a * 2.0) + b; ideal code for this vector-type operation: for( int i = 0; i < dim; ++i ) x[i] = a[i] * b[i]; erboven Users ode but in ++ it translates to: laperf::vector<double> _t1 = operator*(a,2.0); laperf::vector<double> _t2 = operator+(_t1,b); x.operator=(_t2); 19 two temporary vector copies and unnecessary overhead impossible to implement efficient parallelization bad placement of temporaries on cc-nua architectures o Solution: in Expression emplate echanism. P /
20 in the Real World Handling of cc-nua architectures o All x86-based multi-socket system will be cc-nua! erboven urrent Operating Systems apply first-touch placement If cc-nua is ignored, the speedup will be zero, typically o SL provides the concept of an allocator to encapsulate memory management build on the same concept to optimize for cc-nua 20 o wo possible allocators: ay even be plugged into your own data types dist_allocator: Distribute data according to schedule type (same scheduling as in computation) chunked_allocator: Distribute data according to explicitly precalculated scheme to improve load balacing P /
21 in the Real World typedef SparseatrixBasel<...> atrixl; typedef VectorBaseL<double> VectorL; erboven DROPS: Iteration Loop of G-type solver 21 PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, double& tol) { VectorL q(n), p(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] P /
22 in the Real World erboven DROPS: Parallel Iteration Loop of G-type solver typedef SparseatrixBasel<..., yallocator> atrixl; typedef VectorBaseL<double, InternalPar> VectorL; 22 PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, double& tol) { VectorL q(n), p(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] Expression emplates allow parallelization of whole line. Here: omplete Parallel Region inside operator calls. P /
23 Speedup in the Real World erboven Parallel Performance easurements (2/2) o Library-parallelized GRES solver on cc-nua machine: # hreads 23 Optimal BB-std OP-std OP-distributed OP-chunked o cc-nua architecture provides good memory bandwidth o Allocator concept successfull, BB asks have no affinity P /
24 in the Real World erboven o A clean ++ object-oriented coding style may be very helpful for parallelization: Encapsulation prohibits unintended data dependencies Encapsulation may improve data locality (think ccnua) o But: s ++ support is limited: Non-POD types not well supported (e. g. with reductions) Pragmas are not language constructs: You might have to restructure / extend your code 24 o Nevertheless: We like the combination and it can be used successfully! P /
25 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 25 o P /
26 in the Real World omparing Processors and Boxes erboven etric \ Server SF V40z FS RX200 S4 Sun 5120 Processor hip AD Opteron GHz Intel Xeon GHz # sockets UltraSPAR Ghz # cores 8 (dual-core) 8 (quad-core) 8 (octo-core) # threads Accumulated L2 $ 8 mb 16 mb 4 mb L2 $ Strategy Separate per core Shared by 2 cores Shared by 8 cores echnology 90 nm 45 nm 65 nm Peak Performance 35.2 GFLOPS 96 GFLOPS 11.2 GFLOPS Dimension 3 units 1 unit 1 unit 26 Note: Here we compare machines of different ages which can be seens as unfair! For example newer Opteron-based machines provide similar settings in 1 unit P /
27 in the Real World easuring emory Bandwidth erboven o Do not look at the PU performance only the memory subsystem s performance is crucial for your HP application! long long *x, *xstart, *xend, mask; for (x = xstart; x < xend; x++) *x ^= mask; o Each loop iteration: One load + One store o We ran this kernel with multiple threads working on private data at a time using (large memory footprint >> L2) 27 o Explicit processor binding to control the thread placement Linux: taskset command Solaris: SUN_P_PROBIND environment variable P /
28 in the Real World Selected results: 1 thread erboven 2x lovertown, 2.66 GHz 1 thread: GB/s 28 4x Opteron 875, 2.2 GHz 1 thread: GB/s 1x Niagara 2, 1.4 GHz 1 thread: GB/s P /
29 in the Real World erboven emory Bandwidth: Dual-Socket Quad-ore lovertown 1 thread: GB/s 2 threads: GB/s 2 threads: GB/s 2 threads: GB/s 8 threads: GB/s 29 P /
30 in the Real World erboven emory Bandwidth: Quad-Socket Dual-ore Opteron 1 thread: GB/s 2 threads: GB/s 2 threads: GB/s 2 threads: GB/s 30 ccnua! P / 8 threads: GB/s
31 in the Real World erboven emory Bandwidth: 8-ore Niagara-2 1 thread: GB/s 2 threads: GB/s 4 threads: GB/s 8 threads: GBs 4 threads: GB/s 8 threads: GB/s 16 threads: GB/s 32 threads: GB/s 31 P /
32 in the Real World erboven What does that mean for my application? o Ouch, these boxes all behave differently o Lets look at a Sparse atrix Vector multiplication: 32 o Which architecture is best-suited depends on: an your application profit from big caches? an your application profit from shared / separate caches? an your application profit from a high clock rate? Is your application memory bound anyway? and more factors SF V40z (Opteron) FS RX200 S4 (Xeon) SF??? (Niagara2) Sparse atvec (small) GFLOPS Sparse atvec (large) GFLOPS P /
33 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 33 o P /
34 in the Real World ase Study: Kegeloleranzen (WZL) o ontact analysis simulation of Bevel Gears Written in Fortran, using Intel Fortran 10.1 compiler Very cache-friendly runs at high flop/s rates erboven Bevel Gear Pair Differential Gear 34 Laboratory for achine ools and Production Engineering, RWH Aachen P /
35 Flop/s in the Real World ase Study: Kegeloleranzen (WZL) o omparing Linux and Server 2008: erboven 35 0 Performance of Kegeloleranzen 8000 Linux 2.6: 4x Opteron dual-core, 2.2 GHz : 4x Opteron dual-core, 2.2 GHz 6000 Linux 2.6: 2x Harpertown quad-core, 3.0 GHz : 2x Harpertown quad-core, 3.0 GHz Performance gain for the user: Speedup of 5.6 on one node. Even better from starting point (desktop: 220 Flop/s) #hreads 7% b e t t e r P /
36 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 36 o P /
37 in the Real World erboven o is often a good alternative to multiple PI processes per node, as parallelization might require less work! Adding levels of parallelism can help to increase scalability! o ools for performance analysis and verification are critical ingredients of the program development environment. o is useful for multi-core architectures! But: urrent support for ++ is limited. But: Support for architecture aspects is still missing. 37 P /
38 in the Real World he End erboven hank you for your attention! 38
Hybrid Parallelization: Performance from SMP Building Blocks
Parallelization: Performance from SMP Building Blocks Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany September 29th, Leogang, Austria
More informationOpenMP in the Works July 4, 2007, JülichJ
OpenMP in the Works July 4, 2007, JülichJ Dieter an Mey, hristian Terboven, Samuel Sarholz, Alex Spiegel enter for omputing and ommunication RWTH Aachen University, Germany www.rz.rwth-aachen.de anmey@rz.rwth-aachen.de
More informationWindows RWTH Aachen University
Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC User Group Meeting 17 November, Austin, TX, USA Agenda o o HPC on Windows: Case Studies
More informationExploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes
Exploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes Christian Terboven, Dieter an Mey, Paul Kapinos, Christopher Schleiden, Igor Merkulow {terboven, anmey, kapinos, schleiden,
More informationDieter an Mey Center for Computing and Communication RWTH Aachen University, Germany.
Scalable OpenMP Programming Dieter an Mey enter for omputing and ommunication RWTH Aachen University, Germany www.rz.rwth-aachen.de anmey@rz.rwth-aachen.de 1 Scalable OpenMP Programming D. an Mey enter
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationOpenMP in the Multicore Era
in the Era Christian Terboven, Dieter an Mey {terboven, anmey}@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany 5th International Conference on Automatic Differentiation
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationExperiences with the OpenMP Parallelization of DROPS, a Navier-Stokes
Experiences with the OpenMP Parallelization of DROPS, a Navier-Stokes Stokes-Solver Solver written in ++ hristian Terboven, Alexander Spiegel, Dieter an Mey enter for omputing and ommunication, RWTH Aachen
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationProductivity and Performance with Multi-Core Programming
Productivity and Performance with Multi-Core Programming Christian Terboven Center for Computing and Communication RWTH Aachen University 04.07.2012, CE Seminar, TU Darmstadt,
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationC C V OpenMP V3.0. Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany
V 3.0 1 OpenMP V3.0 Dieter an Mey enter for omputing and ommunication, RWTH Aachen University, Germany anmey@rz.rwth-aachen.de enter for omputing and ommunication Setting the Scene The Time Scale Is OpenMP
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationC++ and OpenMP Christian Terboven Center for Computing and Communication RWTH Aachen University, Germany
++ and OpenMP hristian Terboven terboven@rz.rwth-aachen.de enter omputing and ommunication RWTH Aachen University, Germany 1 IWOMP 07 Tutorial ++ and OpenMP enter omputing and ommunication Agenda OpenMP
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationTasking and OpenMP Success Stories
Tasking and OpenMP Success Stories Christian Terboven 23.03.2011 / Aachen, Germany Stand: 21.03.2011 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Tasking
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationIntroduction to OpenMP
1 / 7 Introduction to OpenMP: Exercises and Handout Introduction to OpenMP Christian Terboven Center for Computing and Communication, RWTH Aachen University Seffenter Weg 23, 52074 Aachen, Germany Abstract
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationOpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.
OpenMP Overview in 30 Minutes Christian Terboven 06.12.2010 / Aachen, Germany Stand: 03.12.2010 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Parallel Regions,
More informationOpenMP Case Studies. Dieter an Mey. Center for Computing and Communication Aachen University
OpenMP Case Studies Dieter an Mey Center for Computing and Communication Aachen University anmey@rz rz.rwth-aachen.de 1 OpenMP Case Studies, Dieter an Mey, SunHPC 2004 OpenMP Case Studies Parallelization
More informationHigh Performance Computing State of the Art and Limitations
High Performance omputing State of the Art and Limitations hristian Bischof enter for omputing and ommunication Institute for Scientific omputing RWTH Aachen University www.rz.rwth-aachen.de, www.sc.rwth-aachen.de
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationOpenMP and Performance
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl}@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims
More informationScalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment
Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer Altix 3700 SSSI - Architecture and Software
More informationAn Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism.
This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, ulti objective Optimisation and Resilience
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationFinisTerrae: Memory Hierarchy and Mapping
galicia supercomputing center Applications & Projects Department FinisTerrae: Memory Hierarchy and Mapping Technical Report CESGA-2010-001 Juan Carlos Pichel Tuesday 12 th January, 2010 Contents Contents
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationTowards NUMA Support with Distance Information
Towards NUMA Support with Distance Information Dirk Schmidl, Christian Terboven, Dieter an Mey {schmidl terboven anmey}@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Topology of modern
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationTwo OpenMP Programming Patterns
Two OpenMP Programming Patterns Dieter an Mey Center for Computing and Communication, Aachen University anmey@rz.rwth-aachen.de 1 Introduction A collection of frequently occurring OpenMP programming patterns
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationA Uniform Programming Model for Petascale Computing
A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationIntroduction to OpenMP
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationPerformance Engineering - Case study: Jacobi stencil
Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationParallel Computer Architecture - Basics -
Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor
More informationIntroduction to OpenMP
Introduction to OpenMP Christian Terboven 10.04.2013 / Darmstadt, Germany Stand: 06.03.2013 Version 2.3 Rechen- und Kommunikationszentrum (RZ) History De-facto standard for
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationInfrastructure for Simulation Science
Infrastructure for Simulation Science Christian Bischof niversity Center for Computing and Communications (CCC) Institute for Scientific Computing Center for Computing and Communications 1 Simulation Science
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationCS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationCluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup
Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationMulticore Challenge in Vector Pascal. P Cockshott, Y Gdura
Multicore Challenge in Vector Pascal P Cockshott, Y Gdura N-body Problem Part 1 (Performance on Intel Nehalem ) Introduction Data Structures (1D and 2D layouts) Performance of single thread code Performance
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationLittle Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo
OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More information[Potentially] Your first parallel application
[Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel
More informationIntroduction to OpenMP. Lecture 4: Work sharing directives
Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationC++ and OpenMP. 1 ParCo 07 Terboven C++ and OpenMP. Christian Terboven. Center for Computing and Communication RWTH Aachen University, Germany
C++ and OpenMP Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany 1 ParCo 07 Terboven C++ and OpenMP Agenda OpenMP and objects OpenMP and
More informationWindows-HPC Environment at RWTH Aachen University
Windows-HPC Environment at RWTH Aachen University Christian Terboven, Samuel Sarholz {terboven, sarholz}@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University PPCES 2009 March
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationGraS. An Environmental Application on Windows-HPC. Christian Terboven
An Environmental Application on Windows-HPC Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany Windows-HPC User Group March 12th, Fraunhofer
More informationDESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID
I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort
More informationExperiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems
Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems Ashay Rane and Dan Stanzione Ph.D. {ashay.rane, dstanzi}@asu.edu Fulton High Performance Computing Initiative, Arizona
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationGeilo Winter School Henrik Löf Uppsala Universitet
Geilo Winter School 2008 Henrik Löf Uppsala Universitet 1 Optimizing parallel applications Low latencies (locality of reference) Cache memories Remote Accesses (NUMA) Low parallel overhead Minimize communication
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationPerformance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationOpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono
OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationEvaluation of Intel Memory Drive Technology Performance for Scientific Applications
Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationDay 6: Optimization on Parallel Intel Architectures
Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationA Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh
A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory
More informationPerformance Tuning and OpenMP
Performance Tuning and OpenMP mueller@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Höchstleistungsrechenzentrum Stuttgart Outline Motivation Performance
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationScaling PostgreSQL on SMP Architectures
Scaling PostgreSQL on SMP Architectures Doug Tolbert, David Strong, Johney Tsai {doug.tolbert, david.strong, johney.tsai}@unisys.com PGCon 2007, Ottawa, May 21-24, 2007 Page 1 Performance vs. Scalability
More information