OpenMP in the Real World

Size: px
Start display at page:

Download "OpenMP in the Real World"

Transcription

1 in the Real World hristian erboven, Dieter an ey {terboven, enter for omputing and ommunication RWH Aachen University, Germany utorial June 3rd, U Dresden, Germany

2 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 2 o P /

3 in the Real World FIRE: Image Retrieval System erboven o FIRE = Flexible Image Retrieval Engine ompare the performance of common features on different databases Analysis of correlation of different features 3 homas Deselaers and Daniel Keysers, RWH I6: hair for Human Language echnology and Pattern Recognition P /

4 in the Real World FIRE: Image Retrieval System erboven o Q: query image, X: set of database images o Q m, X m : m-th feature of Q and X o d m : distance measure, w m : weighting coefficient o Return the k images with lowest distance to query image o Well-suited for Shared-emory parallelization: Data ining in a large image database! 4 o hree levels to exploit parallelism: Process multiple query images in parallel Process database comparison for one query image in parallel omputation of distances might be parallelized as well P /

5 in the Real World erboven FIRE: improves scalability Speedup Sun Fire E25K, 72 dual-core UltraSPAR-IV processors # hreads Only outer level Only inner level o How can improve the scalability? Scalability on outer level is limited because of output sync. overhead increases with the number of threads Dataset might better fit to the number of threads P /

6 in the Real World erboven ScaleP: virtual Shared-emory Architecture o ScaleP Versatile SP Architecture Aggregation of multiple x86 boards into one single system image ache coherent connection through InfiniBand odified IB stack and BIOS, virtualization, caching strategies Aggregation of all I/O resources to the OS o echnology by o aximum (end of 2008): 16 boards with 32 processors / 128 cores up to 1 B of main memory o Our system: o 13 boards with 2x Intel Xeon E5450 (dual-core, 2.50 GHz) P /

7 in the Real World erboven FIRE: Scalability on a ScaleP System o improves speedup by reducing the total overhead. Best effort: Speedup of on 104 cores. 120 Linear Speedup 100 Only Outer S p e e d u p Only Inner 4x26 1x x Numer of hreads o Explict hread Binding via KP_AFFINIY env. var. (Intel ompiler) P /

8 in the Real World erboven P: Parallel ritical Point Extraction o VR in Aachen: Analysis of large-scale flow simulations Feature extraction from raw data Interactive analysis in virtual environment (e.g. a cave) o ritical Point: Point in the vector field with zero velocity 8 Andreas Gerndt, Virtual Reality enter, RWH Aachen P /

9 Runtime [s] in the Real World erboven P: Addressing Load Imbalance o Algorithmic sketch of ritical Point extraction: Loop over the time steps of unsteady datasets Loop over the blocks of multi-block datasets Loop checking the cells withing the block for P Basic Load otal Runtime ime Level o he time needed to check a cell may vary considerably! P /

10 10 in the Real World erboven P: Addressing Load Imbalance o Solution in is rather simple: #pragma omp parallel for num_threads(nimehreads) \ schedule(dynamic,1) for (cut = 1; cur <= max; ++cur) { #pragma omp parallel for num_threads(nblockhreads) \ schedule(dynamic,1) for (curb = 1; curb <= maxb; ++curb) { #pragma omp parallel for num_threads(nellhreads) \ schedule(guided) } } } for (cur = 1; cur <= max; ++cur) { findriticalpoints(cur, curb, cur); P /

11 in the Real World erboven P: Addressing Load Imbalance o he achievable speedup heavily depends on the dataset o No load imbalance almost perfect scalability o Speedup on Sun Fire E25k, 72 dual-core UltraSPAR-IV processors, execution with 128 threads: Without load balancing: 10.3 (static scheduling) With load balancing: 33.9 (dynamic scheduling) Sun extension to guided sched.: 55.3 (weight factor = 20) 11 P /

12 in the Real World FLOWer: A Navier-Stokes solver erboven o FLOWer: Navier-Stockes solver, German Aerospace enter o PHOENIX: a small scale prototype of the space launch vehicle HOPPER (take off horizontally, place cargo in orbit, glide back to earth) PI + / autoparallelization hybrid parallel program DB library used to automatically adjust number of threads to improve load balance of PI 12 Birgit Reinartz and ichael Hesse, Laboratory of echanics, RWH Aachen University P /

13 in the Real World erboven FLOWer: PI parallelization is not balanced Process 5 has too much work to do 13 P /

14 processors in the Real World FLOWer: Dynamic thread balancing erboven Sun Fire E25K 23 PI procs start with 2 threads each Warm-up phase (1-12) artificially vary number or threads per process Steering phase (13-30) increase number of threads of busy procs iterations Production phase (31-) freeze thread numbers + nexttouch P /

15 io L2 cache misses per second total GFlop/s in the Real World FLOWer: Dynamic thread balancing Sun Fire 25K, ~ 65 flop/s per thread = 3% of peak performance (high PI communication overhead) local access remote accesses total Gflops erboven 5 4,5 4 3, runtime [secs] nexttouch mechanism applied 3 2,5 2 1,5 1 0,5 0 P /

16 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 16 o P /

17 in the Real World erboven DROPS: A Navier-Stokes Solver in ++ o Numerical Simulation of two-phase flow o odeled by instationary and non-linear Navier-Stokes equation o Level Set function is used to describe the interface between the two phases Example: Silicon oil drop in D 2 O (fluid/fluid) o Written in ++: is object-oriented, uses nested templates, uses SL types, uses compile-time polymorphism, 17 o (Adaptive) etrahedral Grid Hierarchy o Finite Element ethod (FE) P /

18 18 in the Real World o Problems of both options: ode hanges not applicable to complex expressions may introduce additional overhead erboven : Naive Approach w/ PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, { double& tol) VectorL p(n), z(n), q(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] Option 1: Replace operator calls y_ax_par(&q.raw()[0], A.num_rows(), A.raw_val(), A.raw_row(), A.raw_col(), Addr( p.raw())); Option 2: Place parallelization inside operator calls P /

19 in the Real World Possible problem: emporaries laperf::vector<double> x(dim),a(dim),b(dim); x = (a * 2.0) + b; ideal code for this vector-type operation: for( int i = 0; i < dim; ++i ) x[i] = a[i] * b[i]; erboven Users ode but in ++ it translates to: laperf::vector<double> _t1 = operator*(a,2.0); laperf::vector<double> _t2 = operator+(_t1,b); x.operator=(_t2); 19 two temporary vector copies and unnecessary overhead impossible to implement efficient parallelization bad placement of temporaries on cc-nua architectures o Solution: in Expression emplate echanism. P /

20 in the Real World Handling of cc-nua architectures o All x86-based multi-socket system will be cc-nua! erboven urrent Operating Systems apply first-touch placement If cc-nua is ignored, the speedup will be zero, typically o SL provides the concept of an allocator to encapsulate memory management build on the same concept to optimize for cc-nua 20 o wo possible allocators: ay even be plugged into your own data types dist_allocator: Distribute data according to schedule type (same scheduling as in computation) chunked_allocator: Distribute data according to explicitly precalculated scheme to improve load balacing P /

21 in the Real World typedef SparseatrixBasel<...> atrixl; typedef VectorBaseL<double> VectorL; erboven DROPS: Iteration Loop of G-type solver 21 PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, double& tol) { VectorL q(n), p(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] P /

22 in the Real World erboven DROPS: Parallel Iteration Loop of G-type solver typedef SparseatrixBasel<..., yallocator> atrixl; typedef VectorBaseL<double, InternalPar> VectorL; 22 PG(const atrixl& A, VectorL& x, const VectorL& b, const Preon&, int& max_iter, double& tol) { VectorL q(n), p(n), r(n); [ ] for (int i=1; i<=max_iter; ++i) { [ ] q = A * p; double alpha = rho / (p*q); x += alpha * p; r -= alpha * q; [ ] Expression emplates allow parallelization of whole line. Here: omplete Parallel Region inside operator calls. P /

23 Speedup in the Real World erboven Parallel Performance easurements (2/2) o Library-parallelized GRES solver on cc-nua machine: # hreads 23 Optimal BB-std OP-std OP-distributed OP-chunked o cc-nua architecture provides good memory bandwidth o Allocator concept successfull, BB asks have no affinity P /

24 in the Real World erboven o A clean ++ object-oriented coding style may be very helpful for parallelization: Encapsulation prohibits unintended data dependencies Encapsulation may improve data locality (think ccnua) o But: s ++ support is limited: Non-POD types not well supported (e. g. with reductions) Pragmas are not language constructs: You might have to restructure / extend your code 24 o Nevertheless: We like the combination and it can be used successfully! P /

25 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 25 o P /

26 in the Real World omparing Processors and Boxes erboven etric \ Server SF V40z FS RX200 S4 Sun 5120 Processor hip AD Opteron GHz Intel Xeon GHz # sockets UltraSPAR Ghz # cores 8 (dual-core) 8 (quad-core) 8 (octo-core) # threads Accumulated L2 $ 8 mb 16 mb 4 mb L2 $ Strategy Separate per core Shared by 2 cores Shared by 8 cores echnology 90 nm 45 nm 65 nm Peak Performance 35.2 GFLOPS 96 GFLOPS 11.2 GFLOPS Dimension 3 units 1 unit 1 unit 26 Note: Here we compare machines of different ages which can be seens as unfair! For example newer Opteron-based machines provide similar settings in 1 unit P /

27 in the Real World easuring emory Bandwidth erboven o Do not look at the PU performance only the memory subsystem s performance is crucial for your HP application! long long *x, *xstart, *xend, mask; for (x = xstart; x < xend; x++) *x ^= mask; o Each loop iteration: One load + One store o We ran this kernel with multiple threads working on private data at a time using (large memory footprint >> L2) 27 o Explicit processor binding to control the thread placement Linux: taskset command Solaris: SUN_P_PROBIND environment variable P /

28 in the Real World Selected results: 1 thread erboven 2x lovertown, 2.66 GHz 1 thread: GB/s 28 4x Opteron 875, 2.2 GHz 1 thread: GB/s 1x Niagara 2, 1.4 GHz 1 thread: GB/s P /

29 in the Real World erboven emory Bandwidth: Dual-Socket Quad-ore lovertown 1 thread: GB/s 2 threads: GB/s 2 threads: GB/s 2 threads: GB/s 8 threads: GB/s 29 P /

30 in the Real World erboven emory Bandwidth: Quad-Socket Dual-ore Opteron 1 thread: GB/s 2 threads: GB/s 2 threads: GB/s 2 threads: GB/s 30 ccnua! P / 8 threads: GB/s

31 in the Real World erboven emory Bandwidth: 8-ore Niagara-2 1 thread: GB/s 2 threads: GB/s 4 threads: GB/s 8 threads: GBs 4 threads: GB/s 8 threads: GB/s 16 threads: GB/s 32 threads: GB/s 31 P /

32 in the Real World erboven What does that mean for my application? o Ouch, these boxes all behave differently o Lets look at a Sparse atrix Vector multiplication: 32 o Which architecture is best-suited depends on: an your application profit from big caches? an your application profit from shared / separate caches? an your application profit from a high clock rate? Is your application memory bound anyway? and more factors SF V40z (Opteron) FS RX200 S4 (Xeon) SF??? (Niagara2) Sparse atvec (small) GFLOPS Sparse atvec (large) GFLOPS P /

33 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 33 o P /

34 in the Real World ase Study: Kegeloleranzen (WZL) o ontact analysis simulation of Bevel Gears Written in Fortran, using Intel Fortran 10.1 compiler Very cache-friendly runs at high flop/s rates erboven Bevel Gear Pair Differential Gear 34 Laboratory for achine ools and Production Engineering, RWH Aachen P /

35 Flop/s in the Real World ase Study: Kegeloleranzen (WZL) o omparing Linux and Server 2008: erboven 35 0 Performance of Kegeloleranzen 8000 Linux 2.6: 4x Opteron dual-core, 2.2 GHz : 4x Opteron dual-core, 2.2 GHz 6000 Linux 2.6: 2x Harpertown quad-core, 3.0 GHz : 2x Harpertown quad-core, 3.0 GHz Performance gain for the user: Speedup of 5.6 on one node. Even better from starting point (desktop: 220 Flop/s) #hreads 7% b e t t e r P /

36 in the Real World Agenda erboven o FIRE: Pattern Recognition P: omputation of ritial Points Dynamic hread Balancing in FLOWER o DROPS: Navier-Stokes Solver o P / o 36 o P /

37 in the Real World erboven o is often a good alternative to multiple PI processes per node, as parallelization might require less work! Adding levels of parallelism can help to increase scalability! o ools for performance analysis and verification are critical ingredients of the program development environment. o is useful for multi-core architectures! But: urrent support for ++ is limited. But: Support for architecture aspects is still missing. 37 P /

38 in the Real World he End erboven hank you for your attention! 38

Hybrid Parallelization: Performance from SMP Building Blocks

Hybrid Parallelization: Performance from SMP Building Blocks Parallelization: Performance from SMP Building Blocks Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany September 29th, Leogang, Austria

More information

OpenMP in the Works July 4, 2007, JülichJ

OpenMP in the Works July 4, 2007, JülichJ OpenMP in the Works July 4, 2007, JülichJ Dieter an Mey, hristian Terboven, Samuel Sarholz, Alex Spiegel enter for omputing and ommunication RWTH Aachen University, Germany www.rz.rwth-aachen.de anmey@rz.rwth-aachen.de

More information

Windows RWTH Aachen University

Windows RWTH Aachen University Christian Terboven terboven@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University Windows HPC User Group Meeting 17 November, Austin, TX, USA Agenda o o HPC on Windows: Case Studies

More information

Exploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes

Exploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes Exploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes Christian Terboven, Dieter an Mey, Paul Kapinos, Christopher Schleiden, Igor Merkulow {terboven, anmey, kapinos, schleiden,

More information

Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany.

Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany. Scalable OpenMP Programming Dieter an Mey enter for omputing and ommunication RWTH Aachen University, Germany www.rz.rwth-aachen.de anmey@rz.rwth-aachen.de 1 Scalable OpenMP Programming D. an Mey enter

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

OpenMP in the Multicore Era

OpenMP in the Multicore Era in the Era Christian Terboven, Dieter an Mey {terboven, anmey}@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany 5th International Conference on Automatic Differentiation

More information

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Binding Nested OpenMP Programs on Hierarchical Memory Architectures Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de

More information

Experiences with the OpenMP Parallelization of DROPS, a Navier-Stokes

Experiences with the OpenMP Parallelization of DROPS, a Navier-Stokes Experiences with the OpenMP Parallelization of DROPS, a Navier-Stokes Stokes-Solver Solver written in ++ hristian Terboven, Alexander Spiegel, Dieter an Mey enter for omputing and ommunication, RWTH Aachen

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Productivity and Performance with Multi-Core Programming

Productivity and Performance with Multi-Core Programming Productivity and Performance with Multi-Core Programming Christian Terboven Center for Computing and Communication RWTH Aachen University 04.07.2012, CE Seminar, TU Darmstadt,

More information

Getting Performance from OpenMP Programs on NUMA Architectures

Getting Performance from OpenMP Programs on NUMA Architectures Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant

More information

C C V OpenMP V3.0. Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany

C C V OpenMP V3.0. Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany V 3.0 1 OpenMP V3.0 Dieter an Mey enter for omputing and ommunication, RWTH Aachen University, Germany anmey@rz.rwth-aachen.de enter for omputing and ommunication Setting the Scene The Time Scale Is OpenMP

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

C++ and OpenMP Christian Terboven Center for Computing and Communication RWTH Aachen University, Germany

C++ and OpenMP Christian Terboven Center for Computing and Communication RWTH Aachen University, Germany ++ and OpenMP hristian Terboven terboven@rz.rwth-aachen.de enter omputing and ommunication RWTH Aachen University, Germany 1 IWOMP 07 Tutorial ++ and OpenMP enter omputing and ommunication Agenda OpenMP

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Tasking and OpenMP Success Stories

Tasking and OpenMP Success Stories Tasking and OpenMP Success Stories Christian Terboven 23.03.2011 / Aachen, Germany Stand: 21.03.2011 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Tasking

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

Introduction to OpenMP

Introduction to OpenMP 1 / 7 Introduction to OpenMP: Exercises and Handout Introduction to OpenMP Christian Terboven Center for Computing and Communication, RWTH Aachen University Seffenter Weg 23, 52074 Aachen, Germany Abstract

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2. OpenMP Overview in 30 Minutes Christian Terboven 06.12.2010 / Aachen, Germany Stand: 03.12.2010 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Parallel Regions,

More information

OpenMP Case Studies. Dieter an Mey. Center for Computing and Communication Aachen University

OpenMP Case Studies. Dieter an Mey. Center for Computing and Communication Aachen University OpenMP Case Studies Dieter an Mey Center for Computing and Communication Aachen University anmey@rz rz.rwth-aachen.de 1 OpenMP Case Studies, Dieter an Mey, SunHPC 2004 OpenMP Case Studies Parallelization

More information

High Performance Computing State of the Art and Limitations

High Performance Computing State of the Art and Limitations High Performance omputing State of the Art and Limitations hristian Bischof enter for omputing and ommunication Institute for Scientific omputing RWTH Aachen University www.rz.rwth-aachen.de, www.sc.rwth-aachen.de

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

OpenMP and Performance

OpenMP and Performance Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl}@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims

More information

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer Altix 3700 SSSI - Architecture and Software

More information

An Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism.

An Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism. This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, ulti objective Optimisation and Resilience

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

FinisTerrae: Memory Hierarchy and Mapping

FinisTerrae: Memory Hierarchy and Mapping galicia supercomputing center Applications & Projects Department FinisTerrae: Memory Hierarchy and Mapping Technical Report CESGA-2010-001 Juan Carlos Pichel Tuesday 12 th January, 2010 Contents Contents

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for

More information

Towards NUMA Support with Distance Information

Towards NUMA Support with Distance Information Towards NUMA Support with Distance Information Dirk Schmidl, Christian Terboven, Dieter an Mey {schmidl terboven anmey}@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Topology of modern

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Two OpenMP Programming Patterns

Two OpenMP Programming Patterns Two OpenMP Programming Patterns Dieter an Mey Center for Computing and Communication, Aachen University anmey@rz.rwth-aachen.de 1 Introduction A collection of frequently occurring OpenMP programming patterns

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

A Uniform Programming Model for Petascale Computing

A Uniform Programming Model for Petascale Computing A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Introduction to OpenMP

Introduction to OpenMP Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Performance Engineering - Case study: Jacobi stencil

Performance Engineering - Case study: Jacobi stencil Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Christian Terboven 10.04.2013 / Darmstadt, Germany Stand: 06.03.2013 Version 2.3 Rechen- und Kommunikationszentrum (RZ) History De-facto standard for

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Infrastructure for Simulation Science

Infrastructure for Simulation Science Infrastructure for Simulation Science Christian Bischof niversity Center for Computing and Communications (CCC) Institute for Scientific Computing Center for Computing and Communications 1 Simulation Science

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura Multicore Challenge in Vector Pascal P Cockshott, Y Gdura N-body Problem Part 1 (Performance on Intel Nehalem ) Introduction Data Structures (1D and 2D layouts) Performance of single thread code Performance

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

[Potentially] Your first parallel application

[Potentially] Your first parallel application [Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel

More information

Introduction to OpenMP. Lecture 4: Work sharing directives

Introduction to OpenMP. Lecture 4: Work sharing directives Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

C++ and OpenMP. 1 ParCo 07 Terboven C++ and OpenMP. Christian Terboven. Center for Computing and Communication RWTH Aachen University, Germany

C++ and OpenMP. 1 ParCo 07 Terboven C++ and OpenMP. Christian Terboven. Center for Computing and Communication RWTH Aachen University, Germany C++ and OpenMP Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany 1 ParCo 07 Terboven C++ and OpenMP Agenda OpenMP and objects OpenMP and

More information

Windows-HPC Environment at RWTH Aachen University

Windows-HPC Environment at RWTH Aachen University Windows-HPC Environment at RWTH Aachen University Christian Terboven, Samuel Sarholz {terboven, sarholz}@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University PPCES 2009 March

More information

Sequoia. Mattan Erez. The University of Texas at Austin

Sequoia. Mattan Erez. The University of Texas at Austin Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

GraS. An Environmental Application on Windows-HPC. Christian Terboven

GraS. An Environmental Application on Windows-HPC. Christian Terboven An Environmental Application on Windows-HPC Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University, Germany Windows-HPC User Group March 12th, Fraunhofer

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems

Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems Ashay Rane and Dan Stanzione Ph.D. {ashay.rane, dstanzi}@asu.edu Fulton High Performance Computing Initiative, Arizona

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Geilo Winter School Henrik Löf Uppsala Universitet

Geilo Winter School Henrik Löf Uppsala Universitet Geilo Winter School 2008 Henrik Löf Uppsala Universitet 1 Optimizing parallel applications Low latencies (locality of reference) Cache memories Remote Accesses (NUMA) Low parallel overhead Minimize communication

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Day 6: Optimization on Parallel Intel Architectures

Day 6: Optimization on Parallel Intel Architectures Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory

More information

Performance Tuning and OpenMP

Performance Tuning and OpenMP Performance Tuning and OpenMP mueller@hlrs.de University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Höchstleistungsrechenzentrum Stuttgart Outline Motivation Performance

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Scaling PostgreSQL on SMP Architectures

Scaling PostgreSQL on SMP Architectures Scaling PostgreSQL on SMP Architectures Doug Tolbert, David Strong, Johney Tsai {doug.tolbert, david.strong, johney.tsai}@unisys.com PGCon 2007, Ottawa, May 21-24, 2007 Page 1 Performance vs. Scalability

More information