Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Similar documents
Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-scale Multicore Clusters

Hybrid MPI and OpenMP Parallel Programming

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

First Experiences with Intel Cluster OpenMP

Hybrid Programming with MPI and SMPSs

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Cost-Performance Evaluation of SMP Clusters

Introduction to parallel Computing

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs

Algorithm Engineering with PRAM Algorithms

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

MPI & OpenMP Mixed Hybrid Programming

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Carlo Cavazzoni, HPC department, CINECA

HPC Architectures. Types of resource currently in use

Parallel Architectures

Two OpenMP Programming Patterns

Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Hands-on / Demo: Building and running NPB-MZ-MPI / BT

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

PRIMEHPC FX10: Advanced Software

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures

Optimization of MPI Applications Rolf Rabenseifner

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

OpenMP Doacross Loops Case Study

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Point-to-Point Synchronisation on Shared Memory Architectures

PCS - Part Two: Multiprocessor Architectures

Clusters of SMP s. Sean Peisert

MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops

NUMA-aware OpenMP Programming

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Overpartioning with the Rice dhpf Compiler

A Uniform Programming Model for Petascale Computing

AcuSolve Performance Benchmark and Profiling. October 2011

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Three basic multiprocessing issues

Hybrid Space-Time Parallel Solution of Burgers Equation

Tracing embedded heterogeneous systems

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

A Test Suite for High-Performance Parallel Java

Advances of parallel computing. Kirill Bogachev May 2016

Placement de processus (MPI) sur architecture multi-cœur NUMA

Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany.

Parallel Computing Introduction

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Introducing OpenMP Tasks into the HYDRO Benchmark

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

Filesystem Performance on FreeBSD

Automatic Thread Distribution For Nested Parallelism In OpenMP

The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters

BlueGene/L (No. 4 in the Latest Top500 List)

6.1 Multiprocessor Computing Environment

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Performance comparison between a massive SMP machine and clusters

Hybrid programming with MPI and OpenMP On the way to exascale

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

MPI+X on The Way to Exascale. William Gropp

Programming for Fujitsu Supercomputers

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems

SunFire range of servers

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

SFS: Random Write Considered Harmful in Solid State Drives

Convergence of Parallel Architecture

Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet

COSC 6385 Computer Architecture - Multi Processor Systems

Intro to Parallel Computing

Hybrid MPI-OpenMP Paradigm on SMP clusters: MPEG-2 Encoder and n-body Simulation

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications

An Introduction to Parallel Programming

Parallel Poisson Solver in Fortran

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Optimizing LS-DYNA Productivity in Cluster Environments

Transcription:

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of Aachen ***Sun Microsystems 09/22/03 GJ EWOMP03 1

SMP Cluster Cluster of Symmetric Multi-Processor (SMP) nodes Mix of shared and distributed Memory Architecture Programming Paradigms: OpenMP within one SMP node All message-passing Hybrid programming: Shared memory with one SMP node Message-passing between SMP nodes network Shared memory Shared memory 2

SMP Cluster Cluster of Symmetric Multi-Processor (SMP) nodes Mix of shared and distributed Memory Architecture Programming paradigms: OpenMP within one SMP node All message-passing Hybrid programming: Shared memory with one SMP node Message-passing between SMP nodes network Shared memory Shared memory OpenMP 3

SMP Cluster Cluster of Symmetric Multi-Processor (SMP) nodes Mix of shared and distributed memory architecture Programming paradigms: OpenMP within one SMP node All message-passing Hybrid programming: Shared memory with one SMP node Message-passing between SMP nodes network Message-passing Shared memory Shared memory Message-passing Message-passing 4

SMP Cluster Cluster of Symmetric Multi-Processor (SMP) nodes Mix of shared and distributed memory architecture Programming paradigms: OpenMP within one SMP node All message-passing Hybrid programming: OpenMP within one SMP node Message-passing between SMP nodes network Message-passing Shared memory Shared memory Message passing OpenMP Message Passing OpenMP 5

Sun Fire SMP Cluster 4 x Sun Fire 15K 72 Ultra SPARC III Cu 900 MHz 144 GB RAM 16 x Sun Fire 6800 24 Ultra SPARC III Cu 900 MH 24 GB RAM Located at the Computing Center of the University of Aachen (Germany) 6

Sun Fire Link Configuration 4 x Sun Fire 15K 8 x Sun Fire 6800 8 x Sun Fire 6800 7

Hardware Details UltraSPARC-III Cu processors Superscalar 64-bit processor 900 MHz L1 cache (on chip) 64KB data and 32KB instructions L2 cache (off chip) 8 MB for data and instructions Sun Fire 6800 node: 24 UltraSPARC-III 900 MHz CPU 24 GB of shared main memory Flat memory system: approx. 270 ns latency 9.6 GB/s bandwidth Sun Fire 15K node: 72 UltraSPARC-III 900 MHz CPU 144 GB of shared main memory NUMA memory system: Latency 270 ns onboard to 600 ns off board Bandwidth 173GB/s on board to 43 GB (worst case) 8

Testbed Configurations Gigabit Ethernet (GE) MPI: 100 us latency 100 MB/s bandwidth Sun Fire Link (SFL) MPI: 4 us latency 2GB/s bandwidth 4 Sun Fire 6800 nodes connected with GE 96 (4x24) CPUs total 4 Sun Fire 6800 nodes connected with SFL 96 (4x24) CPUs total 1 Sun Fire 15K node 72 (1x72) CPUs total 9

The NAS Parallel Benchmark BT Simulated CFD application Uses ADI method to solve Navier- Stokes equations in 3D Decouples the three spatial dimensions Solves a tridiagonal system of equation in each dimension Four different implementations are used for this study: One pure MPI version One pure OpenMP version Two different hybrid MPI/Open versions initialize compute_rhs x_solve y_solve z_solve 10

BT MPI Message passing implementation based on MPI (as in NPB2.3) Domain decomposition employing a 3-D multipartition scheme For each sweep directions all processes can start work in parallel Sync. points 3 0 1 2 2 3 0 1 1 2 3 0 sweep direction Before each iteration: Exchange of boundary data 0 1 2 3 Within each sweep: Sends partially updated results to neighbors 11

BT OMP Loop level parallelism based on OpenMP (as in NPB3.0) Place OpenMP directives on outermost loops in time consuming One dimensional parallelism Different spatial dimensions parallelized in different routines BT OMP code segment!$omp parallel do do k= 1, nz do j= 1, ny do i=1,nx.. = u(i,j,k-1) + u(i,j,k+1) 12

BT Hybrid V1 MPI/OpenMP: 1D data partition in z-dimension (k-loop) OpenMP directives in y-dimension (j-loop) Routine y_solve Routine z_solve!$omp parallel do k=k_low,k_high synchronize neighbor threads!$omp do do j=1,ny do i=1,nx rhs(i,j,k) = rhs(i,j-1,k) +... synchronize neighbor threads!$omp end parallel data dependence in y-dimension pipelining of thread execution!$omp parallel do do j=1,ny call receive do k=k_low,k_high do i=1,nx rhs(i,j,k) = rhs(i,j,k-1) +... call send data partitioned in K communication within parallel region 13

BT Hybrid V2 MPI/OpenMP: Employ 3-D multi-partition scheme from BT MPI Place OpenMP directives on outermost loops in time consuming routines Hybrid V2 without OpenMP BT MPI Differences to BT Hybrid V1: Data decomposition in 3 dimensions MPI and OpenMP employed on the same dimension All communication outside of parallel regions Routine z_solve do ib = 1, nblock call receive!$omp parallel do do j=j_low,j_high do i=i_low,i_high do k=k_low,k_high rhs(i,j,k,ib)= rhs(i,j,k-1,ib)+... call send end do 14

Speed-up for BT Class A (Fast Networks) Speed-up 80 70 60 50 40 30 Speed-up of BT on SF 6800 (SFL) (4x24 CPUs) BT OMP BT MPI BT Hybrid V1 BT Hybrid V2 Speed-up 60 50 40 30 20 Speed-up BT on SF 15K (72 CPUs) BT OMP BT MPI BT Hybrid V1 BT Hybrid V2 20 10 0 1 4 16 36 64 81 Number of CPUs Problem size 64 x 64 x 64 grid points Speed-up: Measured against time of fastest implementation on 1 CPU (BT OMP) For the hybrid codes the best combination of MPI processes and threads is reported BT OMP: Scalability limited by 24 on SF 6800 (available number of CPUs) 62 on SF 15K (due to problem size) 10 0 1 4 16 36 64 Number of CPUs 15

Speed-up for BT Class A (Fast Networks) Speed-up of BT on SF 6800 (SFL) Speed-up BT on SF 15K Speed-up 80 70 60 50 40 30 BT OMP BT MPI BT Hybrid V1 BT Hybrid V2 BT Hybrid V1 BT Hybrid V2 Speed-up 60 50 40 30 20 BT OMP BT MPI BT Hybrid V1 BT Hybrid V2 BT Hybrid V1 BT Hybrid V2 20 10 10 0 1 4 16 36 64 81 Number of CPUs 0 1 4 16 36 64 Number of CPUs Speed-up: BT Hybrid V2: Best performance with only 1 thread per MPI process: No benefit employing hybrid programming paradigm BT Hybrid V1: Best performance for 16 MPI processes and 4 or 5 threads Limitation on number of MPI processes due to implementation Dashed lines indicate the speed-up compared to 1 CPU of the same implementation BT MPI shows best scalability 16

Speed-up Comparison on Fast and Slow Network Speed-up 80 70 60 50 40 30 20 10 0 Speed-up of BT on SF 6800 (SFL) (4x24 CPUs) BT OMP BT MPI BT Hybrid V1 BT Hybrid V2 1 4 16 36 64 81 Number of CPUs Speed-up 80 70 60 50 40 30 20 10 0 Speed-up of BT on SF 6800 (GE) (4x24 CPUs) 1 4 16 36 64 81 Number of CPUs BT OMP BT MPI BT Hybrid V1 BT Hybrid V2 Performance using Gigabit Ethernet (GE): Hybrid implementations outperform pure MPI implementation BT Hybrid V1 shows best scalability BT Hybrid V1 achieves best performance employing 16 MPI processes and 4 or 5 threads respectively 17

Characteristics of Hybrid Codes BT Hybrid V1: Message exchange with 2 neighbor processes Many short messages Length of messages remain the same Increase number of threads: Increase of OpenMP barrier time (threads from different MPI processes have to synchronize) Increase of MPI time (MPI calls within parallel regions are serialized) BT Hybrid V2: Message exchange with 6 neighbor processes Few long messages Length of messages decreases with increasing number of processes Increase number of threads: Slight increase of OpenMP barrier time MPI time not affected since all communication occurs outside of parallel regions 18

Performance of Hybrid Codes BT Class A on 64 CPUs BT Class A on 4 SMP Nodes (GE) 7 6 BT Hybrid V1 BT Hybrid V2 50 45 Hybrid V1 Time in seconds 5 4 3 2 1 Time in seconds 40 35 30 25 20 15 Hybrid V2 0 10 64x1(SFL) 16x4 (SFL) 4x16 (SFL) 64x1 (GE) 16x4 (GE) 4x16 (GE) 5 0 4x1 16x1 36x1 64x1 MPI Processes x OpenMP Threads MPI processes x Number of threads BT Hybrid V1: Tight interaction between MPI and OpenMP limits number of threads that can be used efficiently BT Hybrid V2: saturates the slow network with small number of MPI processes 19

Using Multiple Nodes Sun Fire Link imposes little overhead when using multiple SMP nodes Time in seconds 8 7 6 5 4 3 2 1 0 Hybrid V2 on 16 CPUs on SF-6800 16x1 (SFL) 4x4 (SFL) 16x1 (GE) 4x4 (GE) 1 Node 2 Nodes 4 Nodes Number of SMP Nodes 20

Conclusions For the BT benchmark: Pure MPI implementation shows best scalability when fast network of shared memory is available All CPUs within the cluster can be use Coarse grained well balanced workload distribution Hybrid programming paradigm advantageous for slow network Tight interaction between MPI and OpenMP problematic OpenMP paradigm introduces limitations to the scalability: Problem size in one dimension Nested parallelism? CPUs within one SMP node Software DSM? Mandatory barrier synchronization at the end parallel regions OpenMP extensions? Hybrid programming paradigm more suitable for multizone codes 21

Related Work Comparison of MPI and shared memory programming (H. Shan et. all) Aspects of hybrid programming (R. Rabenseifner) Evaluation of MPI on the Sun Fire Link (S. J. Sistare et. all) 22