Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Similar documents
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

First Experiences with Intel Cluster OpenMP

Towards NUMA Support with Distance Information

Tasking and OpenMP Success Stories

NUMA-aware OpenMP Programming

Performance Tools for Technical Computing

Getting Performance from OpenMP Programs on NUMA Architectures

Dieter an Mey Center for Computing and Communication RWTH Aachen University, Germany.

Exploiting Object-Oriented Abstractions to parallelize Sparse Linear Algebra Codes

Productivity and Performance with Multi-Core Programming

Parallel Computer Architecture - Basics -

Why you should care about hardware locality and how.

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

C C V OpenMP V3.0. Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany

Performance optimization of Numerical Simulation of Waves Propagation in Time Domain and in Harmonic Domain using OpenMP Tasks

OpenMP in the Multicore Era

Non-uniform memory access (NUMA)

A Uniform Programming Model for Petascale Computing

OpenMP and Performance

Scalable Shared Memory Programming with OpenMP

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

FUSION1200 Scalable x86 SMP System

Task-parallel Programming on NUMA Architectures

OS impact on performance

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

CellSs Making it easier to program the Cell Broadband Engine processor

Parallel Computer Architecture - Basics -

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

COSC 6385 Computer Architecture - Multi Processor Systems

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

HPC Architectures. Types of resource currently in use

Parallel Computing. Hwansoo Han (SKKU)

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

Towards OpenMP for Java

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Practical Introduction to

NUMA-Aware Shared-Memory Collective Communication for MPI

Parallel Algorithm Engineering

Day 6: Optimization on Parallel Intel Architectures

OpenMP 4.0/4.5. Mark Bull, EPCC

Acano solution. White Paper on Virtualized Deployments. Simon Evans, Acano Chief Scientist. March B

OpenMP 4.0. Mark Bull, EPCC

HPC on Windows. Visual Studio 2010 and ISV Software

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

A brief introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Debugging with TotalView

Performance Impact of Resource Contention in Multicore Systems

Individual Instances of the Operating System

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

NUMA Support for Charm++

Bring your application to a new era:

OpenMP in the Real World

Advanced OpenMP Features

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Windows RWTH Aachen University

Advanced Job Launching. mapping applications to hardware

COSC 6374 Parallel Computation. Parallel Computer Architectures

Introduction to OpenMP

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

NUMA-aware Graph-structured Analytics

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Callisto-RTS: Fine-Grain Parallel Loops

41st Cray User Group Conference Minneapolis, Minnesota

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

An Introduction to OpenMP

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

A Lightweight OpenMP Runtime

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Simulation using MIC co-processor on Helios

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Performance comparison between a massive SMP machine and clusters

Debugging OpenMP Programs

OP2 FOR MANY-CORE ARCHITECTURES

OpenMP for Accelerators

Hybrid Parallelization: Performance from SMP Building Blocks

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Parallel Computing Platforms

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

Transcription:

Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de

Agenda Thread Binding special isues concerning nested OpenMP complexity of manual binding Our Approach to bind Threads for Nested OpenMP Strategy Implementation Performance Tests Kernel Benchmarks Production Code: SHEMAT-Suite Folie 2

Advantages of Thread Binding first touch data placement only makes sense when threads are not moved during the program run faster communication and synchronization through shared caches reproducible program performance Folie 3

Thread Binding and OpenMP 1. Compiler dependent Environment Variables (KMP_AFFINITY, SUNW_MP_PROCBIND, GOMP_CPU_AFFINITY, ) not uniform nesting is not well supported 2. Manual Binding through API Calls (sched_setaffinity, ) only binding of system threads possible Hardware knowledge is needed for best binding Folie 4

Numbering of cores on Nehalem Processor 2-socket system from Sun 2-socket system from HP 0 8 1 9 2 10 3 11 0 8 2 10 4 12 6 14 4 12 5 13 6 14 7 15 1 9 3 11 5 13 7 15 Folie 5

Nested OpenMP most compilers use a thread pool, so not always the same system threads are taken out of this pool more synchronization and data sharing within the inner teams => higher importance where to place the threads of an inner team Team1 Team2 Team3 Team4 Nodes Sockets Cores Folie 6

Thread Binding Approach Goals: easy to use no detailed hardware knowledge needed user interaction possible in an understandable way support for nested OpenMP Folie 7

Thread Binding Approach Solution: user provides simple Binding Strategies (scatter, compact, subscatter, subcompact) environment variable : OMP_NESTING_TEAM_SIZE=4,scatter,2,subcompact function call: omp_set_nesting_info( 4,scatter,2,subcompact ); hardware information and mapping of threads to the hardware is done automatically affinity mask of the process is taken into account Folie 8

Binding Strategies compact: bind threads of the team close together if possible the team can use shared caches and is connected to the same local memory scatter: spread threads far away from each other maximizes memory bandwidth by using as many NUMA nodes as possible subcompact/subscatter: run close to the master thread of the team, e.g. on the same board or socket and use the scatter or compact strategy on the board or socket data initialized by the master can still be found in the local memory Folie 9

Binding Strategies 4,compact 4,scatter used Core free Core Folie 10

Binding Strategies 4,scatter,4,subscatter 4,scatter,4,scatter Folie 11

Binding-Library 1. Automatically detect hardware information from the system 2. Read environment variables and map OpenMP threads to cores respecting the specified strategies 3. Binding needs to be done every time a team is forked since it is not clear which system threads are used instrument the code by OPARI provide a library which binds threads in pomp_parallel_begin() function using the computed mapping 4. Update the mapping after omp_set_nesting_info() Folie 12

Used Hardware 1. Tigerton (Fujitsu-Siemens RX600): 1. 4 x Intel Xeon X7350 @ 2,93 GHz 2. 1 x 64 GB RAM 2. Barcelona (IBM eserver LS42): 1. 4 x AMD Opteron 8356 @2,3 GHz 2. 4 x 8 = 32 GB RAM 3. ScaleMP 1. 13 board connected via Infiniband 2. each 2 x Intel Xeon E5420 @ 2,5 GHz SMP cc-numa cc-numa with high NUMA ratio 3. cache coherency by virtualization software (vsmp) 4. 13 x 16 = 208 GB RAM ~38 GB reserved for vsmp = 170 GB available Folie 13

Nested Stream modification of the STREAM benchmark start different teams each inner team computes STREAM benchmark separate data arrays for every inner team compute totally reached memory bandwidth start outer threads initialize data compute triad compute reached bandwidth Folie 14

Nested Stream threads 1x1 1x4 4x1 4x4 6x1 6x4 13x1 13x4 Barcelona unbound 4.4 4.9 15.0 10.7 bound 4.4 7.6 15.8 13.1 Tigerton Unbound 2.3 6.0 4.8 8.7 Bound 2.3 3.0 8.2 8.5 ScaleMP Unbound 3.8 10.7 11.2 1.7 9.0 1.6 3.4 2.4 bound 3.8 5.9 14.4 18.8 20.4 15.8 43.0 27.8 Memory bandwidth in GB/s of the nested Stream benchmark. Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 15

Nested Stream threads 1x1 1x4 4x1 4x4 6x1 6x4 13x1 13x4 Barcelona unbound 4.4 4.9 15.0 10.7 bound 4.4 7.6 15.8 13.1 Tigerton Unbound 2.3 6.0 4.8 8.7 Bound 2.3 3.0 8.2 8.5 ScaleMP Unbound 3.8 10.7 11.2 1.7 9.0 1.6 3.4 2.4 bound 3.8 5.9 14.4 18.8 20.4 15.8 43.0 27.8 Memory bandwidth in GB/s of the nested Stream benchmark. Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 16

Nested Stream threads 1x1 1x4 4x1 4x4 6x1 6x4 13x1 13x4 Barcelona unbound 4.4 4.9 15.0 10.7 bound 4.4 7.6 15.8 13.1 Tigerton Unbound 2.3 6.0 4.8 8.7 Bound 2.3 3.0 8.2 8.5 ScaleMP Unbound 3.8 10.7 11.2 1.7 9.0 1.6 3.4 2.4 bound 3.8 5.9 14.4 18.8 20.4 15.8 43.0 27.8 Memory bandwidth in GB/s of the nested Stream benchmark. Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 17

Nested Stream threads 1x1 1x4 4x1 4x4 6x1 6x4 13x1 13x4 Barcelona unbound 4.4 4.9 15.0 10.7 bound 4.4 7.6 15.8 13.1 Tigerton Unbound 2.3 6.0 4.8 8.7 Bound 2.3 3.0 8.2 8.5 ScaleMP Unbound 3.8 10.7 11.2 1.7 9.0 1.6 3.4 2.4 bound 3.8 5.9 14.4 18.8 20.4 15.8 43.0 27.8 Memory bandwidth in GB/s of the nested Stream benchmark. Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 18

Nested EPCC syncbench modification of EPCC microbenchmarks start different teams each inner team uses synchronization constructs compute average synchronization overhead start outer threads use synchronization constructs compute average overhead Folie 19

Nested EPCC syncbench parallel barrier reduction parallel barrier reduction Barcelona 1x2 4x4 unbound 19.25 18.12 19.80 117.90 100.27 119.35 bound 23.53 20.97 24.14 70.05 69.26 69.29 Tigerton 1x2 4x4 unbound 23.88 20.84 24.17 74.15 54.90 77.00 bound 9.48 7.35 9.77 58.96 34.75 58.24 ScaleMP 1x2 2x8 unbound 63.53 65.00 42.74 2655.09 2197.42 2642.03 bound 34.11 33.47 41.99 425.63 323.71 444.77 overhead of nested OpenMP constructs in microseconds Y,scatter,Z,subcompact strategy used for YxZ Threads Folie 20

Nested EPCC syncbench parallel barrier reduction parallel barrier reduction Barcelona 1x2 4x4 unbound 19.25 18.12 19.80 117.90 100.27 119.35 bound 23.53 20.97 24.14 70.05 69.26 69.29 Tigerton 1x2 4x4 unbound 23.88 20.84 24.17 74.15 54.90 77.00 bound 9.48 7.35 9.77 58.96 34.75 58.24 ScaleMP 1x2 2x8 unbound 63.53 65.00 42.74 2655.09 2197.42 2642.03 bound 34.11 33.47 41.99 425.63 323.71 444.77 overhead of nested OpenMP constructs in microseconds Y,scatter,Z,subcompact strategy used for YxZ Threads 1.7X 1.3X 6X Folie 22

SHEMAT-Suite Geothermal Simulation Package to simulate groundwater flow, heat transport, and the transport of reactive solutes in porous media at high temperatures (3D) Institute for Applied Geophysics Written in Fortran, two levels of parallelism: Independent Computations of the Directional Derivatives (Maximum is 16 for the used Dataset) Setup and Solving of linear equation systems Folie 23

1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 1x16 2x8 4x4 8x2 16x1 SHEMAT-Suite on Tigerton small differences sometimes unbound strategy is advantageous for larger thread numbers only very small differences binding in the best case brings only 2,3 % 12 10 8 6 4 2 0 bound unbound speedup for different thread numbers on Tigerton Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 24

1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 1x16 2x8 4x4 8x2 16x1 SHEMAT-Suite on Barcelona larger differences for larger thread numbers noticeable differences in nested cases e.g. for 8x2 the improvement is 55% 12 10 8 6 4 2 0 bound unbound speedup for different thread numbers on Barcelona Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 25

1x1 1x2 2x1 1x4 2x2 4x1 1x8 2x4 4x2 8x1 2x8 4x4 8x2 10x2 10x4 10x8 SHEMAT-Suite on ScaleMP very larger differences when multiple boards are used 25 20 15 bound unbound binding in the best case brings 2.5X performance improvement speedup is about 2 times higher than on the Barcelona and 3 times higher than on the Tigerton 10 5 0 speedup for different thread numbers on ScaleMP Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 26

Recent Results: SHEMAT-Suite on new ScaleMP box at Bielefeld ScaleMP Bielefeld: 1. 15 (16) board each 2 x Intel Xeon E5550 @ 2,67 GHz (Nehalem) 2. 320 GB RAM - ~32 GB reserved for vsmp = 288 GB available Folie 27

1x1 2x1 2x2 1x8 4x2 1x16 4x4 16x1 2x16 8x4 32x1 2x32 8x8 32x2 60x2 15x8 Shemat ScaleMP Bielefeld speedup of about 8 without binding 40 30 bound unbound speedup of up to 38 when binding is used comparing best effort in both cases: improvement 4X 20 10 0 speedup for different thread numbers on ScaleMP - Bielefeld Y,scatter,Z,subscatter strategy used for YxZ Threads Folie 28

Summary Binding is Important especially for Nested OpenMP Programs On SMP s the advantage is low, but on cc-numa architectures it is really an advantage Hardware Details are confusing and should be hidden from the User Folie 29

End Thank you for your attention! Questions? Folie 30