Hybrid Computing. Lars Koesterke. University of Porto, Portugal May 28-29, 2009

Similar documents
Performance Considerations: Hardware Architecture

Hybrid Programming with OpenMP and MPI. Karl W. Schulz. Texas Advanced Computing Center The University of Texas at Austin

NUMA Control for Hybrid Applications. Hang Liu TACC February 7 th, 2011

NUMA Control for Hybrid Applications. Hang Liu & Xiao Zhu TACC September 20, 2013

Lab: Hybrid Programming and NUMA Control

Lab: Hybrid Programming and NUMA Control

Hybrid Computing Lab

MPI & OpenMP Mixed Hybrid Programming

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

A short overview of parallel paradigms. Fabio Affinito, SCAI

COMP528: Multi-core and Multi-Processor Computing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Hybrid MPI and OpenMP Parallel Programming

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to MPI+OpenMP hybrid programming. SuperComputing Applications and Innovation Department

Introduction to parallel computing concepts and technics

Message Passing with MPI

Best Practice Guide for Writing MPI + OmpSs Interoperable Programs. Version 1.0, 3 rd April 2017

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Multicore Architecture and Hybrid Programming

Distributed Memory Programming with Message-Passing

Holland Computing Center Kickstart MPI Intro

Why Combine OpenMP and MPI

Combining OpenMP and MPI

High Performance Computing Course Notes Message Passing Programming I

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Combining OpenMP and MPI. Timothy H. Kaiser,Ph.D..

MPI-Hello World. Timothy H. Kaiser, PH.D.

MPI-2. Introduction Dynamic Process Creation. Based on notes by Sathish Vadhiyar, Rob Thacker, and David Cronk

Computer Architecture

MPI-Hello World. Timothy H. Kaiser, PH.D.

Parallel Computing: Overview

CS 426. Building and Running a Parallel Application

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Hybrid Programming. John Urbanic Parallel Computing Specialist Pittsburgh Supercomputing Center. Copyright 2017

Our new HPC-Cluster An overview

JURECA Tuning for the platform

ECE 563 Midterm 1 Spring 2015

Introduction to OpenMP

A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008

Message Passing Interface. George Bosilca

Hybrid Programming with MPI and OpenMP

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Parallel Computing. Lecture 17: OpenMP Last Touch

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Advanced MPI Programming

12:00 13:20, December 14 (Monday), 2009 # (even student id)

Slides prepared by : Farzana Rahman 1

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Hybrid Programming: MPI+OpenMP. Piero Lanucara SuperComputing Applications and Innovation Department

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Parallel Programming. Libraries and Implementations

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Practical Introduction to Message-Passing Interface (MPI)

Shared memory programming

Parallel Programming Overview

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Parallel Programming, MPI Lecture 2

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Introduction to MPI+OpenMP hybrid programming. Fabio Affinito SuperComputing Applications and Innovation Department

Introduction Hybrid Programming. Sebastian von Alfthan 13/8/2008 CSC the Finnish IT Center for Science PRACE summer school 2008

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

EPL372 Lab Exercise 5: Introduction to OpenMP

Some Aspects of Message-Passing on Future Hybrid Systems

Introduction to MPI. Ricardo Fonseca.

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

MPI. (message passing, MIMD)

Shared Memory programming paradigm: openmp

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Lecture 7: Distributed memory

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Parallel Programming Libraries and implementations

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

An Introduction to MPI

AMath 483/583 Lecture 21

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Using MPI+OpenMP for current and future architectures

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Introduction to MPI. Ritu Arora Texas Advanced Computing Center June 17,

Transcription:

Hybrid Computing Lars Koesterke University of Porto, Portugal May 28-29, 29

Why Hybrid? Eliminates domain decomposition at node Automatic coherency at node Lower memory latency and data movement within node Can synchronize on memory instead of barrier 2

Why Hybrid? (cont 1) Only profitable if on-node aggregation of MPI parallel components is faster as a single SMP algorithm (or a single SMP algorithm on each socket). 3

Hybrid - Motivation Load Balancing Reduce Memory Traffic -bound Memory-bound 4

Node Views OpenMP MPI 1 2 3 Process- Affinity Memory- Allocation 5

NUMA Operations Where do threads/processes and memory allocations go? If Memory were completely uniform there would be no need to worry about these two concerns. Only for NUMA (non-uniform memory access) is (re)placement of processes and allocated memory (NUMA Control) of importance. Default Control: Decided by policy when process exec d or thread forked, and when memory allocated. Directed from within Kernel. 6

NUMA Operations Ways Process Affinity and Memory Policy can be changed: Dynamically on a running process (knowing process id) At process execution (with wrapper command) Within program through F9/C API Users can alter Kernel Policies (setting Process Affinity and Memory Policy == PAMPer) Users can PAMPer their own processes. Root can PAMPer any process. Careful, libraries may PAMPer, too! 7

NUMA Operations Process Affinity and Memory Policy can be controlled at socket and core level with numactl. 2 3 2 3 8,9,1,11 12,13,14,15 1 1 4,5,6,7,1,2,3 Process: Socket References Memory: Socket References process assignment memory allocafon N l i peferred m (local, interleaved, pref., mandatory) Process: Core References core assignment C 8

Modes of MPI/Thread Operation SMP Nodes Single MPI task launched per node Parallel Threads share all node memory, e.g 16 threads/ node on Ranger. SMP Sockets Single MPI task launched on each socket Parallel Thread set shares socket memory, e.g. 4 threads/socket on Ranger MPI Cores Each core on a node is assigned an MPI task. (not really hybrid, but in master/slave paradigm master could use threads) 9

Modes of MPI/Thread Operation Pure MPI Node Pure SMP Node 16 MPI Tasks 4 MPI Tasks 4Threads/Task 1 MPI Tasks 16 Threads/Task Master Thread of MPI Task MPI Task on Core Master Thread of MPI Task Slave Thread of MPI Task 1

SMP Nodes Hybrid Batch Script 16 threads/node Make sure 1 task is created on each node Set total number of cores (nodes x 16) Set number of threads for each node PAMPering at job level Controls behavior for ALL tasks No simple/standard way to control thread-core affinity job script (Bourne shell) job script (C shell)...... #! -pe 1way 192 #! -pe 1way 192...... export OMP_NUM_THREADS=16 setenv OMP_NUM_THREADS 16 ibrun numactl i all./a.out ibrun numactl i all./a.out 11

job script (Bourne shell)...... SMP Sockets Hybrid Batch Script 4 tasks/node, 4 threads/task Example script setup for a square (6x6 = 36) processor topology. Create a task for each socket (4 tasks per node). Set total number of cores allocated by batch (nodes x 16 cores/node). Set actual number of cores used with MY_NSLOTS. Set number of threads for each task PAMPering at task level Create script to extract rank for numactl options, and a.out execution (TACC MPI systems always assign sequential ranks on a node. No simple/standard way to control thread-core affinity job script (C shell) #! -pe 4way 48 #! -pe 4way 48...... export MY_NSLOTS =36 setenv MY_NSLOTS 36 export OMP_NUM_THREADS=4 setenv OMP_NUM_THREADS 4 ibrun numa.csh ibrun numa.sh 12

SMP Sockets Hybrid Batch Script 4 tasks/node, 4 threads/task numa.sh numa.csh for mvapich2 #!/bin/bash export MV2_USE_AFFINITY= export MV2_ENABLE_AFFINITY= #TasksPerNode TPN=`echo $PE sed 's/way//'` [! $TPN ] && echo TPN NOT defined! [! $TPN ] && exit 1 socket=$(( $PMI_RANK % $TPN )) numactl -N $socket -m $socket./a.out #!/bin/tcsh setenv MV2_USE_AFFINITY setenv MV2_ENABLE_AFFINITY #TasksPerNode set TPN = `echo $PE sed 's/way//'` if(! ${%TPN}) echo TPN NOT defined! if(! ${%TPN}) exit @ socket = $PMI_RANK % $TPN numactl -N $socket -m $socket./a.out 13

Hybrid Program Model Start with MPI initialization Create OMP parallel regions within MPI task (process). Serial regions are the master thread or MPI task. MPI rank is known to all threads Call MPI library in serial and parallel regions. Finalize MPI 14 Program

MPI with OpenMP -- Messaging Single-threaded messaging Node Node rank to rank MPI from serial region or a single thread within parallel region Multi-threaded messaging Node Node rank-thread ID to any rank-thread ID MPI from multiple threads within parallel region Requires thread-safe implementation 15

Threads calling MPI Use MPI_Init_thread to select/determine MPI s thread level of support (in lieu of MPI_Init). MPI_Init_thread is supported in MPI2 Thread safety is controlled by provided types: single, funneled, serialized and multiple Single means there is no multi-threading. Funneled means only the master thread calls MPI Serialized means multiple threads can call MPI, but only 1 call can be in progress at a time (serialized). Multiple means MPI is thread safe. Monotonic values are assigned to Parameters: MPI_THREAD_SINGLE < MPI_THREAD_FUNNELED < MPI_THREAD_SERIALIZED < MPI_THREAD_MULTIPLE 16

Syntax: MPI2 MPI_Init_thread call MPI_Init_thread( irequired, iprovided, ierr) int MPI_Init_thread(int *argc, char ***argv, int required, int *provided) int MPI::Init_thread(int& argc, char**& argv, int required) Support Levels Description MPI_THREAD_SINGLE MPI_THREAD_FUNNELED MPI_THREAD_SERIALIZE Only one thread will execute. Process may be multi-threaded, but only main thread will make MPI calls (calls are funneled'' to main thread). Default Process may be multi-threaded, any thread can make MPI calls, but threads cannot execute MPI calls concurrently (MPI calls are serialized''). MPI_THREAD_MULTIPLE Multiple threads may call MPI, no restrictions. If supported, the call will return provided = required. Otherwise, the highest level of support will be provided. 17

Hybrid Coding include mpif.h program hybsimp Fortran C #include <mpi.h> int main(int argc, char **argv){ int rank, size, ierr, i; call MPI_Init(ierr) call MPI_Comm_rank (...,irank,ierr) call MPI_Comm_size (...,isize,ierr)! Setup shared mem, comp. & Comm!$OMP parallel do do i=1,n <work> enddo! compute & communicate call MPI_Finalize(ierr) end ierr= MPI_Init(&argc,&argv[]); ierr= MPI_Comm_rank (...,&rank); ierr= MPI_Comm_size (...,&size); //Setup shared mem, compute & Comm #pragma omp parallel for for(i=; i<n; i++){ <work> } // compute & communicate ierr= MPI_Finalize(); 18

MPI Call through Master MPI_THREAD_FUNNELED Use OMP_BARRIER since there is no implicit barrier in master workshare construct (OMP_MASTER). All other threads will be sleeping. 19

Funneling through Master include mpif.h program hybmas!$omp parallel!$omp barrier!$omp master call MPI_<Whatever>(,ierr)!$OMP end master!$omp barrier!$omp end parallel end Fortran C #include <mpi.h> int main(int argc, char **argv){ int rank, size, ierr, i; #pragma omp parallel { #pragma omp barrier #pragma omp master { ierr=mpi_<whatever>( ) } #pragma omp barrier } } 2

MPI Call within Single MPI_THREAD_SERIALIZED Only OMP_BARRIER at beginning, since there is an implicit barrier in SINGLE workshare construct (OMP_SINGLE). All other threads will be sleeping. (The simplest case is for any thread to execute a single mpi call, e.g. with the single omp construct. See next slide.) 21

include mpif.h program hybsing Serialize through Single call mpi_init_thread(mpi_thread_serialized, iprovided,ierr)!$omp parallel!$omp barrier!$omp single call MPI_<whatever>(,ierr)!$OMP end single Fortran C #include <mpi.h> int main(int argc, char **argv){ int rank, size, ierr, i; mpi_init_thread(mpi_thread_serialized, iprovided) #pragma omp parallel { #pragma omp barrier #pragma omp single { ierr=mpi_<whatever>( ) }!!OMP barrier!$omp end parallel end //pragma omp barrier } } 22

Overlapping Communication and Work One core can saturate the PCI-e network bus. Why use all to communicate? Communicate with one or several cores. Work with others during communication. Need at least MPI_THREAD_FUNNELED support. Can be difficult to manage and load balance! 23

Overlapping Communication and Work include mpi.h program hybover!$omp parallel if (ithread.eq. ) then call MPI_<whatever>(,ierr) else <work> endif!$omp end parallel end Fortran C #include <mpi.h> int main(int argc, char **argv){ int rank, size, ierr, i; #pragma omp parallel { if (thread == ){ ierr=mpi_<whatever>( ) } if(thread!= ){ work } } 24

Thread-rank Communication Can use thread id and rank in communication Example illustrates technique in multithread ping (send/receive) example. 25

Thread-rank Communication call mpi_init_thread( MPI_THREAD_MULTIPLE, iprovided,ierr) call mpi_comm_rank(mpi_comm_world, irank, ierr) call mpi_comm_size( MPI_COMM_WORLD,nranks, ierr)!$omp parallel private(i, ithread, nthreads) nthreads=omp_get_num_threads() Communicate between ranks. ithread =OMP_GET_THREAD_NUM() call pwork(ithread, irank, nthreads, nranks ) Threads use tags to differentiate. if(irank == ) then call mpi_send(ithread,1,mpi_integer, 1, ithread, MPI_COMM_WORLD, ierr) else call mpi_recv( j,1,mpi_integer,, ithread, MPI_COMM_WORLD, istatus,ierr) print*, "Yep, this is ",irank," thread ", ithread," I received from ", j endif!$omp END PARALLEL end 26

NUMA in Code Scheduling Affinity and Memory Policy can be changed within code through: sched_get/setaffinity get/set_memory_policy Scheduling: Bits in a mask are set for assignments. 1 Assignment to Core 1 Assignment to Core 15 1 1 Assignment to Core or 15 27

NUMA in Code Scheduling Affinity #include <sched.h> int icore=3; cpu_set_t cpu_mask; _ZERO( cpu_mask); _SET(icore, cpu_mask); err = sched_setaffinity( (pid_t), sizeof(cpu_mask), &cpu_mask); Get C Set core number Allocate mask Set mask to zero Set mask with core # Set the affinity 28

Conclusion Placement and binding of processes, and allocation location of memory are important performance considerations in pure MPI/ OpenMP and Hybrid codes. Simple numactl commands and APIs allow users to control process and memory assignments. 8-core and 16-core socket systems are on the way, even more effort will be focused on process scheduling and memory location. Expect to see more multi-threaded libraries. 29

References www.nersc.gov/nusers/services/training/classes/nug/ Jun4/NUG24_yhe_hybrid.ppt Hybrid OpenMP and MPI Programming and Tuning (NUG24),Yun (Helen) He and Chris Ding, Lawrence Berkeley National Laboratory, June 24, 24. www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2./ node162.htm#node162 www.tacc.utexas.edu/services/userguides/ranger {See numa section.} www.mpi-forum.org/docs/mpi2-report.pdf http://www.nersc.gov/nusers/services/training/classes/nug/jun4/ NUG24_yhe_hybrid.ppt www.intel.com/software/products/compilers/docs/fmac/doc_files/source/ extfile/optaps_for/common/optaps_openmp_thread_affinity.htm 3