High Performance Computing: Tools and Applications

Similar documents
High Performance Computing: Tools and Applications

More advanced MPI and mixed programming topics

MPI-3 One-Sided Communication

Lecture 34: One-sided Communication in MPI. William Gropp

MPI One sided Communication

Advanced MPI. Andrew Emerson

DPHPC Recitation Session 2 Advanced MPI Concepts

Threading Tradeoffs in Domain Decomposition

COSC 6374 Parallel Computation. Remote Direct Memory Acces

COSC 6374 Parallel Computation. Remote Direct Memory Access

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

DPHPC: Locks Recitation session

MPI Tutorial Part 2 Design of Parallel and High-Performance Computing Recitation Session

Dmitry Durnov 15 February 2017

MPI Remote Memory Access Programming (MPI3-RMA) and Advanced MPI Programming

Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming

MPI Introduction More Advanced Topics

High Performance Computing: Tools and Applications

Advanced MPI. Andrew Emerson

Advanced MPI Programming

Introduction to MPI-2 (Message-Passing Interface)

Advanced MPI. George Bosilca

Evaluating the Portability of UPC to the Cell Broadband Engine

MPI+X - Hybrid Programming on Modern Compute Clusters with Multicore Processors and Accelerators

Advanced MPI. George Bosilca

NUMA-Aware Shared-Memory Collective Communication for MPI

High Performance Computing: Tools and Applications

Large Scale Debugging

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

The Message Passing Interface MPI and MPI-2

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

High Performance Computing: Tools and Applications

Standards, Standards & Standards - MPI 3. Dr Roger Philp, Bayncore an Intel paired

Programming with MPI

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

One Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Heap Management portion of the store lives indefinitely until the program explicitly deletes it C++ and Java new Such objects are stored on a heap

More Communication (cont d)

Givy, a software DSM runtime using raw pointers

Optimising MPI for Multicore Systems

15-440: Recitation 8

LDetector: A low overhead data race detector for GPU programs

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Comparing One-Sided Communication with MPI, UPC and SHMEM

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

MPI-Adapter for Portable MPI Computing Environment

Lecture 13. Shared memory: Architecture and programming

Point-to-Point Communication. Reference:

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Introduction to pthreads

MPI, Part 3. Scientific Computing Course, Part 3

Dynamic Memory Management

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Introduction to OpenMP

MPI Parallel I/O. Chieh-Sen (Jason) Huang. Department of Applied Mathematics. National Sun Yat-sen University

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel I/O for SwissTx Emin Gabrielyan Prof. Roger D. Hersch Peripheral Systems Laboratory Ecole Polytechnique Fédérale de Lausanne Switzerland

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

MPI+X on The Way to Exascale. William Gropp

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

Intel Xeon Phi Coprocessor

Dynamic Memory Management! Goals of this Lecture!

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MPI & OpenMP Mixed Hybrid Programming

Optimize and perform with Intel MPI

Multiprocessors - Flynn s Taxonomy (1966)

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

High Performance Computing. Introduction to Parallel Computing

November 13, Networking/Interprocess Communication (25 pts)

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

6 - Main Memory EECE 315 (101) ECE UBC 2013 W2

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Dynamic Memory Management

Scientific Programming in C XIV. Parallel programming

OpenACC 2.6 Proposed Features

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Dr Markus Hagenbuchner CSCI319. Distributed Systems Chapter 3 - Processes

Getting Performance from OpenMP Programs on NUMA Architectures

Chapter 18 - Multicore Computers

Lecture 7: Distributed memory

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

High-Performance and Parallel Computing

Intel Parallel Studio XE 2015

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Parallel I/O for SwissTx

High Performance Computing: Tools and Applications

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

HPC Architectures. Types of resource currently in use

Transcription:

High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 20

Shared memory computers and clusters On a shared memory computer, use MPI or a shared memory mechanism (e.g., OpenMP)? Disadvantage of MPI: may need to replicate common data structures; not scalable memory use (scales with number of processes) also, may allow you to use larger blocks in algorithms that are more efficient this way, e.g., Jacobi-Schwarz with larger and fewer blocks cannot continue to use MPI for compute nodes as number of cores increases and memory and network bandwidth per core decreases MPI uses data copying in its protocols; should not be necessary with shared memory Advantage of MPI: do not have threads that might interfere with each other (sharing of heap); also, forces you to decompose problems for locality

Shared memory computers and clusters (continued) In general, on clusters, and even Intel Xeon Phi, MPI may be combined with OpenMP reduce number of MPI processes (MPI processes have overhead) max number of MPI processes may be limited

MPI shared memory MPI-3 provides for shared memory programming within a compute node (use load/store instead of get/put) Alternative to MPI+X, which might not interoperate well, e.g., X=OpenMP MPI+MPI has no interoperability issues

Recall MPI RMA define memory regions use put/get synchronize MPI-3 shared memory programming extends some RMA ideas, but uses load/store instead of get/put

Shared memory programming with processes? processes have their own address space a shared memory region may have different addresses on each process (but is physically the same memory copies are avoided)

Recall creating memory windows for RMA int MPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, MPI_Win *win); base is the local address of beginning of window this memory can be any memory, including memory allocated by MPI_Alloc_mem (and freed by MPI_Free_mem)

Allocating memory and creating a memory window at the same time MPI_Win_allocate(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win); Call it this way: double *baseptr; MPI_Win win; MPI_Win_allocate(..., &baseptr, &win); Free the window (also frees the allocated memory) MPI_Win_free(&win);

Philosophy: shared memory windows data is private by default (like MPI programs running in different processes) data is made public explicitly through shared memory windows allows graceful migration of pure MPI programs to use multicore processors more efficiently communication (load/store) via shared memory does not involve extra copies creating a shared memory window is a collective operation (done at the same time on all ranks, allowing the optimization of the memory layout)

Allocating shared memory windows Window on a process s memory that can be accessed by other processes on the same node int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win); called collectively by processes in comm processes in comm must be those that can access shared memory (e.g., processes on the same compute node) by default, a contiguous region of memory is allocated and shared (noncontiguous allocation is also possible, and may be more efficient as each contributed region could be page aligned) each process contributes size bytes to the contiguous region; size can be different for each process and can be zero the contribution to the shared region is in order by rank baseptr is the pointer to a process s contributed memory (not the beginning of the shared region) in the address space of the process

Allocating shared memory windows (continued) the shared region can now be used using load/store, with all the usual caveats about race conditions when accessing shared memory from different processes the shared region can also be accessed using RMA operations (particularly for synchronization)

How it works Process with rank 0 in comm allocates the entire shared memory region for all processes Other processes attach to this shared memory region The entire memory region may reside in a single locality domain, which may not be desirable Therefore, using noncontiguous allocation may be advantageous (set the alloc_shared_noncontig info key to true)

Address of shared memory contributed by another process int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr); baseptr returns the address (in the local address space) of the beginning of the shared memory segment contributed by another process, the target rank also returns the size of the segment and the displacement unit if rank is MPI_PROC_NULL, then the address of the beginning of the first memory segment is returned this function could be useful if processes contribute segments of different sizes (so addresses cannot be computed locally), or if noncontiguous allocation is used in many programs, knowing the owner of each segment may not be necessary

Extension to MPI+MPI Function for determining which ranks are common to a compute node: MPI_Comm_split_type (comm, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &shmcomm); Function for mapping group ranks to global ranks: MPI_Group_translate_ranks