COMP528: Multi-core and Multi-Processor Computing

Similar documents
Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Parallel Programming. Libraries and Implementations

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Message Passing with MPI

Parallel Programming. Libraries and implementations

MPI & OpenMP Mixed Hybrid Programming

Parallel Programming Libraries and implementations

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

COMP528: Multi-core and Multi-Processor Computing

Computer Architecture

Hybrid Computing. Lars Koesterke. University of Porto, Portugal May 28-29, 2009

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Trends and Challenges in Multicore Programming

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Scalasca performance properties The metrics tour

Parallel Computing. November 20, W.Homberg

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Hybrid Model Parallel Programs

UPDATES. 1. Threads.v. hyperthreading

Blue Waters Programming Environment

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

ECE Spring 2017 Exam 2

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Performance Considerations: Hardware Architecture

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

OpenMP 4.0. Mark Bull, EPCC

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Parallel Programming Overview

Advanced MPI. Andrew Emerson

Using MPI+OpenMP for current and future architectures

Basic Communication Operations (Chapter 4)

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Hybrid Programming. John Urbanic Parallel Computing Specialist Pittsburgh Supercomputing Center. Copyright 2017

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Parallel Computing and the MPI environment

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Hybrid programming with MPI and OpenMP On the way to exascale

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Trends in HPC (hardware complexity and software challenges)

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

OpenMP 4.0/4.5. Mark Bull, EPCC

Practical Introduction to Message-Passing Interface (MPI)

MPI/OpenMP Hybrid Parallelism for Multi-core Processors. Neil Stringfellow, CSCS

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High Performance Computing (HPC) Introduction

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Parallel Systems. Project topics

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Accelerators in Technical Computing: Is it Worth the Pain?

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Xeon Phi Native Mode - Sharpen Exercise

CUDA. Matthew Joyner, Jeremy Williams

Scalasca performance properties The metrics tour

Understanding Dynamic Parallelism

Introduction to Parallel Programming

Introduction to OpenMP

A4. Intro to Parallel Computing

Számítogépes modellezés labor (MSc)

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

GPUs and Emerging Architectures

OpenACC 2.6 Proposed Features

ECE 563 Spring 2012 First Exam

OpenACC Course. Office Hour #2 Q&A

Research on Programming Models to foster Programmer Productivity

ARCHER Champions 2 workshop

Using JURECA's GPU Nodes

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Best Practice Guide for Writing MPI + OmpSs Interoperable Programs. Version 1.0, 3 rd April 2017

Xeon Phi Native Mode - Sharpen Exercise

Message Passing Interface. George Bosilca

COMP Parallel Computing. Programming Accelerators using Directives

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Advanced MPI. Andrew Emerson

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Hybrid Implementation of 3D Kirchhoff Migration

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Advanced MPI Programming

Parallel processing with OpenMP. #pragma omp

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Introduction to the Message Passing Interface (MPI)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

COMP528: Multi-core and Multi-Processor Computing

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Performance properties The metrics tour

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Transcription:

COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X

So far Why and what HPC / multi-core / multi-processing [#1 - #3] Use of HPC facility batch, timing variations Theory Amdahl s Law, Gustafson deadlock, livelock Message Passing Interface (MPI) distributed memory, many nodes, scaling up nodes & memory wrappers for compiler and launching OpenMP shared (globally addressable) memory, single node

So far But not only CPU GPU [#18 - #24] CUDA: <<<blocks, threads>>> & writing kernels Directives: OpenACC [#23], OpenMP 4.0+ [#24] OpenCL [#24] comp528 (c) univ of liverpool

Still to come Vectorisation including some optimisation Hybrid programming how to use OpenMP from MPI Libraries Black box codes Using what we have learned to understand how these can help Profiling & Optimisation

Still to come a summary lecture what s key to remember opportunity to ask Qs what might be interesting but would need another course and did somebody say cloud?

HYBRID

Today s High End Architectures processors: many cores each with vector unit maybe specialised units nodes TPU, Tensor cores (etc) for Machine Learning one or more processors zero or more GPUs potentially likes of Xeon Phi, FPGA, custom ASIC,

Today s High End Architectures i.e. electic mix Needing appropriate programming for max performance MPI for inter-node MPI or OpenMP for intra-node CUDA OpenCL OpenACC OpenMP for accelerators BUT heterogeneous arch ==> heterogeneous use of languages

MPI across nodes, OpenMP on a node or MPI per processor & OpenMP across cores? Already done (assignment #3) OpenMP for CPU + CUDA for GPU a single thread calls the CUDA kernel for GPU to run (calling a CUDA kernel in a parallel region would launch many instances of the kernel, each requesting <<<blocks, threads>>>)

MPI + OMP: Simple Case MPI code => runs a copy on each process Put one process per node When need to accelerate (eg a for loop), then use OpenMP the master OpenMP thread is the MPI process and we have the other cores as the slave OpenMP threads (Inter-process) Comms is only via MPI why dynamic else may load we wish balancing to use OMP eg schedule(dynamic) rather than MPI? consider each OpenMP team independent of (and without any knowledge of) other OpenMP teams

OpenMP program (no MPI) MPI with 1 process launching OpenMP parallel regions MPI with 4 processes each launching OpenMP parallel regions comp528 (c) univ of liverpool

MPI with 4 processes each launching OpenMP parallel regions REDUCTION TO ROOT There is no reason we have to have the same size OpenMP team on each MPI process Data exchange between MPI threads Via MPI Comms pt-to-pt collectives Easiest if OUTSIDE of the OpenMP regions

Example / DEMO ~/MPI/hybrid/ex1.c how compile hybrid? run: illustrate ~/MPI/hybrid/ex2.c v. simple example of summation over MPI*OMP MPI_Scatter #pragma omp par for reduction MPI_Reduce

Other Options HANDLE WITH CARE A single OMP thread (eg in a master or single region) sends info via MPI generally okay will be to another master thread is pretty much like sending outside OMP region

Other Options HANDLE WITH CARE One or more OMP threads in an OMP parallel region doing MPI Comms (or at same time) threaded MPI requires MPI_Init_Thread rather than MPI_Init MPI_Init_thread(argc, argv, required, provided) requires provided support (implementation dependent) of one of: MPI_THREAD_FUNNELLED MPI_THREAD_SERIALIZED MPI_THREAD_MULTIPLE

Performance, Batch etc How many cores to request in batch job? different batch systems would require request for: 4 processors * 7 cores (MPI:proc, OMP:core) 24 cores (& then worry re placement) Chadwick 24 cores, place MPI per node via mpirun (SHOW) Is it efficient use of resources? depends if runs faster but there is dead time (cf Amdahl)

Can think of some tricks #pragma omp parallel if (omp_get_thread_num()==0) { } MPI_Send( ) // or other MPI eg MPI_Recv on different MPI process else { } // do some OMP work on remaining threads what can we NOT do here?

Further Reading https://www.intertwineproject.eu/sites/default/files/images/intertwine_best_pr actice_guide_mpi%2bopenmp_1.2.pdf (Archer?)