Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Similar documents
Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to Parallel Programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to debugging. Martin Čuma Center for High Performance Computing University of Utah

Debugging with Totalview. Martin Čuma Center for High Performance Computing University of Utah

Debugging, benchmarking, tuning i.e. software development tools. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Practical Introduction to Message-Passing Interface (MPI)

Introduction to OpenMP

Our new HPC-Cluster An overview

Intel MPI Cluster Edition on Graham A First Look! Doug Roberts

Beginner's Guide for UK IBM systems

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to HPC2N

Open Multi-Processing: Basic Course

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Compiling applications for the Cray XC

Mathematical libraries at the CHPC

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Cluster Clonetroop: HowTo 2014

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to MPI. Martin Čuma Center for High Performance Computing University of Utah

Welcome. HRSK Practical on Debugging, Zellescher Weg 12 Willers-Bau A106 Tel

Introduction to HPC2N

Tech Computer Center Documentation

Programming Environment 4/11/2015

Introduction to OpenMP

Introduction to parallel computing with MPI

Effective Use of CCV Resources

Computing with the Moore Cluster

JURECA Tuning for the platform

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

CENTER FOR HIGH PERFORMANCE COMPUTING. Overview of CHPC. Martin Čuma, PhD. Center for High Performance Computing

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Hybrid MPI/OpenMP programming. Martin Čuma Center for High Performance Computing University of Utah

Introduction to PICO Parallel & Production Enviroment

Practical Introduction to Message-Passing Interface (MPI)

Guillimin HPC Users Meeting July 14, 2016

Implementation of Parallelization

Debugging with GDB and DDT

The Cray Programming Environment. An Introduction

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Parallel Computing: Overview

Parallel Programming Basic MPI. Timothy H. Kaiser, Ph.D.

Introduction to GALILEO

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Using a Linux System 6

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Intel Software Development Products for High Performance Computing and Parallel Programming

Introduction to GALILEO

Debugging with TotalView

Shared Memory programming paradigm: openmp

High Performance Beowulf Cluster Environment User Manual

Hands-on Workshop on How To Debug Codes at the Institute

Introduction to HPC2N

OpenMP threading on Mio and AuN. Timothy H. Kaiser, Ph.D. Feb 23, 2015

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

Graham vs legacy systems

Compiling code and using MPI

INTRODUCTION TO OPENMP

Debugging Applications Using Totalview

Introduction to Parallel Programming with MPI

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Programming LRZ. Dr. Volker Weinberg, RRZE, 2018

Combining OpenMP and MPI. Timothy H. Kaiser,Ph.D..

Code optimization. Geert Jan Bex

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Using Intel VTune Amplifier XE for High Performance Computing

Compilation and Parallel Start

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Introduction to HPC Numerical libraries on FERMI and PLX

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

TotalView. Debugging Tool Presentation. Josip Jakić

MPI introduction - exercises -

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Combining OpenMP and MPI

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

What s New August 2015

Message Passing Programming. Designing MPI Applications

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

WRF performance on Intel Processors

OpenMP Shared Memory Programming

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

OpenACC Course. Office Hour #2 Q&A

Cray RS Programming Environment

Debugging. P.Dagna, M.Cremonesi. May 2015

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Hybrid MPI/ OpenMP. Presentation. Martin Čuma Center for High Performance Computing University of Utah

Introduction to MPI HPC Workshop: Parallel Programming. Alexander B. Pacheco

Message Passing Interface: Basic Course

MPI Runtime Error Detection with MUST

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Transcription:

Introduction to Parallel Programming Martin Čuma Center for High Performance Computing University of Utah m.cuma@utah.edu

Overview Types of parallel computers. Parallel programming options. How to write parallel applications. How to compile. How to debug/profile. Summary, future expansion. Please give us feedback https://www.surveymonkey.com/r/khvdc5h 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 2

Parallel architectures Single processor: SISD single instruction single data. Multiple processors: SIMD - single instruction multiple data. MIMD multiple instruction multiple data. Shared Memory Distributed Memory Current processors combine SIMD and MIMD Multi-core CPUs w/ SIMD instructions (AVX, SSE) GPUs with many cores and SIMT 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 3

Shared memory All processors have access to local memory Simpler programming Concurrent memory access More specialized hardware CHPC : Linux clusters 12, 16, 20, 24, 28 core nodes GPU nodes Dual quad-core node CPU CPU CPU CPU CPU CPU BUS BUS Memory Memory Many-core node (e.g. SGI) Memory Memory Memory Memory 9/28/2017 http://www.chpc.utah.edu Slide 4

Distributed memory Process has access only to its local memory Data between processes must be communicated CPU CPU BUS Memory Memory More complex programming Cheap commodity hardware CHPC: Linux clusters Node Node Node Node Network Node Node Node Node 8 node cluster (64 cores) 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 5

Parallel programming options Shared Memory Threads POSIX Pthreads, OpenMP (CPU, MIC), OpenACC, CUDA (GPU) Thread own execution sequence but shares memory space with the original process Message passing processes Process entity that executes a program has its own memory space, execution sequence Distributed Memory Message passing libraries Vendor specific non portable General MPI, PVM, language extensions (Co-array Fortran, UPC. ) 9/28/2017 http://www.chpc.utah.edu Slide 6

OpenMP basics Compiler directives to parallelize Fortran source code comments!$omp parallel/!$omp end parallel C/C++ - #pragmas #pragma omp parallel Small set of subroutines Degree of parallelism specification OMP_NUM_THREADS or omp_set_num_threads(integer n) 9/28/2017 http://www.chpc.utah.edu Slide 7

MPI Basics Communication library Language bindings: C/C++ - int MPI_Init(int argv, char* argc[]) Fortran - MPI_Init(INTEGER ierr) Quite complex (100+ subroutines) but only small number used frequently User defined parallel distribution 9/28/2017 http://www.chpc.utah.edu Slide 8

MPI vs. OpenMP Complex to code Slow data communication Ported to many architectures Many tune-up options for parallel execution Easy to code Fast data exchange Memory access (thread safety) Limited usability Limited user s influence on parallel execution 9/28/2017 http://www.chpc.utah.edu Slide 9

Program example saxpy vector addition: z = ax + simple loop, no cross-dependence, easy to parallelize y subroutine saxpy_serial(z, a, x, y, n) integer i, n real z(n), a, x(n), y(n) do i=1, n z(i) = a*x(i) + y(i) enddo return 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 10

OpenMP program example subroutine saxpy_parallel_omp(z, a, x, y, n) integer i, n real z(n), a, x(n), y(n)!$omp parallel do do i=1, n z(i) = a*x(i) + y(i) enddo return setenv OMP_NUM_THREADS 16 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 11

MPI program example subroutine saxpy_parallel_mpi(z, a, x, y, n) integer i, n, ierr, my_rank, nodes, i_st, i_end real z(n), a, x(n), y(n) call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,my_rank,ierr) call MPI_Comm_size(MPI_COMM_WORLD,nodes,ierr) i_st = n/nodes*my_rank+1 i_end = n/nodes*(my_rank+1) do i=i_st, i_end z(i) = a*x(i) + y(i) enddo call MPI_Finalize(ierr) return z(i) operation on 4 processes (tasks) z(1 n/4) z(n/4+1 2*n/4) z(2*n/4+1 3*n/4) z(3*n/4+1 n) 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 12

MPI program example Result on the first CPU include "mpif.h" integer status(mpi_status_size) if (my_rank.eq. 0 ) then do j = 1, nodes-1 do i= n/nodes*j+1, n/nodes*(j+1) call MPI_Recv(z(i),1,MPI_REAL,j,0,MPI_COMM_WORLD, & status,ierr) enddo Sender enddo Data Count Recipient else do i=i_st, i_end call MPI_Send(z(i),1,MPI_REAL,0,0,MPI_COMM_WORLD,ierr) enddo endif P3 P2 P1 P0 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 13

MPI program example Collective communication real zi(n) j = 1 z(i) do i=i_st, i_end zi(j) = a*x(i) + y(i) j = j +1 enddo Send data Receive data call MPI_Gather(zi,n/nodes,MPI_REAL,z,n/nodes,MPI_REAL, & 0,MPI_COMM_WORLD,ierr) Result on all nodes call MPI_AllGather(zi,n/nodes,MPI_REAL,z,n/nodes, & Root process MPI_REAL,MPI_COMM_WORLD,ierr) No root process zi(i) Process 0 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 14 zi(i) zi(i) zi(i) Process 1 Process 2 Process 3

Clusters - login First log into one of the clusters ssh lonepeak.chpc.utah.edu Ethernet ssh ember.chpc.utah.edu Ethernet, InfiniBand ssh kingspeak.chpc.utah.edu Ethernet, InfiniBand May debug and do short test runs (< 15 min, <= 4 processes/threads) on interactive nodes Then submit a job to get compute nodes srun N 2 n 24 p ember A chpc t 1:00:00 --pty=/bin/tcsh -l sbatch script.slr Useful scheduler commands sbatch submit a job scancel delete a job squeue show job queue 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 15

Security Policies No clear text passwords use ssh and scp You may not share your account under any circumstances Don t leave your terminal unattended while logged into your account Do not introduce classified or sensitive work onto CHPC systems Use a good password and protect it 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 16

Security Policies Do not try to break passwords, tamper with files etc. Do not distribute or copy privileged data or software Report suspicions to CHPC (security@chpc.utah.edu) Please see http://www.chpc.utah.edu/docs/policies/security. html for more details 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 17

Compilation - OpenMP Different switches for different compilers, qopenmp, fopenmp or mp module load intel module load pgi module load gcc e.g. pgf77 mp source.f o program.exe Nodes with up to 28 cores each Further references: Compilers man page man ifort Compilers websites http://www.intel.com/software/products/compilers http://gcc.cnu.org http://www.pgroup.com/doc/ 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 18

Compilation - MPI Two common network interfaces Ethernet, InfiniBand Different MPI implementations MPICH - Ethernet, InfiniBand OpenMPI Ethernet, InfiniBand MVAPICH2 - InfiniBand Intel MPI commercial, Ethernet, InfiniBand 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 19

Compilation - MPI Clusters MPICH, OpenMPI, MVAPICH2, Intel MPI /MPI-path/bin/mpixx source.x o program.exe xx = cc, cxx, f77, f90; icc, ifort for Intel MPI MPI-path = location of the distribution set by module load module load mpich MPICH Ethernet, InfiniBand module load openmpi OpenMPI Ethernet, InfiniBand module load mvapich2 MVAPICH2 InfiniBand module load impi Intel MPI Ethernet, InfiniBand = after this simply use mpixx Ensure that when running (using mpirun), the same module is loaded. 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 20

Running a parallel job Clusters MPICH Interactive batch srun N 2 n 24 p ember A chpc t 1:00:00 --pty=/bin/tcsh -l wait for prompt module load intel mpich mpirun np $SLURM_NTASKS program.exe MPICH Batch sbatch N 2 n 24 p ember A chpc t 1:00:00 script.slr OpenMP Batch srun N 1 n 1 p ember A chpc t 1:00:00 --pty=/bin/tcsh -l setenv OMP_NUM_THREADS 12 program.exe 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 21

Compiling and running a parallel job desktops Use MPICH or OpenMPI, MPICH is my preferred module load mpich mpixx source.x o program.exe xx = cc, cxx, f77, f90; icc, ifort for Intel MPI MPICH running mpirun np 4./program.exe OpenMP running setenv OMP_NUM_THREADS 4./program.exe See more details/combinations at https://www.chpc.utah.edu/documentation/software/mpilib raries.php 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 22

Single executable across desktops and clusters MPICH, MVAPICH2 and Intel MPI are cross-compatible using the same ABI Can e.g. compile with MPICH on a desktop, and then run on the cluster using MVAPICH2 and InfiniBand Intel and PGI compilers allow to build "unified binary" with optimizations for different CPU platforms But in reality it only works well under Intel compilers On a desktop module load intel mpich mpicc axcore-avx2,avx,sse4.2 program.c o program.exe mpirun np 4./program.exe On a cluster srun N 2 n 24... module load intel mvapich2 mpirun np $SLURM_NTASKS./program.exe https://www.chpc.utah.edu/documentation/software/singleexecutable.php 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 23

Debuggers Useful for finding bugs in programs Several free gdb GNU, text based, limited parallel ddd graphical frontend for gdb Commercial that come with compilers pgdbg PGI, graphical, parallel but not intuitive pathdb, idb Pathscale, Intel, text based Specialized commercial totalview graphical, parallel, CHPC has a license ddt - Distributed Debugging Tool Intel Inspector XE memory and threading error checker How to use: http://www.chpc.utah.edu/docs/manuals/software/par_ devel.html 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 24

Debuggers - parallel Parallel debugging more complex due to interaction between processes DDT is the debugger of choice at CHPC Expensive but academia get discount How to run it: compile with g flag run ddt command fill in information about executable, parallelism, Details: https://www.chpc.utah.edu/documentation/software/debuggin g.php Further information https://www.allinea.com/products/ddt 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 25

Debuggers parallel 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 26

Profilers Measure performance of the code Serial profiling discover inefficient programming computer architecture slowdowns compiler optimizations evaluation gprof, pgprof, pathopt2, Intel tools Parallel profiling target is inefficient communication Intel Trace Collector and Analyzer, AdvisorXE, VTune 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 27

Profilers - parallel 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 28

Libraries Serial BLAS, LAPACK linear algebra routines MKL, ACML hardware vendor libraries Parallel ScaLAPACK, PETSc, NAG, FFTW MKL dense and sparse matrices http://www.chpc.utah.edu/docs/manuals /software/mat_l.html 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 29

Summary Shared vs. Distributed memory OpenMP Limited to 1 cluster node Simple parallelization MPI Clusters Must use communication http://www.chpc.utah.edu/docs/presentations/intro_par 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 30

References OpenMP http://www.openmp.org/ Chandra, et. al. - Parallel Programming in OpenMP Chapman, Jost, van der Pas Using OpenMP MPI http://www-unix.mcs.anl.gov/mpi/ Pacheco - Parallel Programming with MPI Gropp, Lusk, Skjellum - Using MPI 1, 2 MPI and OpenMP Pacheco An Introduction to Parallel Programming 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 31

Future Presentations Introduction to MPI Introduction to OpenMP Debugging with Totalview Profiling with TAU/Vampir Intermediate MPI and MPI-IO Mathematical Libraries at the CHPC Feedback https://www.surveymonkey.com/r/khvdc5h 9/28/2017 http://www.chpc.utah.edu https://www.surveymonkey.com/r/khvdc5h Slide 32