Parallel Programming Overview

Size: px

Start display at page:

Download "Parallel Programming Overview"

Dina Whitehead
5 years ago
Views:

1 Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1

2 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator 2

3 Our Support Offerings 3

IT Center & JARA-CSD Support General & Effective Usage of HPC Systems at RWTH Service for Tier-2 system, Tier-3 system, and hosted clusters Account creation, login,

advice regarding the project-based access Collaboration with FZ Jülich within JARA Center for Simulation and Data Science (JARA-CSD) Cross-sectional group Parallel

4 IT Center & JARA-CSD Support General & Effective Usage of HPC Systems at RWTH Service for Tier-2 system, Tier-3 system, and hosted clusters Account creation, login, usage, batch system, installation of software, Performance analysis and optimization Extensive training (eg, PPCES & aixcelerate events) and documentation Guidance and advice regarding the project-based access Collaboration with FZ Jülich within JARA Center for Simulation and Data Science (JARA-CSD) Cross-sectional group Parallel Efficiency Performance and correctness analysis of parallel programs Development of performance and correctness tools MUST (correctness), Score-P (measurement), Scalasca (analysis) 4

5 3 rd -party Funded Projects wwwpop-coeeu Parallel application performance audit and plan Analysis measuring a range of performance metrics to assess quality of performance Further performance evaluation to identify root causes of issues found Proof-of-concept: experiments with customer codes to demonstrate actual benefits of proposed optimizations Structured performance engineering process: systematic bottleneck-centric performance analysis and optimization Job monitoring & analysis: automatic light-weight performance profile and bottleneck analysis for all applications running on the HPC system HPC wiki: central web offering, knowledge base 5

6 Programming for Clusters 6

7 Motivation n Clusters à HPC market is dominated by distributed memory multicomputers (clusters) à Many nodes with no direct access to other nodes memory Network 7

8 SPMD Identity n How to do useful work in parallel if source code is the same? à Each process receives a unique identifier à Multiple code paths based on the ID int my_id = get_my_id(); if (my_id == id_1) { // Code for process id_1 else if (my_id == id_2) { // Code for process id_2 else { // Code for other processes 8

9 SPMD Data Exchange n Serial program a[09] = a[100109]; n SPMD program à process 0: aa holds a[09] array aa[10]; if (my_id == 0) { recv(aa, 10); else if (my_id == 10) { send(aa, 0); process 10: aa holds a[100109] array aa[10]; if (my_id == 0) { recv(aa, 10); else if (my_id == 10) { send(aa, 0); 9

10 Hello, MPI! 1 2 n C #include <stdioh> #include <mpih> int main(int argc, char **argv) { int rank, nprocs; MPI_Init(&argc, &argv); 1 Header file inclusion makes available prototypes of all MPI functions 2 MPI library initialisation must be called before other MPI operations are called 3 MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 3 MPI operations more on that later 4 printf( Hello, MPI! I am %d of %d\n, rank, nprocs); 4 Text output MPI programs also can print to the standard output 5 MPI_Finalize(); return 0; 5 MPI library clean-up no other MPI calls after this one allowed 10

11 Communicators n Each MPI communication happens within a context called communicator à Logical communication domains à A group of participating processes + context à Each MPI process has a unique numeric ID in the group rank

12 Query operations on communicators n How many processes are there in a given communicator? MPI_Comm_size (MPI_Comm comm, int *size) à Returns the total number of MPI processes when called on MPI_COMM_WORLD à Returns 1 when called on MPI_COMM_SELF n What is the rank of the calling process in a given communicator? MPI_Comm_rank (MPI_Comm comm, int *rank) à Returned rank will differ in each calling process given the same communicator à Ranks values are in [0, size-1] (always 0 for MPI_COMM_SELF) 12

13 Messages n MPI passes data around in the form of messages n Two components à Message content (user data) à Envelope Field Sender rank Receiver rank Tag Communicator Meaning Who sent the message To whom the message is addressed to Additional message identifier Communication context n MPI retains the logical order in which messages between any two ranks are sent (FIFO) à But the receiver have the option to peek further down the queue 13

14 Sending messages n Messages are sent using the MPI_Send family of operations MPI_Send (buf, count, datatype, dest, tag, comm) message content envelope Parameter Meaning buf Location of data in memory count Number of consecutive data elements to send datatype MPI data type handle dest Rank of the receiver tag Message tag comm Communicator handle n The MPI API is built on the idea that data structures are array-like à No fancy C++ objects supported 14

15 Example n Our earlier SPMD example written in MPI int aa[10]; MPI_Status status; if (rank == 0) { MPI_Recv(aa, 10, MPI_INT, 10, 0, MPI_COMM_WORLD, &status); else if (rank == 10) { MPI_Send(aa, 10, MPI_INT, 0, 0, MPI_COMM_WORLD); 10 MPI_Send 0 MPI_Recv 15

16 Develop with CLAIX n Two MPI implementations à Open MPI cluster$ module switch intelmpi openmpi à Intel MPI (default) n Universal environment variables à $MPICC C compiler wrapper à $MPICXX C++ compiler wrapper à $MPIF77 Fortran 77 compiler wrapper à $MPIFC Fortran 90+ compiler wrapper à $MPIEXEC MPI launcher à $FLAGS_MPI_BATCH Recommended launcher flags in batch mode 16

17 Hello, world! n 1 Type the code on slide 6 into a text file named helloc/f90 n 2 Compile: à C: à Fortran: mpicc -o helloexe helloc mpif90 -o helloexe hellof90 n 3 Run: à mpiexec -n 4 helloexe n You may also use the RWTH specific environment variables: à $MPICC -o helloexe helloc à $MPIEXEC -n 4 helloexe 17

18 Programming for Multi-Core Nodes 18

19 OpenMP Execution Model n OpenMP programs start with just one thread: The Master n Worker threads are spawned at Parallel Regions, together with the Master they form the Team of threads n In between Parallel Regions the Worker threads are put to sleep The OpenMP Runtime takes care of all thread management work Master Thread Slave Threads Slave Threads Worker Threads Serial Part Parallel Region Serial Part Parallel Region n Concept: Fork-Join n Allows for an incremental parallelization! 19

20 Fork-Join Execution Model #pragma omp parallel { Master Thread Fork of threads Slave Threads Slave Threads Worker Threads Join of threads Thread Team serial part parallel region serial part 20

21 Data Sharing Attributes (1/3) int a; #pragma omp parallel shared(a) { a same memory used for all threads serial part parallel region serial part 21

22 Data Sharing Attributes (2/3) int a,b; #pragma omp parallel shared(a) // { private(b) int c; uninitialized private copies of b and c for every thread a b b b b b c c c c serial part parallel region serial part 22

23 Data Sharing Attributes (3/3) int d=2; #pragma omp parallel firstprivate(d) { #pragma omp single {d=6; d=2 d=2 d=2 d=2 d=2 d=6 copy of the value of d serial part parallel region serial part 23

24 For Worksharing #pragma omp parallel #pragma omp for for (int i=0; i<100; i++){ distributes loop iterations across threads serial part parallel region serial part 24

to the shared variable a=0 a=0 a=0 a=0 a=0 300 925 1550 2175 4950 reduction

25 Reduction Operations int a=0; #pragma omp parallel #pragma omp for reduction(+:a) for (int i=0; i<100; i++) { a+=i; local copies for computation update is written to the shared variable a=0 a=0 a=0 a=0 a= reduction computes final result in the shared variable serial part parallel region serial part 25

26 Tasks #pragma omp parallel #pragma omp single while (work()){ 26 #pragma omp task { // implicit barrier here Taskqueue T1 T2 T3 T4 T5 T6 A task is some code together with a data environment Tasks can be executed by any thread in any order serial part parallel region serial part

27 OpenMP Programs on CLAIX n Compilation: add fopenmp flag to compiler (and linker) n From within a shell, global setting of the number of threads: export OMP_NUM_THREADS=4 /program n From within a shell, one-time setting of the number of threads: OMP_NUM_THREADS=4 /program 27

28 Programming for Accelerators 28

29 Offloading 29 MEMORY CPU host 3 PCI Bus n Weak memory model à Host + device memory = separate entities à No coherence between host + device àdata transfers needed n Host-directed execution model à Copy input data from CPU mem to device mem à Execute the device program MEMORY GPU device à Copy results from device mem to CPU mem 1 2 TIME We refer to discrete GPUs here processing flow (simplified) CPU GPU GPU

30 GPGPU programming n CUDA (Compute Unified Device Architecture) à C/C++ (NVIDIA): architecture + programming language, NVIDIA GPUs à Fortran (PGI): NVIDIA s CUDA for Fortran, NVIDIA GPUs n OpenCL à C (Khronos Group): open standard, portable, CPU/GPU/ n OpenACC à C/Fortran (PGI, Cray, CAPS, NVIDIA): Directive-based accelerator programming, industry standard published in Nov 2011 (NVIDIA GPUs) n OpenMP à C/C++, Fortran: Directive-based programming for hosts and accelerators, standard, portable, published in July 2013, implementations for Xeon Phis n 30

Overview OpenACC n Open industry standard Here, PGI s OpenACC compiler is used for examples Details à Portability may be compiler dependent n Introduced by CAPS, Cray, NVIDIA, PGI (Nov 2011) n Today

31 Overview OpenACC n Open industry standard Here, PGI s OpenACC compiler is used for examples Details à Portability may be compiler dependent n Introduced by CAPS, Cray, NVIDIA, PGI (Nov 2011) n Today s members and supporter organizations: 31 à Industry: AMD, Cray, NVIDIA, Total, à Universities/Labs: HZDR, EPCC, Virginia Tech, KAUST, n Support n Indiana Uni, ORNL, Supercomputing Center Wuxi, à C,C++ and Fortran à NVIDIA GPUs, (AMD GPUs), x86 CPU, OpenPOWER, Sunway, PEZY-SC Nov 11 Spec 10 Nov 16 TR deep copy attach/ detach Jun 13 Spec 20 Nov 17 Spec 26 Nov 15 Spec 25 Nov 18 Spec 27 Timeline Source: wwwopenaccorg

32 Overview CUDA = Compute Unified Device Architecture n CUDA C/C++ from NVIDIA CUDA Fortran available by PGI à Based on industry standard C/C++ (extensions & restrictions) à Driver API (low level), Runtime API (higher level) n Brief timeline à 2006: Introduction of CUDA, G80 GPU architecture à 2007: CUDA Toolkit 10 à 2008: GT200 GPU architecture à 2010: Fermi GPU architecture à 2012: Kepler K20 GPU architecture à 2015: Maxwell GPU architecture à 2016: Pascal GPU architecture à 2017: Volta GPU architecture 32

33 Example SAXPY CPU void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];! int main(int argc, const char* argv[]) { int n = 10240; float a = 20f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=50*i-10; // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); SAXPY = Single-precision real Alpha X Plus Y! y = a x +! y 33 free(x); free(y); return 0;

34 Example SAXPY OpenACC void saxpyopenacc(int n, float a, float *x, float *y) { #pragma acc parallel loop for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 20f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=50*i-10; // Invoke serial SAXPY kernel saxpyopenacc(n, a, x, y); 34 free(x); free(y); return 0;

35 Example SAXPY CUDA global void saxpycuda(int n, float a, float *x, float *y) { int i = blockidxx * blockdimx + threadidxx; if (i < n){ y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float a = 20f; float* h_x,*h_y; // Pointer to CPU memory h_x = (float*) malloc(n* sizeof(float)); h_y = (float*) malloc(n* sizeof(float)); // Initialize h_x, h_y for(int i=0; i<n; ++i){ h_x[i]=i; h_y[i]=50*i-10; float *d_x,*d_y; // Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblockx); saxpycuda<<<blockspergrid, threadsperblock>>>(n, 20, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; 35

36 Putting it all together 36

37 Hybrid programming n (Hierarchical) mixing of different programming paradigms CUDA / OpenMP GPGPU OpenMP Shared memory CUDA / OpenMP GPGPU OpenMP Shared memory MPI 37

Introduction to GPGPUs

Introduction to GPGPUs using CUDA Sandra Wienke, M.Sc. wienke@itc.rwth-aachen.de IT Center, RWTH Aachen University May 28th 2015 IT Center der RWTH Aachen University Links PPCES Workshop: http://www.itc.rwth-aachen.de/ppces