APIs for Parallel Programming

Size: px

Start display at page:

Download "APIs for Parallel Programming"

Estella Garrison
5 years ago
Views:

1 APIs for Parallel Programming Wolfgang Welz November 12, 2012 Part I: Programming for Shared Memory Systems Threads and OpenMP

2 What is Parallel Computing? 2 / 42 Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved in parallel. Wikipedia There are different levels of parallel computing: Bit Level: 32/64-bit microprocessors Instruction Level: Pipelining, superscalar execution Process and Program Level

3 Classification of Parallel Computers Flynn s taxonomy (1966): Single instruction Instruction Pool Multiple instruction Instruction Pool Single data Data Pool PU Data Pool PU PU Instruction Pool Instruction Pool PU Multiple data Data Pool PU PU Data Pool PU PU PU 3 / 42

4 Shared Memory Architecture 4 / 42 Pros: Independent processors access globally shared memory Changes in a memory location are visible to all other processors Modern multicore systems are cache coherent Data sharing is fast and easy user-friendly for the programmer Cons: Scaling of HW is hard and expensive Access of global memory needs to be synchronized by the user Memory Cache Cache Cache Cache Cache CPU CPU CPU CPU CPU

5 Distributed Memory Architecture 5 / 42 Pros: Processors have own local memory and are connected by some communication network Data needs to be transferred explicitly Rapidly access to own memory, no overhead for cache coherency Efficiently scalable Cons: The programmer is responsible for communication Data structures need to be mapped to existing network and memory topology Network Memory Memory Memory Memory Memory CPU CPU CPU CPU CPU

6 Shared Memory Programming Model 6 / 42 P 1 P 2 D 1 D 2 symmetric relation All threads access the same shared memory Threads also have their own private data Programmers are responsible for protecting globally shared data We assume that all processors are identical SMP

7 POSIX Threads 7 / 42 The original Pthreads API was defined in 1995 It offers a standardized interface for UNIX threads The Pthreads API can be grouped into the following groups: 1 Thread management 2 Mutexes 3 Condition variables 4 Read/write locks Pthreads are implemented for C via pthread.h Different native implementations for other languages exists using exactly the same concept

8 Fork-Join Model int main() pthread_create(&id,...,f,...) void f() parallel pthread_join(id,...) pthread_exit(...) pthread_exit(0) pthread_create(thread,attr,start_routine,arg) Create new thread executing start_routine with arg as its argument pthread_exit(status) Terminate thread and make status available to the corresponding join pthread_join(thread,status) Wait for thread to finish, status is taken from pthread_exit 8 / 42

9 Mutex Mutex Thread lock lock unlock unlock shared resource Thread The only safe way to concurrently access the same data without synchronization is when all threads only read. Enforcing mutual exclusion: pthread_mutex_init() initialization of mutex object pthread_mutex_lock() acquire lock on mutex variable pthread_mutex_trylock() non blocking lock pthread_mutex_unlock() unlock previously locked mutex If a lock is acquired for an already locked mutex, the calling thread is blocked until the mutex is unlocked 9 / 42

10 Condition Variables 10 / 42 Wait for a condition set by a different thread: while (!readyflag) pthread_mutex_unlock(&mutex); sleep(10); pthread_mutex_lock(&mutex); Busy-Wait is bad use condition variables: pthread_cond_init() initialization pthread_cond_wait() block until condition is signaled pthread_cond_signal() signal first waiting thread pthread_cond_broadcast() signal all waiting threads Conditional variables are used with a mutex so that wait and signal are in the critical section. Spurious Wakeups A wakeup does not necessarily mean that the condition now holds check condition yourself

11 Example: MT-Queue 11 / 42 Queue queue; mutex_t queuemutex; cond_t queuecondvar; void provide(int val) mutex_lock(queuemutex); queue.push(val); cond_signal(queuecondvar); mutex_unlock(queuemutex); int consume() mutex_lock(queuemutex); while (queue.empty()) cond_wait(queuecondvar, queuemutex); val = queue.pop(); mutex_unlock(queuemutex) return val;

OpenMP 12 / 42 The Open specification for Multi-Processing is an API for writing parallel programs on SMP-Machines: Set of compiler directives and some library functions Fork-Join

12 OpenMP 12 / 42 The Open specification for Multi-Processing is an API for writing parallel programs on SMP-Machines: Set of compiler directives and some library functions Fork-Join programming model Simplifies writing multi-threaded programs in C, C++ and FORTRAN OpenMP needs to be supported by the compiler and activated using a compiler switch (GCC: -fopenmp)

13 Using OpenMP int main() cout << "Hello World" << endl; return 0; 13 / 42

14 Using OpenMP #include <omp.h> int main() #pragma omp parallel cout << "Hello World" << endl; return 0; 13 / 42

15 Using OpenMP #include <omp.h> int main() #pragma omp parallel int th_id = omp_get_thread_num(); Output: Hello World from Hello World from 03 Hello World from 2 Hello World from 1 cout << "Hello World from " << th_id << endl; return 0; 13 / 42

16 Using OpenMP #include <omp.h> Output: int main() Hello World from 0 Hello World from 2 Hello World from 3 #pragma omp parallel Hello World from 1 int th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; return 0; 13 / 42

17 Using OpenMP #include <omp.h> Output: int main() Hello World from 0 There are 4 threads Hello World from 2 #pragma omp parallel Hello World from 3 Hello World from 1 int th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; #pragma omp master int nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << endl; return 0; 13 / 42

18 Using OpenMP #include <omp.h> int main() Hello World from 0 Hello World from 2 Hello World from 1 #pragma omp parallel Hello World from 3 There are 4 threads int th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; #pragma omp barrier Output: #pragma omp master int nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << endl; return 0; 13 / 42

19 Using OpenMP #include <omp.h> int main() int nthreads, th_id; #pragma omp parallel private(th_id) shared(nthreads) th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; #pragma omp barrier #pragma omp master nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << endl; return 0; 13 / 42

20 Sections 14 / 42 #pragma omp parallel #pragma omp sections #pragma omp section TaskA(); #pragma omp section TaskB(); #pragma omp section TaskC(); TaskA() TaskB() TaskC()

21 Loop Parallelization #pragma omp parallel #pragma omp for for (int i=0;i<15;++i) int x = *i; a[i] = b[i] + x; The omp for-pragma specifies that the iterations will be distributed among threads: Loop variable must be of type signed integer Comparison must be <, <=, >, >= invariant_int No jump out of the loop i=0..4 x=42+2*i a[i]=b[i]+x i=5..9 x=42+2*i a[i]=b[i]+x i= x=42+2*i a[i]=b[i]+x Avoid data dependencies in loop: program compiles but will fail 15 / 42

22 Improving Performance 1 16 / 42 double ave = 0.0; #pragma omp parallel for for (int i=0;i<max;++i) ave += A[i]; ave = ave/max; double ave = 0.0; #pragma omp parallel for reduction (+:ave) for (int i=0;i<max;++i) ave += A[i]; ave = ave/max; Multiple values are combined into a single accumulation variable. This is called "reduction". Support for reduction operations is included in most parallel programming environments.

23 Improving Performance 2 17 / 42 #pragma omp parallel shared(s) int id = omp_get_thread_num(); for(int i=0;i<big;++i) s[id] += foo(id,i); #pragma omp parallel int id= omp_get_thread_num(); int tmp = 0; for(int i=0;i<big;++i) tmp += foo(id,i); s[id] = tmp; If array elements happen to share the same cache line, this leads to false sharing: Every update of a single element invalidates the entire cache line.

24 Comparison 18 / 42 APIs for Shared Memory Programming on SMP systems Threads: Virtually all languages offer support for Fork-Join threads Easy parallelization of larger independent code segments Explicit synchronization is required OpenMP: Only for C, C++, FORTRAN and specific compilers Easy parallelization of segments and loops Explicit synchronization is often not needed

25 APIs for Parallel Programming Wolfgang Welz November 19, 2012 Part II: Programming for GPUs and message passing systems OpenCL and MPI

26 Introduction 20 / 42 Last week: shared memory APIs P 1 P 2 D 1 D 2 symmetric relation Today: graphics cards GPUs are also based on a shared Memory Architecture Graphics cards have own memory, explicit copy operations are needed from host Fundamental differences between GPUs and (multi-core) CPUs

27 Comparing CPU and GPU 21 / 42 Control ALU ALU ALU ALU Cache CPU Execute any sequential code as fast as possible Out of order execution, branch prediction and large caches Execute a small amount of heavy threads in parallel GPU?

28 The Graphics Card Fixed-function pipeline: 3D data (vertices, colors,... ) transformation calculate projection send to screen Shaders More advanced visual effects require a programmable pipeline! 22 / 42

29 Shader 23 / 42 Algorithm: A Vertex Input: Uniform Variables, Global Attributes and Vertex Attributes Result: Clip Coordinates and Color of vertex begin // do fancy calculations Vertex Shader foreach Vertex v do get data of v; run A Vertex ; store result;

30 Shader 24 / 42 Algorithm: A Pixel Input: Uniform Variables, Global Attributes and interpolated Vertex Attributes Result: Color of pixel begin // do fancy color calculation Pixel Shader foreach Pixel p do get data of p by interpolation; run A Pixel ; store color;

31 Shader for computations Vertex Shader foreach Vertex v do get data of v; run A Vertex ; store result; Unified Shader foreach i I do get data I i ; Pixel execute Shader algorithm A; save result to O foreach Pixel p do i ; get data of p by interpolation; run A Pixel ; store color; Vector processor: SIMT (single instruction, multiple threads) 25 / 42

Available APIs 26 / 42 NVIDA CUDA: works only

.. for all kind of processors language is C

32 Available APIs 26 / 42 NVIDA CUDA: works only on NVIDA GPUs designed for GPUs language is C/C++ Khronos OpenCL: free standard: AMD, NVIDIA, Intel, IBM,... for all kind of processors language is C Microsoft DirectCompute: only for DirectX 11 designed for GPUs language is Direct3D HLSL

33 The OpenCL Platform Model Host C/C++ Java Python C# Fortran GPU CPU Accelerator OpenCL OpenCL OpenCL Devices OpenCL Platform: Host prepares and triggers device code execution: OpenCL code compilation at runtime Memory allocations, memory copies,... OpenCL program launch Devices execute OpenCL code written in OpenCL C Contexts group devices and resources 27 / 42

34 The OpenCL Execution Model 28 / 42 Define N-Dimensional computation domain Each independent element of execution in N-D domain is called a work-item The N-D domain defines the total number of work-items that execute in (quasi)-parallel global work size Work-items can be grouped together work-group Work-items in group can communicate with each other Can synchronize execution among work-items in group global work size Sx work-group work-group work-group work-group global work size Sy Work Dimension work_dim = 2 Inside the Kernel: get_global_id(0) 3 get_global_id(1) 1 get_local_id(0) 1 get_local_id(1) 1 get_group_id(0) 1 get_group_id(1) 0

35 The OpenCL Memory Model 29 / 42 Global memory: communicating data between host and device data exchange between kernels contents visible to all threads longer latency access In OpenCL a one-dimensional array in global memory is called buffer object A buffer belongs to a context and is dynamically allocated on a device Memory transfers are associated with a specific device host context work-group local memory work-item private memory work-item private memory host memory global memory work-group local memory work-item private memory work-item private memory

36 Parallelizing breadth-first search 30 / 42 Find a good representation for the NDRange: elements correspond to the nodes the computational domain is the entire set of nodes Algorithm: BFSKernel(V, E, C, Dist) v get_global_id; if C v is gray then C v black; foreach adjacent node w of v do if C w is white then Dist w Dist v + 1; C w gray;

37 OpenCL C 31 / 42 Differences between ISO C99 and OpenCL C: Function qualifier kernel Address space qualifiers global, local Work-item functions Build-in functions for synchronization Built-in vectors with 2,3,4,8,16 elements Built-in math functions Compilation at execution time No recursion allowed No built-in random number generator Dynamic allocation not allowed

38 The BFS kernel kernel void BFSKernel( global uint *V, global const uint *E, global const uchar *C, global uchar *C_new, global uint *D ) const uint v_id = get_global_id(0); if ( C[v_id] == GRAY ) C_new[v_id] = BLACK; const uint end_edge = V[v_id + 1]; for (uint out_edge = V[v_id]; out_edge!= end_edge; out_edge++) const uint w_id = E[out_edge]; if ( C[w_id] == WHITE ) D[w_id] = D[v_id] + 1; C_new[w_id] = GRAY; 32 / 42

39 Invoking the kernel 33 / 42 clenqueuewritebuffer(queue, V_b, false, 0, sizeof(v_b), V_array);... clfinish(queue); size_t global_work_size[] = N ; do clenqueuendrangekernel(queue, kernel, 1, 0, global_work_size, 0); clenqueuecopybuffer(queue, C_new_b, C_b, 0, 0, sizeof(c_b)); clenqueuereadbuffer(queue, C_b, true, 0, sizeof(c_b), C_array); while (!hasgraynode()); clenqueuereadbuffer(queue, D_b, true, 0, sizeof(d_b), D_array); OpenCL events Each command also return its event and takes an array of other events as a dependency (append 0, 0, 0 if not needed). When using more than one queue, it might be necessary to synchronize the execution of kernels with events!

40 Performance of our OpenCL-BFS 34 / GPU CPU Time (0.1 s) Number of nodes (millions) CPU: Intel Core i GHz GPU: NVIDIA GeForce GTX 295

41 Performance Pitfalls 35 / 42 Data transfers between host and device are expensive: 500 MB vector of float Transferring data from host: ln(x) arcsin(1 x 2 ) on all 130M entries: Copy on the same device: 180 ms 18 ms 0.09 ms

42 Performance Pitfalls Global memory accesses should be coalesced! 17 Copy with Offset Time (ms) Offset 36 / 42

43 General Performance Optimization Strategies 37 / 42 NVIDIA OpenCL Best Practices Guide: Focus first on finding ways to parallelize sequential code Minimize data transfer between the host and the device Global memory accesses should be coalesced Minimize the use of global memory, prefer shared memory access Avoid different execution paths within the same work group Crucial Very important Very important Important Medium

44 Multicomputer Programming: MPI Message Passing Interface (MPI) is a standardized message-passing system MPI was designed for distributed memory architectures MPI is a specification, not by itself a library Message passing programming model: P 1 P 2 Copy D 1 D 2 directed relation All data needs to be distributed, more programming overhead 38 / 42

45 General MPI Program Structure #include "mpi.h" main() MPI_Init() #include "mpi.h" main() MPI_Init() send/receive MPI_Finalize() exit() MPI_Finalize() exit() Compile using the MPI-compiler-wrapper mpicc, mpif77 etc. 39 / 42

46 Communication in MPI Groups of communicating processes in one MPI session are connected as communicator objects The actual assignment of processes happens at runtime through mpirun -np 128 myprog Each process in a communicator is defined by unique number, called rank MPI_COMM_WORLD The predefined communicator that contains all processes is called MPI_COMM_WORLD 2 4 MPI_COMM_WORLD Communicators should be partitioned to better match the topology of the problem / 42

47 Point-to-Point Communication 41 / 42 The basic point-to-point operations are send and receive Blocking Send int MPI_Send (void* buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm) dest rank of target process tag arbitrary integer to uniquely identify a message comm communicator of the process group blocking non-blocking synchronous MPI_Ssend, MPI_Recv MPI_Issend, MPI_Irecv buffered MPI_Bsend, MPI_Recv MPI_Ibsend, MPI_Irecv standard MPI_Send, MPI_Recv MPI_Isend, MPI_Irecv ready mode MPI_Rsend, MPI_Recv MPI_Irsend, MPI_Irecv

48 Collective operations broadcast same data from one root process scatter gather distribute and collect data from root process allgather A 0 A 1 A 2 A 0 B 0 C 0 alltoall B 0 B 1 B 2 A 1 B 1 C 1 C 0 C 1 C 2 every process sends and receives data A 2 B 2 C 2 Use collective operations: Although MPI belongs in layers 5 of the OSI Model, implementations may cover lower layers 42 / 42

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing