PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec

PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17

PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization libraries, compilers, tools & environments Many approaches and strategies for achieving parallelism One of the best languages for the development of parallel applications (along with Fortran and C) Many parallel tools depend strongly on target platform! Built-in support for threads and nothing else! 3 / 17

SIMD MMX & SSE & AVX Special instructions supported natively by Intel/AMD processors Intel/AMD: MMX/SSE/AVX/... Some compilers are able to generate them Compiler support intrinsic functions using SIMD instructions Requires explicit knowledge of these instructions Problem: alignment Data for SIMD instructions are loaded automaticaly to registers Data require alignment (SSE: 16 B) Problem: remaining iterations Difficult loop management SIMD is not efficient for small or scathered data 4 / 17

STD::THREAD Defined in standard library Thread Basic thread that executes a single method and the thread is terminated when the method ends Supports join (only a single thread) and detach Mutex Four types of locks used for synchronization Simple functionality lock, try_lock & unlock Atomic Class for synchronized access to managed value List of types that can be used without synchronization 5 / 17

STD::THREAD EXAMPLE 1 #include <thread> 2 #include <iostream> 3 void f(int id) 4 { 5 for(int n=0; n<10000; ++n) 6 std::cout << "Output from thread " << id << '\n'; 7 } 8 int main() 9 { 10 std::thread t1(f, 1); // launch thread executiong function f() with argument 1 11 std::thread t2(f, 2), t3(f, 3); // two more threads, also executing f() 12 t1.join(); t2.join(); t3.join(); // wait for all three threads to finish before ending main() 13 } 6 / 17

PTHREAD & WINDOWS THREADS POSIX and Windows threads available on Win & Linux Similar philosophy to standard threads (standard threads were take after POSIX threads) Design less straight-forward, API was designed for C Threads supported and implemented by respective OS Standard threads use native threads It is better to use standard threads, but native threads are used in many applications 7 / 17

NATIVE THREADS EXAMPLE #include <iostream> #include <pthread.h> using namespace std; void *print_message(){ } Windows thread cout << "Threading\n"; int main() { pthread_t t1; 1 2 3 4 5 6 7 8 9 10 11 12 13 #include <windows.h> #include <iostream> DWORD threadedfunction() { std::cout << "Hello World\n"; } int main() { CreateThread(NULL, 0, threadedfunction,null, 0, NULL); std::cout << "HelloWorld\n"; } PThread pthread_create(&t1, NULL, &print_message, NULL); cout << "Hello"; } return 0; 8 / 17

TBB C++ library Namespace tbb Templates Compatible with other threading libraries (pthreads, OpenMP, ) Works with tasks, not threads Tasks processed by threads managed by the TBB runtime Load-balancing is managed by the runtime environment 9 / 17

TBB CONT. A splittable object has the following constructor: X::X(X& x, Split) Unlike copy constructor, the first argument is not constant Divides the first argument to two parts one is stored back into the first argument other is stored in the newly constructed object Applies to both Range and Body splitting of a range into two parts (first part into argument, second part into newly created object) splitting body to two instances executable in parallel 10 / 17

OPENMP fork join model tailored mostly for large array operations pragmas #pragma omp #pragma omp parallel for only a few constructs programs should run without OpenMP possible but not enforced #ifdef _OPENMP 11 / 17

OPENMP CONT. Requires support from the compiler Program should work without OpenMP, since the parallelism is introduced by #pragma declarations Iteration independence checked by the compiler Fork-join model 12 / 17

MPI Message Passing Interface A library of functions The primary goals Provide source code portability MPI programs should compile and run as-is on any platform Allow efficient implementations across a range of architectures MPI also offers A great deal of functionality, including a number of different types of communication, special routines for common collective operations, and the ability to handle user-defined data types and topologies Support for heterogeneous parallel architectures 13 / 17

GPU COMPUTING Very efficient for numerical problems (matrices, FFD, ) Separate code for host CPU and GPU (kernel) CPU Few cores per chip General purpose cores Processing different threads Huge caches to reduce memory latency Locality of reference problem GPU Many cores per chip Cores specialized for numeric computations SIMT thread processing Huge amount of threads and fast context switch Results in more complex memory transfers 14 / 17

OPENCL Universal Framework for Parallel Computations Specification created by Khronos group Multiple implementations exist (AMD, NVIDIA, MAC, ) API for Different Parallel Architectures Multi-Core CPU, Many-Core GPU, IBM Cell cards, Handles device detection, data transfers, and code execution Extended Version of C99 for Programming Devices Code is compiled at runtime for selected device Theoretically, we may chose best device for our application dynamically However, we have to consider optimizations 15 / 17

CUDA OpenCL Generic platform By Khronos Slower changes Supported by various vendor and devices Device code is compiled at runtime Easier for more portable application CUDA GPU-specific platform By NVIDIA Faster changes Limited to NVIDIA hardware only Host and device code together Easier to tune for peak performance 16 / 17

INTEL XEON PHI The Xeon Phi Device Many simpler (Pentium) cores Each equipped with powerful 512bit vector engine 17 / 17