PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Similar documents
Threaded Programming. Lecture 9: Alternatives to OpenMP

Threads. Threads (continued)

SIMD Exploitation in (JIT) Compilers

Questions from last time

Advanced OpenMP. Other threading APIs

Parallelization, OpenMP

Addressing Heterogeneity in Manycore Applications

CS420: Operating Systems

Parallel Programming Libraries and implementations

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Scientific Computing WS 2017/2018. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1

Concurrency, Thread. Dongkun Shin, SKKU

Parallel Computing. November 20, W.Homberg

Chapter 4: Threads. Operating System Concepts with Java 8 th Edition

Parallel Programming. Libraries and Implementations

Parallel Systems. Project topics

CSE 4/521 Introduction to Operating Systems

Shared-Memory Programming

Introduction to Runtime Systems

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

C++11 threads -- a simpler interface than pthreads Examples largely taken from 013/02/24/investigating-c11-threads/

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Introduction to GPU hardware and to CUDA

Chapter 4: Multi-Threaded Programming

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Chapter 4: Multithreaded Programming

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

S Comparing OpenACC 2.5 and OpenMP 4.5

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

A brief introduction to OpenMP

CS 261 Fall Mike Lam, Professor. Threads

Martin Kruliš, v

Introduction to Multicore Programming

Parallelism paradigms

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads

CUDA GPGPU Workshop 2012

C Grundlagen - Threads

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Performance of deal.ii on a node

CS 3305 Intro to Threads. Lecture 6

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

PRACE Autumn School Basic Programming Models

Intel Xeon Phi Coprocessors

Modern Processor Architectures. L25: Modern Compiler Design

Dynamic SIMD Scheduling

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Distributed Systems + Middleware Concurrent Programming with OpenMP

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Introduction to CUDA Programming

The StarPU Runtime System

Trends and Challenges in Multicore Programming

Heterogeneous Computing and OpenCL

Parallel Programming on Ranger and Stampede

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Parallel Programming Languages 1 - OpenMP

Multiprocessors 2007/2008

CS333 Intro to Operating Systems. Jonathan Walpole

Parallel and Distributed Computing

Threads. studykorner.org

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

OpenCL Vectorising Features. Andreas Beckmann

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

OpenACC programming for GPGPUs: Rotor wake simulation

General introduction: GPUs and the realm of parallel architectures

Introduction to Multicore Programming

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Progress Report on QDP-JIT

CSE 333 SECTION 9. Threads

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

CS516 Programming Languages and Compilers II

OpenACC (Open Accelerators - Introduced in 2012)

POSIX Threads and OpenMP tasks

The Art of Parallel Processing

OpenCL: History & Future. November 20, 2017

Shared memory parallel computing

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Scientific Programming in C XIV. Parallel programming

APIs for Parallel Programming

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

The Era of Heterogeneous Computing

Parallel Numerical Algorithms

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Computing. Hwansoo Han (SKKU)

Tasks and Threads. What? When? Tasks and Threads. Use OpenMP Threading Building Blocks (TBB) Intel Math Kernel Library (MKL)

Introduction to the Intel Xeon Phi on Stampede

Transcription:

PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec

PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17

PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization libraries, compilers, tools & environments Many approaches and strategies for achieving parallelism One of the best languages for the development of parallel applications (along with Fortran and C) Many parallel tools depend strongly on target platform! Built-in support for threads and nothing else! 3 / 17

SIMD MMX & SSE & AVX Special instructions supported natively by Intel/AMD processors Intel/AMD: MMX/SSE/AVX/... Some compilers are able to generate them Compiler support intrinsic functions using SIMD instructions Requires explicit knowledge of these instructions Problem: alignment Data for SIMD instructions are loaded automaticaly to registers Data require alignment (SSE: 16 B) Problem: remaining iterations Difficult loop management SIMD is not efficient for small or scathered data 4 / 17

STD::THREAD Defined in standard library Thread Basic thread that executes a single method and the thread is terminated when the method ends Supports join (only a single thread) and detach Mutex Four types of locks used for synchronization Simple functionality lock, try_lock & unlock Atomic Class for synchronized access to managed value List of types that can be used without synchronization 5 / 17

STD::THREAD EXAMPLE 1 #include <thread> 2 #include <iostream> 3 void f(int id) 4 { 5 for(int n=0; n<10000; ++n) 6 std::cout << "Output from thread " << id << '\n'; 7 } 8 int main() 9 { 10 std::thread t1(f, 1); // launch thread executiong function f() with argument 1 11 std::thread t2(f, 2), t3(f, 3); // two more threads, also executing f() 12 t1.join(); t2.join(); t3.join(); // wait for all three threads to finish before ending main() 13 } 6 / 17

PTHREAD & WINDOWS THREADS POSIX and Windows threads available on Win & Linux Similar philosophy to standard threads (standard threads were take after POSIX threads) Design less straight-forward, API was designed for C Threads supported and implemented by respective OS Standard threads use native threads It is better to use standard threads, but native threads are used in many applications 7 / 17

NATIVE THREADS EXAMPLE #include <iostream> #include <pthread.h> using namespace std; void *print_message(){ } Windows thread cout << "Threading\n"; int main() { pthread_t t1; 1 2 3 4 5 6 7 8 9 10 11 12 13 #include <windows.h> #include <iostream> DWORD threadedfunction() { std::cout << "Hello World\n"; } int main() { CreateThread(NULL, 0, threadedfunction,null, 0, NULL); std::cout << "HelloWorld\n"; } PThread pthread_create(&t1, NULL, &print_message, NULL); cout << "Hello"; } return 0; 8 / 17

TBB C++ library Namespace tbb Templates Compatible with other threading libraries (pthreads, OpenMP, ) Works with tasks, not threads Tasks processed by threads managed by the TBB runtime Load-balancing is managed by the runtime environment 9 / 17

TBB CONT. A splittable object has the following constructor: X::X(X& x, Split) Unlike copy constructor, the first argument is not constant Divides the first argument to two parts one is stored back into the first argument other is stored in the newly constructed object Applies to both Range and Body splitting of a range into two parts (first part into argument, second part into newly created object) splitting body to two instances executable in parallel 10 / 17

OPENMP fork join model tailored mostly for large array operations pragmas #pragma omp #pragma omp parallel for only a few constructs programs should run without OpenMP possible but not enforced #ifdef _OPENMP 11 / 17

OPENMP CONT. Requires support from the compiler Program should work without OpenMP, since the parallelism is introduced by #pragma declarations Iteration independence checked by the compiler Fork-join model 12 / 17

MPI Message Passing Interface A library of functions The primary goals Provide source code portability MPI programs should compile and run as-is on any platform Allow efficient implementations across a range of architectures MPI also offers A great deal of functionality, including a number of different types of communication, special routines for common collective operations, and the ability to handle user-defined data types and topologies Support for heterogeneous parallel architectures 13 / 17

GPU COMPUTING Very efficient for numerical problems (matrices, FFD, ) Separate code for host CPU and GPU (kernel) CPU Few cores per chip General purpose cores Processing different threads Huge caches to reduce memory latency Locality of reference problem GPU Many cores per chip Cores specialized for numeric computations SIMT thread processing Huge amount of threads and fast context switch Results in more complex memory transfers 14 / 17

OPENCL Universal Framework for Parallel Computations Specification created by Khronos group Multiple implementations exist (AMD, NVIDIA, MAC, ) API for Different Parallel Architectures Multi-Core CPU, Many-Core GPU, IBM Cell cards, Handles device detection, data transfers, and code execution Extended Version of C99 for Programming Devices Code is compiled at runtime for selected device Theoretically, we may chose best device for our application dynamically However, we have to consider optimizations 15 / 17

CUDA OpenCL Generic platform By Khronos Slower changes Supported by various vendor and devices Device code is compiled at runtime Easier for more portable application CUDA GPU-specific platform By NVIDIA Faster changes Limited to NVIDIA hardware only Host and device code together Easier to tune for peak performance 16 / 17

INTEL XEON PHI The Xeon Phi Device Many simpler (Pentium) cores Each equipped with powerful 512bit vector engine 17 / 17