APIs for Parallel Programming

Size: px
Start display at page:

Download "APIs for Parallel Programming"

Transcription

1 APIs for Parallel Programming Wolfgang Welz November 12, 2012 Part I: Programming for Shared Memory Systems Threads and OpenMP

2 What is Parallel Computing? 2 / 42 Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved in parallel. Wikipedia There are different levels of parallel computing: Bit Level: 32/64-bit microprocessors Instruction Level: Pipelining, superscalar execution Process and Program Level

3 Classification of Parallel Computers Flynn s taxonomy (1966): Single instruction Instruction Pool Multiple instruction Instruction Pool Single data Data Pool PU Data Pool PU PU Instruction Pool Instruction Pool PU Multiple data Data Pool PU PU Data Pool PU PU PU 3 / 42

4 Shared Memory Architecture 4 / 42 Pros: Independent processors access globally shared memory Changes in a memory location are visible to all other processors Modern multicore systems are cache coherent Data sharing is fast and easy user-friendly for the programmer Cons: Scaling of HW is hard and expensive Access of global memory needs to be synchronized by the user Memory Cache Cache Cache Cache Cache CPU CPU CPU CPU CPU

5 Distributed Memory Architecture 5 / 42 Pros: Processors have own local memory and are connected by some communication network Data needs to be transferred explicitly Rapidly access to own memory, no overhead for cache coherency Efficiently scalable Cons: The programmer is responsible for communication Data structures need to be mapped to existing network and memory topology Network Memory Memory Memory Memory Memory CPU CPU CPU CPU CPU

6 Shared Memory Programming Model 6 / 42 P 1 P 2 D 1 D 2 symmetric relation All threads access the same shared memory Threads also have their own private data Programmers are responsible for protecting globally shared data We assume that all processors are identical SMP

7 POSIX Threads 7 / 42 The original Pthreads API was defined in 1995 It offers a standardized interface for UNIX threads The Pthreads API can be grouped into the following groups: 1 Thread management 2 Mutexes 3 Condition variables 4 Read/write locks Pthreads are implemented for C via pthread.h Different native implementations for other languages exists using exactly the same concept

8 Fork-Join Model int main() pthread_create(&id,...,f,...) void f() parallel pthread_join(id,...) pthread_exit(...) pthread_exit(0) pthread_create(thread,attr,start_routine,arg) Create new thread executing start_routine with arg as its argument pthread_exit(status) Terminate thread and make status available to the corresponding join pthread_join(thread,status) Wait for thread to finish, status is taken from pthread_exit 8 / 42

9 Mutex Mutex Thread lock lock unlock unlock shared resource Thread The only safe way to concurrently access the same data without synchronization is when all threads only read. Enforcing mutual exclusion: pthread_mutex_init() initialization of mutex object pthread_mutex_lock() acquire lock on mutex variable pthread_mutex_trylock() non blocking lock pthread_mutex_unlock() unlock previously locked mutex If a lock is acquired for an already locked mutex, the calling thread is blocked until the mutex is unlocked 9 / 42

10 Condition Variables 10 / 42 Wait for a condition set by a different thread: while (!readyflag) pthread_mutex_unlock(&mutex); sleep(10); pthread_mutex_lock(&mutex); Busy-Wait is bad use condition variables: pthread_cond_init() initialization pthread_cond_wait() block until condition is signaled pthread_cond_signal() signal first waiting thread pthread_cond_broadcast() signal all waiting threads Conditional variables are used with a mutex so that wait and signal are in the critical section. Spurious Wakeups A wakeup does not necessarily mean that the condition now holds check condition yourself

11 Example: MT-Queue 11 / 42 Queue queue; mutex_t queuemutex; cond_t queuecondvar; void provide(int val) mutex_lock(queuemutex); queue.push(val); cond_signal(queuecondvar); mutex_unlock(queuemutex); int consume() mutex_lock(queuemutex); while (queue.empty()) cond_wait(queuecondvar, queuemutex); val = queue.pop(); mutex_unlock(queuemutex) return val;

12 OpenMP 12 / 42 The Open specification for Multi-Processing is an API for writing parallel programs on SMP-Machines: Set of compiler directives and some library functions Fork-Join programming model Simplifies writing multi-threaded programs in C, C++ and FORTRAN OpenMP needs to be supported by the compiler and activated using a compiler switch (GCC: -fopenmp)

13 Using OpenMP int main() cout << "Hello World" << endl; return 0; 13 / 42

14 Using OpenMP #include <omp.h> int main() #pragma omp parallel cout << "Hello World" << endl; return 0; 13 / 42

15 Using OpenMP #include <omp.h> int main() #pragma omp parallel int th_id = omp_get_thread_num(); Output: Hello World from Hello World from 03 Hello World from 2 Hello World from 1 cout << "Hello World from " << th_id << endl; return 0; 13 / 42

16 Using OpenMP #include <omp.h> Output: int main() Hello World from 0 Hello World from 2 Hello World from 3 #pragma omp parallel Hello World from 1 int th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; return 0; 13 / 42

17 Using OpenMP #include <omp.h> Output: int main() Hello World from 0 There are 4 threads Hello World from 2 #pragma omp parallel Hello World from 3 Hello World from 1 int th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; #pragma omp master int nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << endl; return 0; 13 / 42

18 Using OpenMP #include <omp.h> int main() Hello World from 0 Hello World from 2 Hello World from 1 #pragma omp parallel Hello World from 3 There are 4 threads int th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; #pragma omp barrier Output: #pragma omp master int nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << endl; return 0; 13 / 42

19 Using OpenMP #include <omp.h> int main() int nthreads, th_id; #pragma omp parallel private(th_id) shared(nthreads) th_id = omp_get_thread_num(); #pragma omp critical cout << "Hello World from " << th_id << endl; #pragma omp barrier #pragma omp master nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << endl; return 0; 13 / 42

20 Sections 14 / 42 #pragma omp parallel #pragma omp sections #pragma omp section TaskA(); #pragma omp section TaskB(); #pragma omp section TaskC(); TaskA() TaskB() TaskC()

21 Loop Parallelization #pragma omp parallel #pragma omp for for (int i=0;i<15;++i) int x = *i; a[i] = b[i] + x; The omp for-pragma specifies that the iterations will be distributed among threads: Loop variable must be of type signed integer Comparison must be <, <=, >, >= invariant_int No jump out of the loop i=0..4 x=42+2*i a[i]=b[i]+x i=5..9 x=42+2*i a[i]=b[i]+x i= x=42+2*i a[i]=b[i]+x Avoid data dependencies in loop: program compiles but will fail 15 / 42

22 Improving Performance 1 16 / 42 double ave = 0.0; #pragma omp parallel for for (int i=0;i<max;++i) ave += A[i]; ave = ave/max; double ave = 0.0; #pragma omp parallel for reduction (+:ave) for (int i=0;i<max;++i) ave += A[i]; ave = ave/max; Multiple values are combined into a single accumulation variable. This is called "reduction". Support for reduction operations is included in most parallel programming environments.

23 Improving Performance 2 17 / 42 #pragma omp parallel shared(s) int id = omp_get_thread_num(); for(int i=0;i<big;++i) s[id] += foo(id,i); #pragma omp parallel int id= omp_get_thread_num(); int tmp = 0; for(int i=0;i<big;++i) tmp += foo(id,i); s[id] = tmp; If array elements happen to share the same cache line, this leads to false sharing: Every update of a single element invalidates the entire cache line.

24 Comparison 18 / 42 APIs for Shared Memory Programming on SMP systems Threads: Virtually all languages offer support for Fork-Join threads Easy parallelization of larger independent code segments Explicit synchronization is required OpenMP: Only for C, C++, FORTRAN and specific compilers Easy parallelization of segments and loops Explicit synchronization is often not needed

25 APIs for Parallel Programming Wolfgang Welz November 19, 2012 Part II: Programming for GPUs and message passing systems OpenCL and MPI

26 Introduction 20 / 42 Last week: shared memory APIs P 1 P 2 D 1 D 2 symmetric relation Today: graphics cards GPUs are also based on a shared Memory Architecture Graphics cards have own memory, explicit copy operations are needed from host Fundamental differences between GPUs and (multi-core) CPUs

27 Comparing CPU and GPU 21 / 42 Control ALU ALU ALU ALU Cache CPU Execute any sequential code as fast as possible Out of order execution, branch prediction and large caches Execute a small amount of heavy threads in parallel GPU?

28 The Graphics Card Fixed-function pipeline: 3D data (vertices, colors,... ) transformation calculate projection send to screen Shaders More advanced visual effects require a programmable pipeline! 22 / 42

29 Shader 23 / 42 Algorithm: A Vertex Input: Uniform Variables, Global Attributes and Vertex Attributes Result: Clip Coordinates and Color of vertex begin // do fancy calculations Vertex Shader foreach Vertex v do get data of v; run A Vertex ; store result;

30 Shader 24 / 42 Algorithm: A Pixel Input: Uniform Variables, Global Attributes and interpolated Vertex Attributes Result: Color of pixel begin // do fancy color calculation Pixel Shader foreach Pixel p do get data of p by interpolation; run A Pixel ; store color;

31 Shader for computations Vertex Shader foreach Vertex v do get data of v; run A Vertex ; store result; Unified Shader foreach i I do get data I i ; Pixel execute Shader algorithm A; save result to O foreach Pixel p do i ; get data of p by interpolation; run A Pixel ; store color; Vector processor: SIMT (single instruction, multiple threads) 25 / 42

32 Available APIs 26 / 42 NVIDA CUDA: works only on NVIDA GPUs designed for GPUs language is C/C++ Khronos OpenCL: free standard: AMD, NVIDIA, Intel, IBM,... for all kind of processors language is C Microsoft DirectCompute: only for DirectX 11 designed for GPUs language is Direct3D HLSL

33 The OpenCL Platform Model Host C/C++ Java Python C# Fortran GPU CPU Accelerator OpenCL OpenCL OpenCL Devices OpenCL Platform: Host prepares and triggers device code execution: OpenCL code compilation at runtime Memory allocations, memory copies,... OpenCL program launch Devices execute OpenCL code written in OpenCL C Contexts group devices and resources 27 / 42

34 The OpenCL Execution Model 28 / 42 Define N-Dimensional computation domain Each independent element of execution in N-D domain is called a work-item The N-D domain defines the total number of work-items that execute in (quasi)-parallel global work size Work-items can be grouped together work-group Work-items in group can communicate with each other Can synchronize execution among work-items in group global work size Sx work-group work-group work-group work-group global work size Sy Work Dimension work_dim = 2 Inside the Kernel: get_global_id(0) 3 get_global_id(1) 1 get_local_id(0) 1 get_local_id(1) 1 get_group_id(0) 1 get_group_id(1) 0

35 The OpenCL Memory Model 29 / 42 Global memory: communicating data between host and device data exchange between kernels contents visible to all threads longer latency access In OpenCL a one-dimensional array in global memory is called buffer object A buffer belongs to a context and is dynamically allocated on a device Memory transfers are associated with a specific device host context work-group local memory work-item private memory work-item private memory host memory global memory work-group local memory work-item private memory work-item private memory

36 Parallelizing breadth-first search 30 / 42 Find a good representation for the NDRange: elements correspond to the nodes the computational domain is the entire set of nodes Algorithm: BFSKernel(V, E, C, Dist) v get_global_id; if C v is gray then C v black; foreach adjacent node w of v do if C w is white then Dist w Dist v + 1; C w gray;

37 OpenCL C 31 / 42 Differences between ISO C99 and OpenCL C: Function qualifier kernel Address space qualifiers global, local Work-item functions Build-in functions for synchronization Built-in vectors with 2,3,4,8,16 elements Built-in math functions Compilation at execution time No recursion allowed No built-in random number generator Dynamic allocation not allowed

38 The BFS kernel kernel void BFSKernel( global uint *V, global const uint *E, global const uchar *C, global uchar *C_new, global uint *D ) const uint v_id = get_global_id(0); if ( C[v_id] == GRAY ) C_new[v_id] = BLACK; const uint end_edge = V[v_id + 1]; for (uint out_edge = V[v_id]; out_edge!= end_edge; out_edge++) const uint w_id = E[out_edge]; if ( C[w_id] == WHITE ) D[w_id] = D[v_id] + 1; C_new[w_id] = GRAY; 32 / 42

39 Invoking the kernel 33 / 42 clenqueuewritebuffer(queue, V_b, false, 0, sizeof(v_b), V_array);... clfinish(queue); size_t global_work_size[] = N ; do clenqueuendrangekernel(queue, kernel, 1, 0, global_work_size, 0); clenqueuecopybuffer(queue, C_new_b, C_b, 0, 0, sizeof(c_b)); clenqueuereadbuffer(queue, C_b, true, 0, sizeof(c_b), C_array); while (!hasgraynode()); clenqueuereadbuffer(queue, D_b, true, 0, sizeof(d_b), D_array); OpenCL events Each command also return its event and takes an array of other events as a dependency (append 0, 0, 0 if not needed). When using more than one queue, it might be necessary to synchronize the execution of kernels with events!

40 Performance of our OpenCL-BFS 34 / GPU CPU Time (0.1 s) Number of nodes (millions) CPU: Intel Core i GHz GPU: NVIDIA GeForce GTX 295

41 Performance Pitfalls 35 / 42 Data transfers between host and device are expensive: 500 MB vector of float Transferring data from host: ln(x) arcsin(1 x 2 ) on all 130M entries: Copy on the same device: 180 ms 18 ms 0.09 ms

42 Performance Pitfalls Global memory accesses should be coalesced! 17 Copy with Offset Time (ms) Offset 36 / 42

43 General Performance Optimization Strategies 37 / 42 NVIDIA OpenCL Best Practices Guide: Focus first on finding ways to parallelize sequential code Minimize data transfer between the host and the device Global memory accesses should be coalesced Minimize the use of global memory, prefer shared memory access Avoid different execution paths within the same work group Crucial Very important Very important Important Medium

44 Multicomputer Programming: MPI Message Passing Interface (MPI) is a standardized message-passing system MPI was designed for distributed memory architectures MPI is a specification, not by itself a library Message passing programming model: P 1 P 2 Copy D 1 D 2 directed relation All data needs to be distributed, more programming overhead 38 / 42

45 General MPI Program Structure #include "mpi.h" main() MPI_Init() #include "mpi.h" main() MPI_Init() send/receive MPI_Finalize() exit() MPI_Finalize() exit() Compile using the MPI-compiler-wrapper mpicc, mpif77 etc. 39 / 42

46 Communication in MPI Groups of communicating processes in one MPI session are connected as communicator objects The actual assignment of processes happens at runtime through mpirun -np 128 myprog Each process in a communicator is defined by unique number, called rank MPI_COMM_WORLD The predefined communicator that contains all processes is called MPI_COMM_WORLD 2 4 MPI_COMM_WORLD Communicators should be partitioned to better match the topology of the problem / 42

47 Point-to-Point Communication 41 / 42 The basic point-to-point operations are send and receive Blocking Send int MPI_Send (void* buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm) dest rank of target process tag arbitrary integer to uniquely identify a message comm communicator of the process group blocking non-blocking synchronous MPI_Ssend, MPI_Recv MPI_Issend, MPI_Irecv buffered MPI_Bsend, MPI_Recv MPI_Ibsend, MPI_Irecv standard MPI_Send, MPI_Recv MPI_Isend, MPI_Irecv ready mode MPI_Rsend, MPI_Recv MPI_Irsend, MPI_Irecv

48 Collective operations broadcast same data from one root process scatter gather distribute and collect data from root process allgather A 0 A 1 A 2 A 0 B 0 C 0 alltoall B 0 B 1 B 2 A 1 B 1 C 1 C 0 C 1 C 2 every process sends and receives data A 2 B 2 C 2 Use collective operations: Although MPI belongs in layers 5 of the OSI Model, implementations may cover lower layers 42 / 42

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007 HPC Workshop University of Kentucky May 9, 2007 May 10, 2007 Part 3 Parallel Programming Parallel Programming Concepts Amdahl s Law Parallel Programming Models Tools Compiler (Intel) Math Libraries (Intel)

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Message Passing Interface. most of the slides taken from Hanjun Kim

Message Passing Interface. most of the slides taken from Hanjun Kim Message Passing Interface most of the slides taken from Hanjun Kim Message Passing Pros Scalable, Flexible Cons Someone says it s more difficult than DSM MPI (Message Passing Interface) A standard message

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming

More information

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013 OpenMP António Abreu Instituto Politécnico de Setúbal 1 de Março de 2013 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 1 / 37 openmp what? It s an Application Program Interface

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Threaded Programming. Lecture 9: Alternatives to OpenMP

Threaded Programming. Lecture 9: Alternatives to OpenMP Threaded Programming Lecture 9: Alternatives to OpenMP What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

Point-to-Point Communication. Reference:

Point-to-Point Communication. Reference: Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Questions from last time

Questions from last time Questions from last time Pthreads vs regular thread? Pthreads are POSIX-standard threads (1995). There exist earlier and newer standards (C++11). Pthread is probably most common. Pthread API: about a 100

More information

12:00 13:20, December 14 (Monday), 2009 # (even student id)

12:00 13:20, December 14 (Monday), 2009 # (even student id) Final Exam 12:00 13:20, December 14 (Monday), 2009 #330110 (odd student id) #330118 (even student id) Scope: Everything Closed-book exam Final exam scores will be posted in the lecture homepage 1 Parallel

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Parallel programming using OpenMP

Parallel programming using OpenMP Parallel programming using OpenMP Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

Synchronisation in Java - Java Monitor

Synchronisation in Java - Java Monitor Synchronisation in Java - Java Monitor -Every object and class is logically associated with a monitor - the associated monitor protects the variable in the object/class -The monitor of an object/class

More information

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

15-440: Recitation 8

15-440: Recitation 8 15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

Parallel Computing: Overview

Parallel Computing: Overview Parallel Computing: Overview Jemmy Hu SHARCNET University of Waterloo March 1, 2007 Contents What is Parallel Computing? Why use Parallel Computing? Flynn's Classical Taxonomy Parallel Computer Memory

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

CS 470 Spring Mike Lam, Professor. Semaphores and Conditions

CS 470 Spring Mike Lam, Professor. Semaphores and Conditions CS 470 Spring 2018 Mike Lam, Professor Semaphores and Conditions Synchronization mechanisms Busy-waiting (wasteful!) Atomic instructions (e.g., LOCK prefix in x86) Pthreads Mutex: simple mutual exclusion

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with

More information

Parallel Computing Paradigms

Parallel Computing Paradigms Parallel Computing Paradigms Message Passing João Luís Ferreira Sobral Departamento do Informática Universidade do Minho 31 October 2017 Communication paradigms for distributed memory Message passing is

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective? Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective? CS 470 Spring 2019 POSIX Mike Lam, Professor Multithreading & Pthreads MIMD

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Practical Scientific Computing: Performanceoptimized

Practical Scientific Computing: Performanceoptimized Practical Scientific Computing: Performanceoptimized Programming Programming with MPI November 29, 2006 Dr. Ralf-Peter Mundani Department of Computer Science Chair V Technische Universität München, Germany

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

CS691/SC791: Parallel & Distributed Computing

CS691/SC791: Parallel & Distributed Computing CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

PCS - Part Two: Multiprocessor Architectures

PCS - Part Two: Multiprocessor Architectures PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors

More information

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Part - II. Message Passing Interface. Dheeraj Bhardwaj Part - II Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi 110016 India http://www.cse.iitd.ac.in/~dheerajb 1 Outlines Basics of MPI How to compile and

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Parallel programming MPI

Parallel programming MPI Parallel programming MPI Distributed memory Each unit has its own memory space If a unit needs data in some other memory space, explicit communication (often through network) is required Point-to-point

More information

Shared Memory Programming Model

Shared Memory Programming Model Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan HPC Consultant User Services Goals Acquaint users with the concept of shared memory parallelism Acquaint users with the basics of programming with OpenMP Discuss briefly the

More information

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

OPENMP OPEN MULTI-PROCESSING

OPENMP OPEN MULTI-PROCESSING OPENMP OPEN MULTI-PROCESSING OpenMP OpenMP is a portable directive-based API that can be used with FORTRAN, C, and C++ for programming shared address space machines. OpenMP provides the programmer with

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

POSIX Threads. HUJI Spring 2011

POSIX Threads. HUJI Spring 2011 POSIX Threads HUJI Spring 2011 Why Threads The primary motivation for using threads is to realize potential program performance gains and structuring. Overlapping CPU work with I/O. Priority/real-time

More information

Parallelization, OpenMP

Parallelization, OpenMP ~ Parallelization, OpenMP Scientific Computing Winter 2016/2017 Lecture 26 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 18 Why parallelization? Computers became faster and faster

More information

CSE 333 SECTION 9. Threads

CSE 333 SECTION 9. Threads CSE 333 SECTION 9 Threads HW4 How s HW4 going? Any Questions? Threads Sequential execution of a program. Contained within a process. Multiple threads can exist within the same process. Every process starts

More information

Standard MPI - Message Passing Interface

Standard MPI - Message Passing Interface c Ewa Szynkiewicz, 2007 1 Standard MPI - Message Passing Interface The message-passing paradigm is one of the oldest and most widely used approaches for programming parallel machines, especially those

More information

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References COSC 6374 Parallel Computation Shared memory programming with POSIX Threads Fall 2012 References Some of the slides in this lecture is based on the following references: http://www.cobweb.ecn.purdue.edu/~eigenman/ece563/h

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface ) CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc. Introduction to MPI SHARCNET MPI Lecture Series: Part I of II Paul Preney, OCT, M.Sc., B.Ed., B.Sc. preney@sharcnet.ca School of Computer Science University of Windsor Windsor, Ontario, Canada Copyright

More information

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018 OpenMP 4 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Distributed Systems + Middleware Concurrent Programming with OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP Distributed Systems + Middleware Concurrent Programming with OpenMP Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola

More information

Parallele Numerik. Blatt 1

Parallele Numerik. Blatt 1 Universität Konstanz FB Mathematik & Statistik Prof. Dr. M. Junk Dr. Z. Yang Ausgabe: 02. Mai; SS08 Parallele Numerik Blatt 1 As a first step, we consider two basic problems. Hints for the realization

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 Process creation in UNIX All processes have a unique process id getpid(),

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution

More information

Shared Memory: Virtual Shared Memory, Threads & OpenMP

Shared Memory: Virtual Shared Memory, Threads & OpenMP Shared Memory: Virtual Shared Memory, Threads & OpenMP Eugen Betke University of Hamburg Department Informatik Scientific Computing 09.01.2012 Agenda 1 Introduction Architectures of Memory Systems 2 Virtual

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR

More information

Introduction to OpenMP

Introduction to OpenMP Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE) Elementary Parallel Programming with Examples Reinhold Bader (LRZ) Georg Hager (RRZE) Two Paradigms for Parallel Programming Hardware Designs Distributed Memory M Message Passing explicit programming required

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

OpenMP threading: parallel regions. Paolo Burgio

OpenMP threading: parallel regions. Paolo Burgio OpenMP threading: parallel regions Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks,

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information