A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs

Size: px
Start display at page:

Download "A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs"

Transcription

1 A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs Prakash Raghavendra, Akshay Kumar Behki, K Hariprasad, Madhav Mohan, Praveen Jain, Srivatsa S Bhat, VM Thejus, Vishnumurthy Prabhu Department of Information Technology, National Institute of Technology Karnataka, Surathkal srp@nitkacin, aksbeks@gmailcom, harinitk2007@gmailcom, madhavmon@gmailcom, prgu jain@yahoocom, bhatsrivatsa@gmailcom, thejusvm@gmailcom, prabhuvishnumurthy@gmailcom Abstract Today, the challenge is to exploit the parallelism available in the way of multi-core architectures by the software This could be done by re-writing the application, by exploiting the hardware capabilities or expect the compiler/software runtime tools to do the job for us With the advent of multi-core architectures ([1] [2]), this problem is becoming more and more relevant Even today, there are not many run-time tools to analyze the behavioral pattern of such performance critical applications, and to re-compile them So, techniques like OpenMP for shared memory programs are still useful in exploiting parallelism in the machine This work tries to study if the loop parallelization (both with and without applying transformations) can be a good case for running scientific programs efficiently on such multi-core architectures We have found the results to be encouraging and we strongly feel that this could lead to some good results if implemented fully in a production compiler for multi-core architectures 1 Introduction Parallel processing requires program logic to have zero dependency between the successive iterations of a loop To run a program in parallel we can divide the task between multiple threads or processes executing in parallel We can also go up to the extent of running these parallel pieces of code simultaneously on different nodes in a high speed network However the amount of parallelization possible depends on the program structure as well as the hardware configuration OpenMP [3] is an Application Program Interface (API) specification for C/C++ or FORTRAN that may be used to explicitly direct multi-threaded, shared memory parallelism It is a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization It has been implemented on most of the popular compilers like GNU (gcc), Intel, IBM, HP, Sun Microsystems compilers C-H Hsu et al (Eds): ICA3PP 2010, Part I, LNCS 6081, pp , 2010 c Springer-Verlag Berlin Heidelberg 2010

2 A Study of Performance Scalability by Parallelizing Loop Iterations 477 Most OpenMP parallelism is specified through the use of compiler directives which are embedded in C/C++ or FORTRAN source code The use of the pre-processor directive (starting with #) along with the OpenMP directive instructs the compiler during pre-processing to implement parallel execution of the code following the OpenMP directive There are various directives available, one of them being #pragma omp parallel for whose implementation was our prime interest in the project The #pragma directive is the method specified by the C standard for providing additional information to the compiler, beyond what is conveyed in the language itself The #pragma omp parallel for directive instructs the compiler that all the iterations of the for loop following the directive can be executed in parallel In that case, OpenMP compiler will generate code to spawn optimized number of threads based on the number of cores available Consider the following example of a typical C/C++ program using OpenMP pragmas: #include <omph> main () { int var1, var2, var3; /*** Serial code ***/ /*** Beginning of parallel section Fork a team of threads ***/ /*** Specify variable scoping ***/ #pragma omp parallel private(var1, var2) shared(var3) { /*** Parallel section executed by all threads ***/ /*** All threads join master thread and disband ***/ } /*** Resume serial code ***/ } The variables var1 and var2 are private to each of the threads spawned and the variable var3 is shared among all the threads The intent in this work is to study some OpenMP programs and see if these scale well on multi-core architectures Further, we would also like to parallelize non-parallel loops by applying transformations (using known techniques like unimodular and GCD transformations [5] [6] [7]) and see if they too scale well on such architectures We used some known OpenMP pragmas as case studies and implemented them in our compiler to study the performance In Section 2, wedescribethewayweimplemented these OpenMP pragmas In Section 3, we discuss unimodular transformations which we used to parallelize non-parallel loops In Section 4, we tabulate and explain all the results Section 5 concludes the paper and suggests some directions for future work

3 478 P Raghavendra et al 2 Implementation of OpenMP Pragmas in gcc The parallel portions of the program can be run on different threads For this we have to implement pthread library function calls in C Our two approaches were: by using a runtime library, and second, by using a wrapper which calls gcc In the first approach, we planned to have our own library functions which would have to be called in the same way as OpenMP We felt that this might not have much impact on performance and therefore, we resolved this to do at source level We would call gcc on the resultant program for further code generation and optimization for the target platform Hence we opted for the second approach in which we intended to develop a wrapper on gcc which would take the file having OpenMP directives as the input file and automatically generate the code which implemented multithreading The wrapper on gcc should be able to search for the OpenMP directives in the input file (program) and replace them with appropriate thread calls This could be easily achieved using a shell script, since we were working in a Linux environment The script searches for #pragma omp and removes it from the code, and the loop which follows this directive will be implemented in a function to be called by the thread The loop which needs to be parallelized is divided into several segments each of which is executed by a thread The kernel implicitly assigns individual threads to available cores Threads can be implemented in two ways: Static and Dynamic assignment of threads In static assignment, the total number of iterations of the loop is divided equally among the different threads However, in dynamic assignment, we dynamically assign loop iterations to different threads as and when the threads complete their previous tasks The static implementation of threads was observed to be faster than its dynamic counterpart since there is no explicit necessity to handle the tasks given to each of the threads But in this case the number of iterations of the loop must be divisible by the number of threads we create, at compile time itself We used the following standard for POSIX threads pthread_create(&thread_id,null,function,&value); This function creates a thread with the thread ID as thread_id, with default characteristics (specified by NULL as second parameter), executes the function function () in the thread and passes the value value (usually a structure) to the function pthread_join(thread_id, &exit_status); This function waits for the thread with thread ID thread_id to complete its execution and collects its exit status (return value if any) in exit_status This also helps in synchronizing more than one thread When the OpenMP pragmas are non-nested, then the case is very simple and all we have to do is to put the code following the directives in separate functions and call the functions using threads Whereas when there are nested

4 A Study of Performance Scalability by Parallelizing Loop Iterations 479 OpenMP directives, then we will have to create threads again inside the outer thread functions, resulting in nested thread functions As an added feature to the script, the number of threads to be implemented can be specified as a command line argument, default being two We run the script as follows: Syntax: /scriptsh <input_file> [no_of_threads] 3 Unimodular Transformations Unimodular transformation is a loop transformation defined by a unimodular matrix To test the validity of a general unimodular transformation, we need to know the distance vectors of the loop After a valid transformation, if a loop in the transformed program carries no dependency, then it can execute in parallel Many researchers have applied unimodular transformations for parallelization of loops [11] [12] Such transformations, though have limitations on applicability due to the model on which they work, can be applied very elegantly onto loops, which give various forms of loops executable in parallel These types can be generalized by a set of non-parallel loops sandwiched between set of outer parallel loops and set of inner parallel loops The beauty of this technique is that we can control on how many of such inner and outer loops we want depending upon the number of schedulable resources that we have If the input loop does not belong to the model or if the number of dependencies are more than what is allowed by the model, then we will not be able to get completely parallel loops In that case, we have to make these run with explicit communication, as explained in some works like [12] In such methods, we allow the loops to have dependencies, however we still run these in parallel and making explicit synchronization wherever necessary so that the loops run correctly honoring all data dependencies The extension of this technique would be one in [13] where they optimize the layout of arrays for such loops Program Model Our model program [6] is the loop nest L= (L1, L2,Lm): L1: do I1 = p1, q1 L2: do I2 = p2, q2 Lm: do Im = pm, qm H (I1, I2,, Im) Enddo Enddo Enddo An iteration H(i) is executed before another iteration H(j) in L iff i < j Take any m*m unimodular matrix U Let LU denote the program consisting of the iterations of L, such that an iteration H(i) is executed before another iteration

5 480 P Raghavendra et al H(j) in LU iff iu < ju LU can be written as a nest of m loops with an index vector K=(K1,K2,Km) defined by K=IU The body of the new program is H(KU-1) which will be written as HU (K) The program LU is the transformed program defined by U, and the transformation L -> LU is the unimodular transformation of L defined by U The idea here is to transform the non-parallel loop into another base so that the same loop would become parallel (and can be executed in parallel on target machine) 4 Results We studied the results as two problems, first to study the OpenMP ready programs with explicit parallel loops, and second, loops which are not parallel as given, but may need some transformations (like unimodular transformations) to make them run in parallel We will show the results for the first case in Section 41 and show the results of next case in Section Study of OpenMP Parallel Programs We took OpenMP programs like matrix multiplication with array sizes of 1800 * 1800 and LU factorization (array sizes of 1800*1800) We tested these programs with different number of threads (2, 4, 6, 8, 16, etc) on IBM Power 5 server and the performances were noted The IBM Power 5 server we used (IBM A) has 4 physical processing units (8 cores) The maximum available physical processing units were 38 since the rest was used for VIOS (Virtual I/O Server) Each processing unit has 2 cores The virtual machine on which the programs were tested was configured for two settings: Profile 1: Virtual Processing Units: 3 (6 cores) Physical Processing Units: 3 (6 cores) Profile 2: Virtual Processing Units: 8 (16 cores) Physical Processing Units: 38 (approx 8 cores) Matrix multiplication The program snippet which performs matrix multiplication implemented with threads is as follows : #pragma omp parallel for private(i,j,k) for(i = 0; i < 1800; i++){ for(j = 0; j < 1800;j++){ c[i][j] = 0; for(k = 0; k < 1800; k++){ c[i][j] += a[i][k] * b[k][j]; }}} Table-1 shows the statistics for the above program using Profile-1 and Profile-2 Graph-1 shows the graph corresponding to Profile-1 Graph-2 shows the graph corresponding to Profile-2

6 A Study of Performance Scalability by Parallelizing Loop Iterations 481 Table 1 Scale-ups for the above program with Profile-1 and Profile-2 Profile-1 Profile-2 Threads Time in s Scale-up Threads Time in s Scale-up Fig 1 Graph-1 From the observations, it can be concluded that the scale up increases with the number of threads as long as the number of threads is less than or equal to the number of cores available We see that we can never approach the ideal speed up of N where N is the number of cores, since there is always some synchronization to be done, which would slow down the program execution Comparing the maximum scale up produced in the two cases (profile-1: 396, profile-2: 459), though the number of cores in the second case is twice that of the first one, considerable amount of scale up is not obtained This is because in the second case 16 is the number of virtual cores and not the physical cores (which is actually 8) And hence one core is taking care of two threads, which leads to overhead due to switching of threads LU Factorization The code snippet which performs LU factorization, implementedwiththreadsisasfollows:

7 482 P Raghavendra et al #pragma omp parallel for private(i,j,k) for(k = 0; k < 1800; k++){ for(i=k+1 = 0; i < 1800;i++){ a[i][k] = a[i][k] / a[k][k]; } for(i=k+1 = 0; i < 1800;i++){ for(j=k+1 = 0; j < 1800;j++){ a[i][j] = a[i][j] - a[i][k] * a[k][j]; }}} Table-2 shows the statistics for the above program using Profile-1 and Profile-2 Graph-3 shows the graph corresponding to Profile-1 Graph-4 shows the graph corresponding to Profile-2 It is clearly observed in all of the graphs that when the number of threads is increased beyond a certain limit (number of cores available), the scale up remains almost a constant But if it is further increased to a much higher value, Fig 2 Graph-2 Table 2 Scale-ups for the above program with Profile-1 and Profile-2 Profile-1 Profile-2 Threads Time in s Scale-up Threads Time in s Scale-up

8 A Study of Performance Scalability by Parallelizing Loop Iterations 483 Fig 3 Graph-3 Fig 4 Graph-4 the scale up would decrease This is because of the delay (overhead) produced due to switching of the large number of threads As we know the loop following the OpenMP directives will be executed in parallel We studied the effects of parallelizing different sets of loops which are nested Consider the following code snippet : for(i = 0; i < 1800; i++){ for(j = 0; j < 1800;j++){ c[i][j] = 0; for(k = 0; k < 1800; k++){ c[i][j] += a[i][k] * b[k][j]; }}} The time taken for a single thread to complete the above loops was seconds When the outer most for loop (loop counter i ) was parallelized with 4 threads, the time taken was found to be 7972 seconds When the second for loop (loop counter j ) was parallelized with 4 threads, the time taken was found to be seconds When the inner most for loop (loop counter k ) was parallelized with 4 threads, the time taken was found to be 5266 seconds

9 484 P Raghavendra et al The parallelization achieved by multithreading the outer for loop is called Coarse Granular Parallelization It takes the least time because each thread executes large chunk of calculations and hence overhead due to switching of threads is minimized since each thread once created handles significant part of the program The parallelization achieved by multithreading the inner for loop is called Fine Granular Parallelization It takes more time because each thread executes smaller chunk of calculations and hence overhead is more due to repeated thread creation and destruction for very small tasks When the outer most and the second for loop were parallelized with two threads each, the time taken was found to be 6347 seconds, which is slightly better than that obtained by parallelizing the outer most loop alone with four threads (7972 s) When the outer most and the inner most for loop were parallelized with two threads each, the time taken was found to be 1975 seconds When the second and the inner most for loop were parallelized with two threads each, the time taken was found to be seconds The above data further proves higher performance gain of coarse granular parallelization over fine granular parallelization There are a number of limitations in case of multithreading There is an upper limit on the number of parallel threads we can create, since all these threads belong to a single process and each process has an upper limit on the memory it can use, depending on the hardware architecture Each thread has its own program stack So having a large number of threads in a single process may lead to memory insufficiency due to the large number of stacks required Moreover, the best performance is obtained when the number of threads equals the number of actual cores available on the system 42 Study of Loop Transformations on Multi-core Loops with independent iterations can be easily parallelized unlike ones with dependencies But some loops with dependencies can still be made independent to some extent by applying mathematical transformations, which then can be parallelized ( [7]) The transformations are usually specific to the function the body of the loop performs Some specific examples are discussed below Inner loop parallelization dependencies : [6] Consider the following double loop which has for(i = 1; i < 1000; i++){ for(j = 1; j < 1000;j++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[i][j] = a[i-1][j] + a[i][j-1]; }} This code takes 8751 seconds to complete This can be transformed into an independent loop given by:

10 A Study of Performance Scalability by Parallelizing Loop Iterations 485 for(k = 2; k < 1999; k++){ k_1=1>(k-999)?1:(k-999); k_2=999>(k-1)?(k-1):999; for(k_1 = k_1; k_1 <= k_2;k_1++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[k-k_1][k_1] = a[k-k_1-1][k_1] + a[k-k_1][k_1-1]; }} In the above code, only the inner most loop is independent and hence can be run in parallel When executed with two threads, the time taken is 5217 seconds, which is faster compared to the time taken by dependent loops though the parallelization is fine granular In this case, the number of iterations executed by the second for loop is least for the first and the last iteration of the outer most for loop and highest in between and the gradient is linear Hence in order to obtain higher efficiency, the first and last few iterations (depending upon the amount of work being done) of the outer most for loop can be executed sequentially and the rest in parallel or we can make use of dynamic threads Dynamic threads results are in table-3 Table 3 Scale-ups for Dynamic threads program Threads Time in s Outer loop parallelization [6] Consider the following code which has dependencies : for(i = 6; i <= 500; i++){ for(j = i; j <= (2*i)+4; j++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[i][j] = a[i-2][j-3] + a[i][j-6]; }} This code takes 3296 seconds to complete This can be transformed into an independent code given by : for(y_1 = 0; y_1 <= 1; y_1++){// parallelizable loop for(y_2 = 0; y_2 <= 1; y_2++){ // parallelizable loop for(k_1 = ceil((6-y_1)/20); k_1 < floor((10-y_1)/20); k_1++){ for(k_2 = ceil(y_1+2*k_1-y_2); k_2 < floor((2*y_1+4*k_1-4-y_2)/30); k_2++){ a[y_1+2*k_1][y_2+3*k_2] = a[y_1+2*k_1-2][y_2+3*k_2-3] + a[y_1+2*k_1][y_2+3*k_2-6]; }}}}

11 486 P Raghavendra et al When executed with two threads, the time taken is 2028 seconds, which is faster compared to the time taken by dependent loops Since it is a case of coarse granular parallelization, increasing the number of threads gives better performance The above program with maximum 6 threads gives a scale up of 3 (1335 s) which is not up to the expectations since the number of inner loop iterations is greater than that of the original program 5 Conclusion In this study, we did two things First, study performance of a few OpenMP ready parallel programs on multi-core machines (up to 8 physical cores) For this, we implemented the OpenMP compiler over the gcc compiler Second, we extended our work to also include non-parallel loops and used some transformations to make these loops run in parallel We were delighted to see that both these cases proved that running these (with or without transformations) can give us significant performance benefits on multi-core architectures The trend today is to have more and more cores and in that context, this work would be quite relevant In future, we would like to extend our work to more non-parallel loops, which may run in parallel with some explicit synchronization on multi-core References 1 AMD Multi-core Products (2006), 2 Multi-core from Intel Products and Platforms (2006), 3 OpenMP, 4 Wolfe, MJ: Techniques for improving the inherent parallelism in programs Technical Report , Department of Computer Science, University of Illinois at Urbana-Champaign (July 1990) 5 Wolfe, M: High Performance Compilers for Parallel Computing Addison-Wesley, Reading 6 Banerjee, UK: Loop Transformations for Restructuring Compilers: The Foundations Kluwer Academic Publishers, Norwell (1993) 7 Banerjee, UK: Loop Parallelization Kluwer Academic Publishers, Norwell (1994) 8 Pthreads reference, 9 DHollander, EH: Partitioning and Labelling of loops by Unimodular Transformation IEEE Transactions on Parallel and Distributed Systems 3(4) (1992) 10 Saas, R, Mutka, M: Enabling unimodular transformation In: Supercomputing 1994, November 1994, pp (1994) 11 Banerjee, U: Unimodular Transformations of Double Loop In: Advances in Languages and Compilers for Parallel Processing, pp (1991) 12 Prakash, SR, Srikant, YN: An Approach to Global Data Partitioning for Distributed Memory Machines In: IPPS/SPDP (1999) 13 Prakash, SR, Srikant, YN: Communication Cost Estimation and Global Data Partitioning for Distributed Memory Machines In: Fourth International Conference on High Performance Computing, Bangalore (1997)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

OpenMP - Introduction

OpenMP - Introduction OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı - 21.06.2012 Outline What is OpenMP? Introduction (Code Structure, Directives, Threads etc.) Limitations Data Scope Clauses Shared,

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

POSIX Threads and OpenMP tasks

POSIX Threads and OpenMP tasks POSIX Threads and OpenMP tasks Jimmy Aguilar Mena February 16, 2018 Introduction Pthreads Tasks Two simple schemas Independent functions # include # include void f u n c t i

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with

More information

Synchronisation in Java - Java Monitor

Synchronisation in Java - Java Monitor Synchronisation in Java - Java Monitor -Every object and class is logically associated with a monitor - the associated monitor protects the variable in the object/class -The monitor of an object/class

More information

Introduction to OpenMP.

Introduction to OpenMP. Introduction to OpenMP www.openmp.org Motivation Parallelize the following code using threads: for (i=0; i

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Chapter 4: Multi-Threaded Programming

Chapter 4: Multi-Threaded Programming Chapter 4: Multi-Threaded Programming Chapter 4: Threads 4.1 Overview 4.2 Multicore Programming 4.3 Multithreading Models 4.4 Thread Libraries Pthreads Win32 Threads Java Threads 4.5 Implicit Threading

More information

A Source-to-Source OpenMP Compiler

A Source-to-Source OpenMP Compiler A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review

More information

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and

More information

CSE 4/521 Introduction to Operating Systems

CSE 4/521 Introduction to Operating Systems CSE 4/521 Introduction to Operating Systems Lecture 5 Threads (Overview, Multicore Programming, Multithreading Models, Thread Libraries, Implicit Threading, Operating- System Examples) Summer 2018 Overview

More information

15-418, Spring 2008 OpenMP: A Short Introduction

15-418, Spring 2008 OpenMP: A Short Introduction 15-418, Spring 2008 OpenMP: A Short Introduction This is a short introduction to OpenMP, an API (Application Program Interface) that supports multithreaded, shared address space (aka shared memory) parallelism.

More information

Introduction to Standard OpenMP 3.1

Introduction to Standard OpenMP 3.1 Introduction to Standard OpenMP 3.1 Massimiliano Culpo - m.culpo@cineca.it Gian Franco Marras - g.marras@cineca.it CINECA - SuperComputing Applications and Innovation Department 1 / 59 Outline 1 Introduction

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

An Introduction to OpenMP

An Introduction to OpenMP An Introduction to OpenMP U N C L A S S I F I E D Slide 1 What Is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism

More information

https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG OpenMP Basic Defs: Solution Stack HW System layer Prog. User layer Layer Directives, Compiler End User Application OpenMP library

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

CS691/SC791: Parallel & Distributed Computing

CS691/SC791: Parallel & Distributed Computing CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Objectives of Training Acquaint users with the concept of shared memory parallelism Acquaint users with the basics of programming with OpenMP Memory System: Shared Memory

More information

Parallel Programming: OpenMP

Parallel Programming: OpenMP Parallel Programming: OpenMP Xianyi Zeng xzeng@utep.edu Department of Mathematical Sciences The University of Texas at El Paso. November 10, 2016. An Overview of OpenMP OpenMP: Open Multi-Processing An

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer OpenMP examples Sergeev Efim Senior software engineer Singularis Lab, Ltd. OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

More information

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden CSE 160 Lecture 8 NUMA OpenMP Scott B. Baden OpenMP Today s lecture NUMA Architectures 2013 Scott B. Baden / CSE 160 / Fall 2013 2 OpenMP A higher level interface for threads programming Parallelization

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS. 0104 Cover (Curtis) 11/19/03 9:52 AM Page 1 JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 LINUX M A G A Z I N E OPEN SOURCE. OPEN STANDARDS. THE STATE

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele Advanced C Programming Winter Term 2008/09 Guest Lecture by Markus Thiele Lecture 14: Parallel Programming with OpenMP Motivation: Why parallelize? The free lunch is over. Herb

More information

Programming Shared-memory Platforms with OpenMP. Xu Liu

Programming Shared-memory Platforms with OpenMP. Xu Liu Programming Shared-memory Platforms with OpenMP Xu Liu Introduction to OpenMP OpenMP directives concurrency directives parallel regions loops, sections, tasks Topics for Today synchronization directives

More information

Alfio Lazzaro: Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B. Bertinoro Italy, 12 17 October 2009 Alfio Lazzaro:

More information

Parallel Computing Parallel Programming Languages Hwansoo Han

Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Programming Practice Current Start with a parallel algorithm Implement, keeping in mind Data races Synchronization Threading syntax

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 2 OpenMP Shared address space programming High-level

More information

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions. 1 of 10 3/3/2005 10:51 AM Linux Magazine March 2004 C++ Parallel Increase application performance without changing your source code. Parallel Processing Top manufacturer of multiprocessing video & imaging

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University 1. Introduction 2. System Structures 3. Process Concept 4. Multithreaded Programming

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen Programming with Shared Memory PART II HPC Fall 2012 Prof. Robert van Engelen Overview Sequential consistency Parallel programming constructs Dependence analysis OpenMP Autoparallelization Further reading

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

OpenMP Shared Memory Programming

OpenMP Shared Memory Programming OpenMP Shared Memory Programming John Burkardt, Information Technology Department, Virginia Tech.... Mathematics Department, Ajou University, Suwon, Korea, 13 May 2009.... http://people.sc.fsu.edu/ jburkardt/presentations/

More information

Distributed Systems + Middleware Concurrent Programming with OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP Distributed Systems + Middleware Concurrent Programming with OpenMP Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

OpenMP threading: parallel regions. Paolo Burgio

OpenMP threading: parallel regions. Paolo Burgio OpenMP threading: parallel regions Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks,

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:... ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure

More information

Questions from last time

Questions from last time Questions from last time Pthreads vs regular thread? Pthreads are POSIX-standard threads (1995). There exist earlier and newer standards (C++11). Pthread is probably most common. Pthread API: about a 100

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Shared Memory Parallelism using OpenMP

Shared Memory Parallelism using OpenMP Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत SE 292: High Performance Computing [3:0][Aug:2014] Shared Memory Parallelism using OpenMP Yogesh Simmhan Adapted from: o

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Chapter 4: Threads. Operating System Concepts 9 th Edit9on Chapter 4: Threads Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads 1. Overview 2. Multicore Programming 3. Multithreading Models 4. Thread Libraries 5. Implicit

More information

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to. Slides prepared by : Farzana Rahman 1 Introduction to OpenMP Slides prepared by : Farzana Rahman 1 Definition of OpenMP Application Program Interface (API) for Shared Memory Parallel Programming Directive based approach with library support

More information

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) COSC 6374 Parallel Computation Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2014 Introduction Threads vs. processes Recap of

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Threads. CS3026 Operating Systems Lecture 06

Threads. CS3026 Operating Systems Lecture 06 Threads CS3026 Operating Systems Lecture 06 Multithreading Multithreading is the ability of an operating system to support multiple threads of execution within a single process Processes have at least

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

[Potentially] Your first parallel application

[Potentially] Your first parallel application [Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel

More information