A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs
|
|
- Charity Wilkerson
- 5 years ago
- Views:
Transcription
1 A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs Prakash Raghavendra, Akshay Kumar Behki, K Hariprasad, Madhav Mohan, Praveen Jain, Srivatsa S Bhat, VM Thejus, Vishnumurthy Prabhu Department of Information Technology, National Institute of Technology Karnataka, Surathkal srp@nitkacin, aksbeks@gmailcom, harinitk2007@gmailcom, madhavmon@gmailcom, prgu jain@yahoocom, bhatsrivatsa@gmailcom, thejusvm@gmailcom, prabhuvishnumurthy@gmailcom Abstract Today, the challenge is to exploit the parallelism available in the way of multi-core architectures by the software This could be done by re-writing the application, by exploiting the hardware capabilities or expect the compiler/software runtime tools to do the job for us With the advent of multi-core architectures ([1] [2]), this problem is becoming more and more relevant Even today, there are not many run-time tools to analyze the behavioral pattern of such performance critical applications, and to re-compile them So, techniques like OpenMP for shared memory programs are still useful in exploiting parallelism in the machine This work tries to study if the loop parallelization (both with and without applying transformations) can be a good case for running scientific programs efficiently on such multi-core architectures We have found the results to be encouraging and we strongly feel that this could lead to some good results if implemented fully in a production compiler for multi-core architectures 1 Introduction Parallel processing requires program logic to have zero dependency between the successive iterations of a loop To run a program in parallel we can divide the task between multiple threads or processes executing in parallel We can also go up to the extent of running these parallel pieces of code simultaneously on different nodes in a high speed network However the amount of parallelization possible depends on the program structure as well as the hardware configuration OpenMP [3] is an Application Program Interface (API) specification for C/C++ or FORTRAN that may be used to explicitly direct multi-threaded, shared memory parallelism It is a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization It has been implemented on most of the popular compilers like GNU (gcc), Intel, IBM, HP, Sun Microsystems compilers C-H Hsu et al (Eds): ICA3PP 2010, Part I, LNCS 6081, pp , 2010 c Springer-Verlag Berlin Heidelberg 2010
2 A Study of Performance Scalability by Parallelizing Loop Iterations 477 Most OpenMP parallelism is specified through the use of compiler directives which are embedded in C/C++ or FORTRAN source code The use of the pre-processor directive (starting with #) along with the OpenMP directive instructs the compiler during pre-processing to implement parallel execution of the code following the OpenMP directive There are various directives available, one of them being #pragma omp parallel for whose implementation was our prime interest in the project The #pragma directive is the method specified by the C standard for providing additional information to the compiler, beyond what is conveyed in the language itself The #pragma omp parallel for directive instructs the compiler that all the iterations of the for loop following the directive can be executed in parallel In that case, OpenMP compiler will generate code to spawn optimized number of threads based on the number of cores available Consider the following example of a typical C/C++ program using OpenMP pragmas: #include <omph> main () { int var1, var2, var3; /*** Serial code ***/ /*** Beginning of parallel section Fork a team of threads ***/ /*** Specify variable scoping ***/ #pragma omp parallel private(var1, var2) shared(var3) { /*** Parallel section executed by all threads ***/ /*** All threads join master thread and disband ***/ } /*** Resume serial code ***/ } The variables var1 and var2 are private to each of the threads spawned and the variable var3 is shared among all the threads The intent in this work is to study some OpenMP programs and see if these scale well on multi-core architectures Further, we would also like to parallelize non-parallel loops by applying transformations (using known techniques like unimodular and GCD transformations [5] [6] [7]) and see if they too scale well on such architectures We used some known OpenMP pragmas as case studies and implemented them in our compiler to study the performance In Section 2, wedescribethewayweimplemented these OpenMP pragmas In Section 3, we discuss unimodular transformations which we used to parallelize non-parallel loops In Section 4, we tabulate and explain all the results Section 5 concludes the paper and suggests some directions for future work
3 478 P Raghavendra et al 2 Implementation of OpenMP Pragmas in gcc The parallel portions of the program can be run on different threads For this we have to implement pthread library function calls in C Our two approaches were: by using a runtime library, and second, by using a wrapper which calls gcc In the first approach, we planned to have our own library functions which would have to be called in the same way as OpenMP We felt that this might not have much impact on performance and therefore, we resolved this to do at source level We would call gcc on the resultant program for further code generation and optimization for the target platform Hence we opted for the second approach in which we intended to develop a wrapper on gcc which would take the file having OpenMP directives as the input file and automatically generate the code which implemented multithreading The wrapper on gcc should be able to search for the OpenMP directives in the input file (program) and replace them with appropriate thread calls This could be easily achieved using a shell script, since we were working in a Linux environment The script searches for #pragma omp and removes it from the code, and the loop which follows this directive will be implemented in a function to be called by the thread The loop which needs to be parallelized is divided into several segments each of which is executed by a thread The kernel implicitly assigns individual threads to available cores Threads can be implemented in two ways: Static and Dynamic assignment of threads In static assignment, the total number of iterations of the loop is divided equally among the different threads However, in dynamic assignment, we dynamically assign loop iterations to different threads as and when the threads complete their previous tasks The static implementation of threads was observed to be faster than its dynamic counterpart since there is no explicit necessity to handle the tasks given to each of the threads But in this case the number of iterations of the loop must be divisible by the number of threads we create, at compile time itself We used the following standard for POSIX threads pthread_create(&thread_id,null,function,&value); This function creates a thread with the thread ID as thread_id, with default characteristics (specified by NULL as second parameter), executes the function function () in the thread and passes the value value (usually a structure) to the function pthread_join(thread_id, &exit_status); This function waits for the thread with thread ID thread_id to complete its execution and collects its exit status (return value if any) in exit_status This also helps in synchronizing more than one thread When the OpenMP pragmas are non-nested, then the case is very simple and all we have to do is to put the code following the directives in separate functions and call the functions using threads Whereas when there are nested
4 A Study of Performance Scalability by Parallelizing Loop Iterations 479 OpenMP directives, then we will have to create threads again inside the outer thread functions, resulting in nested thread functions As an added feature to the script, the number of threads to be implemented can be specified as a command line argument, default being two We run the script as follows: Syntax: /scriptsh <input_file> [no_of_threads] 3 Unimodular Transformations Unimodular transformation is a loop transformation defined by a unimodular matrix To test the validity of a general unimodular transformation, we need to know the distance vectors of the loop After a valid transformation, if a loop in the transformed program carries no dependency, then it can execute in parallel Many researchers have applied unimodular transformations for parallelization of loops [11] [12] Such transformations, though have limitations on applicability due to the model on which they work, can be applied very elegantly onto loops, which give various forms of loops executable in parallel These types can be generalized by a set of non-parallel loops sandwiched between set of outer parallel loops and set of inner parallel loops The beauty of this technique is that we can control on how many of such inner and outer loops we want depending upon the number of schedulable resources that we have If the input loop does not belong to the model or if the number of dependencies are more than what is allowed by the model, then we will not be able to get completely parallel loops In that case, we have to make these run with explicit communication, as explained in some works like [12] In such methods, we allow the loops to have dependencies, however we still run these in parallel and making explicit synchronization wherever necessary so that the loops run correctly honoring all data dependencies The extension of this technique would be one in [13] where they optimize the layout of arrays for such loops Program Model Our model program [6] is the loop nest L= (L1, L2,Lm): L1: do I1 = p1, q1 L2: do I2 = p2, q2 Lm: do Im = pm, qm H (I1, I2,, Im) Enddo Enddo Enddo An iteration H(i) is executed before another iteration H(j) in L iff i < j Take any m*m unimodular matrix U Let LU denote the program consisting of the iterations of L, such that an iteration H(i) is executed before another iteration
5 480 P Raghavendra et al H(j) in LU iff iu < ju LU can be written as a nest of m loops with an index vector K=(K1,K2,Km) defined by K=IU The body of the new program is H(KU-1) which will be written as HU (K) The program LU is the transformed program defined by U, and the transformation L -> LU is the unimodular transformation of L defined by U The idea here is to transform the non-parallel loop into another base so that the same loop would become parallel (and can be executed in parallel on target machine) 4 Results We studied the results as two problems, first to study the OpenMP ready programs with explicit parallel loops, and second, loops which are not parallel as given, but may need some transformations (like unimodular transformations) to make them run in parallel We will show the results for the first case in Section 41 and show the results of next case in Section Study of OpenMP Parallel Programs We took OpenMP programs like matrix multiplication with array sizes of 1800 * 1800 and LU factorization (array sizes of 1800*1800) We tested these programs with different number of threads (2, 4, 6, 8, 16, etc) on IBM Power 5 server and the performances were noted The IBM Power 5 server we used (IBM A) has 4 physical processing units (8 cores) The maximum available physical processing units were 38 since the rest was used for VIOS (Virtual I/O Server) Each processing unit has 2 cores The virtual machine on which the programs were tested was configured for two settings: Profile 1: Virtual Processing Units: 3 (6 cores) Physical Processing Units: 3 (6 cores) Profile 2: Virtual Processing Units: 8 (16 cores) Physical Processing Units: 38 (approx 8 cores) Matrix multiplication The program snippet which performs matrix multiplication implemented with threads is as follows : #pragma omp parallel for private(i,j,k) for(i = 0; i < 1800; i++){ for(j = 0; j < 1800;j++){ c[i][j] = 0; for(k = 0; k < 1800; k++){ c[i][j] += a[i][k] * b[k][j]; }}} Table-1 shows the statistics for the above program using Profile-1 and Profile-2 Graph-1 shows the graph corresponding to Profile-1 Graph-2 shows the graph corresponding to Profile-2
6 A Study of Performance Scalability by Parallelizing Loop Iterations 481 Table 1 Scale-ups for the above program with Profile-1 and Profile-2 Profile-1 Profile-2 Threads Time in s Scale-up Threads Time in s Scale-up Fig 1 Graph-1 From the observations, it can be concluded that the scale up increases with the number of threads as long as the number of threads is less than or equal to the number of cores available We see that we can never approach the ideal speed up of N where N is the number of cores, since there is always some synchronization to be done, which would slow down the program execution Comparing the maximum scale up produced in the two cases (profile-1: 396, profile-2: 459), though the number of cores in the second case is twice that of the first one, considerable amount of scale up is not obtained This is because in the second case 16 is the number of virtual cores and not the physical cores (which is actually 8) And hence one core is taking care of two threads, which leads to overhead due to switching of threads LU Factorization The code snippet which performs LU factorization, implementedwiththreadsisasfollows:
7 482 P Raghavendra et al #pragma omp parallel for private(i,j,k) for(k = 0; k < 1800; k++){ for(i=k+1 = 0; i < 1800;i++){ a[i][k] = a[i][k] / a[k][k]; } for(i=k+1 = 0; i < 1800;i++){ for(j=k+1 = 0; j < 1800;j++){ a[i][j] = a[i][j] - a[i][k] * a[k][j]; }}} Table-2 shows the statistics for the above program using Profile-1 and Profile-2 Graph-3 shows the graph corresponding to Profile-1 Graph-4 shows the graph corresponding to Profile-2 It is clearly observed in all of the graphs that when the number of threads is increased beyond a certain limit (number of cores available), the scale up remains almost a constant But if it is further increased to a much higher value, Fig 2 Graph-2 Table 2 Scale-ups for the above program with Profile-1 and Profile-2 Profile-1 Profile-2 Threads Time in s Scale-up Threads Time in s Scale-up
8 A Study of Performance Scalability by Parallelizing Loop Iterations 483 Fig 3 Graph-3 Fig 4 Graph-4 the scale up would decrease This is because of the delay (overhead) produced due to switching of the large number of threads As we know the loop following the OpenMP directives will be executed in parallel We studied the effects of parallelizing different sets of loops which are nested Consider the following code snippet : for(i = 0; i < 1800; i++){ for(j = 0; j < 1800;j++){ c[i][j] = 0; for(k = 0; k < 1800; k++){ c[i][j] += a[i][k] * b[k][j]; }}} The time taken for a single thread to complete the above loops was seconds When the outer most for loop (loop counter i ) was parallelized with 4 threads, the time taken was found to be 7972 seconds When the second for loop (loop counter j ) was parallelized with 4 threads, the time taken was found to be seconds When the inner most for loop (loop counter k ) was parallelized with 4 threads, the time taken was found to be 5266 seconds
9 484 P Raghavendra et al The parallelization achieved by multithreading the outer for loop is called Coarse Granular Parallelization It takes the least time because each thread executes large chunk of calculations and hence overhead due to switching of threads is minimized since each thread once created handles significant part of the program The parallelization achieved by multithreading the inner for loop is called Fine Granular Parallelization It takes more time because each thread executes smaller chunk of calculations and hence overhead is more due to repeated thread creation and destruction for very small tasks When the outer most and the second for loop were parallelized with two threads each, the time taken was found to be 6347 seconds, which is slightly better than that obtained by parallelizing the outer most loop alone with four threads (7972 s) When the outer most and the inner most for loop were parallelized with two threads each, the time taken was found to be 1975 seconds When the second and the inner most for loop were parallelized with two threads each, the time taken was found to be seconds The above data further proves higher performance gain of coarse granular parallelization over fine granular parallelization There are a number of limitations in case of multithreading There is an upper limit on the number of parallel threads we can create, since all these threads belong to a single process and each process has an upper limit on the memory it can use, depending on the hardware architecture Each thread has its own program stack So having a large number of threads in a single process may lead to memory insufficiency due to the large number of stacks required Moreover, the best performance is obtained when the number of threads equals the number of actual cores available on the system 42 Study of Loop Transformations on Multi-core Loops with independent iterations can be easily parallelized unlike ones with dependencies But some loops with dependencies can still be made independent to some extent by applying mathematical transformations, which then can be parallelized ( [7]) The transformations are usually specific to the function the body of the loop performs Some specific examples are discussed below Inner loop parallelization dependencies : [6] Consider the following double loop which has for(i = 1; i < 1000; i++){ for(j = 1; j < 1000;j++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[i][j] = a[i-1][j] + a[i][j-1]; }} This code takes 8751 seconds to complete This can be transformed into an independent loop given by:
10 A Study of Performance Scalability by Parallelizing Loop Iterations 485 for(k = 2; k < 1999; k++){ k_1=1>(k-999)?1:(k-999); k_2=999>(k-1)?(k-1):999; for(k_1 = k_1; k_1 <= k_2;k_1++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[k-k_1][k_1] = a[k-k_1-1][k_1] + a[k-k_1][k_1-1]; }} In the above code, only the inner most loop is independent and hence can be run in parallel When executed with two threads, the time taken is 5217 seconds, which is faster compared to the time taken by dependent loops though the parallelization is fine granular In this case, the number of iterations executed by the second for loop is least for the first and the last iteration of the outer most for loop and highest in between and the gradient is linear Hence in order to obtain higher efficiency, the first and last few iterations (depending upon the amount of work being done) of the outer most for loop can be executed sequentially and the rest in parallel or we can make use of dynamic threads Dynamic threads results are in table-3 Table 3 Scale-ups for Dynamic threads program Threads Time in s Outer loop parallelization [6] Consider the following code which has dependencies : for(i = 6; i <= 500; i++){ for(j = i; j <= (2*i)+4; j++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[i][j] = a[i-2][j-3] + a[i][j-6]; }} This code takes 3296 seconds to complete This can be transformed into an independent code given by : for(y_1 = 0; y_1 <= 1; y_1++){// parallelizable loop for(y_2 = 0; y_2 <= 1; y_2++){ // parallelizable loop for(k_1 = ceil((6-y_1)/20); k_1 < floor((10-y_1)/20); k_1++){ for(k_2 = ceil(y_1+2*k_1-y_2); k_2 < floor((2*y_1+4*k_1-4-y_2)/30); k_2++){ a[y_1+2*k_1][y_2+3*k_2] = a[y_1+2*k_1-2][y_2+3*k_2-3] + a[y_1+2*k_1][y_2+3*k_2-6]; }}}}
11 486 P Raghavendra et al When executed with two threads, the time taken is 2028 seconds, which is faster compared to the time taken by dependent loops Since it is a case of coarse granular parallelization, increasing the number of threads gives better performance The above program with maximum 6 threads gives a scale up of 3 (1335 s) which is not up to the expectations since the number of inner loop iterations is greater than that of the original program 5 Conclusion In this study, we did two things First, study performance of a few OpenMP ready parallel programs on multi-core machines (up to 8 physical cores) For this, we implemented the OpenMP compiler over the gcc compiler Second, we extended our work to also include non-parallel loops and used some transformations to make these loops run in parallel We were delighted to see that both these cases proved that running these (with or without transformations) can give us significant performance benefits on multi-core architectures The trend today is to have more and more cores and in that context, this work would be quite relevant In future, we would like to extend our work to more non-parallel loops, which may run in parallel with some explicit synchronization on multi-core References 1 AMD Multi-core Products (2006), 2 Multi-core from Intel Products and Platforms (2006), 3 OpenMP, 4 Wolfe, MJ: Techniques for improving the inherent parallelism in programs Technical Report , Department of Computer Science, University of Illinois at Urbana-Champaign (July 1990) 5 Wolfe, M: High Performance Compilers for Parallel Computing Addison-Wesley, Reading 6 Banerjee, UK: Loop Transformations for Restructuring Compilers: The Foundations Kluwer Academic Publishers, Norwell (1993) 7 Banerjee, UK: Loop Parallelization Kluwer Academic Publishers, Norwell (1994) 8 Pthreads reference, 9 DHollander, EH: Partitioning and Labelling of loops by Unimodular Transformation IEEE Transactions on Parallel and Distributed Systems 3(4) (1992) 10 Saas, R, Mutka, M: Enabling unimodular transformation In: Supercomputing 1994, November 1994, pp (1994) 11 Banerjee, U: Unimodular Transformations of Double Loop In: Advances in Languages and Compilers for Parallel Processing, pp (1991) 12 Prakash, SR, Srikant, YN: An Approach to Global Data Partitioning for Distributed Memory Machines In: IPPS/SPDP (1999) 13 Prakash, SR, Srikant, YN: Communication Cost Estimation and Global Data Partitioning for Distributed Memory Machines In: Fourth International Conference on High Performance Computing, Bangalore (1997)
Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationEPL372 Lab Exercise 5: Introduction to OpenMP
EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf
More informationOpenMP - Introduction
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı - 21.06.2012 Outline What is OpenMP? Introduction (Code Structure, Directives, Threads etc.) Limitations Data Scope Clauses Shared,
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationParallel Programming with OpenMP. CS240A, T. Yang
Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationEE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California
EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationImplementation of Parallelization
Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationPOSIX Threads and OpenMP tasks
POSIX Threads and OpenMP tasks Jimmy Aguilar Mena February 16, 2018 Introduction Pthreads Tasks Two simple schemas Independent functions # include # include void f u n c t i
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationOpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono
OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/
More informationIntroduction to OpenMP
Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with
More informationSynchronisation in Java - Java Monitor
Synchronisation in Java - Java Monitor -Every object and class is logically associated with a monitor - the associated monitor protects the variable in the object/class -The monitor of an object/class
More informationIntroduction to OpenMP.
Introduction to OpenMP www.openmp.org Motivation Parallelize the following code using threads: for (i=0; i
More information1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008
1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction
More informationChapter 4: Multi-Threaded Programming
Chapter 4: Multi-Threaded Programming Chapter 4: Threads 4.1 Overview 4.2 Multicore Programming 4.3 Multithreading Models 4.4 Thread Libraries Pthreads Win32 Threads Java Threads 4.5 Implicit Threading
More informationA Source-to-Source OpenMP Compiler
A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review
More informationTopics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP
Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and
More informationCSE 4/521 Introduction to Operating Systems
CSE 4/521 Introduction to Operating Systems Lecture 5 Threads (Overview, Multicore Programming, Multithreading Models, Thread Libraries, Implicit Threading, Operating- System Examples) Summer 2018 Overview
More information15-418, Spring 2008 OpenMP: A Short Introduction
15-418, Spring 2008 OpenMP: A Short Introduction This is a short introduction to OpenMP, an API (Application Program Interface) that supports multithreaded, shared address space (aka shared memory) parallelism.
More informationIntroduction to Standard OpenMP 3.1
Introduction to Standard OpenMP 3.1 Massimiliano Culpo - m.culpo@cineca.it Gian Franco Marras - g.marras@cineca.it CINECA - SuperComputing Applications and Innovation Department 1 / 59 Outline 1 Introduction
More informationIntroduction to OpenMP
Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationUvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP
Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationMultithreading in C with OpenMP
Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads
More informationParallel Programming
Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems
More informationSession 4: Parallel Programming with OpenMP
Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00
More informationConcurrency, Thread. Dongkun Shin, SKKU
Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point
More informationProgramming Shared Memory Systems with OpenMP Part I. Book
Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine
More informationAn Introduction to OpenMP
An Introduction to OpenMP U N C L A S S I F I E D Slide 1 What Is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism
More informationhttps://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG
https://www.youtube.com/playlist?list=pllx- Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG OpenMP Basic Defs: Solution Stack HW System layer Prog. User layer Layer Directives, Compiler End User Application OpenMP library
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationChap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1
Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationCS691/SC791: Parallel & Distributed Computing
CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationIntroduction to OpenMP
Introduction to OpenMP Le Yan Objectives of Training Acquaint users with the concept of shared memory parallelism Acquaint users with the basics of programming with OpenMP Memory System: Shared Memory
More informationParallel Programming: OpenMP
Parallel Programming: OpenMP Xianyi Zeng xzeng@utep.edu Department of Mathematical Sciences The University of Texas at El Paso. November 10, 2016. An Overview of OpenMP OpenMP: Open Multi-Processing An
More informationCOMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP
COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including
More informationOpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer
OpenMP examples Sergeev Efim Senior software engineer Singularis Lab, Ltd. OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.
More informationCSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden
CSE 160 Lecture 8 NUMA OpenMP Scott B. Baden OpenMP Today s lecture NUMA Architectures 2013 Scott B. Baden / CSE 160 / Fall 2013 2 OpenMP A higher level interface for threads programming Parallelization
More informationJukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples
Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationJANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.
0104 Cover (Curtis) 11/19/03 9:52 AM Page 1 JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 LINUX M A G A Z I N E OPEN SOURCE. OPEN STANDARDS. THE STATE
More informationCS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night
More informationAdvanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele
Advanced C Programming Winter Term 2008/09 Guest Lecture by Markus Thiele Lecture 14: Parallel Programming with OpenMP Motivation: Why parallelize? The free lunch is over. Herb
More informationProgramming Shared-memory Platforms with OpenMP. Xu Liu
Programming Shared-memory Platforms with OpenMP Xu Liu Introduction to OpenMP OpenMP directives concurrency directives parallel regions loops, sections, tasks Topics for Today synchronization directives
More informationAlfio Lazzaro: Introduction to OpenMP
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B. Bertinoro Italy, 12 17 October 2009 Alfio Lazzaro:
More informationParallel Computing Parallel Programming Languages Hwansoo Han
Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Programming Practice Current Start with a parallel algorithm Implement, keeping in mind Data races Synchronization Threading syntax
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 2 OpenMP Shared address space programming High-level
More informationParallel Processing Top manufacturer of multiprocessing video & imaging solutions.
1 of 10 3/3/2005 10:51 AM Linux Magazine March 2004 C++ Parallel Increase application performance without changing your source code. Parallel Processing Top manufacturer of multiprocessing video & imaging
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University 1. Introduction 2. System Structures 3. Process Concept 4. Multithreaded Programming
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationProgramming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen
Programming with Shared Memory PART II HPC Fall 2012 Prof. Robert van Engelen Overview Sequential consistency Parallel programming constructs Dependence analysis OpenMP Autoparallelization Further reading
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationOpenMP Shared Memory Programming
OpenMP Shared Memory Programming John Burkardt, Information Technology Department, Virginia Tech.... Mathematics Department, Ajou University, Suwon, Korea, 13 May 2009.... http://people.sc.fsu.edu/ jburkardt/presentations/
More informationDistributed Systems + Middleware Concurrent Programming with OpenMP
Distributed Systems + Middleware Concurrent Programming with OpenMP Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationOpenMP threading: parallel regions. Paolo Burgio
OpenMP threading: parallel regions Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks,
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...
ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure
More informationQuestions from last time
Questions from last time Pthreads vs regular thread? Pthreads are POSIX-standard threads (1995). There exist earlier and newer standards (C++11). Pthread is probably most common. Pthread API: about a 100
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationShared Memory Parallelism using OpenMP
Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत SE 292: High Performance Computing [3:0][Aug:2014] Shared Memory Parallelism using OpenMP Yogesh Simmhan Adapted from: o
More informationExample of a Parallel Algorithm
-1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software
More informationChapter 4: Threads. Operating System Concepts 9 th Edit9on
Chapter 4: Threads Operating System Concepts 9 th Edit9on Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads 1. Overview 2. Multicore Programming 3. Multithreading Models 4. Thread Libraries 5. Implicit
More informationIntroduction to. Slides prepared by : Farzana Rahman 1
Introduction to OpenMP Slides prepared by : Farzana Rahman 1 Definition of OpenMP Application Program Interface (API) for Shared Memory Parallel Programming Directive based approach with library support
More informationCOSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)
COSC 6374 Parallel Computation Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2014 Introduction Threads vs. processes Recap of
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationOperating Systems 2 nd semester 2016/2017. Chapter 4: Threads
Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationOpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationDetection and Analysis of Iterative Behavior in Parallel Applications
Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University
More informationThreads. CS3026 Operating Systems Lecture 06
Threads CS3026 Operating Systems Lecture 06 Multithreading Multithreading is the ability of an operating system to support multiple threads of execution within a single process Processes have at least
More informationParallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More information[Potentially] Your first parallel application
[Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel
More information