Parallelization of Sequential Programs

Similar documents
Introduction to Parallel Computing

EE/CSCI 451: Parallel and Distributed Computation

Distributed Systems CS /640

Introduction to OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Accelerated Library Framework for Hybrid-x86

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Introduction to Parallel Programming. Tuesday, April 17, 12

Parallel Programming Multicore systems

Concurrent Programming with OpenMP

Introduction to parallel Computing

Parallel Programming

Parallelizing Compilers

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallelization Principles. Sathish Vadhiyar

DPHPC: Performance Recitation session

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Optimize HPC - Application Efficiency on Many Core Systems

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Design of Parallel Algorithms. Course Introduction

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Parallel Computing. Hwansoo Han (SKKU)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Introduction to Parallel Programming

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Computer Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to Parallel Computing

Chapter 3 Parallel Software

Parallel Numerical Algorithms

Lecture 10 Midterm review

CSC630/CSC730 Parallel & Distributed Computing

Shared Memory Programming with OpenMP (3)

Parallel Numerics, WT 2013/ Introduction

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Analysis of Parallelization Techniques and Tools

An Introduction to Parallel Programming

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Cloud Computing CS

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

PARALLEL MEMORY ARCHITECTURE

Analytical Modeling of Parallel Programs

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Parallel Computing Basics, Semantics

Blocking SEND/RECEIVE

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Search Algorithms for Discrete Optimization Problems

Understanding Parallelism and the Limitations of Parallel Computing

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis.

ECE 669 Parallel Computer Architecture

Parallel Code Optimisation

Parallel Algorithm Design. CS595, Fall 2010

Fortran 2008: what s in it for high-performance computing

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

15-418, Spring 2008 OpenMP: A Short Introduction

High-Performance and Parallel Computing

Review of previous examinations TMA4280 Introduction to Supercomputing

Simulating ocean currents

Module 10: Parallel Query Processing

ECE 574 Cluster Computing Lecture 10

COMP 3361: Operating Systems 1 Final Exam Winter 2009

Frequently asked questions from the previous class survey

Lecture 7: Parallel Processing

Parallel Programming

Review: Creating a Parallel Program. Programming for Performance

Introduction to OpenCL!

Parallelism. CS6787 Lecture 8 Fall 2017

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

High Performance Computing. Introduction to Parallel Computing

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Search Algorithms for Discrete Optimization Problems

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model

A Framework for Space and Time Efficient Scheduling of Parallelism

The Art of Parallel Processing

Principles of Parallel Algorithm Design: Concurrency and Decomposition

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Parallel Programming in C with MPI and OpenMP

Transcription:

Parallelization of Sequential Programs Alecu Felician, Pre-Assistant Lecturer, Economic Informatics Department, A.S.E. Bucharest Abstract The main reason of parallelization a sequential program is to run the program faster. The easiest way to parallelize a sequential program is to use a compiler that automatically detects the parallelism of the program and generates the parallel version. If the user decides to manually parallelize the sequential program he has to explicitly define the desired communication and synchronization mechanisms. There are several ways to parallelize a program containing input/output operations but one of the most important challenges is to parallelize loops because they usually sp the most CPU time and the code contained involves arrays whose indices are associated with the loop variable. Typically loops are parallelized by dividing arrays into blocks and distributing iterations among processes. 1. Introduction According to the Amdahl law, even in an ideal parallel system it is very difficult to obtain a speedup value equal with the number of processors because each program, in terms of running time, has a fraction α that cannot be parallelized and has to be executed sequentially by a single processor. The rest of (1 - α) will be executed in parallel. This is why the maximum speedup that could be obtained running on a parallel system a program with a fraction α that cannot be parallelized is 1/α no matter of the number of processors from the system. In figure 1 we can see how the speedup deps on the number of processor for a few given values of fraction α. Figure 1 Speedup deping on the number of processor for a few given values of fraction α When running a parallel program on a real parallel system there is an overhead coming from processors load imbalance and from communication times needed for changing data between processors and for synchronization. This is why the execution time of the program will be greater than the theoretical value.

So, in order to have a faster program we should: - Decrease the fraction of the program that cannot be parallelized sometimes we have to change the algorithm used by the program in order to achieve a lower value of fraction α; - Minimize the processors load imbalance; - Minimize the communications time by decreasing the amount of data transmitted or by decreasing the number of times that data is transmitted. Let s presume that a process has to s a large matrix to another process. If it will s it element by element the communication overhead will have a very high value. Instead, if the matrix is copied in a contiguous memory area the process can s the entire memory area content at once and the communication overhead will be acceptable because usually copying the data needs less time than communication latency. 2. How to parallelize a sequential program The easiest way to parallelize a sequential program is to use a compiler that detects, automatically or based on the compiling directives specified by the user, the parallelism of the program and generates the parallel version with the aid of the interdepencies found in the source code. The automatically generated parallel version of the program could be executed on a parallel system. The executables generated by such compilers run in parallel using multiple threads that can communicate with each other by using the shared address space or by using the explicit message passing statements. The user does not have to be concerned about which part of the program is parallelized because the compiler will take such a decision when the automatic parallelization facility is used. On the other hand the user has a very limited control over parallelization. The capabilities of such a compiler are restricted because it is very difficult to find and to analyze the depencies from complex programs using nested loops, procedure calls and so on. If the user decides to manually parallelize his program, he can freely decide which parts have to be parallelized and which not. Also, the user has to explicitly define in the parallel program the desired communication mechanisms (message passing or shared memory) and synchronization methods (locks, barriers, semaphores and so on). Parallelizing loops is one of the most important challenges because loops usually sp the most CPU time even if the code contained is very small. In figure 2 we have an example of nested loops. Loop 1, n Loop 1, k Figure 2 Sequential Program - Nested Loops Example Let s presume that the first loop is not time consumer but the second one is using most of the CPU time. This is why it is reasonable to parallelize the second loop by

distributing iterations among processes while the main loop remains unchanged. This is called partial parallelization of a loop. If data affected by the inner loop are then referenced in the main loop, we need to synchronize data just after the of the inner loop in order to be sure that the values accessed by the main loop are the updated one. Figure 3 shows how the sequential program code was parallelized. The data synchronization was added just after the second loop and the inner loop was distributed over multiple processes and processors. Every process will execute just a subset of the inner loop iterations range. Loop 1, n Loop Start, End Synchronize data Figure 3 Parallel Program - Nested Loops Example Using the partial parallelization of a loop we can minimize the load imbalance but the communication overhead will be higher as result of synchronization. The program has to perform more data transmissions in order to synchronize the data affected by the inner loop. If the inner loop is located in a function/procedure called by the main loop, we can parallelize the main program or the subroutine deping on the desired level of granularity (figure 4). Loop 1, n Subroutine Loop 1, k Call Figure 4 Sequential Program Using Loops If the subroutine is parallelized, the workload will be balanced due to the finegrained parallelization.

3. Input/Output Blocks Parallelization The following methods are used to parallelize source code containing I/O operations. When a program uses an input file, all processes have to read the file content concurrently. This could be done in the following ways: 1. Locate the input file on a shared file system each process reads data from the same file. The file system could be a classical one (like NFS) or could be a dedicated one (like GPFS General Parallel File System). Process 1 Process 2 Process n input file located on a shared file system Figure 5 Input file located on shared file system 2. Each process uses a local copy of the input file the input file is copied to the each node before running the program. In this way processes can read the input file concurrently using the local copy. Copying the input file to every node needs supplementary time and disk space but the performance achieved is better than reading from a shared file system. Process 1 Process 2 Process n local copy of the input file local copy of the input file local copy of the input file Figure 6 Local Copy of the Input File 3. A single processor reads the input file and then broadcasts the file content to other processes every process will receive only the amount of data needed.

Process 1 Process 2 Process n input file - local or located on a shared file system Figure 7 One Process Reads and Distributes the Input File If a program generates an output file, it could be created using the following methods: 1. Output file located on shared file system every process writes its data sequentially into an output file located on a shared file system. While a process is writing its data the file is locked and no other process can write in the file. In this way the file content will not be corrupt at the time when the execution of the program s. Process 1 Process 2 Process n output file located on a shared file system Figure 8 Output File Located on a Shared File System 2. Gathering of data by one process one of the processes gathers data from all other concurrent processes and than writes the data to the output file. In this case, the output file could be located locally or on a shared file system. Locks, barriers, semaphores and other methods are used to synchronize the processes.

Process 1 Process 2 Process n output file - local or located on a shared file system Figure 9 Gathering of Data by One Process 4. Loops Parallelization Parallelizing loops is one of the most important challenges because loops usually sp the most CPU time even if the code contained is very small. A loop could by parallelized by distributing iterations among processes. Every process will execute just a subset of the loop iterations range. Usually the code contained by a loop involves arrays whose indices are associated with the loop variable. This is why distributing iterations means dividing arrays and assigning chunks to processes. 4.1. Block distribution If there are p processes to be executed in parallel, the iterations have to be divided into p parts. If we have a loop with 100 iterations executed by 5 processes, each process will execute 20 consecutive iterations. If the number of iterations, n, is not divisible by the number of processes, p, we have to distribute among processes the reminder, r (n = p q + r). In such a case, the first r processes will execute q + 1 iterations each and the last p r processes will execute just q iterations each. This distribution corresponds to expressing n as: n = r ( q + 1) + ( p r) q Having the loop range of iterations and the number of processes it is possible to compute the iterations range for a given rank: sub range_ver1(n1, n2, p, rank, Start, End) q = (n2 n1 + 1) / p r = MOD(n2 n1 + 1, p) Start = rank * q + n1 + MIN(rank, r) End = Start + q 1 If r > rank then End = End 1 or sub range_ver2(n1, n2, p, rank, Start, End) q = (n2 n1) / p + 1 Start = MIN(rank * q + n1, n2 + 1) End = MIN(Start + q 1, n2)

where n1, n2 lowest value, highest value of the iteration variable (IN) p number of processes (IN) rank the rank for which we want to know the range of iterations (IN) Start, End lowest value, highest value of the iteration variable that process rank executes (OUT) Table 1 shows how these subroutines block-distribute 14 iterations to four processes using the above method. Iteration Rank range_ver1 range_ver2 1 0 0 2 0 0 3 0 0 4 0 0 5 1 1 6 1 1 7 1 1 8 1 1 9 2 2 10 2 2 11 2 2 12 3 2 13 3 3 14 3 3 Table 1 Block Distribution Using block distribution method it is possible to have processes assigned to no iterations at all and the workload could be unbalanced among processes. 4.2. Cyclic Distribution Cyclic distribution means that the iterations are assigned to processes in a roundrobin fashion. Let s presume that we have the following loop: for i = n1, n2 computations In such a case, the rank rk process will perform the following iterations: for i = n1 + rk, n2, p computations Table 2 we have an example of cyclic distribution of 11 iterations to three processes.

Iteration Rank 1 0 2 1 3 2 4 0 5 1 6 2 7 0 8 1 9 2 10 0 11 1 Table 2 Cyclic Distribution The cyclic distribution method provides a more balanced workload among processes then block distribution. 4.3. Block-Cyclic Distribution The block-cyclic distribution combines block and cyclic distributions in a single method. The iterations are partitioned into equally sized blocks and these blocks are assigned to processes in a round-robin manner. Dividing the following loop for i = n1, n2 computations the rank rk process will perform the following iterations: for ii = n1 + rk * block, n2, p * block for i = ii, MIN(ii + block 1, n2) computations where block means the block size. Table 3 shows how 10 iterations are assigned to 3 processes using block-cyclic distribution method with a block size of.two. Iteration Rank 1 0 2 0 3 1 4 1 5 2 6 2 7 0 8 0 9 1 10 1 Table 3 Block-Cyclic Distribution

4.4. Nested Loops When parallelizing loops it is necessary to consider that it is more efficient to access arrays in the same order they are stored in the memory than to access them in any other order. For example, if we have an array stored in a column-major order, the LOOP1 will be slower than LOOP2: LOOP1 for i = 1, n for j = 1, n a(i, j) = LOOP2 for j = 1, n for i = 1, n a(i, j) = The LOOP2 could be parallelized in the following ways: LOOP21 for j = jsta, j for i = 1, n a(i, j) = LOOP22 for j = 1, n for i = ista, i a(i, j) = The same rule has to be applied so it is preferable to use the LOOP21 because it is faster than LOOP22. References 1. S. A. Wiliams, Programming Models for Parallel Systems, John Wiley & Sons, 1990. 2. A. Gibbons, P. Spirakis, Lectures on Parallel Computation, Cambridge University Press, 1993. 3. M. Zargham, Computer Architecture Single and Parallel Systems, Prentice Hall, 1996. 4. G. Dodescu, Operating Systems, ASE Bucharest, 1997. 5. G. Dodescu, Parallel Processing, Ed. Economica, 2002