Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Similar documents
Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to parallel Computing

An Introduction to Parallel Programming

Introduction to Parallel Computing

Plan. 1: Parallel Computations. Plan. Aims of the course. Course in Scalable Computing. Vittorio Scarano. Dottorato di Ricerca in Informatica

Seminar Series on High Performance Computing:

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

An Introduction to Parallel Programming

Introduction to Parallel Computing

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Introduction to Parallel Programming

Introduction to Parallel Computing

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Introduction to High Performance Computing

Parallel Computing Introduction

A Lightweight OpenMP Runtime

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

High Performance Computing for Science and Engineering I

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Database Management and Tuning

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Review: Creating a Parallel Program. Programming for Performance

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CS 475: Parallel Programming Introduction

High Performance Computing. University questions with solution

Introduction to Parallel Computing. Yao-Yuan Chuang

Parallel Computing Why & How?

EE/CSCI 451: Parallel and Distributed Computation

Introduction to Parallel Computing

Programming as Successive Refinement. Partitioning for Performance

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Efficient Work Stealing for Fine-Grained Parallelism

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

CS691/SC791: Parallel & Distributed Computing

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

UVA HPC & BIG DATA COURSE INTRODUCTORY LECTURES. Adam Belloum

Chapter 3 Parallel Software

Shared Memory Programming. Parallel Programming Overview

CSC630/COS781: Parallel & Distributed Computing

Parallel repartitioning and remapping in

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Design of Parallel Algorithms. Models of Parallel Computation

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Principles of Parallel Algorithm Design: Concurrency and Mapping

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

Shared Memory Programming with OpenMP (3)

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

High Performance Computing in C and C++

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Trends and Challenges in Multicore Programming

Molatomium: Parallel Programming Model in Practice

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Hybrid Model Parallel Programs

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

CUDA GPGPU Workshop 2012

Parallel Architectures

Principles of Parallel Algorithm Design: Concurrency and Mapping

Introduction to Parallel Computing

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Multithreading in C with OpenMP

Portland State University ECE 588/688. Graphics Processors

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Task Graph. Name: Problem: Context: D B. C Antecedent. Task Graph

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Martin Kruliš, v

Huge market -- essentially all high performance databases work this way

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

EE/CSCI 451: Parallel and Distributed Computation

Multi-core Architecture and Programming

Fractals exercise. Investigating task farms and load imbalance

Transcription:

Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

There are two ways to develop parallel programs 1. Automatic compiler finds any possible parallelization points 2. Manual programmer finds any possible parallelization points Manual Method 1. Determine if the problem is one that can be solved in parallel. Example of Parallelizable Problem: Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. Example of a Non parallelizable Problem: Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: F(n) = F(n 1) + F(n 2) 2. Identify Program Hotspots Hotspots are areas where a lot of work is being done by the program. Concentrate on parallelizing the hot spots. 3. Identify Bottlenecks Are there areas, such as I/O commands, that slow or even halt the parallel portion of the program. 4. Identify Inhibitors to Parallelism Data dependence (Fib series) is a common inhibitor. 5. If possible look at other algorithms to see if they can be parallelized. 6. Use parallel programs and libraries from third party vendors.

Designing Parallel Programs: Partitioning Divide the tasks to be done into discrete chunks. There are two methods for partitioning: 1. Domain Decomposition 2. Functional Decomposition Domain Decomposition The data is decomposed. Then each task works on a portion of the data.

Functional Decomposition The problem is broken into smaller tasks. Then each task does part of the work that needs to be done. In functional decomposition the emphasis is on the computation, rather than the data. Examples of Functional Decomposition 1. Ecosystem modeling 2. Signal processing 3. Climate modeling

Communications Which tasks need to communicate with each other. Communication is not needed: If two tasks do not share data and the tasks are independent of each other, then this type of task does not need to communicate with other tasks. An example: A black and white image needs to have its color reversed. Each pixel can be independently reversed, so there is no need for the tasks to exchange information. Communication is needed: Tasks need information from other tasks to complete the work. An example: A 3 D heat diffuser problem. Each task needs to know what neighboring portions of the problem have calculated so that it can adjust its own information. Facts to Consider with Communication 1. Cost of communication 2. Latency vs Bandwidth small packages cause latency to dominate the communications overhead. Larger packages make fuller use of the bandwidth. 3. Visibility of Communication With Message Passing communication is explicit and under control of the programer. With the data parallel model the programmer does not know when communication is occurring and has no control over it. 4. Synchronous vs Asynchronous communication Synchronous called blocking requires handshaking (send, receive). Asynchronous called non blocking does not need to wait for communication to finish. Interleaving computation is the biggest advantage of asynchronous communication. 5. Scope of Communication two scope levels 1. point to point = one task is the sender and one task is the receiver, 2. Collective = data sharing between two or more tasks. both can be synchronous or asynchronous.

Types of Collective Communication 6. Efficiency of Communications Choosing the right type of communication will affect its efficiency. 7. Overhead and Complexity are increased with communications.

Synchronization Managing the work and the tasks required to complete the work is a critical design consideration. Program performance can be significantly affected. Portions of the program may need to be serialized. There are 3 kinds of Synchronization: barrier when each task reaches the barrier it stops. When the last task reaches the barrier, all the tasks are synchronized and released. This is usually done to sync the tasks before a serial portion of the program. lock/semaphore some of tasks are involved in a lock or semaphore. The purpose of the lock is to allow only one task at a time access to either a global variable or a serial portion of code. The first task to the lock sets it, then all other tasks must wait until the lock is released before they can acquire the variable or access to the code. synchronous communication operations tasks that are communicating use this synchronization. Before a task can send a message it must get acknowledgement that the receiving task is ready for it.

Data Dependencies A dependence occurs in a program when the order of execution affects the outcome of the program. A dependence occurs when more than one task is using a variable in a program. Dependencies are important in parallel programming because they are the main inhibitor to parallelism. Data Dependence Examples Loop carried Dependence A(J 1) needs to be calculated before A(J) can be calculated. A(J) is data dependent on A(J 1). To properly compute the value of A(J): Assume task2 calculates A(J) and task1 calculates A(J 1). In a Distributed Memory Architecture this means task1 must finish before task2 can begin. In a Shared Memory Architecture task1 must write the value of A(J 1) to memory before task2 can retrieve it. Loop Independent Data Dependence In this instance, the order in which a task completes will affect the outcome of the program. In a DIstributed Memory Architecture the value of Y is dependent upon when the value of X is transferred between the two tasks. In a Shared Memory Architecture the value of Y is dependent upon the last update of X. Loop carried data dependencies are particularly important because loops are often a primary target of parallelization opportunities.

How to Handle Data Dependencies In Distributed Memory Architectures data must be communicated at synchronization points. In Shared Memory Architectures synchronize read/write operations between tasks. Load Balancing The goal of task balancing is to keep all tasks busy all of the time, it can also be seen as minimization of task idle time. In parallel programs the slowest task will determine the overall performance.

How to Achieve Load Balance Equally Partition the Work Each Task Receives For array/matrix operations where each task performs similar work For loop iterations where the work done in each iteration is similar Evenly distribute the data set among the tasks Evenly distribute the iterations across the tasks If a heterogeneous mix of machines with varying performance characteristics are being used Be sure to use some type of performance analysis tool to detect any load imbalances. Dynamic Work Assignment Certain classes of problems result in load imbalances even if data is evenly distributed among tasks. Sparse Arrays Adaptive Grid Methods N body simulations Some tasks will have actual data to work on while others have mostly zeros Some tasks may need to refine their mesh while others don t Where some particles may migrate to/from their original task domain to another task s. Where particles owned by some tasks require more work than those owned by other tasks. When to use a scheduler task pool approach: then the amount of work of each task is variable or unpredictable. As each task finishes it joins the task pool to be assigned a new task. For some situations, it may be necessary to design an algorithm to detect load imbalances.

Granularity Computation/ Communication Ratio: Granularity is the ratio of computation to communication. Periods of computation are separated from communication periods through the use of synchronous events. Fine grain Parallelism: The work is divided into many small amounts between communication events Low computation to communication ratio Facilitates load balancing Results in high communication costs and lower performance enhancement If granularity is too fine communication and synchronization overhead can be greater than computation. Coarse Grain Parallelism In between communication / synchronization events large amounts of computational work is performed. There is a high computation to communication ratio There is more opportunity for increases in performance. It is harder to load balance effectively

Fine Grain Coarse Grain Which is Best? Best = most efficient granularity The algorithm and hardware environment play important roles in the efficiency of a run. Coarse granularity is usually advantageous because of the overhead associated with communications and synchronization is high compared to the execution speed. Fine grain can help reduce load imbalance overheads.

I/O The Bad News I/O operations are generally considered to inhibitors to parallelism I/O operations require much more time than memory operations (by orders of magnitude) Parallel I/O systems may not be available for all platforms. In environments where all tasks see the same file space, write operations can result in file overwriting. Read operations can be affected by the file servers ability to handle multiple simultaneous reads. Over network I/O can cause severe bottlenecks and crash servers. The Good Parallel file systems are available MPI has had I/O programming interface since 1996. I/O Pointers REDUCE I/O AS MUCH AS POSSIBLE Use a parallel file system if you have it. Writing large blocks of data is more efficient than multiple small blocks. Confine I/O to the serial portions of the code. Have a small subset of tasks perform I/O

Debugging Debugging parallel code can be very difficult. Good debuggers are: For threaded programs: pthreads and OpenMP MPI GPU / accelerator Hybrid Performance Analysis and Tuning Analyzing and tuning is very difficult. The use of tuning tools is highly recommended.