Introduction to Parallel Programming

Similar documents
Introduction to Parallel Programming

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Introduction to Parallel Programming

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Introduction to Parallel Computing

Introduction to parallel computing concepts and technics

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Computer Architecture

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

High Performance Computing Systems

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Message Passing Interface (MPI)

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Lecture 7: Parallel Processing

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS 426. Building and Running a Parallel Application

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Parallel Numerics, WT 2013/ Introduction

CSL 860: Modern Parallel

Parallel Computing and the MPI environment

Outline. Overview Theore/cal background Parallel compu/ng systems Programming models Vectoriza/on Current trends in parallel compu/ng

COSC 6385 Computer Architecture - Multi Processor Systems

Programming with MPI. Pedro Velho

Parallel Computing Introduction

Parallel Computing Why & How?

Parallel Computing: Overview

Parallel Architectures

Holland Computing Center Kickstart MPI Intro

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer parallelism Flynn s categories

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Introduction II. Overview

The Art of Parallel Processing

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel hardware. Distributed Memory. Parallel software. COMP528 MPI Programming, I. Flynn s taxonomy:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Distributed Systems CS /640

Parallel Programming Using MPI

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Computing Platforms

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Multiprocessors - Flynn s Taxonomy (1966)

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Online Course Evaluation. What we will do in the last week?

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

High Performance Computing

Parallel Computing. November 20, W.Homberg

Parallel Programming. Libraries and Implementations

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Distributed Memory Programming with Message-Passing

Lecture 7: Distributed memory

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Message Passing Interface

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Lecture 7: Parallel Processing

An Introduction to Parallel Programming

High Performance Computing Course Notes Message Passing Programming I

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Lect. 2: Types of Parallelism

Goals of this Course

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Introduction to parallel computing. Seminar Organization

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008

Parallel Processing Experience on Low cost Pentium Machines

Dr. Joe Zhang PDC-3: Parallel Platforms

MPI Message Passing Interface

Types of Parallel Computers

Multiprocessors & Thread Level Parallelism

ECE 563 Second Exam, Spring 2014

Message Passing Interface

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

Introduction to OpenMP

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chap. 4 Multiprocessors and Thread-Level Parallelism

Distributed Simulation in CUBINlab

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Parallel Numerical Algorithms

Top500 Supercomputer list

CSE 160 Lecture 18. Message Passing

PARALLEL AND DISTRIBUTED COMPUTING

Parallel programming. Luis Alejandro Giraldo León

Transcription:

Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1

y What is Parallel Programming? Using more than one processor or computer to complete a task Each processor works on its section of the problem (functional parallelism) Each processor works on its section of the data (data parallelism) Processors can exchange information Grid of Problem to be solved CPU #1 works on this area of the problem CPU #2 works on this area of the problem CPU #3 works on this area of the problem CPU #4 works on this area of the problem x 5/18/2010 www.cac.cornell.edu 2

Forest Inventory with Ranger Bob 5/18/2010 www.cac.cornell.edu 3

Why Do Parallel Programming? Limits of single CPU computing performance available memory Parallel computing allows one to: solve problems that don t fit on a single CPU solve problems that can t be solved in a reasonable time We can solve larger problems faster more cases 5/18/2010 www.cac.cornell.edu 4

Forest Inventory with Ranger Bob 5/18/2010 www.cac.cornell.edu 5

Terminology (1) serial code is a single thread of execution working on a single data item at any one time parallel code has more than one thing happening at a time. This could be A single thread of execution operating on multiple data items simultaneously Multiple threads of execution in a single executable Multiple executables all working on the same problem Any combination of the above task is the name we use for an instance of an executable. Each task has its own virtual address space and may have multiple threads. 5/18/2010 www.cac.cornell.edu 6

Terminology (2) node: a discrete unit of a computer system that typically runs its own instance of the operating system core: a processing unit on a computer chip that is able to support a thread of execution; can refer either to a single core or to all of the cores on a particular chip cluster: a collection of machines or nodes that function in someway as a single resource. grid: the software stack designed to handle the technical and social challenges of sharing resources across networking and institutional boundaries. grid also applies to the groups that have reached agreements to share their resources. 5/18/2010 www.cac.cornell.edu 7

Limits of Parallel Computing Theoretical Upper Limits Amdahl s Law Practical Limits Load balancing (waiting) Conflicts (accesses to shared memory) Communications I/O (file system access) 5/18/2010 www.cac.cornell.edu 8

Theoretical Upper Limits to Performance All parallel programs contain: parallel sections (we hope!) serial sections (unfortunately) Serial sections limit the parallel effectiveness 1 task 2 tasks serial portion parallel portion 4 tasks Amdahl s Law states this formally 5/18/2010 www.cac.cornell.edu 9

Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t n = (f p / N + f s )t 1 Where f s = serial fraction of code f p = parallel fraction of code N = number of processors t 1 = time to run on one processor 5/18/2010 www.cac.cornell.edu 10

Limit Cases of Amdahl s Law Speed up formula: S = 1 / (fs+ fp/ N) Where f s = serial fraction of code f p = parallel fraction of code N = number of processors Case: 1. f s = 0, f p = 1, then S = N 2. N infinity: S = 1/f s ; if 10% of the code is sequential, you will never speed up by more than 10, no matter the number of processors. 5/18/2010 www.cac.cornell.edu 11

Ilustration of Amdahl s Law 250 S 200 150 f p = 1.000 f p = 0.999 f p = 0.990 f p = 0.900 100 50 0 0 50 100 150 200 250 Number of processors 5/18/2010 www.cac.cornell.edu 12

More Terminology synchronization: the temporal coordination of parallel tasks. It involves waiting until two or more tasks reach a specified point (a sync point) before continuing any of the tasks. parallel overhead: the amount of time required to coordinate parallel tasks, as opposed to doing useful work, including time to start and terminate tasks, communication, move data. granularity: a measure of the ratio of the amount of computation done in a parallel task to the amount of communication. fine-grained (very little computation per communication-byte) coarse-grained (extensive computation per communication-byte). 5/18/2010 www.cac.cornell.edu 13

Practical Limits: Amdahl s Law vs. Reality Amdahl s Law shows a theoretical upper limit speedup In reality, the situation is even worse than predicted by Amdahl s Law due to: Load balancing (waiting) Scheduling (shared processors or memory) Communications 80 I/O f 70 p = 0.99 60 S 50 40 30 20 10 0 Amdahl's Law Reality 0 50 100 150 200 250 Number of processors 5/18/2010 www.cac.cornell.edu 14

Forest Inventory with Ranger Bob 5/18/2010 www.cac.cornell.edu 15

Is it really worth it to go Parallel? Writing effective parallel applications is difficult!! Load balance is important Communication can limit parallel efficiency Serial time can dominate Is it worth your time to rewrite your application? Do the CPU requirements justify parallelization? Is your problem really `large? Is there a library that does what you need (parallel FFT, linear system solving) Will the code be used more than once? 5/18/2010 www.cac.cornell.edu 16

Types of Parallel Computers (Flynn's taxonomy) Data Stream Single Instruction Stream Multiple Single Single Instruction Single Data SISD Multiple Instruction Single Data MISD Multiple Single Instruction Multiple Data SIMD Multiple Instruction Multiple Data MIMD 5/18/2010 www.cac.cornell.edu 17

Types of Parallel Computers (Memory Model) Nearly all parallel machines these days are multiple instruction, multiple data (MIMD) A much more useful way to classify modern parallel computers is by their memory model shared memory distributed memory 5/18/2010 www.cac.cornell.edu 18

Shared and Distributed Memory Models P P P P P P Bus Memory P M P P P P P M M M M M Network Shared memory: single address space. All processors have access to a pool of shared memory; easy to build and program, good price-performance for small numbers of processors; predictable performance due to UMA.(example: SGI Altix) Methods of memory access : - Bus - Crossbar Distributed memory: each processor has its own local memory. Must do message passing to exchange data between processors. cc-numa enables larger number of processors and shared memory address space than SMPs; still easy to program, but harder and more expensive to build. (example: Clusters) Methods of memory access : - various topological interconnects 5/18/2010 www.cac.cornell.edu 19

Ranger 5/18/2010 www.cac.cornell.edu 20

Shared Memory vs. Distributed Memory Tools can be developed to make any system appear to look like a different kind of system distributed memory systems can be programmed as if they have shared memory, and vice versa such tools do not produce the most efficient code, but might enable portability HOWEVER, the most natural way to program any machine is to use tools and languages that express the algorithm explicitly for the architecture. 5/18/2010 www.cac.cornell.edu 21

Programming Parallel Computers Programming single-processor systems is (relatively) easy because they have a single thread of execution and a single address space. Programming shared memory systems can benefit from the single address space Programming distributed memory systems is the most difficult due to multiple address spaces and need to access remote data Both shared memory and distributed memory parallel computers can be programmed in a data parallel, SIMD fashion and they also can perform independent operations on different data (MIMD) and implement task parallelism. 5/18/2010 www.cac.cornell.edu 22

Data vs. Functional Parallelism Partition by Data (data parallelism) Each process does the same work on a unique piece of data First divide the data. Each process then becomes responsible for whatever work is needed to process its data. Data placement is an essential part of a data-parallel algorithm Usually more scalable than functional parallelism Can be programmed at a high level with OpenMP, or at a lower level (subroutine calls) using a message-passing library like MPI. Partition by Task (functional parallelism) Each process performs a different "function" or executes a different code section First identify functions, then look at the data requirements Commonly programmed with message-passing libraries 5/18/2010 www.cac.cornell.edu 23

Data Parallel Programming Example One code will run on 2 CPUs Program has array of data to be operated on by 2 CPUs so array is split in two. program: if CPU=a then low_limit=1 upper_limit=50 elseif CPU=b then low_limit=51 upper_limit=100 end if do I = low_limit, upper_limit work on A(I) end do... end program CPU A program: low_limit=1 upper_limit=50 do I= low_limit, upper_limit work on A(I) end do end program CPU B program: low_limit=51 upper_limit=100 do I= low_limit, upper_limit work on A(I) end do end program 5/18/2010 www.cac.cornell.edu 24

Forest Inventory with Ranger Bob 5/18/2010 www.cac.cornell.edu 25

Single Program, Multiple Data (SPMD) SPMD: dominant programming model for shared and distributed memory machines. One source code is written Code can have conditional execution based on which processor is executing the copy All copies of code are started simultaneously and communicate and sync with each other periodically 5/18/2010 www.cac.cornell.edu 26

SPMD Programming Model source.c source.c source.c source.c source.c Processor 0 Processor 1 Processor 2 Processor 3 5/18/2010 www.cac.cornell.edu 27

Task Parallel Programming Example One code will run on 2 CPUs Program has 2 tasks (a and b) to be done by 2 CPUs program.f: initialize... if CPU=a then do task a elseif CPU=b then do task b end if. end program CPU A program.f: initialize do task a end program CPU B program.f: initialize do task b end program 5/18/2010 www.cac.cornell.edu 28

Forest Inventory with Ranger Bob 5/18/2010 www.cac.cornell.edu 29

Shared Memory Programming: OpenMP Shared memory systems (SMPs and cc-numas) have a single address space: applications can be developed in which loop iterations (with no dependencies) are executed by different processors shared memory codes are mostly data parallel, SIMD kinds of codes OpenMP is the new standard for shared memory programming (compiler directives) Vendors offer native compiler directives 5/18/2010 www.cac.cornell.edu 30

Accessing Shared Variables If multiple processors want to write to a shared variable at the same time, there could be conflicts : Process 1 and 2 read X compute X+1 write X Programmer, language, and/or architecture must provide ways of resolving conflicts X + 1 in proc1 Shared variable X in memory X + 1 in proc2 5/18/2010 www.cac.cornell.edu 31

OpenMP Example #1: Parallel Loop!$OMP PARALLEL DO do i=1,128 b(i) = a(i) + c(i) end do!$omp END PARALLEL DO The first directive specifies that the loop immediately following should be executed in parallel. The second directive specifies the end of the parallel section (optional). For codes that spend the majority of their time executing the content of simple loops, the PARALLEL DO directive can result in significant parallel performance. 5/18/2010 www.cac.cornell.edu 32

OpenMP Example #2: Private Variables!$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP) do I=1,N TEMP = A(I)/B(I) C(I) = TEMP + SQRT(TEMP) end do!$omp END PARALLEL DO In this loop, each processor needs its own private copy of the variable TEMP. If TEMP were shared, the result would be unpredictable since multiple processors would be writing to the same memory location. 5/18/2010 www.cac.cornell.edu 33

Distributed Memory Programming: MPI Distributed memory systems have separate address spaces for each processor Local memory accessed faster than remote memory Data must be manually decomposed MPI is the standard for distributed memory programming (library of subprogram calls) 5/18/2010 www.cac.cornell.edu 34

y Data Decomposition For distributed memory systems, the whole grid or sum of particles is decomposed to the individual nodes Each node works on its section of the problem Nodes can exchange information Grid of Problem to be solved CPU #1 works on this area of the problem CPU #2 works on this area of the problem CPU #3 works on this area of the problem CPU #4 works on this area of the problem x 5/18/2010 www.cac.cornell.edu 35

MPI Example #include <....> #include "mpi.h" main(int argc, char **argv) { char message[20]; int i, rank, size, type = 99; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { strcpy(message, "Hello, world"); for (i = 1; i < size; i++) MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD); } else MPI_Recv(message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status); printf( "Message from process = %d : %.13s\n", rank,message); MPI_Finalize(); } 5/18/2010 www.cac.cornell.edu 36

MPI: Sends and Receives MPI programs must send and receive data between the processors (communication) The most basic calls in MPI (besides the initialization, rank/size, and finalization calls) are: MPI_Send MPI_Recv These calls are blocking: the source processor issuing the send/receive cannot move to the next statement until the target processor issues the matching receive/send. 5/18/2010 www.cac.cornell.edu 37

Programming Multi-tiered Systems Systems with multiple shared memory nodes are becoming common for reasons of economics and engineering. Memory is shared at the node level, distributed above that: Applications can be written using OpenMP + MPI Developing apps with only MPI usually possible 5/18/2010 www.cac.cornell.edu 38