AMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Introduction to Parallel Programming Dianne P.

Similar documents
CS 475: Parallel Programming Introduction

Top500 Supercomputer list

Computer Architecture

CSL 860: Modern Parallel

Approaches to Parallel Computing

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Parallel Computing Ideas

Introduction to High Performance Computing

Chapter 3 Parallel Software

High Performance Computing

BlueGene/L (No. 4 in the Latest Top500 List)

Parallel Computing Introduction

Parallelism paradigms

Administrivia. Minute Essay From 4/11

Introduction to High-Performance Computing

Data parallel algorithms 1

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Hybrid Model Parallel Programs

Introduction to Parallel Programming and Computing for Computational Sciences. By Justin McKennon

Introduction to Parallel Programming

High-Performance and Parallel Computing

Parallel Computing Why & How?

CSE 260 Introduction to Parallel Computation

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Practical High Performance Computing

Intel Xeon Phi Coprocessors

ECE 574 Cluster Computing Lecture 1

The Art of Parallel Processing

Parallel Processing SIMD, Vector and GPU s cont.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Programming. Michael Gerndt Technische Universität München

Programming with MPI. Pedro Velho

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

CS252 Lecture Notes Multithreaded Architectures

Introduction to Parallel Computing

Parallel and High Performance Computing CSE 745

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Parallel Computers. c R. Leduc

L20: Putting it together: Tree Search (Ch. 6)!

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015

Modelling and implementation of algorithms in applied mathematics using MPI

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

LECTURE 11. Memory Hierarchy

Trends in HPC (hardware complexity and software challenges)

CSCE 626 Experimental Evaluation.

Modern Processor Architectures. L25: Modern Compiler Design

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to Parallel Programming

Chap. 4 Multiprocessors and Thread-Level Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016

Optimized Scientific Computing:

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.3. Krste Asanovic UC Berkeley Fall 2009

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.4. Krste Asanovic UC Berkeley Fall 2009

An Introduction to Parallel Programming

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Lecture 3: Intro to parallel machines and models

Multiprocessors - Flynn s Taxonomy (1966)

Computer Caches. Lab 1. Caching

An Introduction to Parallel Programming

Lecture 7: Parallel Processing

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Parallel Computing Basics, Semantics

Aleksandar Milenkovich 1

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Simulation of Conditional Value-at-Risk via Parallelism on Graphic Processing Units

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Lecture 9: MIMD Architectures

Architecture of parallel processing in computer organization

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

OpenMP: Open Multiprocessing

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Module 5 Introduction to Parallel Processing Systems

Issues in Multiprocessors

NETW3005 Operating Systems Lecture 1: Introduction and history of O/Ss

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Name card info (inside)

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Copyright 2010, Elsevier Inc. All rights Reserved

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

OPERATING- SYSTEM CONCEPTS

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Clusters of SMP s. Sean Peisert

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

The Use of Cloud Computing Resources in an HPC Environment

Outlook is easier to use than you might think; it also does a lot more than. Fundamental Features: How Did You Ever Do without Outlook?

OpenMP: Open Multiprocessing

Introduction to Parallel Computing

Introduction. HPC Fall 2007 Prof. Robert van Engelen

Chapter 17 - Parallel Processing

Transcription:

AMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Introduction to Parallel Programming Dianne P. O Leary c 2011 1

Introduction to Parallel Programming 2

Why parallel computing: Get results faster than we can with a single processor. Take advantage of hardware that is in our laptop, or lab, or cloud. You ve already had an informal (but important) nontechnical introduction. Now we have to program parallel computers using what you have learned. 3

Parallel Computing Models 4

How to communicate: Two basic approaches Shared-memory computing: a set of processes operate independently on their own data and shared data, communicating using the shared memory. Overhead of sharing overhead of moving data to and from memory. Distributed-memory computing: a set of processes operate independently on their own data communicating by passing messages to other processes. Overhead of sharing overhead of doing input/output (maybe 10,000 times the cost of memory access) + delay of passing the information over a bus or wireless network. 5

Two basic approaches Master/slave processes: How to organize the work: One process spawns and controls all the others, handing out work and collecting the results. Vulnerable to the health of the master process. If it fails, the job fails. The master process becomes a bottleneck, and often the spawned processes have to wait for attention. Even in computing, slavery is not a good idea. But often there is a parent process that spawns the child processes, and this can be a convenient way to organize the program. Communicating processes: Each process operates somewhat independently, grabbing undone work and producing and communicating the results. More robust to hardware failures and load balancing done dynamically. 6

How to do the work: Two basic approaches SIMD: Single Instruction, Multiple Data. Each process has the same set of instructions (the same computer code), and it operates on a set of data determined by the ID of the process. Much easier than writing a different program for every process! Processors in lockstep synchronization this leads to inefficiency when one process has to wait while others do something that it doesn t need to do. May be able to use a single fetch/decode hardware component, so hardware tends to be cheaper and faster (smaller). 7

MIMD: Multiple Instruction, Multiple Data. Each process can have a different set of instructions and it operates on a set of data determined by the ID of the process. Writing the program can be much more complicated. Processors in much looser synchronization can work freely until they need to get or give information to another process. Tends to be much more efficient, with less waiting. Processors take more space, so communication is slower. 8

There are a few competing models: How to write parallel programs MPI: Message Passing Interface. This system provides functions in C, Fortran, etc. to allow operations such as: spawning processes. passing a message to a process. passing a message to all processes. gathering information from all processes. It does not support shared memory. Pthreads: Parallel Threads. This system provides low-level support so that shared memory parallel computing can be programmed in C. 9

OpenMP: Open MultiProcessing. This system provides somewhat higher-level support for shared memory parallel computing in C. It can also be used for message passing. Cuda: Compute Unified Device Architecture This functions as a virtual machine model for a set of graphical processing units (GPUs). Hardware that satisfies these assumptions can be programmed using an extension to C or other languages. Parallel computing can also be done from high-level languages such as Matlab. See http://www.mathworks.com/help/toolbox/distcomp/ distcomp_product_page.html The university does not have enough licenses to support us playing with this as a class. 10

The long established custom of lying about parallel computing My new algorithm... My algorithm gets 90% of peak performance... (running an unacceptable algorithm) On n processors, my speed-up over a single processor is n + 1... (running an unacceptable algorithm) My algorithm has 90% efficiency... (on a problem of size 10,000 using 8 processors) Of course my program will run after I leave... 11

Fact 1: Most new parallel algorithms are old. Oldest algorithms I know of: developed for rooms of women (called computers ) computing tables for ballistics and other applications. http://en.hnf.de 12

In the 1970s and 1980s there was a very large number of experimental and commercial parallel computers SIMD: ILLIAC IV, DAP, Goodyear MPP, Connection Machine, CDC 8600, etc. MIMD: Denelcor HEP, IBM RP3, etc. MIMD Message passing: Caltech Cosmic Cube, ncube, Transputer (multiple processors on a single chip), etc. Vector pipelining: Cray-1, Alliant, etc. and an even larger number of researchers publishing parallel algorithms. If you think of it, they quite possibly thought of it. 13

Why almost all of these companies failed Hardware issues: The mean-time-to-failure for the hardware was too short. Software issues: The computers were very, very difficult to program. Cost: The computers were very expensive and consumed a tremendous amount of power. 14

Fact 2: Efficiency can be deceiving. All you need to do is perform a lot of operations that don t interfere with one another. A program like this can get a large fraction of peak efficiency say, perform 90% of the maximum GFLOPS. It is much harder to get high-efficiency programs to do useful work. So don t be impressed by graphs showing large GFLOPS. Example: If I want to solve a linear system Ax = b for a typically used model problem where A is the 5-point finite difference approximation to u xx u yy, I can easily get 90% of peak performance on SIMD machines using the Jacobi algorithm. (To be continued.) 15

Fact 3: Speed-ups can be deceiving. Speed-up is (time on a single processor) / (time on the parallel computer). Example (continued) The speedup of Jacobi s method will also be terrific almost a factor of n if we use n processors on a problem with n unknowns. But we are not computing the right performance measures. Jacobi s method is not the best sequential algorithm, so it is much more reasonable to define speed-up to be (time for best algorithm on a single processor) / (your algorithm s time on the parallel computer). Similarly, efficiency should really be measured using (time for best algorithm on a single processor)/(number of processors * your algorithm s time on the parallel computer). 16

Fact 4: Scalability should not be neglected. Efficiency and speed-up need to be measured over a whole range of problem sizes and number of processors. Great debates about whether we should plot efficiency for various problem sizes for a fixed number of processors, or whether we should increase the number of processors when we increase the problem size. 17

Fact 4: Portability should not be neglected. Don t invest a year of your effort into designing for hardware that will disappear in another 6 months. Use programming constructs from MPI, OpenMP, etc., even if they slow your program down, because then your code can be used on other computers, even after your computer disappears. 18

Our study of parallel programming We will program in OpenMP. We will talk about GPU programming and do an in-class exercise. 19

An excellent reference Peter S. Pacheco, An Introduction to Parallel Programming Morgan Kaufmann 2011 The Pacheco book s website: http://textbooks.elsevier.com/web/product_details.aspx? isbn=9780123742605 We will cover Chapter 5. You can find notes on other chapters and all of the code from the book, on the book s website. 20