Introduction to Parallel Programming

Similar documents
Introduction to Parallel Programming

Introduction to Parallel Programming

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction II. Overview

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computing Introduction

Lecture 7: Parallel Processing

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

COSC 6385 Computer Architecture - Multi Processor Systems

Multiprocessors - Flynn s Taxonomy (1966)

Top500 Supercomputer list

CSL 860: Modern Parallel

Computer Architecture

Introduction to KNL and Parallel Computing

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Parallel Architectures

Parallel and High Performance Computing CSE 745

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

Lect. 2: Types of Parallelism

The Art of Parallel Processing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 7: Parallel Processing

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Parallel Computing Why & How?

Parallel Architectures

Online Course Evaluation. What we will do in the last week?

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

High Performance Computing

BİL 542 Parallel Computing

Chapter 11. Introduction to Multiprocessors

High Performance Computing Systems

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Types of Parallel Computers

Computer Systems Architecture

Organisasi Sistem Komputer

Introduction to High-Performance Computing

Parallel Computers. c R. Leduc

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Dr. Joe Zhang PDC-3: Parallel Platforms

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014

Distributed Systems CS /640

Computer Organization. Chapter 16

Parallel Numerics, WT 2013/ Introduction

Multi-Processor / Parallel Processing

HPC Architectures. Types of resource currently in use

Computer parallelism Flynn s categories

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Overview of High Performance Computing

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chapter 18 Parallel Processing

Lecture 9: MIMD Architectures

Moore s Law. Computer architect goal Software developer assumption

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

UVA HPC & BIG DATA COURSE INTRODUCTORY LECTURES. Adam Belloum

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Programmation Concurrente (SE205)

Introduction to High Performance Computing

Multi-core Programming - Introduction

Multiprocessors & Thread Level Parallelism

High Performance Computing in C and C++

Parallel programming. Luis Alejandro Giraldo León

WHY PARALLEL PROCESSING? (CE-401)

Introduction to Parallel Programming. Tuesday, April 17, 12

Computer Systems Architecture

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Computing Ideas

Parallel Programming Programowanie równoległe

Introduction to Parallel Processing

Parallel Programming. Michael Gerndt Technische Universität München

Multicores, Multiprocessors, and Clusters

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Computer Architecture Spring 2016

Parallel Computing Platforms

10th August Part One: Introduction to Parallel Computing

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Moore s Law. Computer architect goal Software developer assumption

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Chapter 17 - Parallel Processing

Plan. 1: Parallel Computations. Plan. Aims of the course. Course in Scalable Computing. Vittorio Scarano. Dottorato di Ricerca in Informatica

Fundamentals of Computer Design

Message Passing Interface (MPI)

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Transcription:

Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu

What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally much more difficult to achieve Tasks must be independent Order of execution can t matter How to define the tasks Each processor works on their section of the problem (functional parallelism) Each processor works on their section of the data (data parallelism) How and when can the processors exchange information 1/14/2015 www.cac.cornell.edu 2

Why Do Parallel Programming? Solve problems faster; 1 day is better than 30 days Solve bigger problems; model stress on a machine, not just one nut Solve problem on more datasets; find all max values for one month, not one day Solve problems that are too large to run on a single CPU Solve problems in real time 1/14/2015 www.cac.cornell.edu 3

Is it worth it to go Parallel? Writing effective parallel applications is difficult!! Load balancing is critical Communication can limit parallel efficiency Serial time can dominate Is it worth your time to rewrite your application? Do the CPU requirements justify parallelization? Is your problem really large? Is there a library that does what you need (parallel FFT, linear system solving) Will the code be used more than once? 1/14/2015 www.cac.cornell.edu 4

Terminology node: a discrete unit of a computer system that typically runs its own instance of the operating system Stampede has 6400 nodes processor: chip that shares a common memory and local disk Stampede has two Sandy Bridge processors per node core: a processing unit on a computer chip able to support a thread of execution Stampede has 8 cores per processor or 16 cores per node coprocessor: a lightweight processor Stampede has a one Phi coprocessor per node with 61 cores per coprocessor cluster: a collection of nodes that function as a single resource 1/14/2015 www.cac.cornell.edu 5

Node Processor Core Coprocessor 1/14/2015 www.cac.cornell.edu 6

Functional Parallelism Definition: each process performs a different "function" or executes different code sections that are independent. Examples: 2 brothers do yard work (1 edges & 1 mows) 8 farmers build a barn A B C D Commonly programmed with messagepassing libraries E 1/14/2015 www.cac.cornell.edu 7

Data Parallelism Definition: each process does the same work on unique and independent pieces of data Examples: 2 brothers mow the lawn 8 farmers paint a barn B A B B Usually more scalable than functional parallelism Can be programmed at a high level with OpenMP, or at a lower level using a message-passing library like MPI or with hybrid programming. C 1/14/2015 www.cac.cornell.edu 8

Embarrassing Parallelism A special case of Data Parallelism Definition: each process performs the same functions but do not communicate with each other, only with a Master Process. These are often called Embarrassingly Parallel codes. Examples: Independent Monte Carlo Simulations ATM Transactions Stampede has a special wrapper for submitting this type of job; see https://www.xsede.org/news/-/news/item/5778 1/14/2015 www.cac.cornell.edu 9

Flynn s Taxonomy Classification of computer architectures Based on # of concurrent instruction streams and data streams Single Instructio n Multiple Instruction Single Program Multiple Program Single Data SISD (serial) MISD (custom) Multiple Data SIMD (vector) (GPU) MIMD (superscalar) SPMD (data parallel) MPMD (task parallel) 1/14/2015 www.cac.cornell.edu 10

Theoretical Upper Limits to Performance All parallel programs contain: parallel sections (we hope!) serial sections (unfortunately) Serial sections limit the parallel effectiveness 1 task 2 tasks serial portion parallel portion 4 tasks Amdahl s Law states this formally 1/14/2015 www.cac.cornell.edu 11

Amdahl s Law Amdahl s Law places a limit on the speedup gained by using multiple processors. Effect of multiple processors on run time tn = (fp / N + fs )t1 where f s = serial fraction of the code f p = parallel fraction of the code N = number of processors t1 = time to run on one processor Speed up formula: S = 1 / (fs + fp / N) if f s = 0 & f p = 1, then S = N If N infinity: S = 1/f s ; if 10% of the code is sequential, you will never speed up by more than 10, no matter the number of processors. 1/14/2015 www.cac.cornell.edu 12

Practical Limits: Amdahl s Law vs. Reality Amdahl s Law shows a theoretical upper limit for speedup In reality, the situation is even worse than predicted by Amdahl s Law due to: Load balancing (waiting) Scheduling (shared processors or memory) Communications I/O 80 f p = 0.99 70 S p e e d u p 60 50 40 30 20 10 0 Amdahl's Law Reality 0 50 100 150 200 250 Number of processors 1/14/2015 www.cac.cornell.edu 13

High Performance Computing Architectures 1/14/2015 www.cac.cornell.edu 14

HPC Systems Continue to Evolve Over Time Centralized Big-Iron Mainframes Mini Computers Specialized Parallel Computers NOWS RISC MPPS Clusters Grids + Clusters Hybrid Clusters RISC Workstations PCs Decentralized collections 1970 1980 1990 2000 2010 1/14/2015 www.cac.cornell.edu 15

Cluster Computing Environment Access Control File Server(s) Login Nodes File servers & Scratch Space Compute Nodes Batch Schedulers Login Node(s) Compute Nodes 1/14/2015 www.cac.cornell.edu 16

Types of Parallel Computers (Memory Model) Useful to classify modern parallel computers by their memory model shared memory architecture memory is addressable by all cores and/or processors distributed memory architecture memory is split up into separate pools, where each pool is addressable only by cores and/or processors on the same node cluster mixture of shared and distributed memory; shared memory on cores in a single node and distributed memory between nodes Most parallel machines today are multiple instruction, multiple data (MIMD) 1/14/2015 www.cac.cornell.edu 17

Shared and Distributed Memory Models Shared memory: single address space. All processors have access to a pool of shared memory; easy to build and program, good price-performance for small numbers of processors; predictable performance due to uniform memory access (UMA). Methods of memory access : - Bus - Crossbar Distributed memory: each processor has its own local memory. Must do message passing to exchange data between processors. cc-numa enables larger number of processors and shared memory address space than SMPs; still easy to program, but harder and more expensive to build. (example: Clusters) Methods of memory access : - various topological interconnects 1/14/2015 www.cac.cornell.edu 18

Programming Parallel Computers Programming single-processor systems is (relatively) easy because they have a single thread of execution Programming shared memory systems can likewise benefit from the single address space Programming distributed memory systems is more difficult due to multiple address spaces and the need to access remote data Hybrid programming for distributed and shared memory is even more difficult, but gives the programmer much greater flexibility 1/14/2015 www.cac.cornell.edu 19

Single Program, Multiple Data (SPMD) SPMD: One source code is written Code can have conditional execution based on which processor is executing the copy All copies of code are started simultaneously and communicate and sync with each other periodically 1/14/2015 www.cac.cornell.edu 20

SPMD Programming Model source.c a.out (compiled) a.out a.out a.out a.out Processor 0 Processor 1 Processor 2 Processor 3 1/14/2015 www.cac.cornell.edu 21

Shared Memory Programming: OpenMP Shared memory systems have a single address space: Applications can be developed in which loop iterations (with no dependencies) are executed by different processors Application runs as a single process with multiple parallel threads OpenMP is the standard for shared memory programming (compiler directives) Vendors offer native compiler directives 1/14/2015 www.cac.cornell.edu 22

Distributed Memory Programming: Message Passing Interface (MPI) Distributed memory systems have separate pools of memory for each processor Application runs as multiple processes with separate address spaces Processes communicate data to each other using MPI Data must be manually decomposed MPI is the standard for distributed memory programming (library of subprogram calls) 1/14/2015 www.cac.cornell.edu 23

Hybrid Programming Systems with multiple shared memory nodes Memory is shared at the node level, distributed above that: Applications can be written to run on one node using OpenMP Applications can be written using MPI Application can be written using both OpenMP and MPI 1/14/2015 www.cac.cornell.edu 24

References Virtual Workshop module: Parallel Programming Concepts and High-Performance Computing HPC terms glossary Jim Demmel, Applications of Parallel Computers. This set of lectures is an online rendition of Applications of Parallel Computers taught at U.C. Berkeley in Spring 2012. This online course is sponsored by the Extreme Science and Engineering Discovery Environment (XSEDE), and is only available through the XSEDE User Portal. Ian Foster, Designing and Building Parallel Programs, Addison-Wesley, 1995. http://www-unix.mcs.anl.gov/dbpp/ This book is a fine introduction to parallel programming and available online in its entirety. Some of the languages covered in later chapters, such as Compositional C++ and Fortran M, are more rare these days, but the concepts have not changed much since the book was written. 1/14/2015 www.cac.cornell.edu 25