CS 475: Parallel Programming Introduction

Similar documents
Name card info (inside)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Computing Introduction

COSC 6385 Computer Architecture - Multi Processor Systems

BlueGene/L (No. 4 in the Latest Top500 List)

Top500 Supercomputer list

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Introduction to Parallel Computing

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

High Performance Computing

Introduction to High-Performance Computing

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallelism. CS6787 Lecture 8 Fall 2017

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

An Introduction to Parallel Programming

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Why Multiprocessors?

27. Parallel Programming I

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer parallelism Flynn s categories

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Multi-core Programming - Introduction

Lecture 7: Parallel Processing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lect. 2: Types of Parallelism

Processor Architecture and Interconnect

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Parallel Computing Concepts. CSInParallel Project

Parallel Architectures

Multiprocessors & Thread Level Parallelism

Module 5 Introduction to Parallel Processing Systems

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors - Flynn s Taxonomy (1966)

Chapter 11. Introduction to Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Chapter 5: Thread-Level Parallelism Part 1

Outline Marquette University

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

WHY PARALLEL PROCESSING? (CE-401)

The Art of Parallel Processing

Introduction to parallel Computing

Chapter 1: Introduction to Parallel Computing

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Chap. 4 Multiprocessors and Thread-Level Parallelism

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

High Performance Computing in C and C++

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CDA3101 Recitation Section 13

Introduction to Parallel Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Overview. Processor organizations Types of parallel machines. Real machines

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Advanced Parallel Architecture. Annalisa Massini /2017

Parallel Computing. Hwansoo Han (SKKU)

Computer Architecture

Introduction II. Overview

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Chapter 9 Multiprocessors

Parallel Computing: Parallel Architectures Jin, Hai

What is Parallel Computing?

Parallel and High Performance Computing CSE 745

Parallel Computing Why & How?

10th August Part One: Introduction to Parallel Computing

EE/CSCI 451: Parallel and Distributed Computation

Trends and Challenges in Multicore Programming

Review: Creating a Parallel Program. Programming for Performance

27. Parallel Programming I

What are Clusters? Why Clusters? - a Short History

Lecture 7: Parallel Processing

Introduction to parallel computing

4. Networks. in parallel computers. Advances in Computer Architecture

BİL 542 Parallel Computing

AMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Introduction to Parallel Programming Dianne P.

CS Computer Architecture

27. Parallel Programming I

Issues in Multiprocessors

Computer Architecture Spring 2016

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture notes for CS Chapter 4 11/27/18

CS671 Parallel Programming in the Many-Core Era

Lecture 1: Introduction

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

CUDA GPGPU Workshop 2012

Why do we care about parallel?

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

ECE/CS 250 Computer Architecture. Summer 2016

Parallel Architectures

Issues in Multiprocessors

Transcription:

CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014

Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus. n Instructor and GTA. n The rules of the game. Grading, Tests. Progress, the heart of the course: time table. Assignments, details of Labs, Discussions and PAs. Checkin: PAs make and compile are checked

Why Parallel Programming n Need for speed Many applications require orders of magnitude more compute power than we have right now. Speed came for a long time from technology improvements, mainly increased clock speed and chip density. The technology improvements have slowed down, and the only way to get more speed is to exploit parallelism. All new computers are now parallel computers.

Technology: Moore s Law n Empirical observation (1965) by Gordon Moore: n The chip density: number of transistors that can be inexpensively placed on an integrated circuit, is increasing exponentially, doubling approximately every two years. n en.wikipedia.org/wiki/moore's_law n This has held true until now and is expected to hold until at least 2015 (news.cnet.com/new-life-for- Moores-Law/2009-1006_3-5672485.html).

source: Wikipedia

Corollary of exponential growth n When two quantities grow exponentially, but at different rates, their ratio also grows exponentially. " y 1 = a x, and y 2 = b x for a b 1, therefore a % $ ' # b & n 1.1 n O(2 n ) or 2 n grows a lot faster than (1.1) n n Consequence for computer architecture: growth rate for e.g. memory is not as high as for processors, therefore, memory gets slower and slower (in terms of clock cycles) as compared to processors. This gives rise to so called gaps or walls x = r x, r 1

Gaps or walls of Moore s Law n Memory gap/wall: Memory bandwidth and latency improve much slower than processor speeds (since mid 80s, this was addressed by ever increasing on-chip caches). n (Brick) Power Wall Higher frequency means more power means more heat. Microprocessor chips are getting exceedingly hot. Consequence: We can no longer increase the clock frequency of our processors.

source: The Free Lunch Is Over Herb Sutter Dr. Dobbs, 2005

Enter Parallel Computing n The goal is to deliver performance. The only solution to the power wall is to have multiple processor cores on a chip, because we can still increase the chip density. n Nowadays, there are no more processors with one core. The processor is the new transistor. n 2005 prediction: Number of cores will continue to grow at an exponential rate (at least until 2015), has not happened as much as we thought it would. We are still in the ~16 cores / micro-processor range.

Implicit Parallel Programming n Let the compiler / runtime system detect parallelism, do task and data allocation, and scheduling. n This has been a holy grail of compiler and parallel computing research for a long time and, after >40 years, still requires research on automatic parallelization, languages, OS, fault tolerance, etc. n We (still) don t have a general purpose high performance implicit parallel programming language that compiles to a general purpose parallel machine.

Explicit Parallel Programming Let the programmer express parallelism, task and data partitioning, allocation, synchronization, and scheduling, using programming languages extended with explicit parallel programming constructs. n This makes the programming task harder, but a lot of fun and a large source of income n very large actually (VLSI = very large source of income J)

Explicit parallel programming n Explicit parallel programming Multi-threading: OpenMP Data parallel (multi-threaded)programming: CUDA Message Passing: MPI n Explicit Parallelism complicates programming Creation, allocation, scheduling of processes Data partitioning Synchronization ( semaphores, locks, messages )

Computer Architectures n Sequential: John von Neumann Stored Program, single instruction stream, single data stream n Parallel: Michael Flynn SIMD:Single instruction multiple data n data parallel MIMD: Multiple instruction multiple data

SIMD n One control unit n Many processing elements All executing the same instruction Local memories Conditional execution n Where ocean perform ocean step n Where land perform land step n Gives rise to idle processors (optimized in GPUs) n PE interconnection Usually a 2D mesh

SIMD machines n Bit level n ILLIAC IV (very early research machine, Illinois) n CM2 (Thinking Machines) n now GPU: Cartesian grid of SMs (Streaming Multiprocessors) n each SM has a collection of Processing Elements n each SM has a shared memory (shared among the PEs) and a register file one global memory

MIMD n Multiple Processors Each executing its own code n Distributed memory MIMD Complete PE+Memory connected to network n Cosmic Cube, N-Cube n Extreme: Network of Workstations (our labs) Data centers: racks of processors with high speed bus interconnects n Programming model: Message Passing

Shared memory MIMD n Shared memory CPUs or cores, bus, shared memory All our processors are now multi-core Programming model: OpenMP n NUMA: Non Uniform Memory Access Memory Hierarchy (registers, cache, local, global) Potential for better performance, but problem: memory (cache) coherence (resolved by architecture)

Speedup n Why do we write parallel programs again? to get speedup: go faster than sequential n What is speedup? T 1 = sequential time to execute a program sometimes called T, or S T p = time to execute the same program with p processors (cores, PEs) S p = T 1 / T p speedup for p processors

Ideal Speedup, linear speedup n Ideal speedup: p fold speedup: S p = p Ideal not always possible. WHY? n Certain parts of the computation are inherently sequential n Tasks are data dependent, so not all processors are always busy, and need to synchronize n Remote data needs communication Memory wall PLUS Communication wall n Linear speedup: S p = f p f usually less than 1

Efficiency n Speedup is usually neither ideal, nor linear n We express this in terms of efficiency E p : E p = S p / p E p defines the average utilization of p processors Range? n What does E p = 1 signify? n What does E p = f (0<f<1) signify?

More realistic speedups n T 1 = 1, T p = σ + (ο+π)/p n S p = 1 / T p = 1 / (σ + (ο+π)/p ) σ is sequential fraction of the program π is parallel fraction of the program σ+π = 1 ο is parallel overhead (does not occur in sequential execution) we assume here that o can be parallelised Draw speedup curves for p=1, 2, 4, 8, 16, 32 σ= ¼, ½ ο= ¼, ½ When p goes to, S p goes to?

Plotting speedups T 1 = 1, T p =σ+(ο+π)/p S p = 1/(σ+(ο+π)/p ) p Sp (s:1/4 o:1/4) Sp (s:1/4 o:1/2) Sp (s:1/2 o:1/4) Sp (s:1/2 o:1/2) 1 1 1 1 1 2 1.33 1.14 1.14 1.00 4 2.00 1.78 1.45 1.33 8 2.67 2.46 1.68 1.60 16 3.2 3.05 1.83 1.78 32 3.56 3.46 1.91 1.88 4" 3.5" 3" 2.5" Speedup&&Sp&&& 2" 1.5" 1" 0.5" Sp"(s:1/4"o:1/4)" Sp"(s:1/4"o:1/2)" Sp"(s:1/2"o:1/4)" Sp"(s:1/2"o:1/2)" 0" 0" 10" 20" 30" 40" Number&of&processors:&p&

This class: explicit parallel programming Class Objectives n Study the foundations of parallel programming n Write explicit parallel programs In C plus libraries Using OpenMP (shared memory), CUDA (data parallel), and MPI (message passing) Execute and measure, plot speedups, interpret / try to understand the outcomes (σ, π, ο), document performance

Class Format n Hands on: write efficient parallel code. This often starts with writing efficient sequential code. n four versions of a code: naïve sequential, naïve parallel, smart sequential, smart parallel n Lab covers many details, discussions follow up on what s learned in labs. n PAs: Write effective and scalable parallel programs: report and interpret their performance.

Parallel Programming steps n Partition: break program in (smallest possible) parallel tasks and partition the data accordingly. n Communicate: decide which tasks need to communicate (exchange data). n Agglomerate: group tasks into larger tasks reducing communication overhead taking LOCALITY into account n High Performance Computing is all about locality

Locality A process (running on a processor) needs to have its data as close to it as possible. Processors have non uniform memory access (NUMA): n registers 0 (extra) cycles n cache 1, 2, 3 cycles there are multiple caches on a multicore chip n local memory 50s to 100s of cycles n Process n n global memory, processor-processor interconnect, disk COARSE in OS world: a program in execution FINER GRAIN in HPC world: (a number of) loop iterations, or a function invocation

Exercise: summing numbers 10x10 grid of 100 processors. Add up 4000 numbers initially at the corner processor[0,0], processors can only communicate with NEWS neighbors Different cost models: Each communication event may have an arbitrarily large set of numbers (cost independent of volume). n Alternate: Communication cost is proportional to volume Cost: Communication takes 100x the time of adding n Alternate: communication and computation take the same time

Outline n scatter : the numbers to processors row or column wise Each processor along the edge gets a stack n Keeps whatever is needed for its row/column n Forwards the rest to the processor to south (or east) Initiates an independent scatter along its row/column n each processor adds its local set of sums n gather : sums are sent back in reverse order of scatter and added up along the way

Broadcast / Reduction n One processor has a value, and we want every processor to have a copy of that value. n Grid: Time for a processor to receive the value must be at least equal to its distance (number of hops in the grid graph) from the source. For the opposite corner, this is the perimeter of the grid, ~ 2sqrt(p) hops n Reduction: dual of broadcast First add n/p numbers on each processor Then do the broadcast in reverse and add along the way

Analysis n Scatter, add local, communicate back to the corner Intersperse with one addition operation problem size N=4000, machine size P=100, machine topology: square grid, comms cost C=100) Scatter: 18 steps (get data, keep what you need, send the rest onwards) (2(sqrt(P)-1))) Add 40 numbers Reduce the local answers: 18 steps (one send, add 2 numbers) n Is P=100 good for this problem? Which P is optimal, assuming a square grid)?