Name card info (inside)

Similar documents
CS 475: Parallel Programming Introduction

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Computing Introduction

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Introduction to Parallel Computing

COSC 6385 Computer Architecture - Multi Processor Systems

High Performance Computing

Introduction to High-Performance Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

An Introduction to Parallel Programming

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Multiprocessors - Flynn s Taxonomy (1966)

Parallelism. CS6787 Lecture 8 Fall 2017

Multiprocessors & Thread Level Parallelism

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

27. Parallel Programming I

Why Multiprocessors?

Top500 Supercomputer list

Multi-core Programming - Introduction

The Art of Parallel Processing

Lecture 7: Parallel Processing

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Parallel Computing Why & How?

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Chapter 1: Introduction to Parallel Computing

Issues in Multiprocessors

CDA3101 Recitation Section 13

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Computer Architecture

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to parallel Computing

Processor Architecture and Interconnect

Lect. 2: Types of Parallelism

Chapter 9 Multiprocessors

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Parallel Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

Introduction II. Overview

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Review: Creating a Parallel Program. Programming for Performance

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Architecture Spring 2016

PARALLEL MEMORY ARCHITECTURE

Trends and Challenges in Multicore Programming

Issues in Multiprocessors

Parallel Computing Concepts. CSInParallel Project

High Performance Computing in C and C++

Lecture 7: Parallel Processing

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

WHY PARALLEL PROCESSING? (CE-401)

What are Clusters? Why Clusters? - a Short History

ECE/CS 250 Computer Architecture. Summer 2016

Chapter 5: Thread-Level Parallelism Part 1

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Introduction to parallel computing

Introduction to Parallel Computing

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

4. Networks. in parallel computers. Advances in Computer Architecture

Advanced Parallel Architecture. Annalisa Massini /2017

What is Parallel Computing?

10th August Part One: Introduction to Parallel Computing

Module 5 Introduction to Parallel Processing Systems

Chapter 11. Introduction to Multiprocessors

Outline Marquette University

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer parallelism Flynn s categories

27. Parallel Programming I

Parallel Architectures

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Overview. Processor organizations Types of parallel machines. Real machines

Cray XE6 Performance Workshop

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

27. Parallel Programming I

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

BİL 542 Parallel Computing

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

CS Computer Architecture

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Computing: Parallel Architectures Jin, Hai

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Transcription:

CS 475: Parallel Programming Introduction Sanjay Rajopadhye (with updates by Wim Bohm, Cathie Olschanowski) Colorado State University Fall 2016 Name card info (inside) n Name: Sanjay Rajopadhye n Pronunciation (optional): In Indian names, a is almost always pronounced as a short u sound as in gun, fun, etc., or a long aa sound as in calm, bard, etc. Sanjay is pronounced as Sun-juy Rajopadhye Raaj-Oh-paath-yay (in the paath make the t sound like a d). Don t worry if you don t get it right, it s almost always mispronounced, even in India). These pronunciation rules are used in many parts of Asia, e.g., pronounce Bagdad? n Major dept: CS/ECE n Status: Associate Professor (e.g., 3 rd year Ph.D.) n Interesting fact: I only recently discovered how thin my hair was in the back 1

Bio sketch n Education undergrad: B. Tech, IIT Kharagpur, India, 1980 grad: Ph.D. University of Utah, 1986 n Professional: Assistant Prof, CS University of Oregon (86-91) Assistant Prof, ECE Oregon State University (91-92) Researcher/Prof Associé IRISA, Rennes (1993-2001) Colorado State University (2001- ) n Research contributions/interests: High-performance computing, embedded systems, VLSI, algorithms, compilers, programming languages, Polyhedral model: a mathematical framework for describing, transforming and compiling massively parallel computations Teaching Assistant n Name: Swetha Varadarajan n What you want to be called: Swetha n Major: ECE n Status: Grad (MS) n Interesting fact: I can play candy crush for 15 hours straight 2

Topics n Answer: Why parallel computing? n Overview the types of parallel computing n Overview of speedup and efficiency n Review the plan for this class n Interactive/Illustrative example Why Parallel Programming n Need for speed Many applications require orders of magnitude more compute power than we have right now. Speed came for a long time from technology improvements, mainly increased clock speed and chip density. The technology improvements have slowed down, and the only way to get more speed is to exploit parallelism. All new computers are now parallel computers. 3

Technology: Moore s Law n Empirical observation (1965) by Gordon Moore: n The chip density: number of transistors that can be inexpensively placed on an integrated circuit, is increasing exponentially, doubling approximately every two years. n en.wikipedia.org/wiki/moore's_law n This has held true until now and is expected to hold until at least 2015 (news.cnet.com/new-life-for- Moores-Law/2009-1006_3-5672485.html). source: Wikipedia 4

Corollary of exponential growth n When two quantities grow exponentially, but at different rates, their ratio also grows exponentially. y 1 = a x, and y 2 = b x for a b 1, therefore a x " % $ ' = r x, r 1 # b & n 1.1 n O(2 n ) or 2 n grows a lot faster than (1.1) n n Consequence for computer architecture: growth rate for e.g. memory is not as high as for processors, therefore, memory gets slower and slower (in terms of clock cycles) as compared to processors. This gives rise to so called gaps or walls Gaps or walls of Moore s Law n Memory gap/wall: Memory bandwidth and latency improve much slower than processor speeds (since mid 80s, this was addressed by ever increasing on-chip caches). n (Brick) Power Wall Higher frequency means more power, i.e., more heat. So chips are getting exceedingly hot Consequence: We can no longer increase the clock frequency of our processors. 5

source: The Free Lunch Is Over Herb Sutter Dr. Dobbs, 2005 Enter Parallel Computing n The goal is to deliver performance. The only solution to the power wall is to have multiple processor cores on a chip, because we can still increase the chip density. n Nowadays, there are no more processors with one core. The processor is the new transistor. n 2005 prediction: Number of cores will continue to grow at an exponential rate (at least until 2015), has not happened as much as we thought it would. We are still in the ~16 cores per micro-processor range. 6

Implicit Parallel Programming n Let the compiler / runtime system detect parallelism, do task and data allocation, and scheduling. n This has been a holy grail of compiler and parallel computing research for a long time and, after >40 years, still requires research on automatic parallelization, languages, OS, fault tolerance, etc. n We still don t have a general purpose high performance implicit parallel programming language that compiles to a general purpose parallel machine. Explicit Parallel Programming Let the programmer express parallelism, task and data partitioning, allocation, synchronization, and scheduling, using programming languages extended with explicit parallel programming constructs. n This makes the programming task harder, but a lot of fun and a large source of income n (very large actually ) 7

Explicit parallel programming n Explicit parallel programming Multithreading: OpenMP Message Passing: MPI Data parallel programming: CUDA n Explicit Parallelism complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( semaphores, locks, messages ) Computer Architectures n Sequential: John von Neumann Stored Program, single instruction stream, single data stream n Parallel: Michael Flynn SIMD:Single instruction multiple data n data parallel MIMD: Multiple instruction multiple data 8

SIMD n One control unit n Many processing elements All executing the same instruction Local memories Conditional execution n Where ocean perform ocean step n Where land perform land step n Gives rise to idle processors (optimized in GPUs) n PE interconnection Usually a 2D mesh SIMD machines n Bit level n ILLIAC IV (very early research machine, Illinois) n CM2 (Thinking Machines) n now GPU: Cartesian grid of SMs (Streaming Multiprocessors) n each SM has a collection of Processing Elements n each SM has a shared memory and a register file one global memory 9

MIMD n Multiple Processors Each executing its own code n Distributed memory MIMD Complete PE+Memory connected to network n Cosmic Cube, N-Cube, our Cray n Extreme: Network of Workstations, Data centers: racks of processors with high speed bus interconnects n Programming model: Message Passing Shared memory MIMD n Shared memory CPUs or cores, bus, shared memory All our processors are now multi-core Programming model: OpenMP n NUMA: Non Uniform Memory Access Memory Hierarchy (registers, cache, local, global) Potential for better performance, but problem: memory (cache) coherence (resolved by architecture) 10

Clarifying Reading assignments 1. Inside Front Cover 2. Chapter 3 Parallel Algorithm Design Speedup n Why do we write parallel programs again? to get speedup: go faster than sequential n What is speedup? T 1 = sequential time to execute a program sometimes called T, or S T p = time to execute the same program with p processors (cores, PEs) S p = T 1 / T p speedup for p processors 11

Ideal Speedup, linear speedup n Ideal speedup: p fold speedup: S p = p Ideal not always possible. WHY? n Certain parts of the computation are inherently sequential n Tasks are data dependent, so not all processors are always busy, and need to synchronize n Remote data needs communication Memory wall PLUS Communication wall n Linear speedup: S p = β p β is usually less than 1 Efficiency n Speedup is usually not ideal, nor linear n We express this in terms of efficiency E p : E p = S p / p E p defines the average utilization of p processors Range? n What does E p = 1 signify? n What does E p = β (0<β<1) signify? 12

More realistic speedups n T 1 = 1, T p = σ + (ο+π)/p n S p = 1 / (σ + (ο+π)/p ) σ is sequential fraction of the program π is parallel fraction of the program σ+π = 1 ο is parallel overhead (does not occur in sequential execution) Draw speedup curves for p=1, 2, 4, 8, 16, 32 σ= ¼, ½ ο= ¼, ½ When p goes to, S p goes to? Amdahl s Law (from your reading) 13

Plotting speedups T 1 = 1, T p =σ+(ο+π)/p S p = 1/(σ+(ο+π)/p ) p Sp (s:1/4 o:1/4) Sp (s:1/4 o:1/2) Sp (s:1/2 o:1/4) Sp (s:1/2 o:1/2) 1 1 1 1 1 2 1.33 1.14 1.14 1.00 4 2.00 1.78 1.45 1.33 8 2.67 2.46 1.68 1.60 16 3.2 3.05 1.83 1.78 32 3.56 3.46 1.91 1.88 4" 3.5" 3" 2.5" Speedup&&Sp&&& 2" 1.5" 1" 0.5" Sp"(s:1/4"o:1/4)" Sp"(s:1/4"o:1/2)" Sp"(s:1/2"o:1/4)" Sp"(s:1/2"o:1/2)" 0" 0" 10" 20" 30" 40" Number&of&processors:&p& This class: explicit parallel programming Class Objectives n Study the foundations of parallel programming n Write explicit parallel programs In C plus libraries Using OpenMP (shared memory), MPI (message passing) and CUDA (data parallel) Execute and measure, plot speedups, interpret / try to understand the outcomes (σ, π, ο), document performance 14

Class Format n Hands on: write efficient parallel code. This often starts with writing efficient sequential code. n four versions of a code: naïve sequential, naïve parallel, smart sequential, smart parallel n Lab covers many details, discussions follow up on what s learned in labs. n PAs: Write effective and scalable parallel programs: report and interpret their performance. Parallel Programming steps n Partition: break program in (smallest possible) parallel tasks and partition the data accordingly. n Communicate: decide which tasks need to communicate (exchange data). n Agglomerate: group tasks into larger tasks reducing communication overhead taking LOCALITY into account n High Performance Computing is all about locality 15

Locality A process (running on a processor) needs to have its data as close to it as possible. Processors have non uniform memory access (NUMA): n registers 0 (extra) cycles n cache 1,2,3 cycles there are multiple caches on a multicore chip n local memory 50s to 100s of cycles n global memory, processor-processor interconnect, disk Process n COARSE in OS world: a program in execution n FINER GRAIN in HPC world: (a number of) loop iterations, or a function invocation Exercise: summing numbers 10x10 grid of 100 processors. Add up 4000 numbers initially at the corner processor[0,0], processors can only communicate with NEWS neighbors Different cost models: Each communication event may have an arbitrarily large set of numbers (cost independent of volume). n Alternate: Communication cost is proportional to volume Cost: Communication takes 100x the time of adding n Alternate: communication and computation take the same time 16

Outline n scatter : the numbers to processors row or column wise Each processor along the edge gets a stack n Keeps whatever is needed for its row/column n Forwards the rest to the processor to south (or east) Initiates an independent scatter along its row/column n each processor adds its local set of sums n gather : sums are sent back in reverse order of scatter and added up along the way Broadcast / Reduction n One processor has a value, and we want every processor to have a copy of that value. n Grid: Time for a processor to receive the value must be at least equal to its distance (number of hops in the grid graph) from the source. For the opposite corner, this is the perimeter of the grid, ~ 2sqrt(p) hops n Reduction: dual of broadcast First add n/p numbers on each processor Then do the broadcast in reverse and add along the way 17

Analysis n Scatter, add local, communicate back to the corner Intersperse with one addition operation problem size N=4000, machine size P=100, machine topology: square grid, comms cost C=100) Scatter: 18 steps (get data, keep what you need, send the rest onwards) (2(sqrt(P)-1))) Add 40 numbers Reduce the local answers: 18 steps (one send, add 2 numbers) n Is P=100 good for this problem? Which P is optimal, assuming a square grid)? Weekly Roundup n Moore s law (AKA why parallel?) n SIMD vs MIMD n 3 major types of parallel computing n Explicit versus implicit parallelism n Locality n Calculation speedup Measured Theoretical ideal Cost of communication 18