--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Similar documents
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Review: Creating a Parallel Program. Programming for Performance

Parallelization of an Example Program

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

Simulating ocean currents

Parallelization Principles. Sathish Vadhiyar

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Programming as Successive Refinement. Partitioning for Performance

Parallel Programming Basics

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

An Introduction to Parallel Programming

Lect. 2: Types of Parallelism

SGI Challenge Overview

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Lecture 1: Introduction

ECE 669 Parallel Computer Architecture

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Multiprocessors at Earth This lecture: How do we program such computers?

What is a parallel computer?

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Steps in Creating a Parallel Program

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Flynn's Classification

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Introduction to Parallel Programming

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Performance Optimization Part II: Locality, Communication, and Contention

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Conventional Computer Architecture. Abstraction

Module 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming

Introduction to Parallel Programming

ECE453: Advanced Computer Architecture II Homework 1

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Memory Consistency and Multiprocessor Performance

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

CS 475: Parallel Programming Introduction

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Parallel Programming Basics

Introduction to parallel Computing

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Parallel Programming Case Studies

Huge market -- essentially all high performance databases work this way

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Multiprocessor Systems. Chapter 8, 8.1

Introduction to Parallel Computing

Parallel Algorithm Design. CS595, Fall 2010

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Computer Systems Architecture

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Scientific Applications. Chao Sun

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Blocking SEND/RECEIVE

Computer Systems Architecture

Parallel Programming Case Studies

Lecture: Memory Technology Innovations

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Introduction (Chapter 1)

Limitations of parallel processing

Παράλληλη Επεξεργασία

Parallel Programming Case Studies

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Data Locality and Load Balancing in

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Principles of Parallel Algorithm Design: Concurrency and Mapping

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Principles of Parallel Algorithm Design: Concurrency and Mapping

Multiprocessor Systems. COMP s1

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Multiprocessors - Flynn s Taxonomy (1966)

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Lecture 4: Locality and parallelism in simulation I

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Basics of Performance Engineering

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

PARALLEL MEMORY ARCHITECTURE

Transcription:

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming (Chapters 2, 3, & 4) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! Parallel Programming To understand and evaluate design decisions in a parallel machine, we must get an idea of the software that runs on a parallel machine. --Introduction to Culler et al. s Chapter 2, beginning 192 pages on software (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 2 Page 1

Outline Review Applications Creating Parallel Programs Programming for Performance Scaling (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 3 Review: Separation of Model and Architecture Shared Memory Single shared address space Communicate, synchronize using load / store Can support message passing Message Passing Send / Receive Communication + synchronization combined Can support shared memory (with some effort) Data Parallel Lock-step execution on regular data structures Often requires global operations (sum, max, min...) Can support on either SM or MP (SPMD) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 4 Page 2

Review: A Generic Parallel Machine P $ P $ Node 0 Node 1 P Mem Mem $ CA CA Interconnect P Mem Mem $ CA CA Node 2 Node 3 Separation of programming models from architectures All models require communication Node with processor(s), memory, communication assist (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 5 Review: Fundamental Architectural Issues Naming: How is communicated data and/or partner node referenced? Operations: What operations are allowed on named data? Ordering: How can producers and consumers of data coordinate their activities? Performance Latency: How long does it take to communicate in a protected fashion? Bandwidth: How much data can be communicated per second? How many operations per second? (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 6 Page 3

Applications Ocean Current Simulation Typical of 3-D simulations Has a major effect on world climate N-Body Evolution of galaxies Ray Tracing Shoot Ray through three dimensional scene (let it bounce off objects) Data Mining finding associations Consumers that are college students, and buy beer, tend to buy chips (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 7 Ocean Ocean Basin Simulate ocean currents discretize in space and time (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 8 Page 4

Barnes-Hut Computing the mutual interactions of N bodies n-body problems stars, planets, molecules Can approximate influence of distant bodies (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 9 Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS A Task is a piece of work Ocean: grid point, row, plane Raytrace: 1 ray or group of rays Task grain small => fine-grain task large => course-grain task Process (thread) performs tasks According to OS: process = thread(s) + address space Process (threads) executed on processor(s) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 10 Page 5

Steps for Creating a Parallel Program Decomposition into tasks Assignment of tasks to processes Orchestration of data access, communication, etc Mapping processes to processors Decomposition Assignment P0 P2 P1 P3 Orchestration P0 P2 P1 P3 Mapping P0 P1 Sequential Computation Tasks Processes Parallel Program (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 11 Decomposition Decompose computation into set of tasks Task availability can be dynamic Maximize concurrency Minimize overhead of managing tasks Decomposition Sequential Computation Tasks (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 12 Page 6

Concurrency Profile Plot number of concurrent tasks over time Example: operate on n 2 parallel data points, then sum them sum sequentially first sum blocks in parallel then sum p partial sums sequentially Speedup overall = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 13 Speedup Concurrency Profile: Speedup Area under concurrency Horizontal extent of f = k p = number of processors fraction of work with concurrency k p profile concurrency profile ( p) fk k = 1 1 Speedup( p) = p p 1 fk fk k = 1 k k= 1 k Normalize total work to 1; make concurrency either serial or completely parallel Amdahl s Law 1 s s + 1 p note: book has an unusual formulation of Amdahl s law (based on fraction of time rather than fraction of work) (C) 2003 J. E. Smith CS/ECE 757 14 Page 7

Assignment Assign tasks to processes (static or dynamic) Balance workload Reduce communication (while keeping load balanced) Minimize overhead Assignment + Decomposition = Partitioning Decomposition Assignment P0 P2 P1 P3 Sequential Computation Tasks (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 15 Orchestration Choreograph data access, communication and synchronization Reduce cost of communication and synchronization Preserve data locality (data layout) Schedule tasks (order of execution) Reduce overhead of managing parallelism Must have good primitives (architecture and model) Decomposition Assignment P0 P2 P1 P3 Orchestration P0 P2 P1 P3 Sequential Computation Tasks Processes Parallel Program (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 16 Page 8

Mapping Map processes to physical processors Space Sharing Static mapping (pin process to processor) Dynamic processes migrate under OS control what about orchestration (data locality)? task queues under program control Decomposition Assignment P0 P2 P1 P3 Orchestration P0 P2 P1 P3 Mapping P0 P1 Sequential Computation Tasks Processes Parallel Program (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 17 OS Effects on Mapping Ability to bind process to processor Space Sharing Physical partitioning of machine Gang Scheduling All processes context switched simultaneously (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 18 Page 9

Example: Ocean Equation Solver kernel = small piece of heavily used code Update each point based on NEWS neighbors Gauss-Seidel (update in place) Compute average difference per element Convergence when difference is small => exit (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 19 Equation Solver Decomposition while!converged for for The loops are not independent (linear recurrence) Exploit properties of problem Don t really need up-to-date values (approximation) May take more steps to converge, but exposes parallelism (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 20 Page 10

Recurrence in Inner Loop Sequential Solver copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 21 Parallel Solver Identical computation as Original Code Parallelism along anti-diagonal Different degrees of parallelism as diagonal grows/shrinks copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 22 Page 11

The FORALL Statement while!converged forall forall Can execute the iterations in parallel Each grid point computation (n 2 parallelism) while!converged forall for Computation for rows is independent (n parallelism) less overhead (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 23 Asynchronous Parallel Solver No assignment or orchestration Each processor updates its region independent of other s values Global synch at end of iteration, to keep things somewhat up-to-date non-deterministic copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 24 Page 12

Red-Black Parallel Solver Red-Black like checkerboard update of Red point depends only on Black points alternate iterations over red, then black but convergence may change deterministic copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 25 Equation Solver Assignment P0 P1 P2 Each process gets a contiguous block of rows (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 26 Page 13

PARMACS Macro Package, runtime system must implement portability CREATE(p,proc,args) Create p processes executing proc(args) G_MALLOC(size) Allocate shared data of size bytes LOCK(name) UNLOCK(name) BARRIER(name,number) Wait for number processes to arrive WAIT_FOR_END(number) Wait for number processes to terminate WAIT(flag) while(!flag); SIGNAL(flag) flag = 1; (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 27 Shared Memory Orchestration copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 28 Page 14

Equation Solver: The Ugly Code main() parse_arguments(); A = G_MALLOC(size of big array); CREATE(nprocs-1,Solve, A); Solve(A) WAIT_FOR_END; end main Solve(A) while!done for i = my_start to my_end for j = 1 to n mydiff += abs(a[i,j] - temp); LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); if (convergence_test) done =1 BARRIER (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 29 Message Passing Primitives copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 30 Page 15

Sends and Receives Synchronous send: returns control to sending process after receive is performed receive: returns control when data is written into address space can deadlock on pairwise data exchanges Asynchronous Blocking» send: returns control when the message has been sent from source data structure (but may still be in transit)» receive: like synchronous, but no ack is sent to sender Non-Blocking» send: returns control immediately» receive: posts intent to receive» probe operations determine completion (C) 2003 J. E. Smith CS/ECE 757 31 Message Passing Orchestration Create separate processes Each process a portion of the array n/nprocs (+2) rows boundary rows are passed in messages» deterministic because boundaries only change between iterations To test for convergence, each process computes mydiff and sends to proc 0 synchronized via send/receive (C) 2003 J. E. Smith CS/ECE 757 32 Page 16

Message Passing Orchestration copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 33 Message Passing Orchestration copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 34 Page 17

Message Passing Orchestration copyright 1999 Morgan Kaufmann Publishers, Inc (C) 2003 J. E. Smith CS/ECE 757 35 Example: Programmer / Software LocusRoute (standard cell router) while (route_density_improvement > threshold) { for (i = 1 to num_wires) do { rip old wire out explore new route place wire using best new route } } (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 36 Page 18

Shared Memory Implementation Shared memory algorithm Divide cost array into regions» logically assign regions to PEs Assign wires to PEs based on the region in which center lies Do load balancing using stealing when local queue empty Pros: Good load balancing Mostly local accesses High cache hit ratio Cons: non-deterministic potential hot spots amount of parallelism (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 37 Message Passing Implementations Method 1: Distribute wires and cost array regions as in SM implementation When wire-path crosses to remote region» send computation to remote PE, or» send message to access remote data Method 2: Distribute only wires as in SM implementation Fully replicate cost array on each PE» one owned region, and potential stale copy of others» send updates so copies are not too stale Consequences:» waste of memory in replication» stale data => poorer quality results or more iterations Both methods require lots of thought for programmer (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 38 Page 19

Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 39 Page 20