CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Similar documents
CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 7: Fancy Solution to HW 2. Mary Hall September 15, 2009

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Homework 1: Parallel Programming Basics Homework 1: Parallel Programming Basics Homework 1, cont.

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Regarding the Red/Blue computation

Chap. 4 Part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Goal: Introduce scalable algorithms and strategies for developing scalable solutions

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Peril-L, Reduce, and Scan

EE/CSCI 451: Parallel and Distributed Computation

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

1/25/12. Administrative

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

An Introduction to Parallel Programming

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Developing MapReduce Programs

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Introduction to Parallel Computing

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Parallel and Distributed Computing (PD)

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Threaded Programming. Lecture 1: Concepts

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

CSE524 Parallel Computation

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Chapter 2 Abstract Machine Models. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Παράλληλη Επεξεργασία

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

L20: Putting it together: Tree Search (Ch. 6)!

Introduction to Parallel Computing

Homework 3 Solutions

Conventional Computer Architecture. Abstraction

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

L23: Parallel Programming Retrospective

Chapter 2 Parallel Computer Models & Classification Thoai Nam

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Introduction to parallel Computing

Question 1. Notes on the Exam. Today. Comp 104: Operating Systems Concepts 11/05/2015. Revision Lectures

Administrivia. Minute Essay From 4/11

Review: Creating a Parallel Program. Programming for Performance

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

The PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

Intel Thread Building Blocks, Part II

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

High Performance Computing. Introduction to Parallel Computing

Chapter 17 - Parallel Processing

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers)

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Exercise (could be a quiz) Solution. Concurrent Programming. Roadmap. Tevfik Koşar. CSE 421/521 - Operating Systems Fall Lecture - IV Threads

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Concurrent & Distributed Systems Supervision Exercises

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

Intel Thread Building Blocks

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers)

EE/CSCI 451: Parallel and Distributed Computation

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSE 120 Principles of Operating Systems

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Parallel Computing Introduction

Deadlock and Monitors. CS439: Principles of Computer Systems September 24, 2018

Parallel programming. Luis Alejandro Giraldo León

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter 1 Computer System Overview

Computer Science Curricula 2013

ECE 669 Parallel Computer Architecture

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Modern Processor Architectures. L25: Modern Compiler Design

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CSEP 524 Parallel Computa3on University of Washington

High Performance Computing

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

CS 5220: Shared memory programming. David Bindel

CSE 160 Lecture 7. C++11 threads C++11 memory model

Parallel Computing. Hwansoo Han (SKKU)

COMP 322: Fundamentals of Parallel Programming

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

CSC 261/461 Database Systems Lecture 19

Lecture 2. Memory locality optimizations Address space organization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Intel Thread Building Blocks

Online Course Evaluation. What we will do in the last week?

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

CSCI-1200 Data Structures Fall 2009 Lecture 25 Concurrency & Asynchronous Computing

Parallel and High Performance Computing CSE 745

Transcription:

CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command: handin cs4961 hw2 <prob1file> Mailing list set up: cs4961@list.eng.utah.edu Sriram office hours: - MEB 3115, Mondays and Wednesdays, 2-3 PM Mary Hall September 3, 2009 09/03/2009 CS4961 1 09/03/2009 CS4961 2 Going over Homework 1 Problem 2: - Provide pseudocode (as in the book and class notes) for a correct and efficient parallel implementation in C of the parallel sums code, based on the tree-based concept in slides 26 and 27 of Lecture 2. Assume that you have an array of 128 elements and you are using 8 processors. - Hints: - Use an iterative algorithm similar to count3s, but use the tree structure to accumulate the final result. - Use the book to show you how to add threads to what we derived for count3s. Problem 3: - Now show how the same algorithm can be modified to find the maximum element of an array. (problem 2 in text). Is this also a reduction computation? If so, why? Homework 2 Problem 1 (#10 in text on p. 111): The Red/Blue computation simulates two interactive flows. An n x n board is initialized so cells have one of three colors: red, white, and blue, where white is empty, red moves right, and blue moves down. Colors wrap around on the opposite side when reaching the edge. In the first half step of an iteration, any red color can move right one cell if the cell to the right is unoccupied (white). On the second half step, any blue color can move down one cell if the cell below it is unoccupied. The case where red vacates a cell (first half) and blue moves into it (second half) is okay. Viewing the board as overlaid with t x t tiles (where t divides n evenly), the computation terminates if any tile s colored squares are more than c% one color. Use Peril-L to write a solution to the Red/Blue computation. 09/03/2009 CS4961 3 09/03/2009 CS4961 4 1

Homework 2, cont. Problem 2: For the following task graphs, determine the following: (1) Maximum degree of concurrency. (2) Critical path length. (3) Maximum achievable speedup over one process assuming than an arbitrarily large number of processes is available. (4) The minimum number of processes needed to obtain the maximum possible speedup. (5) The maximum achievable speedup if the number of processes is limited to (a) 2 and (b) 8. Today s Lecture Data parallel and task parallel constructs and how to express them Peril-L syntax - An abstract programming model based on C to be used to illustrate concepts Task Parallel Concepts Sources for this lecture: - Larry Snyder, http://www.cs.washington.edu/education/courses/ 524/08wi/ - Grama et al., Introduction to Parallel Computing, http://www.cs.umn.edu/~karypis/parbook 09/03/2009 CS4961 5 09/03/2009 CS4961 6 Brief Recap of Tuesday s Lecture We looked at a lot of different kinds of parallel architectures - Diverse! How to write software for a moving hardware target? - Abstract away specific details - Want to write machine-independent code Candidate Type Architecture (CTA) Model - Captures inherent tradeoffs without details of hardware choices - Summary: Locality is Everything! Candidate Type Architecture (CTA Model) A model with P standard processors, d degree, latency Node == processor + memory + NIC - All memory is associated with a specific node! Key Property: Local memory ref is 1, global memory is 09/03/2009 CS4961 7 09/03/2009 CS4961 8 2

Estimated Values for Lambda Captures inherent property that data locality is important. But different values of Lambda can lead to different algorithm strategies Locality Rule Definition, p. 53: - Fast programs tend to maximize the number of local memory references and minimize the number of non-local memory references. Locality Rule in practice - It is usually more efficient to add a fair amount of redundant computation to avoid non-local accesses (e.g., random number generator example). This is the most important thing you need to learn in this class! 09/03/2009 CS4961 9 09/03/2009 CS4961 10 Conceptual: CTA for Shared Memory Architectures? CTA is not capturing global memory in SMPs Forces a discipline - Application developer should think about locality even if remote data is referenced identically to local data! - Otherwise, performance will suffer - Anecdotally, codes written for distributed memory shown to run faster on shared memory architectures than shared memory programs - Similarly, GPU codes (which require a partitioning of memory) recently shown to run well on conventional multi-core 09/03/2009 CS4961 11 Definitions of Data and Task Parallelism Data parallel computation: - Perform the same operation to different items of data at the same time; the parallelism grows with the size of the data. Task parallel computation: - Perform distinct computations -- or tasks -- at the same time; with the number of tasks fixed, the parallelism is not scalable. Summary - Mostly we will study data parallelism in this class - Data parallelism facilitates very high speedups; and scaling to supercomputers. - Hybrid (mixing of the two) is increasingly common 09/03/2009 CS4961 12 3

Parallel Formulation vs. Parallel Algorithm Parallel Formulation - Refers to a parallelization of a serial algorithm. Parallel Algorithm - May represent an entirely different algorithm than the one used serially. In this course, we primarily focus on Parallel Formulations. 09/03/2009 CS4961 13 Steps to Parallel Formulation (refined from Lecture 2) Computation Decomposition/Partitioning: - Identify pieces of work that can be done concurrently - Assign tasks to multiple processors (processes used equivalently) Data Decomposition/Partitioning: - Decompose input, output and intermediate data across different processors Manage Access to shared data and synchronization: - coherent view, safe access for input or intermediate data UNDERLYING PRINCIPLES: Maximize concurrency and reduce overheads due to parallelization! Maximize potential speedup! 09/03/2009 CS4961 14 Peril-L Notation This week s homework was made more difficult because we didn t have a concrete way of expressing the parallelism features of our code! Peril-L, introduced as a neutral language for describing parallel programming constructs - Abstracts away details of existing languages - Architecture independent - Data parallel - Based on C, for universality We can map other language features to Peril-L features as we learn them Peril-L Threads The basic form of parallelism is a thread Threads are specified in the following (data parallel) way: forall <int var> in ( <index range spec> ) {<body> } Semantics: spawn k threads each executing body forall thid in (1..12) { } printf("hello, World, from thread %i\n", thid); <index range spec> is any reasonable (ordered) naming 09/03/2009 CS4961 15 09/03/2009 CS4961 16 4

Thread Model is Asynchronous (MIMD) Threads execute at their own rate The execution relationships among threads is not known or predictable To cause threads to synchronize, we have Common Notions of Task-Parallel Thread Creation (not in Peril-L) barrier; Threads arriving at barriers suspend execution until all threads in its forall arrive there; then they re all released 09/03/2009 CS4961 17 09/03/2009 CS4961 18 Memory Model in Peril-L Two kinds of memory: local and global - Variables declared in a thread are local by default - Any variable w/ underlined_name is global Names (usually indexes) work as usual - Local variables use local indexing - Global variables use global indexing Memory is based on CTA, so performance: - Local memory reference are unit time - Global memory references take λ time Memory Read-Write Semantics Local Memory behaves like the RAM model Global memory - Reads are concurrent, so multiple processors can read a memory location at the same time - Writes must be exclusive, so only one processor can write a location at a time - The possibility of multiple processors writing to a location is not checked and if it happens the result is unpredictable 09/03/2009 CS4961 19 09/03/2009 CS4961 20 5

Shared Memory or Distributed Memory? Peril-L is not a shared memory model because - It distinguishes between local and global memory costs that s why it s called global Peril-L is not distributed memory because - The code does not use explicit communication to access remote data Peril-L is not a PRAM because - It is founded on the CTA - By distinguishing between local and global memory, it distinguishes their costs - It is asynchronous Serializing Global Writes To ensure the exclusive write Peril-L has exclusive { <body> } As compared to a lock, this is an atomic block of code. The semantics are that - a thread can execute <body> only if no other thread is doing so - if some thread is executing, then it must wait for access - sequencing through exclusive may not be fair 09/03/2009 CS4961 21 09/03/2009 CS4961 22 Reductions (and Scans) in Peril-L Peril-L Example: Try3 from Lecture 2 Aggregate operations use APL syntax - Reduce: <op>/<operand> for <op> in {+, *, &&,, max, min}; as in +/priv_sum - Scan: <op>\<operand> for <op> in {+, *, &&,, max, min}; as in +\local_finds To be portable, use reduce & scan rather than programming them exclusive {count += priv_count; } WRONG count = +/priv_count; RIGHT Reduce/Scan Imply Synchronization 09/03/2009 CS4961 23 09/03/2009 CS4961 24 6

Connecting Global and Local Memory CTA model does not have a global memory. Instead, global data is distributed across local memories But #1 thing to learn is importance of locality. Want to be able to place data current thread will use in local memory Construct for data partitioning/placement Scalable Parallel Solution using Localize localize(); Meaning - Return a reference to portion of global data structure allocated locally (not a copy) - Thereafter, modifications to local portion using local name do not incur penalty 09/03/2009 CS4961 25 09/03/2009 CS4961 26 Summary of Lecture How to develop a parallel code - Added in data decomposition Peril-L syntax Next Time: - More on task parallelism and task decomposition - Peril-L examples - Start on SIMD, possibly 09/03/2009 CS4961 27 7