Writing Parallel Programs; Cost Model.

Similar documents
Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

The DAG Model; Analysis of For-Loops; Reduction

Divide and Conquer Algorithms

Introduction to Multithreaded Algorithms

Cost Model: Work, Span and Parallelism

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri

PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS

CS4311 Design and Analysis of Algorithms. Lecture 1: Getting Started

Today: Amortized Analysis (examples) Multithreaded Algs.

Data Structures and Algorithms CSE 465

CS2351 Data Structures. Lecture 1: Getting Started

Lecture 2: Getting Started

Algorithm Analysis. (Algorithm Analysis ) Data Structures and Programming Spring / 48

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

Scan and Quicksort. 1 Scan. 1.1 Contraction CSE341T 09/20/2017. Lecture 7

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Sankalchand Patel College of Engineering - Visnagar Department of Computer Engineering and Information Technology. Assignment

6.001 Notes: Section 4.1

(Refer Slide Time: 1:27)

Week - 03 Lecture - 18 Recursion. For the last lecture of this week, we will look at recursive functions. (Refer Slide Time: 00:05)

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

The PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

CSE 260 Lecture 19. Parallel Programming Languages

1. Attempt any three of the following: 15

Cilk, Matrix Multiplication, and Sorting

Algorithms and Data Structures

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

Lecture 1. Introduction / Insertion Sort / Merge Sort

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

CSE 101- Winter 18 Discussion Section Week 6

n = 1 What problems are interesting when n is just 1?

Scientific Computing. Algorithm Analysis

COSC 311: ALGORITHMS HW1: SORTING

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Introduction to the Analysis of Algorithms. Algorithm

(Refer Slide Time: 01.26)

1 Introduction. 2 InsertionSort. 2.1 Correctness of InsertionSort

Lecture 15 : Review DRAFT

Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn:

Dense Matrix Algorithms

Assignment 1 (concept): Solutions

CS 310: Order Notation (aka Big-O and friends)

/463 Algorithms - Fall 2013 Solution to Assignment 3

Data Structures and Algorithms Key to Homework 1

Unit-5 Dynamic Programming 2016

CS/COE 1501

SEARCHING, SORTING, AND ASYMPTOTIC COMPLEXITY. Lecture 11 CS2110 Spring 2016

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

CS302 Topic: Algorithm Analysis. Thursday, Sept. 22, 2005

Basic Data Structures (Version 7) Name:

Lower Bound on Comparison-based Sorting

Algorithmics. Some information. Programming details: Ruby desuka?

DATA STRUCTURES AND ALGORITHMS. Asymptotic analysis of the algorithms

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.

Design and Analysis of Algorithms

CS1 Lecture 22 Mar. 6, 2019

Lesson 1 1 Introduction

Outline Purpose How to analyze algorithms Examples. Algorithm Analysis. Seth Long. January 15, 2010

1 Overview, Models of Computation, Brent s Theorem

Introduction to Data Structure

CS/COE 1501 cs.pitt.edu/~bill/1501/ Introduction

Chapter 3 Dynamic programming

Lecture 6 Sequences II. Parallel and Sequential Data Structures and Algorithms, (Fall 2013) Lectured by Danny Sleator 12 September 2013

0.1 Welcome. 0.2 Insertion sort. Jessica Su (some portions copied from CLRS)

SEARCHING, SORTING, AND ASYMPTOTIC COMPLEXITY

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

Write an algorithm to find the maximum value that can be obtained by an appropriate placement of parentheses in the expression

Introduction to Analysis of Algorithms

Binary Trees

Complexity of Algorithms

Theory and Algorithms Introduction: insertion sort, merge sort

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Introduction to Algorithms

Reading for this lecture (Goodrich and Tamassia):

Data Structures Lecture 8

Introduction to Algorithms

Lecture Notes for Chapter 2: Getting Started

A Framework for Space and Time Efficient Scheduling of Parallelism

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

TAIL RECURSION, SCOPE, AND PROJECT 4 11

Chapter Fourteen Bonus Lessons: Algorithms and Efficiency

Lecture Summary CSC 263H. August 5, 2016

RUNNING TIME ANALYSIS. Problem Solving with Computers-II

Data Structures and Algorithms. Part 2

CIS 194: Homework 6. Due Monday, February 25. Fibonacci numbers

Sorting. There exist sorting algorithms which have shown to be more efficient in practice.

UNIT 1 ANALYSIS OF ALGORITHMS

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)

Week - 04 Lecture - 02 Merge Sort, Analysis

Introduction to Algorithms 6.046J/18.401J

Introduction to Computers and Programming. Today

Recursion Chapter 3.5

Introduction to Programming I

Lecture 13: Chain Matrix Multiplication

1 Non greedy algorithms (which we should have covered

Transcription:

CSE341T 08/30/2017 Lecture 2 Writing Parallel Programs; Cost Model. Due to physical and economical constraints, a typical machine we can buy now has 4 to 8 computing cores, and soon this number will be 16, 32, and 64. While the number of cores grows at a rapid pace, the per-core speed hasn t increased much over the past several years. Additionally, graphics processing units (GPUs) are highly parallel platforms with hundreds of cores readily available in commodity machines today. This gives a compelling reason to study parallel algorithms. In this course, we will learn algorithmic techniques that allow you to design and analyze parallel algorithms. For this lecture, we will learn how to write the pseudocode for parallel algorithms in this class, do some examples and look at how one analyze a parallel program written using this pseudocode. When we analyze the cost of an algorithm formally, we need to be reasonably precise in what model we are performing the analysis. As in CSE241/CSE247, we will use asymptotic costs. These costs can then be used to compare algorithms in terms of how they scale to large inputs. For example, as you know, some sorting algorithms use O(n log n) work and others O(n 2 ). Clearly the O(n log n) algorithm scales better, but perhaps the O(n 2 ) is actually faster on small inputs. In this class we are concerned with how algorithms scale, and therefore asymptotic analysis is indeed what we want. 1 Pseudocode for Parallel Programs We are going to use pseudocode to write algorithms. This pseudocode is based on the programming language called Cilk Plus. Cilk Plus is an extension of C++ programming language and simply adds a few extra keywords so that you can specify the parallelism in your program. There are many aspects of this language, but we are mostly interested in just three keywords, spawn, sync and parallel for. 1 Let us look at a code example. 1 In the language itself, they are cilk spawn, cilk sync, and cilk for but we will just use spawn and sync in the pseudocode. 1

Fib(n) 1 if n 1 2 then return 1 3 x spawn FIB(n 1) 4 y FIB(n 2) 5 sync 6 return x + y Adding the spawn keyword before the function means that the runtime is allowed to execute that function in parallel with the continuation, the remainder of the parent function. The sync keyword means that all functions that have been spawned so far have to return before control can go beyond this point. Therefore, in the above code, the values of FIB(n 1) and FIB(n 2) can be calculated in parallel. Another important concept is the parallel for loop. This construct gives the runtime system permission to execute all the iterations of the for-loop in parallel with each other. For example, if you have an array A[1..n] and you have to create an array B such that, for all i, B[i] = A[i] + 1, then you can simply write: 1 parallel for i 1 to n 2 do B[i] A[i] + 1 In general, two things can be done in parallel if they don t depend on each other. Lets say I want to calculate the sum of all the elements of an array. Can I use the following program? 1 parallel for i 1 to n 2 do s s + A[i] These iterations are not independent since the variable s is changed in all the iterations. The program can lead to several different values of s depending on how the instructions are interleaved (scheduled) when run in parallel. These are called race conditions. A race condition occurs when more than one (logically) parallel operations access the same memory location and one of the accesses is a write. In this course, we are concerned with deterministic parallel programs, that is, programs that give the same result regardless of the schedule. While there are many definitions of determinism, we will concentrate on programs that have no race conditions. You must be careful to make sure that when you use the spawn keyword, you are not introducing race conditions in your program. Exercise 1 If you want to do the sum computation in parallel without determinacy races, how would you do it? 2

2 Let s look at some examples We can now look at some examples of sequential programs you have written in the past and see how we can parallelize them simply: Matrix Multiplication Say we have two square (n n) matrices A and B. To compute their product and store it into another n n matrix C, simply do: C ij = n A ik B kj k=1 which can be easily down by writing a triple-nested for loops. That means, it s also easily parallelizable using parallel for loops: 1 let C be a new n n matrix 2 parallel for i 1 to n 3 do parallel for j 1 to n 4 do C ij 0 5 for k 1 to n 6 do C ij C ij + A ik B kj Merge Sort Recall that given an array of numbers, a sorting algorithm permutes this array so that they are arranged in an increasing order. You are already familiar with Merge Sort. Lets try to make it parallel. The most obvious thing to do to make the merge sort algorithm parallel is to make the recursive calls in parallel. MergeSort(A, n) 1 if n = 1 2 then return A 3 Divide A into two A left and A right each of size n/2 4 A left spawn MERGESORT(A left, n/2) 5 A right MERGESORT(A right, n/2) 6 sync 7 Merge the two halves into A 8 return A So how can we tell if these are good parallel algorithms? 3

3 The RAM model for sequential computation: Traditionally, algorithms have been analyzed in the Random Access Machine (RAM) model. This model assumes a single processor executing one instruction after another, with no parallel operations, and that the processor has access to unbounded memory. The model contains instructions commonly found in real computer, including basic arithmetic and logical operations (e.g. +, -, *, and, or, not), reads from and writes to arbitrary memory locations, and conditional and unconditional jumps to other locations in the code. Each instruction is assumed to take a constant amount of time, and the cost of a computation is measured in terms of the number of instructions execute by the machine. This model has served well for analyzing the asymptotic runtime of sequential algorithms, and most work on sequential algorithms to date has used this model. One reason for its success is that there is an easy mapping from algorithmic pseudocode and sequential languages such as C and C++ to the model, and so it is reasonably easy to reason about the cost of algorithms. That said, this model should only be used to derive the asymptotic bounds (i.e., using big-o, big-theta and big-omega), and not for trying to predict exact running time. One reason for this is that on a real machine not all instructions take the same time, and furthermore not all machines have the same instructions. 4 Performance measures For a parallel algorithm, we are concerned with the amount of parallelism in the algorithm, or in other words, how it scales with the number of processors. We will measure the parallelism using two metrics: work and span. Roughly speaking, the work corresponds to the total number of instructions we perform, and span (also called depth or critical path length) to the number of instructions along the longest chain of dependences. Another way of thinking about it is: work is the running time of the program on one core, while span is the running time of the program on an infinite number of cores. 2 Henceforth, we will use T p to denote the time it takes to execute a computation on p processors, so we will use T 1 to denote work and T to denote span. Just as an example, let s analyze the work and span for the following code (note that any function can be represented in this manner): 2 Here we assume that the scheduler is perfect and has no overheads. 4

Foo 1 e 1 2 spawn BAR 3 spawn BAZ 4 e 2 5 sync 6 e 3 The work of a program is the running time on one core (as measured with the RAM model), so we simply ignore all the spawns and syncs in the program: T 1 (FOO) = T 1 (e 1 ) + T 1 (BAR) + T 1 (BAZ) + T 1 (e 2 ) + T 1 (e 3 ). The span calculation is more involved: T (FOO) = T (e 1 ) + max {T (BAR), T (BAZ), T (e 2 )} + T (e 3 ). Exercise 2 What s the work for FIB(n)? Hint: The closed form for Fibonacci number F ib i = φi ˆφ i 5, where φ = 1+ 5 and ˆφ = 1 5 (i.e., 2 2 the two roots of the equation x 2 = x + 1). Exercise 3 What s the span for FIB(n)? Once we know the work and span of an algorithm, its parallelism is simply defined as the work over span: P = T 1 T One way of thinking about parallelism is that it denotes the average amount of work that can be performed in parallel for each step along the span. This measure of parallelism also represents roughly how many processors we can use efficiently when we run the algorithm. The parallelism of an algorithm is dictated by both the work term and the span term. For example, suppose n = 10, 000 and if T 1 (n) = θ(n 3 ) 10 12 and T (n) = θ(n log n) 10 5 then P(n) 10 7, which is a lot of parallelism. But, if T 1 (n) = θ(n 2 ) 10 8 then P(n) 10 3, which is much less parallelism. The decrease in parallelism is not because of the span was large, but because the work was reduced. 5

Goals: In parallel algorithm design, we would like to keep the parallelism as high as possible but without increasing the work. In general, the goals in designing efficient algorithms are 1. First priority: to keep work as low as possible 2. Second priority: keep parallelism as high as possible (and hence the span as low as possible). In this course we will mostly cover parallel algorithms that is work efficient, meaning that its work is the same or close to the same (within a constant factor) as that of the best sequential algorithm. Now among the algorithms that are work-efficient, we will try to minimize the span to achieve the greatest parallelism. Exercise 4 We said that cilk for is just syntactic sugar for cilk spawn and cilk sync. How would you go about converting a for loop to spawns and syncs? Think about minimizing span while doing so. Exercise 5 What is the work and span of the following code: 1 parallel for i 1 to n 2 do B[i] A[i] + 1 6