A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

Similar documents
CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

CSE 332: Analysis of Fork-Join Parallel Programs. Richard Anderson Spring 2016

Sec$on 4: Parallel Algorithms. Michelle Ku8el

Lecture 23: Parallelism. To Use Library. Getting Good Results. CS 62 Fall 2016 Kim Bruce & Peter Mawhorter

Lecture 29: Parallelism II

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting

Unit #8: Shared-Memory Parallelism and Concurrency

Lecture 23: Parallelism. Parallelism Idea. Parallel Programming in Java. CS 62 Fall 2015 Kim Bruce & Michael Bannister

CSE 373: Data Structures and Algorithms

Program Graph. Lecture 25: Parallelism & Concurrency. Performance. What does it mean?

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 1 Introduction to Multithreading & Fork-Join Parallelism

Performance of Parallel Programs. Michelle Ku3el

What next? CSE332: Data Abstractions Lecture 20: Parallel Prefix and Parallel Sorting. The prefix-sum problem. Parallel prefix-sum

CSE 373: Data Structures & Algorithms Introduction to Parallelism and Concurrency. Riley Porter Winter 2017

CSE 332: Data Structures & Parallelism Lecture 14: Introduction to Multithreading & Fork-Join Parallelism. Ruth Anderson Winter 2018

3/25/14. Lecture 25: Concurrency. + Today. n Reading. n P&C Section 6. n Objectives. n Concurrency

CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

10/5/2016. Comparing Algorithms. Analyzing Code ( worst case ) Example. Analyzing Code. Binary Search. Linear Search

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting

CSE373 Winter 2014, Final Examination March 18, 2014 Please do not turn the page until the bell rings.

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

CSE332: Data Abstractions Lecture 21: Parallel Prefix and Parallel Sorting. Tyler Robison Summer 2010

CSE373: Data Structures and Algorithms Lecture 4: Asymptotic Analysis. Aaron Bauer Winter 2014

CSE 332: Data Structures & Parallelism Lecture 17: Shared-Memory Concurrency & Mutual Exclusion. Ruth Anderson Winter 2019

CSE 373: Data Structures and Algorithms

CMPSCI 187: Programming With Data Structures. Lecture 5: Analysis of Algorithms Overview 16 September 2011

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

CSE332, Spring 2012, Final Examination June 5, 2012

Software Analysis. Asymptotic Performance Analysis

Lecture 7 Quicksort : Principles of Imperative Computation (Spring 2018) Frank Pfenning

CSE 373: Data Structures and Algorithms

CSE 373 MAY 24 TH ANALYSIS AND NON- COMPARISON SORTING

Why do we care about parallel?

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

CS : Data Structures

Algorithms and Applications

CSE373: Data Structure & Algorithms Lecture 21: More Comparison Sorting. Aaron Bauer Winter 2014

Multicore Programming

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

CSE373 Winter 2014, Final Examination March 18, 2014 Please do not turn the page until the bell rings.

CS2 Algorithms and Data Structures Note 1

6.001 Notes: Section 4.1

CSE 332: Data Structures & Parallelism Lecture 5: Algorithm Analysis II. Ruth Anderson Winter 2018

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Week 02 Module 06 Lecture - 14 Merge Sort: Analysis

Parallelism; Parallel Collections

COMP 161 Lecture Notes 16 Analyzing Search and Sort

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

CS 62 Midterm 2 Practice

CPSC 221 Basic Algorithms and Data Structures

CSE 373: Data Structures and Algorithms

Computational Geometry

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

15 150: Principles of Functional Programming Sorting Integer Lists

CSCI-1200 Data Structures Spring 2018 Lecture 7 Order Notation & Basic Recursion

26 How Many Times Must a White Dove Sail...

Introduction to Programming I

1 Overview, Models of Computation, Brent s Theorem

Practice Problems for the Final

CS 4349 Lecture October 18th, 2017

Algorithms in Systems Engineering IE172. Midterm Review. Dr. Ted Ralphs

SEARCHING, SORTING, AND ASYMPTOTIC COMPLEXITY. Lecture 11 CS2110 Spring 2016

Writing Parallel Programs; Cost Model.

An Introduction to Parallel Programming

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

CSCI 136 Data Structures & Advanced Programming. Lecture 7 Spring 2018 Bill and Jon

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

Principles of Algorithm Design

Lectures Parallelism

Parallel Computing Concepts. CSInParallel Project

Section 05: Solutions

Balanced Trees Part Two

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it.

CSE 417 Network Flows (pt 4) Min Cost Flows

Parallelism. CS6787 Lecture 8 Fall 2017

Lecture 1: Overview

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Measuring algorithm efficiency

Lecture Notes on Quicksort

CS2110 Assignment 3 Inheritance and Trees, Summer 2008

CSC 261/461 Database Systems Lecture 17. Fall 2017

Async Programming & Networking. CS 475, Spring 2018 Concurrent & Distributed Systems

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Multithreaded Parallelism and Performance Measures

Lectures 24 and 25: Scheduling; Introduction to Effects

Announcement. Submit assignment 3 on CourSys Do not hand in hard copy Due Friday, 15:20:00. Caution: Assignment 4 will be due next Wednesday

1 Binary Search Trees (BSTs)

(Refer Slide Time: 1:27)

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

Analytical Modeling of Parallel Programs

Compsci 590.3: Introduction to Parallel Computing

Transcription:

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January 2016 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/

Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of small tasks is best Combines results in parallel Some Java and ForkJoin Framework specifics More pragmatics (e.g., installation) in separate notes Now: More examples of simple parallel programs Arrays & balanced trees support parallelism better than linked lists Asymptotic analysis for fork-join parallelism Amdahl s Law 2

What else looks like this? Saw summing an array went from O(n) sequential to O(log n) parallel (assuming a lot of processors and very large n!) Exponential speed-up in theory (n / log n grows exponentially) + + + + + + + + + + + + + + + Anything that can use results from two halves and merge them in O(1) time has the same property 3

Examples Maximum or minimum element Is there an element satisfying some property (e.g., is there a 17)? Left-most element satisfying some property (e.g., first 17) What should the recursive tasks return? How should we merge the results? Corners of a rectangle containing all points (a bounding box ) Counts, for example, number of strings that start with a vowel This is just summing with a different base case Many problems are! 4

Reductions Computations of this form are called reductions (or reduces?) Produce single answer from collection via an associative operator Examples: max, count, leftmost, rightmost, sum, product, Non-examples: median, subtraction, exponentiation (Recursive) results don t have to be single numbers or strings. They can be arrays or objects with multiple fields. Example: Histogram of test results is a variant of sum But some things are inherently sequential How we process arr[i] may depend entirely on the result of processing arr[i-1] 5

Even easier: Maps (Data Parallelism) A map operates on each element of a collection independently to create a new collection of the same size No combining results For arrays, this is so trivial some hardware has direct support Canonical example: Vector addition int[] vector_add(int[] arr1, int[] arr2){ assert (arr1.length == arr2.length); result = new int[arr1.length]; FORALL(i=0; i < arr1.length; i++) { result[i] = arr1[i] + arr2[i]; } return result; } 6

Maps in ForkJoin Framework class VecAdd extends RecursiveAction { int lo; int hi; int[] res; int[] arr1; int[] arr2; VecAdd(int Even though l,int there is h,int[] no result-combining, r,int[] a1,int[] it still helps a2){ with load } protected balancing to void create compute(){ many small tasks if(hi lo < SEQUENTIAL_CUTOFF) { for(int Maybe not i=lo; for vector-add i < hi; but i++) for more compute-intensive maps The res[i] forking = is arr1[i] O(log n) whereas + arr2[i]; } else { theoretically other approaches int to vector-add mid = (hi+lo)/2; is O(1) VecAdd left = new VecAdd(lo,mid,res,arr1,arr2); VecAdd right= new VecAdd(mid,hi,res,arr1,arr2); left.fork(); right.compute(); left.join(); } } } int[] add(int[] arr1, int[] arr2){ assert (arr1.length == arr2.length); int[] ans = new int[arr1.length]; ForkJoinPool.commonPool().invoke //needs Java 8+ (new VecAdd(0,arr.length,ans,arr1,arr2); return ans; } 7

Maps and reductions Maps and reductions: the workhorses of parallel programming By far the two most important and common patterns Two more-advanced patterns in next lecture Learn to recognize when an algorithm can be written in terms of maps and reductions Use maps and reductions to describe (parallel) algorithms Programming them becomes trivial with a little practice Exactly like sequential for-loops seem second-nature 8

Digression: MapReduce on clusters You may have heard of Google s map/reduce Or the open-source version Hadoop Idea: Perform maps/reduces on data using many machines The system takes care of distributing the data and managing fault tolerance You just write code to map one element and reduce elements to a combined result Separates how to do recursive divide-and-conquer from what computation to perform Old idea in higher-order functional programming transferred to large-scale distributed computing Complementary approach to declarative queries for databases 9

Trees Maps and reductions work just fine on balanced trees Divide-and-conquer each child rather than array subranges Correct for unbalanced trees, but won t get much speed-up Example: minimum element in an unsorted but balanced binary tree in O(log n) time given enough processors How to do the sequential cut-off? Store number-of-descendants at each node (easy to maintain) Or could approximate it with, e.g., AVL-tree height 10

Linked lists Can you parallelize maps or reduces over linked lists? Example: Increment all elements of a linked list Example: Sum all elements of a linked list Parallelism still beneficial for expensive per-element operations b c d e f front back Once again, data structures matter! For parallelism, balanced trees generally better than lists so that we can get to all the data exponentially faster O(log n) vs. O(n) Trees have the same flexibility as lists compared to arrays 11

Analyzing algorithms Like all algorithms, parallel algorithms should be: Correct Efficient For our algorithms so far, correctness is obvious so we ll focus on efficiency Want asymptotic bounds Want to analyze the algorithm without regard to a specific number of processors The key magic of the ForkJoin Framework is getting expected run-time performance asymptotically optimal for the available number of processors So we can analyze algorithms assuming this guarantee 12

Work and Span Let T P be the running time if there are P processors available Two key measures of run-time: Work: How long it would take 1 processor = T 1 Just sequentialize the recursive forking Span: How long it would take infinity processors = T The longest dependence-chain Example: O(log n) for summing an array Notice having > n/2 processors is no additional help Also called critical path length or computational depth 13

The DAG A program execution using fork and join can be seen as a DAG Nodes: Pieces of work Edges: Source must finish before destination starts A fork ends a node and makes two outgoing edges New thread Continuation of current thread A join ends a node and makes a node with two incoming edges Node just ended Last node of thread joined on 14

Our simple examples fork and join are very flexible, but divide-and-conquer maps and reductions use them in a very basic way: A tree on top of an upside-down tree divide base cases combine results 15

More interesting DAGs? The DAGs are not always this simple Example: Suppose combining two results might be expensive enough that we want to parallelize each one Then each node in the inverted tree on the previous slide would itself expand into another set of nodes for that parallel computation 16

Connecting to performance Recall: T P = running time if there are P processors available Work = T 1 = sum of run-time of all nodes in the DAG That lonely processor does everything Any topological sort is a legal execution O(n) for simple maps and reductions Span = T = sum of run-time of all nodes on the most-expensive path in the DAG Note: costs are on the nodes not the edges Our infinite army can do everything that is ready to be done, but still has to wait for earlier results O(log n) for simple maps and reductions 17

Definitions A couple more terms: Speed-up on P processors: T 1 / T P If speed-up is P as we vary P, we call it perfect linear speed-up Perfect linear speed-up means doubling P halves running time Usually our goal; hard to get in practice Parallelism is the maximum possible speed-up: T 1 / T At some point, adding processors won t help What that point is depends on the span Parallel algorithms is about decreasing span without increasing work too much 18

Optimal T P : Thanks ForkJoin library! So we know T 1 and T but we want T P (e.g., P=4) Ignoring memory-hierarchy issues (caching), T P can t beat T 1 / P why not? T why not? So an asymptotically optimal execution would be: T P = O((T 1 / P) + T ) First term dominates for small P, second for large P The ForkJoin Framework gives an expected-time guarantee of asymptotically optimal! Expected time because it flips coins when scheduling How? For an advanced course (few need to know) Guarantee requires a few assumptions about your code 19

Division of responsibility Our job as ForkJoin Framework users: Pick a good algorithm, write a program When run, program creates a DAG of things to do Make all the nodes a small-ish and approximately equal amount of work The framework-writer s job: Assign work to available processors to avoid idling Let framework-user ignore all scheduling issues Keep constant factors low Give the expected-time optimal guarantee assuming framework-user did his/her job T P = O((T 1 / P) + T ) 20

Examples T P = O((T 1 / P) + T ) In the algorithms seen so far (e.g., sum an array): T 1 = O(n) T = O(log n) So expect (ignoring overheads): T P = O(n/P + log n) Suppose instead: T 1 = O(n 2 ) T = O(n) So expect (ignoring overheads): T P = O(n 2 /P + n) 21

Amdahl s Law (mostly bad news) So far: analyze parallel programs in terms of work and span In practice, typically have parts of programs that parallelize well Such as maps/reductions over arrays and trees and parts that don t parallelize at all Such as reading a linked list, getting input, doing computations where each needs the previous step, etc. Nine women can t make a baby in one month 22

Amdahl s Law (mostly bad news) Let the work (time to run on 1 processor) be 1 unit time Let S be the portion of the execution that can t be parallelized Then: T 1 = S + (1-S) = 1 Suppose we get perfect linear speedup on the parallel portion Then: T P = S + (1-S)/P So the overall speedup with P processors is (Amdahl s Law): T 1 / T P = 1 / (S + (1-S)/P) And the parallelism (infinite processors) is: T 1 / T = 1 / S 23

Why such bad news T 1 / T P = 1 / (S + (1-S)/P) T 1 / T = 1 / S Suppose 33% of a program s execution is sequential Then a billion processors won t give a speedup over 3 Suppose you miss the good old days (1980-2005) where 12ish years was long enough to get 100x speedup Now suppose in 12 years, clock speed is the same but you get 256 processors instead of 1 For 256 processors to get at least 100x speedup, we need 100 1 / (S + (1-S)/256) Which means S.0061 (i.e., 99.4% perfectly parallelizable) 24

Plots you have to see 1. Assume 256 processors x-axis: sequential portion S, ranging from.01 to.25 y-axis: speedup T 1 / T P (will go down as S increases) 2. Assume S =.01 or.1 or.25 (three separate lines) x-axis: number of processors P, ranging from 2 to 32 y-axis: speedup T 1 / T P (will go up as P increases) Do this as a homework problem! Chance to use a spreadsheet or other graphing program Compare against your intuition A picture is worth 1000 words, especially if you made it 25

All is not lost Amdahl s Law is a bummer! Unparallelized parts become a bottleneck very quickly But it doesn t mean additional processors are worthless We can find new parallel algorithms Some things that seem sequential are actually parallelizable We can change the problem or do new things Example: Video games use tons of parallel processors They are not rendering 10-year-old graphics faster They are rendering more beautiful(?) monsters 26

Moore and Amdahl Moore s Law is an observation about the progress of the semiconductor industry Transistor density doubles roughly every 18 months Amdahl s Law is a mathematical theorem Diminishing returns of adding more processors Both are incredibly important in designing computer systems 27