Algorithms PART I: Embarrassingly Parallel. HPC Fall 2012 Prof. Robert van Engelen

Similar documents
Embarrassingly Parallel Computations

Outline: Embarrassingly Parallel Problems

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies

Outline: Embarrassingly Parallel Problems. Example#1: Computation of the Mandelbrot Set. Embarrassingly Parallel Problems. The Mandelbrot Set

Embarrassingly Parallel Computations

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies. Load Balancing and Termination Detection

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

Load Balancing and Termination Detection

Message-Passing Computing Examples

Embarrassingly Parallel Computations

Load Balancing and Termination Detection

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

Embarrassingly Parallel Computations Creating the Mandelbrot set

Principles of Parallel Algorithm Design: Concurrency and Mapping

EE/CSCI 451 Midterm 1

Common Parallel Programming Paradigms

Chap. 5 Part 2. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Principles of Parallel Algorithm Design: Concurrency and Mapping

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

Load Balancing and Termination Detection

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Design of Parallel Algorithms. Models of Parallel Computation

More Communication (cont d)

Static Load Balancing

Load Balancing. Load balancing: distributing data and/or computations across multiple processes to maximize efficiency for a parallel program.

Fractals exercise. Investigating task farms and load imbalance

Pipelining and Vector Processing

1 Serial Implementation

CSCE 313: Embedded Systems. Scaling Multiprocessors. Instructor: Jason D. Bakos

Chapter Four: Loops. Slides by Evan Gallagher. C++ for Everyone by Cay Horstmann Copyright 2012 by John Wiley & Sons. All rights reserved

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Renderer Implementation: Basics and Clipping. Overview. Preliminaries. David Carr Virtual Environments, Fundamentals Spring 2005

Optimization of MPI Applications Rolf Rabenseifner

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola

Pipelined Computations

Review: Creating a Parallel Program. Programming for Performance

Parallel Programming Patterns

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

An Introduction to Parallel Programming

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

LPT Codes used with ROAM Splitting Algorithm CMSC451 Honors Project Joseph Brosnihan Mentor: Dave Mount Fall 2015

Distributed-Memory Programming Models I

Hashing for searching

Main Points of the Computer Organization and System Software Module

Blocking SEND/RECEIVE

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

Parallel Computing: MapReduce Jin, Hai

Lessons learned from MPI

Computational Methods. Randomness and Monte Carlo Methods

CS475m - Computer Graphics. Lecture 1 : Rasterization Basics

Partitioning. Partition the problem into parts such that each part can be solved separately. 1/22

Topology basics. Constraints and measures. Butterfly networks.

Introduction to Parallel Programming

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

A Study of Skew in MapReduce Applications

The Traditional Graphics Pipeline

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallelization Strategy

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Lecture 4: Locality and parallelism in simulation I

Programming as Successive Refinement. Partitioning for Performance

Distributed Operating Systems

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Exercises: Message-Passing Programming

COMP30019 Graphics and Interaction Scan Converting Polygons and Lines

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

The Traditional Graphics Pipeline

Interconnect Technology and Computational Speed

EE/CSCI 451: Parallel and Distributed Computation

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

CS 543: Computer Graphics. Rasterization

John Mellor-Crummey Department of Computer Science Rice University

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Basic Switch Organization

A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis.

Parallel Programming Patterns. Overview and Concepts

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

CMPE-655 Fall 2013 Assignment 2: Parallel Implementation of a Ray Tracer

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Network-on-chip (NOC) Topologies

CMPE 655 Fall 2016 Assignment 2: Parallel Implementation of a Ray Tracer

Comp 310 Computer Systems and Organization

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

The Traditional Graphics Pipeline

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

Ph3 Mathematica Homework: Week 8

Graph Processing. Connor Gramazio Spiros Boosalis

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

Transcription:

Algorithms PART I: Embarrassingly Parallel HPC Fall 2012 Prof. Robert van Engelen

Overview Ideal parallelism Master-worker paradigm Processor farms Examples Geometrical transformations of images Mandelbrot set Monte Carlo methods Load balancing of independent tasks Further reading HPC Fall 2012 2

Ideal Parallelism An ideal parallel computation can be immediately divided into completely independent parts Embarrassingly parallel Naturally parallel No special techniques or algorithms required input P 0 P 1 P 2 P 3 result HPC Fall 2012 3

Ideal Parallelism and the Master-Worker Paradigm Ideally there is no communication Maximum speedup Practical embarrassingly parallel applications have initial communication and (sometimes) a final communication Master-worker paradigm where master submits jobs to workers No communications between workers Send initial data Master P 0 P 1 P 2 P 3 Collect results HPC Fall 2012 4

Parallel Tradeoffs Embarrassingly parallel with perfect load balancing: t comp = t s / P assuming P workers and sequential execution time t s Master-worker paradigm gives speedup only if workers have to perform a reasonable amount of work Sequential time > total communication time + one workers time t s > t p = t comm + t comp Speedup S P = t s / t P = P t comp / (t comm + t comp ) = P / (r -1 + 1) where r = t comp / t comm Thus S P P when r However, communication t comm can be expensive Typically t s < t comm for small tasks, that is, the time to send/recv data to the workers is more expensive than doing all the work Try to overlap computation with communication to hide t comm latency HPC Fall 2012 5

Example 1: Geometrical Transformations of Images Partition pixmap into regions By block (row & col block) By row Pixmap operations Shift x = x + Δx y = y + Δy Scale x = S x x y = S y y Rotation x = x cos θ + y sin θ y = -x sin θ + y cos θ Clip x l < x = x < x h y l < y = y < y h HPC Fall 2012 6

Example 1: Master and Worker Naïve Implementation row = 0; for (p = 0; p < P; p++) { send(row, p); row += 480/P; for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) temp_map[i][j] = 0; for (i = 0; i < 480; i++) { recv(&oldrow, &oldcol, &newrow, &newcol, anyp); if (!(newrow < 0 newrow >= 480 newcol < 0 newcol >= 640)) temp_map[newrow][newcol] = map[oldrow][oldcol]; for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) map[i][j] = temp_map[i][j]; Each worker computes: row < x < row + 480/P; 0 < y < 640: x = x + Δx y = y + Δy Master Worker recv(&row, master); for (oldrow = row; oldrow < row + 480/P; oldrow++) { for (oldcol = 0; oldcol < 640; oldcol++) { newrow = oldrow + delta_x; newcol = oldcol + delta_y; send(oldrow, oldcol, newrow, newcol, master); HPC Fall 2012 7

Example 1: Geometrical Transformation Speedups? Assume in the general case the pixmap has n 2 points Sequential time of pixmap shift t s = 2n 2 Communication t comm = P(t startup + t data ) + n 2 (t startup + 4t data ) = O(P + n 2 ) Computation t comp = 2n 2 / P = O(n 2 /P) Computation/communication ratio r = O((n 2 / P) / (P + n 2 )) = O(n 2 / (P 2 + n 2 P)) This is not good! The asymptotic computation time should be an order higher than the asymptotic communication time, e.g. O(n 2 ) versus O(n) or there must be a very large constant in the computation time Performance on shared memory machine can be good No communication time HPC Fall 2012 8

Example 2: Mandelbrot Set Imaginary Real The number of iterations it takes for z to end up at a point outside the complex circle with radius 2 determines the pixmap color A pixmap is generated by iterating the complex-valued recurrence z k+1 = z k 2 + c with z 0 =0 and c=x+yi until z >2 The Mandelbrot set is shifted and scaled for display: x = x min + x scale row y = y min + y scale col for each of the pixmap s pixels at row and col location HPC Fall 2012 9

Example 2: Mandelbrot Set Color Computation int pix_color(float x0, float y0) { float x = x0, y = y0; int i = 0; while (x*x + y*y < (2*2) && i < maxiter) { float xtemp = x*x - y*y + x0; float ytemp = 2*x*y + y0; x = xtemp; y = ytemp; i++; return i; HPC Fall 2012 10

Example 2: Mandelbrot Set Simple Master and Worker Master row = 0; for (p = 0; p < P; p++) { send(row, p); row += 480/P; for (i = 0; i < 480 * 640; i++) { recv(&x, &y, &color, anyp); display(x, y, color); Send/recv (x,y) pixel colors Worker recv(&row, master); for (y = row; y < row + 480/P; y++) { for (x = 0; x < 640; x++) { x0 = xmin + x * xscale; y0 = ymin + y * yscale; color = pix_color(x0, y0); send(x, y, color, master); HPC Fall 2012 11

Example 2: Mandelbrot Set Better Master and Worker Master row = 0; for (p = 0; p < P; p++) { send(row, p); row += 480/P; for (i = 0; i < 480; i++) { recv(&y, &color, anyp); for (x = 0; x < 640; x++) display(x, y, color[x]); Assume n n pixmap, n iterations on average per pixel, and P workers: Communication time? Computation time? Computation/communication ratio? Speedup? Send/recv array of colors[x] for each row y Worker recv(&row, master); for (y = row; y < row + 480/P; y++) { for (x = 0; x < 640; x++) { x0 = xmin + x * xscale; y0 = ymin + y * yscale; color[x] = pix_color(x0, y0); send(y, color, master); HPC Fall 2012 12

Processor Farms Processor farms (also called the work-pool approach) A collection of workers, where each worker repeats: Take new task from pool Compute task Return results into pool Achieves load balancing Tasks differ in amount of work Workers can differ in execution speed (viz. heterogeneous cluster) Put tasks Get results Master task result Work pool task task result Get task Return result P 0 P 1 P 2 P 3 HPC Fall 2012 13

Example 2: Mandelbrot Set with Processor Farm Master count = 0; row = 0; for (p = 0; p < P; p++) { send(row, p); count++; row++; do { recv(&y, &color, anyp); count--; if (row < 480) { send(row, anyp); row++; count++; else send(-1, anyp); for (x = 0; x < 640; x++) display(x, y, color[x]); while (count > 0); Send row to workers Recv colors for row y Recv row Send next row Send sentinel Compute color for (x,y) Send colors Recv next row Assuming synchronous send/recv Keeps track of how many workers are active Worker recv(&y, master); while (y!= -1) { for (x = 0; x < 640; x++) { x0 = xmin + x * xscale; y0 = ymin + y * yscale; color[x] = pix_color(x0, y0); send(y, color, master); recv(&y, master); HPC Fall 2012 14

Example 2: Mandelbrot Set with Processor Farm Master count = 0; row = 0; for (p = 0; p < P; p++) { send(row, p); count++; row++; do { recv(&rank, &y, &color, anyp); count--; if (row < 480) { send(row, rank); row++; count++; else send(-1, rank); for (x = 0; x < 640; x++) display(x, y, color[x]); while (count > 0); Send colors and rank Assuming asynchronous send/recv Recv a row y and worker rank and send to worker (using rank) Worker recv(&y, master); while (y!= -1) { for (x = 0; x < 640; x++) { x0 = xmin + x * xscale; y0 = ymin + y * yscale; color[x] = pix_color(x0, y0); send(myrank, y, color, master); recv(&y, master); HPC Fall 2012 15

Example 3: Monte Carlo Methods Perform random selections to sample the solution Each sample is independent Example Compute π by sampling the [-1..1,-1..1] square that contains a circle with radius 1 The probability of hitting the circle is π/4 2 area = π 2 HPC Fall 2012 16

Example 3: Monte Carlo Methods General Monte Carlo methods sample inside and outside the solution space Many Monte Carlo methods do not sample outside solution space Function integration by sampling the function values over the integration domain x 1 x r x 2 f HPC Fall 2012 17

Example 3: Monte Carlo Methods and Parallel RNGs Approach 1: master sends random number sequences to the workers Uses one random number generator (RNG) Lots of communication Approach 2: workers produce independent random number sequences Communication of sample parameters only Cannot use standard pseudo RNG (sequences are the same) Needs parallel RNG Parallel RNGs (e.g. SPRNG library) Parallel pseudo RNG Parallel quasi-random RNG HPC Fall 2012 18

Example 3: Monte Carlo Methods and Parallel RNGs Linear congruential generator (pseudo RNG): x i+1 = (a x i + c) mod m with a choice of a, c, and m Good choice of a, c, and m is crucial! Cannot easily segment the sequence (for processors) A parallel pseudo RNG with a jump constant k x i+k = (A x i + C) mod m where A=a k mod m, C=c(a k-1 +a k-2 + +a 1 +a 0 ) mod m Parallel computation of sequence The sequences per processor x 0 x 1 x k-1 x k x k+1 x 2k-1 x 2k x ik P 0 x ik+1 P 1 x ik+2 P 2 x ik+k-1 P k-1 HPC Fall 2012 19

Load Balancing Load balancing attempts to spread tasks evenly across processors Load imbalance is caused by Tasks of different execution cost, e.g. Mandelbrot example Processors operate with different execution speeds or are busy When tasks and processors are not load balanced: Some processes finish early and sit idle waiting Global computation is finished when the slowest processor(s) completes its task Load balanced Not load balanced P 0 P 0 P 1 P 1 P 2 P 2 P 3 P 3 start finish start finish HPC Fall 2012 20

Static Load Balancing Load balancing can be viewed as a form of bin packing Static scheduling of tasks amounts to optimal bin packing Round robin algorithm Randomized algorithms Recursive bisection Optimized scheduling with simulated annealing and genetic algorithms Problem: difficult to estimate amount of work per task, deal with changes in processor utilization and communication latencies Not load balanced Load balanced P 0 P 0 P 1 P 1 P 2 P 2 P 3 start finish start finish P 3 HPC Fall 2012 21

Centralized Dynamic Load Balancing Centralized: work pool with replicated workers Master process or central queue holds incomplete tasks First-in-first-out or priority queue (e.g. priority based on task size) Terminates when queue is empty or workers receive termination signal Work pool in task queue task task task out Get task P 0 P 1 P 2 P 3 Return result or add new task worker HPC Fall 2012 22

Decentralized Dynamic Load Balancing Disadvantage of centralized approach is the central queue through which tasks move one by one Decentralized: distributed work pools Master initial tasks task task task Mini master task task task task task task worker HPC Fall 2012 23

Fully Distributed Work Pool Receiver-initiated poll method: (an idle) worker process requests a task from another worker process Sender-initiated push method: (an overloaded) worker process sends a task to another (idle) worker Absence of Workers maintain local task queues starvation: Process selection assume tasks, Topology-based: select nearest neighbors how can we Round-robin: try each of the other workers in turn guarantee each one Random polling/pushing: pick an arbitrary worker is eventually executed? send/recv tasks worker task task task select a worker task task task HPC Fall 2012 24

Worker Pipeline Workers are organized in an array (or ring) with the master on one end (or middle) Master feeds the pipeline When the buffer of a worker is idle, it sends a request to the left When the buffer of a worker is full, incoming tasks are shifted to the worker on the right (passing task along until an empty slot) master p 0 p 1 P n-1 HPC Fall 2012 25

Further Reading [PP2] pages 79-99, 201-210 HPC Fall 2012 26