Conquer Strategies. Partitioning and Divide-and- Partitioning Strategies + + Using separate send()s and recv()s. Slave

Similar documents
Partitioning and Divide-and- Conquer Strategies

Partitioning and Divide-and-Conquer Strategies

Partitioning and Divide-and-Conquer Strategies

Message-Passing Computing Examples

COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell

Partitioning and Divide-and-Conquer Strategies

CSC 391/691: GPU Programming Fall N-Body Problem. Copyright 2011 Samuel S. Cho

Pipelined Computations

Basic Techniques of Parallel Programming & Examples

CHAPTER 5 Pipelined Computations

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

Algorithms and Applications

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

Partitioning. Partition the problem into parts such that each part can be solved separately. 1/22

Matrix Multiplication

The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems.

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

COMP 352 FALL Tutorial 10

Lecture 2: Getting Started

Lecture 3: Sorting 1

MA 243 Calculus III Fall Assignment 1. Reading assignments are found in James Stewart s Calculus (Early Transcendentals)

Introduction to Algorithms

Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Geometric Algorithms. Geometric search: overview. 1D Range Search. 1D Range Search Implementations

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Adaptive-Mesh-Refinement Pattern

Divide-and-Conquer. Divide-and conquer is a general algorithm design paradigm:

Lecture 8 Parallel Algorithms II

CS102 Sorting - Part 2

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Parallel Sorting Algorithms

Numerical Algorithms

Sorting is ordering a list of objects. Here are some sorting algorithms

FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 ( Marks: 1 ) - Please choose one The data of the problem is of 2GB and the hard

Programming II (CS300)

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search kd trees line segment intersection interval search trees rectangle intersection

DATA STRUCTURE : A MCQ QUESTION SET Code : RBMCQ0305

Range Searching. 2010/11/22 Yang Lei

Divide and Conquer Algorithms. Sathish Vadhiyar

CS 506, Sect 002 Homework 5 Dr. David Nassimi Foundations of CS Due: Week 11, Mon. Apr. 7 Spring 2014

SORTING AND SELECTION

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

4. Sorting and Order-Statistics

Randomized Algorithms: Selection

Chapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

CSE 613: Parallel Programming. Lecture 6 ( Basic Parallel Algorithmic Techniques )

Lecture 17: Array Algorithms

Motion Planning. O Rourke, Chapter 8

Today s class. Roots of equation Finish up incremental search Open methods. Numerical Methods, Fall 2011 Lecture 5. Prof. Jinbo Bi CSE, UConn

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

DO NOT. UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N.

5/27/12. Objectives 7.1. Area of a Region Between Two Curves. Find the area of a region between two curves using integration.

PARALLEL XXXXXXXXX IMPLEMENTATION USING MPI

Lab 5 Excel Spreadsheet Introduction

Module 2: Classical Algorithm Design Techniques

II (Sorting and) Order Statistics

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Go Math 3 rd Grade Meramec Valley R-3 School District. Yearly Plan. 1 Updated 6/24/16

Sorting Algorithms. + Analysis of the Sorting Algorithms

Unit-2 Divide and conquer 2016

Geometric Search. range search space partitioning trees intersection search. range search space partitioning trees intersection search.

1 Closest Pair Problem

1 st Grade Math Curriculum Crosswalk

COMP2012H Spring 2014 Dekai Wu. Sorting. (more on sorting algorithms: mergesort, quicksort, heapsort)

Algorithms Interview Questions

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

Sorting. Bubble Sort. Pseudo Code for Bubble Sorting: Sorting is ordering a list of elements.

Sorting & Growth of Functions

Solving the N-Body Problem with the ALiCE Grid System

! Good algorithms also extend to higher dimensions. ! Curse of dimensionality. ! Range searching. ! Nearest neighbor.

Classic Data Structures Introduction UNIT I

Embarrassingly Parallel Computations

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Synchronous Iteration

COMP Data Structures

Parallel Computing. Parallel Algorithm Design

UNIT-2. Problem of size n. Sub-problem 1 size n/2. Sub-problem 2 size n/2. Solution to the original problem

Chapter 4. Divide-and-Conquer. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Range Searching and Windowing

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Chapter 11 Image Processing

Divide and Conquer CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Outline

MATH 31A HOMEWORK 9 (DUE 12/6) PARTS (A) AND (B) SECTION 5.4. f(x) = x + 1 x 2 + 9, F (7) = 0

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

Applications of Integration. Copyright Cengage Learning. All rights reserved.

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

Lecture 6 Sorting and Searching

7. Tile That Courtyard, Please

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

CP222 Computer Science II. Searching and Sorting

CSci 231 Final Review

Sorting is a problem for which we can prove a non-trivial lower bound.

GUIDED NOTES 3.1 FUNCTIONS AND FUNCTION NOTATION

Lecture #2. 1 Overview. 2 Worst-Case Analysis vs. Average Case Analysis. 3 Divide-and-Conquer Design Paradigm. 4 Quicksort. 4.

2/14/13. Outline. Part 5. Computational Complexity (2) Examples. (revisit) Properties of Growth-rate functions(1/3)

Transcription:

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply divides the problem into parts Example - Adding a sequence of numbers We might consider dividing the sequence into m parts of n/m numbers each, (x 0 x (n/m) 1), (x n/m x (2n/m) 1 ),, (x (m 1)n/m x n 1), at which point m processors (or processes) can each add one sequence independently to create partial sums. x 0 x (n/m) 1 x n/m x (2n/m) 1 x (m 1)n/m x n 1 + + + Partial sums + Sum Figure 4.1 Partitioning a sequence of numbers into parts and adding the parts. 120 Using separate send()s and recv()s Master s = n/m; /* number of numbers for slaves*/ for (i = 0, x = 0; i < m; i++, x = x + s) send(&numbers[x], s, P i ); /* send s numbers to slave */ sum = 0; for (i = 0; i < m; i++) { /* wait for results from slaves */ recv(&part_sum, P ANY ); sum = sum + part_sum; /* accumulate partial sums */ Slave recv(numbers, s, P master ); /* receive s numbers from master */ part_sum = 0; for (i = 0; i < s; i++) /* add numbers */ part_sum = part_sum + numbers[i]; send(&part_sum, P master ); /* send sum to master */ 121

Using Broadcast/multicast Routine Master s = n/m; /* number of numbers for slaves */ bcast(numbers, s, P slave_group ); /* send all numbers to slaves */ sum = 0; for (i = 0; i < m; i++){ /* wait for results from slaves */ recv(&part_sum, P ANY ); sum = sum + part_sum; /* accumulate partial sums */ Slave bcast(numbers, s, P master ); /* receive all numbers from master*/ start = slave_number * s; /* slave number obtained earlier */ end = start + s; part_sum = 0; for (i = start; i < end; i++) /* add numbers */ part_sum = part_sum + numbers[i]; send(&part_sum, P master ); /* send sum to master */ 122 Using scatter and reduce routines Master s = n/m; /* number of numbers */ scatter(numbers,&s,p group,root=master); /* send numbers to slaves */ reduce_add(&sum,&s,p group,root=master); /* results from slaves */ Slave scatter(numbers,&s,p group,root=master); /* receive s numbers */. /* add numbers */ reduce_add(&part_sum,&s,p group,root=master);/* send sum to master */ 123

Analysis Sequential Requires n 1 additions with a time complexity of Ο(n). Parallel Using individual send and receive routines Phase 1 Communication t comm1 = m(t startup + (n/m)t data ) Phase 2 Computation t comp1 = n/m 1 Phase 3 Communication Returning partial results using individual send and receive routines Final accumulation t comm2 = m(t startup + t data ) Phase 4 Computation t comp2 = m 1 Overall t p = (t comm1 + t comm2 ) + (t comp1 + t comp2 ) = 2mt startup + (n + m)t data + m + n/m 2 or t p = O(n + m) We see that the parallel time complexity is worse than the sequential time complexity. 124 Divide and Conquer Characterized by dividing a problem into subproblems that are of the same form as the larger problem. Further divisions into still smaller sub-problems are usually done by recursion A sequential recursive definition for adding a list of numbers is int add(int *s) /* add list of numbers, s */ { if (number(s) =< 2) return (n1 + n2); /* see explanation */ else { Divide (s, s1, s2); /* divide s into two parts, s1 and s2 */ part_sum1 = add(s1); /*recursive calls to add sub lists */ part_sum2 = add(s2); return (part_sum1 + part_sum2); 125

Initial problem Divide problem Final tasks Figure 4.2 Tree construction. 126 Parallel Implementation Original list P 0 P 0 P 4 P 0 P 2 P 4 P 6 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 x 0 x n 1 Figure 4.3 Dividing a list into parts. 127

x 0 x n 1 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 0 P 2 P 4 P 6 P 0 P 4 P 0 Final sum Figure 4.4 Partial summation. 128 Parallel Code Suppose we statically create eight processors (or processes) to add a list of numbers. Process P 0 /* division phase */ divide(s1, s1, s2); /* divide s1 into two, s1 and s2 */ send(s2, P 4 ); /* send one part to another process */ divide(s1, s1, s2); send(s2, P 2 ); divide(s1, s1, s2); send(s2, P 1 ; part_sum = *s1; /* combining phase */ recv(&part_sum1, P 1 ); part_sum = part_sum + part_sum1; recv(&part_sum1, P 2 ); part_sum = part_sum + part_sum1; recv(&part_sum1, P 4 ); part_sum = part_sum + part_sum1; The code for process P 4 might take the form Process P 4 recv(s1, P 0 ); /* division phase */ divide(s1, s1, s2); send(s2, P 6 ); divide(s1, s1, s2); send(s2, P 5 ); part_sum = *s1; /* combining phase */ recv(&part_sum1, P 5 ); part_sum = part_sum + part_sum1; recv(&part_sum1, P 6 ); part_sum = part_sum + part_sum1; send(&part_sum, P 0 ); Similar sequences are required for the other processes. 129

Analysis Assume that n is a power of 2. The communication setup time, t startup, is not included in the following for simplicity. Communication Division phase n n n t comm1 2 --tdata 4 --tdata 8 --tdata --t n n( p 1 ) = + + + + p data = -------------------t p data Combining phase t comm2 = t data log p Total communication time t comm = t comm1 + t comm2 = n( p 1 ) p -------------------t data + t data log p Computation n t comp = -- + log p p Total Parallel Execution Time n( p 1 ) n t p = -------------------t p data + t data log p + -- +log p p 130 Found/ Not found OR OR OR Figure 4.5 Part of a search tree. 131

M-ary Divide and Conquer Divide and conquer can also be applied where a task is divided into more than two parts at each stage. For example, if the task is broken into four parts, the sequential recursive definition would be int add(int *s) /* add list of numbers, s */ { if (number(s) =< 4) return(n1 + n2 + n3 + n4); else { Divide (s,s1,s2,s3,s4); /* divide s into s1,s2,s3,s4*/ part_sum1 = add(s1); /*recursive calls to add sublists */ part_sum2 = add(s2); part_sum3 = add(s3); part_sum4 = add(s4); return (part_sum1 + part_sum2 + part_sum3 + part_sum4); Figure 4.6 Quadtree. 132 Image area First division into four parts Second division Figure 4.7 Dividing an image. 133

Divide-and-Conquer Examples Sorting Using Bucket Sort Works well if the original numbers are uniformly distributed across a known interval, say 0 to a 1. This interval is divided into m equal regions, 0 to a/m 1, a/m to 2a/m 1, 2a/m to 3a/m 1, and one bucket is assigned to hold numbers that fall within each region. The numbers are simply placed into the appropriate buckets. The numbers in each bucket will be sorted using a sequential sorting algorithm Unsorted numbers Buckets Sort contents of buckets Merge lists Sorted numbers Figure 4.8 Bucket sort. Sequential time t s = n + m((n/m)log(n/m)) = n + n log(n/m) = Ο(n log(n/m)) 134 Parallel Algorithm Bucket sort can be parallelized by assigning one processor for each bucket, which reduces the second term in the preceding equation to (n/p)log(n/p) for p processors (where p = m). Unsorted numbers p processors Buckets Sort contents of buckets Merge lists Sorted numbers Figure 4.9 One parallel version of bucket sort. 135

Further Parallelization By partitioning the sequence into m regions, one region for each processor. Each processor maintains p small buckets and separates the numbers in its region into its own small buckets. These small buckets are then emptied into the p final buckets for sorting, which requires each processor to send one small bucket to each of the other processors (bucket i to processor i). n/m numbers Unsorted numbers p processors Small buckets Empty small buckets Large buckets Sort contents of buckets Merge lists Sorted numbers Figure 4.10 Parallel version of bucket sort. 136 The following phases are needed: Analysis 1. Partition numbers. 2. Sort into small buckets. 3. Send to large buckets. 4. Sort large buckets. Phase 1 Computation and Communication t comp1 = n t comm1 = t startup + t data n Phase 2 Computation t comp2 = n/p Phase 3 Communication. If all the communications could overlap: t comm3 = (p 1)(t startup + (n/p 2 )t data ) Phase 4 Computation t comp4 = (n/p)log(n/p) Overall t p = t startup + t data n + n/p + (p 1)(t startup + (n/p 2 )t data ) +(n/p)log(n/p) It is assumed that the numbers are uniformly distributed to obtain these formulas. The worst-case scenario would occur when all the numbers fell into one bucket! 137

all-to-all routine For Phase 3 - sends data from each process to every other process Process 0 Process n 1 Send Receive buffer buffer Send buffer 0 n 1 0 n 1 0 n 1 0 n 1 Process 1 Process n 1 Process 0 Process n 2 Figure 4.11 All-to-all broadcast. 138 P 0 P 1 P 2 P 3 The all-to-all routine will actually transfer the rows of an array to columns: All-to-all A 0,0 A 0,1 A 0,2 A 0,3 A 0,0 A 1,0 A 2,0 A 3,0 A 1,0 A 1,1 A 1,2 A 1,3 A 0,1 A 1,1 A 2,1 A 3,1 A 2,0 A 2,1 A 2,2 A 2,3 A 0,2 A 1,2 A 2,2 A 3,2 A 3,0 A 3,1 A 3,2 A 3,3 A 0,3 A 1,3 A 2,3 A 3,3 Figure 4.12 Effect of all-to-all on an array. 139

Numerical Integration A general divide-and-conquer technique divides the region continually into parts and lets some optimization function decide when certain regions are sufficiently divided. Example: numerical integration: b I = f( x) d x a Can divide the area into separate parts, each of which can be calculated by a separate process. Each region could be calculated using an approximation given by rectangles: f(x) f(p) f(q) a p δ q b x Figure 4.13 Numerical integration using rectangles. 140 A Better Approximation Aligning the rectangles: f(x) f(p) f(q) a p δ q b x Figure 4.14 More accurate numerical integration using rectangles. 141

f(x) f(p) f(q) a p δ q b x Figure 4.15 Numerical integration using the trapezoidal method. 142 SPMD pseudocode: Static Assignment Process P i if (i == master) { /* read number of intervals required */ printf( Enter number of intervals ); scanf(%d,&n); bcast(&n, P group ); /* broadcast interval to all processes */ region = (b - a)/p; /* length of region for each process */ start = a + region * i; /* starting x coordinate for process */ end = start + region; /* ending x coordinate for process */ d = (b - a)/n; /* size of interval */ area = 0.0; for (x = start; x < end; x = x + d) area = area + f(x) + f(x+d); area = 0.5 * area * d; reduce_add(&integral, &area, P group ); /* form sum of areas */ A reduce operation is used to add the areas computed by the individual processes. Can simplify the calculation somewhat by algebraic manipulation (see text). 143

Adaptive Quadrature Method whereby the solution adapts to the shape of the curve Example- use three areas, A, B, and C. The computation is terminated when the area computed for the largest of the A and B regions is sufficiently close to the sum of the areas computed for the other two regions. f(x) C A B x Figure 4.16 Adaptive quadrature construction. 144 Some care might be needed in choosing when to terminate. f(x) C = 0 A B x Figure 4.17 Adaptive quadrature with false termination. Might cause us to terminate early, as two large regions are the same (i.e., C = 0). 145

Gravitational N-Body Problem The objective is to find the positions and movements of the bodies in space (say planets) that are subject to gravitational forces from other bodies using Newtonian laws of physics. The gravitational force between two bodies of masses m a and m b is given by Gm F a m b = ----------------- r 2 where G is the gravitational constant and r is the distance between the bodies. Subject to forces, a body will accelerate according to Newton s second law: F = ma where m is the mass of the body, F is the force it experiences, and a is the resultant acceleration. Let the time interval be t. Then, for a particular body of mass m, the force is given by and a new velocity mv ( t + 1 v t ) F = ------------------------------ t v t + 1 = v t + F t -------- m where v t+1 is the velocity of the body at time t + 1 and v t is the velocity of the body at time t. If a body is moving at a velocity v over the time interval t, its position changes by where x t is its position at time t. x t + 1 x t = v t Once bodies move to new positions, the forces change and the computation has to be repeated. 146 Three-Dimensional Space Since the bodies are in a three-dimensional space, all values are vectors and have to be resolved into three directions, x, y, and z. In a three-dimensional space having a coordinate system (x, y, z), the distance between the bodies at (x a, y a, z a ) and (x b, y b, z b ) is given by r = ( x b x a ) 2 + ( y b y a ) 2 +( z b z a ) 2 The forces are resolved in the three directions, using, for example, ----------------- x b x a Gm a m b F x = r 2 --------------- r ----------------- y b y a Gm a m b F y = r 2 --------------- r ----------------- z b z a Gm a m b F z = r 2 -------------- r where the particles are of mass m a and m b and have the coordinates (x a, y a, z a ) and (x b, y b, z b ). 147

Sequential Code The overall gravitational N-body computation can be described by the algorithm for (t = 0; t < tmax; t++) /* for each time period */ for (i = 0; i < N; i++) { /* for each body */ F = Force_routine(i); /* compute force on ith body */ v[i] new = v[i] + F * dt / m; /* compute new velocity and x[i] new = x[i] + v[i] new * dt; /* new position (leap-frog) */ for (i = 0; i < nmax; i++) { /* for each body */ x[i] = x[i] new ; /* update velocity and position*/ v[i] = v[i] new ; 148 Parallel Code The algorithm is an O(N 2 ) algorithm (for one iteration) as each of the N bodies is influenced by each of the other N 1 bodies. It is not feasible to use this direct algorithm for most interesting N-body problems where N is very large. The time complexity can be reduced using the observation that a cluster of distant bodies can be approximated as a single distant body of the total mass of the cluster sited at the center of mass of the cluster: Center of mass r Distant cluster of bodies Figure 4.18 Clustering distant bodies. 149

Barnes-Hut Algorithm Starts with the whole space in which one cube contains the bodies (or particles). First, this cube is divided into eight subcubes. If a subcube contains no particles, the subcube is deleted from further consideration. If a subcube contains more than one body, it is recursively divided until every subcube contains one body. This process creates an octtree; that is, a tree with up to eight edges from each node. The leaves represent cells each containing one body. After the tree has been constructed, the total mass and center of mass of the subcube is stored at each node. The force on each body can then be obtained by traversing the tree starting at the root, stopping at a node when the clustering approximation can be used, e.g. when: d r -- θ where θ is a constant typically 1.0 or less (θ is called the opening angle). Constructing the tree requires a time of Ο(nlogn), and so does computing all the forces, so that the overall time complexity of the method is O(nlogn). 150 Subdivision direction Particles Partial quadtree Figure 4.19 Recursive division of two-dimensional space. 151

Orthogonal Recursive Bisection Example for a two-dimensional square area. First, a vertical line is found that divides the area into two areas each with an equal number of bodies. For each area, a horizontal line is found that divides it into two areas each with an equal number of bodies. This is repeated until there are as many areas as processors, and then one processor is assigned to each area. Figure 4.20 Orthogonal recursive bisection method. 152