COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell

Size: px

Start display at page:

Download "COMP4300/8300: Parallelisation via Data Partitioning. Alistair Rendell"

Betty Walton
6 years ago
Views:

1 COMP4300/8300: Parallelisation via Data Partitioning Chapter 5: Lin and Snyder Chapter 4: Wilkinson and Allen Alistair Rendell COMP4300 Lecture 7-1 Copyright c 2015 The Australian National University

2 7.1 Parallelisation Strategies Replicated data approach Each process has entire copy of data but does subset of computation Partition program data to different processes Most common Domain decomposition Divide and Conquer Partitioning of program functionality Much less common Functional Decomposition Consider addition of numbers sum = n 1 i=0 x i COMP4300 Lecture 7-2 Copyright c 2015 The Australian National University

3 7.2 D&C Example#1: Simple Summation of Vector Divide numbers into m equal parts x(0)... x(n/m 1) x(n/m)... x(2n/m 1) x((m 1)n/m)... x(n 1) Partial Sums Sum COMP4300 Lecture 7-3 Copyright c 2015 The Australian National University

4 7.3 Master/Slave Send/Recv Approach Master s = n/m for (i = 0, x = 0; i < m; i++, x = x + s) send(&numbers[x], s, i) sum = 0; for (i = 0; i < m; i++){ recv(&part_sum, any_processor) sum = sum + part_sum; Slave recv(numbers, s, master) part_sum = 0; for (i = 0; i < s; i++) part_sum = part_sum +numbers[i]; send(&part_sum, master) COMP4300 Lecture 7-4 Copyright c 2015 The Australian National University

5 7.4 Using MPI Scatter and MPI Reduce sendcount = n/m; MPI_SCATTER(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm com) for (i = 0; i < sendcount; i++) part_sum = part_sum +numbers[i]; MPI_REDUCE(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op MPI_SUM, int root, MPI_Comm comm) NOT Master/Slave Root sends data to all processes in MPI Comm (including self) Note related MPI calls MPI SCATTERV scatters variable lengths MPI ALLREDUCE will return result to all processors COMP4300 Lecture 7-5 Copyright c 2015 The Australian National University

6 7.5 Analysis Sequential n 1 additions thus O(n) Parallel Communications#1: Computation#1: Communication#2: Computation#2: t comm1 = m(t stup +(n/m)t data ) t comp1 = n/m 1 t comm2 = m(t stup +t data ) t comp2 = m 1 Overall t p = 2mt stup +(n+m)t data +m+n/m 2 = O(n+m) Worse than sequential code!! COMP4300 Lecture 7-6 Copyright c 2015 The Australian National University

7 7.6 Domain Decomposition via Divide and Conquer Problems that can be recursively divided into smaller problems of the same type Recursive implementation of the summation problem int add(int *s) { if (numbers(s) =< 2) return (s[0]+s[1]); else{ divide(s, s1, s2); part_sum1 = add(s1); part_sum2 = add(s2); return (part_sum1 + part_sum2); COMP4300 Lecture 7-7 Copyright c 2015 The Australian National University

8 7.7 Binary Tree Divide and conquer with binary partitioning Note number of working processors decreases going up the tree Number of Processors 1 P0 2 P0 P1 4 P0 P2 P3 P = 4 8 P0 P4 P6 P2 P3 P7 P5 P1 COMP4300 Lecture 7-8 Copyright c 2015 The Australian National University

9 7.8 Simple Binary Tree Code \* Binary Tree broadcast a) 0->1 b) 0->2, 1->3 c) 0->4, 1->5, 2->6, 3->7 d) 0->8, 1->9, 2->10, 3->11, 4->12, 5->13, 6->14, 7->15 *\ lo = 1 while (lo < nproc) { if (me < lo) { id = me + lo; if (id < nproc)send(type, buf, lenbuf, &id); else if (me < 2*lo) { id = lo; recv(type, buf, lenbuf, &lenmes, &id); lo *= 2; COMP4300 Lecture 7-9 Copyright c 2015 The Australian National University

10 7.9 Analysis Assume n is a power of 2 and ignoring t stup Communication#1: Division t comm1 = n 2 t data+ n 4 t data+ n 8 t data+ n p t data = n(p 1) p t data Communication#2: Combining Computation: t comm2 = t data logp t comp = n p +logp Total: t p = n(p 1) t data +t data logp+ n p p +logp Slightly better than before - as p n cost O(n) COMP4300 Lecture 7-10 Copyright c 2015 The Australian National University

11 7.10 Higher Order Trees Possible to divide data into higher order trees, eg quad tree Initial Area First Division Second Division COMP4300 Lecture 7-11 Copyright c 2015 The Australian National University

12 7.11 The Reduce and Scan Abstractions Summation of a vector already partitioned between processes is an example of the reduce and scan abstraction Reduce: Combines a set of values to produce a single value Scan: performs a sequential operation in parts and carries along the intermediate results Reduce usually involves mapping a binary (or higher level) tree communication pattern between processes Examples (to discuss) include Finding second smallest array element Computing a k-way histogram Length of longest run of 1s Index of first occurrence of x See Lin and Snyder for further details COMP4300 Lecture 7-12 Copyright c 2015 The Australian National University

13 7.12 Scan ConsidervectorX = [0,1,2,3,4,5,6,7]scanoperationreplaceseachelementwiththecumulative sum of all preceding elements (either inclusive or exclusive of current element) Inclusive SCAN(X) = [0,1,3,6,10,15,21,28], exclusive SCAN(X) = [0,0,1,3,6,10,15,21] Inclusive sequential scan code (clear dependency!) ScanX[0]=X[0]; for (i=1; i<n; i++) ScanX[i] = ScanX[i-1] + X[i]; Parallel PRAM code (why is PRAM important?) for (j=1; j<=log2(n); j++){ FORALL (k=0; k<n; k++) IN PARALLEL { if (k >= POWER(2,j-1)){ X[k] = X[k-POWER(2,j-1)]+X[k]; COMP4300 Lecture 7-13 Copyright c 2015 The Australian National University

14 7.13 Scan Example: 8 Values on 8 Nodes for (j=1; j<=log2(n); j++){ FORALL (k=0; k<n; k++) IN PARALLEL { if (k >= POWER(2,j-1)){ X[k] = X[k-POWER(2,j-1)]+X[k]; Node and Value of Data on that Node Step Offset (j) (2 j 1 ) =1 1+2=3 2+3=5 3+4=7 4+5=9 5+6=11 6+7= =3 1+5=6 3+7=10 5+9= = = = = = =28 Final Values COMP4300 Lecture 7-14 Copyright c 2015 The Australian National University

15 7.14 D&C Example#2: Bucket Sort Divide number range (a) into m equal regions (0 a ) ( a ) m 1, m 2a m 1, (2 a ) m 3a m 1, Assign one bucket to each region Stage 1: numbers are place into appropriate buckets Stage 2: each bucket is sorted using a traditional sorting algorithm Works best if numbers are evenly distributed over the range a Sequential time t s = n+m((n/m)log(n/m)) = n+nlog(n/m) = O(nlog(n/m)) COMP4300 Lecture 7-15 Copyright c 2015 The Australian National University

16 7.15 Sequential Bucket Sort Unsorted Numbers Buckets Sorted Numbers COMP4300 Lecture 7-16 Copyright c 2015 The Australian National University

17 7.16 Parallel Bucket Sort#1 Assign one bucket to each process Unsorted Numbers Processes Buckets Sorted Numbers COMP4300 Lecture 7-17 Copyright c 2015 The Australian National University

18 7.17 Parallel Bucket Sort#2 Unsorted Numbers Processes Small Buckets Big Buckets Assign p small buckets to each process Note possible use of MPI ALLTOALL Sorted Numbers MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) COMP4300 Lecture 7-18 Copyright c 2015 The Australian National University

19 7.18 Analysis Initial partitioning and distribution Sort into small buckets t comp1 = n t comm1 = t stup +t data n t comp2 = n/p Send to large buckets: (overlapping communications) t comm3 = (p 1)(t stup +(n/p 2 )t data ) Sort of large buckets t comp4 = (n/p)log(n/p) Total t p = t stup +t data n+n/p+(p 1)(t stup +(n/p 2 )t data )+(n/p)log(n/p) At best O(n) What would be the worse case scenario? COMP4300 Lecture 7-19 Copyright c 2015 The Australian National University

20 7.19 D&C Example#3: Integration Consider evaluation of integral using trapezoidal rule I = b a f(x)dx f(x) a p q δ b x COMP4300 Lecture 7-20 Copyright c 2015 The Australian National University

21 7.20 Static Distribution: SPMD Model if (i == master){ printf(" Enter number of regions\n"); scanf("%d",&n); broadcast(&n,process_group) region = (b-a)/p; start = a + region*i; end = start + region d = (b-a)/n for (x = start; x < end; x = x + d) area = area + f(x) + f(x+d); area = 0.5 * area *d; reduce_add(&integral, & area, processor_group); COMP4300 Lecture 7-21 Copyright c 2015 The Australian National University

22 7.21 Adaptive Quadrature Not all areas require the same number of points When to terminate division into smaller area is an issue Parallel code will have uneven work load f(x) a b x COMP4300 Lecture 7-22 Copyright c 2015 The Australian National University

23 7.22 D&C Example#4: N-Body Problems Summing longrange pairwise interactions, eg gravitation F = Gm am b r 2 where G is the gravitational constant, m a and m b are the mass of two bodies, and r is the distance between them In Cartesian space F x = Gm am b r 2 F y = Gm am b r 2 F z = Gm am b r 2 ( ) xb x a r ( ) yb y a r ( ) zb z a What is the total force on the sun due to all other stars in the milky way? Given the force on each star we can calculate their motions Molecular dynamics is very similar but long range forces are electrostatic r COMP4300 Lecture 7-23 Copyright c 2015 The Australian National University

24 7.23 Simple Sequential Force Code for (i = 0; i < nstars; i++) for (j = 0; j < nstars; j++){ if (i!= j ){ rij2 = (x[i]-x[j])*(x[i]-x[j]) + (y[i]-y[j])*(y[i]-y[j]) + (z[i]-z[j])*(z[i]-z[j]); Fx[i] = Fx[i] + G*m[i]*m[j]/rij2 * (x[i]-x[j]) / sqrt(rij2) Fy[i] = Fy[i] + G*m[i]*m[j]/rij2 * (y[i]-y[j]) / sqrt(rij2) Fz[i] = Fz[i] + G*m[i]*m[j]/rij2 * (z[i]-z[j]) / sqrt(rij2) Aside: How could you improve this sequential code? O(n 2 ) - this will get very expensive for large N Is there a better way? COMP4300 Lecture 7-24 Copyright c 2015 The Australian National University

25 7.24 Clustering Idea: the interaction with several bodies that are cluster together but are located at large r for another body can be replaced by the interaction with the center of mass of the cluster Center of Mass r Distant Cluster of Stars COMP4300 Lecture 7-25 Copyright c 2015 The Australian National University

26 7.25 Barnes-Hut Algorithm Start with whole space in one cube Divide the cube into 8 subcubes Delete subcubes if they have no particles in them Subcubes with more than 1 particle are divided into 8 again Continue until each cube has only one particle (or none) Process creates an octtree Total mass and centre of mass of each subcube stored at each node Force evaluated by starting at each root and traversing the tree, BUT stopping at a node if the clustering algorithm can be used Scaling O(nlogn) Load balancing likely to be an issue for parallel code COMP4300 Lecture 7-26 Copyright c 2015 The Australian National University

27 7.26 Barnes-Hut Algorithm: 2D Illustration COMP4300 Lecture 7-27 Copyright c 2015 The Australian National University

Partitioning and Divide-and- Conquer Strategies

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply divides the problem into parts Example - Adding a sequence of numbers We might consider dividing the sequence