Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Size: px

Start display at page:

Download "Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1"

Ellen Stokes
5 years ago
Views:

1 Parallel Algorithms 1

2 Deliverables Parallel Programming Paradigm Matrix Multiplication Merge Sort 2

3 Introduction It involves using the power of multiple processors in parallel to increase the performance of your algorithm in terms of time complexity There are various models to discuss parallel programming We have around 500 machines having 10,000 processors each. In few years number of processors in the largest system is supposed to touch 1 million 3

4 Introduction It requires some form of synchronization so that the value in memory is accessed when it is valid. Different architectures support different approaches. Performance is most important in scientific computing. Memory access is the critical issue in high performance computing. 4

5 Performance of systems Problem Size Vs. Performance Problem Fits Into Registers Problem Fits Into Cache Problem Fits Into RAM Problem Requires Hard Disk Access Problem Too Big for Memory Size of problem being solved 5

6 Work/Memory Ratio Suppose that a given algorithm has a P WM and it is implemented on a system with a maximum bandwidth to memory of x billion floating point words per second. Then the maximum performance that can be achieved is x P WM GFLOPS It is an upper bound on the number of operations per unit time by assuming the floating point operation blocks until data is available to the CPU 6

7 Work/Memory Ratio P WM : Number of floating point operations divided by number of memory locations referenced (either reads or writes) π = n A= i=1 a i A requires n-1 floating point additions and involves N+1 memory locations, one for A and n for a i s Therefor Work/memory Ratio is n 1 n+1 1 6/12/2012 6:44 PM gdeepak.com 7

8 Memory Latency and access average cycles word access = % of hits(i) * cache cycles word access RAM cycles Hard Disk access + % of hits(j) * + (1-i-j) * word access word access As the problem size increase, it migrates into an ever slower memory system. Eventually it reaches a point, where it cannot be completed for lack of space. 6/12/2012 6:43 PM gdeepak.com 8

9 Performance For summing a series we can always sum partially in parallel because sum is commutative. The iteration space of a given set of loops is a subset of the Cartesian product of the integers consisting of the set of all possible values of loop indices. The dimension of the Cartesian product is the number of nested loops. For granularity k=5 and x no of processors for data size n. This is known as data parallelism. P P P x N 9

10 Load Balancing If the work is not distributed equally, then one processor may end up taking longer than the other. Suppose that a set of parallel tasks (indexed by i=1,..,p) execute in time t i. Average execution time = 1/p Load balance of this set of parallel tasks is average(ti:1 i p) max(ti:1 i p) This ratio should be closed to one for proper load balance. 10

11 Important Minimum run time shown by a processor in set of parallel tasks is irrelevant for the purpose of load balancing. In extreme cases there may be few processors who does nothing because we have no work to give to them or as in the previous example of summing the series N is 56, then the last processor will have only one number to sum which is trivial because we are giving five numbers to every processor. 11

12 Parallel Performance Parallel performance is more complex than sequential performance in many ways. It is difficult to achieve good parallel performance and it is also more difficult to predict and measure the parallel performance. This is due to the fact that eventually independent computers are cooperating on a single task. Analysis becomes difficult due to issues of synchronization, load balancing and communication costs 12

13 System Performance T(SystemA)=T(CPU)+T(Memory Access)=f*T(A)+(1-f)*T(A) with f in the range 0,1 T(SystemB) = f*t(a/n) + (1-f) * T(A)/k where System B processor is n times faster and memory access is k times faster. So it is more important to make the memory faster. 13

14 Other factors affecting the performance Data B/w Address B/w I/O # processors Memory Latency Libraries Compilers Operating System 14

15 Fibonacci Series To understand take the example of Fibonacci Series Fib(n) If n<2 Then return n X spawn Fib(n-1) Y spawn Fib(n-2) Sync return(x+y) Here spawn means that this subroutine thread can execute at the same time as parent thread Sync means wait until all children are done 15

16 Logical Parallelism This is called logical parallelism, not actual because we are not taking into considerations the actual conditions, which are determined by the scheduler. Scheduler solves the problem of mapping dynamically unfolding execution onto processors. The concept is also known as dynamic multithreading. Multithreaded computation in a parallel instruction stream is equivalent to a directed acyclic graph where vertices are threads. A single vertex represents maximal sequence of instructions not containing parallel control (spawn, sync etc.) 16

17 Fib(4) spawn edge Fib(3) continuation edge A B C Fib(2) A B C A B C Fib(2) Fib(1) A Fib(1) A A Fib(0) Fib(1) A A B A Return edge C Fib(0) Tp=Running time on P Processors T1 = Running time on one processor T = Critical path Length= Longest Path in the DAG(Directed Acyclic Graph) Fib(4) has T1=17 T =8 (A,A,A,B,A,C,C,C) 17

18 Speed Up by Parallelism Lower Bound on Tp Tp T1/P P processors can do P work in one step Tp T P processors can t do more work than processors Speedup T1/Tp Speed up on P Processors if T1/Tp= Θ(p) implies linear speedup if T1/Tp=P implies perfect linear speedup if T1/Tp >p implies super linear speedup Super linear performance is not possible in this model however other models show super linear performance by adding cache performance etc 18

19 Scheduling Maximum Possible speedup given is T1/T =parallelism parallelism=work/critical path length = average amount of work that can be done in parallel along each path of the critical path length = P# It gives us the limits that how far we can go Scheduling : Maps computations to p processors which is done by runtime system. An algorithm runs below the language layer that schedules spawns and syncs. Online schedulers are complex so our illustrations will be based on offline scheduler using greedy scheduling 19

20 Greedy Scheduler Do as much as possible on every step Complete step means where number of threads ready to run are > P. In that case execute any P Incomplete step means where number of threads ready to run P. In that case execute all of them Theorem: A greedy scheduler executes any computation G with work T1 and critical path length T in time Tp T1/P + T on a computer with P processors This is foundation of parallel scheduling Both terms on RHS are individual lower bounds for complete steps and incomplete steps 20

21 Proof Complete steps T1/P because Maximum amount of work is T1 and otherwise more than T1 work would be done Consider an incomplete steps and let G be sub graph of G that remains to be executed. Threads with indegree zero in G are the ones that are ready to be executed 21

22 Execution steps E E E R R 22

23 Critical Path Length Execute all of them. Critical path length that remains to be executed reduced by one implies that number of incomplete steps is at most T Corollary: Linear speedup when P=O(p#) P#=T1/T implies P= O(T1/T ) T = O(T1/P) Thus Tp T1/P + T = O(T1/P) If I have fewer processors than the parallelism then it will give linear speedup, If I have more processors then the parallelism it will not give any additional speedup beyond that given by processors equal to parallelism 23

24 Multithreaded Algorithms Matrix Multiplication (nxn) C=A.B using Divide and conquer A11B11 A11B12 A12B21 A12B22 = + A21B11 A21B12 A22B11 A22B22 24

25 Mult(C,A,B,n) //n is exact power of 2 temp matrix T[1 n,1 n] if n=1 then C[1,1]=A[1,1].B[1,1] else <partition matrices> //O(1) time spawn mult(c11,a11,b11,n/2) spawn mult(c12,a11,b12,n/2) spawn mult(c21,a21,b11,n/2) spawn mult(c22,a21,b12,n/2) spawn mult(t11,a12,b21,n/2) spawn mult(t12,a12,b22,n/2) spawn mult(t21,a22,b21,n/2) spawn mult(t22,a22,b22,n/2) sync Add(C,T,n) return 25

26 Algorithm Add(C,T,n) //C C+T <base case and partitioning> spawn Add(C11,T11,n/2) spawn Add(C12,T12,n/2) spawn Add(C21,T21,n/2) spawn Add(C22,T22,n/2) sync Analysis Let Mp(n)=P processor execution time for mult code Ap(n)= P processor execution time for add code 26

27 Analysis Work A1(n)=4A1(n/2) +Θ(1) //4 problems of size ½ =Θ(n 2 ) M1(n)=8M1(n/2) +Θ(n 2 ) //8 problems of size ½ =Θ(n 3 ) Critical Path length A (n)=a (n/2) +Θ (1) problems of size ½ in 4 different processes All spawns are at same level sow e can look up at one only = Θ(lgn) M (n)=m (n/2) +Θ (lgn) = Θ(lg 2 n) 27

28 Analysis Contd. Parallelism P#=M1(n)/M (n)= Θ(n 3 /lg 2 n) For 1000X1000 matrices = /10 2 = 10 7 i.e. 10 million processors Till today we have systems that support maximum processors 28

29 Improved Algorithm P# much bigger than typical P Trade parallelism for space efficiency Mult-add(C,A,B,n) //C C+A.B <base-partition> spawn multadd(c11,a11,b11,n/2) spawn multadd(c12,a11,b12,n/2) spawn multadd(c21,a21,b11,n/2) spawn multadd(c22,a21,b12,n/2) sync spawn multadd(c11,a12,b21,n/2) spawn multadd(c12,a12,b22,n/2) spawn multadd(c21,a22,b21,n/2) spawn multadd(c22,a22,b22,n/2) sync 29

30 Analysis Work MA1(n)=Θ(n 3 ) Critical Path length MA (n)=2ma (n/2) +Θ (1) = Θ(n) Parallelism P#=MA1(n)/MA (n)= Θ(n 2 ) For 1000X1000 matrix P#=

31 Merge Sorting Sorting Merge-Sort(A,p,r) //sort A[p..r] if p<r then q (p+r)/2 spawn Merge-sort(A,p,q) spawn Merge-sort(A,q+1,r) sync Merge(A,p,q,r) //Merge A[p..q].. A[q+1..r] 31

32 Merge Sorting-Analysis Work T1(n)=2T1(n/2)+Θ(n) = Θ(nlgn) Critical Path length T (n)=t (n/2) +Θ (n) = Θ(n) Parallelism P#=T1(n)/T (n)= Θ(lgn) Which is not much if we also consider constants and overheads etc we must parallelize the merge step 32

33 Improvement Length of first array is l and second array is m l>m A A[l/2] A[l/2] B A[l/2] A[l/2] Perform binary search for l/2 in second array between j..j+1 33

34 Improved Algorithm P_Merge(A[1..l], B[1..m],c[1..n]) //merge A, B to C; n=l+m //Assume l>m <base> Find j such that B[j] A[l/2] B[j+1] Spawn p_merge(a[1..l/2],b[1..j],c[1..l/2+j] Spawn p_merge(a[l/2..l],b[j+1,..m],c[l/2+j+1..n] Sync When recursing we are having at least n/4 elements on one side when we take the middle of the large array. 34

35 Analysis Improved Merge Sort Work PM1(n)=PM1(αn)+PM1(1- α )n+θ(lgn) = Θ(n) //can be proved by substitution Critical Path length PM (n)=pm (3n/4) +Θ (lgn) = Θ(lg 2 n) Total Merge Sort T1(n)= Θ (nlgn) CPL T (n)=t(n/2)+ Θ(lg 2 n) = Θ(lg 3 n) Parallelism P#=T1(n)/T (n) = nlgn/lg 3 n =n/lg 2 n Best known till date Θ(n/lgn) 35

36 Questions, Comments and Suggestions 36

37 Question 1 Give Any Three Limitations of Parallel Processing 37

38 Question 2 Give three sorting algorithms which will give better performance with parallel Algorithms. 38

39 Question 3 Should we go for the maximum possible parallelism. A) Yes, Why? B) No, Why? 39

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014