Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea

Size: px

Start display at page:

Download "Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea"

Gervase Cummings
5 years ago
Views:

1 Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea

2 Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions Review 2

3 Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions Review 3

4 Parallel solutions to problems Why parallel solutions? Increase speed Process huge amount of data Solve problems in real time 4

5 Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions Review 5

6 Efficiency Parameters to be evaluated Time Memory Cases to be considered (depend on the algorithm implementing the problem) Best Worst Average Optimal algorithms Ω() = O() An algorithm is optimal if the running time in the worst case is of the same order of magnitude with function Ω of the problem to be solved Order of magnitude: a function which express the running time and Ω respectively if polynomial function, the polynom should have the same maximum degree, regardless the multiplicative constant Ex: Ω() = 2n2 and O()= 3n2 it is considered Ω = O, written as Ω(n2) = O(n2) Algorithms have O() >=Ω in the worst case O() <= Ω could be true for best case scenario, or even for the average case Ex: Sorting has Ω(n lgn), while several sorting algorithms have O(1) best case scenario 6

7 Efficiency cont. What kind of algorithms should be considered for parallel implementation? Alg1: O(nk) Oper. Time M1: nk T M2: nk T/V Vnk T For O(nk) alg: (n2)k= Vnk=(v1/k n)k Alg2: O(2n) Oper. 2n 2n V2n Hence n2= For O(2n) alg: (2) n2 = V 2 n =2 lgv + n Hence n2= Time T T/V T v1/k n n + lgv GOOD! BAD! Unhappy conclusion: regardless the number of times we improve the speed of the machine, the max. dimension for which we can run the algorithm 2 increases only with an additive constant!!! 7

8 Efficiency cont. Parameters to be evaluated for parallel processing Time Memory Number of processors involved Cost = Time x Number of processors (c = t x n) E = O()/cost (<=1) E=1 Cost = O() A parallel algorithm is optimal if the cost is of the same order of magnitude with the best known algorithm to solve the problem in the worst case Efficiency Optimal algorithms Other 8

9 Efficiency cont Parallel running time Computation time Communication time (data transfer, partial results transfer, information communication) Computation time As the number of processors involved increases, the computation time decreases Communication time Quite the opposite!!! 9

10 Efficiency cont Diagram of running time T N 10

11 Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions Review 11

12 Optimal Parallel Solution Parallel Architecture Flynn s taxonomy According to instruction/data flow single/multiple SISD von Newmann architecture SIMD -single control unit dispatches instructions to each processing unit MISD - the same data is analyzed from different points of view (MI) MIMD - different instructions run on different data 12

13 Optimal Parallel Solution Parallel Algorithms Data parallel Input data should be split to be processed by different CPU Control parallel Different sequences are processed by different CPU Code should be parallelized Conflicts should be avoided (removed, where possible, or not parallelize the given sequence) 13

14 Optimal Parallel Solution Parallel Algorithms cont. Constraints within the data/code Refer to precedence relations of computation to be satisfied to compute the problem correctly Read/Write conflicts Read before Write Write before Read Write before Write Define dependencies that limit parallel potential Some of them can be removed such that allow to increase the parallelism 14

15 Parallel Algorithms cont. Dependencies analysis Data Flow dependence Indicates a write before read ordering of operations which must be satisfied Can NOT be avoided Limits possible parallelism Ex: (assume sequence in a loop on i) S1: a(i) = x(i) 3 //assign first S2: b(i) = a(i) /c(i) //use (read) it second a(i) should be first calculated in S1, and only then used in S2. Therefore, S1 and S2 cannot be processed in parallel!!! 15

16 Parallel Algorithms cont. Dependencies analysis Data Antidependence Indicates a read before write ordering of operations which must be satisfied Can be avoided (techniques to remove this type of dependence) Ex: (assume sequence in a loop on i) S1: a(i) = x(i) 3 //assign it later S2: b(i) = c(i) + a(i+1) //use (read) first a(i+1) should be first used (with its former value) in S2, and only then computed in S1, at the next execution of the loop. Therefore, several iterations of the loop cannot be processed in parallel! 16

17 Parallel Algorithms cont. Dependencies analysis Output dependence Indicates a write before write ordering of operations which must be satisfied Can be avoided (techniques to remove this type of dependence) Ex: (assume sequence in a loop on i) S1: c(i+4) = b(i) + a(i+1) //assign first here S2: c(i+1) = x(i) //and here after 3 //executions of the loop c(i+4) is assigned in S1; after 3 executions of the loop, the same element in the array is assigned in S2. Therefore, several iterations of the loop cannot be processed in parallel 17

18 Parallel Algorithms cont. Dependencies analysis Each dependency defines a dependency vector They are defined by the distance between the loop where they interfere (data flow) S1: a(i) = x(i) 3 d=0 S2: b(i) = a(i) /c(i) (data antidep.) S1: a(i) = x(i) 3 d=1 S2: b(i) = c(i) + a(i+1) (output dep.) S1: c(i+4) = b(i) + a(i+1) d=3 S2: c(i+1) = x(i) All dependency vectors within a sequence define a dependency matrix 18

19 Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions Review 19

20 GRID A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities. (Foster) It provides access to computational power 20

21 GRID cont. support for High Throughput Computing Support = computing power + storage (TB) 21

22 GRID cont. RO 09 UTCN W.N.01 C.E. ce01.mosigrid.utcluj.ro SEE GRID W.N.02 S.E. se01.mosigrid.utcluj.ro W.N.10 CE = computing element SE = storage element WN = work node 22

23 GRID cont. SEE-GRID 23

24 Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions Review 24

25 Parallel implementations optimal approach Matrix multiplication Matrix power Gaussian elimination Sorting Graph algorithms FFT Cryptographic algorithms Randomness tests I/O optimal approach Remarks (general and particular) Tests run by Alex Mascasan Ioana Gligan 25

26 Parallel implementations - optimal approach Matrix multiplication C=AxB procedure MAT_MULT (A, B, C) begin for i := 0 to n - 1 do for j := 0 to n - 1 do begin C[i, j] := 0; for k := 0 to n - 1 do C[i, j] := C[i, j] + A[i, k] x B[k, j]; endfor; end MAT_MULT Sequential implementation: O(n3) 26

or column) drawback: all processor need entire A (lot of broadcast, and system overload) 2-D

27 Parallel implementations - optimal approach Matrix multiplication cont. Parallel solutions - Data parallel 1-D Partitioning A&B decomposed one dimension (either row or column) drawback: all processor need entire A (lot of broadcast, and system overload) 2-D Partitioning A and B partitioned into p blocks Dimension of each block n n x p p submatrix mult. Ci,j :Ai,k and Bk,j 27

28 Parallel implementations - optimal approach Matrix multiplication cont. Analysis 1D (absolute analysis) Matrix Multiplication 1D Time Processors Optimal sol: p=f(n) 28

29 Parallel implementations - optimal approach Matrix multiplication cont. Analysis 1D (relative analysis) Matrix Multiplication 1D Time % Processors Computation time (for small dim) is small compared to communication time: hence parallel implementation with 1D efficient on very large matrix dim. 29

30 Parallel implementations - optimal approach Matrix multiplication cont. Analysis 2D (absolute analysis) Matrix Multiplication 2D Time x x x x Processors 30

31 Parallel implementations - optimal approach Matrix multiplication cont. Analysis 2D (relative analysis) Matrix Multiplication 2D Time % Processors 31

32 Parallel implementations - optimal approach Matrix multiplication cont. Comparison 1D- 2D Comparison 1D 2D D Time 300 1D 700 1D D 100 2D D 700 2D D Processors 32

33 Parallel implementations - optimal approach Matrix power (derived from matrix multiplication) Matrix Power Time power 12 power power 4 power Processors 33

34 Parallel implementations - optimal approach Gaussian elimination Solving a system of equations: A x = B procedure GAUSSIAN_ELIMINATION (A, b, y) begin for k := 0 to n - 1 do begin for j := k + 1 to n - 1 do A[k, j] := A[k, j]/a[k, k]; y[k] := b[k]/a[k, k]; A[k, k] := 1; for i := k + 1 to n - 1 do begin for j := k + 1 to n - 1 do A[i, j] := A[i, j] - A[i, k] x A[k, j]; b[i] := b[i] - A[i, k] x y[k]; A[i, k] := 0; endfor; endfor; end GAUSSIAN_ELIMINATION O(n3) 34

35 Parallel implementations - optimal approach Gaussian elimination cont. 1D implementation: p processors (p<n), each receives n/p rows; Due to the un-even nature of the algorithm, most of the processors (actually more than half for half of the time) are idle (as they finished their work); t = n3/p without considering communication, hence the cost =n3 (tseq= 2n3/3), NOT cost optimal; To be used only when n>>p. Cyclic 1D mapping Balances the load of processors (difference at most one row) 35

36 Parallel implementations - optimal approach Gaussian elimination cont. Gauss Elimination Time (ms) Processors 36

37 Parallel implementations - optimal approach Sorting Ω(nlgn) QuickSort O(n2) worst case, O(nlgn) average case Data parallel approach (data split evenly among p processors) procedure QUICKSORT (A, q, r) begin if q < r then begin x := A[q]; s := q; for i := q + 1 to r do if A[i] <= x then begin s := s + 1; swap(a[s], A[i]); end if swap(a[q], A[s]); QUICKSORT (A, q, s); QUICKSORT (A, s + 1, r); end if end QUICKSORT 37

38 Parallel implementations - optimal approach Sorting cont. Analysis QuickSort (absolute analysis) QuickSort Time Processors Interpretation of results: after the optimal point is reach, the running time grows rapidly; it has to be a massive process communication! 38

39 Parallel implementations - optimal approach Sorting cont. Analysis QuickSort (relative analysis) QuickSort Time % Processors Improvement of 1 order of magnitude for large dim. for optimum nb. of proc. 39

40 Parallel implementations - optimal approach Sorting cont. Ω(nlgn) Enumeration Sort: for each element determine its rank in the ordered array, then move it directly in its final position. Data parallel implementation: split data among processors, run the sequential code on each processor merge the results 40

41 Parallel implementations - optimal approach Sorting cont. Relative Analysis Absolute Analysis Enumeration Sort Enumeration Sort Time Time % Processors Processors Parallel running time = 1.7% of the Note: as opposite to QS, the values mono processor running time, are not growing so fast to the right! hence, almost 2 order of magnitude!!!! 41

42 Parallel implementations - optimal approach Graph algorithms Minimum Spanning Trees Graph with 7 vertices: MST will have 6 edges with the sum of their weight minimum Possible solutions: MST weight = 20 42

43 Parallel implementations - optimal approach Graph algorithms cont. Prim s alg. procedure PRIM_MST(V, E, w, r) begin VT := {r}; d[r] := 0; for all v (V - VT ) do if edge(r, v) exists set d[v]:= w(r,v); else set d[v]:= ; while VT V do begin find a vertex u such that d[u]:=min{d[v] v (V - VT )}; VT := VT {u}; for all v (V VT ) do d[v]:= min{d[v],w(u, v)}; endwhile end PRIM_MST 43

44 Parallel implementations - optimal approach Graph algorithms cont. MST Absolute Analysis MST Relative Analysis Minimum Spannig Tree Minumum Spanning Tree Time Time % Observe the min value shifts to the right as graph s dimension increases Processors Processors 8 Observe the running time improvement with one order of magnitude (compared to sequential approach) for graphs with n =

45 Parallel implementations - optimal approach Graph algorithms cont. Transitive Closure: search for the existence of a path between any 2 vertices in a graph Possible solutions: nth power of the matrix Dijkstra s SPSP (O(n2)) Parallel solutions: source-partitioned formulation source-parallel formulation 45

46 Parallel implementations - optimal approach Graph algorithms cont. Transitive Closure: source-partitioned Each vertex vi is assigned to a process, and each process computes shortest path from each vertex in its set to any other vertex in the graph, by using the sequential algorithm (hence, data partitioning) Requires no inter process communication Transitive closure source partition Time Processors 46

47 Parallel implementations - optimal approach Graph algorithms cont. Transitive Closure: source-parallel assigns each vertex to a set of processes and uses the parallel formulation of the single-source algorithm (similar to MST) Requires inter process communication Transitive closure source parallel Time Processors 47

48 Parallel implementations - optimal approach Graph algorithms cont. Graph Isomorphism: exponential running time! search if 2 graphs matches trivial solution: generate all permutations of a graph (n!) and match with the other graph n! (n 1)! (n 1)! (n 1)! Each process is responsible for computing (n-1)! (data parallel approach) 48

49 Parallel implementations - optimal approach Graph algorithms cont. Graph Isomorphism Time logarithmic scale vertices 10 vertices vertices 12 vertices Processors 49

50 Parallel implementations - optimal approach FFT Based on Cooley-Tukey algorithm n-point, one-dimensional, unordered, radix-2 FFT requires inter-process interaction only during the first d = log p of the log n iterations FFT Time Processors 50

51 Parallel implementations -optimal approach Cryptographic algorithms 51

52 Parallel implementations -optimal approach Cryptographic algorithms cont. Decrypt MB MB MB MB speed increase is not linear, especially for small input sizes as input size increases, a constancy can be detected the 8 MB decryption takes half as long for 2 processors than for one, and approximately a quarter for 4 processors. The time remains aprox. constant as while doubling the dimension, we double the number of processors 52

53 Parallel implementations - optimal approach Randomness tests Most tests are based nm counting the number of 0 s and 1 s For a sequence to be random, they should be close to ½ Runs Test: how many runs of 0 and 1 bits there are in the input sequence. The focus of this test is the total number of runs in the sequence, where a run is an uninterrupted sequence of identical bits Implementation: at byte level Original alg.: traversing each byte and counting the passes from 0 to 1 and from 1 to 0, the test uses XOR operation, which returns a 1 only in the case when the two input bits have different values. A byte is therefore shifted one position to the right and XOR-ed with itself 53

54 Parallel implementations - optimal approach Randomness tests 54

55 Parallel implementations optimal approach Remarks (general and particular) Matrix multiplication Matrix power Gaussian elimination Sorting Graph algorithms FFT Cryptographic algorithms Randomness tests I/O optimal approach 55

56 Parallel implementations optimal approach I/O Read/write buffer (optimal) size Questions: Access method to use (local/se) R/W buffer size to get best results Network delays involved 56

57 Parallel implementations optimal approach I/O cont. 57

58 Parallel implementations optimal approach I/O cont. 58

59 Parallel implementations optimal approach I/O cont. Access method to use (local/global) As expected, local read is more efficient 59

60 Parallel implementations optimal approach I/O cont. 60

61 Parallel implementations optimal approach I/O cont Access method to use (local/se) R/W buffer size to get best results approximately bytes Network delays involved For 1-4 CPU to sec. For 5-8 CPU to sec. 61

62 Conclusions Parallelization process Parallelization Parabolic shape: more pregnant when Parabolic communication is involved Shift of optimal number of processors Shift How parallelization is done matters How Time up to 2 orders of magnitude Time decrease 62

63 Review Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations 63

Parallel Processing: October, 5, 2010

Parallel Processing: October, 5, 2010 Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need