Recovering Numerical Reproducibility in Hydrodynamics Simulations

Size: px
Start display at page:

Download "Recovering Numerical Reproducibility in Hydrodynamics Simulations"

Transcription

1 23rd IEEE Symposium on Computer Arithmetic July , Santa-Clara, USA Recovering Numerical Reproducibility in Hydrodynamics Simulations Philippe Langlois, Rafife Nheili, Christophe Denis DALI, Université de Perpignan Via Domitia LIRMM, UMR 5506 CNRS, Université de Montpellier CMLA, ENS Cachan, France 1 / 31

2 Recovering numerical reproducibility Execution: Sequential 2 procs 4 procs p procs Original code 2 / 31

3 Recovering numerical reproducibility Execution: Sequential 2 procs 4 procs p procs Non-reproducible original code 2 / 31

4 Recovering numerical reproducibility Reproducibility: bitwise identical results for every p-parallel run, p 1 Execution: Sequential 2 procs 4 procs p procs Non-reproducible original code Reproducible code 2 / 31

5 Recovering numerical reproducibility Reproducibility: bitwise identical results for every p-parallel run, p 1 Execution: Sequential 2 procs 4 procs p procs Non-reproducible original code reproducibility Reproducible code 2 / 31

6 Recovering numerical reproducibility Reproducibility: bitwise identical results for every p-parallel run, p 1 Reproducibility Accuracy Execution: Sequential 2 procs 4 procs p procs Non-reproducible original code accuracy reproducibility Reproducible code 2 / 31

7 Recovering numerical reproducibility Reproducibility: bitwise identical results for every p-parallel run, p 1 Reproducibility Accuracy Failures reported in numerical simulation for energy [10], dynamic weather science [2], dynamic molecular [9], dynamic fluid [8] Execution: Sequential 2 procs 4 procs p procs Non-reproducible original code accuracy reproducibility Reproducible code 2 / 31

8 Recovering numerical reproducibility Reproducibility: bitwise identical results for every p-parallel run, p 1 Reproducibility Accuracy Failures reported in numerical simulation for energy [10], dynamic weather science [2], dynamic molecular [9], dynamic fluid [8] How to debug? How to test? How to validate? How to receive legal agreements? Execution: Sequential 2 procs 4 procs p procs Non-reproducible original code accuracy reproducibility Reproducible code 2 / 31

9 Hydrodynamics simulation One industrial scale simulation code Simulation of free-surface flows in 1D-2D-3D hydrodynamics loc. of open source Fortran years, 4000 registered users, EDF R&D + international consortium Telemac 2D [3] 2D hydrodynamic: Saint Venant equations Finite element method, triangular element mesh, sub-domain decomposition for parallel resolution Mesh node unknowns: water depth (H) and velocity (U,V) 3 / 31

10 Telemac2D: the simplest gouttedo simulation The gouttedo simulation test case 2D-simulation of a water drop fall in a square basin Unknown: water depth for a 0.2 sec time step Triangular mesh: 8978 elements and 4624 nodes Expected numerical reproducibility (time step = 1, 2,... ) Sequential Parallel p = 2 4 / 31

11 A white plot displays a non-reproducible value Numerical reproducibility? time step = 1 Sequential Parallel p = 2 5 / 31

12 A white plot displays a non-reproducible value Numerical reproducibility? time step = 2 Sequential Parallel p = 2 5 / 31

13 A white plot displays a non-reproducible value Numerical reproducibility? time step = 3 Sequential Parallel p = 2 5 / 31

14 A white plot displays a non-reproducible value Numerical reproducibility? time step = 4 Sequential Parallel p = 2 5 / 31

15 A white plot displays a non-reproducible value Numerical reproducibility? time step = 5 Sequential Parallel p = 2 5 / 31

16 A white plot displays a non-reproducible value Numerical reproducibility? time step = 6 Sequential Parallel p = 2 5 / 31

17 A white plot displays a non-reproducible value Numerical reproducibility? time step = 7 Sequential Parallel p = 2 5 / 31

18 A white plot displays a non-reproducible value Numerical reproducibility? time step = 8 Sequential Parallel p = 2 5 / 31

19 A white plot displays a non-reproducible value Numerical reproducibility? time step = 9 Sequential Parallel p = 2 5 / 31

20 A white plot displays a non-reproducible value Numerical reproducibility? time step = 10 Sequential Parallel p = 2 5 / 31

21 A white plot displays a non-reproducible value Numerical reproducibility? time step = 11 Sequential Parallel p = 2 5 / 31

22 A white plot displays a non-reproducible value Numerical reproducibility? time step = 12 Sequential Parallel p = 2 5 / 31

23 A white plot displays a non-reproducible value Numerical reproducibility? time step = 13 Sequential Parallel p = 2 5 / 31

24 A white plot displays a non-reproducible value Numerical reproducibility? time step = 14 Sequential Parallel p = 2 5 / 31

25 A white plot displays a non-reproducible value NO numerical reproducibility! time step = 15 Sequential Parallel p = 2 5 / 31

26 Telemac2D: gouttedo NO numerical reproducibility! Sequential Parallel p = 2 6 / 31

27 Today s issues Case study Industrial scale software: opentelemac-mascaret Finite element simulation, domain decomposition, linear system solving 2 modules: Tomawac, Telemac2D 7 / 31

28 Today s issues Case study Industrial scale software: opentelemac-mascaret Finite element simulation, domain decomposition, linear system solving 2 modules: Tomawac, Telemac2D Feasibility How to recover reproducibility? Sources of non-reproducibility? Do existing techniques apply? how easily? Compensation yields reproducibility here! 7 / 31

29 Today s issues Case study Industrial scale software: opentelemac-mascaret Finite element simulation, domain decomposition, linear system solving 2 modules: Tomawac, Telemac2D Feasibility How to recover reproducibility? Sources of non-reproducibility? Do existing techniques apply? how easily? Compensation yields reproducibility here! Efficiency How much to pay for reproducibility? extra-cost which decreases as the problem size increases OK to debug, to validate and even to simulate! 7 / 31

30 Outline 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 8 / 31

31 Parallel reduction and compensation techniques Non associative floating point addition The computed value depends on the operation order 9 / 31

32 Parallel reduction and compensation techniques Non associative floating point addition The computed value depends on the operation order Parallel reduction of undefined order generates reproducibility failure a b c d a b c d a b c d a b (a b) c (a b) (c d) ((a b) c) d 9 / 31

33 Parallel reduction and compensation techniques Non associative floating point addition The computed value depends on the operation order Parallel reduction of undefined order generates reproducibility failure Compensate rounding errors with error free transformations a b c d a b c d e 1 a b c d e 2 a b f 1 (a b) c f 2 e 3 (a b) (c d) ((a b) c) d f 3 ((a b) (c d)) ((e 1 e 2 ) e 3 ) = (((a b) c) d) ((f 1 f 2 ) f 3 ) Should be repeted for too ill-conditionned sums 9 / 31

34 Finite element assembly: the sequential case The assembly step: V (i) = elements W el(i) compute the inner node values V (i) accumulating local W el for every el that contains i %&# *+# ()*+,'-# "!# '#!" "# &# ' $$#!"# +# The assembly loop for p = 1,np //p: triangular local number (np=3) for el = 1,nel i = IKLE(el,p) % V(i) = V(i) + W(el,p) //i: domain global number 10 / 31

35 Finite element assembly: the sequential case The assembly step: V (i) = elements W el(i) compute the inner node values V (i) accumulating local W el for every el that contains i ()*+,'-# %&# *+# "!# '#!" "# &# ' $$#!"# +# The assembly loop for p = 1,np //p: triangular local number (np=3) for el = 1,nel i = IKLE(el,p) % < - LOOP INDEX INDIRECTION V(i) = V(i) + W(el,p) //i: domain global number 10 / 31

36 Finite element assembly: the parallel case Interface point assembly: communications and reductions V (i) = D k V (i) for sub-domains D k, k = 1...p sequential parallel sub-domains inner nodes interface points V (i) = a V D1 (i) = b V D2 (i) = c Interface point assembly V (i) = b + c = a Exact arithmetic 11 / 31

37 Finite element assembly: the parallel case Interface point assembly: communications and reductions V (i) = D k V (i) for sub-domains D k, k = 1...p sequential parallel sub-domains inner nodes interface points V (i) = â V D1 (i) = b V D2 (i) = ĉ Interface point V (i) = b assembly ĉ â Floating point arithmetic 11 / 31

38 Basic ingredients 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 11 / 31

39 Sources of non reproducibility in Telemac2D Culprits: theory 1 Building step: interface point assembly 2 Solving step (conjugate gradient): parallel matrix-vector and dot products 12 / 31

40 Sources of non reproducibility in Telemac2D Culprits: theory 1 Building step: interface point assembly 2 Solving step (conjugate gradient): parallel matrix-vector and dot products Culprits: practice = optimization Interface point assembly and linear system solving are merged Element-by-element storage (EBE) of the FE matrix EBE parallel matrix-vector product: no BLAS Everything is vector, no matrix! Wave equation, mass-lumping and associated algebraic transformations System decoupling and many diagonal matrices Everything is vector, no matrix! 12 / 31

41 Sources of non reproducibility in Telemac2D The Telemac2D FE steps Solution U, V C 2 = B u A uhh Solution H diagonal resolution C 3 = B v A vhh Interface point assembly: C 2, C 3 H A 2, A 3 System equation AX = C A A A 3 A hh A hu A hv A 0 uh Auu A vh 0 Avv conjugate gradient H U = V H U = V C 1 C 2 C 3 wave equation B h Bu Bv Interface point assembly: A 1d in each iteration A 2, A 3 : diagonal matrices A 1 = A hh A hua 1 2 Auh Ahv A 1 3 Avh C 1 = B h A hua 1 2 Bu Ahv A 1 3 Bv Interface point assembly: A 2, A 3, C 1 A 1, C 1 FE assembly + algebraic computation Mesh (elements, nodes) Discretisation 13 / 31

42 Sources of non reproducibility in Telemac2D The Telemac2D FE steps Solution U, V C 2 = B u A uhh diagonal resolution Solution H C 3 = B v A vhh Interface point assembly: C 2, C 3 H A 2, A 3 System equation AX = C A A A 3 A hh A hu A hv A 0 uh Auu A vh 0 Avv conjugate gradient H U = V H U = V C 1 C 2 C 3 wave equation B h Bu Bv Interface point assembly: A 1d in each iteration A 2, A 3 : diagonal matrices A 1 = A hh A hua 1 2 Auh Ahv A 1 3 Avh C 1 = B h A hua 1 2 Bu Ahv A 1 3 Bv Interface point assembly: A 2, A 3, C 1 A 1, C 1 Mesh (elements, nodes) FE assembly + algebraic computation Objective Correct sources of non-reproducibility to compute reproducible system and solutions Discretisation 13 / 31

43 Recovering reproducibility in a finite element resolution 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 13 / 31

44 Recovering reproducibility in Telemac2D Sources FE assembly: diagonal of the matrices and second members Resolution: EBE matrix-vector and dot products Wave equation: algebraic transformations and diagonal resolutions Reproducible resolution: principles vector V [V, E V ] V + E V Computes E V in the FE assembly of V Propagates E V over each V operation Compensates all nodes while assembling the Interface Point Compensate MPI parallel dot products that include MPI reduction 14 / 31

45 Reproducible FE assembly i: any node; W el (i): contribution for every el that contains i Original FE assembly: V (i) = elements W el(i) V (i) = W el1 (i) + W el2 (i) + + W elni (i) Modified FE assembly: [V (i), E V (i)] = ReprodAss elements W el (i) V (i) = W el1 (i) + W el2 (i) + + W elni (i) e 1 e 2 e ni E V (i) = e 1 + e e ni 1 15 / 31

46 Reproducible interface point assembly i: one interface point of D 1, D 2,, D k 1, D k Original IP assembly: V (i) = D k V (i) Communicate V Di to D 1, D 2, and compute: V (i) = V D1 (i) + V D2 (i) + + V Dk 1 (i) + V Dk (i) Modified IP assembly: [V (i), E V (i)] = ReprodAss Dk [V (i), E VDk (i)] Communicate [V Di, E DVi ] to D 1, D 2, and compute: V (i) = V D1 (i) + V D2 (i) + + V Dk 1 (i) + V Dk (i) δ 1 δ 2 δ k 1 E V (i) = (E VD1 (i) + E VD2 (i) + δ 1 ) + + (E VDk 1 (i) + E VDk (i) + δ k 1 ) Reproducibility: at the IP assembly step [V, E V ] V + E V compensates every node value 16 / 31

47 Reproducible algebraic operations 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 16 / 31

48 Algebraic operation: the vector case Reproducible algebraic vector operations opentelemac s included library: BIEF Entry-wise vector ops: copy, opp/inv., add/sub, Hadamard prod.,... Applies also for diagonal matrix Propagate rounding errors to compensate while assembling IP Example: Hadamard product Original version X, Y V = X Y V (i) = X (i) Y (i) Modified version [X, E X ], [Y, E Y ] [V, E V ] with (V, e V ) = 2Prod(X, Y ) and E V = X E Y + Y E X + e V 17 / 31

49 What is reproducible now? Most of the linear system: except: FE assembly algebraic vector operations interface point assembly the matrix of the H system its dependencies: the second members of the U and V systems Next step: conjugate gradient Partially reproducible Telemac2D Diagonal resolutions: C 2 = B u A uhh, C 3 = B v A vhh. Interface point assembly: C 2, C 3 Solution U, V Conjugate gradient : Interface point assembly: A 1d in each iteration Solution H Wave equation: A 2, A 3 : diagonal matrices, A 1 = A hh A hua 1 2 Auh Ahv A 1 3 Avh, C 1 = B h A hua 1 2 Bu Ahv A 1 3 Bv, Interface point assembly: A 2, A 3, C 1 H A 1,C 1 A 2, A 3 18 / 31

50 Recovering reproducibility in a finite element resolution 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 18 / 31

51 Towards a reproducible conjugate gradient A = [A 1, E A1 ] B = C 1 Initialization: a given d 0 r 0 = AX 0 B; ρ 0 = (r0, d 0 ) ; X 1 = X 0 ρ 0 d 0 (Ad 0, d 0 ) Iterations until stopping criteria: r m = r m 1 ρ m 1 Ad m 1 d m = r m + (r m, r m ) (r m 1, r m 1 ) d m 1 ρ m = (rm, d m ) (d m, Ad m ) X m+1 = X m ρ m d m X = H 19 / 31

52 Towards a reproducible conjugate gradient A = [A 1, E A1 ] B = C 1 Initialization: a given d 0 r 0 = AX 0 B; ρ 0 = (r0, d 0 ) ; X 1 = X 0 ρ 0 d 0 (Ad 0, d 0 ) Iterations until stopping criteria: r m = r m 1 ρ m 1 Ad m 1 d m = r m + (r m, r m ) (r m 1, r m 1 ) d m 1 ρ m = (rm, d m ) (d m, Ad m ) X m+1 = X m ρ m d m X = H Non-reproducibility sources EBE matrix-vector product dot product MPI reduction weighted dot products for IP shared by p sub-domains 19 / 31

53 The EBE storage nel M = D + el=1 nodes i [1, np], elements el [1, nel], element vertices j, k, l el M is decomposed as: X el 1 assembled diagonal D[np] : D = [D(1),, D(np)] 2 elementary extra-diagonal X el [6]: X jk (el) X jl (el) X el = X kj (el) X kl (el) = [X el (1),, X el (6)] X lj (el) X lk (el) 20 / 31

54 The EBE Matrix-Vector product Steps of the EBE matrix-vector product 1 R 1 (i) = D(i) V (i), i [1, np] nel R = M V = D V + X el V el 2 X el.v el = [X el (1) V (k), X el (2) V (l),, X el (6) V (k)], el [1, nel] el=1 3 FE assembly R 2 [np]: R 2 = nel el=1 X el V el 4 R = R 1 + R 2 5 IP assembly: R(i) = D k R(i) for all IP i 21 / 31

55 Reproducible EBE matrix-vector product Original EBE matrix-vector product R = D V + nel el=1 X el V el R(i) = D k R(i) Reproducible EBE matrix-vector product [R, E R ] = [D, E D ] V + ReprodAss nel el=1x el V el R(i) = ReprodAss Dk [R(i), E R (i)] Compensation: R + E R 22 / 31

56 A reproducible conjugate gradient A = [A 1, E A1 ] B = C 1 Initialization: a given d 0 r 0 = AX 0 B; ρ 0 = (r0, d 0 ) ; X 1 = X 0 ρ 0 d 0 (Ad 0, d 0 ) Iterations until stopping criteria: r m = r m 1 ρ m 1 Ad m 1 d m = r m + (r m, r m ) (r m 1, r m 1 ) d m 1 ρ m = (rm, d m ) (d m, Ad m ) X m+1 = X m ρ m d m X = H Non-reproducibility: sources and solutions Reproducible EBE matrix-vector product dot product MPI reduction: a parallel compensated dot2 weights: (1/k, 1/k,..., 1/k) (1, 0,..., 0) Reproducible operations reproducible results 23 / 31

57 A reproducible conjugate gradient A=[A 1, E A1 ] B=C 1 Initialization: a given d 0 r 0 = AX 0 B; ρ 0 = (r0, d 0 ) (Ad 0, d 0 ) ; X 1 = X 0 ρ 0 d 0 Iterations until stopping criteria: r m = r m 1 ρ m 1 Ad m 1 d m = r m + (r m, r m ) (r m 1, r m 1 ) d m 1 ρ m = (rm, d m ) (d m, Ad m ) X m+1 = X m ρ m d m X =H Non-reproducibility: sources and solutions Reproducible EBE matrix-vector product dot product MPI reduction: a parallel compensated dot2 weights: (1/k, 1/k,..., 1/k) (1, 0,..., 0) Reproducible operations reproducible results Same errors for both sequential and parallel executions 23 / 31

58 Recovering reproducibility in a finite element resolution 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 23 / 31

59 Reproducible Telemac2D Reproducible Telemac2D Execution: Sequential 2 procs 4 procs p procs Non-reproducible Original code original code accuracy reproducibility Diagonal resolutions: C 2 = B u A uhh, C 3 = B v A vhh. Interface point assembly: C 2, C 3 Solution U, V Conjugate gradient : Interface point assembly: A 1d in each iteration Solution H Wave equation: A 2, A 3 : diagonal matrices, A 1 = A hh A hua 1 2 Auh Ahv A 1 3 Avh, C 1 = B h A hua 1 2 Bu Ahv A 1 3 Bv, Interface point assembly: A 2, A 3, C 1 H A 1,C 1 A 2, A 3 Maximum Relative Error vs. Sequential Computation Reproducible code Maximum relative error gouttedo test case acc rep COMP P=2 COMP P=4 COMP P= Time step 24 / 31

60 Reproducible gouttedo! Time step 1 p=1 p=2 p=4 p=8 25 / 31

61 Reproducible gouttedo! Time step 2 p=1 p=2 p=4 p=8 25 / 31

62 Reproducible gouttedo! Time step 3 p=1 p=2 p=4 p=8 25 / 31

63 Reproducible gouttedo! Time step 4 p=1 p=2 p=4 p=8 25 / 31

64 Reproducible gouttedo! Time step 5 p=1 p=2 p=4 p=8 25 / 31

65 Reproducible gouttedo! Time step 6 p=1 p=2 p=4 p=8 25 / 31

66 Reproducible gouttedo! Time step 7 p=1 p=2 p=4 p=8 25 / 31

67 Reproducible gouttedo! Time step 8 p=1 p=2 p=4 p=8 25 / 31

68 Reproducible gouttedo! Time step 9 p=1 p=2 p=4 p=8 25 / 31

69 Reproducible gouttedo! Time step 10 p=1 p=2 p=4 p=8 25 / 31

70 Reproducible gouttedo! Time step 11 p=1 p=2 p=4 p=8 25 / 31

71 Reproducible gouttedo! Time step 12 p=1 p=2 p=4 p=8 25 / 31

72 Reproducible gouttedo! Time step 13 p=1 p=2 p=4 p=8 25 / 31

73 Reproducible gouttedo! Time step 14 p=1 p=2 p=4 p=8 25 / 31

74 Reproducible gouttedo! Time step 15 p=1 p=2 p=4 p=8 25 / 31

75 Reproducible gouttedo! Time step 15 p=1 p=2 p=4 p=8 26 / 31

76 Efficiency 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 26 / 31

77 Runtime extra-cost for reproducible simulations Measures, test cases and mesh sizes hardware cycle counter: rdtsc gouttedo mesh sizes: 4624, 18225, nodes (1, 4, 16) #nodes #proc Number of IP Hardware and software env. opentelemac v7.2 socket: Intel Xeon E GHz (L3 cache = 20 M) 2 sockets of 8 cores each GNU Fortran 4.6.3, -O3 OpenMPI Linux generic 27 / 31

78 The core runtime extra-cost for reproducible gouttedo gouttedo core: no input/output steps, just building+solving x 1.16 Telemac v7, gouttedo x 1.23 Original, #nodes= 4624 Reproducible, #nodes= 4624 Original, #nodes= Reproducible, #nodes= Original, #nodes= Reproducible, #nodes= #cycles x 1.31 x 1.44 x 1.43 x x 1.64 x 1.83 x 1.59 x x 2.21 x # processors 28 / 31

79 Time to conclude 1 Motivation 2 Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D 3 Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D 4 Efficiency 5 Conclusion 28 / 31

80 Conclusion Recovering numerical reproducibility Industrial scale software: opentelemac-mascaret Finite element simulation, domain decomposition, linear system solving 2 reproducible modules: Tomawac, Telemac2D Integration in the next opentelemac version: current work 29 / 31

81 Conclusion Recovering numerical reproducibility Industrial scale software: opentelemac-mascaret Finite element simulation, domain decomposition, linear system solving 2 reproducible modules: Tomawac, Telemac2D Integration in the next opentelemac version: current work Feasibility How to recover reproducibility? Sources of non-reproducibility? Do existing techniques apply? how easily? Hand-made analysis of the computing workflow Compensation yields reproducibility here! Fits well to the opentelemac s vector library Other existing techniques also apply and more or less easily [4] 29 / 31

82 Conclusion Efficiency How much to pay for reproducibility? extra-cost which decreases as the problem size increases OK to debug, to validate and even to simulate! 30 / 31

83 Conclusion Efficiency How much to pay for reproducibility? extra-cost which decreases as the problem size increases OK to debug, to validate and even to simulate! Reproducibility at a larger scale: the whole opentelemac software suite Does it still working for complex, large and real-life simulations? The two FE test cases are significant enough to validate the methodology Localization of the failure sources is difficult to automatize but the methodology application is easy for the software developpers 30 / 31

84 Acknowledgment Jean-Michel Hervouet, LNE, EDR R&D, Chatou Chemseddine Chohra, DALI/LIRMM, UPVD, Perpignan 30 / 31

85 Références I J. W. Demmel and H. D. Nguyen. Fast reproducible floating-point summation. In Proc. 21th IEEE Symposium on Computer Arithmetic. Austin, Texas, USA, Y. He and C. Ding. Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. J. Supercomput., 18: , J.-M. Hervouet. Hydrodynamics of free surface flows: Modelling with the finite element method. John Wiley & Sons, P. Langlois, R. Nheili, and C. Denis. Numerical Reproducibility: Feasibility Issues. In NTMS 2015: 7th IFIP International Conference on New Technologies, Mobility and Security, pages 1 5, Paris, France, July IEEE, IEEE COMSOC & IFIP TC6.5 WG. P. Langlois, R. Nheili, and C. Denis. Recovering numerical reproducibility in hydrodynamic simulations. In J. Hormigo, S. Oberman, and N. Revol, editors, 23rd IEEE International Symposium on Computer Arithmetic, pages IEEE Computer Society, July (Silicon Valley, USA. July ). 30 / 31

86 Références II T. Ogita, S. M. Rump, and S. Oishi. Accurate sum and dot product. SIAM J. Sci. Comput., 26(6): , Open TELEMAC-MASCARET. v.7.0, Release notes R. W. Robey, J. M. Robey, and R. Aulwes. In search of numerical consistency in parallel programming. Parallel Comput., 37(4-5): , M. Taufer, O. Padron, P. Saponaro, and S. Patel. Improving numerical reproducibility and stability in large-scale numerical simulations on gpus. In IPDPS, pages 1 9. IEEE, O. Villa, D. G. Chavarría-Miranda, V. Gurumoorthi, A. Márquez, and S. Krishnamoorthy. Effects of floating-point non-associativity on numerical computations on massively multithreaded systems. In CUG 2009 Proceedings, pages 1 11, / 31

Numerical Reproducibility: Feasibility Issues

Numerical Reproducibility: Feasibility Issues Numerical Reproducibility: Feasibility Issues Philippe Langlois, Rafife Nheili, Christophe Denis To cite this version: Philippe Langlois, Rafife Nheili, Christophe Denis. Numerical Reproducibility: Feasibility

More information

First improvements toward a reproducible Telemac-2D

First improvements toward a reproducible Telemac-2D First improvements toward a reproducible Telemac-2D Rafife Nheili, Philippe Langlois, Christophe Denis To cite this version: Rafife Nheili, Philippe Langlois, Christophe Denis. First improvements toward

More information

Numerical Verification of Large Scale CFD Simulations: One Way to Prepare the Exascale Challenge

Numerical Verification of Large Scale CFD Simulations: One Way to Prepare the Exascale Challenge Numerical Verification of Large Scale CFD Simulations: One Way to Prepare the Exascale Challenge Christophe DENIS Christophe.Denis@edf.fr EDF Resarch and Development - EDF Lab Clamart August 22, 2014 16

More information

Reproducibility in Stochastic Simulation

Reproducibility in Stochastic Simulation Reproducibility in Stochastic Simulation Prof. Michael Mascagni Department of Computer Science Department of Mathematics Department of Scientific Computing Graduate Program in Molecular Biophysics Florida

More information

Estimation of numerical reproducibility on CPU and GPU

Estimation of numerical reproducibility on CPU and GPU Estimation of numerical reproducibility on CPU and GPU Fabienne Jézéquel 1,2,3, Jean-Luc Lamotte 1,2, and Issam Saïd 1,2 1 Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France

More information

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Outline of the Talk Introduction to the TELEMAC System and to TELEMAC-2D Code Developments Data Reordering Strategy Results Conclusions

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

Numerical validation of compensated summation algorithms with stochastic arithmetic

Numerical validation of compensated summation algorithms with stochastic arithmetic NSV 2015 Numerical validation of compensated summation algorithms with stochastic arithmetic S. Graillat 1 Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France CNRS, UMR 7606,

More information

Toward hardware support for Reproducible BLAS

Toward hardware support for Reproducible BLAS Toward hardware support for Reproducible BLAS http://bebop.cs.berkeley.edu/reproblas/ James Demmel, Hong Diep Nguyen SCAN 2014 - Wurzburg, Germany Sep 24, 2014 1 / 17 Reproducibility Reproducibility: obtaining

More information

Christophe DENIS 1 and Sethy MONTAN 2

Christophe DENIS 1 and Sethy MONTAN 2 ESAIM: PROCEEDINGS, March 2012, Vol. 35, p. 107-113 Fédération Denis Poisson (Orléans-Tours) et E. Trélat (UPMC), Editors NUMERICAL VERIFICATION OF INDUSTRIAL NUMERICAL CODES Christophe DENIS 1 and Sethy

More information

Efficiency of Reproducible Level 1 BLAS

Efficiency of Reproducible Level 1 BLAS Efficiency of Reproducible Level 1 BLAS Chemseddine Chohra, Philippe Langlois, David Parello To cite this version: Chemseddine Chohra, Philippe Langlois, David Parello. Efficiency of Reproducible Level

More information

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing

More information

First steps towards more numerical reproducibility

First steps towards more numerical reproducibility First steps towards more numerical reproducibility Fabienne Jézéquel, Philippe Langlois, Nathalie Revol To cite this version: Fabienne Jézéquel, Philippe Langlois, Nathalie Revol. First steps towards more

More information

Floating Point Adverse Effects on the Numerical Verification and Reproducibility of Digital Simulations

Floating Point Adverse Effects on the Numerical Verification and Reproducibility of Digital Simulations Floating Point Adverse Effects on the Numerical Verification and Reproducibility of Digital Simulations Seminar at The Maison de la Simulation Christophe DENIS Research Fellow (HDR), CMLA, ENS Paris-Saclay

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Compensated algorithms in floating point arithmetic: accuracy, validation, performances.

Compensated algorithms in floating point arithmetic: accuracy, validation, performances. Compensated algorithms in floating point arithmetic: accuracy, validation, performances. Nicolas Louvet Directeur de thèse: Philippe Langlois Université de Perpignan Via Domitia Laboratoire ELIAUS Équipe

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Realization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets

Realization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 08 February 2016 ISSN (online): 2349-784X Realization of Hardware Architectures for Householder Transformation based QR

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Uppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2

Uppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2 Uppsala University Department of Information technology Hands-on : Ill-conditioning Exercise (Ill-conditioned linear systems) Definition A system of linear equations is said to be ill-conditioned when

More information

Solving non-hydrostatic Navier-Stokes equations with a free surface

Solving non-hydrostatic Navier-Stokes equations with a free surface Solving non-hydrostatic Navier-Stokes equations with a free surface J.-M. Hervouet Laboratoire National d'hydraulique et Environnement, Electricite' De France, Research & Development Division, France.

More information

Optimizing TELEMAC-2D for Large-scale Flood Simulations

Optimizing TELEMAC-2D for Large-scale Flood Simulations Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Optimizing TELEMAC-2D for Large-scale Flood Simulations Charles Moulinec a,, Yoann Audouin a, Andrew Sunderland a a STFC

More information

Effects of Floating-Point non-associativity on Numerical Computations on Massively Multithreaded Systems

Effects of Floating-Point non-associativity on Numerical Computations on Massively Multithreaded Systems Effects of Floating-Point non-associativity on Numerical Computations on Massively Multithreaded Systems Oreste Villa 1, Daniel Chavarría-Miranda 1, Vidhya Gurumoorthi 2, Andrés Márquez 1, and Sriram Krishnamoorthy

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Estimation d arrondis, analyse de stabilité des grands codes de calcul numérique

Estimation d arrondis, analyse de stabilité des grands codes de calcul numérique Estimation d arrondis, analyse de stabilité des grands codes de calcul numérique Jean-Marie Chesneaux, Fabienne Jézéquel, Jean-Luc Lamotte, Jean Vignes Laboratoire d Informatique de Paris 6, P. and M.

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

3D Finite Difference Time-Domain Modeling of Acoustic Wave Propagation based on Domain Decomposition

3D Finite Difference Time-Domain Modeling of Acoustic Wave Propagation based on Domain Decomposition 3D Finite Difference Time-Domain Modeling of Acoustic Wave Propagation based on Domain Decomposition UMR Géosciences Azur CNRS-IRD-UNSA-OCA Villefranche-sur-mer Supervised by: Dr. Stéphane Operto Jade

More information

VeriTracer: Context-enriched tracer for floating-point arithmetic analysis

VeriTracer: Context-enriched tracer for floating-point arithmetic analysis VeriTracer: Context-enriched tracer for floating-point arithmetic analysis ARITH 25, Amherst MA USA, June 2018 Yohan Chatelain 1,5 Pablo de Oliveira Castro 1,5 Eric Petit 2,5 David Defour 3 Jordan Bieder

More information

Augmented Arithmetic Operations Proposed for IEEE

Augmented Arithmetic Operations Proposed for IEEE Augmented Arithmetic Operations Proposed for IEEE 754-2018 E. Jason Riedy Georgia Institute of Technology 25th IEEE Symposium on Computer Arithmetic, 26 June 2018 Outline History and Definitions Decisions

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Thermomechanical and hydraulic industrial simulations using MUMPS at EDF

Thermomechanical and hydraulic industrial simulations using MUMPS at EDF K u f Thermomechanical and hydraulic industrial simulations using MUMPS at EDF MUMPS User group meeting 15 april 2010 O.Boiteau, C.Denis, F.Zaoui MUltifrontal Massively Parallel sparse direct Solver 1

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course

More information

2.7 Numerical Linear Algebra Software

2.7 Numerical Linear Algebra Software 2.7 Numerical Linear Algebra Software In this section we will discuss three software packages for linear algebra operations: (i) (ii) (iii) Matlab, Basic Linear Algebra Subroutines (BLAS) and LAPACK. There

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

ReproBLAS: Reproducible BLAS

ReproBLAS: Reproducible BLAS ReproBLAS: Reproducible BLAS http://bebop.cs.berkeley.edu/reproblas/ James Demmel, Nguyen Hong Diep SC 13 - Denver, CO Nov 22, 2013 1 / 15 Reproducibility Reproducibility: obtaining bit-wise identical

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

Modelling, migration, and inversion using Linear Algebra

Modelling, migration, and inversion using Linear Algebra Modelling, mid., and inv. using Linear Algebra Modelling, migration, and inversion using Linear Algebra John C. Bancroft and Rod Blais ABSRAC Modelling migration and inversion can all be accomplished using

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Approaches to Parallel Implementation of the BDDC Method

Approaches to Parallel Implementation of the BDDC Method Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague

More information

Efficient implementation of interval matrix multiplication

Efficient implementation of interval matrix multiplication Efficient implementation of interval matrix multiplication Hong Diep Nguyen To cite this version: Hong Diep Nguyen. Efficient implementation of interval matrix multiplication. Para 2010: State of the Art

More information

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R. Progress In Electromagnetics Research Letters, Vol. 9, 29 38, 2009 AN IMPROVED ALGORITHM FOR MATRIX BANDWIDTH AND PROFILE REDUCTION IN FINITE ELEMENT ANALYSIS Q. Wang National Key Laboratory of Antenna

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar.

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar. Matching Compare region of image to region of image. We talked about this for stereo. Important for motion. Epipolar constraint unknown. But motion small. Recognition Find object in image. Recognize object.

More information

ExBLAS: Reproducible and Accurate BLAS Library

ExBLAS: Reproducible and Accurate BLAS Library ExBLAS: Reproducible and Accurate BLAS Library Roman Iakymchuk, Sylvain Collange, David Defour, Stef Graillat To cite this version: Roman Iakymchuk, Sylvain Collange, David Defour, Stef Graillat. ExBLAS:

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

MAT 275 Laboratory 2 Matrix Computations and Programming in MATLAB

MAT 275 Laboratory 2 Matrix Computations and Programming in MATLAB MATLAB sessions: Laboratory MAT 75 Laboratory Matrix Computations and Programming in MATLAB In this laboratory session we will learn how to. Create and manipulate matrices and vectors.. Write simple programs

More information

An Improved Measurement Placement Algorithm for Network Observability

An Improved Measurement Placement Algorithm for Network Observability IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 16, NO. 4, NOVEMBER 2001 819 An Improved Measurement Placement Algorithm for Network Observability Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs Mark van Heeswijk 1 and Yoan Miche 2 and Erkki Oja 1 and Amaury Lendasse 1 1 Helsinki University of Technology - Dept. of Information

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics July 11, 2016 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information

Chapter 14: Matrix Iterative Methods

Chapter 14: Matrix Iterative Methods Chapter 14: Matrix Iterative Methods 14.1INTRODUCTION AND OBJECTIVES This chapter discusses how to solve linear systems of equations using iterative methods and it may be skipped on a first reading of

More information

Understanding the performances of sparse compression formats using data parallel programming model

Understanding the performances of sparse compression formats using data parallel programming model 2017 International Conference on High Performance Computing & Simulation Understanding the performances of sparse compression formats using data parallel programming model Ichrak MEHREZ, Olfa HAMDI-LARBI,

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Scanning Real World Objects without Worries 3D Reconstruction

Scanning Real World Objects without Worries 3D Reconstruction Scanning Real World Objects without Worries 3D Reconstruction 1. Overview Feng Li 308262 Kuan Tian 308263 This document is written for the 3D reconstruction part in the course Scanning real world objects

More information

SPH: Why and what for?

SPH: Why and what for? SPH: Why and what for? 4 th SPHERIC training day David Le Touzé, Fluid Mechanics Laboratory, Ecole Centrale de Nantes / CNRS SPH What for and why? How it works? Why not for everything? Duality of SPH SPH

More information

Fast Primitives for Irregular Computations on the NEC SX-4

Fast Primitives for Irregular Computations on the NEC SX-4 To appear: Crosscuts 6 (4) Dec 1997 (http://www.cscs.ch/official/pubcrosscuts6-4.pdf) Fast Primitives for Irregular Computations on the NEC SX-4 J.F. Prins, University of North Carolina at Chapel Hill,

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics December 0, 0 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC 2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy

More information

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar

Acknowledgements. Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn. SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar Philipp Hahn Acknowledgements Prof. Dan Negrut Prof. Darryl Thelen Prof. Michael Zinn SBEL Colleagues: Hammad Mazar, Toby Heyn, Manoj Kumar 2 Outline Motivation Lumped Mass Model Model properties Simulation

More information

Corrected/Updated References

Corrected/Updated References K. Kashiyama, H. Ito, M. Behr and T. Tezduyar, "Massively Parallel Finite Element Strategies for Large-Scale Computation of Shallow Water Flows and Contaminant Transport", Extended Abstracts of the Second

More information

Exercise Set Decide whether each matrix below is an elementary matrix. (a) (b) (c) (d) Answer:

Exercise Set Decide whether each matrix below is an elementary matrix. (a) (b) (c) (d) Answer: Understand the relationships between statements that are equivalent to the invertibility of a square matrix (Theorem 1.5.3). Use the inversion algorithm to find the inverse of an invertible matrix. Express

More information

x = 12 x = 12 1x = 16

x = 12 x = 12 1x = 16 2.2 - The Inverse of a Matrix We've seen how to add matrices, multiply them by scalars, subtract them, and multiply one matrix by another. The question naturally arises: Can we divide one matrix by another?

More information

Estimation de la reproductibilité numérique dans les environnements hybrides CPU-GPU

Estimation de la reproductibilité numérique dans les environnements hybrides CPU-GPU Estimation de la reproductibilité numérique dans les environnements hybrides CPU-GPU Fabienne Jézéquel 1, Jean-Luc Lamotte 1 & Issam Said 2 1 LIP6, Université Pierre et Marie Curie 2 Total & LIP6, Université

More information

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today

More information

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application

More information

computational Fluid Dynamics - Prof. V. Esfahanian

computational Fluid Dynamics - Prof. V. Esfahanian Three boards categories: Experimental Theoretical Computational Crucial to know all three: Each has their advantages and disadvantages. Require validation and verification. School of Mechanical Engineering

More information

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society

More information

A parallel computing framework and a modular collaborative cfd workbench in Java

A parallel computing framework and a modular collaborative cfd workbench in Java Advances in Fluid Mechanics VI 21 A parallel computing framework and a modular collaborative cfd workbench in Java S. Sengupta & K. P. Sinhamahapatra Department of Aerospace Engineering, IIT Kharagpur,

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

NOISE PROPAGATION FROM VIBRATING STRUCTURES

NOISE PROPAGATION FROM VIBRATING STRUCTURES NOISE PROPAGATION FROM VIBRATING STRUCTURES Abstract R. Helfrich, M. Spriegel (INTES GmbH, Germany) Noise and noise exposure are becoming more important in product development due to environmental legislation.

More information

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura

Multicore Challenge in Vector Pascal. P Cockshott, Y Gdura Multicore Challenge in Vector Pascal P Cockshott, Y Gdura N-body Problem Part 1 (Performance on Intel Nehalem ) Introduction Data Structures (1D and 2D layouts) Performance of single thread code Performance

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Parallel edge-based implementation of the finite element method for shallow water equations

Parallel edge-based implementation of the finite element method for shallow water equations Parallel edge-based implementation of the finite element method for shallow water equations I. Slobodcicov, F.L.B. Ribeiro, & A.L.G.A. Coutinho Programa de Engenharia Civil, COPPE / Universidade Federal

More information

SIPE: Small Integer Plus Exponent

SIPE: Small Integer Plus Exponent SIPE: Small Integer Plus Exponent Vincent LEFÈVRE AriC, INRIA Grenoble Rhône-Alpes / LIP, ENS-Lyon Arith 21, Austin, Texas, USA, 2013-04-09 Introduction: Why SIPE? All started with floating-point algorithms

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

A Parallel Implementation of the BDDC Method for Linear Elasticity

A Parallel Implementation of the BDDC Method for Linear Elasticity A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information