ScaFaCoS and P 3 M. Olaf Lenz. Recent Developments. June 3, 2013

Size: px

Start display at page:

Download "ScaFaCoS and P 3 M. Olaf Lenz. Recent Developments. June 3, 2013"

Morgan Allen
6 years ago
Views:

1 ScaFaCoS and P 3 M Recent Developments Olaf Lenz June 3, 2013

2 Outline ScaFaCoS ScaFaCoS Methods Performance Comparison Recent P 3 M developments Olaf Lenz ScaFaCoS and P 3 M 2/23

ScaFaCoS Scalable Fast Coulomb Solver Highly scalable, MPI-parallelized Library of different Coulomb solvers Common interface for all methods Developed by groups from Jülich, Wuppertal, Chemnitz,

3 ScaFaCoS Scalable Fast Coulomb Solver Highly scalable, MPI-parallelized Library of different Coulomb solvers Common interface for all methods Developed by groups from Jülich, Wuppertal, Chemnitz, Bonn... and Stuttgart BMBF project, officially ended 2011 Source code on github since two months (yay!) First publication will be submitted soon (since 6 months) Olaf Lenz ScaFaCoS and P 3 M 3/23

4 Interface # include <fcs.h> FCS handle = NULL ; /* Initialize P3M */ fcs_init (& handle, " p3m ", MPI_COMM_WORLD ); /* Set common parameters */ fcs_ set_ common ( handle, near_field, box_a, box_b, box_c, offset, periodicity, total_ particles ); /* Set method - specific parmeters */ fcs_p3m_set_r_cut ( handle, r_cut ); /* Tune the method ( optional ) */ fcs_ tune ( handle, N, max_particles, positions, charges ); /* Run the method */ fcs_run ( handle, N, max_particles, positions, charges, fields, potentials ); /* Finally destroy the handle */ fcs_destroy ( handle ); Olaf Lenz ScaFaCoS and P 3 M 4/23

5 Methods ScaFaCoS currently provides 11 methods: DIRECT, EWALD, P3M, P2NFFT, VMG, PP3MG, PEPC,, MEMD, MMM1D, MMM2D In the following comparison, only the bold methods are considered Distinguish Splitting Methods, Hierarchical Methods and Local Methods (i.e. MEMD) Other methods for reference purposes only (DIRECT, EWALD), for different periodicities only (MMM*D, here, only fully periodic systems are considered), or performed too bad (PEPC) Olaf Lenz ScaFaCoS and P 3 M 5/23

6 Splitting Methods Problems of electrostatic potential Slow decay bad for direct summation Singularity bad for convergence accelerating methods = + Idea of splitting methods: split potential into fast decaying near field and non-singular far field Near field can be computed directly (O(N)) For the far field, other methods can be used Olaf Lenz ScaFaCoS and P 3 M 6/23

Splitting Methods: Ewald and Particle-Mesh Ewald Ewald s idea: Compute far field in Fourier space Ewald summation O(N 3/2 ) Particle-Mesh Ewald O(N log N) discretize far field charge distribution

7 Splitting Methods: Ewald and Particle-Mesh Ewald Ewald s idea: Compute far field in Fourier space Ewald summation O(N 3/2 ) Particle-Mesh Ewald O(N log N) discretize far field charge distribution onto mesh use FFT to Fourier transform solve Poisson equation in Fourier space back-fft to obtain potential on mesh compute potentials or fields by interpolating the mesh potential In ScaFaCoS: P3M (ICP), P2NFFT (Chemnitz; uses non-equidistant FFT) P. P. Ewald ( ) Olaf Lenz ScaFaCoS and P 3 M 7/23

8 Splitting Methods: Multigrid Solve Poisson equation in far field with multigrid PDE solver use differente levels of successively coarser meshes solve poisson equation on these meshes by recursively improving the solution of the coarser mesh Complexity O(N) Can be extended to handle periodic BC In ScaFaCoS: PP3MG (Wuppertal), VMG (Bonn) l = 4 l = 3 l = 2 l = 1 Restriction Prolongation Smoothing/Solving Olaf Lenz ScaFaCoS and P 3 M 8/23

9 Hierarchical Methods: Barnes-Hut Tree Code Multipole expand successively larger clusters of particles Compute interaction with far away clusters instead of single particles Complexity O(N log N) Can be extended to handle periodic BC In ScaFaCoS: PEPC (Jülich) Olaf Lenz ScaFaCoS and P 3 M 9/23

10 Hierarchical Methods: Fast Multipole Method Expand Barnes-Hut: let clusters interact with each other Put eveything on a grid Complexity O(N) Can be extended to handle periodic BC In ScaFaCoS: (Jülich) Olaf Lenz ScaFaCoS and P 3 M 10/23

11 Local Methods: MEMD See talk of Florian F. Purely local: should show very nice parallel scaling Complexity O(N) Olaf Lenz ScaFaCoS and P 3 M 11/23

12 Benchmark Systems Cloud-wall system (ESPR ES S O test system) 300 charges Olaf Lenz Silica melt charges ScaFaCoS and P3 M 12/23

13 Benchmark Systems 2 When larger systems were needed, the systems were replicated PEPC was removed (too bad) Periodic systems Relatively homogenous density Charge-neutral JUROPA (Linux cluster) for small to intermediate number of cores JUGENE (BlueGene/P HPC machine) for intermediate to large number of cores Accuracies are given by the relative RMS potential error ε pot := N j=1 Φref ( x j ) Φ method ( x j ) 2 Φ ref ( x j ) 2 N j=1 Olaf Lenz ScaFaCoS and P 3 M 13/23 1/2

14 Complexity P2NFFT, P 3 M and are fastest MEMD and Multigrid 10 slower All algorithms show (close-to-)linear behavior log N-term of P2NFFT and P 3 M is invisible No cross-over with Time t/#charges [s] MEMD P2NFFT P3M VMG PP3MG #Charges Silica melt, ε pot 10 3, P = 1 (JUROPA) Olaf Lenz ScaFaCoS and P 3 M 14/23

15 Accuracy and P2NFFT scale very good Can achieve very high accuracy P 3 M not (due to tuning) Multigrid methods suffer from steep potential (or bad tuning) MEMD cannot influence accuracy to any great extent Time t/#charges [s] MEMD P2NFFT P3M VMG PP3MG Relative RMS potential error ε pot N = (Cloud-wall), P = 1 (JUROPA) Olaf Lenz ScaFaCoS and P 3 M 15/23

16 Scaling: Timing Execution time t vs. number of cores P often used to display parallel scaling shows actual execution times hides actual scaling hides differences between algorithms Time t [s] MEMD P2NFFT P3M VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUROPA Olaf Lenz ScaFaCoS and P 3 M 16/23

17 Scaling: Relative Parallel Efficiency Parallel efficiency can be used to show scaling: e(p) = t 1 t P P e(p) [0, 1] 1 for optimal scaling Can be thought of effective fraction of CPU used in parallel Relative Parallel Efficiency to compare algorithms: e(p) = t bestp best t P P Relative Parallel Efficiency e(p ) MEMD P2NFFT P3M VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUROPA Olaf Lenz ScaFaCoS and P 3 M 17/23

18 Scaling: Comparing Methods Again: P2NFFT, and P 3 M within 2 Scaling of P 3 M is better than P2NFFT and Issue of P 3 M: Tuning MEMD performs OK Scaling of Multigrid methods is very smooth and flat... but e(p) < 10%! Relative Parallel Efficiency e(p ) MEMD P2NFFT P3M VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUROPA Olaf Lenz ScaFaCoS and P 3 M 18/23

19 Scaling: Small Systems Relative Parallel Efficiency e(p ) MEMD P2NFFT P3M VMG PP3MG #Cores P N = 8100 (Cloud-wall), ε pot 10 3, JUROPA Relative Parallel Efficiency e(p ) MEMD P2NFFT P3M VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUROPA Olaf Lenz ScaFaCoS and P 3 M 19/23

20 Scaling: HPC machine Relative Parallel Efficiency e(p ) MEMD P2NFFT P3M VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUROPA Relative Parallel Efficiency e(p ) MEMD P2NFFT P3M VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUGENE Older versions of both P 3 M and P2NFFT (JUGENE is dead) All algorithms show better scaling JUGENE has slower cores but better interconnects Olaf Lenz ScaFaCoS and P 3 M 20/23

21 Scaling: Large Systems Relative Parallel Efficiency e(p ) MEMD P2NFFT VMG PP3MG #Cores P N = (Cloud-wall), ε pot 10 3, JUGENE Many algorithms can t handle large systems Relative Parallel Efficiency e(p ) P2NFFT VMG PP3MG #Cores P For really large systems, seems to be good has done time steps of 3 trillion charges!... whatever that s good for N = (Cloud-wall), ε pot 10 3, JUGENE Olaf Lenz ScaFaCoS and P 3 M 21/23

22 ScaFaCoS: Conclusions Performance depends heavily on architecture, compiler and implementation... and tuning! 2 differences between algorithms are normal Within these limits,, P 3 M and P2NFFT perform equally good MEMD slightly worse ( 4), but performs better at larger systems Multigrid methods seem to be worse ( 10)... apparently due to large variation in the potential Olaf Lenz ScaFaCoS and P 3 M 22/23

23 P 3 M: Recent Developments Determined optimal P 3 M components, gained 4 (Florian W.) Improved tuning (Florian W.) CUDA P 3 M: coming to ESPRESSO really soon (Florian W.) First interface to ScaFaCoS (with problems) (Andreas M.) Improved P 3 M code (Florian W., Olaf) In progress: Improved code organization: common code base for ScaFaCoS and ESPRESSO (Florian W., Olaf) In progress: Further improvements in tuning (April, Olaf) Olaf Lenz ScaFaCoS and P 3 M 23/23

FORSCHUNGSZENTRUM JÜLICH GmbH Jülich Supercomputing Centre D Jülich, Tel. (02461)

FORSCHUNGSZENTRUM JÜLICH GmbH Jülich Supercomputing Centre D-52425 Jülich, Tel. (02461) 61-6402 Technical Report Benchmark of fast Coulomb Solvers for open and periodic boundary conditions Sebastian Krumscheid