Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Size: px

Start display at page:

Download "Parallelization of DQMC Simulations for Strongly Correlated Electron Systems"

Beatrice Dalton
5 years ago
Views:

1 Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun Bai (UCDavis) IEEE International Parallel and Distributed Processing Symposium 00 Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

2 Outline DQMC simulations DQMC parallelization Algorithmic approaches System approaches Experiment results Conclusion Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

3 Computational Material Science Understanding and exploiting the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity,... A. Density B. Density fluctuations C. Spin correlations D. Pairing correlations Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

4 Hubbard Model and DQMC Simulations Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for Determinant

5 Hubbard Model and DQMC Simulations Many body simulation on multi-layer lattices using Hubbard model and quantum monte carlo method. QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for Determinant Quantum Monte Carlo (DQMC) simulations. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

6 DQMC Algorithm Two stages: Random HS field Warmup stage Sampling stage A DQMC step Propose a local change: h h. Throw a random number 0 < r <. Accept the change if r < det(e βh(h ) ) det(e βh(h) ). warmup sampling DQMC step thermalized yes DQMC step Measurements enough samples yes Aggregation no no Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 5 /

7 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

8 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) The unequal time Green s function I B G τ B I = B L I Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

9 Computational Kernels The equal time Green s function G k = (I + B k B k+ B B L B k ) The unequal time Green s function I B G τ B I = B L I Physical measurements Operations on G k and G τ, Fourier Transform, etc. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

10 Computational Challenges For simulating strongly correlated electron systems The size of lattices need be large. A longer warmup stage is required. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /

11 Computational Challenges For simulating strongly correlated electron systems The size of lattices need be large. A longer warmup stage is required. Numerical stability issues. Additional stabilizing steps are required. Most calculations need double precision. Many fast updating methods and parallel algorithms cannot be used. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /

12 DQMC Parallelization Algorithmic approaches Parallel Markov chain Rolling feeder algorithm Parallel matrix computations System approaches Task decomposition Communication and computation overlapping Message compression Load balance Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 8 /

13 DQMC step Measurements DQMC step Measurements DQMC step Measurements Parallel Markov Chain The sampling stage can be parallelized embarrassingly. warmup sampling Random HS field DQMC step thermalized yes no Aggregation Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 9 /

14 DQMC step Measurements DQMC step Measurements DQMC step Measurements Parallel Markov Chain The sampling stage can be parallelized embarrassingly. The speedup of parallelization is limited by the time of the warmup stage. (Amdahl s law) ρ speedup = T warmup + T sampling T warmup + T sampling /N p < T warmup + T sampling T warmup warmup sampling Random HS field DQMC step thermalized yes no Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 9 /

15 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

16 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

17 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

18 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

19 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Parallel reduction (takes O(N log L) time.)... DQMC step Compute G DQMC step Compute G DQMC step Compute G... Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

20 Green s Function Calculation Matrix G k need be computed cyclically with B k updated. G = (I + B B B L B L ). G = (I + B B B L B ). G = (I + B B B B ). Parallel reduction (takes O(N log L) time.)... DQMC step Compute G DQMC step Compute G DQMC step Compute G... Numerically unstable! Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

21 Rolling Feeder Algorithm The matrix product can be stably computed sequentially. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

22 Rolling Feeder Algorithm The matrix product can be stably computed sequentially.... DQMC step DQMC step DQMC step... Compute G Compute G Compute G Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

23 Rolling Feeder Algorithm The matrix product can be stably computed sequentially.... DQMC step DQMC step DQMC step... Compute G Compute G Compute G Tasks to get one G k Sequential Parallel reduction Rolling feeder. Matrix multiplication L log L. Stabilization step O(L) O(log L). Inverting (I + B... B L ). Data transmission N O(LN ) N Comparisons on resources and stability Processor O() O(L) O(L) Numerically stable Y N Y Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

24 Parallel Matrix Computations Two matrix computation kernels are parallelized. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

25 Parallel Matrix Computations Two matrix computation kernels are parallelized. The unequal time Green s function is computed by blocks in parallel (I +B k B B L B k+ ) B k B l+ k >l Gk,l τ = (I +B k B B L B k+ ) k =l (I +B k B k+ ) B k B B L B l+ k <l Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

26 Parallel Matrix Computations Two matrix computation kernels are parallelized. The unequal time Green s function is computed by blocks in parallel (I +B k B B L B k+ ) B k B l+ k >l Gk,l τ = (I +B k B B L B k+ ) k =l (I +B k B k+ ) B k B B L B l+ k <l The matrix-matrix multiplication of G k and each block matrix of G τ is speeded up using multicore. The matrix size of G k, , is too small such that the matrix computation cannot be benefited by using MPI-style parallelization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

27 System Design The system contains several simulators for parallel Markov chain. Each simulator consists of a walker and a M-server. MC walker DQMC steps M-server Physical measurements Feeder Feeder Equal time measurator Feeder Iterator HS field GC GC GC Feeder Feeder Unequal time measurator Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

28 Implementation Techniques System is implemented for hybrid systems (cluster+multicore) Task MPI OpenMP Comm/comp Message Load overlapping compression balance Parallel Markov chain Rolling feeder algorithm Unequal time Green s fn Physical measure- ment Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

29 Communication/Computation Overlapping Iterator... HS field for time slice 0 G 0 Feeder Multiply B 0 and compute G 0 Iterator... Get G 0 by FUA HS field for time slice 0 MC iteration MC starts here iterates on G 0 G 0 Feeder Multiply B 0 and compute G 0 MC iterates on G 0 MC iteration starts here HS field for time slice Without overlapping Get G 0 and convert it to G by FUA MC iterates on G HS field for time slice Multiply B Using fast update algorithm (FUA) to reduce waiting time Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 5 /

30 Load Balance Iterators are fully occupied the bottleneck of speedup. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 6 /

31 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

32 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. The load balance problem: how many block submatrices should one processor compute? Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

33 Load Balance Iterators are fully occupied the bottleneck of speedup. Processor utilization can be enhanced by merging tasks. For example, when computing unequal time Green s function, each processor can take care of more than one block submatrix. The load balance problem: how many block submatrices should one processor compute? Using the queueing theory (Little s law) to estimate. P n C = max. P λt λt λ: arrival rate; T : processing time; P: processor utilization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 6 /

34 System and Benchmark System Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 50 quad-core processor and GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 7 /

35 System and Benchmark System Run on the IBM Blue Gene/P Each compute node is equipped with 850MHz PowerPC 50 quad-core processor and GB memory. IBM XL compilers with IBM BLAS and LAPACK libraries. Benchmark DQMC simulation on a two-dimensional periodic lattice. The lattice size is N = 6 6 = 56. The ratio of DQMC steps for the warmup stage and the sampling stages is : 0. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 7 /

36 Communication Pattern Iterator Feeder. Feeder. Feeder. Feeder. Feeder.5 Feeder.6 Measurator Iterator Feeder. Feeder. Feeder. Feeder. Feeder.5 Feeder.6 Measurator Green bands show the waiting time of MPI RECV. Iterators are fully occupied after started. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 8 /

37 Speedup for Different L Speedup np Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 9 /

38 Effect of Load Balance (L = 96) Speedup np nc: number of block submatrices computed per processor. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 0 /

39 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

40 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

41 Summary DQMC simulation for strongly correlated materials is a computationally intensive task, which is eager for parallelization. We targeted the hybrid massive parallel systems, and explored the parallelism of DQMC simulations on different levels of granularity. Our implementation shows over 80x speedup on thousand processors, which is much better than embarrassing parallelization (speedup < ). Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallelization of DQMC Simulations IPDPS 00 /

42 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

43 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

44 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

45 Future Works More fine-grain parallel matrix computation kernels (pivoted QR, QR, matrix inversion) to fully utilize the computational power of multicores. Better system design to enhance the processor utilization. Different physics models and methods. Code is still in the experimental stage. Further development is required for practical use. Che-Rung Lee Parallelization of DQMC Simulations IPDPS 00 /

Parallelization of DQMC Simulation for Strongly Correlated Electron Systems

Parallelization of DQMC Simulation for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Hsin-Chu, Taiwan cherung@cs.nthu.edu.tw I-Hsin Chung IBM