Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Size: px

Start display at page:

Download "Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems"

Laurel Gilmore
6 years ago
Views:

1 International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra 4 Innovative Computing Laboratory University of Tennessee Knoxville KAUST Supercomputing Laboratory Thuwal, Saudi Arabia Oak Ridge National Lab 4 University of Manchester

2 Outline Motivations From LAPACK to PLASMA From PLASMA to DPLASMA 4 Power Measurements Technique 5 Power Measurements Results 6 Summary and Future Work Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

3 The Top5 List Motivations Rank Name Vendor Cores R max (Tflop/s) Power (KW) Sequoia IBM BG/Q (8%) 789. K Computer Fujitsu SPARC (9%) Mira IBM BG/Q (8%) SuperMUC Intel Xeon E (9%) Tianhe-A Intel Xeon X567 + M (55%) Jaguar Cray XK6 Opteron + M (74%) Fermi IBM BG/Q (8%) JuQUEEN IBM BG/Q (8%) Curie thin nodes Intel Xeon E (8%) 5. Nebulae Intel Xeon X567 + M (4%) 58. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

4 The Top5 List Motivations Rank Name Vendor Cores R max (Tflop/s) Power (KW) Sequoia IBM BG/Q (8%) 789. K Computer Fujitsu SPARC (9%) Mira IBM BG/Q (8%) SuperMUC Intel Xeon E (9%) Tianhe-A Intel Xeon X567 + M (55%) Jaguar Cray XK6 Opteron + M (74%) Fermi IBM BG/Q (8%) JuQUEEN IBM BG/Q (8%) Curie thin nodes Intel Xeon E (8%) 5. Nebulae Intel Xeon X567 + M (4%) 58. Human brain: PetaFLOPS! (cf Kurzweil) Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 4 / 6

5 Motivations Today s Special Meal 8 MW needed to feed the baby Exascale roadmap says up to MW Power Envelope Huge challenge: achieving orders of magnitude in performance by only doubling the power rate High level of concurrency Ingredients: Fine-grain parallelism, Dynamic runtime systems, Power Efficiency Flops are cheap, Data movement is expensive Co-designed Hardware and Software solutions Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 5 / 6

6 From LAPACK to PLASMA Block Algorithms Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level BLAS) Transformations applied at once on the trailing submatrix (Level BLAS) Parallelism hidden inside the BLAS Fork-join Model Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 6 / 6

7 From LAPACK to PLASMA One-Sided Block Algorithms: LU Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 7 / 6

8 From LAPACK to PLASMA Block Algorithms: Fork-Join Paradigm Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 8 / 6

9 From LAPACK to PLASMA Tile Data Layout Format LAPACK: column-major format PLASMA: tile format Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 9 / 6

10 From LAPACK to PLASMA PLASMA: Tile Algorithms PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

11 From LAPACK to PLASMA Dynamic Runtime QUARK Basic Ideas: Conceptually similar to out-of-order processor scheduling Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer Similar projects: SuperMatrix, OMPSs, StarPU Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

(cf. Wikipedia) Think "how things connect" rather than "how things happen" Assembly line

12 From LAPACK to PLASMA DataFlow Programming Model Five decades OLD concept Programming paradigm that models a program as a directed graph of the data flowing between operations (cf. Wikipedia) Think "how things connect" rather than "how things happen" Assembly line Inherently parallel Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

13 From PLASMA to DPLASMA D Block Cyclic Distribution (a) Column-major data layout format within a block. (b) Tile data layout format within a block. Figure: Two-Dimensional Block Cyclic Data Distribution. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

14 From PLASMA to DPLASMA DAGuE Dynamic Runtime Scheduler Bosilca et. al, UTK Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 4 / 6

15 From PLASMA to DPLASMA DAGuE Dynamic Runtime Scheduler Bosilca et. al, UTK Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 5 / 6

16 Power Measurements Technique The PowerPack Framework Dual-socket quad-core Intel Xeon system from Virginia Tech, clocked at.8ghz with 8GB of memory Measurements from power meters attached to the hardware of the system Fine-grain measurement (ms) allows power consumption to be measured on a per-component basis, memory, hard disk, motherboard and (as a whole) N = 4 for all experiments Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 6 / 6

17 Power Measurements Technique The PowerPack Framework K. Cameron et. al, Virginia Tech Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 7 / 6

18 Power Measurements Results (a) block size =. (a) tile size = (b) block size = 8. (b) tile size = (c) block size = 5. (c) tile size = 768. Figure: Impact of the block size on the power profiles (Watts) of the ScaLAPACK Cholesky. Figure: Impact of the tile size on the power profiles (Watts) of the DPLASMA Cholesky. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 8 / 6

19 Power Measurements Results (a) block size =. (a) tile size = (b) block size = 8. (b) tile size = (c) block size = 5. (c) tile size = 768. Figure: Impact of the block size on the power profiles (Watts) of the ScaLAPACK QR Factorization. Figure: Impact of the tile size on the power profiles (Watts) of the DPLASMA QR Factorization. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 9 / 6

20 Power Measurements Results (a) Number of cores = 8. (a) Number of cores = (b) Number of cores = (b) Number of cores = 5. Figure: Impact of the number of Figure: Impact of the number of cores on the power profiles (Watts) cores on the power profiles (Watts) of the ScaLAPACK Cholesky of the DPLASMA Cholesky Factorization. Factorization. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

21 Power Measurements Results (a) Number of cores = 8. (a) Number of cores = (b) Number of cores = (b) Number of cores = 5. Figure: Impact of the number of Figure: Impact of the number of cores on the power profiles (Watts) cores on the power profiles (Watts) of the ScaLAPACK QR. of the DPLASMA QR. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

22 Power Measurements Results (a) ScaLAPACK (b) DPLASMA. Figure: Power Profiles of the Cholesky Factorization. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

23 Power Measurements Results (a) ScaLAPACK (b) DPLASMA. Figure: Power Profiles of the QR Factorization. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC / 6

24 Power Measurements Results # Cores Library Cholesky QR 8 ScaLAPACK 9 67 DPLASMA ScaLAPACK 4 86 DPLASMA ScaLAPACK 5 DPLASMA Figure: Total amount of energy (joule) used for each test based on the number of cores Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 4 / 6

25 Summary and Future Work Conclusion Stressing the system s components DPLASMA Cholesky algorithms decrease the energy consumption up to 6% compared to ScaLAPACK Cholesky DPLASMA QR algorithms decrease the energy consumption up to 4% compared to ScaLAPACK QR Asynchronous execution runtime and adapted algorithms can lead to significantly improved efficiencies and power saving Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 5 / 6

26 Summary and Future Work What s next? Power analysis of advanced numerical algorithms on distributed systems (two-sided transformations, tree reduction, mixed precisions) Comparisons with other DLA libraries: Elemental, Eigen-K, ELPA Distributed heterogeneous architectures Scheduler interaction through DVFS/Intel RAPL technology Running on IBM BG P/Q and exploit embedded power collection hardware tools. Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC 6 / 6

High Performance Linear Algebra

High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control