Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering, The Ohio State University Presented by: Hari Subramoni

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 2

Introduction Scientific research being driven by multi-cores and highspeed networks Supercomputers typically comprise of hundreds of thousands of compute cores The power consumed by such systems has also increased sharply Power is deemed as one of the major challenges in designing next generation exascale systems 3

Power Consumption (KW) Average Power Consumption of Current Generation Supercomputers 1400 1200 1000 800 600 Top500 System Top 50 System Top 10 System 400 200 0 Based on the power measurements of LINPACK benchmark, as reported on the Top500 site 4

Introduction Modern architectures offer fine grained schemes for saving power during idle periods Frequency, Voltage Scaling(DVFS) and CPU Throttling Power conservation techniques are typically associated with a delay leading to performance overheads Broad Challenge: Is it possible to design software and middleware stacks in a power-aware manner to minimize the overall system power consumption, with negligible performance overheads? 5

Power Saving Innovations in Current Generation Processors and Networks Processors (Intel Nehalem ) DVFS Dynamically scale the Frequency and Voltage of the CPU Core-level DVFS CPU Frequency range [1.6 2.4] GHz CPU Throttling Insert brief idle periods to save power Socket-level CPU Throttling Multiple Throttling states (T0 T7) T0-100% CPU activity T7-12% CPU activity Networks (InfiniBand) Allow offloading significant parts of communication Offer polling and blocking modes of message progression 6

State-Of-The-Art in Power Conservation Techniques Where can we conserve power? Communication phases Cameron et al, have proposed and demonstrated the utility of PowerPack Lowenthal et al - dynamically detect communication phases and scale the CPU frequency to save power Liu et al, studied power consumption with RDMA based networks comparing blocking and polling modes 7

Collective Communication in MVAPICH2 Scientific parallel applications spend a considerable amount of time in collective communication operations Multi-core aware and network topology-aware algorithms optimize the communication costs Current power saving methods treat communication phases as a black-box and use DVFS Is it possible to re-design collective communication algorithms to deliver fine-grained power savings with very little performance overheads? 8

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 9

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 10

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Socket A Idle CPU cores Socket B Node1 Node2 NodeN 11

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Idle Socket A Shared memory algorithms significantly CPU improves performance Many cores remain idle during collective cores operations Not power efficient Socket B Node1 Node2 NodeN 12

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 13

Latency (us) Latency (us) MPI Collective Performance Total Vs Network Time 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Broadcast Total Network 70 60 50 40 30 20 10 0 Reduce Total Network Message Size (Bytes) Message Size (Bytes) 14

Latency (us) Latency (us) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 MPI Collective Performance Total Vs Network Time Broadcast Total Network 70 60 50 40 30 20 10 0 Reduce Total Network Network phase dominate total transfer time Non-leader cores idle waste of power! Message Size (Bytes) Message Size (Bytes) 15

Latency (us) 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 MPI Collective Performance Impact of Network Congestion Alltoall-8way Alltoall-4way 50% Message Size (Bytes) Performance in 8-way (4 nodes) configuration 50% worse than the 4-way (8-nodes) with the same system size 16

Latency (us) 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 MPI Collective Performance Impact of Network Congestion Alltoall-8way Alltoall-4way 50% Performance hit when more cores simultaneously involved in network transfer Message Size (Bytes) Performance in 8-way (4 nodes) configuration 50% worse than the 4-way (8-nodes) with the same system size 17

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 18

Problem Statement Existing designs use DVFS to save power without considering the nature of the collective algorithms Modern architectures allow DVFS and CPU Throttling operations to be performed within a few micro-seconds Can we re-design collective communication algorithms in a power-aware manner taking into consideration the nature of the collective operation? What is the impact on performance? What architectural features can allow for more power savings with smaller performance overheads? 19

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 20

Design Space of Power-Aware Algorithms Default (No Power Savings): Run each core at peak frequency/throttling state. Frequency Scaling Only: Dynamically detect communication phases. Treat them as a black-box and scale the CPU frequency Proposed: Consider the communication characteristics of different collectives, intelligently use both DVFS and CPU Throttling to deliver fine-grained power-savings 21

Proposed Approach Apply DVFS to scale down the frequency of all cores to the lowest level at the start of the collective operation Look for opportunities within the algorithms to use CPU Throttling to specific sets of cores of save more power Reset the state of the cores to the normal power settings while exiting collective operations 22

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B DVFS: Scale down the frequency of each core Node1 Node2 NodeN 23

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 24

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Socket A (T4) Throttle Idle CPU cores Socket B (T7) Node1 Node2 NodeN 25

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Core-level CPU throttling would Throttle allow leaving leader core at high power state Idle CPU Lesser performance overhead cores Socket A (T4) Socket B (T7) Node1 Node2 NodeN 26

Proposed Power- Aware Alltoall Algorithms (Phase 1) Complete all the intra-node exchanges Apply DVFS across all nodes Node1 Node2 NodeN 27

Proposed Power- Aware Alltoall Algorithms (Phase 2) Active socket remains at T0 Passive socket powered down to T7 Socket A (T0) Socket B (T7) Node1 Node2 NodeN 28

Proposed Power- Aware Alltoall Algorithms (Phase 3) Repeat for the other socket Socket A (T7) Socket B (T0) Node1 Node2 NodeN 29

Proposed Power- Aware Alltoall Algorithms (Phase 4(a)) Choose a pair of nodes i and j (i < j) Schedule communication and throttle sockets in the following manner Socket Ai (T0) Socket Aj (T7) Socket Bj (T7) Socket Bj (T0) Node i Nodej ( i < j) 30

Proposed Power- Aware Alltoall Algorithms (Phase 4(b)) Socket A (T7) Socket Aj (T0) Socket B (T0) Socket Bj (T7) Node i Nodej ( i < j) 31

Proposed Power- Aware Alltoall Algorithms (Phase 4(b)) Socket A (T7) Socket Aj (T0) Only half the cores communicate at one time CPU throttling leads to better power savings Socket B (T0) Socket Bj (T7) Node i Nodej ( i < j) 32

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 33

Experimental Setup Compute platforms Intel Nehalem Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Network Equipments MT26428 QDR ConnectX HCAs 36-port Mellanox QDR switch used to connect all the nodes Red Hat Enterprise Linux Server release 5.3 (Tikanga) OFED-1.4.2 MASTECH MS2205 Clamp Power Meter to measure instantaneous power consumption 34

Experimental Setup (Cont) MVAPICH2 A High Performance MPI implementation over InfiniBand and other RDMA networks (v1.5.1) http://mvapich.cse.ohio-state.edu/ Used by more than 1255 organizations world-wide Benchmarks Micro-Benchmarks: OSU Microbenchmark Suite Application Benchmarks: CPMD and NAS Estimated power savings with applications based on observations with micro-benchmarks. 35

Latency (us) Alltoall Latency with InfiniBand Blocking and Polling Progression Modes 250000 200000 Alltoall-Polling Alltoall-Blocking 150000 100000 50000 0 8K 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Blocking mode performs worse than polling mode for Alltoall 36

Power Consumption (KW) Alltoall Power Consumption with InfiniBand Blocking and Polling Progression Modes 2.5 2 1.5 1 0.5 0 Alltoall-Polling Alltoall-Blocking 0 3.6 7.2 10.8 14.4 18.1 21.6 25.2 28.8 Execution Time (s) Blocking mode does allow for instantaneous power savings when compared to the polling mode 37

8K 16K 32K 64K 128K 256K 512K 1M 2M 8K 16K 32K 64K 128K 256K 512K 1M Latency (us) Latency (us) Latency for Collectives with Different Power Conservation Techniques 500000 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 Alltoall Alltoall-Default Alltoall-Freq-Scaling Alltoall-Proposed 1800 14% 1600 1400 1200 1000 800 600 400 200 0 Broadcast Bcast-Default Bcast-Freq-Scaling Bcast-Proposed 11% Message Size (Bytes) Message Size (Bytes) Additional performance impact due to CPU throttling is minimal 38

0 5.7 11.5 17.2 23.1 28.8 34.5 40.3 46.1 51.8 0 0.7 1.4 2.1 2.8 3.5 4.2 4.9 5.6 6.3 7 7.6 8.3 9 Power Consumption (KW) Power Consumption (KW) Power Consumption for Collectives with Different Power Conservation Techniques 2.5 Alltoall 2.5 Broadcast 2 25% 33% 2 18% 28% 1.5 1.5 1 Alltoall-Default 1 Bcast-Default 0.5 0 Alltoall-Freq-Scaling Alltoall-Proposed 0.5 0 Bcast-Freq-Scaling Bcast-Proposed Execution Time (s) Execution Time (s) More power savings with proposed approach 39

Application Run-Time (s) Power Consumption (KJ) 300 CPMD with 32 and 64 Processes (ta-inp-md Data set) 70 250 200 150 100 50 0 60 50 40 30 20 10 0 32 Processes 64 Processes Estimated power savings - 7.7% Performance degradation - 2.6% Default Freq-Scaling Proposed Approach 40

Application Run-Time (s) Power Consumption (KJ) NAS FT (Class C) 32 and 64 Processes 20 18 16 14 12 10 8 6 4 2 0 18 16 14 12 10 8 6 4 2 0 32 Processes 64 Processes Estimated power savings - 5% Performance degradation of - 10% Default Freq-Scaling Proposed Approach 41

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 42

Conclusions & Future Work Proposed and designed novel power-aware collective algorithms Designs deliver fine-grained power savings with little performance overheads 33% savings in instantaneous power consumption with micro-benchmarks Overall estimated energy savings of up to 8% with applications Core-level CPU throttling could potentially lead to better power-savings with smaller performance overheads Extend these designs to other collective operations and study the potential for saving power at larger scales These designs will be made available in future MVAPICH2 releases 43

Thank you! http://mvapich.cse.ohio-state.edu {kandalla, mancini, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory, Ohio State University 44

Application Run-Time (s) Power Consumption (KJ) 40 35 30 25 20 15 10 5 0 CPMD with 32 and 64 Processes (Wat-32-inp-2 Data set) Default Freq-Scaling Proposed Approach 9 8 7 6 5 4 3 2 1 0 32 Processes 64 Processes About 3% power savings with a performance degradation of 2.8% with 32 processes with the proposed approach 45