Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Size: px

Start display at page:

Download "Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters"

Roderick Cummings
6 years ago
Views:

1 Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering, The Ohio State University Presented by: Hari Subramoni

2 Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 2

3 Introduction Scientific research being driven by multi-cores and highspeed networks Supercomputers typically comprise of hundreds of thousands of compute cores The power consumed by such systems has also increased sharply Power is deemed as one of the major challenges in designing next generation exascale systems 3

4 Power Consumption (KW) Average Power Consumption of Current Generation Supercomputers Top500 System Top 50 System Top 10 System Based on the power measurements of LINPACK benchmark, as reported on the Top500 site 4

5 Introduction Modern architectures offer fine grained schemes for saving power during idle periods Frequency, Voltage Scaling(DVFS) and CPU Throttling Power conservation techniques are typically associated with a delay leading to performance overheads Broad Challenge: Is it possible to design software and middleware stacks in a power-aware manner to minimize the overall system power consumption, with negligible performance overheads? 5

6 Power Saving Innovations in Current Generation Processors and Networks Processors (Intel Nehalem ) DVFS Dynamically scale the Frequency and Voltage of the CPU Core-level DVFS CPU Frequency range [ ] GHz CPU Throttling Insert brief idle periods to save power Socket-level CPU Throttling Multiple Throttling states (T0 T7) T0-100% CPU activity T7-12% CPU activity Networks (InfiniBand) Allow offloading significant parts of communication Offer polling and blocking modes of message progression 6

7 State-Of-The-Art in Power Conservation Techniques Where can we conserve power? Communication phases Cameron et al, have proposed and demonstrated the utility of PowerPack Lowenthal et al - dynamically detect communication phases and scale the CPU frequency to save power Liu et al, studied power consumption with RDMA based networks comparing blocking and polling modes 7

8 Collective Communication in MVAPICH2 Scientific parallel applications spend a considerable amount of time in collective communication operations Multi-core aware and network topology-aware algorithms optimize the communication costs Current power saving methods treat communication phases as a black-box and use DVFS Is it possible to re-design collective communication algorithms to deliver fine-grained power savings with very little performance overheads? 8

9 Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 9

10 Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 10

11 Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Socket A Idle CPU cores Socket B Node1 Node2 NodeN 11

12 Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Idle Socket A Shared memory algorithms significantly CPU improves performance Many cores remain idle during collective cores operations Not power efficient Socket B Node1 Node2 NodeN 12

13 Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 13

14 Latency (us) Latency (us) MPI Collective Performance Total Vs Network Time Broadcast Total Network Reduce Total Network Message Size (Bytes) Message Size (Bytes) 14

15 Latency (us) Latency (us) MPI Collective Performance Total Vs Network Time Broadcast Total Network Reduce Total Network Network phase dominate total transfer time Non-leader cores idle waste of power! Message Size (Bytes) Message Size (Bytes) 15

16 Latency (us) MPI Collective Performance Impact of Network Congestion Alltoall-8way Alltoall-4way 50% Message Size (Bytes) Performance in 8-way (4 nodes) configuration 50% worse than the 4-way (8-nodes) with the same system size 16

17 Latency (us) MPI Collective Performance Impact of Network Congestion Alltoall-8way Alltoall-4way 50% Performance hit when more cores simultaneously involved in network transfer Message Size (Bytes) Performance in 8-way (4 nodes) configuration 50% worse than the 4-way (8-nodes) with the same system size 17

18 Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 18

19 Problem Statement Existing designs use DVFS to save power without considering the nature of the collective algorithms Modern architectures allow DVFS and CPU Throttling operations to be performed within a few micro-seconds Can we re-design collective communication algorithms in a power-aware manner taking into consideration the nature of the collective operation? What is the impact on performance? What architectural features can allow for more power savings with smaller performance overheads? 19

20 Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 20

21 Design Space of Power-Aware Algorithms Default (No Power Savings): Run each core at peak frequency/throttling state. Frequency Scaling Only: Dynamically detect communication phases. Treat them as a black-box and scale the CPU frequency Proposed: Consider the communication characteristics of different collectives, intelligently use both DVFS and CPU Throttling to deliver fine-grained power-savings 21

22 Proposed Approach Apply DVFS to scale down the frequency of all cores to the lowest level at the start of the collective operation Look for opportunities within the algorithms to use CPU Throttling to specific sets of cores of save more power Reset the state of the cores to the normal power settings while exiting collective operations 22

23 Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B DVFS: Scale down the frequency of each core Node1 Node2 NodeN 23

24 Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 24

25 Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Socket A (T4) Throttle Idle CPU cores Socket B (T7) Node1 Node2 NodeN 25

26 Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Core-level CPU throttling would Throttle allow leaving leader core at high power state Idle CPU Lesser performance overhead cores Socket A (T4) Socket B (T7) Node1 Node2 NodeN 26

27 Proposed Power- Aware Alltoall Algorithms (Phase 1) Complete all the intra-node exchanges Apply DVFS across all nodes Node1 Node2 NodeN 27

28 Proposed Power- Aware Alltoall Algorithms (Phase 2) Active socket remains at T0 Passive socket powered down to T7 Socket A (T0) Socket B (T7) Node1 Node2 NodeN 28

29 Proposed Power- Aware Alltoall Algorithms (Phase 3) Repeat for the other socket Socket A (T7) Socket B (T0) Node1 Node2 NodeN 29

30 Proposed Power- Aware Alltoall Algorithms (Phase 4(a)) Choose a pair of nodes i and j (i < j) Schedule communication and throttle sockets in the following manner Socket Ai (T0) Socket Aj (T7) Socket Bj (T7) Socket Bj (T0) Node i Nodej ( i < j) 30

31 Proposed Power- Aware Alltoall Algorithms (Phase 4(b)) Socket A (T7) Socket Aj (T0) Socket B (T0) Socket Bj (T7) Node i Nodej ( i < j) 31

32 Proposed Power- Aware Alltoall Algorithms (Phase 4(b)) Socket A (T7) Socket Aj (T0) Only half the cores communicate at one time CPU throttling leads to better power savings Socket B (T0) Socket Bj (T7) Node i Nodej ( i < j) 32

33 Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 33

34 Experimental Setup Compute platforms Intel Nehalem Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Network Equipments MT26428 QDR ConnectX HCAs 36-port Mellanox QDR switch used to connect all the nodes Red Hat Enterprise Linux Server release 5.3 (Tikanga) OFED MASTECH MS2205 Clamp Power Meter to measure instantaneous power consumption 34

35 Experimental Setup (Cont) MVAPICH2 A High Performance MPI implementation over InfiniBand and other RDMA networks (v1.5.1) Used by more than 1255 organizations world-wide Benchmarks Micro-Benchmarks: OSU Microbenchmark Suite Application Benchmarks: CPMD and NAS Estimated power savings with applications based on observations with micro-benchmarks. 35

36 Latency (us) Alltoall Latency with InfiniBand Blocking and Polling Progression Modes Alltoall-Polling Alltoall-Blocking K 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Blocking mode performs worse than polling mode for Alltoall 36

37 Power Consumption (KW) Alltoall Power Consumption with InfiniBand Blocking and Polling Progression Modes Alltoall-Polling Alltoall-Blocking Execution Time (s) Blocking mode does allow for instantaneous power savings when compared to the polling mode 37

38 8K 16K 32K 64K 128K 256K 512K 1M 2M 8K 16K 32K 64K 128K 256K 512K 1M Latency (us) Latency (us) Latency for Collectives with Different Power Conservation Techniques Alltoall Alltoall-Default Alltoall-Freq-Scaling Alltoall-Proposed % Broadcast Bcast-Default Bcast-Freq-Scaling Bcast-Proposed 11% Message Size (Bytes) Message Size (Bytes) Additional performance impact due to CPU throttling is minimal 38

39 Power Consumption (KW) Power Consumption (KW) Power Consumption for Collectives with Different Power Conservation Techniques 2.5 Alltoall 2.5 Broadcast 2 25% 33% 2 18% 28% Alltoall-Default 1 Bcast-Default Alltoall-Freq-Scaling Alltoall-Proposed Bcast-Freq-Scaling Bcast-Proposed Execution Time (s) Execution Time (s) More power savings with proposed approach 39

40 Application Run-Time (s) Power Consumption (KJ) 300 CPMD with 32 and 64 Processes (ta-inp-md Data set) Processes 64 Processes Estimated power savings - 7.7% Performance degradation - 2.6% Default Freq-Scaling Proposed Approach 40

41 Application Run-Time (s) Power Consumption (KJ) NAS FT (Class C) 32 and 64 Processes Processes 64 Processes Estimated power savings - 5% Performance degradation of - 10% Default Freq-Scaling Proposed Approach 41

42 Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 42

43 Conclusions & Future Work Proposed and designed novel power-aware collective algorithms Designs deliver fine-grained power savings with little performance overheads 33% savings in instantaneous power consumption with micro-benchmarks Overall estimated energy savings of up to 8% with applications Core-level CPU throttling could potentially lead to better power-savings with smaller performance overheads Extend these designs to other collective operations and study the potential for saving power at larger scales These designs will be made available in future MVAPICH2 releases 43

Thank you! http://mvapich.cse.ohio-state.

44 Thank you! {kandalla, mancini, surs, Network-Based Computing Laboratory, Ohio State University 44

45 Application Run-Time (s) Power Consumption (KJ) CPMD with 32 and 64 Processes (Wat-32-inp-2 Data set) Default Freq-Scaling Proposed Approach Processes 64 Processes About 3% power savings with a performance degradation of 2.8% with 32 processes with the proposed approach 45

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of