Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Similar documents
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Per-call Energy Saving Strategies in All-to-all Communications

Unified Runtime for PGAS and MPI over OFED

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Advanced RDMA-based Admission Control for Modern Data-Centers

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

High Performance MPI on IBM 12x InfiniBand Architecture

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Performance Evaluation of InfiniBand with PCI Express

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Memcached Design on High Performance RDMA Capable Interconnects

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Enhancing Checkpoint Performance with Staging IO & SSD

Interconnect Your Future

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Designing High Performance DSM Systems using InfiniBand Features

High-Performance Broadcast for Streaming and Deep Learning

Accelerating HPL on Heterogeneous GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

MM5 Modeling System Performance Research and Profiling. March 2009

MVAPICH2 Project Update and Big Data Acceleration

A Case for High Performance Computing with Virtual Machines

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Performance Evaluation of InfiniBand with PCI Express

High-Performance Training for Deep Learning and Computer Vision HPC

The Future of Interconnect Technology

MVAPICH2 and MVAPICH2-MIC: Latest Status

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Simulation using MIC co-processor on Helios

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Performance of HPC Middleware over InfiniBand WAN

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

HP-DAEMON: High Performance Distributed Adaptive Energy-efficient Matrix-multiplicatiON

Solutions for Scalable HPC

CP2K Performance Benchmark and Profiling. April 2011

2008 International ANSYS Conference

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Scaling with PGAS Languages

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Future Routing Schemes in Petascale clusters

Designing Shared Address Space MPI libraries in the Many-core Era

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

HYCOM Performance Benchmark and Profiling

Birds of a Feather Presentation

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Acceleration for Big Data, Hadoop and Memcached

Transcription:

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering, The Ohio State University Presented by: Hari Subramoni

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 2

Introduction Scientific research being driven by multi-cores and highspeed networks Supercomputers typically comprise of hundreds of thousands of compute cores The power consumed by such systems has also increased sharply Power is deemed as one of the major challenges in designing next generation exascale systems 3

Power Consumption (KW) Average Power Consumption of Current Generation Supercomputers 1400 1200 1000 800 600 Top500 System Top 50 System Top 10 System 400 200 0 Based on the power measurements of LINPACK benchmark, as reported on the Top500 site 4

Introduction Modern architectures offer fine grained schemes for saving power during idle periods Frequency, Voltage Scaling(DVFS) and CPU Throttling Power conservation techniques are typically associated with a delay leading to performance overheads Broad Challenge: Is it possible to design software and middleware stacks in a power-aware manner to minimize the overall system power consumption, with negligible performance overheads? 5

Power Saving Innovations in Current Generation Processors and Networks Processors (Intel Nehalem ) DVFS Dynamically scale the Frequency and Voltage of the CPU Core-level DVFS CPU Frequency range [1.6 2.4] GHz CPU Throttling Insert brief idle periods to save power Socket-level CPU Throttling Multiple Throttling states (T0 T7) T0-100% CPU activity T7-12% CPU activity Networks (InfiniBand) Allow offloading significant parts of communication Offer polling and blocking modes of message progression 6

State-Of-The-Art in Power Conservation Techniques Where can we conserve power? Communication phases Cameron et al, have proposed and demonstrated the utility of PowerPack Lowenthal et al - dynamically detect communication phases and scale the CPU frequency to save power Liu et al, studied power consumption with RDMA based networks comparing blocking and polling modes 7

Collective Communication in MVAPICH2 Scientific parallel applications spend a considerable amount of time in collective communication operations Multi-core aware and network topology-aware algorithms optimize the communication costs Current power saving methods treat communication phases as a black-box and use DVFS Is it possible to re-design collective communication algorithms to deliver fine-grained power savings with very little performance overheads? 8

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 9

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 10

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Socket A Idle CPU cores Socket B Node1 Node2 NodeN 11

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Idle Socket A Shared memory algorithms significantly CPU improves performance Many cores remain idle during collective cores operations Not power efficient Socket B Node1 Node2 NodeN 12

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 13

Latency (us) Latency (us) MPI Collective Performance Total Vs Network Time 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Broadcast Total Network 70 60 50 40 30 20 10 0 Reduce Total Network Message Size (Bytes) Message Size (Bytes) 14

Latency (us) Latency (us) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 MPI Collective Performance Total Vs Network Time Broadcast Total Network 70 60 50 40 30 20 10 0 Reduce Total Network Network phase dominate total transfer time Non-leader cores idle waste of power! Message Size (Bytes) Message Size (Bytes) 15

Latency (us) 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 MPI Collective Performance Impact of Network Congestion Alltoall-8way Alltoall-4way 50% Message Size (Bytes) Performance in 8-way (4 nodes) configuration 50% worse than the 4-way (8-nodes) with the same system size 16

Latency (us) 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 MPI Collective Performance Impact of Network Congestion Alltoall-8way Alltoall-4way 50% Performance hit when more cores simultaneously involved in network transfer Message Size (Bytes) Performance in 8-way (4 nodes) configuration 50% worse than the 4-way (8-nodes) with the same system size 17

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 18

Problem Statement Existing designs use DVFS to save power without considering the nature of the collective algorithms Modern architectures allow DVFS and CPU Throttling operations to be performed within a few micro-seconds Can we re-design collective communication algorithms in a power-aware manner taking into consideration the nature of the collective operation? What is the impact on performance? What architectural features can allow for more power savings with smaller performance overheads? 19

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 20

Design Space of Power-Aware Algorithms Default (No Power Savings): Run each core at peak frequency/throttling state. Frequency Scaling Only: Dynamically detect communication phases. Treat them as a black-box and scale the CPU frequency Proposed: Consider the communication characteristics of different collectives, intelligently use both DVFS and CPU Throttling to deliver fine-grained power-savings 21

Proposed Approach Apply DVFS to scale down the frequency of all cores to the lowest level at the start of the collective operation Look for opportunities within the algorithms to use CPU Throttling to specific sets of cores of save more power Reset the state of the cores to the normal power settings while exiting collective operations 22

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B DVFS: Scale down the frequency of each core Node1 Node2 NodeN 23

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Socket A Intra- Node Phase Socket B Node1 Node2 NodeN 24

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Socket A (T4) Throttle Idle CPU cores Socket B (T7) Node1 Node2 NodeN 25

Proposed Power- Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter Leader (network) phase Leader Communicator Core-level CPU throttling would Throttle allow leaving leader core at high power state Idle CPU Lesser performance overhead cores Socket A (T4) Socket B (T7) Node1 Node2 NodeN 26

Proposed Power- Aware Alltoall Algorithms (Phase 1) Complete all the intra-node exchanges Apply DVFS across all nodes Node1 Node2 NodeN 27

Proposed Power- Aware Alltoall Algorithms (Phase 2) Active socket remains at T0 Passive socket powered down to T7 Socket A (T0) Socket B (T7) Node1 Node2 NodeN 28

Proposed Power- Aware Alltoall Algorithms (Phase 3) Repeat for the other socket Socket A (T7) Socket B (T0) Node1 Node2 NodeN 29

Proposed Power- Aware Alltoall Algorithms (Phase 4(a)) Choose a pair of nodes i and j (i < j) Schedule communication and throttle sockets in the following manner Socket Ai (T0) Socket Aj (T7) Socket Bj (T7) Socket Bj (T0) Node i Nodej ( i < j) 30

Proposed Power- Aware Alltoall Algorithms (Phase 4(b)) Socket A (T7) Socket Aj (T0) Socket B (T0) Socket Bj (T7) Node i Nodej ( i < j) 31

Proposed Power- Aware Alltoall Algorithms (Phase 4(b)) Socket A (T7) Socket Aj (T0) Only half the cores communicate at one time CPU throttling leads to better power savings Socket B (T0) Socket Bj (T7) Node i Nodej ( i < j) 32

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 33

Experimental Setup Compute platforms Intel Nehalem Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Network Equipments MT26428 QDR ConnectX HCAs 36-port Mellanox QDR switch used to connect all the nodes Red Hat Enterprise Linux Server release 5.3 (Tikanga) OFED-1.4.2 MASTECH MS2205 Clamp Power Meter to measure instantaneous power consumption 34

Experimental Setup (Cont) MVAPICH2 A High Performance MPI implementation over InfiniBand and other RDMA networks (v1.5.1) http://mvapich.cse.ohio-state.edu/ Used by more than 1255 organizations world-wide Benchmarks Micro-Benchmarks: OSU Microbenchmark Suite Application Benchmarks: CPMD and NAS Estimated power savings with applications based on observations with micro-benchmarks. 35

Latency (us) Alltoall Latency with InfiniBand Blocking and Polling Progression Modes 250000 200000 Alltoall-Polling Alltoall-Blocking 150000 100000 50000 0 8K 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Blocking mode performs worse than polling mode for Alltoall 36

Power Consumption (KW) Alltoall Power Consumption with InfiniBand Blocking and Polling Progression Modes 2.5 2 1.5 1 0.5 0 Alltoall-Polling Alltoall-Blocking 0 3.6 7.2 10.8 14.4 18.1 21.6 25.2 28.8 Execution Time (s) Blocking mode does allow for instantaneous power savings when compared to the polling mode 37

8K 16K 32K 64K 128K 256K 512K 1M 2M 8K 16K 32K 64K 128K 256K 512K 1M Latency (us) Latency (us) Latency for Collectives with Different Power Conservation Techniques 500000 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 Alltoall Alltoall-Default Alltoall-Freq-Scaling Alltoall-Proposed 1800 14% 1600 1400 1200 1000 800 600 400 200 0 Broadcast Bcast-Default Bcast-Freq-Scaling Bcast-Proposed 11% Message Size (Bytes) Message Size (Bytes) Additional performance impact due to CPU throttling is minimal 38

0 5.7 11.5 17.2 23.1 28.8 34.5 40.3 46.1 51.8 0 0.7 1.4 2.1 2.8 3.5 4.2 4.9 5.6 6.3 7 7.6 8.3 9 Power Consumption (KW) Power Consumption (KW) Power Consumption for Collectives with Different Power Conservation Techniques 2.5 Alltoall 2.5 Broadcast 2 25% 33% 2 18% 28% 1.5 1.5 1 Alltoall-Default 1 Bcast-Default 0.5 0 Alltoall-Freq-Scaling Alltoall-Proposed 0.5 0 Bcast-Freq-Scaling Bcast-Proposed Execution Time (s) Execution Time (s) More power savings with proposed approach 39

Application Run-Time (s) Power Consumption (KJ) 300 CPMD with 32 and 64 Processes (ta-inp-md Data set) 70 250 200 150 100 50 0 60 50 40 30 20 10 0 32 Processes 64 Processes Estimated power savings - 7.7% Performance degradation - 2.6% Default Freq-Scaling Proposed Approach 40

Application Run-Time (s) Power Consumption (KJ) NAS FT (Class C) 32 and 64 Processes 20 18 16 14 12 10 8 6 4 2 0 18 16 14 12 10 8 6 4 2 0 32 Processes 64 Processes Estimated power savings - 5% Performance degradation of - 10% Default Freq-Scaling Proposed Approach 41

Outline Introduction and Background Motivation Problem Statement Designing Power-Aware Collective Algorithms Experimental Evaluation Conclusions and Future Work 42

Conclusions & Future Work Proposed and designed novel power-aware collective algorithms Designs deliver fine-grained power savings with little performance overheads 33% savings in instantaneous power consumption with micro-benchmarks Overall estimated energy savings of up to 8% with applications Core-level CPU throttling could potentially lead to better power-savings with smaller performance overheads Extend these designs to other collective operations and study the potential for saving power at larger scales These designs will be made available in future MVAPICH2 releases 43

Thank you! http://mvapich.cse.ohio-state.edu {kandalla, mancini, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory, Ohio State University 44

Application Run-Time (s) Power Consumption (KJ) 40 35 30 25 20 15 10 5 0 CPMD with 32 and 64 Processes (Wat-32-inp-2 Data set) Default Freq-Scaling Proposed Approach 9 8 7 6 5 4 3 2 1 0 32 Processes 64 Processes About 3% power savings with a performance degradation of 2.8% with 32 processes with the proposed approach 45