Accelerating HPL on Heterogeneous GPU Clusters

Similar documents
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters*

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Scaling with PGAS Languages

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Broadcast for Streaming and Deep Learning

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Overview of the MVAPICH Project: Latest Status and Future Roadmap

A Case for High Performance Computing with Virtual Machines

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

MVAPICH2 and MVAPICH2-MIC: Latest Status

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Unifying UPC and MPI Runtimes: Experience with MVAPICH

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Unified Runtime for PGAS and MPI over OFED

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Steve Scott, Tesla CTO SC 11 November 15, 2011

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Parallel Programming with MPI

Overview of Tianhe-2

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Enhancing Checkpoint Performance with Staging IO & SSD

Mapping MPI+X Applications to Multi-GPU Architectures

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

DISP: Optimizations Towards Scalable MPI Startup

Automatic Tuning of the High Performance Linpack Benchmark

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

GPU-centric communication for improved efficiency

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Electronic structure calculations on Thousands of CPU's and GPU's

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Introduction to parallel Computing

X10 and APGAS at Petascale

An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Interconnect Your Future

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Transcription:

Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Outline Introduction and Motivation Proposed Design for Hybrid HPL Performance Evaluation Conclusion and Future Work 2

High-End Computing (HEC): PetaFlop to ExaFlop 100 PFlops in 2015 1 EFlops in 2018 Expected to have an ExaFlop system in 2019-2022! 3

Drivers of Heterogeneous HPC Cluster Multi-core Processors Multi-core processors are ubiquitous High Performance Linpack (HPL) is used to measure the peak performance NVIDIA GPUs are becoming common in high-end systems Pushing the envelope for heterogeneous computing Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip Tianhe 1A Stampede Oakley (OSC) Blue Waters (NCSA) 4

GPU Cluster configurations CPU Homogeneous CPU Clusters No GPU accelerators (ex: BlueGene) Homogeneous GPU Clusters All nodes have the same configuration Titan, Keeneland, Wilkes Heterogeneous GPU Clusters CPU CPU Node GPU Node GPU CPU nodes + GPU nodes Ratio CPU/GPU > 1 Oakley@OSC: 634 CPU + 64 GPU CPU GPU BlueWaters@NCSA: 22,500 XE + 4200 XK CPU CPU GPU GPU CPU CPU CPU CPU CPU GPU GPU CPU GPU CPU CPU CPU CPU 5

HPL - High Performance Linpack 6

Current Execution of HPL on Heterogeneous GPU Clusters Current HPL support for GPU Clusters Heterogeneity inside a node CPU+GPU Homogeneity across nodes Current HPL execution on heterogeneous GPU Clusters Only CPU nodes (using all the CPU cores) Only GPU nodes (using CPU+GPU on only GPU nodes) As the ratio CPU/GPU is higher => report the Only CPU runs Need for HPL which supports heterogeneous systems Heterogeneity inside a node (CPU+GPU) Heterogeneity across nodes (nodes w/o GPUs) 7

Outline Introduction and Motivation Proposed Design for Hybrid HPL Two-level Workload Partitioning (Inter-node and Intra-node) Process-grid Reordering Performance Evaluation Conclusion and Future Work 8

Two Level Workload Partitioning: Inter-node 2x2 2x3 Original Partitioning MPI Process-scaling based Partitioning G1 G2 C1 C2 Inter-node Static Partitioning Original design: uniform distribution, bottleneck on CPU nodes New design: identical block size, schedules more MPI processes on GPU nodes MPI_GPU = ACTUAL_PEAK_GPU / ACTUAL_PEAK_CPU + β (NUM_CPU_CORES mod MPI_GPU = 0 ) Evenly split the cores 9

Two Level Workload Partitioning: Intra-node B 1 B 2 A C 1 C 2 GPU_LEN CPU_LEN Intra-node Dynamic Partitioning MPI-to-Device Mapping Original design: 1:1 New design: M: N (M > N), N= number of GPUs/Node, M= number of MPI processes Initial Split Ratio Tuning: alpha = GPU_LEN / (GPU_LEN + CPU_LEN) Fewer CPU cores per MPI processes Overhead caused by scheduling multiple MPI processes on GPU nodes 10

Process Grid Reordering Default (blocking) Process Grid with Multiple MPI processes/gpu nodes G1 G1 G2 G2 C1 C2 2 x 3 Grid G1 G1 G2 G1 G1 C1 G2 C1 C2 G2 G2 C2 Synchronization overhead of Panel Broadcast - Default Design G1 G1 G2 G2 C1 C2 Unbalanced Workload G1 might get more blocks than G2 C1 might get more blocks than C2 - New Design G1 G1 C1 G2 G2 C2 Balanced Workload All the parties have the adequate workload 11

Overview of Hybrid HPL Design Heterogeneity Analysis Pure CPU nodes Pure GPU nodes Hybrid CPU+GPU nodes Pre-process Analysis Heterogeneity Analysis Inter-node Static Partitioning Two-level Workload Partitioning Inter-node Static Intra-node Dynamic Process Grid Reordering Generate efficient node topology Hybrid Launcher GPU nodes Asynchronous Memory Copy MPI-Device Mapping Adaptive Split Ratio Tuning CPU nodes Runtime Execution Process Grid Reordering GPU Node MPI-Device Mapping (m: n) Intra-node Dynamic Partitioning Call MKL, cublas Hybrid Launcher CPU Nodes Call MKL 12

Outline Introduction and Motivation Proposed Design for Hybrid HPL Performance Evaluation Conclusion and Future Work 13

Experimental Setup Experiment Environment MPI Library: MVAPICH2 High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) Support for heterogeneous GPU Clusters Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries http://mvapich.cse.ohio-state.edu/ 14

Actual Peak Performance 1G-CONFIG-A: 8 GPU nodes (1 GPU accelerators) + 32 CPU nodes 1G-CONFIG-Oakley: 32 GPU nodes (1 GPU accelerators) + 128 CPU nodes 2G-CONFIG-Oakley: 32 GPU nodes (2 GPU accelerators) + 128 CPU nodes Actual Peak Performance (GFLOPS) 30000 25000 20000 15000 10000 5000 CPU/GPU ratio = 4 PURE_GPU PURE_CPU OSU-HYBRID 32% 0 1G-CONFIG-A 1G-CONFIG-Oakley 2G-CONFIG-Oakley Hybrid designs outperforms the PURE_CPU by 32% R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko and D. K. Panda, A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters, IEEE Cluster (Cluster '13), Best Student Paper Award 15

Efficiency PURE_GPU PURE_CPU HYBRID 30000 Performance (Gflops) 25000 20000 15000 10000 5000 0 8172 10282 7374 5866 PPE = 88% TPE = 63% 11650 8172 18523 7374 8633 PPE = 85% TPE = 51% 13600 THEO_PEAK ACTUAL_PEAK OSU-HYBRID THEO_PEAK ACTUAL_PEAK OSU-HYBRID 1G-CONFIG 2G-CONFIG PPE: Peak Performance Efficiency TPE: Theoretical Performance Efficiency 16

Peak Performance Efficiency (PPE) Scalability CPU Nodes = 16 CPU/GPU Ratio = 4 Performance Efficiency (%) 100% 80% 60% 40% 20% 0% 2g-config 1g-config 1 2 4 8 16 Performance Efficiency (%) 100 80 60 40 20 0 2g_conf 1g_conf 1 2 4 8 16 Number of GPU nodes Number of GPU nodes GPU Nodes = 4 Performance Efficiency (%) 100% 80% 60% 40% 20% 0% 2g-conf 1g-conf 1 2 4 8 16 Number of CPU nodes Constant PPE for fixed CPUs, fixed GPUs and fixed ratio 17

Outline Introduction and Motivation Proposed Design for Hybrid HPL Performance Evaluation Conclusion and Future Work 18

Conclusion Propose a novel approach of HPL on GPU clusters with heterogeneity on intra- and inter-nodes configuration. Achieve 80% of the combined actual peak performance of pure CPU and pure GPU nodes Studying the impact of such designs for other benchmarks and applications. 19

Thank You! panda@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 20