Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Similar documents
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Unified Runtime for PGAS and MPI over OFED

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

High Performance MPI on IBM 12x InfiniBand Architecture

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Intra-MIC MPI Communication using MVAPICH2: Early Experience

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

A Case for High Performance Computing with Virtual Machines

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

HIGH PERFORMANCE AND SCALABLE MPI INTRA-NODE COMMUNICATION MIDDLEWARE FOR MULTI-CORE CLUSTERS

Single-Points of Performance

Scaling with PGAS Languages

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Accelerating HPL on Heterogeneous GPU Clusters

Enhancing Checkpoint Performance with Staging IO & SSD

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

Performance Evaluation of InfiniBand with PCI Express

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Designing High Performance DSM Systems using InfiniBand Features

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

HYCOM Performance Benchmark and Profiling

Performance Evaluation of InfiniBand with PCI Express

Advanced RDMA-based Admission Control for Modern Data-Centers

Optimizing LS-DYNA Productivity in Cluster Environments

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

High-Performance Training for Deep Learning and Computer Vision HPC

The Optimal CPU and Interconnect for an HPC Cluster

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

2008 International ANSYS Conference

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

MM5 Modeling System Performance Research and Profiling. March 2009

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

High-Performance Broadcast for Streaming and Deep Learning

CP2K Performance Benchmark and Profiling. April 2011

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS

Performance comparison between a massive SMP machine and clusters

2008 International ANSYS Conference

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

INTERCONNECTION TECHNOLOGIES. Non-Uniform Memory Access Seminar Elina Zarisheva

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Application Performance on Dual Processor Cluster Nodes

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

InfiniBand Experiences of PC²

AcuSolve Performance Benchmark and Profiling. October 2011

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

ABySS Performance Benchmark and Profiling. May 2010

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Comparing Ethernet and Soft RoCE for MPI Communication

Performance Analysis and Evaluation of LANL s PaScalBB I/O nodes using Quad-Data-Rate Infiniband and Multiple 10-Gigabit Ethernets Bonding

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

Parallel Architectures

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Performance of HPC Middleware over InfiniBand WAN

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

HPC Architectures. Types of resource currently in use

Future Routing Schemes in Petascale clusters

OCTOPUS Performance Benchmark and Profiling. June 2015

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing communications on clusters of multicores

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Transcription:

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

On-chip Networks The wave of the Future Node architectures are changing rapidly Especially with multi-core processors Introducing new memory-hierarchy Core-core communication Within a socket (on-chip) Across a socket (off-chip) Inter-node communication (off-chip) bandwidth New challenges in designing on-chip networks Also need better software (data placement or optimization) to extract the maximum performance

Emerging Clusters with Multi-Level Memory Hierarchy and Different Communication Bandwidth Intra-node Memory Communication Memory Inter-node Communication Memory Memory Network Intra-node Communication Memory Bus NUMA Node Bus-based SMP Node Memory Core Intra-node Communication (off-chip, across socket) cache Core Multicore Chip/Socket Core cache Memory Core Intra-node Communication (on-chip, within socket) Multicore NUMA Node

Designing Efficient Communication Middleware Criticality of message distribution across the three levels? Can we optimize on-chip communication by exploiting multi-core architecture? Benefits to the overall application A study on MPI Library and applications Intel and Opteron Multi-core systems InfiniBand as Cluster Interconnect

Enhancing MPI Library for Efficient Intra-node (within socket and across socket) Communication High Performance MPI Library for InfiniBand Clusters MVAPICH (MPI-1) and MVAPICH (MPI-2) Used by more than 435 organizations in 3 countries Empowering many TOP5 clusters Available with software stacks of many InfiniBand and server vendors including the OpenIB and Open Fabrics Enterprise Distribution (OFED) http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ Already has good support for intra-node MPI point-to-point communication (with copying) over shared memory Recently designed an efficient and scalable scheme with associated data structures for emerging multicore and NUMA systems

Cumulative Message Distribution Number of Messages Distribution Data Volume Distribution 1% 8% 6% 4% 2% % Inter-node Inter-CMP Intra-CMP Total 1% 8% 6% 4% 2% % Inter-node Inter-CMP Intra-CMP Total 64 1K 16K 256K 4M 64 256 1K 4K 16K 64K 256K 1M 4M Message Size Message Size High Performance Linpack (HPL) Dual dual-core Opteron cluster, 16 nodes, 64 processes

Enhanced Message Passing Design for Improving On-Chip and Off-Chip Communication Small Message Latency Medium Message Latency Large Message Latency Latency (us) 1.2 1.8.6.4.2 12 1 8 6 4 2 Intra-CMP Original Design Intra-CMP New Design Inter-CMP Original Design Inter-CMP New Design 1 9 8 7 6 5 4 3 2 1 1 2 4 8 16 32 64 128 Message Size (Bytes) 256 512 1K 2K 4K 8K Message Size (Bytes) 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Dual dual-core AMD Opteron processors, 2.GHz, 1MB L2 cache The new design improves inter-cmp latency for all the messages intra-cmp latency for small and medium messages L. Chai, A. Hartono and D. K. Panda, Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters, Cluster 6

Performance Improvement in Bandwidth and Cache Miss Rates Bandwidth L2 Cache Miss Rate Bandwidth (MB/s) 18 16 14 12 1 8 6 4 2 1 4 16 64 256 1K 4K 16K Message Size (Bytes) 64K 256K 1M Miss Rate 14.% 12.% 1.% 8.% 6.% 4.% 2.%.% Latency Small Latency Medium Benchmarks Latency Large Bandwidth Original Design New Design Original Sender Original Receiver New Sender New Receiver Bandwidth is improved by up to 5% Benefits mainly come from the reduced L2 cache miss rate High performance implementations of programming models needed End applications need to optimize data placements for good performance

Inter-node Communication Performance P2 P3 P P1 HCA HCA 1 HCA HCA 1 P P1 P2 P3 4-processes on each node concurrently communicating over Dual-rail InfiniBand DDR (Mellanox) 3 Uni-Directional Bandwidth 5 Bi-Directional Bandwidth 25 2 15 4 processes 45 4 35 3 25 4 processes 1 5 288 MB/sec 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) 2 15 1 5 4553 MB/sec 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (bytes) M. J. Koop, W. Huang, A. Vishnu and D. K. Panda, Memory Scalability Evaluation of Next Generation Intel Bensley Platform with InfiniBand, Hot Interconnect Symposium (Aug. 26).

Inter-node Messaging Rate 12 Design based on MVAPICH.9.8 Preliminary performance results - Dual dual-core Intel Woodcrest - Single DDR card Messages / sec 1 8 6 4 1 Pair 2 Pairs 4 Pairs 2 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (bytes)

Open Issues for Enhancing Communication Performance with Multicore and On-Chip/Off- Chip Interconnects Can one or more cores be dedicated for communication? to achieve better overlap of computation and communication Can one-sided and multi-threading be well supported on multicore systems? Can collective communication (broadcast, multicast, all-to-all, all-reduce, etc.) be implemented efficiently and in a non-blocking manner To have better overlap of computation with collective communication Can we aim for fine-grain synchronization? Will depend on the capability and features of both on-chip and off-chip networks