Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Similar documents
Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

High Performance MPI on IBM 12x InfiniBand Architecture

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Unified Runtime for PGAS and MPI over OFED

Advanced RDMA-based Admission Control for Modern Data-Centers

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Performance Evaluation of InfiniBand with PCI Express

A Case for High Performance Computing with Virtual Machines

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Enhancing Checkpoint Performance with Staging IO & SSD

Performance Evaluation of InfiniBand with PCI Express

Unifying UPC and MPI Runtimes: Experience with MVAPICH

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

OCTOPUS Performance Benchmark and Profiling. June 2015

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

SNAP Performance Benchmark and Profiling. April 2014

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Performance Analysis and Evaluation of LANL s PaScalBB I/O nodes using Quad-Data-Rate Infiniband and Multiple 10-Gigabit Ethernets Bonding

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

InfiniBand Experiences of PC²

Solutions for Scalable HPC

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Memcached Design on High Performance RDMA Capable Interconnects

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

MM5 Modeling System Performance Research and Profiling. March 2009

Accelerating HPL on Heterogeneous GPU Clusters

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

High Performance Computing

Building Multirail InfiniBand Clusters: MPI-Level Designs and Performance Evaluation

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

MILC Performance Benchmark and Profiling. April 2013

HYCOM Performance Benchmark and Profiling

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Designing Shared Address Space MPI libraries in the Many-core Era

Interconnect Your Future

NAMD Performance Benchmark and Profiling. January 2015

ABySS Performance Benchmark and Profiling. May 2010

Scaling with PGAS Languages

LS-DYNA Performance Benchmark and Profiling. October 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

CPMD Performance Benchmark and Profiling. February 2014

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing High Performance DSM Systems using InfiniBand Features

High-Performance Training for Deep Learning and Computer Vision HPC

Performance of HPC Middleware over InfiniBand WAN

AcuSolve Performance Benchmark and Profiling. October 2011

Acceleration for Big Data, Hadoop and Memcached

Performance Evaluation of InfiniBand with PCI Express

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

2008 International ANSYS Conference

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

Paving the Road to Exascale

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

INTERCONNECTION TECHNOLOGIES. Non-Uniform Memory Access Seminar Elina Zarisheva

Transcription:

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University

Introduction Computer systems have increased significantly in processing capability over the last few years in various ways Multi-core architectures are becoming more prevalent High-speed I/O interfaces, such as PCI-Express have enabled high-speed interconnects such as InfiniBand to deliver higher performance The area that has improved the least during this time is the memory controller

Traditional Memory Design Traditional memory controller design has limited the number of DIMMs per memory channel as signal rates have increased Due to high pin count (240) required for each channel, adding additional channels is costly End result is equal or lesser memory capacity in recent years

Fully-Buffered DIMMs (FB-DIMMs) Image courtesy of Intel Corporation FB-DIMM uses serial lanes with a buffer on each chip to eliminate this tradeoff Each channel requires only 69 pins Using the buffer allows larger numbers of DIMMs per channel as well as increased parallelism

Evaluation With multi-core systems coming, a scalable memory subsystem is increasingly important Our goal is to compare FB-DIMM against a traditional design and evaluate the scalability Evaluation Process Test memory subsystem on a single node Evaluate network-level performance with two InfiniBand Host Channel Adapters (HCAs)

Outline Introduction & Goals Memory Subsystem Evaluation Experimental testbed Latency and throughput Network results Conclusions and Future work

Evaluation Testbed Intel Bensley system Two 3.2 GHz dual-core Intel Xeon Dempsey processors FB-DIMM-based memory subsystem Intel Lindenhurst system Two 3.4 GHz Intel Xeon processors Traditional memory subsystem (2 channels) Both contain: 2 8x PCI-Express slots DDR2 533-based memory 2 dual-port Mellanox MT25208 InfiniBand HCAs

Bensley Memory Configurations Branch1 Branch0 Channel3 Channel2 Channel1 Channel0 Slot 3 Slot 2 Slot 1 Slot 0 Slot 3 Slot 2 Slot 1 Slot 0 Slot 3 Slot 2 Slot 1 Slot 0 Slot 3 Slot 2 Slot 1 Slot 0 Processor 0 Processor 1 The standard allows up to 6 channels with 8 DIMMs/channel for 192GB Our systems have 4 channels, each with 4 DIMM slots To fill 4 DIMM slots we have 3 combinations

Subsystem Evaluation Tool lmbench 3.0-a5: Open-source benchmark suite for evaluating system-level performance Latency Memory read latency Throughput Memory read benchmark Memory write benchmark Memory copy benchmark Aggregate performance is obtained by running multiple long-running processes and reporting the sum of averages

Bensley Memory Throughput 6000 Read Write Copy 4000 2500 Aggregate Throughput (MB/sec) 5000 4000 3000 2000 1000 Aggregate Throughput (MB/sec) 3500 3000 2500 2000 1500 1000 500 Aggregate Throughput (MB/sec) 2000 1500 1000 500 0 1 2 4 0 1 2 4 0 1 2 4 Number of Processes Number of Processes Number of Processes 1 Channel 2 Channels 4 Channels 1 Channel 2 Channels 4 Channels 1 Channel 2 Channels 4 Channels To study the impact of additional channels we evaluated using 1, 2, and 4 channels Throughput increases significantly from one to two channels in all operations

Access Latency Comparison 200 Unloaded Loaded Comparison when unloaded and loaded 180 Read Latency (ns) 160 140 120 100 80 60 40 20 0 Bensley Lindenhurst 4GB 8GB 16GB 2GB 4GB Loaded is when a memory read throughput test is run in the background while the latency test is runnning From unloaded to loaded latency: Lindenhurst: 40% increase Bensley: 10% increase

Memory Throughput Comparison Read Write Copy 6000 4000 2500 Aggregate Throughput (MB/sec) 5000 4000 3000 2000 1000 Aggregate Throughput (MB/sec) 3500 3000 2500 2000 1500 1000 500 Aggregate Throughput (MB/sec) 2000 1500 1000 500 0 1 2 4 0 1 2 4 0 1 2 4 Number of Processes Number of Processes Number of Processes Bensley 4GB Bensley 8GB Bensley 16GB Lindenhurst 2GB Lindenhurst 4GB Bensley 4GB Bensley 8GB Bensley 16GB Lindenhurst 2GB Lindenhurst 4GB Bensley 4GB Bensley 8GB Bensley 16GB Lindenhurst 2GB Lindenhurst 4GB Comparison of Lindenhurst and Bensley platforms with increasing memory size Performance increases with two concurrent read or write operations on the Bensley platform

Outline Introduction & Goals Memory Subsystem Evaluation Experimental testbed Latency and throughput Network results Conclusions and Future work

OSU MPI over InfiniBand Open Source High Performance Implementations MPI-1 (MVAPICH) MPI-2 (MVAPICH2) Has enabled a large number of production IB clusters all over the world to take advantage of InfiniBand Largest being Sandia Thunderbird Cluster (4512 nodes with 9024 processors) Have been directly downloaded and used by more than 395 organizations worldwide (in 30 countries) Time tested and stable code base with novel features Available in software stack distributions of many vendors Available in the OpenFabrics(OpenIB) Gen2 stack and OFED More details at http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

Experimental Setup Round Robin Process Binding P2 P3 P0 P1 HCA 0 HCA 1 HCA 0 HCA 1 P0 P1 P2 P3 P2 P3 P0 P1 HCA 0 HCA 1 HCA 0 HCA 1 P0 P1 P2 P3 Evaluation is with two InfiniBand DDR HCAs, which uses the multi-rail feature of MVAPICH Results with one process use both rails in a round-robin pattern 2 and 4 process pair results are done using a process binding assignment

Uni-Directional Bandwidth 3000 Bensley: 1 Process Throughput (MB/sec) 2500 2000 1500 1000 500 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K Message Size (bytes) 64K 128K 256K 512K 1M Bensley: 2 Processes Bensley: 4 Processes Lindenhurst: 1 Process Lindenhurst: 2 Processes Comparison of Lindenhurst and Bensley with dual DDR HCAs Due to higher memory copy bandwidth, Bensley signficantly outperforms Lindenhurst for the medium-sized messages

Bi-Directional Bandwidth Throughput (MB/sec) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 2 At 1K improvement: 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K Message Size (bytes) Lindenhurst: 1 to 2 processes:15% Bensley: 1 to 2 processes: 75%, 2 to 4: 45% 256K 512K 1M Bensley: 1 Process Bensley: 2 Processes Bensley: 4 Processes Lindenhurst peak bi-directional bandwidth is only 100 MB/sec greater than uni-directional Lindenhurst: 1 Process Lindenhurst: 2 Processes

Messaging Rate Throughput (millions messages/sec) 3.5 3 2.5 2 1.5 1 0.5 Bensley: 1 Process Bensley: 2 Processes Bensley: 4 Processes Lindenhurst: 1 Process Lindenhurst: 2 Processes 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Message Size (bytes) 16K 32K 64K 128K 256K For very small messages, both show similar performance At 512 bytes: Lindenhurst 2 process case is only 52% higher than 1 process, Bensley still shows 100% improvement

Outline Introduction & Goals Memory Subsystem Evaluation Experimental testbed Latency and throughput Network results Conclusions and Future work

Conclusions and Future Work Performed detailed analysis of the memory subsystem scalability of Bensley and Lindenhurst Bensley shows significant advantage in scalable throughput and capacity in all measures tested Future work: Profile real-world applications on a larger cluster and observe the effects of contention in multi-core architectures Expand evaluation to include NUMA-based architectures

Acknowledgements Our research is supported by the following organizations Current Funding support by Current Equipment support by 21

Web Pointers {koop, huanwei, vishnu, panda}@cse.ohio-state.edu http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ 22