Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Similar documents
Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

High Performance MPI on IBM 12x InfiniBand Architecture

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Intra-MIC MPI Communication using MVAPICH2: Early Experience

HPC Challenge Awards 2010 Class2 XcalableMP Submission

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Performance Evaluation of InfiniBand with PCI Express

High Performance MPI-2 One-Sided Communication over InfiniBand

MICROBENCHMARK PERFORMANCE COMPARISON OF HIGH-SPEED CLUSTER INTERCONNECTS

Benchmarking CPU Performance. Benchmarking CPU Performance

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

High Performance MPI-2 One-Sided Communication over InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

2008 International ANSYS Conference

Application Performance on Dual Processor Cluster Nodes

InfiniBand Experiences of PC²

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

PC DESY Peter Wegner. PC Cluster Definition 1

A Case for High Performance Computing with Virtual Machines

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Unified Runtime for PGAS and MPI over OFED

HIGH PERFORMANCE AND SCALABLE MPI INTRA-NODE COMMUNICATION MIDDLEWARE FOR MULTI-CORE CLUSTERS

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

COSC 6385 Computer Architecture - Multi Processor Systems

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Designing High Performance DSM Systems using InfiniBand Features

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Slurm Configuration Impact on Benchmarking

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Organizational issues (I)

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

MM5 Modeling System Performance Research and Profiling. March 2009

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Adaptive Connection Management for Scalable MPI over InfiniBand

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Performance of Voltaire InfiniBand in IBM 64-Bit Commodity HPC Clusters

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Building Multirail InfiniBand Clusters: MPI-Level Designs and Performance Evaluation

Optimizing LS-DYNA Productivity in Cluster Environments

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Impact of Network Sharing in Multi-core Architectures

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Benchmarking CPU Performance

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

The LiMIC Strikes Back. Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

A C compiler for Large Data Sequential Processing using Remote Memory

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS

Organizational issues (I)

Fast-communication PC Clusters at DESY Peter Wegner DV Zeuthen

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

C PGAS XcalableMP(XMP) Unified Parallel

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

Analysis of the Component Architecture Overhead in Open MPI

Assessment of LS-DYNA Scalability Performance on Cray XD1

HYCOM Performance Benchmark and Profiling

Transcription:

,,.,, InfiniBand PCI Express,,,. Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation Akira Nishida, The recent development of commodity hardware technologies makes building a software distributed shared memory computing environment a more practical approach for scientific computing that requires the repetitive solutions of large linear systems. In this study, we build a tightly connected PC cluster with PCI Express and InfiniBand technology for the scalable implementation of parallel iterative linear solvers, evaluate the performance and bottlenecks of commodity hardware technologies. The scalability of the implementation of parallel sparse matrix computations considering both the local memory bandwidth of the compute nodes and their internode communication bandwidth is evaluated. 1., PC, 2),6).,, PC,,.,,,,, -.,,,,.,, InfiniBand PCI Department of Computer Science, the University of Tokyo CREST CREST, JST Express,. 2. PCI Express PCI, Intel, NEC 24. PCI 1GB/s, PCI Express, 2.5Gb/s 32, 16GB/s. PCI Express, InfiniBand. Mellanox Technologies InfiniHost (HCA), 1 5Gbps DDR InfiniBand 4 ( 1 ), 8 PCI Express, 4Gbps. Opteron 1

2 PCI Express x16., 2.2GHz, 1MB Opteron 848, 4 16,. 2. 512MB PC32 DDR SDRAM (ECC Registered) 8. PCI Express x16 Internode communication InfiniHost HCA nforce Professional 22 1 Mellanox Technologies PCI Express DDR InfiniHost HCA MHGS18-XT DDR Fig. 1 InfiniHCA HCA MHGS18-XT DDR for PCI Express bus from Mellanox Technologies, Inc. DDR4 SDRAM DDR4 SDRAM DDR4 SDRAM DDR4 SDRAM AMD 64bit.,,.,, 2.6.11 SuSE Linux 9.3. 2.6 Linux, Opteron.,, 4-way Opteron 4 IBQ, Mellanox Technologies 8x PCI Express single port DDR InfiniHost HCA, PCI Express MHGS18-XT DDR. 24 DDR InfiniBand MTS24 DDR. 2-way 8 IBD 9),, 1CPU 1.5GB/s. PCI Express x8 DDR InfiniBand HCA, 4-way, PCI Express x8 2., NVIDIA nforce Professional Uniwide (Appro), Iwill Uniserver 3346, Internode communication InfiniHost HCA PCI Express x16 AMD 8132 nforce Professional 225 PCI-X 2 quad Opteron Fig. 2 System Diagram of dual Opteron Cluster Node. 3 InfiniBand MTS24 DDR Opteron IBQ Fig. 3 Opteron Cluster IBQ, connected with InfiniBand switch MTS24 DDR., 17 1, MPI OpenMP SCASH-MPI.,, InfiniBand 2

. 3. SCASH-MPI, MPI,.,, InfiniBand MPI MVAPICH 4). MVAPICH, 5). MVAPICH.9.6., MHGS-18XT, Uniserver 3346., DDR InfiniBand Opteron, 1 HCA. 4-5 MVAPICH.9.6 Opteron/MHGS- 18XT MPI., MPI, MPI 3). Time (us) 1 1 1 1 1 Intranode Internode.1 1 1 1 1 1 1 1e+6 1e+7 Message Size (Bytes) 4 MVAPICH.9.6 MPI ( DDR InfiniBand HCA (Mellanox MHGS18-XT) ) Fig. 4 MPI Latency of MVAPICH.9.6 (nodes connected with DDR InfiniHost HCA (Mellanox MHGS18-XT)). 3.1 STREAM benchmark,, STREAM benchmark 1) Uniwide., LAM, MPICH.,. 2 Intranode 18 Internode 16 14 12 1 8 6 4 2 1 1 1 1 1 1 1e+6 1e+7 Message Size (Bytes) 5 MVAPICH.9.6 ( DDR InfiniBand HCA (Mellanox MHGS18-XT) ) Fig. 5 MPI Bidirectional Bandwidth of MVAPICH.9.6 (nodes connected with DDR InfiniHost HCA (Mellanox MHGS18-XT)). Bandwidth (MB/s).,,. 1, 6 STREAM benchmark. /* - MAIN LOOP - repeat NTIMES times - */ scalar = 3.; for (k=; k<ntimes; k++) { times[][k] = second(); for (j=; j<n; j++) c[j] = a[j]; times[][k] = second() - times[][k]; times[1][k] = second(); for (j=; j<n; j++) b[j] = scalar*c[j]; times[1][k] = second() - times[1][k]; times[2][k] = second(); for (j=; j<n; j++) c[j] = a[j]+b[j]; times[2][k] = second() - times[2][k]; times[3][k] = second(); for (j=; j<n; j++) a[j] = b[j]+scalar*c[j]; times[3][k] = second() - times[3][k]; } 6 STREAM benchmark Fig. 6 Main loops of STREAM benchmark.,,. OpenMP stream d omp.c,., 7 4MPI 3

1 STREAM benchmark Table 1 STREAM benchmark types. Benchmark Operation Bytes per iteration a[i] = b[i] 16 a[i] = q * b[i] 16 a[i] = b[i] + c[i] 24 a[i] = b[i] + q * c[i] 24 STREAM benchmark. 4,,, 91.6MB., 1 3,., SCASH-MPI MPI 1 8)., 1 2., 8, 9 GbE. GbE MPI 24.27us, 226MB/s, InfiniBand., STREAM,,., 6 for,, 1. 2 15 1 5 2 4 6 8 1 12 14 16 7 4MPI STREAM benchmark Fig. 7 Speedup of STREAM benckmark results with up to four processes per node. 4. NAS Parallel Benchmark,, NAS Parallel Benchmark 7) 3.2, 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 8 2MPI STREAM benchmark Fig. 8 Speedup of STREAM benckmark results with up to two processes per node. 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 2MPI STREAM benchmark (GbE ) Fig. 9 Speedup of STREAM benckmark results with up to two processes per node on GbE. 35 3 25 2 15 1 5 2 4 6 8 1 12 14 16 1, 4MPI STREAM benchmark Fig. 1 Speedup of STREAM benckmark results with up to four processes per node (Mapping specified and barrier synchronization removed). 4

CG kernel Class S B. vector r,p,q,z; /* working vector */ /* solve A*x */ real CGSOLVE(matrix A,vector x, real zeta) { int it; real alpha, beta, rho, rho; r = x; p = x; z =.; rho = x * x; for(it=; it < NITCG; it++) { q = A*p; alpha = rho/(p * q); z += alpha * p; rho = rho; r += -alpha * q; rho = r*r; beta = rho / rho; p += beta*r; } r=a*z-x; zeta = 1./x*z; x = (1./sqrt(z*z))*z; return sqrt(r*r); } 11 NPB CG Fig. 11 Algorithm of NPB CG main loop. CG kernel,. 11.,,.,. S=1,4, W=7,, A=14,, B=75,, C=15,, S=7, W=8, A=11, B=13, C=15., 16 75KB., MPI NPB CG kernel DDR InfiniHost HCA. CG kernel, 9). 12 MPI CG kernel IBD. 4-way 13, DDR InfiniBand, 2 SDR InfiniBand Class C,., MPI, 1 2MPI OpenMP NPB CG 14. NPB BT, SP 8),, CG. MPI, performance (MFLOPS) 4 35 3 25 2 15 1 5 Class S Class W Class A Class B Class C 1 2 4 8 16 number of processes 12 SDR InfiniHost HCA 2 NPB CG Fig. 12 Speedup of NPB CG with two ports of SDR InfiniHost HCA. performance (MFLOPS) 4 35 3 25 2 15 1 5 Class S Class W Class A Class B Class C 1 2 4 8 16 number of processes 13 DDR InfiniHost HCA NPB CG (MPI ) Fig. 13 Speedup of NPB CG (MPI version) with DDR InfiniHost HCA.,, CG,. 5., PCI Express InfiniBand,,.,, 4-way, 5

performance (MFLOPS) 8 7 6 5 4 3 2 1 1 2 4 8 number of processes Class S Class W Class A Class B 14 DDR InfiniHost HCA NPB CG (OpenMP ). 1 2MPI Fig. 14 Speedup of NPB CG (OpenMP version) with DDR InfiniHost HCA, up to two processes per node.,,.,,. InfiniBand, Mellanox Technologies, Inc.,,,.,, 1616225,. Conference on Supercomputing 3 (23). 5) Liu, J., Vishnu, A. and Panda, D. K.: Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation, Proceedings of the International Conference on Supercomputing 4 (24). 6) Pfister, G. F.: In Search of Clusters, Prentice- Hall, second edition (1998). 7) van der Wijngaart, R.: The NAS Parallel Benchmarks 2.4, Technical ReportNAS-2-7, NASA (22). 8),,, : MPI, :, Vol. 46, No. SIG 7, pp. 63 73 (25). 9) : InfiniBand,, Vol. 25, No. 81, pp. 97 12 (25). 1) STREAM: Sustainable Memory Bandwidth in High Performance Computers, http://www.cs.virginia.edu/stream/. 2) Hennessy, J. I. and Patterson, D. A.: Computer Architecture: A Quantitative Approach, Third Edition, Morgan Kaufmann (23). 3) Jin, H. W., Sur, S., Chai, L. and Panda, D. K.: LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster, Proceedings of the International Conference on Parallel Processing (25). 4) Liu, J., Chandrasekaran, B., J. Wu, W. J., Kini, S. P., Yu, W., Buntinas, D., Wyckoff, P. and Panda, D. K.: Performance Comparison of MPI implementations over Infiniband, Myrinet and Quadrics, Proceedings of the International 6