MVAPICH User s Group Meeting August 20, 2015

Similar documents
Mapping MPI+X Applications to Multi-GPU Architectures

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Oak Ridge National Laboratory Computing and Computational Sciences

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Present and Future Leadership Computers at OLCF

IBM CORAL HPC System Solution

Modernizing OpenMP for an Accelerated World

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

UCX: An Open Source Framework for HPC Network APIs and Beyond

Paving the Road to Exascale

Interconnect Your Future

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Interconnect Your Future

IBM HPC Technology & Strategy

Data Management and Analysis for Energy Efficient HPC Centers

IBM Power Advanced Compute (AC) AC922 Server

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

outthink limits Spectrum Scale Enhancements for CORAL Sarp Oral, Oak Ridge National Laboratory Gautam Shah, IBM

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

OpenFOAM Performance Testing and Profiling. October 2017

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

IBM Spectrum Scale IO performance

IBM Deep Learning Solutions

ABySS Performance Benchmark and Profiling. May 2010

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HPC Customer Requirements for OpenFabrics Software

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

System Software Solutions for Exploiting Power Limited HPC Systems

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

SNAP Performance Benchmark and Profiling. April 2014

IBM Power AC922 Server

LS-DYNA Performance Benchmark and Profiling. October 2017

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Deep Learning mit PowerAI - Ein Überblick

Leveraging Flash in HPC Systems

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Interconnect Your Future

Update on NASA High End Computing (HEC) & Survey of Federal HEC Capabilities

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays

Optimizing Network Usage on Sequoia and Sierra

Delivering HPC Performance at Scale

CPMD Performance Benchmark and Profiling. February 2014

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Oak Ridge National Laboratory Computing and Computational Sciences

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Tightly Coupled Accelerators Architecture

CCR. ISC18 June 28, Kevin Pedretti, Jim H. Laros III, Si Hammond SAND C. Photos placed in horizontal env

Overview of Tianhe-2

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

UCX: An Open Source Framework for HPC Network APIs and Beyond

Cheyenne NCAR s Next-Generation Data-Centric Supercomputing Environment

MVAPICH2 and MVAPICH2-MIC: Latest Status

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

An Overview of Fujitsu s Lustre Based File System

University at Buffalo Center for Computational Research

GPUs and Emerging Architectures

Oak Ridge Leadership Computing Facility: Summit and Beyond

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

2008 International ANSYS Conference

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Illinois Proposal Considerations Greg Bauer

HPC Technology Update Challenges or Chances?

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

MVAPICH2 Project Update and Big Data Acceleration

Smarter Clusters from the Supercomputer Experts

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

GPUS FOR NGVLA. M Clark, April 2015

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Update on LRZ Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. 2 Oct 2018 Prof. Dr. Dieter Kranzlmüller

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Interconnect Your Future

2014 LENOVO INTERNAL. ALL RIGHTS RESERVED.

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Supercomputing at the United States National Weather Service (NWS)

High Performance Computing

Carlo Cavazzoni, HPC department, CINECA

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

HOKUSAI System. Figure 0-1 System diagram

Transcription:

MVAPICH User s Group Meeting August 20, 2015 LLNL-PRES-676271 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

2

Bio-Security Counterterrorism Defense Energy Intelligence Nonproliferation Science Weapons https://missions.llnl.gov 3

Top500 System Rank Unclassified Network (OCF) Program Manufactur e/ Model OS Capability Capacity Visualization Interconnect Serial Nodes Cores Memory (GB) Peak TFLOP/s Vulcan 9 ASC+M&IC+HPCIC IBM BGQ RHEL/CNK5D Torus 24,576 393,216 393,216 5,033.2 Sierra 417 M&IC Dell TOSS IB QDR 1,944 23,328 46,656 243.7 Cab (TLCC2) 145 ASC+M&IC+HPCIC Appro TOSS IB QDR 1,296 20,736 41,472 426.0 Ansel M&IC Dell TOSS IB QDR 324 3,888 7,776 43.5 RZMerl (TLCC2) ASC+ICF Appro TOSS IB QDR 162 2,592 5,184 53.9 RZZeus M&IC Appro TOSS IB DDR 267 2,144 6,408 20.6 Catalyst ASC+M&IC Cray TOSS IB QDR 324 7,776 41,472 149.3 Syrah ASC+M&IC Cray TOSS IB QDR 324 5,184 20,736 107.8 Surface ASC+M&IC Cray TOSS IB FDR 162 2,592 41,500 451.9 Aztec M&IC Dell TOSS N/A 96 1,152 4,608 12.9 Herd M&IC Appro TOSS IB DDR 9 256 1,088 1.6 OCF Totals Systems 11 6,544.4 Classified Network (SCF) Pinot(TLCC2, SNSI) M&IC Appro TOSS IB QDR 162 2,592 10,368 53.9 Sequoia 3 ASC IBM BGQ RHEL/CNK5D Torus 98,304 1,572,864 1,572,864 20132.7 Zin (TLCC2) 66 ASC Appro TOSS IB QDR 2,916 46,656 93,312 961.1 Juno (TLCC) ASC Appro TOSS IB DDR 1,152 18,432 36,864 162.2 Muir ICF Dell TOSS IB QDR 1,296 15,552 31,104 168.0 Graph ASC Appro TOSS IB DDR 576 13,824 72,960 107.5 Max ASC Appro TOSS IB FDR 324 5,184 82,944 107.8 Inca ASC Dell TOSS N/A 100 1,216 5,120 13.5 SCF Totals Systems 8 21,706.7 Combined Totals 19 28,251.1 4

Top500 System Rank Unclassified Network (OCF) Program Manufactur e/ Model OS Interconnect Nodes Cores Memory (GB) Peak TFLOP/s Vulcan 9 ASC+M&IC+HPCIC IBM BGQ RHEL/CNK5D Torus 24,576 393,216 393,216 5,033.2 Sierra 417 M&IC Dell TOSS IB QDR 1,944 23,328 46,656 243.7 Cab (TLCC2) 145 ASC+M&IC+HPCIC Appro TOSS IB QDR 1,296 20,736 41,472 426.0 Ansel M&IC Dell TOSS IB QDR 324 3,888 7,776 43.5 RZMerl (TLCC2) ASC+ICF Appro TOSS IB QDR 162 2,592 5,184 53.9 RZZeus M&IC Appro TOSS IB DDR 267 2,144 6,408 20.6 Catalyst ASC+M&IC Cray TOSS IB QDR 324 7,776 41,472 149.3 Syrah ASC+M&IC Cray TOSS IB QDR 324 5,184 20,736 107.8 Surface ASC+M&IC Cray TOSS IB FDR 162 2,592 41,500 451.9 Aztec M&IC Dell TOSS N/A 96 1,152 4,608 12.9 Herd M&IC Appro TOSS IB DDR 9 256 1,088 1.6 OCF Totals Systems 11 6,544.4 Classified Network (SCF) Pinot(TLCC2, SNSI) M&IC Appro TOSS IB QDR 162 2,592 10,368 53.9 Sequoia 3 ASC IBM BGQ RHEL/CNK5D Torus 98,304 1,572,864 1,572,864 20132.7 Zin (TLCC2) 66 ASC Appro TOSS IB QDR 2,916 46,656 93,312 961.1 Juno (TLCC) ASC Appro TOSS IB DDR 1,152 18,432 36,864 162.2 Muir ICF Dell TOSS IB QDR 1,296 15,552 31,104 168.0 Graph ASC Appro TOSS IB DDR 576 13,824 72,960 107.5 Max ASC Appro TOSS IB FDR 324 5,184 82,944 107.8 Inca ASC Dell TOSS N/A 100 1,216 5,120 13.5 SCF Totals Systems 8 21,706.7 Combined Totals 19 28,251.1 5

6

First MPI available for IB Reliable and proven Fastest for many users Familiarity with MPICH code base Acceptance of feedback and patches Good ties and communication with OSU 7

Interns Matt Koop Hari Subramoni Krishna Kandalla Raghunath Rajachandrasekar Sourav Chakraborty Compute resources LC Collaborative Zone Hyperion 8

9

10

11

12

13

14

15

16

17

18

19

Consider reshaping laser pulse to reduce mixing Computer simulations using MVAPICH support the theory 20

Major step forward on path to ignition Experiments agree exceptionally well with simulations Fuel gain exceeding unity in an inertially confined fusion implosion, Hurricane et. al., Nature, Feb 2014 https://str.llnl.gov/june-2014/hurricane 21

22

Option 1: Storage in batteries and capacitors Short-term storage Short distances from generation source Option 2: Storage in chemical bonds Longer-term storage Transportable over long distances Produce usable fuel Want a carbon-free (or closed-carbon) reaction cycle 23

24 hrs. H 2 production H 2 H + H 2 O H 2 H + Ga 0.5 In 0.5 P (001) Pt catalyst oxide nitride Pt catalyst e- GaInP semiconductor e- photocathode 24

Brandon Wood, Tadashi Ogitsu, Wooni Choi DOE s 2014 Hydrogen Production R&D Award, presented by EERE s Fuel Cell Technologies Office https://str.llnl.gov/april-2015/wood Simulations at semiconductorwater interface explain differences in efficiency and corrosion 25

Fusion Science Hydrogen Fuel Additive Manufacturing 26

27

New Linux clusters to replace existing systems Roughly same number of nodes, updated hardware (CPUs, memory, network) Delivery in 2016 through 2018 Systems to last about 5 years Contract not finalized but MVAPICH stands good chance to continue as primary MPI Potentially many more years with MVAPICH 28

Components IBM POWER NVLink Compute Node POWER Architecture Processor NVIDIA Volta NVMe-compatible PCIe 800GB SSD > 512 GB DDR4 + HBM Coherent Shared Memory Compute Rack Standard 19 Warm water cooling Compute System 2.1 2.7 PB Memory 120-150 PFLOPS 10 MW NVIDIA Volta HBM NVLink Mellanox Interconnect Dual-rail EDR Infiniband GPFS File System 120 PB usable storage 1.2/1.0 TB/s R/W bandwidth 29

MVAPICH team already experts with Mellanox and NVIDIA Just need port to POWER CORAL systems (Sierra and Summit) will be two of the world s fastest when they come online Let s get MVAPICH on these systems! 30

Compute node storage driving requirements for async services Desire to use threads or second MPI job running in background of main app BurstFS research prototype by Teng Wang 31

Large LLNL codes takes 1 day to recompile app and all libraries Must then be tested and verified ABI breaks compatibility between versions MV1-1.2 MV2-1.7 MV2-1.9 MV2-2.1 Example solution: MPI-Adapter (Japan) or MorphMPI 32

Make MV2 faster than MV1 Some users still report a slowdown when switching to MV2 Possible instruction bloat between the two? e.g., use HPCToolkit and other profiling tools to find and remove bottlenecks and unnecessary code 33

https://github.com/hpc/lwgrp Group representation MPI: O(P) LWGRP: O(log P) log(p) collectives: Barrier, Split time Bcast, Allreduce, Scan, MPI: O(P * log P) Allgather, Alltoall, etc. LWGRP: O(log G * log P) P=1m, G=1k first term reduced 5 orders of magnitude 100,000x faster! P=1m memory reduced 5 orders of magnitude 100,000x less memory! 34

Port to IBM Power Processor Efficient MPI_THREAD_MULTIPLE and efficient blocking mode ABI compatibility between versions Make MV2 faster than MV1 Find and eliminate every O(P) scaling term Faster MPI_Comm create and destroy 35

36