Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays

Similar documents
Performance and Power Co-Design of Exascale Systems and Applications

HYCOM Performance Benchmark and Profiling

UCX: An Open Source Framework for HPC Network APIs and Beyond

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Interconnect Your Future

Unified Runtime for PGAS and MPI over OFED

GROMACS Performance Benchmark and Profiling. September 2012

The Future of Interconnect Technology

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Interconnect Your Future

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

NAMD Performance Benchmark and Profiling. November 2010

Solutions for Scalable HPC

A Case for High Performance Computing with Virtual Machines

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Introduction to Xeon Phi. Bill Barth January 11, 2013

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

System Software Solutions for Exploiting Power Limited HPC Systems

HPC Architectures. Types of resource currently in use

IT4Innovations national supercomputing center. Branislav Jansík

ABySS Performance Benchmark and Profiling. May 2010

Unifying UPC and MPI Runtimes: Experience with MVAPICH

The Stampede is Coming: A New Petascale Resource for the Open Science Community

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

AcuSolve Performance Benchmark and Profiling. October 2011

Paving the Road to Exascale

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

NAMD Performance Benchmark and Profiling. January 2015

MVAPICH2 and MVAPICH2-MIC: Latest Status

Introduc)on to Xeon Phi

Experiences with ENZO on the Intel Many Integrated Core Architecture

SNAP Performance Benchmark and Profiling. April 2014

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

High performance computing and numerical modeling

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Advances of parallel computing. Kirill Bogachev May 2016

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Timothy Lanfear, NVIDIA HPC

Altair RADIOSS Performance Benchmark and Profiling. May 2013

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

Preparing Applications for current and future Systems: Experiences at PNNL. Darren J. Kerbyson Laboratory Fellow, HPC Group Lead

BlueGene/L. Computer Science, University of Warwick. Source: IBM

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Introduction to the Intel Xeon Phi on Stampede

GROMACS Performance Benchmark and Profiling. August 2011

MVAPICH User s Group Meeting August 20, 2015

AcuSolve Performance Benchmark and Profiling. October 2011

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Maximizing Memory Performance for ANSYS Simulations

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

OVERVIEW OF MPC JUNE 24 TH LLNL Meeting June 15th, 2015 PAGE 1

Advantages to Using MVAPICH2 on TACC HPC Clusters

Intel Many Integrated Core (MIC) Architecture

Introduc)on to Xeon Phi

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Performance Evaluation of InfiniBand with PCI Express

MM5 Modeling System Performance Research and Profiling. March 2009

MPI History. MPI versions MPI-2 MPICH2

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Pedraforca: a First ARM + GPU Cluster for HPC

The rcuda middleware and applications

The IBM Blue Gene/Q: Application performance, scalability and optimisation

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

HPC projects. Grischa Bolls

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Arm Processor Technology Update and Roadmap

ICON Performance Benchmark and Profiling. March 2012

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

CALMIP : HIGH PERFORMANCE COMPUTING

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CPMD Performance Benchmark and Profiling. February 2014

Characterizing and Improving Power and Performance in HPC Networks

Transcription:

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays ABHINAV VISHNU*, JEFFREY DAILY, BRUCE PALMER, HUBERTUS VAN DAM, KAROL KOWALSKI, DARREN KERBYSON, AND ADOLFY HOISIE PACIFIC NORTHWEST NATIONAL LABORATORY MVAPICH2 USER GROUP MEETING, AUGUST 2013 August 26, 2013 1

The Power Barrier Power consumption is growing dramatically ~100 MW datacenters are real Facebook, Google, Apple DOE Labs 2% of total US consumption is in datacenters 1 M$/ MW-year Significant portion of total cost of ownership For the upcoming (2020+) Exascale machine, the power is expected to be between 20MW+ Many architectural innovations August 26, 2013 2

Workloads and Programming Models Program in multiple programming models MPI + X+ Y Evolve the existing application to newer architecture Evolution of newer execution models Task models (PLASMA, MAGMA, DaGuE) With possible MPI runtime backend Workloads are changing dramatically Examples: Graph500 August 26, 2013 3

The Role of MPI Moving Forward: Least Common Denominator Applica;ons Solvers New Execu;on Models/Runt. MPI +X, Y, Z OFA/RoCE ugni/ DMAPP PAMI. August 26, 2013 4

Examples for X: Global Arrays, UPC, CAF Physically distributed data Global Address Space A distributed-shared object programming model Shared data view One-sided communication models Used in wide variety of applications Global Arrays - Computational Chemistry NWChem, molcas, molpro Bioinformatics ScalaBLAST Machine Learning Extreme Scale Data Analysis Library (xdal) 5

Global Arrays and MVAPICH2 at PNNL PNNL Institutional Computing (PIC) cluster PIC: 604 AMD Barcelona, IB QDR systems MVAPICH2 default MPI in many cases Global Arrays uses Verbs for Put/Get and MPI for collectives Continuously updated with stable and beta releases Primary Workloads NWChem WRF Climate (CCSM) August 26, 2013 6

HPCS4: Upcoming Multi-Petaflop System@PNNL 1440 Dual socket Intel Sandybridge @ 2.6 Ghz 128 GB memory/node Requirements from compute intensive/data intensive applications such as NWChem 2 Xeon Phi / node, and 8 GB / Xeon Phi Mellanox InfiniBand FDR Theoretical peak / node : 2.35 TF 3.4 PF/s peak performance Expected efficiency on LINPACK : > 70% Power consumption < 1.5 MW Software Ecosystem: MVAPICH2, Global Arrays, Intel-MPI Applications NWChem, WRF August 26, 2013 7

As a User of MVAPICH2, What I Like (with Feedback from Others at the Lab) Job startup time Very important at large scale Consistent improvement over the releases Collective communication performance Allreduce scales very well Primary collective operation in many applications Low memory footprint Numerous optimizations over the years A remark on collectives: Collective performance with OS noise Problems with a noncustomized kernel Liu-IPDPS 04, Mamidala- Cluster 05 The theme was collectives with process skew OS noise introduces similar issues August 26, 2013 8

As a User of MVAPICH2, What would I like.. Scalable Performance 2001- Reliability MVAPICH2 Power Efficiency Efficient Integra;on with X, Y, Z August 26, 2013 9

Position Statement 1: Power Why should software do it? Amdahl s law: Once power optimized hardware is available, the relative power inefficiency of software grows proportionally Why should MPI do it (as well)? It knows a lot about application indirectly The key is to automatically discover patterns Previous work is an important step, albeit with limitations Kandalla et al. ICPP 09 Efficiency for the size of data movement Most collectives are small size Some applications work on state transitions using pt-to-pt messages August 26, 2013 10

Example: Neutron Transport @ Los Alamos Sweep 3D Neutron transport 100 total n+2 n+1 n Is B n-1 Blocks n-2 J n-3 n-1 n n+1 Js K Number of Ac,ve cores (%) NE >SW n+3 80 SE >NW NW >SE SW >NE 60 40 20 " 0 PE 1 I "# $# %# &# "#$ '# (# )# *# '$ %$ "#$%&'()*+"%',-.' &$ +#,# $"# $$# ($ #$ ($ %$ '$ &$ $%# $&# $'# $(# %$ "#$ #$ The Slack August 26, 2013 21 41 61 81 101 121 141 161 181 201 Step nunber "#$ "&$ )($ &'$ *&$ "(%$ "*($ ("&$ (#$ )#$ /0123'(+4#',3560)"#7.' Cluster 11 '#$ 11

Shock Hydrodynamics @ LLNL LULESH Untouched Grid Useful Work Stable Grid August 26, 2013 Sedov blast problem Deposit Force at origin, and see the impact on material A small subset of processes perform useful work Difficult to define a static model for deformation Significant potential for Power efficiency Very different from Sweep3D problem 12

+ non-determinism Hardware Lots of dynamism/uncertainty due to near threshold voltage execution The dark silicon effect Different chips perform varied levels of bit-correction Software Difficult to capture on a static performance model MPI + X, X is expected to be an adaptive programming model Implication is that MPI needs to handle non-temporal patterns in communication OS Noise Possible R&D There are clear examples The key is to automatically discover communication patterns Potential for significant energy savings without one-off solution for a single application August 26, 2013 13

Position 2: Effective Integration with Other Programming Models MPI + X (+ Y +..) Shared Address Space examples of X OpenMP, Pthreads, Intel TBB.. Classical data and Work decomposition MPI_THREAD_MULTIPLE is not required, lower level models would do When multiple threads make MPI calls Symmetric communication For MPI runtime, the challenge is MPI progress on multiple threads Handling the critical section P" T" T" P" MPI_Sendrecv August 26, 2013 14

Multi-threaded MVAPICH2 Performance Dinan et al. EuroMPI 13 Provide an endpoint to each thread Separate communicators for each thread Prevents Thread Local Storage (TLS) conflicts between threads Problem: Space complexity Even harder with oversubscription (Charm++) Possible solution partly in specification, and other in runtime Bandwidth (MB/s) 3000 2500 2000 1500 1000 500 MT RMA PR NAT Compute Node 0 16 64 256 1K 4K 16K 64K 256K Message Size(Bytes) Time (us) 60000 50000 40000 30000 20000 10000 MT RMA PR NAT 0 16 64 256 1K 4K Number of Processes August 26, 2013 15

Conclusions and Future Directions MPI will continue to be the least common denominator Significant investment in applications Newer execution models are adapting MPI for designing their runtimes MVAPICH2 has many features for scalable performance Larger scale systems would need MPI to be Power Efficient The algorithms are different from each other The architecture is expected to be radically different Perfect recipe for advanced research and development Efficient support in MPI + X model would be critical New execution models are looking forward to using MPI runtime with revolutionary approaches The research space is very open August 26, 2013 16

Thanks for Listening Abhinav Vishnu, abhinav.vishnu@pnnl.gov August 26, 2013 17