Open MPI for Cray XE/XK Systems

Size: px
Start display at page:

Download "Open MPI for Cray XE/XK Systems"

Transcription

1 Open MPI for Cray XE/XK Systems Samuel K. Gutierrez LANL Nathan T. Hjelm LANL Manjunath Gorentla Venkata ORNL Richard L. Graham - Mellanox Cray User Group (CUG) 2012 May 2, 2012 U N C L A S S I F I E D Slide 1

2 A Collaborative Effort U N C L A S S I F I E D LA-UR Slide 2

3 First Things First Open MPI Overview Open-Source Implementation of the MPI-2 Standard Developed and Maintained By Academia Industry National Laboratories Supports a Range of High-Performance Network Interfaces Infiniband Cray SeaStar and Now Cray Gemini U N C L A S S I F I E D LA-UR Slide 3

4 The Gemini System Interconnect 3 An Overview Network Used by the Cray XE and XK System Families Successor to the Cray SeaStar* Network Interconnect 3D Torus Network Built of Gemini ASICs Gemini ASIC Provides 2 NICs and a 48-port Router Connects 2 Opteron Nodes Provides 10 Torus Connections 2 x (+X, -X, +Z, -Z) 1 x (+Y, -Y) U N C L A S S I F I E D LA-UR Slide 4

5 Open MPI s Plugin Architecture A High-level Overview 1 User Application MPI API Modular Architecture (MCA) Framework Framework Framework U N C L A S S I F I E D LA-UR Slide 5

6 Open MPI s Plugin Architecture A High-level Overview 1 MPI API E.g. MPI_Send, MPI_Recv, MPI_Bcast MPI API Modular Architecture (MCA) Framework Framework Framework U N C L A S S I F I E D LA-UR Slide 6

7 Open MPI s Plugin Architecture A High-level Overview 1 Modular Architecture (MCA) Backbone of Open MPI Plugin System Finds, Loads, and Parameterizes s Open MPI Hearts MCA Parameters Modular Architecture (MCA) Framework Framework Framework U N C L A S S I F I E D LA-UR Slide 7

8 Open MPI s Plugin Architecture A High-level Overview 1 Frameworks Functionality Specification E.g. Resource Manager, Point-to-Point, Collective Algorithm Modular Architecture (MCA) Framework Framework Framework U N C L A S S I F I E D LA-UR Slide 8

9 Open MPI s Plugin Architecture A High-level Overview 1 s Implementation of a Framework Type A Plugin E.g. SLURM RAS, Open IB BTL What a Developer Typically Creates to Support New Functionality Module: an Instance of a Modular Architecture (MCA) Framework Framework Framework U N C L A S S I F I E D LA-UR Slide 9

10 Open MPI s Plugin Architecture Main Code Sections 1 Open MPI Layer (OMPI) MPI API and Support Logic Open Run-Time Environment (ORTE) Run-Time System Open Portability Access Layer (OPAL) OS-Specific/Utility Code OPAL ORTE OMPI Operating System U N C L A S S I F I E D LA-UR Slide 10

11 The Port - ORTE Environment-Specific Services (ESS) Run-Time Environment (RTE) Setup Messaging, Routing, Module Exchange (ModEx) Process Naming Job Size and Locality Information Process Lifecycle Management (PLM) Central Switchyard for All Process Management Resource Allocation, Process Mapping, Process Launch, Process Monitoring Resource Allocation Subsystem (RAS) Job Resource Availability and Allocation RML Routing Table (ROUTED) Next Hop Routing Services De Bruijn U N C L A S S I F I E D LA-UR Slide 11

12 OMPI Point-to-Point Overview 1 MPI API PML BTL 1 BTL 1 MPool RCache BML BTL n BTL n MPool RCache U N C L A S S I F I E D LA-UR Slide 12

13 Byte Transfer Layers (BTLs) 1 Transport Interface Support Plugins Think: Byte Transfer Driver Thin Abstraction Layer Above Target Device Source/Destination Preparation Protocol Definition Short, Medium, Long Send, SendI, Put, Get No Notion of MPI Semantics BTL 1 BTL n BTL 1 MPool RCache BTL n MPool RCache U N C L A S S I F I E D LA-UR Slide 13

14 The Port: New BTLs Kernel-Assisted (Single Copy) Shared Memory BTL Used Exclusively for Intra-Node Communication Leverages XPMEM ( Currently Named vader in Development Trunk Gemini BTL Used Exclusively for Inter-Node Communication Leverages Cray s Generic Network Interface (ugni) Currently Named ugni in Development Trunk Vader BTL Vader Memory Pool Registration Cache ugni BTL ugni Memory Pool Registration Cache U N C L A S S I F I E D LA-UR Slide 14

15 BTL Management Layer 1 Manages Multiple BTLs Within in Single Process No Modifications Needed for Port Vader BTL Vader Memory Pool Registration Cache BML ugni BTL ugni Memory Pool Registration Cache U N C L A S S I F I E D LA-UR Slide 15

16 Point-to-Point Management Layer 1 Provides Point-to-Point Functionality Required by the MPI Layer Minor Modification Required for Port Vader BTL PML BML ugni BTL Vader Memory Pool Registration Cache ugni Memory Pool Registration Cache U N C L A S S I F I E D LA-UR Slide 16

17 More About the XPMEM BTL - Vader MPICH Nemesis-like Design Lock-Free Message Queues Fast Boxes I.e. Per-Peer Receive Queues for Short Messages Copy Backend Changes Based on Message Size E.g. bcopy [a,b) - memcpy Otherwise User Tunable with Good Defaults Cross-Process Memory Mapping Allows for RDMA-Like Semantics Copy-In/Copy-Out (CICO) Avoided No Backing Store Required Heavy Use of Registration Cache XPMEM Support Requires Kernel Patch and User-Level Library Already Available and Leveraged by Cray s Native MPI Implementation U N C L A S S I F I E D LA-UR Slide 17

18 More About the ugni BTL Protocols Short Message Fast Memory Access (FMA) Short Messaging (SMSG) Medium Message FMA RDMA Long Message Block Transfer Engine (BTE) RDMA Lazy Connection Establishment Resource Utilization Directly Related to Application Communication Characteristics U N C L A S S I F I E D LA-UR Slide 18

19 Improved Collectives: Cheetah 2 ORNL s Cheetah A Framework for Collective Operations Collectives Implemented with Collective Primitives Each Primitive is Optimized for a Particular Communication Path Progressed Asynchronously and Independently When Semantics Permit OMPI BCOL SBGP COLL ugni UMA PTPColl UMA Socket IBNET P2P ML Default U N C L A S S I F I E D LA-UR Slide 19

20 Improved Collectives: Cheetah 2 Base Collectives (BCOL) Implements Collective Primitives Subgrouping (SBGP) Provides Process Grouping Rules Multilevel (ML) Coordinates Collective Primitive Execution For Design and Implementation Details: See Cheetah Publications OMPI BCOL SBGP COLL ugni UMA PTPColl UMA Socket IBNET P2P ML Default U N C L A S S I F I E D LA-UR Slide 20

21 Improved Collectives: ugni BCOL Barrier Implemented ugni Cheetah Barrier Fan-In/Fan-Out Algorithm Atomic Barrier Leverages Atomic Operations Provided by the ugni Library Currently Only Supports MPI_Barrier OMPI BCOL SBGP COLL ugni UMA PTPColl UMA Socket IBNET P2P ML Default U N C L A S S I F I E D LA-UR Slide 21

22 Performance Evaluation - Setup Test Beds! Cielo - 142,304 Core XE6 Enhanced Jaguar 299,008 Core XK6 Point-to-Point Latency OSU s MPI Mico-Benchmark Suite osu_latency & osu_multi_lat Point-to-Point Bandwidth OSU s MPI Mico-Benchmark Suite osu_bibw & osu_mbw_mr Barrier Latency MPI_Barrier in a Tight Loop Average Latency Reported U N C L A S S I F I E D LA-UR Slide 22

23 Vader Latency on AMD Magny-Cours U N C L A S S I F I E D LA-UR Slide 23

24 Vader Bandwidth on AMD Magny-Cours U N C L A S S I F I E D LA-UR Slide 24

25 ugni BTL Latency on XE6 U N C L A S S I F I E D LA-UR Slide 25

26 ugni BTL Bandwidth on XE6 U N C L A S S I F I E D LA-UR Slide 26

27 Performance of Cheetah Barriers on XK6 U N C L A S S I F I E D LA-UR Slide 27

28 Ongoing/Future Work Point-to-Point Stabilization/Optimization Already Tested at 128k Processors (Cielo) Investigating New Protocols Continue Collectives Work Evaluate Performance and Scalability Characteristics of the Atomic Collective Operations at Larger Scales Evaluate the Potential for Implementing Other Collective Operations Using the Atomic Collective Operations Work with Friendly Testers Prepare for General Release U N C L A S S I F I E D LA-UR Slide 28

29 Thanks! U N C L A S S I F I E D LA-UR Slide 29

30 Questions? Questions? Comments? U N C L A S S I F I E D LA-UR Slide 30

31 References [1] Open MPI. 13 Feb <open-mpi.org>. [2] R. Graham, et al., Cheetah: A Framework for Scalable Hierarchical Collective Operations, CCGRID 2011, [3] R. Alverson, et al., The Gemini System Interconnect, in High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on, Aug. 2010, pp U N C L A S S I F I E D LA-UR Slide 31

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment Richard L. Graham, Joshua S. Ladd, Manjunath GorentlaVenkata Oak Ridge National

More information

OPEN MPI AND RECENT TRENDS IN NETWORK APIS

OPEN MPI AND RECENT TRENDS IN NETWORK APIS 12th ANNUAL WORKSHOP 2016 OPEN MPI AND RECENT TRENDS IN NETWORK APIS #OFADevWorkshop HOWARD PRITCHARD (HOWARDP@LANL.GOV) LOS ALAMOS NATIONAL LAB LA-UR-16-22559 OUTLINE Open MPI background and release timeline

More information

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel Open MPI und ADCL Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen Department of Computer Science University of Houston gabriel@cs.uh.edu Is MPI dead? New MPI libraries released

More information

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy The Role of InfiniBand Technologies in High Performance Computing 1 Managed by UT-Battelle Contributors Gil Bloch Noam Bloch Hillel Chapman Manjunath Gorentla- Venkata Richard Graham Michael Kagan Vasily

More information

Screencast: Basic Architecture and Tuning

Screencast: Basic Architecture and Tuning Screencast: Basic Architecture and Tuning Jeff Squyres May 2008 May 2008 Screencast: Basic Architecture and Tuning 1 Open MPI Architecture Modular component architecture (MCA) Backbone plugin / component

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

MPI on the Cray XC30

MPI on the Cray XC30 MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

Implementation and Usage of the PERUSE-Interface in Open MPI

Implementation and Usage of the PERUSE-Interface in Open MPI Implementation and Usage of the PERUSE-Interface in Open MPI Rainer Keller HLRS George Bosilca UTK Graham Fagg UTK Michael Resch HLRS Jack J. Dongarra UTK 13th EuroPVM/MPI 2006, Bonn EU-project HPC-Europa

More information

DISP: Optimizations Towards Scalable MPI Startup

DISP: Optimizations Towards Scalable MPI Startup DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory Outline Background and motivation

More information

Optimising Communication on the Cray XE6

Optimising Communication on the Cray XE6 Optimising Communication on the Cray XE6 Outline MPICH2 Releases for XE Day in the Life of an MPI Message Gemini NIC Resources Eager Message Protocol Rendezvous Message Protocol Important MPI environment

More information

Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications

Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications c World Scientific Publishing Company Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications Richard L. Graham National Center for Computational Sciences, Oak Ridge

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

MPI for Cray XE/XK Systems & Recent Enhancements

MPI for Cray XE/XK Systems & Recent Enhancements MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.

More information

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart. Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart. Artem Y. Polyakov, Joshua S. Ladd, Boris I. Karasev Nov 16, 2017 Agenda Problem description Slurm

More information

MDHIM: A Parallel Key/Value Store Framework for HPC

MDHIM: A Parallel Key/Value Store Framework for HPC MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system

More information

Wednesday : Basic Overview. Thursday : Optimization

Wednesday : Basic Overview. Thursday : Optimization Cray Inc. Wednesday : Basic Overview XT Architecture XT Programming Environment XT MPT : CRAY MPI Cray Scientific Libraries CRAYPAT : Basic HOWTO Handons Thursday : Optimization Where and How to Optimize

More information

Screencast: OMPI OpenFabrics Protocols (v1.2 series)

Screencast: OMPI OpenFabrics Protocols (v1.2 series) Screencast: OMPI OpenFabrics Protocols (v1.2 series) Jeff Squyres May 2008 May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 1 Short Messages For short messages memcpy() into / out of pre-registered

More information

Screencast: What is [Open] MPI?

Screencast: What is [Open] MPI? Screencast: What is [Open] MPI? Jeff Squyres May 2008 May 2008 Screencast: What is [Open] MPI? 1 What is MPI? Message Passing Interface De facto standard Not an official standard (IEEE, IETF, ) Written

More information

Screencast: What is [Open] MPI? Jeff Squyres May May 2008 Screencast: What is [Open] MPI? 1. What is MPI? Message Passing Interface

Screencast: What is [Open] MPI? Jeff Squyres May May 2008 Screencast: What is [Open] MPI? 1. What is MPI? Message Passing Interface Screencast: What is [Open] MPI? Jeff Squyres May 2008 May 2008 Screencast: What is [Open] MPI? 1 What is MPI? Message Passing Interface De facto standard Not an official standard (IEEE, IETF, ) Written

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Open MPI from a user perspective. George Bosilca ICL University of Tennessee

Open MPI from a user perspective. George Bosilca ICL University of Tennessee Open MPI from a user perspective George Bosilca ICL University of Tennessee bosilca@cs.utk.edu From Scratch? Merger of ideas from FT-MPI (U. of Tennessee) LA-MPI (Los Alamos, Sandia) LAM/MPI (Indiana U.)

More information

PRIMEHPC FX10: Advanced Software

PRIMEHPC FX10: Advanced Software PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for

More information

NAMD Performance Benchmark and Profiling. February 2012

NAMD Performance Benchmark and Profiling. February 2012 NAMD Performance Benchmark and Profiling February 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline

More information

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster

More information

HYCOM Performance Benchmark and Profiling

HYCOM Performance Benchmark and Profiling HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program Note The following research was performed under the HPC Advisory Council activities

More information

Paving the Road to Exascale

Paving the Road to Exascale Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015

More information

MM5 Modeling System Performance Research and Profiling. March 2009

MM5 Modeling System Performance Research and Profiling. March 2009 MM5 Modeling System Performance Research and Profiling March 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center

More information

MPI- RMA: The State of the Art in Open MPI

MPI- RMA: The State of the Art in Open MPI LA-UR-18-23228 MPI- RMA: The State of the Art in Open MPI SEA workshop 2018, UCAR Howard Pritchard Nathan Hjelm April 6, 2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's

More information

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte Alain Greiner Univ. Paris 6, France http://mpc.lip6.fr

More information

Open MPI: A Flexible High Performance MPI

Open MPI: A Flexible High Performance MPI Open MPI: A Flexible High Performance MPI Richard L. Graham 1, Timothy S. Woodall 1, and Jeffrey M. Squyres 2 1 Advanced Computing Laboratory, Los Alamos National Lab {rlgraham,twoodall}@lanl.gov 2 Open

More information

Implementing a Hardware-Based Barrier in Open MPI

Implementing a Hardware-Based Barrier in Open MPI Implementing a Hardware-Based Barrier in Open MPI - A Case Study - Torsten Hoefler 1, Jeffrey M. Squyres 2, Torsten Mehlan 1 Frank Mietke 1 and Wolfgang Rehm 1 1 Technical University of Chemnitz 2 Open

More information

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E Understanding Communication and MPI on Cray XC40 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard

More information

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Integrating Analysis and Computation with Trios Services

Integrating Analysis and Computation with Trios Services October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque,

More information

The GNI Provider Layer for OFI libfabric

The GNI Provider Layer for OFI libfabric The GNI Provider Layer for OFI libfabric Howard Pritchard, Evan Harvey Los Alamos National Laboratory Los Alamos, NM Email: {howardp,eharvey}@lanl.gov Sung-Eun Choi, James Swaro, Zachary Tiffany Cray Inc.

More information

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere

More information

Non-Blocking Collectives for MPI

Non-Blocking Collectives for MPI Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden

More information

SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems

SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems : Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems Cong Xu Manjunath Gorentla Venkata Richard L. Graham Yandong Wang Zhuo Liu Weikuan Yu Auburn University Oak Ridge National

More information

AMBER 11 Performance Benchmark and Profiling. July 2011

AMBER 11 Performance Benchmark and Profiling. July 2011 AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

NUMA-Aware Shared-Memory Collective Communication for MPI

NUMA-Aware Shared-Memory Collective Communication for MPI NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important

More information

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work

More information

Memory allocation and sample API calls. Preliminary Gemini performance measurements

Memory allocation and sample API calls. Preliminary Gemini performance measurements DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Topology Awareness in the Tofu Interconnect Series

Topology Awareness in the Tofu Interconnect Series Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0 Introduction Networks

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R. Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R. Vallee Motivation & Challenges Bigger machines (e.g., TITAN, upcoming Exascale

More information

Exascale Process Management Interface

Exascale Process Management Interface Exascale Process Management Interface Ralph Castain Intel Corporation rhc@open-mpi.org Joshua S. Ladd Mellanox Technologies Inc. joshual@mellanox.com Artem Y. Polyakov Mellanox Technologies Inc. artemp@mellanox.com

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Open MPI State of the Union X Community Meeting SC 16

Open MPI State of the Union X Community Meeting SC 16 Open MPI State of the Union X Community Meeting SC 16 Jeff Squyres George Bosilca Perry Schmidt Ralph Castain Yossi Itigin Nathan Hjelm Open MPI State of the Union X Community Meeting SC 16 10 years of

More information

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy

More information

Myths and reality of communication/computation overlap in MPI applications

Myths and reality of communication/computation overlap in MPI applications Myths and reality of communication/computation overlap in MPI applications Alessandro Fanfarillo National Center for Atmospheric Research Boulder, Colorado, USA elfanfa@ucar.edu Oct 12th, 2017 (elfanfa@ucar.edu)

More information

Breakthrough Science via Extreme Scalability. Greg Clifford Segment Manager, Cray Inc.

Breakthrough Science via Extreme Scalability. Greg Clifford Segment Manager, Cray Inc. Breakthrough Science via Extreme Scalability Greg Clifford Segment Manager, Cray Inc. clifford@cray.com Cray s focus The requirement for highly scalable systems Cray XE6 technology The path to Exascale

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Operational Robustness of Accelerator Aware MPI

Operational Robustness of Accelerator Aware MPI Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers

More information

Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect

Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect Abhinav Vishnu #1, Jeff Daily #2, and Bruce Palmer #3 # Pacific Northwest National Laboratory 92 Battelle Blvd, Richland, WA

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

SNAP Performance Benchmark and Profiling. April 2014

SNAP Performance Benchmark and Profiling. April 2014 SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting

More information

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved. Ethernet Storage Fabrics Using RDMA with Fast NVMe-oF Storage to Reduce Latency and Improve Efficiency Kevin Deierling & Idan Burstein Mellanox Technologies 1 Storage Media Technology Storage Media Access

More information

Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems

Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems PROCEEDINGS OF THE CRAY USER GROUP, 2012 1 Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems Howard Pritchard, Duncan Roweth, David

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute

More information

DISP: Optimizations towards Scalable MPI Startup

DISP: Optimizations towards Scalable MPI Startup DISP: Optimizations towards Scalable MPI Startup Huansong Fu Computer Science Department Florida State University Tallahassee, FL 3233, USA Email: fu@cs.fsu.edu Swaroop Pophale, Manjunath Gorentla Venkata

More information

Analyzing and Optimizing Global Array Toolkit for Cray Gemini Interconnect

Analyzing and Optimizing Global Array Toolkit for Cray Gemini Interconnect Analyzing and Optimizing Global Array Toolkit for Cray Gemini Interconnect Vairavan Murugappan August 27, 2010 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2010 Abstract

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect Developing Integrated Data Services for Cray Systems with a Gemini Interconnect Ron A. Oldfield Scalable System So4ware Sandia Na9onal Laboratories Albuquerque, NM, USA raoldfi@sandia.gov Cray User Group

More information

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed

More information

Ravindra Babu Ganapathi

Ravindra Babu Ganapathi 14 th ANNUAL WORKSHOP 2018 INTEL OMNI-PATH ARCHITECTURE AND NVIDIA GPU SUPPORT Ravindra Babu Ganapathi Intel Corporation [ April, 2018 ] Intel MPI Open MPI MVAPICH2 IBM Platform MPI SHMEM Intel MPI Open

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Vijay Velusamy, Anthony Skjellum MPI Software Technology, Inc. Email: {vijay, tony}@mpi-softtech.com Arkady Kanevsky *,

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework

KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework Brice Goglin, Stéphanie Moreaud To cite this version: Brice Goglin, Stéphanie Moreaud. KNEM: a Generic and Scalable Kernel-Assisted

More information

Infiniband and RDMA Technology. Doug Ledford

Infiniband and RDMA Technology. Doug Ledford Infiniband and RDMA Technology Doug Ledford Top 500 Supercomputers Nov 2005 #5 Sandia National Labs, 4500 machines, 9000 CPUs, 38TFlops, 1 big headache Performance great...but... Adding new machines problematic

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

IO virtualization. Michael Kagan Mellanox Technologies

IO virtualization. Michael Kagan Mellanox Technologies IO virtualization Michael Kagan Mellanox Technologies IO Virtualization Mission non-stop s to consumers Flexibility assign IO resources to consumer as needed Agility assignment of IO resources to consumer

More information

libhio: Optimizing IO on Cray XC Systems With DataWarp

libhio: Optimizing IO on Cray XC Systems With DataWarp libhio: Optimizing IO on Cray XC Systems With DataWarp May 9, 2017 Nathan Hjelm Cray Users Group May 9, 2017 Los Alamos National Laboratory LA-UR-17-23841 5/8/2017 1 Outline Background HIO Design Functionality

More information

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Implementing MPI on Windows: Comparison with Common Approaches on Unix Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2

More information

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Comparative Performance Analysis of RDMA-Enhanced Ethernet Comparative Performance Analysis of RDMA-Enhanced Ethernet Casey B. Reardon and Alan D. George HCS Research Laboratory University of Florida Gainesville, FL July 24, 2005 Clement T. Cole Ammasso Inc. Boston,

More information

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer

More information

Op#miza#on and Tuning Collec#ves in MVAPICH2

Op#miza#on and Tuning Collec#ves in MVAPICH2 Op#miza#on and Tuning Collec#ves in MVAPICH2 MVAPICH2 User Group (MUG) Mee#ng by Hari Subramoni The Ohio State University E- mail: subramon@cse.ohio- state.edu h

More information