Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Similar documents
Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

HPC Architectures. Types of resource currently in use

API and Usage of libhio on XC-40 Systems

libhio: Optimizing IO on Cray XC Systems With DataWarp

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Leveraging Flash in HPC Systems

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Cray XC Scalability and the Aries Network Tony Ford

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

UCX: An Open Source Framework for HPC Network APIs and Beyond

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Revealing Applications Access Pattern in Collective I/O for Cache Management

Oak Ridge National Laboratory Computing and Computational Sciences

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Mapping MPI+X Applications to Multi-GPU Architectures

ALCF Operations Best Practices and Highlights. Mark Fahey The 6 th AICS International Symposium Feb 22-23, 2016 RIKEN AICS, Kobe, Japan

Burst Buffers Simulation in Dragonfly Network

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

ECE 574 Cluster Computing Lecture 23

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Blue Waters I/O Performance

Application Performance on IME

Lecture 20: Distributed Memory Parallelism. William Gropp

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Big Data in HPC. John Shalf Lawrence Berkeley National Laboratory

Interconnect Your Future

ASKAP Central Processor: Design and Implementa8on

The IBM Blue Gene/Q: Application performance, scalability and optimisation

The Red Storm System: Architecture, System Update and Performance Analysis

LLVM and Clang on the Most Powerful Supercomputer in the World

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

I/O and Scheduling aspects in DEEP-EST

HPC Storage Use Cases & Future Trends

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Present and Future Leadership Computers at OLCF

Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

Design of Scalable Network Considering Diameter and Cable Delay

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

University at Buffalo Center for Computational Research

Efficient Parallel Programming on Xeon Phi for Exascale

Operational Robustness of Accelerator Aware MPI

Parallel Computer Architecture II

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

The Spider Center-Wide File System

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

Comet Virtualization Code & Design Sprint

High Performance Computing and Data Resources at SDSC

HPC NETWORKING IN THE REAL WORLD

Messaging Overview. Introduction. Gen-Z Messaging

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

RAMSES on the GPU: An OpenACC-Based Approach

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Overview of Tianhe-2

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

ECMWF's Next Generation IO for the IFS Model and Product Generation

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Initial Performance Evaluation of the Cray SeaStar Interconnect

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Illinois Proposal Considerations Greg Bauer

MVAPICH User s Group Meeting August 20, 2015

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Programming NVM Systems

Toward Automated Application Profiling on Cray Systems

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

A New NSF TeraGrid Resource for Data-Intensive Science

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

User Training Cray XC40 IITM, Pune

Parallel File Systems Compared

Interconnect Your Future

Transcription:

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday 23 rd August, 2017

Data Movement at Scale Computational science simulation in scientific domains such as in materials, high energy physics, engineering, have large I/O needs Typically around 10% to 20% of the wall time is spent in I/O Table: Example of I/O from large simulations Scientific domain Simulation Data size Cosmology Q Continuum 2 PB / simulation High-Energy Physics Higgs Boson 10 PB / year Climate / Weather Hurricane 240 TB / simulation Increasing disparity between computing power and I/O performance in the largest supercomputers Ratio of I/O (TB/s) to Flops (TF/s) in percent IOPS/FLOPS of the #1 system in Top 500 0.1 0.01 0.001 0.0001 1997 2001 2005 2009 2013 2017 Years

Complex Interconnect Hierarchies On BG/Q, data movement needs to fully exploit the 5D-Torus topology for improved performance Additionally, we need to exploit the placement of the I/O nodes for performance Cray supercomputers have similar challenges with dragonfly-based interconnects together with placement of LNET nodes for I/O Pset 128 nodes 5D Torus network 2 GBps per link 2 GBps per link 4 GBps per link Compute nodes PowerPC A2, 16 cores 16 GB of DDR3 Bridge nodes 2 per I/O node QDR Infiniband switch I/O nodes IO forwarding daemon GPFS client Storage GPFS filesystem Mira - 49,152 nodes / 786,432 cores - 768 TB of memory - 27 PB of storage, 330 GB/s (GPFS) - 5D Torus network - Peak performance: 10 PetaFLOPS

Deep Memory Hierarchies and Filesystem characteristics We need to exploit the deep memory hierarchy tiers for improved performance This includes effective ways to seamlessly use HBM, DRAM, NVRAM, BurstBuffers, etc. We need to leverage filesystem specific features such as OSTs and striping in Lustre, among others. Compute node Intel KNL 7250 Dragonfly network Elec. links 14 GBps Dragonfly network Opt. links 12.5 GBps 36 tiles (2 cores, L2) 16 GB MCDRAM 192 GB DDR4 128 GB SSD 6 (level 2) 16 (level 1) Storage Lustre filesystem IB FDR 56 GBps (level 3) Compute node Knights Landing proc. 4 per router Aries router 2D all-to-all structure 96 routers per group Service node LNET, gateway, Irregular mapping 2-cabinet group 9 groups - 18 cabinets 16 x 6 routers hosted All-to-all

TAPIOCA Library based on the two-phase I/O scheme for topology-aware data aggregation at scale on IBM BG/Q with GPFS and Cray XC40 with Lustre (Cluster 17) Topology-aware aggregator placement taking into account The topology of the architecture The data access pattern Capure the data model and data layout to optimize the I/O scheduling Pipelining (RMA, non-blocking calls) of aggregation and I/O phases Interconnect architecture abstraction 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 1 1 Buffers X Y X X X RMA operations Aggregator Non-blocking MPI calls File

Abstractions for Interconnect Topology Topology characteristics include: Spatial coordinates Distance between nodes: number of hops, routing policy I/O nodes location, depending on the filesystem (bridge nodes, LNET,...) Network performance: latency, bandwidth Need to model some unknowns and uncertainties such as routing, contention Figure: 5D-Torus on BG/Q and intra-chassis Dragonfly Network on Cray XC30 (Credit: LLNL / LBNL)

Abstractions for Interconnect Topology - Our current approach TAPIOCA features a topology-aware aggregator placement This approach is based on quantitative information easy to gather: latency, bandwidth, distance between nodes ω(u, v): Amount of data exchanged between nodes u and v d(u, v): Number of hops from nodes u to v l: The interconnect latency B i j: The bandwidth from node i to node j C 1 = ( ) l d(i, A) + ω(i,a) i V C,i A B i A C 2 = l d(a, IO) + ω(a,io) B A IO TopoAware(A) = min (C 1 + C 2) C2 C1 Vc : Compute nodes IO : I/O node A : Aggregator Contention-aware algorithm: static and dynamic routing policies, unknown vendors information such as routing policy or data distribution among I/O nodes,...

TAPIOCA Outperfoms MPI I/O on the IO kernel of HACC and two data layouts on a Cray XC40 + Lustre and BG/Q + GPFS HACC: Large-scale simulation of the mass evolution of the universe with particle-mesh techniques (A particle is defined by 9 variables). 1024 nodes, 16 ranks per node Best PFS configuration for MPI I/O Lustre: 48 OST, 8 MB stripe size, 192 aggregators GPFS: 16 aggregators per Pset (128 aggr), 16 MB buffer size 14 12 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write 25 20 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write I/O Bandwidth (GBps) 10 8 6 4 2 R: 2.7x W: 3.8x I/O Bandwidth (GBps) 15 10 5 W: 10.6x R: 6.7x 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (a) Cray XC40 + Lustre 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (b) BG/Q + GPFS Figure: Array of structures data layout

TAPIOCA Outperfoms MPI I/O on the IO kernel of HACC and two data layouts on a Cray XC40 + Lustre and BG/Q + GPFS HACC: Large-scale simulation of the mass evolution of the universe with particle-mesh techniques (A particle is defined by 9 variables). 1024 nodes, 16 ranks per node Best PFS configuration for MPI I/O Lustre: 48 OST, 8 MB stripe size, 192 aggregators GPFS: 16 aggregators per Pset (128 aggr), 16 MB buffer size 14 12 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write 25 20 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write I/O Bandwidth (GBps) 10 8 6 4 2 R: 2.3x W: 3.6x I/O Bandwidth (GBps) 15 10 5 W: 1.2x R: 1.3x 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (a) Cray XC40 + Lustre 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (b) BG/Q + GPFS Figure: Structure of arrays data layout

TAPIOCA - Ongoing research Move toward a generic data movement library for data-intensive applications exploiting deep memory/storage hierarchies as well as interconnect to facilitate I/O, in-transit analysis, data transformation, data/code coupling, workflows,... Network Memory/Storage P0 P1 P2 P3 Application X Y Z X Y Z X Y Z X Y Z DRAM, MCDRAM,... Data Dragonfly, torus,... X X X X Y Y P0 P2 Y Y Z Z Z Z DRAM, MCDRAM, NVRAM, BB,... Aggregators Dragonfly, torus,... X X X X Y Y DRAM, MCDRAM, Y Y Z Z Z Z Target NVRAM, PFS, BB,...

Abstractions for Memory and Storage Topology characteristics including spatial location, distance Performance characteristics: bandwidth, latency, capacity Access characteristics such as byte/block-based, concurrency Persistency Application Memory API (alloc, store, load, free, ) Abstraction layer (mmap, pmem, ) DRAM HBM NVRAM PFS BB Listing 1: Function prototypes for memory/storage data movements v o i d memalloc ( v o i d buff, int64_ t buffsize, mem_t mem ) ; v o i d memfree ( v o i d buff, mem_t mem ) ; i n t mem{ Write, Store } ( void srcbuffer, int64_ t srcsize, v o i d d e s t B u f f e r, mem_t mem, i n t 6 4 _ t o f f s e t ) ; i n t mem{read, Load } ( void s r c B u f f e r, int6 4 _ t s r c S i z e, v o i d d e s t B u f f e r, mem_t mem, i n t 6 4 _ t o f f s e t ) ; v o i d memflush ( v o i d buff, mem_t mem ) ; Work in progress with open questions Bluring boundary between memory and storage (MCDRAM, 3D XPoint memory,...) Some data movements need one or more processes involved at destination (RMA window, flushing thread,...) Scope of memory/storage tiers (PFS vs node-local SSD) Data partitioning to take advantage of fast memories with smaller capacities

Conclusion TAPIOCA, an optimized data-movement library incorporating Topology-aware aggregator placement Optimized data movement with I/O scheduling and pipelining Hardware abstraction insuring performance portability Performance portability on two leadership-class supercomputers: Mira (IBM BG/Q + GPFS) and Theta (Cray XC40 + Lustre) Same application code running on both platforms Same optimization algorithms using an interconnect abstraction Promising preliminary results with memory/storage abstraction An appropriate level of abstraction is hard to define Specific abstraction for every system including the architecture, filesystems, capturing every phase of deployment, relevant software versions,... Generalized abstraction that maps to current and expected future deep memory hierarchies and interconnects Future work: Come up with a model helping to take smart decision for data movement Acknowledgments Argonne Leadership Computing Facility at Argonne National Laboratory DOE Office of Science, ASCR Proactive Data Containers (PDC) project

Conclusion Thank you for your attention! ftessier@anl.gov

MPI-IO and TAPIOCA - Data layout MPI I/O TAPIOCA P0 P1 P0 P1 Processes X Y Z X Y Z X Y Z X Y Z Data X X Y Y X Y Z X Y Z Aggregator Z Z X Y Z X Y Z X Y Z X Y Z File