Lock Ahead: Shared File Performance Improvements

Similar documents
Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

Lustre Lockahead: Early Experience and Performance using Optimized Locking

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Parallel I/O on Theta with Best Practices

Toward Understanding Life-Long Performance of a Sonexion File System

Blue Waters I/O Performance

Welcome! Virtual tutorial starts at 15:00 BST

API and Usage of libhio on XC-40 Systems

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

libhio: Optimizing IO on Cray XC Systems With DataWarp

CSCS HPC storage. Hussein N. Harake

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Diagnosing Performance Issues in Cray ClusterStor Systems

I/O Performance on Cray XC30

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

PLFS and Lustre Performance Comparison

Parallel I/O Libraries and Techniques

Introduction to HPC Parallel I/O

Lustre and PLFS Parallel I/O Performance on a Cray XE6

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

High-Performance Lustre with Maximum Data Assurance

Lustre A Platform for Intelligent Scale-Out Storage

5.4 - DAOS Demonstration and Benchmark Report

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017

Caches Concepts Review

IME Infinite Memory Engine Technical Overview

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Utilizing Unused Resources To Improve Checkpoint Performance

Improved Solutions for I/O Provisioning and Application Acceleration

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

mos: An Architecture for Extreme Scale Operating Systems

Relaxed Memory-Consistency Models

Introduction to Parallel I/O

Trust Separation on the XC40 using PBS Pro

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

Advanced cache optimizations. ECE 154B Dmitri Strukov

Parallel File Systems. John White Lawrence Berkeley National Lab

Sonexion GridRAID Characteristics

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Alleviating Scalability Issues of Checkpointing

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project

Cache Performance (H&P 5.3; 5.5; 5.6)

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

Real-time monitoring Slurm jobs with InfluxDB September Carlos Fenoy García

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Using Relaxed Consistency Models

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

Blue Waters Local Software To Be Released: Module Improvements and Parfu Parallel Archive Tool

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Idea of Metadata Writeback Cache for Lustre

Introduction to Parallel Computing

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

LCOC Lustre Cache on Client

CS 147: Computer Systems Performance Analysis

Architecting a High Performance Storage System

Programming with MPI

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

Experiments in Pure Parallelism

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Stanford University Computer Science Department CS 240 Quiz 2 with Answers Spring May 24, total

MySQL 8.0 Performance: InnoDB Re-Design

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

The RAMDISK Storage Accelerator

Analyzing I/O Performance on a NEXTGenIO Class System

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

Episode 4. Flow and Congestion Control. Baochun Li Department of Electrical and Computer Engineering University of Toronto

CS3350B Computer Architecture

HPC NETWORKING IN THE REAL WORLD

Lustre SMP scaling. Liang Zhen

MATRIX:DJLSYS EXPLORING RESOURCE ALLOCATION TECHNIQUES FOR DISTRIBUTED JOB LAUNCH UNDER HIGH SYSTEM UTILIZATION

Relaxed Memory Consistency

Burning CDs in Windows XP

OpenZFS Performance Improvements

Transcription:

Lock Ahead: Shared File Performance Improvements Patrick Farrell Cray Lustre Developer Steve Woods Senior Storage Architect woods@cray.com September 2016 9/12/2016 Copyright 2015 Cray Inc 1

Agenda Shared File Problem Sequence of events that lead to the problem Solution Usage and Performance Availability 9/12/2016 Copyright 2015 Cray Inc 2

Shared File vs FPP Two principal approaches to writing out data File-per-process which scales well in Lustre Shared file which does not scale well 9/12/2016 Copyright 2015 Cray Inc 3

Shared File IO Common HPC applications use MPIIO library to do good shared file IO Technique is called collective buffering IO is aggregated to a set of nodes, each of which handles parts of a file Writes are strided, non-overlapping Example: Client 1 is responsible for writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1, block 3, etc. Currently arranged so there is one client per OST 9/12/2016 Copyright 2015 Cray Inc 4

Shared File Scalability Bandwidth best at one client per OST Going from one to two clients reduces bandwidth dramatically, adding more after two doesn t help much In real systems, OST can handle full bandwidth of several clients (FPP hits these limits) 9/12/2016 Copyright 2015 Cray Inc 5

9/12/2016 Copyright 2015 Cray Inc 6

Why Doesn t Shared File IO Scale In good shared file IO, writes are strided, non-overlapping Since writes don t overlap, should be possible to have multiple clients per OST without lock contention With > 1 client per OST, writes are serialized due to LDLM* extent lock design in Lustre 2+ clients are slower than one due to lock contention *LDLM locks are Lustre s distributed locks, used on clients and servers 7

Extent Lock Contention Single OST view of a file, also applies to individual OSTs in a striped file Two clients, doing strided writes Client 1 asks to write segment 0 (Assume stripe size segments) 8

Extent Lock Contention Lock for client 1 was called back, so no locks on file currently OST expands lock request from client 2 Grants lock on rest of file... 9

Extent Lock Contention Client 1 asks to write segment 2 Conflicts with the expanded lock granted to client 2 Lock for client 2 is called back Etc. Continues throughout IO 10

Extent Lock Contention Multiple clients per OST are completely serialized, no parallel writing at all Even worse: Additional latency to exchange lock Mitigation: Clients generally are able to write > one segment before giving up local 11

Extent Lock Contention What about not expanding locks. Avoids contention, clients can write in parallel Surprise: It is actually worse This means we need a lock for every write, latency kills performance That was the blue line at the bottom of the performance graph.. 12

Proposal: Lock Ahead Lock ahead: Allow clients to request locks on file regions in advance of IO Pros: Request lock on a part of a file with an IOCTL, server grants lock only on requested extent (no expansion) Flexible, can optimize other IO patterns Relatively easy to implement Cons: Large files drive up lock count and can hurt performance Pushes LDLM into new areas Exposes bugs 13

Lock Ahead: Requests Locks to Match the IO Pattern Imagine requesting locks ahead of time Same situation: Client 1 wants to write segment 0 But before that, it requests locks 14

Lock Ahead: Requests Locks to Match the IO Pattern Request locks on segments the client intends to do IO on 0, 2, 4, etc Lock ahead locks are not expanded 15

Lock Ahead: Requests Locks to Match the IO Pattern Client 2 requests locks on its segments Segments 1, 3, 5, etc 16

Lock Ahead: Requests Locks to Match the IO Pattern With locks issued, clients can do IO in parallel No lock conflicts 17

What about Group Locks? Lustre has an existing solution: Group locks Basically turns off LDLM locking on a file for group members, allows for FPP performance for group members Tricky: Since lock is shared between clients there are write visibility issues (Clients assume they are the only one with a lock, do not notice file updates until the lock is released and cancelled) Must release the lock to get write visibility between clients 18

What about Group Locks? Works for some workloads, but not OK for many others Not really compatible with HDF5 and other such file formats: In file metadata updates require write visibility between clients during the IO It s possible to fsync and release the lock after every write, but speed benefits are lost 19

Lock Ahead: Performance Early performance results show performance equal to file-per-process or group locks Intended to match up with MPIIO collective buffering feature described earlier IOR a MPIIO c Cray has made patch available Codes need to be rebuilt but not modified 20

Lock Ahead: Performance (preliminary) 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 write, group lock write, lockahead read, group lock read, lockahead 1,000 0 64 ranks 80 ranks 160 ranks 320 ranks 640 ranks 21

Lock Ahead: Performance (preliminary) Job Standard Lock (MiB/s) Lock Ahead (MiB/s) CL_IOR_all_hdf5_cb_w_64m 1666.582 2541.630 CL_IOR_all_hdf5_cb_w_32m 1730.573 2650.620 CL_IOR_all_hdf5_cb_w_1m 689.620 1781.704 22

Lock Ahead: Performance (preliminary) Job Group Lock (MiB/s) Lock Ahead (MiB/s) CL_IOR_all_mpiio_cb_w_64m 2528 2530 CL_IOR_all_mpiio_cb_w_32m 2446 2447 CL_IOR_all_mpiio_cb_w_1m 2312 2286 23

Lock Ahead: Implementation Re-uses much of Asynchronous Glimpse Lock (AGL) implementation Adds an LDLM flag to tell server not to expand lock ahead locks 24

Lock Ahead: When will it be available? Targeted as a feature for Lustre 2.10 Currently available in the Cray Lustre 2.7 Available soon with Cray Sonexion (Neo 2.0 SU22) 25

Questions? 9/12/2016 Copyright 2015 Cray Inc 26