Toward Understanding Life-Long Performance of a Sonexion File System

Similar documents
Sonexion GridRAID Characteristics

Blue Waters I/O Performance

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Welcome! Virtual tutorial starts at 15:00 BST

I/O Performance on Cray XC30

Hands-On II: Ray Tracing (data parallelism) COMPUTE STORE ANALYZE

Lock Ahead: Shared File Performance Improvements

Memory Leaks Chapel Team, Cray Inc. Chapel version 1.14 October 6, 2016

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Implementation

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

e-pg Pathshala Subject: Computer Science Paper: Operating Systems Module 35: File Allocation Methods Module No: CS/OS/35 Quadrant 1 e-text

Xyratex ClusterStor6000 & OneStor

Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017

QUIZ: Is either set of attributes a superkey? A candidate key? Source:

CS 318 Principles of Operating Systems

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

Operating Systems. File Systems. Thomas Ropars.

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

File Systems Management and Examples

CS 537 Fall 2017 Review Session

ECE 598 Advanced Operating Systems Lecture 10

1. a. Show that the four necessary conditions for deadlock indeed hold in this example.

Outlook. File-System Interface Allocation-Methods Free Space Management

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Memory Management. CSE 2431: Introduction to Operating Systems Reading: , [OSC]

The Spider Center-Wide File System

File Layout and Directories

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

CS 318 Principles of Operating Systems

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Lecture S3: File system data layout, naming

CS510 Operating System Foundations. Jonathan Walpole

High-Performance Lustre with Maximum Data Assurance

CSCS HPC storage. Hussein N. Harake

Chapter 12: File System Implementation

File Systems: FFS and LFS

File System Implementation. Sunu Wibirama

COMP 3361: Operating Systems 1 Final Exam Winter 2009

Chapter 11: Implementing File Systems. Operating System Concepts 8 th Edition,

Section 10: Device Drivers, FAT, Queuing Theory, Memory Mapped Files

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

COMP091 Operating Systems 1. File Systems

Chapter 9. Storage Management

Future File System: An Evaluation

COMP 530: Operating Systems File Systems: Fundamentals

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File-System. File Concept. File Types Name, Extension. File Attributes. File Operations. Access Methods. CS307 Operating Systems

Storage and File Structure

Announcements. Persistence: Log-Structured FS (LFS)

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

19 File Structure, Disk Scheduling

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Data-Centric Locality in Chapel

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Locality and The Fast File System. Dongkun Shin, SKKU

CS510 Operating System Foundations. Jonathan Walpole

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

ECE 5730 Memory Systems

An Overview of Fujitsu s Lustre Based File System

An Exploration of New Hardware Features for Lustre. Nathan Rutman

File Systems. Chapter 11, 13 OSPP

b. How many bits are there in the physical address?

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Operating Systems Design Exam 2 Review: Fall 2010

Memory Management. Today. Next Time. Basic memory management Swapping Kernel memory allocation. Virtual memory

File System Internals. Jo, Heeseung

Array, Domain, & Domain Map Improvements Chapel Team, Cray Inc. Chapel version 1.17 April 5, 2018

Some Practice Problems on Hardware, File Organization and Indexing

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Chapter 11: File System Implementation. Objectives

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Use and I: Transitivity of Module Uses and its Impact

IME Infinite Memory Engine Technical Overview

Workloads. CS 537 Lecture 16 File Systems Internals. Goals. Allocation Strategies. Michael Swift

Main Points. File layout Directory layout

Page 1. Recap: File System Goals" Recap: Linked Allocation"

LLNL Lustre Centre of Excellence

Heap Management portion of the store lives indefinitely until the program explicitly deletes it C++ and Java new Such objects are stored on a heap

I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

[537] Fast File System. Tyler Harter

Grab-Bag Topics / Demo COMPUTE STORE ANALYZE

Long-term Information Storage Must store large amounts of data Information stored must survive the termination of the process using it Multiple proces

Cray Operating System Plans and Status. Charlie Carroll May 2012

CS3600 SYSTEMS AND NETWORKS

-Device. -Physical or virtual thing that does something -Software + hardware to operate a device (Controller runs port, Bus, device)

Chapter 12: File System Implementation

Mass Storage. 2. What are the difference between Primary storage and secondary storage devices? Primary Storage is Devices. Secondary Storage devices

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification)

Extreme Storage Performance with exflash DIMM and AMPS

CS140 Operating Systems and Systems Programming Final Exam

Transcription:

Toward Understanding Life-Long Performance of a Sonexion File System CUG 2015 Mark Swan, Doug Petesch, Cray Inc. dpetesch@cray.com

Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. 4/28/2015 Copyright 2015 Cray Inc. 2

Agenda Introduction of concepts Experiment to reduce I/O variation Test method and results Summary 4/28/2015 Copyright 2015 Cray Inc 3

Sources of I/O Performance Variation Over Time Software updates Lustre client and servers, I/O libraries Transient hardware issues IB cables, degraded connections Failing disks, bad sectors, RAID rebuilds Disk position and fragmentation Can it be controlled? Focus of this presentation 4/28/2015 Copyright 2015 Cray Inc 4

Lustre File System Terms Cray Sonexion 1600 file system 1 SSU contains 2 servers with 4 OSTs each Files in /proc/fs/ldiskfs/<device> mb_groups mb_last_group prealloc_table - bitmap of free space - current allocation pointer - disk space allocation size Each multi-block allocation group is 128 MiB 32768 (4096 byte) disk blocks Over 160k groups in a 23 TB OST (3 TB disks) 4/28/2015 Copyright 2015 Cray Inc 5

Allocation Bitmap of an OST 4/28/2015 Copyright 2015 Cray Inc. 6

1M OST pre-allocation 4 files, 8M per file, write Incoming Lustre packets (1 MiB) Empty 1 MiB blocks in file system 4/28/2015 Copyright 2014 Cray Inc. 7

1M OST pre-allocation 4 files, 8M per file, read Outgoing Lustre buffers (8 MiB each) User data in file system 4/28/2015 Copyright 2014 Cray Inc. 8

8M OST pre-allocation 4 files, 8M per file, write Incoming Lustre packets (1 MiB) Empty 1 MiB blocks in file system, allocated 8 at a time 4/28/2015 Copyright 2014 Cray Inc. 9

8M OST pre-allocation 4 files, 8M per file, read Outgoing Lustre buffers (8 MiB each) User data in file system 4/28/2015 Copyright 2014 Cray Inc. 10

Hypothesis Using larger values of OST pre-allocation will: Create more contiguous space when files are created Leave behind fewer, bigger, holes when files are remove Increase read rates due to more contiguous space Decrease write rates due to seeking The increase seen in read rates will be larger than the decrease in write rates, providing better balance for large-block sequential I/O 4/28/2015 Copyright 2015 Cray Inc. 11

Testing Scenario Set 4 SSUs to 4 different OST pre-allocation sizes Choose a clean section of each OST Run optimal write and optimal read IOR jobs in each SSU to determine base rates at 100% free Simultaneously write 20 files on each OST so that data from all 20 files is intermingled Show distribution of contiguous space in each SSU 4/28/2015 Copyright 2015 Cray Inc. 12

Testing Scenario (continued) Repeat until 100% free again: Delete 1 file from each OST which releases 5% of the used space Reset allocation pointer to beginning of test section on each OST Run optimal write and optimal read IOR jobs Record and plot results Test system Cray XE6, 12 LNET routers Lustre 2.5.1 client 4 SSU Sonexion 1600 NEO software level 1.2.1 (MDRAID) 3 TB disk drives 4/28/2015 Copyright 2015 Cray Inc. 13

Bitmap of an OST (before) 4/28/2015 Copyright 2015 Cray Inc. 14

Bitmap of an OST (after) 4/28/2015 Copyright 2015 Cray Inc. 15

Histograms of extent sizes of IOR files 4/28/2015 Copyright 2015 Cray Inc. 16

1 MiB OST Pre-allocation Fragmentation 4/28/2015 Copyright 2015 Cray Inc. 17

2 MiB OST Pre-allocation Fragmentation 4/28/2015 Copyright 2015 Cray Inc. 18

4 MiB OST Pre-allocation Fragmentation 4/28/2015 Copyright 2015 Cray Inc. 19

8 MiB OST Pre-allocation Fragmentation 4/28/2015 Copyright 2015 Cray Inc. 20

4/28/2015 Copyright 2015 Cray Inc. 21

4/28/2015 Copyright 2015 Cray Inc. 22

4/28/2015 Copyright 2015 Cray Inc. 23

4/28/2015 Copyright 2015 Cray Inc. 24

Summary The performance variation due to fragmentation is biggest for 1 MiB OST pre-allocation. This is the default for Sonexion 1600. Using 8 MiB OST pre-allocation, no matter what level of fragmentation: The optimal write rate is at least 89% of a clean file system The optimal read rate is at least 82% of a clean file system Optimal write rates are always above 4 GB/s/SSU Optimal read rates are always above 5 GB/s/SSU 4/28/2015 Copyright 2015 Cray Inc. 25

Checking Settings on OSS Nodes cat /proc/fs/ldiskfs/md*/prealloc_table 32 64 128 # default prealloc 128 256 512 256 512 1024 # prealloc of 2 MiB # prealloc of 4 MiB 256 512 1024 2048 # prealloc of 8 MiB cat /proc/fs/ldiskfs/md*/mb_last_group Which of the multi-block allocation groups was last used Example: 8000 means about 1 TB from start of OST 4/28/2015 Copyright 2015 Cray Inc 26

Thank You Questions? 4/28/2015 Copyright 2015 Cray Inc 27