Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Similar documents
Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Enhancing Lustre Performance and Usability

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

Demonstration Milestone for Parallel Directory Operations

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Lustre Metadata Fundamental Benchmark and Performance

SFA12KX and Lustre Update

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Dell TM Terascala HPC Storage Solution

An Overview of Fujitsu s Lustre Based File System

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

DELL Terascala HPC Storage Solution (DT-HSS2)

FhGFS - Performance at the maximum

Lustre overview and roadmap to Exascale computing

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

Modification and Evaluation of Linux I/O Schedulers

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

CSE 120 PRACTICE FINAL EXAM, WINTER 2013

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

libhio: Optimizing IO on Cray XC Systems With DataWarp

Experiences with HP SFS / Lustre in HPC Production

2 nd Half. Memory management Disk management Network and Security Virtual machine

Ben Walker Data Center Group Intel Corporation

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Dell EMC CIFS-ECS Tool

5.4 - DAOS Demonstration and Benchmark Report

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads

Aziz Gulbeden Dell HPC Engineering Team

CS370 Operating Systems

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

Lustre OSS Read Cache Feature SCALE Test Plan

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Extraordinary HPC file system solutions at KIT

The Google File System

Recent developments in GFS2. Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe October 2013

Fujitsu s Contribution to the Lustre Community

Idea of Metadata Writeback Cache for Lustre

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Triton file systems - an introduction. slide 1 of 28

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

CS A320 Operating Systems for Engineers

HLD For SMP node affinity

Red Hat Enterprise Linux on IBM System z Performance Evaluation

Toward An Integrated Cluster File System

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD


Architecting a High Performance Storage System

CPU Scheduling. Schedulers. CPSC 313: Intro to Computer Systems. Intro to Scheduling. Schedulers in the OS

High-Performance Lustre with Maximum Data Assurance

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

GFS-python: A Simplified GFS Implementation in Python

SUPER CLOUD STORAGE MEASUREMENT STUDY AND OPTIMIZATION

Upstreaming of Features from FEFS

Parallel File Systems for HPC

Operating Systems. Lecture Process Scheduling. Golestan University. Hossein Momeni

CS2506 Quick Revision

Bridging the peta- to exa-scale I/O gap

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

API and Usage of libhio on XC-40 Systems

MC7204 OPERATING SYSTEMS

Disk Scheduling COMPSCI 386

Spectrum Scale CORAL Enhancements

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Storage Systems. NPTEL Course Jan K. Gopinath Indian Institute of Science

SNIA VDBENCH Rules of Thumb

Let s Make Parallel File System More Parallel

Optimizing Local File Accesses for FUSE-Based Distributed Storage

Efficient QoS for Multi-Tiered Storage Systems

Presented by: Nafiseh Mahmoudi Spring 2017

T10PI End-to-End Data Integrity Protection for Lustre

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Operating Systems Design Exam 2 Review: Spring 2011

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

Using DDN IME for Harmonie

CS 416: Opera-ng Systems Design March 23, 2012

NetApp High-Performance Storage Solution for Lustre

File System Internals. Jo, Heeseung

Computer Systems Laboratory Sungkyunkwan University

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Data storage on Triton: an introduction

Operating Systems Comprehensive Exam. Spring Student ID # 3/20/2013

Enterprise2014. GPFS with Flash840 on PureFlex and Power8 (AIX & Linux)

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

DL-SNAP and Fujitsu's Lustre Contributions

Dynamics 365. for Finance and Operations, Enterprise edition (onpremises) system requirements

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

CS3733: Operating Systems

Transcription:

Network Request Scheduler Scale Testing Results Nikitas Angelinas nikitas_angelinas@xyratex.com

Agenda NRS background Aim of test runs Tools used Test results Future tasks 2

NRS motivation Increased read and write throughput across the filesystem. Increased fairness amongst filesystem nodes, and better utilization of resources. Clients. OSTs. Network. Deliberate and controlled unfairness amongst filesystem nodes; QoS semantics. Client or export prioritization. Guaranteed minimum and maximum throughput for a client. 3

Foreword NRS is a collaborative project between Whamcloud and Xyratex. Code is at git://git.whamcloud.com/fs/lustre-dev.git repo, branch liang/b_nrs. Jira ticket LU-398. Is waiting for some large-scale testing. It allows the PTLRPC layer to reorder the servicing of incoming RPCs. We are mostly interested in bulk I/O RPCs. Predominantly server-based, although the clients could play a part in some use cases. 4

NRS policies A binary heap data type is added to libcfs. Used to implement prioritized queues of RPCs at the server. Sorts large numbers of RPCs (10,000,000+) with minimal insertion/removal time. FIFO - Logical wrapper around existing PTLRPC functionality. Is the default policy for all RPC types. CRR-E - Client Round Robin, RR over exports. CRR-N - Client Round Robin, RR over NIDs. ORR - Object Round Robin, RR over backend-fs objects, with request ordering according to logical or physical offsets. TRR - Target Round Robin, RR over OSTs, with request ordering according to logical or physical offsets. Client prioritization policy (not yet implemented). QoS, or guaranteed availability policy (not yet implemented). 5

NRS features Allows to select a different policy for each PTLRPC service. Potentially separate on HP and normal requests in the future. Policies can be hot-swapped via lprocfs, while the system is handling I/O. Policies can fail handling a request: Intentionally or unintentionally. A failed request is handled by the FIFO policy. FIFO cannot fail the processing of an RPC. 6

Questions to be answered Any performance regressions for the NRS framework with FIFO policy? Scalability to a large number of clients? Effective implementation of the algorithms? Are other policies besides FIFO useful? A given policy may aid performance in particular situations, while hindering performance in other situations. A given policy may also benefit more generic aspects of the filesystem workload. Provide quantified answers to the above via a series of tests performed at large scale. 7

Benchmarking environment NRS code rebased on the same Git commit as ; apples vs apples. IOR for SSF and FPP runs of sequential I/O tests. mdtest for file and directory operations metadata performance. IOzone in clustered mode. Multi-client test script using groups of dd processes. 1 Xyratex CS3000; 2 x OSS, 4 x OSTs each. 10-128 physical clients, depending on test case and resource availability. Infiniband QDR fabric. Larger scale tests performed at University of Cambridge; smaller scale tests performed inhouse. 8

Performance regression with FIFO policy IOR SSF and FPP, and mdtest runs. Looking for major performance regressions; minor performance regressions would be hidden by the variance between test runs. So these tests aim to give us indications, but not definite assurance. IOR FPP: IOR -v -a POSIX -i3 -g -e -w -W -r -b 16g -C -t 4m -F -o /mnt/lustre/testfile.fpp -O lustrestripecount=1. IOR SSF: IOR -v -a POSIX -i3 -g -e -w -W -r -b 16g -C -t 4m -o /mnt/lustre/testfile.ssf -O lustrestripecount=-1. mdtest: mdtest -u -d /mnt/lustre/mdtest{1-128} -n 32768 -i 3. 9

IOR FPP regression testing IOR FPP sequential 4MB I/O 128 clients, 1 thread per client 4000 3500 MB/sec 3000 2500 2000 1500 1000 500 0 write FIFO write read FIFO read IOR FPP sequential 4MB I/O MB/sec 64 clients, 1 thread per client 4000 3500 3000 2500 2000 1500 1000 500 0 write 10 FIFO write read FIFO read

IOR SSF regression testing IOR SSF sequential 4MB I/O 128 clients, 1 thread per client 4000 MB/sec 3500 3000 2500 2000 1500 1000 500 0 write FIFO write read FIFO read IOR SSF sequential 4MB I/O MB/sec 64 clients, 1 thread per client 4000 3500 3000 2500 2000 1500 1000 500 0 write 11 FIFO write read FIFO read

mdtest file and dir ops regression testing mdtest file operations 128 clients, 1 thread per client, 4.2 million files 70000 FIFO 60000 IOPS 50000 40000 30000 20000 10000 0 create stat unlink mdtest directory operations 128 clients, 1 thread per client, 4.2 million directories 70000 60000 FIFO IOPS 50000 40000 30000 20000 10000 0 create 12 stat unlink

mdtest file and dir ops regression testing mdtest file operations 64 clients, 1 thread per client, 2.1 million files 70000 FIFO 60000 IOPS 50000 40000 30000 20000 10000 0 create stat unlink mdtest directory operations 64 clients, 1 thread per client, 2.1 million directories 70000 FIFO 60000 IOPS 50000 40000 30000 20000 10000 0 create 13 stat unlink

mdtest file and dir ops regression testing mdtest file operations 12 clients, 2 threads per client, 196608 files 70000 60000 FIFO IOPSs 50000 40000 30000 20000 10000 0 create stat unlink mdtest directory operations 12 clients, 2 threads per client, 196608 directories 70000 60000 FIFO IOPS 50000 40000 30000 20000 10000 0 create 14 stat unlink

CRR-N dd-based investigation Groups of dd processes. Read test: dd if=/mnt/lustre/dd_client*/bigfile* of=/dev/null bs=1m. Write test: dd if=/dev/zero of=/mnt/lustre/dd_client*/outfile* bs=1m. Two series of test runs: 10 clients with 10 dd processes each. 9 clients with 11 dd processes, 1 client with 1 dd process. Observe the effect of NRS CRR-N vs for the above test runs. Measure throughput at each client. Calculate standard deviation of throughput. 15

CRR-N brw RPC distribution (dd test, 14 clients) 16

dd write test - 10 clients, each with 10 processes vs CRR-N 10 clients, 10 processes each, write test 380 370 360 350 MB/sec 340 CRR-N 330 320 310 300 290 280 1 2 3 4 5 6 7 8 client handler 17 stdev throughput 19.361 MB/sec 3469 MB/sec CRR-N 0.425 MB/sec 3537.5 MB/sec 9 10

dd write test 9 clients with 11 procs, client 4 with 1 proc vs CRR-N 9 clients 11 processes, 1 client 1 process, write test 450 400 350 MB/sec 300 CRR-N 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 client handler stdev (client 4 excluded) client 4 throughput 22.756 MB/sec 49.2 MB/sec 3473.9 MB/sec CRR-N 0.491 MB/sec 167 MB/sec 3444 MB/sec 18

dd read test - 10 clients, each with 10 processes, 128 threads vs CRR-N 10 clients, 10 processes each, read test 250 200 CRR-N MB/sec 150 100 50 0 1 2 3 4 5 6 7 8 9 client handler 19 stdev throughput 6.156 MB/sec 2193.8 MB/sec CRR-N 7.976 MB/sec 2054.6 MB/sec 10

dd read test - 9 clients with 11 procs, client 3 with 1 proc, 128 threads vs CRR-N 9 clients 11 processes, 1 client 1 process, read test 300 250 MB/sec 200 CRR-N 150 100 50 0 1 2 3 4 5 6 7 8 9 10 client handler 20 stdev (client 3 excluded) client 3 throughput 7.490 MB/sec 161 MB/sec 2229.2 MB/sec CRR-N 8.455 MB/sec 169 MB/sec 2106.6 MB/sec

IOR FPP - vs NRS with CRR-N policy IOR FPP sequential 4MB I/O MB/sec 128 clients, 1 thread per client 4000 3500 3000 2500 2000 1500 1000 500 0 write CRR-N write read CRR-N read IOR FPP sequential 4MB I/O MB/sec 64 clients, 1 thread per client 4000 3500 3000 2500 2000 1500 1000 500 0 write 21 CRR-N write read CRR-N read

IOR SSF - vs NRS with CRR-N policy IOR SSF sequential 4MB I/O MB/sec 128 client, 1 thread per client 4000 3500 3000 2500 2000 1500 1000 500 0 write CRR-N write read CRR-N read IOR SSF sequential 4MB I/O MB/sec 64 clients, 1 thread per client 4000 3500 3000 2500 2000 1500 1000 500 0 write 22 CRR-N write read CRR-N read

CRR-N comments CRR-N causes a significant lowering of the stdev of write throughput. i.e. it 'evens things out'. Many users will want this. CRR-N shows a negative effect on dd test read operations, but IOR regression tests are fine. Worst case, reads could be routed to FIFO or other policy. CRR-N may improve compute cluster performance when used with real jobs that do some processing. No performance regressions on IOR tests. Confidence to deploy in real clusters and get real-world feedback. Future testing task is to see if these results scale. 23

ORR/TRR policies ORR serves bulk I/O RPCs (only OST_READs by default) in a Round Robin manner over available backend-fs objects. RPCs are grouped in per-object groups of 'RR quantum' size; lprocfs tunable. Sorted within each group by logical of physical disk offset. Physical offsets are calculated using extent information obtained via fiemap. TRR performs the same scheduling, but in a Round Robin manner over available OSTs. The main aim is to minimize drive seek operations, thus increasing read performance. TRR should be able to help in cases where an OST is underutilized; this was not straightforward to test. 24

ORR (phys, 8) brw RPC distribution (IOR FPP test) 25

TRR (phys, 8) brw RPC distribution (IOR FPP test) 26

ORR/TRR policy tests Using IOR to perform read tests; each IOR process reads 16 GB of data. Kernel caches cleared between reads. Performance is compared with and NRS FIFO for different TRR/ORR policy parameters. Tests with 1 process per client and 8 processes per client. Only 14 clients, read operations generate few RPCs. ost_io.threads_max=128 on both OSS nodes. The OSS nodes are not totally saturated with this number of clients. 27

ORR/TRR policy IOR FPP read IOR FPP sequential read, 1MB I/O 14 clients, 1 thread per client, 16 GB file per thread 3300 3250 MB/sec 3200 3150 3100 3050 3000 FIFO : FIFO: ORR log 256: ORR phys 256: TRR log 256: TRR phys 256: 28 ORR log 256 ORR phys 256 TRR log 256 3092.91 MB/sec 3102.17 MB/sec 3146.97 MB/sec 3150.86 MB/sec 3164.66 MB/sec 3268.98 MB/sec TRR phys 256

ORR/TRR policy IOR SSF read IOR SSF sequential read, 1MB I/O 14 clients, 1 thread per client, 16 GB file per thread 2800 2750 MB/sec 2700 2650 2600 2550 2500 FIFO ORR log 256 : FIFO: ORR log 256: ORR phys 256: TRR log 256: TRR phys 256: 29 ORR phys 256 TRR log 256 2744.15 MB/sec 2741.78 MB/sec 2689.12 MB/sec 2728.89 MB/sec 2684.42 MB/sec 2720.96 MB/sec TRR phys 256

ORR/TRR policy IOR FPP read IOR FPP sequential read, 1MB I/O 14 clients, 8 threads per client, 16 GB file per thread 3000 2800 2600 2400 MB/sec 2200 2000 1800 1600 1400 1200 1000 FIFO ORR log 256 ORR phys 256 : FIFO: ORR log 256: ORR phys 256: TRR log 256: TRR phys 256: 30 TRR log 256 2274.55 MB/sec 2248.62 MB/sec 2432.78 MB/sec 2424.69 MB/sec 1540.82 MB/sec 1778.56 MB/sec TRR phys 256

ORR/TRR policy IOR SSF read IOR SSF sequential read, 1MB I/O 14 clients, 8 threads per client, 16 GB file per thread 2500 2000 MB/sec 1500 1000 500 0 FIFO ORR log 256 : FIFO: ORR log 256: ORR phys 256: TRR log 256: TRR phys 256: 31 ORR phys 256 TRR log 256 2260.37 MB/sec 2257.9 MB/sec 1089.385 MB/sec 1236.72 MB/sec 1086.18 MB/sec 1258.907 MB/sec TRR phys 256

IOzone read test all policies IOzone read, 1MB, 16GB per process 14 clients, 1 process per client 3800 3700 3600 MB/sec 3500 3400 3300 3200 3100 3000 32 FIFO CRR-N ORR log 256 ORR phys 256 TRR log 256 TRR phys 256

IOzone read test throughput per process 14 clients, 1 process per client 33 handler min (MB/sec) max (MB/sec) 190.18 606.87 FIFO 191.51 618.05 CRR-N 188.79 513.87 ORR (log, 256) 198.8 425.27 ORR (phys, 256) 198.8 418.85 TRR (log, 256) 208.48 476.55 TRR (phys, 256) 217.56 488.25

ORR/TRR policy tests, large readahead Only 14 clients, for 512 ost_io threads tests, increase number of RPCs by: max_read_ahead_mb=256 max_read_ahead_per_file_mb=256 These lead to curious results. Tests were without the LU-983 fix for readahead. 34

ORR/TRR policy IOR FPP read, large readahead IOR FPP sequential read, 1MB I/O 14 clients, 1 thread per client, 32 GB file per thread 4000 3900 3800 3700 MB/sec 3600 3500 3400 3300 3200 3100 3000 FIFO ORR log 256 ORR phys 256 : FIFO: ORR log 256: ORR phys 256: TRR log 256: TRR phys 256: 35 TRR log 256 3219.73 MB/sec 3252.52 MB/sec 3396.07 MB/sec 3398.86 MB/sec 3512.55 MB/sec 3602.22 MB/sec TRR phys 256

TRR (physical, 256) IOR FPP read, large readahead IOR FPP sequential read, 1MB I/O 14 clients, 1 thread per client, 32 GB file per thread 3700 3600 3500 MB/sec 3400 3300 3200 3100 3000 2900 8 16 32 64 128 256 512 RR quantum Performance is highest at a ~ 512 quantum size. The exact number may vary between workloads. 36 1024

ORR/TRR IOR SSF read test, large readahead IOR SSF sequential read, 1MB I/O 14 clients, 1 thread per client, 448 GiB file 3450 3400 3350 MB/sec 3300 3250 3200 3150 3100 3050 3000 FIFO : FIFO: ORR log 256: ORR phys 256: TRR log 256: TRR phys 256: 37 ORR log 256 ORR phys 256 TRR log 256 3404.8 MB/sec 3364.2 MB/sec 3138.4 MB/sec 3293.1 MB/sec 3165.3 MB/sec 3285.5 MB/sec TRR phys 256

Notes on ORR and TRR policies TRR/ORR increase performance in some test cases, but decrease it in others. TRR/ORR may improve the performance of small and/or random reads. Random reads produce a small number of RPCs with few clients, so this was not tested. TRR may improve the performance of widely striped file reads. Only 8 OSTs were available for these tests, so this was not tested. ORR/TRR may improve the performance of backward reads. Again, few RPCs were generated for this test, so this was not tested. TRR on a multi-layered NRS policy environment can be simplified. ORR policy will need an LRU-based or similar method for object destruction; TRR much less so. TRR and ORR should be less (if at all) beneficial on SSD-based OSTs. 38

Test results summary The NRS framework with FIFO policy: no significant performance regressions. Data and metadata operations tested at reasonably large scale. The CRR-N and ORR/TRR policies look promising for some use cases; CRR-N tends to smooth reads out, ORR/TRR improve performance for reads in some test cases. May be useful in specific scenarios, or for more generic usage. We may get the best of policies when they are combined in a multi-layer NRS arrangement. Further testing may be required. CRR-N and ORR/TRR benefits at larger scale. We ran out of large-scale resources for LUG, but will follow up on this presentation with more results. 39

Future tasks Decide whether the NRS framework with FIFO policy and perhaps others, can land soon. Work out a test plan with the community if further testing is required. We should be able to perform testing at larger scale soon. Two policies operating at the same time should be useful. NRS policies as separate kernel modules? QoS policy. 40

Thank you! Nikitas Angelinas nikitas_angelinas@xyratex.com