Lustre Metadata Fundamental Benchmark and Performance

Similar documents
SFA12KX and Lustre Update

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Project Quota for Lustre

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Lustre overview and roadmap to Exascale computing

Enhancing Lustre Performance and Usability

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

Demonstration Milestone for Parallel Directory Operations

Remote Directories High Level Design

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Fan Yong; Zhang Jinghai. High Performance Data Division

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

T10PI End-to-End Data Integrity Protection for Lustre

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

NetApp High-Performance Storage Solution for Lustre

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Lustre SMP scaling. Liang Zhen

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre / ZFS at Indiana University

Feedback on BeeGFS. A Parallel File System for High Performance Computing

FhGFS - Performance at the maximum

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

High-Performance Lustre with Maximum Data Assurance

DNE2 High Level Design

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Robin Hood 2.5 on Lustre 2.5 with DNE

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Architecting a High Performance Storage System

DELL Terascala HPC Storage Solution (DT-HSS2)

Idea of Metadata Writeback Cache for Lustre

Extraordinary HPC file system solutions at KIT

Multi-tenancy: a real-life implementation

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

Dell TM Terascala HPC Storage Solution

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Lustre & SELinux: in theory and in practice

Efficient Object Storage Journaling in a Distributed Parallel File System

Experiences with HP SFS / Lustre in HPC Production

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Parallel File Systems for HPC

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

An Overview of Fujitsu s Lustre Based File System

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Upstreaming of Features from FEFS

The modules covered in this course are:

Community Release Update

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Status of SELinux support in Lustre

Optimizing Local File Accesses for FUSE-Based Distributed Storage

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

DAOS Epoch Recovery Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

CMD Code Walk through Wang Di

DL-SNAP and Fujitsu's Lustre Contributions

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Aziz Gulbeden Dell HPC Engineering Team

Andreas Dilger, Intel High Performance Data Division LAD 2017

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

LCOC Lustre Cache on Client

White Paper. File System Throughput Performance on RedHawk Linux

Filesystem Performance on FreeBSD

Lustre OSS Read Cache Feature SCALE Test Plan

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Lustre Parallel Filesystem Best Practices

Tutorial: Lustre 2.x Architecture

Triton file systems - an introduction. slide 1 of 28

State of the Linux Kernel

Andreas Dilger, Intel High Performance Data Division SC 2017

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Lustre Capability DLD

XtreemFS: highperformance network file system clients and servers in userspace. Minor Gordon, NEC Deutschland GmbH

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Applying DDN to Machine Learning

Lustre usages and experiences

Enterprise HPC storage systems

<Insert Picture Here> Lustre Development

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Fujitsu s Contribution to the Lustre Community

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Scalability Test Plan on Hyperion For the Imperative Recovery Project On the ORNL Scalable File System Development Contract

Analyzing I/O Performance on a NEXTGenIO Class System

Nathan Rutman SC09 Portland, OR. Lustre HSM

HLD For SMP node affinity

Transcription:

09/22/2014 Lustre Metadata Fundamental Benchmark and Performance DataDirect Networks Japan, Inc. Shuichi Ihara 2014 DataDirect Networks. All Rights Reserved. 1

Lustre Metadata Performance Lustre metadata is a crucial performance metric for many Lustre user LU-56 SMP Scaling (Lustre-2.3) DNE (Lustre-2.4) Metadata performance is related to small file performance on Lustre But, metadata performance is still a little mysterious J Performance differentiation by metadata type and access patterns? What is the impact of hardware resources for metadata performance? This presentation: use standard metadata benchmark tools to analyze metadata performance on Lustre today 2014 DataDirect Networks. All Rights Reserved. 2

Lustre Metadata Benchmark Tools mds-survey Build into Lustre code Similar to obdfilter-suvey Generates loads on MDS to simulate Lustre metadata performance mdtest Major metadata benchmark tool used by many large HPC sites Runs on clients using MPI Several metadata operation and access patterns are supported 2014 DataDirect Networks. All Rights Reserved. 3

Single Client Metadata Performance Limitation Single client Metadata performance does not scale with threads. ops/sec 60000 50000 40000 30000 20000 10000 0 Single Client Metadata Performance (Unique) File creation File stat File removal 1 2 4 8 16 Number of Threads lustre/include/lustre_mdc.h /** * Serializes in-flight MDT-modifying RPC requests to preserve idempotency. * * This mutex is used to implement execute-once semantics on the MDT. * The MDT stores the last transaction ID and result for every client in * its last_rcvd file. If the client doesn't get a reply, it can safely * resend the request and the MDT will reconstruct the reply being aware * that the request has already been executed. Without this lock, * execution status of concurrent in-flight requests would be * overwritten. * * This design limits the extent to which we can keep a full pipeline of * in-flight requests from a single client. This limitation could be * overcome by allowing multiple slots per client in the last_rcvd file. */ struct mdc_rpc_lock { /** Lock protecting in-flight RPC concurrency. */ struct mutex rpcl_mutex; /** Intent associated with currently executing request. */ struct lookup_intent *rpcl_it; /** Used for MDS/RPC load testing purposes. */ int rpcl_fakes; }; Client can send many metadata requests to MDS simultaneously, but MDS needs to store each client's last transaction ID and it's serialized. LU-5319 supports multiple slots per client in last_rcvd file (Under development by Intel and Bull). 2014 DataDirect Networks. All Rights Reserved. 4

Modified mdtest for Lustre Basic Function Supports multiple mount points on a single client Helps generating heavy metadata load from single client Background Originally developed by Liang Zhen for LU-56 work We rebased and cleaned up codes and made few enhancements Enables metadata benchmarks on a small number of clients Regression testing MDS server sizing Performance optimization 2014 DataDirect Networks. All Rights Reserved. 5

Performance Comparison Single Lustre client mounts /lustre_0, /lustre_1,... /lustre_31 for single filesystem # mdtest n 10000 u d /lustre_{0-15} Single Client Metadata Performance (Unique, single mountpoint) File creation File stat File removal Single Client Metadata Performance (Unique, multimountpoints) File creation File stat File removal 60000 60000 50000 50000 40000 40000 ops/sec 30000 ops/sec 30000 20000 20000 10000 10000 0 0 1 2 4 8 16 1 2 4 8 16 Number of Threads Number of Threads 2014 DataDirect Networks. All Rights Reserved. 6

Benchmark Configuration Client MDS OSS 2 x MDS(2 x E5-2676v2, 128GB memory) 4 x OSS(2 x E5-2680v2, 128GB memory) 32 x Client(2 x E5-2680, 128GB memory) SFA12K-40 400 x NL-SAS for OST 8 x SSD for MDT Lustre-2.5 for Servers Lustre-2.6.52, Lustre-1.8.9 for Client RP0 RP1 RP0 RP1 2014 DataDirect Networks. All Rights Reserved. 7

Metadata Benchmark Method Tested Metadata Operations Directory/File Creation Directory/File Stat Directory/File Removal Access patterns To Unique Directory and shared directory o P0 -> /lustre/dir0/file.0.0, P1 -> /lustre/dir1/file.0.1 (Unique) o P0 -> /lustre/dir/file.0.0, P1 -> /lustre/dir/file.1.0 (Shared) Stride pattern o P0 creates files on /lustre/dir0/file.0.0, P1 calls stat() to P0 created files and finally, P2 calls unlink() to them 2014 DataDirect Networks. All Rights Reserved. 8

Lustre Metadata Performance Impact MDS's CPU speed Metadata Performance comparison (Unique Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS) Directory Operation(Unique) File Operation(Unique) Dir Creation Dir Stats Dir Removal File Creation Fie Stats File Removal 180% 180% 160% 140% 120% 100% 20% 38% 160% 140% 120% 100% 60% 38% 80% 80% 60% 60% 40% 40% 20% 20% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 70% 2014 DataDirect Networks. All Rights Reserved. 9 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

Lustre Metadata Performance Impact MDS's CPU speed Metadata Performance comparison (Shared Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS) Directory Operation(Shared) Dir Creation Dir Stats Dir Removal File Operation(Shared) File Creation Fie Stats File Removal 180% 180% 160% 140% 120% 22% 50% 160% 140% 120% 58% 30% 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 2014 DataDirect Networks. All Rights Reserved. 10

Lustre Metadata Performance Impact MDS's CPU Cores Metadata Performance comparison (Unique Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors Directory Operation(Unique) Dir Creation Dir Stats Dir Removal File Operation(Unique) File Creation Fie Stats File Removal 250% 250% 200% 200% 80-120% 150% 150% 100% 25% 100% 50% 50% 0% 8CPU 8CPU_HT 12CPU 12CPU_HT 16CPU 16CPU_HT 100% 2014 DataDirect Networks. All Rights Reserved. 11 0% 8CPU 8CPU_HT 12CPU 12CPU_HT 16CPU 16CPU_HT

Lustre Metadata Performance Impact MDS's CPU Cores Metadata Performance comparison (Shared Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors Directory Operation(Shared) File Operation(Shared) Dir Creation Dir Stats Dir Removal File Creation Fie Stats File Removal 160% 140% 180% 160% Creation(Sshare) and Stat do not scale No scale on 12->16CPU 120% 100% 140% 120% 60% 80% 100% 80% 60% 60% 40% 40% 20% 20% 0% 0% 8CPU 8CPU_HT 12CPU 12CPU_HT 16CPU 16CPU_HT 8CPU 8CPU_HT 12CPU 12CPU_HT 16CPU 16CPU_HT 2014 DataDirect Networks. All Rights Reserved. 12

Lustre Metadata Performance MDSs Scalability (Unique Directory) Lustre Metadata Scalability (Unique) (32 clients, 1024 mount points) 250000 multiple mount points 200000 150000 100% by DNE File Creation ops/sec 100000 File Removal Dir Creation Dir Removal File Creation(DNE) File Removal(DNE) 50000 0 Directory Creation 16 32 64 128 256 512 1024 Number of Threads 50% of File creation 2014 DataDirect Networks. All Rights Reserved. 13

Lustre Metadata Performance MDSs Scalability (Shared Directory) 100000 90000 80000 Lustre Metadata Scalability (Shared) (32 clients, 1024 mount points) multiple mount points 80% of Performance compared to UniqueDir patterns 70000 ops/sec 60000 50000 40000 30000 20000 Directory Creation File Creation File Removal Dir Creation Dir Removal 10000 0 16 32 64 128 256 512 1024 Number of Threads 2014 DataDirect Networks. All Rights Reserved. 14

Why Lustre directory creation is slower than File creation? ext4_mkdir() -> ext_init_new_dir() ->ext4_try_create_inline_dir() 140000 120000 mdtest to ext4 w/wo inline_data -> ext4_append() 100000 -> ext4_getblk() -> ext4_map_blocks() 80000 60000 default inlie_data Average costs (inline_data vs default) ext4_getblk() 1.4us vs 9.8us ext4_map_block() 0.5us vs 4.8us 40000 20000 0 Dir Creation File Creation "Inline_data" feature is available on newer kernel. RHEL7 also supports it. "Inline_data" is not available in ldiskfs since "dir_data" in ldiskfs and "inline_data" are incompatible, today. Similar idea might be good? Investigating on LU-5603 2014 DataDirect Networks. All Rights Reserved. 15

Lustre Metadata Performance File creation and removal for small files Creating files with actual file size (4K, 8K, 16K, 32K, 64K and 128K) (Stripe Count=1) Small File Performance (Unique Directory) (32 clients, 1024 mount points) Small File Performance (Shared Directory) (32 clients, 1024 mount points) File Creation File Read File Removal File Creation File Read File Removal 180000 180000 160000 160000 140000 140000 120000 120000 ops/sec 100000 80000 ops/sec 100000 80000 60000 40000 20000 60000 Small file performance bounds 40000 on metadata performance, but no performance impacts with file size. 20000 0 0 4096 8192 16384 32768 65536 131072 File Size(byte) 0 0 4096 8192 16384 32768 65536 131072 File Size(byte) 2014 DataDirect Networks. All Rights Reserved. 16

Lustre Metadata Performance Stride access pattern Stride ('-N' in mdtest) helps avoiding local locks in cache for stat() and unlink() operation after file creation. 100000 File Removal with Stride (32 clients, 64 processes, Shared) No Stride(2.5 client) No Stride(1.8 client) Stride(2.5 client) Stride(1.8 client) 80000 60000 40000 Lustre-1.8 Lustre-1.8 20000 Lustre-2.6 Lustre-2.6 0 Unique LU-5608 for regressions in 2.6 client for stride metadata access pattern. Shared 2014 DataDirect Networks. All Rights Reserved. 17

Summary Observations MDS Server resources significantly affect Lustre Metadata performance Performance scales well by number of CPU core and CPU Speed in unique directory access, but not CPU bound for shared directory access pattern Collected baseline results with 16 CPU cores, but need more tests on CPU cores Performance is highly dependent on metadata access pattern Example: Directory Creation vs. File Creation "Stride" option helps avoiding local locks in cache With actual file size (instead of zero byte), less impact in the case of a small number of OST(e.g. up to 40 OST), but testing on large number of OSTs is needed 2014 DataDirect Networks. All Rights Reserved. 18

Metadata Performance: Future Work Known Issues and Optimizations Client-side metadata optimization and especially single-client metadata performance Various performance regressions in Lustre 2.5/2.6 that need to be addressed (e.g. LU5608) Areas of Future Investigation Real-world metadata use scenarios and metadata problems Real-world small-file performance (e.g. life sciences) Impact of OST data structures on real world metadata performance DNE scalability on very large systems with many MDSs/MDTs and many OSSs/OSTs 2014 DataDirect Networks. All Rights Reserved. 19