Scalability Test Plan on Hyperion For the Imperative Recovery Project On the ORNL Scalable File System Development Contract

Similar documents
Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

<Insert Picture Here> Lustre Development

Remote Directories High Level Design

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

Commit-On-Sharing High Level Design

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

Fujitsu s Contribution to the Lustre Community

Assistance in Lustre administration

Fan Yong; Zhang Jinghai. High Performance Data Division

Efficient Object Storage Journaling in a Distributed Parallel File System

Disruptive Storage Workshop Hands-On Lustre

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

5.4 - DAOS Demonstration and Benchmark Report

Efficient Geographic Replication & Disaster Recovery. Tom Pantelis Brian Freeman Colin Dixon

Chapter 21 Restart. Types of power supply

Fujitsu's Lustre Contributions - Policy and Roadmap-

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Lustre Metadata Fundamental Benchmark and Performance

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Testing Lustre with Xperior

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Lustre overview and roadmap to Exascale computing

Scalable ejabberd. Konstantin Tcepliaev. Moscow Erlang Factory Lite June 2012

DAOS Epoch Recovery Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

A More Realistic Way of Stressing the End-to-end I/O System

The Spider Center Wide File System

DNE2 High Level Design

Real-time Protection for Microsoft Hyper-V

Lustre OSS Read Cache Feature SCALE Test Plan

Tutorial: Lustre 2.x Architecture

InterSystems High Availability Solutions

How to Guide: SQL Server 2005 Clustering. By Randy Dyess

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

6.5 Collective Open/Close & Epoch Distribution Demonstration

The Spider Center-Wide File System

How To Make Databases on SUSE Linux Enterprise Server Highly Available Mike Friesenegger

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Dynamic I/O Congestion Control in Scalable Lustre File System*

Oracle Rdb Hot Standby Performance Test Results

SFA12KX and Lustre Update

Extraordinary HPC file system solutions at KIT

Diagnosing Performance Issues in Cray ClusterStor Systems

April 21, 2017 Revision GridDB Reliability and Robustness

Database management system Prof. D. Janakiram Department of Computer Science and Engineering Indian Institute of Technology, Madras

Solution Pack. Managed Services Virtual Private Cloud Managed Database Service Selections and Prerequisites

Surveillance Dell EMC Storage with Infinova 2217 Security Management System

DL-SNAP and Fujitsu's Lustre Contributions

LNET MULTI-RAIL RESILIENCY

DB2 Data Sharing Then and Now

PRIMEQUEST 400 Series & SQL Server 2005 Technical Whitepaper (November, 2005)

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

02 - Distributed Systems

Everyday* Lustre. A short survey on Lustre tools. Sven Trautmann Engineer, Lustre Giraffe Team, Sun Microsystems

02 - Distributed Systems

How To Make Databases on Linux on System z Highly Available

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Heckaton. SQL Server's Memory Optimized OLTP Engine

V5R2 Windows Administration, Microsoft Clustering IBM Corporation j02_wclusaug20.prz 08/22/02 1

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Cluster Shared Volume

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

CMS experience with the deployment of Lustre

Recovering from a Crash. Three-Phase Commit

Tradeoff Evaluation. Comparison between CB-R and CB-A as they only differ for this aspect.

Lustre at Scale The LLNL Way

Achieving high availability for Hyper-V

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

Upstreaming of Features from FEFS

Today: Fault Tolerance. Reliable One-One Communication

Distributed Data Management Replication

Tips and Tricks for Diagnosing Lustre Problems on Cray Systems. Cory Spitz and Ann Koehler Cray Inc. 5/25/2011

An Oracle White Paper May Oracle VM 3: Overview of Disaster Recovery Solutions

CS October 2017

Data Guard Maximum Availability

Transparent TCP Recovery

Architecting Storage for Semiconductor Design: Manufacturing Preparation

FortiVoice 200D/200D-T/2000E-T2 High Availability Technical Note

Application High Availability with Oracle

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

An Overview of Fujitsu s Lustre Based File System

LLNL Lustre Centre of Excellence

Protecting Mission-Critical Workloads with VMware Fault Tolerance W H I T E P A P E R

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Oak Ridge National Laboratory

High-Performance Lustre with Maximum Data Assurance

Programming for Fujitsu Supercomputers

CS603: Distributed Systems

Maximum Availability Architecture. Oracle Best Practices for High Availability. Reducing Siebel Downtime with a Local Standby Database

Performance Monitoring Always On Availability Groups. Anthony E. Nocentino

High Performance, Open Source, Dell Lustre TM Storage System Dell PowerVault TM MD3200 storage platform and QDR Mellanox Infiniband

Lock Ahead: Shared File Performance Improvements

Transcription:

Scalability Test Plan on Hyperion For the Imperative Recovery Project On the ORNL Scalable File System Development Contract Revision History Date Revision Author 07/19/2011 Original J. Xiong 08/16/2011 Rev A J. Xiong I. Colle (minor formatting edits) i

Table of Contents I. Introduction... 1 II. Test cases... 1 Procfs entries to be used... 2 Test cases... 2 ir.scalability.ost.1... 2 ir.scalability.ost.2... Error! Bookmark not defined. ir.scalability.mdt.1... 3 ir.scalability.mdt.2... Error! Bookmark not defined. ir.scalability.multiple_targets... 4 ir.scalability.scalability... 5 i

I. Introduction The following test plan applies to the Imperative Recovery (IR) project within the ORNL Scalable File System Development contract signed 11/23/2010 and modified on 03/18/2011. ORNL is experiencing exceedingly long service outages during server failover and recovery. Currently this process takes unacceptably long because it depends on timeouts scaled to accommodate congested service latency. Large- scale lustre implementations have historically experienced problems recovering in a timely manner after a server failure. This is due to the way that clients detect the server failure and how the servers perform their recovery. From the client side, the only way it can detect the failure of a server is by the timeout of Remote Procedure Calls (RPCs). If a server has crashed, the RPCs to the targets associated with that server will time out and the client will attempt to reconnect to the failed server, or to a failover server if configured. After reconnection, the client will participate in recovery and then continue in normal service. A restarting target will try to recover its state by replaying RPCs executed but not committed by its previous incarnation. These RPCs must be executed in the original order to ensure the target reconstructs its state consistently. The target must therefore wait for all clients to reconnect before recovery completes and new requests can be serviced because a missed replay transactions could result in recovery failure and a late replay transaction could be invalidated by new requests. This wait, called the recovery window is bounded to ensure a failing client cannot delay resumption of service indefinitely. The above processes are time consuming, as most are driven by the RPC timeout which must be scaled with system size to prevent false diagnosis of server death. This is especially difficult in an environment of ORNL s scale. The purpose of imperative recovery is to reduce the recovery window by actively informing clients of server failure. The resulting reduction in the recovery window will minimize target downtime and therefore increase overall system availability. II. Test cases Right now there are two recovery types in the system: standard recovery and imperative recovery. We re going to compare the performance between them under different situations. There are several control entities under procfs about imperative recovery. This is a list of proc entries we will use during the test. 1

Procfs entries to be used NODE PATH SETTINGS MGS mgs.mgs.live.lustre Read: print out stats and state of IR [root@wolf6 tests]# lctl get_param mgs.mgs.live.lustre Imperative Recovery Status: state: Full, nonir clients: 0 nidtbl version: 5 notify total/max/count: 0/0/3 Write: 0 Clear stats; Write: status=disabled Disable IR; Write: status=full Set IR to full state. CLIENT Osc.lustre-OST0000- osc-ffff*.pinger_recov This can control if standard recovery will be used. If this is disabled, the corresponding OST can be recovered by only imperative recovery. jay 8/22/11 5:52 PM Deleted: cat lctl lctl lctl CLIENT Osc.lustre-OST0000- osc-ffff*.import Read: Write: 0 disable pinger recover; 1 enable pinger recover. Read: target: lustre-ost0000_uuid state: FULL instance: 1859057999 Check state: FULL to make sure the osc has been reconnected to the ost successfully. Test cases ir.scalability.ost To measure and compare how quick the IR can recover a failing OST OST, MGS and client Shutdown and restart an OST in a running cluster, investigate the time for the OST to get recovered. Sarp Oral 8/22/11 7:26 PM Comment [1]: 1.For this and the other cases: Does make any sense to test all with and without failover pairs declared? Replied by Jinshan: No, we don t have such test env. But we have done this test as unit test. The test cases here are focusing on scalability test. 2

1. Must have workload on the failing OST. I recommend running IOR on each client node to write objects on this OST; 2. When the system is up, make sure the MGS is on Full state by checking: at the MGS: grep state: /proc/fs/lustre/mgs/mgs/live/{fsname} OST shutdown and restarted 1. MGS is running; 2. all clients have connected; 3. IOR are running on clients Clients should be able to reconnect to the failing target; Investigate the time for all clients to get reconnected to the failing target. We shall calculate the maximum and average recovery time from each client jay 8/22/11 5:53 PM Deleted: and jay 8/22/11 5:55 PM Deleted:. ir.scalability.mdt To measure how quickly the IR can recover a failing MDT MDT, MGS and client Shutdown and restart an MDT in a running cluster, investigate the time for the MDT to get recovered. 1. Configure the MGS and MDT separately; 2. There must be workload on the failing MDT. I recommend running mdsrate on each client node to create objects on this MDT; 3. When the system is up, make sure the MGS is on Full state by checking: at the MGS: grep state: /proc/fs/lustre/mgs/mgs/live/{fsname} MDT shutdown and restarted 3

1. MGS is running; 2. all clients have connected; 3. mdsrate is running for a while to make sure there is enough dirty data unflushed on the MDT Clients should be able to reconnect to the failing target; Investigate the time for all clients to get reconnected to the failing target. We shall calculate the maximum and average recovery time from each client. ir.scalability.multiple_targets To make sure imperative recovery works well if multiple targets fail meanwhile OST, MGS and client Suppose there are tens of OSTs on each OSS, and restart as many OSSes as possible on the same time, to investigate if IR can handle this case smoothly. N/A OSSes shutdown and restarted MGS is running and all clients have connected Clients should be able to reconnect to all restarting OSTs in a short time 4

ir.scalability.scalability To verify that imperative recovery can notify many clients in a reasonable time OST, MGS and client Mount thousands of mount points on each client node, and then restart an OST; N/A OSSes shutdown and restarted 1. MGS is running and all clients have connected 2. Make sure you have this line in your modprobe.conf file: options obdclass lu_cache_percent=1, otherwise, you can t mount over 200 mountpoints on one node 3. Clear the IR stats by: at the MGS node: lctl set_param mgs.mgs.live.{fsname}=0 Clients should be able to reconnect to the restarting OST in a short time. Collect the output of lctl get_param mgs.mgs.live.{fsname} One of the IR s problems is that the MGS has to notify each clients for the update of target register, so it may have the scalability problem if there are two many clients in the cluster. This test is to verify IR works well with at least 100K clients. Sarp Oral 8/22/11 7:29 PM Comment [2]: A version of this test case with ongoing I/O to other OSTs while the tested OST is restarted/tested, would be nice to have. Replied by Jinshan: Since we don t have so many nodes, so we have simulate 75K clients by mounting 600 mountpoints on each client node. In this case, I don t think it will make much sense to add workload. Actually serious scalability test has still to be done at ORNL. Sarp Oral 8/22/11 7:33 PM Comment [3]: Is the CPU load on the MGS measured during this test? Replied by Jinshan: in our test, the MGS can notify 75K clients in a really short time, 8 seconds on average(no workload in cluster), so we didn t catch CPU load on the MGS. What I can verify is that the CPU load on the MGS is high by a glimpse. 5