Lustre High availability configuration in CETA-CIEMAT

Similar documents
Exploring the High Availability Storage Infrastructure. Tutorial 323 Brainshare Jo De Baer Technology Specialist Novell -

Parallel File Systems for HPC

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Assistance in Lustre administration

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Xyratex ClusterStor6000 & OneStor

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

SUSE Xen VM High Availability Configuration Guide. Cloud Manager 2.1.5

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

RAIDIX 4.5. Product Features. Document revision 1.0

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

A DIFFERENT VIEW ON XML SML XML MADE SIMPLE

CIT 668: System Architecture

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

<Insert Picture Here> Introducing Oracle WebLogic Server on Oracle Database Appliance

Lustre TM. Scalability

Lessons learned from Lustre file system operation

IBM Storwize V7000 Unified

The Genesis HyperMDC is a scalable metadata cluster designed for ease-of-use and quick deployment.

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

CMS experience with the deployment of Lustre

A GPFS Primer October 2005

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

The advantages of architecting an open iscsi SAN

CSCS HPC storage. Hussein N. Harake

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

Linux-HA Clustering, no SAN. Prof Joe R. Doupnik Ingotec, Univ of Oxford, MindworksUK

Lustre overview and roadmap to Exascale computing

Introduction to Lustre* Architecture

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Coordinating Parallel HSM in Object-based Cluster Filesystems

Database Solutions Engineering. Best Practices for running Microsoft SQL Server and Microsoft Hyper-V on Dell PowerEdge Servers and Storage

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Storage Consolidation with the Dell PowerVault MD3000i iscsi Storage

IOPStor: Storage Made Easy. Key Business Features. Key Business Solutions. IOPStor IOP5BI50T Network Attached Storage (NAS) Page 1 of 5

IBM Tivoli Storage Manager. Blueprint and Server Automated Configuration for Linux x86 Version 2 Release 3 IBM

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre usages and experiences

An Introduction to GPFS

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

RAIDIX Data Storage Solution. Data Storage for a VMware Virtualization Cluster

Clustering In A SAN For High Availability

IBM s Data Warehouse Appliance Offerings

Linux Clustering Technologies. Mark Spencer November 8, 2005

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

Extraordinary HPC file system solutions at KIT

Shared Object-Based Storage and the HPC Data Center

An Overview of Fujitsu s Lustre Based File System

IBM Storage Networking Solutions Version 2. Download Full Version :

Parallel File Systems Compared

Genesis HyperMDC 200D

2012 HPC Advisory Council

Datura The new HPC-Plant at Albert Einstein Institute

LLNL Lustre Centre of Excellence

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Fujitsu's Lustre Contributions - Policy and Roadmap-

StorNext M330 Metadata Appliance Release Notes

EMC CLARiiON CX3-80. Enterprise Solutions for Microsoft SQL Server 2005

IBM System Storage DS5020 Express

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Configuring iscsi in a VMware ESX Server 3 Environment B E S T P R A C T I C E S

A Crash Course In Wide Area Data Replication. Jacob Farmer, CTO, Cambridge Computer

Challenges in making Lustre systems reliable

12/04/ Dell Inc. All Rights Reserved. 1

Multi-tenancy: a real-life implementation

From an open storage solution to a clustered NAS appliance

EMC VPLEX Geo with Quantum StorNext

IBM Tivoli Storage Manager. Blueprint and Server Automated Configuration for Linux on Power Systems Version 2 Release 3 IBM

Exam Name: Midrange Storage Technical Support V2

The General Parallel File System

I/O at the Center for Information Services and High Performance Computing

SFA12KX and Lustre Update

Fan Yong; Zhang Jinghai. High Performance Data Division

DELL EMC UNITY: HIGH AVAILABILITY

Vendor must indicate at what level its proposed solution will meet the College s requirements as delineated in the referenced sections of the RFP:

NetApp High-Performance Storage Solution for Lustre

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Robin Hood 2.5 on Lustre 2.5 with DNE

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Introduction to High-Performance Computing (HPC)

IST346. Data Storage

White paper Version 3.10

Power Systems with POWER8 Scale-out Technical Sales Skills V1

Vendor: IBM. Exam Code: Exam Name: Storage Sales V2. Version: DEMO

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

EMC CLARiiON CX3-80 EMC Metropolitan Recovery for SQL Server 2005 Enabled by Replication Manager and MirrorView/S

Configuring and Managing Virtual Storage

Abstract /10/$26.00 c 2010 IEEE

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Fujitsu s Contribution to the Lustre Community

EMC Solutions for Backup to Disk EMC Celerra LAN Backup to Disk with IBM Tivoli Storage Manager Best Practices Planning

Transcription:

Author: Alfonso Pardo Díaz Event: LAD 0 Place / Date: Paris, /09/0 Paris / /9/0

INDEX High availability (HA) issues Lustre HA in CETA-Ciemat Paris / /9/0

CIEMAT data center (MICINN), joint iniciative with regional government of Extremadura Public institution, financed by PGE & FEDER Mission: Consolidate and disseminate escience and ITs, specially Grid and einfrastructures Offer our resources: Grid, Cloud, and HPC (GPUs, clusters, ) Contribute to the effective expansion of escience Facilitate usage of resources (new sites, new applications) San Francisco Convent Paris / /9/0

Paris / /9/0

Just now updating from version.8. to version..! different storage machines on metadata server (heterogeneous environment) Separated MDS-MGS Tape library for backup or HSM (Tivoli Storage Manager? RobinHood?) Different clients: CentOS, Debian, RedHat, Scientific Linux,... Ethernet (at least for now!) Paris / /9/0

... Design concerns: our MDS/MGS/MDT... MDS/MGS: IBM x6 MDS/MGS in active/passive (HA). 6GB RAM (yes, I know, little ram!) xghz Xeon CPU GBit Ethernet lacp bonding...... MDT: IBM DS00 for metadata target Gbit Fibre Channel connection from MDT to MDS RAID, one LUN per filesystem Hot spare for RAID Paris / /9/0

Design concerns: our OSS/OST 0 Supermicro as OSS/OST TB RAW => 7 TB Lustre RAID 6 per OST Hot spare for healthy RAID 8 GB RAM, x,gz Xeon CPU xgb Ethernet interfaces = Gbit Ethernet lacp bonding and active/passive failover bonding How to reach different bonding interfaces? Two different IP interfaces Second IP bonding is an Active or Passive OSS of the first interface Paris / /9/0

... Design concerns: our OSS/OST IBM x6 MDS in active/passive GB RAM xghz Xeon CPU GBit Ethernet lacp bonding 0 IBM DS00+exp0 as OST Gbit Fibre Channel connections from......... OST to OSS RAID per OST Hot spare for RAID Paris / /9/0

What happens if a MDS or MGS server fails? Second server is in passive mode. Pacemaker: MDT mounted in only one MDS server. It manages where must be the active service. Clients are sensitive to MDS errors by timeout. Clients have a list of possible MDS servers MDS active server enter in RECOVERY TIME FILESYSTEM STUCK DURING RECOVERY TIME!! Paris / /9/0

What happens if an OSS server fails? Second server is in active/active or active/passive mode. List of possible OSS GOOD AND TRANSPARENT!! Filesystem still works without interruption Paris / /9/0

But, what does it happens if an OST server fail? My disk crashed! Control parity Data redundancy --> degraded RAID More RAID level, more reliability More RAID level, less space But, If my disk controller fails or if, or more disks crash! RAID failure lost data Filesystem stucks during RAID reconstruction Filesystem stucks during disk controller substitution Paris / /9/0

Design concerns: LUSTRE Metadata server only permits one active server Active/Passive architecture MGS/MDS in separated servers More efficient for high I/O request Server : MGS active and MDS passive Server : MDS active and MGS passive Pacemaker kills the failure server STONITH service STONITH with PDU management (APC7000) Paris / /9/0

Paris / /9/0

<primitive class="ocf" id="resmgs" provider="heartbeat" type="filesystem"> <instance_attributes id="resmgs-instance_attributes"> <nvpair id="resmgs-instance_attributes-device" name="device" value="/dev/mapper/mgs"/> <nvpair id="resmgs-instance_attributes-directory" name="directory" value="/mgs"/> <nvpair id="resmgs-instance_attributes-fstype" name="fstype" value="lustre"/> </instance_attributes> <operations id="resmgs-operations"> <op id="resmgs-start-0" interval="0" name="start" timeout="0"/> <op id="resmgs-stop-0" interval="0" name="stop" timeout="0"/> <op id="resmgs-monitor-60" interval="" name="monitor" start-delay="0" timeout="0"/> </ operations> <meta_attributes id="resmgs-meta_attributes"> <nvpair id="resmgs-meta_attributes-target-role" name="target-role" value="started"/> </meta_attributes> </primitive> <primitive id="resmds" class="ocf" provider="heartbeat" type="filesystem"> <instance_attributes id="resmds-instance_attributes"> <nvpair id="resmds-instance_attributes-device" name="device" value="/dev/mapper/ MDT"/> <nvpair id="resmds-instance_attributes-directory" name="directory" value="/cetafs"/> <nvpair id="resmds-instance_attributesfstype" name="fstype" value="lustre"/> </instance_attributes> <operations id="resmds-operations"> <op interval="0" id="resmdsstart-0" name="start" timeout="0"/> <op interval="0" id="resmds-stop-0" name="stop" timeout="0"/> <op id="resmds-monitor-60" name="monitor" interval="" timeout="0" start-delay="0"/> </operations> <meta_attributes id="resmds-meta_attributes"> <nvpair id="resmds-meta_attributes-target-role" name="target-role" value="started"/> </meta_attributes> </primitive> Resources definition Paris / /9/0

Fence definition <primitive id="stonith_fence_apc_snmp_fence_apc" class="stonith" type="fence_apc_snmp"> <instance_attributes id="stonith_fence_apc_snmp_fence_apc-instance_attributes"> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-action" name="action" value="off"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-ipaddr" name="ipaddr" value="9.68.9."/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-login" name="login" value="apc"/> <nvpair id="nvpairstonith_fence_apc_snmp_fence_apc-passwd" name="passwd" value="apc"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-port" name="port" value=""/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-pcmk_host_check" name="pcmk_host_check" value="static-list"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-pcmk_host_list" name="pcmk_host_list" value="ic-d-0 icd-0"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-pcmk_host_map" name="pcmk_host_map" value="ic-d-0: ic-d-0:"/> </instance_attributes> <meta_attributes id="stonith_fence_apc_snmp_fence_apc-meta_attributes"/> </primitive> <primitive id="stonith_fence_apc_snmp_fence_apc" class="stonith" type="fence_apc_snmp"> <instance_attributes id="stonith_fence_apc_snmp_fence_apc-instance_attributes"> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-action" name="action" value="off"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-ipaddr" name="ipaddr" value="9.68.9."/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-login" name="login" value="apc"/> <nvpair id="nvpairstonith_fence_apc_snmp_fence_apc-passwd" name="passwd" value="apc"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-port" name="port" value=""/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-pcmk_host_check" name="pcmk_host_check" value="static-list"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-pcmk_host_list" name="pcmk_host_list" value="ic-d-0 icd-0"/> <nvpair id="nvpair-stonith_fence_apc_snmp_fence_apc-pcmk_host_map" name="pcmk_host_map" value="ic-d-0: ic-d-0:"/> </instance_attributes> <meta_attributes id="stonith_fence_apc_snmp_fence_apc-meta_attributes"/> </primitive> Paris / /9/0

Resources and fence allocation <rsc_location id="loc_resmgs_ic-d-0" node="ic-d-0" rsc="resmgs" score=""/> <rsc_location id="loc_resmds_ic-d-0" node="ic-d-0" rsc="resmds" score=""/> <rsc_location id="loc_resmds_ic-d-0" node="ic-d-0" rsc="resmds" score="0"/> <rsc_location id="loc_resmgs_icd-0" node="ic-d-0" rsc="resmgs" score="0"/> <rsc_location id="loc_stonith_fence_apc_snmp_fence_apc_ic-d-0" rsc="stonith_fence_apc_snmp_fence_apc" node="ic-d-0" score=""/> <rsc_location id="loc_stonith_fence_apc_snmp_fence_apc_ic-d-0" rsc="stonith_fence_apc_snmp_fence_apc" node="ic-d-0" score="0"/> <rsc_location id="loc_stonith_fence_apc_snmp_fence_apc_ic-d-0" rsc="stonith_fence_apc_snmp_fence_apc" node="ic-d-0" score=""/> <rsc_location id="loc_stonith_fence_apc_snmp_fence_apc_ic-d-0" rsc="stonith_fence_apc_snmp_fence_apc" node="ic-d-0" score="0"/> Paris / /9/0

When a OST fail: RAID reconstruction can take a long time. A failed RAID can lost a lot of files in a splitted filesystem A failure in a disk controller could stuck a entire filesystem Paris / /9/0

(Blueprint) Basically brings the RAID parity bit to a splitted filesystem Clients calculate the parity bit and this bit is written in an OST If OSS/OST fails, the client can reconstruct the file in runtime Increased client CPU usage. Are you a security data paranoid or your company need it? Double parity We are working on it! Paris / /9/0

METADATA SERVER CLiENT Lustre client wants write a file? OSS where I must write? Split file into slices in oss, oss, oss and parity to oss Calculate the parity bit of the slices send the slices to the OSS Write the slices File written File written File written Paris / /9/0

METADATA SERVER CLiENT OSS Lustre client wants read a file? where I must read? Split file into slices in oss, oss, oss and parity in oss Read the file in OSS,OSS,OSS and parity bit in OSS OSS received the client petition Wait n seconds for OSS OSS send its file slice, but OSS is down Clent receive the file slide, but no OSS slice Client reconstruct the file using parity bit Paris / /9/0

A Appendix: Grid software incompatibilities Grid middleware troubles: Quotas: When user quota finishes, the grid software (middleware) still tries to write. Free disk space report: Grid software see the full file system. This space is limited by quota. Grid middleware must-to-see the free quota space. XRootD Paris / /9/0

THANKS!!! AND THANKS TO: Paris / /9/0

Conventual de San Francisco, Sola, 000 Trujillo Telephone: 97 6 9 7 Fax: 97 7 www.ceta-ciemat.es Paris / /9/0