NAS, SANs and Parallel File Systems

Similar documents
Programming with MPI

Programming with MPI

Programming with MPI

Parallel File Systems Compared

Parallel File Systems for HPC

CS370 Operating Systems

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

Programming with MPI

Programming with MPI

Distributed File Systems

What is a file system

An Introduction to GPFS

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Programming with MPI

Distributed File Systems II

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

The File Systems Evolution. Christian Bandulet, Sun Microsystems

Caching and reliability

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

Google Cluster Computing Faculty Training Workshop

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

The Google File System. Alexandru Costan

CSE 153 Design of Operating Systems

Programming with MPI

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

VERITAS SANPoint Storage Appliance Overview of an Open Platform for the Implementation of Intelligent Storage Server

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Announcements. Persistence: Crash Consistency

Chapter 11: File System Implementation. Objectives

Veritas NetBackup on Cisco UCS S3260 Storage Server

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

Distributed File Systems. Directory Hierarchy. Transfer Model

CSE506: Operating Systems CSE 506: Operating Systems

3.1. Storage. Direct Attached Storage (DAS)

CA485 Ray Walshe Google File System

Real-time Protection for Microsoft Hyper-V

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to OpenMP

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Physical-to-Virtual Migration with Portlock Storage Manager

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Ext3/4 file systems. Don Porter CSE 506

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Computer Systems Laboratory Sungkyunkwan University

Installing Ubuntu Server

File System Consistency

STORAGE PROTOCOLS. Storage is a major consideration for cloud initiatives; what type of disk, which

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Inside the PostgreSQL Shared Buffer Cache

General Purpose Storage Servers

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

Current Topics in OS Research. So, what s hot?

Chapter 10: Mass-Storage Systems

ICS Principles of Operating Systems

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Symantec NetBackup PureDisk Compatibility Matrix Created August 26, 2010

The General Parallel File System

File systems CS 241. May 2, University of Illinois

Operating System Concepts Ch. 11: File System Implementation

EMC Celerra CNS with CLARiiON Storage

IOPStor: Storage Made Easy. Key Business Features. Key Business Solutions. IOPStor IOP5BI50T Network Attached Storage (NAS) Page 1 of 5

Distributed Systems COMP 212. Revision 2 Othon Michail

CS5460: Operating Systems Lecture 20: File System Reliability

Storage Area Networks: Performance and Security

GFS: The Google File System. Dr. Yingwu Zhu

Tricky issues in file systems

Linux Filesystems Ext2, Ext3. Nafisa Kazi

Exam : Title : Storage Sales V2. Version : Demo

Research on Implement Snapshot of pnfs Distributed File System

Protect enterprise data, achieve long-term data retention

Feedback on BeeGFS. A Parallel File System for High Performance Computing

The Google File System

PeerStorage Arrays Unequalled Storage Solutions

XP: Backup Your Important Files for Safety

TECHNICAL SPECIFICATIONS

DISTRIBUTED FILE SYSTEMS & NFS

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

Storage Systems. Storage Systems

CyberStore DSS. Multi Award Winning. Broadberry. CyberStore DSS. Open-E DSS v7 based Storage Appliances. Powering these organisations

Lecture 11: Linux ext3 crash recovery

SMD149 - Operating Systems - File systems

Backup Solutions with (DSS) July 2009

G54ADM Sample Exam Questions and Answers

Chapter 11: Implementing File-Systems

Introduction to OpenMP

NFS: What s Next. David L. Black, Ph.D. Senior Technologist EMC Corporation NAS Industry Conference. October 12-14, 2004

An Overview of CORAID Technology and ATA-over-Ethernet (AoE)

1 / 23. CS 137: File Systems. General Filesystem Design

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEMS CS136

Red Hat Global File System

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Lustre A Platform for Intelligent Scale-Out Storage

Assistance in Lustre administration

Operating Systems. File Systems. Thomas Ropars.

Transcription:

NAS, SANs and Parallel File Systems p. 1/?? NAS, SANs and Parallel File Systems Nick Maclaren nmm1@cam.ac.uk February 2008

NAS, SANs and Parallel File Systems p. 2/?? Summary of Presentation Background and default technology What the terms mean (insofar as they do) How they are constructed and operate What they can do and what they will not do An overview of the more common software This is an anti--salesman innoculation :--)

NAS, SANs and Parallel File Systems p. 3/?? Absolutely Basic File Server Bog standard Workstation NFS/CIFS GbE

NAS, SANs and Parallel File Systems p. 4/?? And Beyond That? More than one disk (perhaps up to 6 10) Limit is terabytes in 2008 (2 10 per box) Mirroring (halves the space, but provides safety) Software RAID--5 (needs 3+ disks etc.) Multiple Ethernet ports (watch out here) Or even an InfiniBand network! More memory (ECC of course), SMP CPUs UPS, remote power management etc.

NAS, SANs and Parallel File Systems p. 5/?? Much Fancier File Server Multi CPU SMP Server Usually RAID or Mirrored NFS/CIFS GbE/InfiniBand

NAS, SANs and Parallel File Systems p. 6/?? Tens/Hundreds of Terabytes You need some sort of multi--disk shelf/box All major server vendors sell them Can choose to use hardware RAID--5 or not Recovering tens of TB is incredibly tedious Devices may be local (SCSI, SATA etc.) Or networked (Ethernet, InfiniBand etc.) Generally, manage them much like disks Attached as devices or point--to--point network

NAS, SANs and Parallel File Systems p. 7/?? Using External s Server SCSI SATA iscsi Fibre Channel Almost Always RAID or Mirrored NFS/CIFS GbE/InfiniBand

NAS, SANs and Parallel File Systems p. 8/?? Aside: a Word to the Wise RAID--5 and UPS help only sometimes, not with: File system corruption caused by software Finger trouble by administrators and users Fire, flood, theft, malicious damage etc. No substitute for backups to somewhere else Even if just to a separate set of disks You regularly check recovery, of course?

NAS, SANs and Parallel File Systems p. 9/?? NAS Network Attached Storage It s just a complete file server in a box Can be easier to administer but less flexible You usually pay extra for the pre--configuration Remote management facilities of some sort Occasionally the term is used differently External disks connected over a network Those are described under Simple SAN

NAS, SANs and Parallel File Systems p. 10/?? Network Attached Storage Server Almost Always RAID or Mirrored NFS/CIFS/Netware GbE/InfiniBand

NAS, SANs and Parallel File Systems p. 11/?? SAN Storage Area Network What on earth does that gibberish mean? As far as I can tell, almost nothing Simplest use is consolidated external disks Connected to servers by a network (LAN) Can usually boot from and swap/dump to them Sometimes used for parallel file server/system Will come back to this later

NAS, SANs and Parallel File Systems p. 12/?? Simple SAN Embedded Device Server FibreChannel iscsi over GbE etc. (or InfiniBand)

NAS, SANs and Parallel File Systems p. 13/?? Simple SANs (1) Purpose is ease of administration Centralise disk management in one place A single rack contains many servers disks Can save money in UPS, space in racks Can sometimes save effort in taking backups Commercial sites seem to find them useful Few academic ones do, as far as I can tell The PWF does it with NetWare, though Staff effort is cheap or free in academia :--(

NAS, SANs and Parallel File Systems p. 14/?? Simple SANs (2) Interface is generally as virtual disks (LUNs) Connected to server as devices Via a device--level interface (e.g. FibreChannel) iscsi is SCSI over TCP/IP Can use Ethernet or InfiniBand space divided into partitions A partition mounted by just one server Avoids interlocking/consistency problems

NAS, SANs and Parallel File Systems p. 15/?? Sharing Filesystems So far, so good but also So What? Need to share filesystems between servers Current solutions all have major disadvantages NFS, AFS/DFS, CIFS/SMB, Netware,... Use a SAN to share at the device level? Here be dragons!

NAS, SANs and Parallel File Systems p. 16/?? Shared SAN Embedded Device Server FibreChannel iscsi over GbE etc. (or InfiniBand)

NAS, SANs and Parallel File Systems p. 17/?? Why is That a Problem? A version of the distributed database problem Has been intractable for 40+ years Parallel accesses introduce race conditions Caching then introduces inconsistency If done wrong, can even corrupt the filesystem Same problem occurs with crashes and hangs Including when a user flips the power switch All problems are probabilistic (not repeatable)

NAS, SANs and Parallel File Systems p. 18/?? Overview of Next Slides Start by describing original Unix I/O model Was used by Microsoft until recently Then describe hacks used on SMP systems Then why simply sharing devices doesn t work Will given only a few examples of failures Then onto how SAN filesystems actually work And what their constraints and uses are

NAS, SANs and Parallel File Systems p. 19/?? Original Unix I/O Model In this context, Microsoft systems are Unix--like I don t program them, so I shall describe POSIX All system calls are atomic The kernel is a single thread All I/O uses a single file cache applications see a consistent filesystem fsck restores filesystem integrity after a crash This does not restore application--level consistency

NAS, SANs and Parallel File Systems p. 20/?? SMP Issues Simplest to use a single CPU for all I/O That is too slow, so need to use several CPUs SMP kernels include a variety of ad hoc locks Modern filesystems are often journalled Reasonable protection against filesystem corruption Still application--level inconsistencies and races Rare on small SMP systems say, 8 Typically seen only by hard--core HPC people

NAS, SANs and Parallel File Systems p. 21/?? Distributed systems Problems may be thousands of times more likely Global locks are completely unacceptable Each system maintains its own cache Latencies are 10 1,000 times larger Just about works for co--operating applications Ones designed to avoid inconsistencies and races Even then, filesystem corruption does occur Quite often when one system crashes or hangs

NAS, SANs and Parallel File Systems p. 22/?? Thousands of Times? Race conditions need 2+ simultaneous events Probability proportional to square of frequency One event lasting 0.003 seconds every 5 minutes 2 clients involved in causing events a couple of race conditions a year One event lasting 0.03 seconds every 30 seconds 20 clients involved in causing events a race condition every few minutes The above is a 100,000--fold increase

NAS, SANs and Parallel File Systems p. 23/?? Appending to a File Standard sequence of operations: Remove block from free block list Write data to new block on disk Update inode with block ptr and timestamp Chaos if two servers do first step in parallel Block allocated twice and one might be a directory That leads to serious filesystem corruption

NAS, SANs and Parallel File Systems p. 24/?? Updating a File Standard sequence of operations: Read previous block contents from disk Write new data to block on disk Update inode with address and timestamp No possibility of filesystem corruption Except when the block is in a directory Race conditions in a dozen ways High chance of application--level inconsistency

NAS, SANs and Parallel File Systems p. 25/?? Deleting a File Standard sequence of operations: Remove entry from parent directory Check use count in inode; if zero: Restore blocks to free block list Release inode for reuse Chaos if the file is in use at the time The inode and free blocks may be reallocated That leads to serious filesystem corruption

NAS, SANs and Parallel File Systems p. 26/?? Directories No locking of directory access and update No way of synchronising all directories in path Chaos if a directory changes while being read POSIX implies sanity, but is just plain wrong Most directories fit in a single block Most utilities read directories in a burst Problems rarely occur on SMP systems

NAS, SANs and Parallel File Systems p. 27/?? Pathname resolution Implemented in kernel and usually quasi--atomic On SANs, even pathname resolution is risky Each directory level needs a separate read What happens if a multi--block directory is changed? Especially in B--tree directory implementations That is Bad News potentially even for security

NAS, SANs and Parallel File Systems p. 28/?? Horrible Example Master node: mv /dump /dump.old ; mkdir /dump Spawn dump request to all clients On each client node: cpio --o /local > /dump/cpio.$$ Return success indicator On master, wait for all clients to finish Only if all of the client dumps succeeded: rm --r /dump.old Now what if the directory update was delayed?

NAS, SANs and Parallel File Systems p. 29/?? The Usual Resolution Use a SAN filesystem (i.e. a distributed one) Every system must run the client code Contains enough handshaking to avoid corruption This may or may not involve global locking It may not prevent application--level inconsistency Different SAN filesystems have different rules Not all POSIX--conforming programs will work Most mailers and job schedulers don t

NAS, SANs and Parallel File Systems p. 30/?? SAN Filesystem iscsi / FibreChannel SAN FS?? SAN FS SAN FS Server Server Server

NAS, SANs and Parallel File Systems p. 31/?? Common Restrictions Parallel access to same object must be read--only And that includes access to directory trees Major restriction on find, make, tar etc. Timestamps may not be consistent with data Don t trust fsync on SAN filesystems There may be a delay as updates become visible Don t synchronise I/O by using message--passing Or vice versa...

NAS, SANs and Parallel File Systems p. 32/?? Horrible Warning! Remember the Horrible Example? It can still happen and I have seen it do so You often need explicit synchronisation E.g. pass the inode number of /dump Clients then loop until ls --i matches Or whatever... And leave make --j to masochists...

NAS, SANs and Parallel File Systems p. 33/?? POSIX Conformance Some have levels of POSIX conformance For example, Lustre / HP SFS does The more conformance, the less scalability Bad news for hard--core HPC people Remember that POSIX doesn t specify much Most applications also rely on de facto behaviour [ E.g. parallel reading and modifying a directory ] And you won t get de facto conformance

NAS, SANs and Parallel File Systems p. 34/?? Administration Constraints Identical SAN FS versions and configurations Potential chaos if you get them out of step Whole SAN may fail if one system crashes or hangs One person had better manage all systems Avoiding corruption leads to space leaks Fairly frequent recompaction may be needed OK for dedicated clusters in machine rooms Not good news for workstations on users desks

NAS, SANs and Parallel File Systems p. 35/?? Can We Simplify? The general case is always complicated Are there simpler cases that work better? The answer is, of course, yes, but... Very few are of much use in academia Use them in a complicated way once, and... [ Probabilistically, of course ] Single--writer systems are the most useful [ Including most forms of replication ] Hot failover is commonly touted by salesmen

NAS, SANs and Parallel File Systems p. 36/?? Single-Writer Systems (1) Only one system with update privileges All others mount the filesystem read--only Avoids the worst of the problems No filespace corruption or data inconsistency [ At least in theory... ] The read--only systems may see inconsistencies Solution is to remount the shared filesystem

NAS, SANs and Parallel File Systems p. 37/?? Single-Writer Systems (2) Very limited experience around the University Suitable for high--profile Web servers etc. The Sanger did (does?) this with GPFS All really big servers (like Google) do it You are advised to proceed with caution! Please tell me of your experiences

NAS, SANs and Parallel File Systems p. 38/?? Hot Failover A SAN is an important component of this But it is not a solution on its own Need the whole system to be designed for failover Reason is that critical state is kept at all levels In the actual disk blocks (obviously) In the filesystem cache (obviously) In the kernel (e.g. state in file descriptors) In the application (more state, locks etc.) Most experiences with this are not positive :--(

NAS, SANs and Parallel File Systems p. 39/?? Hot Failover iscsi / FibreChannel SAN FS Kernel Application iscsi / FibreChannel / GbE / wet string /... SAN FS Kernel Application

NAS, SANs and Parallel File Systems p. 40/?? Parallel File Servers For when a single file server is inadequate Needed for many clients and heavy I/O only Provides scalability, sometimes hot failover Hardware is cheaper than a large SMP server Perhaps not when including software and support This is major and growing use in the outside world Not just in research HPC but in commerce

NAS, SANs and Parallel File Systems p. 41/?? SAN Based Fileserving iscsi / FibreChannel SAN FS SAN FS SAN FS NFS/CIFS/etc. GbE

NAS, SANs and Parallel File Systems p. 42/?? True Parallel Filesystems I mean ones designed to increase performance Especially that of parallel applications Directly connected to each node of a cluster Extreme HPC for I/O--bound applications Most SANs don t support this type of use Genuinely parallel (non--posix) filesystem Need to design application for the filesystem As an example of this, look at MPI--2 I/O

NAS, SANs and Parallel File Systems p. 43/?? Common Parallel Filesystems Lustre is the HPC market leader (so they say) Experiences with IBM GPFS are very mixed Good: 1 Linux + 1 AIX; bad: 1 Linux + 1 AIX No personal experience with RedHat GFS, Panasas, SGI VxFS, Lustre or others TerraScale looks to be a good design I was not impressed by Ibrix to be polite

NAS, SANs and Parallel File Systems p. 44/?? Lustre / HP SFS I spent some time investigating this HP SFS is a Lustre offshoot Both fiendishly complicated in all respects Sun have bought Lustre and CFS So, bye, bye QFS few people will cry Must have full support contract or dedicated expert Probably only from Sun or HP

NAS, SANs and Parallel File Systems p. 45/?? RedHat GFS and GFS2 From http:/ /www.redhat.com/gfs/: Fully POSIX--compliant, meaning applications don t have to be rewritten to use GFS ********! er, sorry, I meant twaddle! It seems to provide block--level consistency It surely must provide filesystem integrity But I can find no references as to how... It seems to be still a research project

NAS, SANs and Parallel File Systems p. 46/?? pnfs (Parallel NFS) Extensions to NFSv4 by Internet RFCs http:/ /tools.ietf.org/wg/nfsv4/ Object extension more interesting to commerce Semi--standard Object Storage Devices protocol Clients have direct block access to storage Massive scope for inconsistency and confusion Possibly too clever by half and may flop, horribly Partly driven by Panasas, to compete with Lustre http:/ /www.pnfs.com/

NAS, SANs and Parallel File Systems p. 47/?? pnfs Block Access Design NFS Server Block Server Block Server Block Server Client Client Client

NAS, SANs and Parallel File Systems p. 48/?? Single-Writer Filesystems No known users of Novell, Oracle CFS etc. The PWF uses Novell in non--parallel mode The Google file system seems to be private There are doubtless many others I haven t found Most people use a more general SAN filesystem Which may even work, when in single--writer mode

NAS, SANs and Parallel File Systems p. 49/?? Microsoft DFS and FRS One (?) department uses Microsoft DFS, happily My comments are based on Microsoft documents Like a Simple SAN, but provides a single view Generally, each directory lives on one server Client access is via CIFS/SMB to servers only Assumes and relies on a single--writer model Uses replication for multiple servers Not efficient for heavily updated directories