Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Similar documents
Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Computer Science 146. Computer Architecture

CS2410: Computer Architecture. Storage systems. Sangyeun Cho. Computer Science Department University of Pittsburgh

Agenda. Project #3, Part I. Agenda. Project #3, Part I. No SSE 12/1/10

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

CS61C : Machine Structures

CSE325 Principles of Operating Systems. Mass-Storage Systems. David P. Duggan. April 19, 2011

File Locking in NFS. File Locking: Share Reservations

Current Topics in OS Research. So, what s hot?

Agenda. Agenda 11/12/12. Review - 6 Great Ideas in Computer Architecture

Definition of RAID Levels

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 25: Dependability and RAID

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

HP AutoRAID (Lecture 5, cs262a)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

HP AutoRAID (Lecture 5, cs262a)

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

Dependency Redundancy & RAID

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

CSE 153 Design of Operating Systems

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

I/O CANNOT BE IGNORED

416 Distributed Systems. Distributed File Systems 4 Jan 23, 2017

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Disconnected Operation in the Coda File System

Chapter 11: File System Implementation. Objectives

Advanced Computer Architecture

Reliable Computing I

CS5460: Operating Systems Lecture 20: File System Reliability

CSE 486/586: Distributed Systems

Distributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09

RAID (Redundant Array of Inexpensive Disks)

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

Physical Storage Media

CS 61C: Great Ideas in Computer Architecture. Dependability. Guest Lecturer: Paul Ruan

CS61C - Machine Structures. Week 7 - Disks. October 10, 2003 John Wawrzynek

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

CS370 Operating Systems

Announcements. Persistence: Log-Structured FS (LFS)

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Serverless Network File Systems

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Dependability and RAID. Great Ideas in Computer Architecture

COSC 6385 Computer Architecture. Storage Systems

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected

Chapter 6. I/O issues

Serverless Network File Systems

NOW Handout Page 1 S258 S99 1

NFSv4 as the Building Block for Fault Tolerant Applications

I/O CANNOT BE IGNORED

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Chapter 6. Storage and Other I/O Topics

Wednesday, May 3, Several RAID "levels" have been defined. Some are more commercially viable than others.

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Concepts Introduced. I/O Cannot Be Ignored. Typical Collection of I/O Devices. I/O Issues

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Distributed File Systems

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

CS61C : Machine Structures

Chapter 18 Distributed Systems and Web Services

CSE 451: Operating Systems Spring Module 18 Redundant Arrays of Inexpensive Disks (RAID)

Distributed Systems 8L for Part IB. Additional Material (Case Studies) Dr. Steven Hand

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

Today: Distributed File Systems. File System Basics

Today: Distributed File Systems

ECE Enterprise Storage Architecture. Fall 2018

What is a file system

The Google File System

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

CS 61C: Great Ideas in Computer Architecture Lecture 25: Dependability and RAID

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

CA485 Ray Walshe Google File System

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Outline. EEL-4713 Computer Architecture I/O Systems. I/O System Design Issues. The Big Picture: Where are We Now? I/O Performance Measures

3/4/14. Review of Last Lecture Distributed Systems. Topic 2: File Access Consistency. Today's Lecture. Session Semantics in AFS v2

Chapter 14 Mass-Storage Structure

CS514: Intermediate Course in Computer Systems

Parallel Data Lab Research Overview

Lecture 18: Reliable Storage

File systems CS 241. May 2, University of Illinois

Distributed File Systems. Case Studies: Sprite Coda

Appendix D: Storage Systems

RAID. Redundant Array of Inexpensive Disks. Industry tends to use Independent Disks

Operating Systems. File Systems. Thomas Ropars.

Mladen Stefanov F48235 R.A.I.D

Topics. Lecture 8: Magnetic Disks

UCB CS61C : Machine Structures

Storage Systems. Storage Systems

High Performance Computing Course Notes High Performance Storage

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

CMSC 424 Database design Lecture 12 Storage. Mihai Pop

www-inst.eecs.berkeley.edu/~cs61c/

Lenovo RAID Introduction Reference Information

Transcription:

Today: Coda, xfs! Case Study: Coda File System Brief overview of other file systems xfs Log structured file systems Lecture 21, page 1 Distributed File System Requirements! Transparency Access, location, mobility, performance, scaling Concurrent file updates File replication Hardware and OS heterogeniety Fault tolerance Consistency Security Efficiency Computer Science Lecture 21, page 2

AFS Background! Alternative to NFS Originally a research project at CMU Then, defunct commercial product from Transarc (IBM) Public versions now available Goal: Single namespace for global files (sound familiar? DNS? Web?) Designed for wide-area file sharing and scalability Example of stateful file system design Server keeps track of clients accessing files Uses callback-based invalidation Eliminates need for timestamp checking, increases scalability Complicates design Computer Science Lecture 21, page 3 Coda Overview! Offshoot of AFS designed for mobile clients Observation: AFS does the work of migrating popular/ necessary files to your machine on access Nice model for mobile clients who are often disconnected Use file cache to make disconnection transparent At home, on the road, away from network connection Coda supplements AFS file cache with user preferences E.g., always keep this file in the cache Supplement with system learning user behavior Consistency issues? Computer Science Lecture 21, page 4

Coda Consistency! How to keep cached copies on disjoint hosts consistent? In mobile environment, simultaneous writes can be separated by hours/days/weeks Callbacks cannot work since no network connection is available Coda approach (in order): Assume that write sharing the rare case Attempt automatic patch Fallback to manual (user) intervention Computer Science Lecture 21, page 5 File Identifiers! Each file in Coda belongs to exactly one volume Volume may be replicated across several servers Multiple logical (replicated) volumes map to the same physical volume 96 bit file identifier = 32 bit RVID + 64 bit file handle Lecture 21, page 6

Sharing Files in Coda! Transactional behavior for sharing files: similar to share reservations in NFS File open: transfer entire file to client machine [similar to delegation] Uses session semantics: each session is like a transaction Updates are sent back to the server only when the file is closed Lecture 21, page 7 Transactional Semantics! Network partition: part of network isolated from rest Allow conflicting operations on replicas across file partitions Reconcile upon reconnection Transactional semantics => operations must be serializable Ensure that operations were serializable after thay have executed Conflict => force manual reconciliation Lecture 21, page 8

Client Caching! Cache consistency maintained using callbacks Server tracks all clients that have a copy of the file [provide callback promise] Upon modification: send invalidate to clients Lecture 21, page 9 Server Replication! Use replicated writes: read-once write-all Writes are sent to all AVSG (all accessible replicas) How to handle network partitions? Use optimistic strategy for replication Detect conflicts using a Coda version vector Example: [2,2,1] and [1,1,2] is a conflict => manual reconciliation Lecture 21, page 10

Disconnected Operation! The state-transition diagram of a Coda client with respect to a volume. Use hoarding to provide file access during disconnection Prefetch all files that may be accessed and cache (hoard) locally If AVSG=0, go to emulation mode and reintegrate upon reconnection Lecture 21, page 11 Overview of xfs Key Idea: fully distributed file system [serverless file system] Remove the bottleneck of a centralized system xfs: x in xfs => no server Designed for high-speed LAN environments Lecture 21, page 12

xfs Summary! Distributes data storage across disks using software RAID and log-based network striping RAID == Redundant Array of Independent Disks Dynamically distribute control processing across all servers on a per-file granularity Utilizes serverless management scheme Eliminates central server caching using cooperative caching Harvest portions of client memory as a large, global file cache. Lecture 21, page 13 RAID Overview! Basic idea: files are "striped" across multiple disks Redundancy yields high data availability Availability: service still provided to user, even if some components failed Disks will still fail Contents reconstructed from data redundantly stored in the array Capacity penalty to store redundant info Bandwidth penalty to update redundant info Slides courtesy David Patterson Computer Science Lecture 21, page 14

Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks)! IBM 3390K! IBM 3.5" 006 x7 Capacity! Volume! Power! Data Rate! I/O Rate! MTTF! Cost! 20 GBytes! 97 cu. ft 3 KW! 15 MB/s! 600 I/Os/s! 250 KHrs! $250K! 320 MBytes! 0.1 cu. ft 11 W! 1.5 MB/s! 55 I/Os/s! 50 KHrs! $2K! 23 GBytes! 11 cu. ft 1 KW! 120 MB/s! 3900 IOs/s!??? Hrs! $150K! 9X! 3X! 8X! 6X! Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?! Computer Science Lecture 21, page 15 Array Reliability! Reliability of N disks = Reliability of 1 Disk N"!50,000 Hours 70 disks = 700 hours!!disk system MTTF: Drops from 6 years to 1 month!! Arrays (without redundancy) too unreliable to be useful!! Hot spares support reconstruction in parallel with! access: very high media availability can be achieved! Computer Science Lecture 21, page 16

Redundant Arrays of Inexpensive Disks RAID 0: Striping! Stripe data at the block level across multiple disks! High read and write bandwidth! Not a true RAID since no redundancy! Failure of any one drive will cause the entire array to become unavailable! Computer Science Lecture 21, page 17 Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing! recovery! group! Each disk is fully duplicated onto its mirror! Very high availability can be achieved! Bandwidth sacrifice on write:!! Logical write = two physical writes! Reads may be optimized! Most expensive solution: 100% capacity overhead! (RAID 2 not interesting, so skip involves Hamming codes)! Computer Science Lecture 21, page 18

RAID-I (1989) RAID-I! Consisted of a Sun 4/280 workstation with 128 MB of DRAM, four dual-string SCSI controllers, 28 5.25-inch SCSI disks and specialized disk striping software Computer Science Lecture 21, page 19 Redundant Array of Inexpensive Disks RAID 3: Parity Disk! 1001001 1010110 1001011.. logical record! Striped physical! records! P contains sum of other disks per stripe mod 2 ( parity )! If disk fails, subtract P from sum of other " disks to find missing information! Computer Science Lecture 21, page 20 P!

RAID 3! Sum computed across recovery group to protect against hard disk failures, stored in P disk Logically, a single high capacity, high transfer rate disk: good for large transfers But byte level striping is bad for small files (all disks involved) Parity disk is still a bottleneck Computer Science Lecture 21, page 21 Inspiration for RAID 4! RAID 3 stripes data at the byte level RAID 3 relies on parity disk to discover errors on read But every sector on disk has an error detection field Rely on error detection field to catch errors on read, not on the parity disk Allows independent reads to different disks simultaneously Increases read I/O rate since only one disk is accessed rather than all disks for a small read Computer Science Lecture 21, page 22

Redundant Arrays of Inexpensive Disks RAID 4: High I/O Rate Parity! Insides of 5 disks! D D D2! D3! P! D4! D5! D6! D7! P! Increasing! Logical! Disk! Address! Example:! small read! D0 & D5, " large write D12-D15! D8! D9! D1 D1 P! D12! P! D13! D14! D15! D16! D17! D18! D19! P! Stripe! D2 D2 D22! D23! P! Computer Science Lecture 21, page 23 Disk Columns! Inspiration for RAID 5! RAID 4 works well for small reads Small writes (write to one disk): Option 1: read other data disks, create new sum and write to Parity Disk Option 2: since P has old sum, compare old data to new data, add the difference to P Small writes are still limited by Parity Disk: Write to D0, D5, both also write to P disk D D D2! D3! P! D4! D5! D6! D7! P! Computer Science Lecture 21, page 24

Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity! Independent writes! possible because of! interleaved parity! Example: write to D0, D5 uses disks 0, 1, 3, 4! D D D2! D3! P! D4! D5! D6! P! D7! D8! D9! P! D1 D1 D12! P! D13! D14! D15! P! D16! D17! D18! D19! D2 D2 D22! D23! P! Disk Columns! Computer Science Lecture 21, page 25 Increasing! Logical! Disk! Addresses! Problems of Disk Arrays: Small Writes! RAID-5: Small Write Algorithm! 1 Logical Write = 2 Physical Reads + 2 Physical Writes! D0'! D D D2! D3! P! new! data! old! data! +! XOR! (1. Read)! old! parity! (2. Read)! +! XOR! (3. Write)! (4. Write)! D0'! D D2! D3! P'! Computer Science Lecture 21, page 26

xfs uses software RAID! Two limitations Overhead of parity management hurts performance for small writes Ok, if overwriting all N-1 data blocks Otherwise, must read old parity+data blocks to calculate new parity Small writes are common in UNIX-like systems Very expensive since hardware RAIDS add special hardware to compute parity Computer Science Lecture 21, page 27 Log-structured FS! Provide fast writes, simple recovery, flexible file location method Key Idea: buffer writes in memory and commit to disk in large, contiguous, fixed-size log segments Complicates reads, since data can be anywhere Use per-file inodes that move to the end of the log to handle reads Uses in-memory imap to track mobile inodes Periodically checkpoints imap to disk Enables roll forward failure recovery Drawback: must clean holes created by new writes Computer Science Lecture 21, page 28

Combine LFS with software RAID! Addresses small write problem Each log segment spans a RAID stripe Avoids the parity recomputation xfs maintains 4 maps File directory Map Maps human-readable names to index number Manager Map Determines which manager to contact for a file Imap Determines where to look for file in on-disk log Stripe group Map Mapping to set of physical machines storing segments Computer Science Lecture 21, page 29 Processes in xfs! The principle of log-based striping in xfs Combines striping and logging Lecture 21, page 30

Reading a File Block! Reading a block of data in xfs. Lecture 21, page 31 xfs Naming! Main data structures used in xfs. Data structure Manager map Imap Inode File identifier File directory Log addresses Stripe group map Description Maps file ID to manager Maps file ID to log address of file's inode Maps block number (i.e., offset) to log address of block Reference used to index into manager map Maps a file name to a file identifier Triplet of stripe group, ID, segment ID, and segment offset Maps stripe group ID to list of storage servers Lecture 21, page 32

Transactional Semantics! File-associated data Read? Modified? File identifier Yes No Access rights Yes No Last modification time Yes Yes File length Yes Yes File contents Yes Yes Network partition: part of network isolated from rest Allow conflicting operations on replicas across file partitions Reconcile upon reconnection Transactional semantics => operations must be serializable Ensure that operations were serializable after thay have executed Conflict => force manual reconciliation Lecture 21, page 33