Lecture 18: Reliable Storage

Similar documents
Main Points. File System Reliability (part 2) Last Time: File System Reliability. Reliability Approach #1: Careful Ordering 11/26/12

November 9 th, 2015 Prof. John Kubiatowicz

CS162 Operating Systems and Systems Programming Lecture 20. Reliability, Transactions Distributed Systems. Recall: File System Caching

File Systems: Consistency Issues

EECS 482 Introduction to Operating Systems

Announcements. Persistence: Log-Structured FS (LFS)

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Chapter 11: File System Implementation. Objectives

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

Journaling. CS 161: Lecture 14 4/4/17

CS 537 Fall 2017 Review Session

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Operating Systems. Operating Systems Professor Sina Meraji U of T

CSE 153 Design of Operating Systems

File System Consistency

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. File Systems. Thomas Ropars.

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

COS 318: Operating Systems. Journaling, NFS and WAFL

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Announcements. Persistence: Crash Consistency

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

CS5460: Operating Systems Lecture 20: File System Reliability

2. PICTURE: Cut and paste from paper

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 10: Crash Recovery, Logging

File Systems. Chapter 11, 13 OSPP

Lecture 16: Storage Devices

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

OPERATING SYSTEM. Chapter 12: File System Implementation

Caching and reliability

Lecture 10: Multi-Object Synchronization

CS 318 Principles of Operating Systems

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Introduction. Storage Failure Recovery Logging Undo Logging Redo Logging ARIES

Main Points. File layout Directory layout

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Lecture S5: Transactions and reliability

Chapter 11: Implementing File Systems

CS 318 Principles of Operating Systems

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

W4118 Operating Systems. Instructor: Junfeng Yang

File Systems. CS170 Fall 2018

FS Consistency & Journaling

Ext3/4 file systems. Don Porter CSE 506

Review: FFS background

CS 318 Principles of Operating Systems

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

Tricky issues in file systems

Chapter 17: Recovery System

Chapter 12: File System Implementation

[537] Journaling. Tyler Harter

Chapter 14: Recovery System

I/O and file systems. Dealing with device heterogeneity

File System Implementation

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

EECS 482 Introduction to Operating Systems

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Current Topics in OS Research. So, what s hot?

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

CS3210: Crash consistency

Lecture 4: Memory Management & The Programming Interface

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CS307: Operating Systems

Chapter 11: Implementing File

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Lecture 13: Address Translation

What is a file system

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 10: File System Implementation

XI. Transactions CS Computer App in Business: Databases. Lecture Topics

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CS3600 SYSTEMS AND NETWORKS

Chapter 11: Implementing File-Systems

Chapter 17: Recovery System

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

TRANSACTIONAL FLASH CARSTEN WEINHOLD. Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou

Frequently asked questions from the previous class survey

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

Week 12: File System Implementation

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 12: File System Implementation

Operating System Concepts Ch. 11: File System Implementation

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

CSE506: Operating Systems CSE 506: Operating Systems

Lecture 11: Linux ext3 crash recovery

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

CS122 Lecture 15 Winter Term,

CS 4284 Systems Capstone

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

Storage and File Structure

CSE380 - Operating Systems

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Transcription:

CS 422/522 Design & Implementation of Operating Systems Lecture 18: Reliable Storage Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous versions of the CS422/522 lectures taught by Prof. Bryan Ford and Dr. David Wolinsky, and also from the official set of slides accompanying the OSPP textbook by Anderson and Dahlin. Main points Problem posed by machine/disk failures Transaction concept Reliability Careful sequencing of file system operations Copy-on-write (WAFL, ZFS) Journaling (NTFS, linux ext4) Log structure (flash storage) Availability RAID 1

File system reliability What can happen if disk loses power or machine software crashes? Some operations in progress may complete Some operations in progress may be lost Overwrite of a block may only partially complete File system wants durability (as a minimum!) Data previously stored can be retrieved (maybe after some recovery step), regardless of failure File system reliability For performance, all must be cached! This is OK for reads but what about writes? Options for writing data: Write-through: write change immediately to disk Problem: slow! Have to wait for write to complete before you go on Write-back: delay writing modified data back to disk (for example, until replaced) Problem: can lose data on a crash! 2

Multiple updates If multiple updates needed to perform some operations, crash can occur between them! Moving a file between directories: * Delete file from old directory * Add file to new directory Create new file * Allocate space on disk for header, data * Write new header to disk * Add the new file to directory What if there is a crash in the middle? Even with write-through it can still have problems Storage reliability problem Single logical file operation can involve updates to multiple physical disk blocks inode, indirect block, data block, bitmap, With remapping, single update to physical disk block can require multiple (even lower level) updates At a physical level, operations complete one at a time Want concurrent operations for performance How do we guarantee consistency regardless of when crash occurs? 3

Transaction concept Transaction is a group of operations Atomic: operations appear to happen as a group, or not at all (at logical level) * At physical level, only single disk/flash write is atomic Durable: operations that complete stay completed * Future failures do not corrupt previously stored data Isolation: other transactions do not see results of earlier transactions until they are committed Consistency: sequential memory model Reliability approach #1: careful ordering Sequence operations in a specific order Careful design to allow sequence to be interrupted safely Post-crash recovery Read data structures to see if there were any operations in progress Clean up/finish as needed Approach taken in FAT, FFS (fsck), and many app-level recovery schemes (e.g., Word) 4

FAT: Append data to file Add data block Add pointer to data block Update file tail to point to new MFT entry Update access time at head of file FAT: Append data to file Normal operation: Add data block Add pointer to data block Update file tail to point to new MFT entry Update access time at head of file Recovery: Scan MFT If entry is unlinked, delete data block If access time is incorrect, update 5

FAT: Create new file Normal operation: Allocate data block Update MFT entry to point to data block Update directory with file name -> file number Update modify time for directory Recovery: Scan MFT If any unlinked files (not in any directory), delete Scan directories for missing update times Unix approach (careful reordering) Try to achieve consistency on both meta-data and user data! Meta-data: needed to keep file system logically consistent Directories Bitmap File headers Indirect blocks Data: user bytes 6

Meta-data consistency (ad hoc) For meta-data, Unix uses synchronous write through. If multiple updates needed, Unix does them in specific order If it crashes, run the special program fsck which scans the entire disk for internal consistency to check for in progress operations and then fixes up anything in progress. Example: File created, but not yet put in any directory è delete file Blocks allocated, but not in bitmap è update bitmap FFS: Create a file Normal operation: Allocate data block Write data block Allocate inode Write inode block Update bitmap of free blocks Update directory with file name -> file number Update modify time for directory Recovery: Scan inode table If any unlinked files (not in any directory), delete Compare free block bitmap against inode trees Scan directories for missing update/access times Time proportional to size of disk 7

FFS: Move a file Normal operation: Remove filename from old directory Add filename to new directory Recovery: Scan all directories to determine set of live files Consider files with valid inodes and not in any directory New file being created? File move? File deletion? User data consistency For user data, Unix uses write back --- forced to disk every 30 seconds (or user can call sync to force to disk immediately). No guarantee blocks are written to disk in any order. Sometimes meta-data consistency is good enough How should vi save changes to a file to disk? Write new version in temp file Move old version to other temp file Move new version into real file Unlink old version If crash, look at temp area; if any files out there, send email to user that there might be a problem. 8

Application level Normal operation: Write name of each open file to app folder Write changes to backup file Rename backup file to be file (atomic operation provided by file system) Delete list in app folder on clean shutdown Recovery: On startup, see if any files were left open If so, look for backup file If so, ask user to compare versions Careful ordering Pros Works with minimal support in the disk drive Works for most multi-step operations Cons Can require time-consuming recovery after a failure Difficult to reduce every operation to a safely interruptible sequence of writes Difficult to achieve consistency when multiple operations occur concurrently 9

Reliability approach #2: copy-on-write file layout To update file system, write a new version of the file system containing the update Never update in place Reuse existing unchanged disk blocks Seems expensive! But Updates can be batched Almost all disk writes can occur in parallel Approach taken in network file server appliances (WAFL, ZFS) Copy on write/write anywhere Root Inode Slots Inode File s Indirect Blocks Inode Array (in Inode File) Indirect Blocks Data Blocks Fixed Location Anywhere 10

Copy on write/write anywhere Root Inode Slots Inode File s Indirect Blocks Inode Array (in Inode File) Indirect Blocks Data Blocks Update Last Block of File Copy on write batch update Root Inode Root Inode s Indirect Blocks Inode File File s Indirect Blocks File s Data Blocks New Root Inode New Indirect Nodes of Inode File New Data Block of Inode File New Indirect Nodes New Data Blocks 11

Copy on write garbage collection For write efficiency, want contiguous sequences of free blocks Spread across all block groups Updates leave dead blocks scattered For read efficiency, want data read together to be in the same block group Write anywhere leaves related data scattered => Background coalescing of live/dead blocks Copy on write Pros Correct behavior regardless of failures Fast recovery (root block array) High throughput (best if updates are batched) Cons Potential for high latency Small changes require many writes Garbage collection essential for performance 12

Logging file systems Instead of modifying data structures on disk directly, write changes to a journal/log Intention list: set of changes we intend to make Log/Journal is append-only Once changes are on log, safe to apply changes to data structures on disk Recovery can read log to see what changes were intended Once changes are copied, safe to remove log Transaction concept Transactions: group actions together so that they are Atomic: either happens or it does not (no partial operations) Serializable: transactions appear to happen one after the other Durable: once it happens, stays happened Critical sections are atomic and serializable, but not durable Need two more items: Commit --- when transaction is done (durable) Rollback --- if failure during a transaction (means it didn t happen at all) Metaphor: do a set of operations tentatively. If get to commit, ok. Otherwise, roll back the operations as if the transaction never happened. 13

Transaction implementation Key idea: fix problem of how you make multiple updates to disk, by turning multiple updates into a single disk write! Example: money transfer from account x to account y: Begin transaction x = x + 1 y = y 1 Commit Keep redo log on disk of all changes in transaction. A log is like a journal, never erased, record of everything you ve done Once both changes are on log, transactions is committed. Then can write behind changes to disk --- if crash after commit, replay log to make sure updates get to disk Transaction implementation (cont d) Memory cache Disk X: 0 X: 0 Y: 2 Y: 2 Sequence of steps to execute transaction: X=1 Y=1 commit 1. Write new value of X to log 2. Write new value of Y to log write-ahead log (on disk or 3. Write commit tape or non-volatile RAM) 4. Write x to disk 5. Write y to disk 6. Reclaim space on log 14

Transaction implementation (cont d) X=1 Y=1 commit 1. Write new value of X to log 2. Write new value of Y to log 3. Write commit 4. Write x to disk 5. Write y to disk 6. Reclaim space on log What if we crash after 1? No commit, nothing on disk, so just ignore changes What if we crash after 2? Ditto What if we crash after 3 before 4 or 5? Commit written to log, so replay those changes back to disk What if we crash while we are writing commit? As with concurrency, we need some primitive atomic operation or else can t build anything. (e.g., writing a single sector on disk is atomic!) Another example: before transaction start Cache Tom = $200 Mike = $100 Nonvolatile Storage Log: Tom = $200 Mike = $100 15

After updates are logged Cache Tom = $100 Mike = $200 Nonvolatile Storage Log: Tom = $200 Tom = $100 Mike = $200 Mike = $100 After commit logged Cache Tom = $100 Mike = $200 Nonvolatile Storage Log: Tom = $200 Mike = $100 Tom = $100 Mike = $200 COMMIT 16

After copy back Cache Tom = $100 Mike = $200 Nonvolatile Storage Log: Tom = $100 Mike = $200 Tom = $100 Mike = $200 COMMIT After garbage collection Cache Tom = $100 Mike = $200 Nonvolatile Storage Log: Tom = $100 Mike = $200 17

Redo logging Prepare Write all changes (in transaction) to log Commit Single disk write to make transaction durable Redo Copy changes to disk Garbage collection Reclaim space in log Recovery Read log Redo any operations for committed transactions Garbage collect log Performance Log written sequentially Often kept in flash storage Asynchronous write back Any order as long as all changes are logged before commit, and all write backs occur after commit Can process multiple transactions Transaction ID in each log entry Transaction completed iff its commit record is in log 18

Redo log implementation Log head pointer Volatile Memory Pending write backs Log tail pointer Log head pointer Persistent Storage Log: Mixed:... Free Writeback WB Complete Complete Committed Free... Uncommitted older Garbage Collected Eligible for GC In Use Available for New Records newer Transaction isolation Process A move file from x to y mv x/file y/ Process B grep across x and y grep x/* y/* > log What if grep starts after changes are logged, but before commit? 19

Two-phase locking Don t allow unlock before commit. First phase: only allowed to acquire locks (this avoids deadlock concerns). Second phase: all unlocks happen at commit Thread B can t see any of A s changes, until A commits and releases locks. This provides serializability. Transaction isolation Process A Lock x, y move file from x to y mv x/file y/ Commit and release x,y Process B Lock x, y, log grep across x and y grep x/* y/* > log Commit and release x, y, log Grep occurs either before or after move 20

Serializability With two phase locking and redo logging, transactions appear to occur in a sequential order (serializability) Either: grep then move or move then grep Other implementations can also provide serializability Optimistic concurrency control: abort any transaction that would conflict with serializability Caveat Most file systems implement a transactional model internally Copy on write Redo logging Most file systems provide a transactional model for individual system calls File rename, move, Most file systems do NOT provide a transactional model for user data 21

Question Do we need the copy back? What if update in place is very expensive? Ex: flash storage, RAID Log structure Log is the data storage; no copy back Storage split into contiguous fixed size segments * Flash: size of erasure block * Disk: efficient transfer size (e.g., 1MB) Log new blocks into empty segment * Garbage collect dead blocks to create empty segments Each segment contains extra level of indirection * Which blocks are stored in that segment Recovery Find last successfully written segment 22

Storage availability Storage reliability: data fetched is what you stored Transactions, redo logging, etc. Storage availability: data is there when you want it More disks => higher probability of some disk failing Data available ~ Prob(disk working)^k * If failures are independent and data is spread across k disks For large k, probability system works -> 0 RAID Replicate data for availability RAID 0: no replication RAID 1: mirror data across two or more disks * Google File System replicated its data on three disks, spread across multiple racks RAID 5: split data across disks, with redundancy to recover from a single disk failure RAID 6: RAID 5, with extra redundancy to recover from two disk failures 23

RAID 1: Mirroring Replicate writes to both disks Reads can go to either disk Disk 0 Data Block 0 Data Block 1 Data Block 2 Data Block 3 Data Block 4 Data Block 5 Data Block 6 Data Block 7 Data Block 8 Data Block 9 Data Block 10 Data Block 11 Data Block 12 Data Block 13 Data Block 14 Data Block 15 Data Block 16 Data Block 17 Data Block 18 Data Block 19 Disk 1 Data Block 0 Data Block 1 Data Block 2 Data Block 3 Data Block 4 Data Block 5 Data Block 6 Data Block 7 Data Block 8 Data Block 9 Data Block 10 Data Block 11 Data Block 12 Data Block 13 Data Block 14 Data Block 15 Data Block 16 Data Block 17 Data Block 18 Data Block 19...... Parity Parity block: Block1 xor block2 xor block3 10001101 block1 01101100 block2 11000110 block3 -------------- 00100111 parity block Can reconstruct any missing block from the others 24

RAID 5: Rotating parity Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Stripe 0 Strip (0,0) Parity (0,0,0) Parity (1,0,0) Parity (2,0,0) Parity (3,0,0) Strip (1,0) Data Block 0 Data Block 1 Data Block 2 Data Block 3 Strip (2,0) Data Block 4 Data Block 5 Data Block 6 Data Block 7 Strip (3,0) Data Block 8 Data Block 9 Data Block 10 Data Block 11 Strip (4,0) Data Block 12 Data Block 13 Data Block 14 Data Block 15 Stripe 1 Strip (0,1) Data Block 16 Data Block 17 Data Block 18 Data Block 19 Strip (1,1) Parity (0,1,1) Parity (1,1,1) Parity (2,1,1) Parity (3,1,1) Strip (2,1) Data Block 20 Data Block 21 Data Block 22 Data Block 23 Strip (3,1) Data Block 24 Data Block 25 Data Block 26 Data Block 27 Strip (4,1) Data Block 28 Data Block 29 Data Block 30 Data Block 31 Stripe 2 Strip (0,2) Data Block 32 Data Block 33 Data Block 34 Data Block 35 Strip (1,2) Data Block 36 Data Block 37 Data Block 38 Data Block 39 Strip (2,2) Parity (0,2,2) Parity (1,2,2) Parity (2,2,2) Parity (3,2,2) Strip (3,2) Data Block 40 Data Block 41 Data Block 42 Data Block 43 Strip (4,2) Data Block 44 Data Block 45 Data Block 46 Data Block 46............... RAID update Mirroring Write every mirror RAID-5: to write one block Read old data block Read old parity block Write new data block Write new parity block * Old data xor old parity xor new data RAID-5: to write entire stripe Write data blocks and parity 25

Non-recoverable read errors Disk devices can lose data One sector per 10^15 bits read Causes: * Physical wear * Repeated writes to nearby tracks What impact does this have on RAID recovery? Read errors and RAID recovery Example 10 1 TB disks, and 1 fails Read remaining disks to reconstruct missing data Probability of recovery = (1 10^15)^(9 disks * 8 bits * 10^12 bytes/disk) = 93% Solutions: RAID-6: two redundant disk blocks * parity, linear feedback shift Scrubbing: read disk sectors in background to find and fix latent errors 26