Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Similar documents
Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

File Systems: Consistency Issues

COS 318: Operating Systems. Journaling, NFS and WAFL

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Journaling. CS 161: Lecture 14 4/4/17

Lecture 18: Reliable Storage

EECS 482 Introduction to Operating Systems

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Consistency

Computer Systems Laboratory Sungkyunkwan University

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

Operating Systems. File Systems. Thomas Ropars.

FS Consistency & Journaling

File System Internals. Jo, Heeseung

Chapter 17: Recovery System

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

November 9 th, 2015 Prof. John Kubiatowicz

Chapter 11: File System Implementation. Objectives

CS5460: Operating Systems Lecture 20: File System Reliability

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

CS 318 Principles of Operating Systems

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Chapter 17: Recovery System

Announcements. Persistence: Crash Consistency

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Outline. Failure Types

Ext3/4 file systems. Don Porter CSE 506

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CSE 153 Design of Operating Systems

Transactions. A Banking Example

File Systems Management and Examples

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Recoverability. Kathleen Durant PhD CS3200

C13: Files and Directories: System s Perspective

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Chapter 11: Implementing File Systems

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed McGraw-Hill by

Caching and reliability

Chapter 16: Recovery System. Chapter 16: Recovery System

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

OPERATING SYSTEM. Chapter 12: File System Implementation

Chapter 12 File-System Implementation

Chapter 11: Implementing File Systems

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

[537] Journaling. Tyler Harter

CS510 Operating System Foundations. Jonathan Walpole

Announcements. Persistence: Log-Structured FS (LFS)

Operating Systems. Operating Systems Professor Sina Meraji U of T

Transaction Management. Pearson Education Limited 1995, 2005

Lecture 10: Crash Recovery, Logging

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #19: Logging and Recovery 1

Review: FFS background

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. General Overview NOTICE: Faloutsos CMU SCS

CS162 Operating Systems and Systems Programming Lecture 20. Reliability, Transactions Distributed Systems. Recall: File System Caching

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

T ransaction Management 4/23/2018 1

CS3600 SYSTEMS AND NETWORKS

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

Database Technology. Topic 8: Introduction to Transaction Processing

CSE506: Operating Systems CSE 506: Operating Systems

CS122 Lecture 15 Winter Term,

Lecture X: Transactions

NTFS Recoverability. CS 537 Lecture 17 NTFS internals. NTFS On-Disk Structure

Chapter 12: File System Implementation

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

CSE380 - Operating Systems

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ò Server can crash or be disconnected ò Client can crash or be disconnected ò How to coordinate multiple clients accessing same file?

File Systems. What do we need to know?

NFS. Don Porter CSE 506

Typical File Extensions File Structure

Problems Caused by Failures

The Google File System

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

Chapter 11: Implementing File

2. PICTURE: Cut and paste from paper

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Databases: transaction processing

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Recovery from failures

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

Consistency and Scalability

The Google File System

Chapter 10: File System Implementation

Case study: ext2 FS 1

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Synchronization Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Reminder from last time

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Case study: ext2 FS 1

XI. Transactions CS Computer App in Business: Databases. Lecture Topics

Transcription:

Topics COS 318: Operating Systems File Performance and Reliability File buffer cache Disk failure and recovery tools Consistent updates Transactions and logging 2 File Buffer Cache for Performance What to Cache? Cache s in main memory Check the buffer cache first Hit will read from or write to the buffer cache Miss will read from the disk to the buffer cache Usual questions What to cache? How to size? What to prefetch? How and what to replace? Which write policies? Kernel buffer Buffer cache Disk Things to consider s and inect blocks of ectories Directory s I-nodes and inect blocks of s Files What is a good strategy? Cache s and inect blocks if they are in use? Cache only the s and inect blocks of the current ectory? Cache an entire vs. referenced blocks of s 3 4 1

How to Size? Challenges: Multiple Processes An important issue is how to partition memory between the buffer cache and VM cache Early systems use fixed-size buffer cache It does not adapt to workloads Later systems use variable size cache But, large s are common, how do we make adjustment? Solution Basically, we solve the problem using the working set idea Buffer cache (90MB) VM (110MB) Buffer cache (120MB) VM (80MB) Kernel All processes share the same buffer cache Global LRU may not be fair Solution Working set idea again Questions Can each process use a different replacement strategy? Can we move the buffer cache to the user level? What about duplications? process process... Buffer cache process 5 6 What to Prefetch? How and What to Replace? Optimal The blocks are fetched in just enough time to use them But, too hard to do The good news is that s also have locality Temporal locality Spatial locality Common strategies Prefetch next k blocks together (typically > 64KB) Some discard unreferenced blocks Cluster blocks of the same ectory and s if possible (to the same cylinder group and neighborhood) to make prefetching efficient Page replacement theory Use past to predict future LRU is good Buffer cache with LRU replacement mechanism If b is in buffer cache, move it to front and return b Otherwise, replace the tail block, get b from disk, insert b to the front Use double linked list with a hash table LRU (front) Hash table 7 8 2

Which Write Policies? Write Back Complications Write through Whenever modify cached block, write block to disk Cache is always consistent Simple, but cause more I/Os Write back When modifying a block, mark it as ty & write to disk later Fast writes, absorbs writes, and enables batching So, what s the problem? Kernel buffer Buffer cache Disk Fundamental tension On crash, all modified data in cache is lost. The longer you postpone write backs, the faster you are but the worst the damage is on a crash When to write back When a block is evicted When a is closed On an explicit flush When a time interval elapses (30 seconds in Unix) Issues These write back options have no guarantees A solution is consistent updates (later) 9 10 File Recovery Tools What fsck does Physical backup (dump) and recovery Dump disk blocks by blocks to a backup system Backup only changed blocks since the last backup as an incremental Recovery tool built accordingly Logical backup (dump) and recovery Traverse the logical structure from the root Selectively dump what you want to backup Verify logical structures as you backup Recovery tool selectively move s back Consistency check (e.g. fsck) Start from the root Traverse the whole tree and mark reachable s Verify the logical structure Figure out what blocks are free / man u cos318 Get default list of systems to check from /etc/fstab Inconsistencies checked: Blocks claimed by more than one inode or the free map Blocks claimed by an inode outside range of the system Incorrect link counts Size checks (ectory size etc) Bad inode format Blocks not accounted for anywhere Directory checks: File pointing to unallocated inode; Inode number out of range;. or.. not first two entries of a ectory or have wrong inode number Super Block checks More blocks for inodes than are in the system; Bad free block map format; Total free block and/or free inode count incorrect Put orphaned s and ectories in lost+found ectory 11 12 3

Recovery from Disk Block Failures Persistency and Crashes Boot block File system promise: Persistency Create a utility to replace the boot block Use a flash memory to duplicate the boot block and kernel Super block If there is a duplicate, remake system bitmap File system will hold a until its owner explicitly deletes it Why is this hard? A crash will destroy memory content Memory Free block data structure Search all reachable s from the root Cache more better performance Cache more lose more on a crash Unreachable blocks are free blocks How to recover? Inect or data blocks Inect Inect Data Data Data A operation often requires modifying multiple blocks, but the system can only atomically modify one at a time Systems can crash anytime? 13 14 What Is A Crash? Crash is like a context switch Think about a system as a thread before the context switch and another after the context switch Two threads read or write same shared state? Crash is like time travel Current volatile state lost; suddenly go back to old state Example: move a Place it in a ectory Delete it from old Crash happens and both ectories have problems Before Crash After Time Crash Approaches Throw everything away and start over Done for most things (e.g., make again) Not what you want to happen to your email Reconstruction Figure out where you are and make the system consistent and go from there Try to fix things after a crash ( fsck ) Make updates consistent Either new data or old data, but not garbage data Make multiple updates appear atomic Build arbitrary sized atomic units from smaller atomic ones Similar to how we built critical sections from locks, and locks from atomic instructions 15 16 4

Consistent Updates: Problem Modify /u/cos318/foo Consistent Updates: Data Before Metadata Modify /u/cos318/foo Traverse to /u/cos318/ Allocate data block Write pointer into Crash Inconsistent Write new data to foo Crash Consistent / cos318 u foo Old data New data Traverse to /u/cos318/ Allocate data block Write new data to foo Write pointer into Crash Consistent / cos318 u foo Old data New data Writing metadata first can cause inconsistency 17 18 Consistent Updates: Bottom-Up Order The general approach is to use a bottom up order File data blocks,, ectory, ectory, What about buffer cache Write back all data blocks Update and write it to disk Update ectory and write it to disk Update ectory and write it to disk (if necessary) Continue until no ectory update exists Does this solve the write back problem? Updates are consistent but leave garbage blocks around May need to run fsck to clean up once a while Ideal approach: consistent update without leaving garbage Transactions E.g. move money from Savings to Checking in bank BEGIN_TRANSACTION Success = Withdraw ($300, Savings); Success = Success AND Deposit ($300, Checking); IF (!Success) THEN ABORT_TRANSACTION ELSE END_TRANSACTION 19 20 5

Transaction Properties Group multiple operations together so that they have ACID property: Atomicity Consistency Isolation (Serializability) Durability (Persistence) Once it happens, stays happened Question Do critical sections have ACID property? Transaction Properties Atomicity Either entire transaction happens, or none of it does If transaction happens, it appears to have happened as a single atomic action (across all resources it touches) Consistency A transaction results in a valid transformation of the system state; i.e. the results of the transaction do not violate the rules or invariants of the system If they held before, they hold after E.g. total money in bank is not changed by internal transaction Invariants need not be maintained within transaction, only before and after 21 22 Transaction Properties Isolation (aka Serializability) If two or more transactions are running concurrently, then to each of them and to all external processes, the final result looks as if all trnasactions ran sequentially in some (system-dependent) order. i.e. they appear to have run serially, not concurrently From outside the transaction, cannot see intermediate results Durability (aka Persistence) Once a transaction has successfully committed, its results are permanent, regardless of what failures happen after that No later failure can undo the results or cause them to be lost Transactions Bundle many operations into a transaction One of the first transaction systems was the Sabre American Airline reservation system, developed by IBM Primitives BeginTransaction Mark the beginning of the transaction Commit (End transaction) When transaction is done Rollback (Abort transaction) Undo all the actions since Begin transaction. Rules Transactions can run concurrently Rollback can execute anytime Sophisticated transaction systems allow nested transactions 23 24 6

Implementation An Example: Atomic Money Transfer BeginTransaction Start using a write-ahead log on disk Log all updates Commit Write commit at the end of the log Then write-behind to disk by writing updates to disk Clear the log Rollback Clear the log Crash recovery If there is no commit in the log, do nothing If there is commit, replay the log and clear the log Assumptions Writing to disk is correct (recall the error detection and correction) Disk is in a good state before we start 25 Move $100 from account S to C (1 thread): BeginTransaction S = S - $100; C = C + $100; Commit Steps: 1: Write new value of S to log 2: Write new value of C to log 3: Write commit 4: Write S to disk 5: Write C to disk 6: Clear the log Possible crashes After 1 After 2 After 3 before 4 and 5 Questions Can we swap 3 with 4? Can we swap 4 and 5? C = 110 S = 700 C = 10 110 S = 800 700 S=700 C=110 Commit 26 Revisit The Implementation Two Threads Run Transactions BeginTransaction Start using a write-ahead log on disk Log all updates Commit Write commit at the end of the log Then write-behind to disk by writing updates to disk Clear the log Rollback Clear the log Crash recovery If there is no commit in the log, do nothing If there is commit, replay the log and clear the log Apply to the mid-term AtomicTransfer program 1: BeginTransaction 2: if ( a1->id < a2->id ) { Acquire( a1->lock ); Acquire( a2->lock ); } else { Acquire( a2->lock ); Acquire( a1->lock ); } 3: if ((a1->balance - $100 ) < 0) { Release( a2->lock ); Release( a1->lock ); goto 7; } 4: a1->balance -= $100; 5: a2->balance += $100; 6: Release( a2->lock ); Release( a1->lock ); 7: Commit Questions What is commit? What if there is a crash during the recovery? What happens if Thread A performs 1-6; context switch Thread B performs 1-7; crash! 27 28 7

Two-Phase Locking for Transactions First phase Acquire all locks Second phase Commit operation releases all locks (no individual release operations) Rollback operation always undo the changes first and then release all locks Use Transactions in File Systems Make a operation a transaction Create a Move a Write a chunk of data Would this eliminate any need to run fsck after a crash? Make arbitrary number of operations a transaction Just keep logging but make sure that things are idempotent: making a very long transaction Recovery by replaying the log and correct the system This is called logging system or journaling system Almost all new systems are journaling (Windows NTFS, Veritas system, systems on Linux) 29 30 Issue with Logging: Performance For every disk write, we now have two disk writes (on different parts of the disk)? It is not so bad because once written to the log, it is safe to do real writes later Performance tricks Changes made in memory and then logged to disk Log writes are sequential (synchronous writes can be fast if on a separate disk) Merge multiple writes to the log with one write Use NVRAM (Non-Volatile RAM) to keep the log Log Management How big is the log? Same size as the system? Observation Log what s needed for crash recovery Management method Checkpoint operation: flush the buffer cache to disk After a checkpoint, we can truncate log and start again Log needs to be big enough to hold changes in memory Some logging systems log only metadata ( descriptors and ectories) and not data to keep log size down 31 32 8

What to Log? Physical blocks (ectory blocks and inode blocks) Easy to implement but takes more space Which block image? Before operation: Easy to go backward during recovery After operation: Easy to go forward during recovery. Both: Can go either way. Logical operations Example: Add name foo to ectory #41 More compact But more work at recovery time Log-structured File System (LFS) Structure the entire system as a log with segments A segment has s, inect blocks, and data blocks All writes are sequential (no seeks) There will be holes when deleting s Questions What about read performance? How would you clean (garbage collection)? Used Unused Log structured 33 34 Summary File buffer cache True LRU is possible Simple write back is volnerable to crashes Disk block failures and system recovery tools Individual recovery tools Top down traversal tools Consistent updates Transactions and ACID properties Logging or Journaling systems 35 9