Block Device Scheduling

Similar documents
Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling. Don Porter CSE 506

CSE506: Operating Systems CSE 506: Operating Systems

Block Device Driver. Pradipta De

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Ext3/4 file systems. Don Porter CSE 506

CSE 153 Design of Operating Systems Fall 2018

Operating Systems. Operating Systems Professor Sina Meraji U of T

I/O & Storage. Jin-Soo Kim ( Computer Systems Laboratory Sungkyunkwan University

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

Secondary storage. CS 537 Lecture 11 Secondary Storage. Disk trends. Another trip down memory lane

CSE 153 Design of Operating Systems

CS5460: Operating Systems Lecture 20: File System Reliability

CSE380 - Operating Systems. Communicating with Devices

The Art and Science of Memory Allocation

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

CSE 506: Opera.ng Systems. The Page Cache. Don Porter

Operating Systems. File Systems. Thomas Ropars.

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

The Page Cache 3/16/16. Logical Diagram. Background. Recap of previous lectures. The address space abstracvon. Today s Problem.

Disk Scheduling COMPSCI 386

EECS 482 Introduction to Operating Systems

Caching and reliability

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

OPERATING SYSTEMS CS3502 Spring Input/Output System Chapter 9

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

CSE 451: Operating Systems Spring Module 12 Secondary Storage

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Storage. Hwansoo Han

FFS: The Fast File System -and- The Magical World of SSDs

EECS 482 Introduction to Operating Systems

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

CS 318 Principles of Operating Systems

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

I/O 1. Devices and I/O. key concepts device registers, device drivers, program-controlled I/O, DMA, polling, disk drives, disk head scheduling

Lecture 18: Reliable Storage

Operating Systems Design Exam 2 Review: Spring 2011

CS 416: Opera-ng Systems Design March 23, 2012

Ref: Chap 12. Secondary Storage and I/O Systems. Applied Operating System Concepts 12.1

6.033 Computer System Engineering

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

I/O Device Controllers. I/O Systems. I/O Ports & Memory-Mapped I/O. Direct Memory Access (DMA) Operating Systems 10/20/2010. CSC 256/456 Fall

CSE 333 Lecture 9 - storage

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

OPERATING SYSTEMS CS3502 Spring Input/Output System Chapter 9

Scheduling, part 2. Don Porter CSE 506

Operating Systems 2010/2011

[537] I/O Devices/Disks. Tyler Harter

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

Hard Disk Drives (HDDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS510 Operating System Foundations. Jonathan Walpole

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Hard Disk Drives (HDDs)

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

CS 111. Operating Systems Peter Reiher

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

To understand this, let's build a layered model from the bottom up. Layers include: device driver filesystem file

I/O 1. Devices and I/O. key concepts device registers, device drivers, program-controlled I/O, DMA, polling, disk drives, disk head scheduling

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

What is a file system

Input/Output. Chapter 5: I/O Systems. How fast is I/O hardware? Device controllers. Memory-mapped I/O. How is memory-mapped I/O done?

CS 4284 Systems Capstone

File Systems: FFS and LFS

Mass-Storage Structure

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware

Chapter 6. Storage and Other I/O Topics

Data Storage and Query Answering. Data Storage and Disk Structure (2)

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

CSC369 Lecture 9. Larry Zhang, November 16, 2015

COS 318: Operating Systems. Journaling, NFS and WAFL

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova

Disks: Structure and Scheduling

Chapter 6. Storage and Other I/O Topics

Virtual File System. Don Porter CSE 306

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Today: Secondary Storage! Typical Disk Parameters!

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University

Announcements. Persistence: Log-Structured FS (LFS)

Chapter 10: Mass-Storage Systems

CSE 380 Computer Operating Systems

Principles of Operating Systems CS 446/646

Lecture 10: Crash Recovery, Logging

I/O and file systems. Dealing with device heterogeneity

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

COS 318: Operating Systems. Storage Devices. Jaswinder Pal Singh Computer Science Department Princeton University

Review: FFS background

[537] RAID. Tyler Harter

CSE 153 Design of Operating Systems

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Operating Systems, Fall

Kernel Korner I/O Schedulers

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Transcription:

Logical Diagram Block Device Scheduling Don Porter CSE 506 Binary Formats RCU Memory Management File System Memory Allocators System Calls Device Drivers Interrupts Net Networking Threads Sync User Kernel CPU Scheduler Today s Lecture Consistency Hardware Quick Recap Block device goals CPU Scheduling Balance competing concerns with heuristics What were some goals? No perfect solution Today: Block device scheduling How different from the CPU? Focus primarily on a traditional hard drive Extend to new storage media Throughput Latency Safety file system can be recovered after a crash Fairness surprisingly, very little attention is given to storage access fairness Hard problem solutions usually just prevent starvation quotas for space fairness Big Picture OS Model of a Block Dev. VFS Low-level FS (ext4, BTRFS, etc.) Simple array of blocks Blocks are usually 512 or 4k bytes Page Cache Block Device IO Scheduler Driver 1

Recall: Page Cache Caching Page Cache Page (blue) w/ 3 buffer heads (green) Obviously, the number 1 trick in the OS designer s toolbox is caching disk contents in RAM Buffer Heads map disk blocks ` Remember the page cache? Latency can be hidden by pre-reading data into RAM And keeping any free RAM full of disk contents Doesn t help synchronous reads (that miss in RAM cache) or synchronous writes Block Dev Caching + throughput Assume that most reads and writes to disk are asynchronous Dirty data can be buffered and written at OS s leisure Most reads hit in RAM cache most disk reads are readahead optimizations Key problem: How to optimally order pending disk I/O requests? Hint: it isn t first-come, first-served Another view of the problem Between page cache and disk, you have a queue of pending requests Requests are a tuple of (block #, read/write, buffer addr) You can reorder these as you like to improve throughput What reordering heuristic to use? If any? Heuristic is called the IO Scheduler A simple disk model Model s are slow. Why? Moving parts << circuits Programming interface: simple array of sectors (blocks) Physical layout: Concentric circular tracks of blocks on a platter E.g., sectors 0-9 on innermost track, 10-19 on next track, etc. arm moves between tracks Platter rotates under disk head to align w/ requested sector Each block on a sector spins at a constant speed. Sectors rotate underneath head. 1 0 2 7 3 6 4 5 Head reads at granularity of entire sector Head 2

Model Many Tracks Concentric tracks Gap between 7 and 8 accounts for seek time 9 8 21 10 1 20 0 11 2 7 19 12 3 6 18 4 5 13 17 14 15 16 Head head seeks to different tracks Head Several (~4) Platters Implications of multiple platters Blocks actually striped across platters Example: Sector 0 on platter 0 Sector 1 on platter 1 at same position Sector 2 on platter 2, Sec. 3 on Plat. 3 also at same position 4 heads can read all 4 sectors simultaneously Platters spin together at same speed Each platter has a head; All heads seek together 3 key latencies Observations I/O delay: time it takes to read/write a sector Rotational delay: time the disk head waits for the platter to rotate desired sector under it Note: disk rotates continuously at constant speed Seek delay: time the disk arm takes to move to a different track Latency of a given operation is a function of current disk arm and platter position Each request changes these values Idea: build a model of the disk Maybe use delay values from measurement or manuals Use simple math to evaluate latency of each pending request Greedy algorithm: always select lowest latency 3

Example formula Problem with greedy? s = seek latency, in time/track r = rotational latency, in time/sector Far requests will starve head may just hover around the middle tracks i = I/O latency, in seconds Time = (Δtracks * s) + (Δsectors * r) + I Note: Δsectors must factor in position after seek is finished. Why? Elevator Algorithm Elevator Algo, pt. 2 Require disk arm to move in continuous sweeps in and out Reorder requests within a sweep Ex: If disk arm is moving out, reorder requests between the current track and the outside of disk in ascending order (by block number) A request for a sector the arm has already passed must be ordered after the outermost request, in descending order This approach prevents starvation Sectors at inside or outside get service after a bounded time Reasonably good throughput Sort requests to minimize seek latency Can get hit with rotational latency pathologies (How?) Simple to code up! Programming model hides low-level details; difficult to do finegrained optimizations in practice Pluggable Schedulers Linux allows the disk scheduler to be replaced Just like the CPU scheduler Can choose a different heuristic that favors: Fairness Real-time constraints Performance Complete Fairness Queue (CFQ) Idea: Add a second layer of queues (one per process) Round-robin promote them to the real queue Goal: Fairly distribute disk bandwidth among tasks Problems? Overall throughput likely reduced Ping-pong disk head around 4

Deadline Scheduler Anticipatory Scheduler Associate expiration times with requests As requests get close to expiration, make sure they are deployed Constrains reordering to ensure some forward progress Good for real-time applications Idea: Try to anticipate locality of requests If process P tends to issue bursts of requests for close disk blocks, When you see a request from P, hold the request in the disk queue for a while See if more nearby requests come in Then schedule all the requests at once And coalesce adjacent requests Optimizations at Cross-purposes The disk itself does some optimizations: Caching Write requests can sit in a volatile cache for longer than expected Reordering requests internally Can t assume that requests are serviced in-order Dependent operations must wait until first finishes Bad sectors can be remapped to spares Problem: disk arm flailing on an old disk A note on safety In Linux, and other OSes, the I/O scheduler can reorder requests arbitrarily It is the file system s job to keep unsafe I/O requests out of the scheduling queues Dangerous I/Os What can make an I/O request unsafe? File system bookkeeping has invariants on disk Example: Inodes point to file data blocks; data blocks are also marked as free in a bitmap Updates must uphold these invariants Ex: Write an update to the inode, then the bitmap What if the system crashes between writes? Block can end up in two files!!! 3 Simple Rules (Courtesy of Ganger and McKusick, Soft Updates paper) Never write a pointer to a structure until it has been initialized Ex: Don t write a directory entry to disk until the inode has been written to disk Never reuse a resource before nullifying all pointers to it Ex: Before re-allocating a block to a file, write an update to the inode that references it Never reset the last pointer to a live resource before a new pointer has been set Ex: Renaming a file write the new directory entry before the old one (better 2 links than none) 5

A note on safety s aren t everything It is the file system s job to keep unsafe I/O requests out of the scheduling queues While these constraints are simple, enforcing them in the average file system is surprisingly difficult Journaling helps by creating a log of what you are in the middle of doing, which can be replayed (Simpler) Constraint: Journal updates must go to disk before FS updates Flash is increasing in popularity Different types with slight variations (NAND, NOR, etc) No moving parts who cares about block ordering anymore? Can only write to a block of flash ~100k times Can read as much as you want More in a Flash Even newer hotness Flash reads are generally fast, writes are more expensive Prefetching has little benefit Queuing optimizations can take longer than a read New issue: wear leveling need to evenly distribute writes Flash devices usually have a custom, log-structured FS Group random writes Byte-addressible, persistent RAMs (BPRAM) Phase-Change Memory (PCM), Memristors, etc. Splits the difference between RAM and flash: Byte-granularity writes (vs. blocks) Fast reads, slower, high-energy writes Doesn t need energy to hold state (DRAM refresh) Wear an issue (bytes get stuck at last value) Still in the lab, but getting close Important research topic Summary Most work on optimizing storage accessed is tailored to hard drives These heuristics are not easily adapted to new media Future systems will have a mix of disks, flash, PRAM, DRAM Performance characteristics of disks, flash, BPRAM scheduling heuristics Safety constraints for file systems Does it even make sense to treat them all the same? 6