HP AutoRAID (Lecture 5, cs262a)

Similar documents
HP AutoRAID (Lecture 5, cs262a)

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

RAID (Redundant Array of Inexpensive Disks)

Definition of RAID Levels

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

ECE Enterprise Storage Architecture. Fall 2018

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Mladen Stefanov F48235 R.A.I.D

CSE 153 Design of Operating Systems

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

RAID. Redundant Array of Inexpensive Disks. Industry tends to use Independent Disks

CS5460: Operating Systems Lecture 20: File System Reliability

Reliable Computing I

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Computer Science 146. Computer Architecture

I/O CANNOT BE IGNORED

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

Modern RAID Technology. RAID Primer A Configuration Guide

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

Storing Data: Disks and Files

CS2410: Computer Architecture. Storage systems. Sangyeun Cho. Computer Science Department University of Pittsburgh

CSE 451: Operating Systems Spring Module 18 Redundant Arrays of Inexpensive Disks (RAID)

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

I/O CANNOT BE IGNORED

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Configuring Storage Profiles

Appendix D: Storage Systems

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

CSE325 Principles of Operating Systems. Mass-Storage Systems. David P. Duggan. April 19, 2011

Storage Profiles. Storage Profiles. Storage Profiles, page 12

IST346. Data Storage

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Homework, etc. Computer Science Foundations

CS 318 Principles of Operating Systems

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

COMP283-Lecture 3 Applied Database Management

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Configuring Storage Profiles

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy (Cont.) Storage Hierarchy. Magnetic Hard Disk Mechanism

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy. Storage Hierarchy (Cont.) Speed

V. Mass Storage Systems

File Structures and Indexing

Part IV I/O System. Chapter 12: Mass Storage Structure

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

CSE380 - Operating Systems

CS 318 Principles of Operating Systems

Module 13: Secondary-Storage Structure

Configuring Storage Profiles

Concepts Introduced. I/O Cannot Be Ignored. Typical Collection of I/O Devices. I/O Issues

Chapter 11. I/O Management and Disk Scheduling

Storage. Hwansoo Han

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

Configuring Storage Profiles

CSE 380 Computer Operating Systems

Mass-Storage Structure

HPSS RAIT. A high performance, resilient, fault-tolerant tape data storage class. 1

File Systems: FFS and LFS

CSE 120. Operating Systems. March 27, 2014 Lecture 17. Mass Storage. Instructor: Neil Rhodes. Wednesday, March 26, 14

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

An Introduction to RAID

SMORE: A Cold Data Object Store for SMR Drives

Chapter 9: Peripheral Devices: Magnetic Disks

Frequently asked questions from the previous class survey

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Chapter 6. Storage and Other I/O Topics

The Google File System

Part IV I/O System Chapter 1 2: 12: Mass S torage Storage Structur Structur Fall 2010

Method to Establish a High Availability and High Performance Storage Array in a Green Environment

CS 537 Fall 2017 Review Session

u Covered: l Management of CPU & concurrency l Management of main memory & virtual memory u Currently --- Management of I/O devices

Chapter 6 - External Memory

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova

Operating Systems 2010/2011

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Chapter 13: Mass-Storage Systems. Disk Scheduling. Disk Scheduling (Cont.) Disk Structure FCFS. Moving-Head Disk Mechanism

Chapter 13: Mass-Storage Systems. Disk Structure

Chapter 12: Mass-Storage

Chapter 11: File System Implementation. Objectives

RAID (Redundant Array of Inexpensive Disks) I/O Benchmarks ABCs of UNIX File Systems A Study Comparing UNIX File System Performance

I/O Management and Disk Scheduling. Chapter 11

CSE380 - Operating Systems. Communicating with Devices

CMSC 424 Database design Lecture 12 Storage. Mihai Pop

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

Disk scheduling Disk reliability Tertiary storage Swap space management Linux swap space management

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

IBM. Systems management Disk management. IBM i 7.1

Physical Storage Media

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

In the late 1980s, rapid adoption of computers

File systems CS 241. May 2, University of Illinois

The Btrfs Filesystem. Chris Mason

NAS System. User s Manual. Revision 1.0

Transcription:

HP AutoRAID (Lecture 5, cs262a) Ion Stoica, UC Berkeley September 13, 2016 (based on presentation from John Kubiatowicz, UC Berkeley)

Array Reliability Reliability of N disks = Reliability of 1 Disk N 50,000 Hours 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month! Arrays (without redundancy) too unreliable to be useful! Hot spares support reconstruction in parallel with access: very high media availability can be achieved

RAID Basics: Levels* of RAID RAID 0: striping with no parity (just bandwidth) High read/write throughput; theoretically Nx where N is number of disks No data redundancy https://en.wikipedia.org/wiki/standard_raid_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 0: striping with no parity (just bandwidth) RAID 1: Mirroring, simple, fast, but 2x storage Each disk is fully duplicated onto its "shadow à high availability Read throughput (2x) assuming N disks Writes (1x): two physical writes https://en.wikipedia.org/wiki/standard_raid_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 2: bit-level interleaving with Hamming error-correcting codes d bit Hamming code can detect d-1 errors and correct (d-1)/2 Need synchronized disks, i.e., spin at same angular rotation Hamming code https://en.wikipedia.org/wiki/standard_raid_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 3: byte-level striping with dedicated parity disk Dedicated parity disk is write bottleneck: every write writes parity parity disk https://en.wikipedia.org/wiki/standard_raid_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 4: block-level striping with dedicated parity disk Same bottleneck problems as RAID 3 parity disk https://en.wikipedia.org/wiki/standard_raid_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 5: block-level striping with rotating parity disk Most popular; spreads out parity load; space 1-1/N, read/write (N-1)x parity block https://en.wikipedia.org/wiki/standard_raid_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 5: block-level striping with rotating parity disk Most popular; spreads out parity load; space 1-1/N, read/write (N-1)x RAID 6: RAID 5 with two parity blocks (tolerates two drive failures) Use with today s drive sizes! Why? Correlated drive failures (2x expected in 10hr recovery) " [Schroeder and Gibson, FAST07] Failures during multi-hour/day rebuild in high-stress environments Levels in RED are used in practice

Redundant Arrays of Disks RAID 5+: High I/O Rate Parity Independent writes possible because of interleaved parity Reed-Solomon Codes ("Q") for protection during reconstruction A logical write becomes four physical I/Os Targeted for mixed applications D0 D1 D2 D3 P D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 D20 D21 D22 D23 P Disk Columns Increasing Logical Disk Addresses Stripe Stripe Unit

Problems of Disk Arrays: Small Writes RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes D0' new data D0 D1 D2 D3 P old data (1. Read) old XOR parity XOR (2. Read) (3. Write) (4. Write) D0' D1 D2 D3 P'

System Availability: Orthogonal RAIDs String Controller... Array Controller String Controller String Controller String Controller String Controller String Controller............... Redundant Support Components: fans, power supplies, controller, cables Data Recovery Group: unit of data redundancy End to End Data Integrity: internal parity protected data paths

System-Level Availability host I/O Controller Fully dual redundant host I/O Controller Array Controller... Array Controller......... Goal: No Single Points of Failure Recovery Group...... with duplicated paths, higher performance can be obtained when there are no failures

HP AutoRAID Motivation Goals: automate the efficient replication of data in a RAID RAIDs are hard to setup and optimize Different RAID Levels provide different tradeoffs Automate the migration between levels RAID 1 and 5: what are the tradeoffs? RAID 1: excellent read performance, good write performance, performs well under failures, expensive RAID 5: cost effective, good read performance, bad write performance, expensive recovery

HP AutoRAID Motivation Each kind of replication has a narrow range of workloads for which it is best... Mistake 1) poor performance, 2) changing layout is expensive and error prone Difficult to add storage: new disk change layout and rearrange data... Difficult to add disks of different sizes

HP AutoRAID Key Ideas Key idea: mirror active data (hot), RAID 5 for cold data Assumes only part of data in active " use at one time Working set changes slowly " (to allow migration)

HP AutoRAID Key Ideas How to implement this idea? Sys-admin make a human move around the " files... BAD. painful and error prone File system Hard to implement/" deploy; can t work with existing systems Smart array controller: (magic disk) " block-level device interface Easy to deploy because there is a well-defined abstraction Enables easy use of NVRAM (why?)

AutoRAID Details PEX (Physical Extent): 1MB chunk of disk space PEG (Physical Extent Group): Size depends on # Disks A group of PEXes assigned to one storage class Stripe: Size depends # Disks One row of parity and data segments in a RAID 5 storage class Segment: 128 KB Strip unit (RAID 5) or half of a mirroring unit Relocation Block (RB): 64KB Client visible space unit

Closer Look

Putting Everything Together Virtual Device Tables Map RBs to PEGs RB allocated to a PEG only when written PEG Tables Map PEGs to physical disk addresses Unused slots in PEG are marked free until and RB is allocated to them

HP AutoRaid Features Promote/demote in 64K chunks (8-16 blocks) Hot swap disks, etc. (A hot swap is just a controlled failure.) Add storage easily (goes into the mirror pool) Useful to allow different size disks (why?) No need for an active hot spare (per se); Just keep enough working space around Log-structured RAID 5 writes Nice big streams, no need to read old parity for partial writes

Questions When to demote? When there is too much mirrored storage (>10%) Demotion leaves a hole (64KB). What happens to it? Moved to free list and reuse Demoted RBs are written to the RAID5 log, one write for data, a second for parity Why log RAID 5 better than update in place? Update of data requires reading all the old data to recalculate parity Log ignores old data (which becomes garbage) and writes only new data/parity stripes How to promote? When a RAID5 block is written... Just write it to mirrored and the old version becomes garbage.

Questions How big should an RB be? Bigger Less mapping information, fewer seeks Smaller fine grained mapping information How do you find where an RB is? Convert addresses to (LUN, offset) and then lookup RB in a table from this pair Map size = Number of RBs and must be proportional to size of total storage How to handle thrashing (too much active write data)? Automatically revert to directly writing RBs to RAID 5!

Issues Disks writes go to two disks (since newly written data is hot ). Must wait for both to complete - why? Does the host have to wait for both? No, just for NVRAM Controller uses cache for reads Controller uses NVRAM for fast commit, then moves data to disks What if NVRAM is full? Block until NVRAM flushed to disk, then write to NVRAM

Issues What happens in the background? 1) compaction, 2) migration, 3) balancing Compaction: clean RAID 5 and plug holes in the mirrored disks Do mirrored disks get cleaned? Yes, when a PEG is needed for RAID5; i.e., pick a disks with lots of holes and move its used RBs to other disks Resulting empty PEG is now usable by RAID5 What if there aren t enough holes? Write the excess RBs to RAID5, then reclaim the PEG Migration: which RBs to demote? Least-recently-written (not LRU) Balancing: make sure data evenly spread across the disks. (Most important when you add a new disk)