Database Systems II. Secondary Storage

Similar documents
Data Storage and Query Answering. Data Storage and Disk Structure (2)

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

CS 554: Advanced Database System

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Storing Data: Disks and Files

Disks, Memories & Buffer Management

Mass-Storage Structure

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing Data: Disks and Files

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Storing Data: Disks and Files

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

CMSC 424 Database design Lecture 12 Storage. Mihai Pop

CISC 7310X. C11: Mass Storage. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 4/19/2018 CUNY Brooklyn College

L9: Storage Manager Physical Data Organization

Storage Devices for Database Systems

Data Storage and Disk Structure

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

I/O CANNOT BE IGNORED

Storage and File Structure. Classification of Physical Storage Media. Physical Storage Media. Physical Storage Media

Storing Data: Disks and Files

Storage Systems. Storage Systems

Disks. Storage Technology. Vera Goebel Thomas Plagemann. Department of Informatics University of Oslo

Today: Secondary Storage! Typical Disk Parameters!

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

Storing Data: Disks and Files

Storing Data: Disks and Files

Chapter 10: Mass-Storage Systems

Ch 11: Storage and File Structure

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

Storage. CS 3410 Computer System Organization & Programming

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Computer System Architecture

Disks & Files. Yanlei Diao UMass Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Semiconductor Memory Types Microprocessor Design & Organisation HCA2102

Chapter 12: Mass-Storage

V. Mass Storage Systems

BBM371- Data Management. Lecture 2: Storage Devices

CS3600 SYSTEMS AND NETWORKS

A memory is what is left when something happens and does not completely unhappen.

I/O CANNOT BE IGNORED

STORING DATA: DISK AND FILES

CPSC 421 Database Management Systems. Lecture 11: Storage and File Organization

Module 13: Secondary-Storage Structure

COS 318: Operating Systems. Storage Devices. Vivek Pai Computer Science Department Princeton University

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University

Professor: Pete Keleher! Closures, candidate keys, canonical covers etc! Armstrong axioms!

CS370 Operating Systems

COMP283-Lecture 3 Applied Database Management

Chapter 12: Mass-Storage Systems. Operating System Concepts 8 th Edition,

Physical Storage Media

u Covered: l Management of CPU & concurrency l Management of main memory & virtual memory u Currently --- Management of I/O devices

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Disks and Files. Jim Gray s Storage Latency Analogy: How Far Away is the Data? Components of a Disk. Disks

Chapter 14: Mass-Storage Systems

Chapter 10: Mass-Storage Systems

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

CS 261 Fall Mike Lam, Professor. Memory

Lecture 15 - Chapter 10 Storage and File Structure

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Storage and File Structure

Module 13: Secondary-Storage

COS 318: Operating Systems. Storage Devices. Jaswinder Pal Singh Computer Science Department Princeton University

Chapter 9: Peripheral Devices: Magnetic Disks

Chapter 14: Mass-Storage Systems. Disk Structure

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

MASS-STORAGE STRUCTURE

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

CENG3420 Lecture 08: Memory Organization

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification)

CSE325 Principles of Operating Systems. Mass-Storage Systems. David P. Duggan. April 19, 2011

Magnetic Disks. Tore Brox-Larsen Including material developed by Pål Halvorsen, University of Oslo (primarily) Kai Li, Princeton University

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova

Silberschatz, et al. Topics based on Chapter 13

Chapter 12: Mass-Storage

Chapter 10: Storage and File Structure

Chapter 12: Mass-Storage

Mass-Storage Structure

Introduction to I/O and Disk Management

CSE 451: Operating Systems Spring Module 12 Secondary Storage

Introduction to I/O and Disk Management

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.

Physical Representation of Files

Lecture 18: Memory Systems. Spring 2018 Jason Tang

Tape pictures. CSE 30341: Operating Systems Principles

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Data Storage - I: Memory Hierarchies & Disks. Contains slides from: Naci Akkök, Pål Halvorsen, Hector Garcia-Molina, Ketil Lund, Vera Goebel

CS 405G: Introduction to Database Systems. Storage

CSE 451: Operating Systems Winter Secondary Storage. Steve Gribble. Secondary storage

Review 1-- Storing Data: Disks and Files

OS and HW Tuning Considerations!

Part IV I/O System. Chapter 12: Mass Storage Structure

The Memory Hierarchy & Cache

Transcription:

Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400 MB/s 12,800 MB/s (DDR2, dual channel, 800MHz) Virtual Memory Disk Disk-Cache (2 16MB) Main Memory CPU File System L1/L2-Cache (256KB 4MB) 300 MB/s (SATA-300) 16 GB/s (64bit@2GHz) CPU-to-Main-Memory: ~200 cycles latency CPU-to-L1-Cache: ~5 cycles initial latency, then burst mode CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 30

The Memory Hierarchy Cache Data and instructions in cache when needed by CPU. On-board (L1) cache on same chip as CPU, L2 cache on separate chip. Capacity ~ 1MB, access time a few nanoseconds. Main memory All active programs and data need to be in main memory. Capacity ~ 1 GB, access time 10-100 nanoseconds. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 31 The Memory Hierarchy Secondary storage Secondary storage is used for permanent storage of large amounts of data, typically a magnetic disk. Capacity up to 1 TB, access time ~ 10 milliseconds. Tertiary storage To store data collections that do not fit onto secondary storage, e.g. magnetic tapes or optical disks. Capacity ~ 1 PB, access time seconds / minutes. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 32

The Memory Hierarchy Trade-off The larger the capacity of a storage device, the slower the access (and vice versa). A volatile storage device forgets its contents when power is switched off, a non-volatile device remembers its content. Secondary storage and tertiary storage is nonvolatile, all others are volatile. DBS needs non-volatile (secondary) storage devices to store data permanently. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 33 The Memory Hierarchy RAM (main memory) for subset of database used by current transactions. Disk to store current version of entire database (secondary storage). Tapes for archiving older versions of the database (tertiary storage). CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 34

The Memory Hierarchy Typically programs are executed in virtual memory of size equal to the address space of the processor. Virtual memory is managed by the operating system, which keeps the most relevant part in the main memory and the rest on disk. A DBS manages the data itself and does not rely on the virtual memory. However, main memory DBS do manage their data through virtual memory. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 35 Moore s Law Gordon Moore in 1965 observed that the density of integrated circuits (i.e., number of transistors per unit) increased at an exponential rate, thus roughly doubles every 18 months. Parameters that follow Moore s law: - number of instructions per second that can be exceuted for unit cost, - number of main memory bits that can be bought for unit cost, - number of bytes on a disk that can be bought for unit cost. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 36

Moore s Law Number of transistors on an integrated circuit CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 37 Moore s Law Disk capacity CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 38

Moore s Law But some other important hardware parameters do not follow Moore s law and grow much slower. Theses are, in particular, - speed of main memory access, and - speed of disk access. For example, disk latencies (seek times) have almost stagnated for past 5 years. Thus, moving data from one level of the memory hierarchy to the next becomes progressively larger. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 39 Disks Secondary storage device of choice. Data is stored and retrieved in units called disk blocks or pages. Main advantage over tapes: random access vs. sequential access. Unlike RAM, time to retrieve a disk page varies depending upon location on disk. Therefore, relative placement of pages on disk has major impact on DBMS performance! CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 40

Disks Disk consists of two main, moving parts: disk assembly and head assembly. Disk assembly stores information, head assembly reads and writes information. Arm assembly Disk head Arm movement Spindle Tracks Platters CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 41 Disks The platters rotate around central spindle. Upper and lower platter surfaces are covered with magnetic material, which is used to store bits. The arm assembly is moved in or out to position a head on a desired track. All tracks under heads at the same time make a cylinder (imaginary!). Only one head reads/writes at any one time. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 42

Disks Sector Track Top view of a platter surface Gap CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 43 Disks Block size is a multiple of sector size (which is fixed). Time to access (read/write) a disk block (disk latency) consists of three components: - seek time: moving arms to position disk head on track, - rotational delay (waiting for block to rotate under head), and - transfer time (actually moving data to/from disk surface). Seek time and rotational delay dominate. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 44

Disks Time Seek time 3 or 5x x 1 N Cylinders Traveled CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 45 Disks Average seek time S = N N i=1 j=1 j i SEEKTIME (i j) N(N-1) Typical average seek time = 5 ms CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 46

Disks Average rotational delay Head Here Block I Want Average rotational delay R = 1/2 revolution Typical R = 5 ms CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 47 Disks Transfer time Typical transfer rate: 100 MB/sec Typical block size: 16KB Transfer time: block size transfer rate Typical transfer time = 0.16 ms CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 48

Disks Typical average disk latency is 10 ms, maximum latency 20 ms. In 10 ms, a modern microprocessor can execute millions of instructions. Thus, the time for a block access by far dominates the time typically needed for processing the data in memory. The number of disk I/Os (block accesses) is a good approximation for the cost of a database operation. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 49 Accelerating Disk Access Organize data by cylinders to minimize the seek time and rotational delay. Next block concept: - blocks on same track, followed by - blocks on same cylinder, followed by - blocks on adjacent cylinder. Blocks in a file are placed sequentially on disk (by next ). Disk latency can approach the transfer rate. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 50

Accelerating Disk Access Example Assuming 10 ms average seek time, no rotational delay, 40 MB/s transfer rate. Read a single 4 KB Block Random I/O: 10 ms Sequential I/O: 10 ms Read 4 MB in 4 KB Blocks (amortized) Random I/O: 10 s Sequential I/O: 0.1 s Speedup factor of 100 CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 51 Accelerating Disk Access Block size selection Bigger blocks amortize I/O cost. Bigger blocks read in more useless stuff and takes longer to read. Good trade-off block size from 4KB to 16 KB. With decreasing memory costs, blocks are becoming bigger! CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 52

Accelerating Disk Access Using multiple disks Replace one disk (with one independent head) by many disks (with many independent heads). Striping a relation R: divide its blocks over n disks in a round robin fashion. Assuming that disk controller, bus and main memory can handle n times the transfer rate, striping a relation across n disks can lead to a speedup factor of up to n. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 53 Accelerating Disk Access Disk scheduling For I/O requests from different processes, let the disk controller choose the processing order. According to the elevator algorithm, the disk controller keeps sweeping from the innermost to the outermost cylinder, stopping at a cylinder for which there is an I/O request. Can reverse sweep direction as soon as there is no I/O request ahead in the current direction. Optimizes the throughput and average response time. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 54

Accelerating Disk Access Double buffering In some scenarios, we can predict the order in which blocks will be requested from disk by some process. Prefetching (double buffering) is the method of fetching the necessary blocks into the buffer in advance. Requires enough buffer space. Speedup factor up to n, where n is the number of blocks requested by a process. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 55 Accelerating Disk Access (1) Read B1 Buffer (2) Process Data in Buffer Single buffering (3) Read B2 Buffer (4) Process Data in Buffer... Execution time = n(p+r) where P = time to process one block R = time to read in one block n = # blocks read. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 56

Accelerating Disk Access Double buffering (1) Read B1,..., Bn Buffer (2) Process B1 in Buffer (3) Process B2 in Buffer... Execution time = R + np as opposed to n(p+r). remember that R >> P CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 57 Disk Failures In an intermittent failure, a read or write operation is unsuccessful, but succeeds with repeated tries. parity checks to detect intermittent failures Media decay is a permanent corruption of one or more bits which make the corresponding sector impossible to read / write. stable storage to recover from media decay A disk crash makes the entire disk permanently unreadable. RAID to recover from disk crashes CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 58

Disk Failures Checksums Add n parity bits every m data bits. The number of 1 s among a collection of bits and their parity bit is always even. The parity bit is the modulo-2 sum of its data bits. m=8, n=1 Block A: 01101000:1 Block B: 11101110:0 If Block A instead contains Block A : 01100000:1 error detected (odd # of 1 s) (even # of 1 s) (has odd # of 1 s) CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 59 Disk Failures Checksums But what if multiple bits are corrupted? E.g., if Block A instead contains Block A : 01000000:1 (has even # of 1 s) error cannot be detected Probability that a single parity bit cannot detect a corrupt block is ½. This is assuming that the probability of disk failures involving an odd / even number of bits is identical. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 60

Disk Failures Checksums More parity bits decrease the probability of an undetected failure. With n m independent parity bits, this probability is only 1/2 n. E.g., we can have eight parity bits, one for the first bit of every byte, the second one for the second bit of every byte... The chance for not detecting a disk failure is then only 1/256. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 61 Disk Failures Stable storage Sectors are paired, and information X is written both on sectors X l and X r. Assume that both copies are written with a sufficient number of parity bits so that bad sectors can be detected. If sector is bad (according to checksum), write to alternative sector. Alternate reading X l and X r until a good value is returned. Probability of X l and X r both failing is very low. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 62

Disk Failures Disk arrays So far, we cannot recover from disk crashes. To address this problem, use Redundant Arrays of Independent Disks (RAID), arrangements of several disks that gives abstraction of a single, large disk. Goals: Increase reliability (and performance). Redundant information allows reconstruction of data if a disk fails. Data striping improves the disk performance. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 63 Disk Failures Failure Models for Disks What is the expected time until disk crash? We assume uniform distribution of failures over time. Mean time to failure: time period by which 50% of a population of disks have failed (crashed). Typical mean time to failure is 10 years. In this case, 5% of disks crash in the first year, 5% crash in the second year,..., 5% crash in the tenth year,..., 5% crash in the twentieth year. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 64

Disk Failures Failure Models for Disks Given the mean time to failure (mtf) in years, we can derive the probability p of a particular disk failing in a given year. p = 1 / (2 * mtf) Ex.: mtf = 10, p = 1/20 = 5% Mean time to data loss: time period by which 50% of a population of disks have had a crash that resulted in data loss. The mean time to disk failure is not necessarily the same as the mean time to data loss. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 65 Disk Failures Failure Models for Disks Failure rate: percentage of disks of a population that have failed until a certain point of time. Survival rate: percentage of disks of a population that have not failed until a certain point of time. While it simplifies the analysis, the assumption of uniform distribution of failures is unrealistic. Disks tend to fail early (manufacturing defects that have not been detected) or late (wear-andtear). CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 66

Disk Failures Failure Models for Disks Survival rate (realistic) time CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 67 Disk Failures Mirroring The data disk is copied unto a second disk, the mirror disk. When one of the disk crashes, we replace it by a new disk and copy the other disk to the new one. Data loss can only occur if the second disk crashes while the first one is being replaced. This probability is negligible. Mirroring is referred to as RAID level 1. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 68

Disk Failures Parity blocks Mirroring doubles the number of disks needed. The parity block approach needs only one redundant disk for n (arbitray) data disks. In the redundant disk, the ith block stores parity checks for the ith blocks of all the n data disks. A B C P Parity block approach is called RAID level 4. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 69 Disk Failures Parity blocks Reading blocks is the same as without parity blocks. When writing a block on a data disk, we also need to update the corresponding block of the redundant disk. This can be done using four (three additional) disk I/O: read old value of data disk block, read corresponding block of redundant (parity) disk, write new data block, recompute and write new redundant block. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 70

Disk Failures Parity blocks If one of the disks crashes, we bring in a new disk. The content of this disk can be computed, bit by bit, using the remaining n disks. No difference between data disks and parity disk. Computation based on the definition of parity, i.e. total number of ones is even. CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 71 Disk Failures n = 3 data disks Disk 1, block 1: 11110000 Disk 2, block 1: 10101010 Disk 3, block 1: 00111000 and one parity disk Disk 4, block 1: 01100010 Example Sum over each column is always an even number of 1 s Mod-2 sum can recover any missing single row CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 72

Disk Failures Suppose we have: Example Disk 1, block 1: 11110000 Disk 2, block 1:???????? Disk 3, block 1: 00111000 Disk 4, block 1: 01100010 (parity) Use mod-2 sums for block 1 over disks 1,3,4 to recover block 1 of failed disk 2: Disk 2, block 1: 10101010 CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 73 Disk Failures RAID level 5 In the RAID 4 scheme, the parity disk is the bottleneck. On average, n-times as many writes on the parity disk as on the data disks. However, the failure recovery method does not distinguish the types of the n + 1 disks. RAID level 5 does not use a fixed parity disk, but use block i of disk j as redundant if i MOD n+1 = j. A B C D CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 74