CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

CSCI-GA.2433-001 Database Systems Lecture 8: Physical Schema: Storage Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

View 1 View 2 View 3 Conceptual Schema Physical Schema 1. Create a model of the enterprise (e.g., using ER) 2. Create a logical implementation (using a relational model and normalization) Disk What happens under the hood?

Requirements Analysis Conceptual Design Uses a file system to store the relations Requires knowledge of hardware and operating systems characteristics Logical Design Schema Refinement Application & Security Design Physical Design

First, Let s Look at a Typical Hierarchy Level Access time Typical size Registers "instantaneous" under 1KB Level 1 Cache 1-3 ns 64KB per core Level 2 Cache 3-10 ns 256KB per core Level 3 Cache 10-20 ns 2-20 MB per chip Main Memory 30-60 ns 4-32 GB per system Hard Disk 3,000,000-10,000,000 ns over 1TB In DB, we care about those two levels

File Structures Indexes Physical Design Criteria Storage media Performance Hard-disk drives Solid-state Disks

Disk Drives Platters Tracks Platter Sectors Track To access data: seek time: position head over the proper track rotational latency: wait for desired sector (RPM) transfer time: grab the data (one or more sectors)

Hard Disks spinning platter of special material mechanical arm with read/write head must be close to the platter to read/write data data is stored magnetically disks are random access meaning data can be read/written anywhere on the disk

A Conventional Hard Disk Structure

Hard Disk Architecture Surface = group of tracks Track = group of sectors Sector = group of bytes Cylinder: several tracks on corresponding surfaces

Disk Sectors and Access Each sector records Sector ID Data (512 bytes, 4096 bytes proposed) Error correcting code (ECC) Used to hide defects and recording errors Synchronization fields and gaps Access to a sector involves Queuing delay if other accesses are pending Seek: move the heads Rotational latency Data transfer Controller overhead

Example of a Real Disk: MKxx59GSM from Toshiba Size: 1TB 5,400 RPM 2.5 Number of platters: 3 Number of data heads: 6 Interface: SATA Transfer rate to host: 3GB/sec Average seek time: 12ms Track-to-Track seek: 2ms

Disks: Other Issues Average seek and rotation times are helped by locality. Disk performance improves about 10%/year Capacity increases about 60%/year Common disk interfaces/controllers: SCSI, IDE, SATA

The Disk: A View from the Top A sequence of blocks A physical unit of access is always a block All blocks are of the same size Actually, a block consists of 1 or more sectors A file: Logically: a series of records, of similar or different sizes Physically: a series of, no necessarily contiguous blocks The file system helps us to: Find the first block Find the last block Find the next block Find the previous block

Relation File Each tuple is a record in the file. Logical view Records There is one extra layer that we will discuss shortly! Physical view Assumptions: There can be several records in a block No record spans more than one block Blocks Sectors

Example Relation E# Salary 1 1200 3 2100 4 1800 2 1200 6 2300 9 1400 8 1900 Records 2 1200 4 1800 1 1200 3 2100 8 1900 Records First block of the file 2 1200 9 4 1800 1400 1 1200 3 2100 2 1200 8 1900 9 1400 Blocks 6 2300 4 1800 Left-over Space 9 1400 6 2300 6 2300 1 1200 3 2100 8 1900

Processing a Query: What Happens Under The Hood? SELECT E# FROM R WHERE SALARY > 1500; Read from disk into RAM all relevant blocks Get the relevant information from blocks What is the cost of this? Additional processing then produce the answer

A Simple Cost Model Assumptions Reading or Writing a block costs one time unit Processing is free Justification Accessing the disk is much more expensive than any reasonable CPU processing of queries Implication Goal: minimize number of block accesses Good Heuristic: Organize the physical database so that you make as much use as possible from any block you read/write

Why Not Store Everything in Main Memory?

Example What is the cost of accessing: E# 2 and E# 9? What is the cost of accessing: E# 2 and E# 4? Array in RAM 2 1200 4 1800 1 1200 3 2100 8 1900 9 1400 6 2300 Blocks on disk 9 1400 6 2300 2 1200 4 1800 1 1200 3 2100 8 1900

What Is The Best Place for the Next Block?

RAID Disk Array: Arrangement of several disks that gives abstraction of a single, large disk. Goals: Increase performance and reliability. Two main techniques: Data striping: Data is partitioned; size of a partition is called the striping unit. Partitions are distributed over several disks. Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails.

RAID Levels Level 0: No redundancy Level 1: Mirrored (two identical copies) Each disk has a mirror image Parallel reads, a write involves two disks. Maximum transfer rate = transfer rate of one disk Better reliability but no protection against data corruption of viruses

RAID Levels Level 0+1: Striping and Mirroring Parallel reads, a write involves two disks. Maximum transfer rate = aggregate bandwidth Level 1+0 is the same as 0+1 with reverse in mirroring and stripping.

RAID Levels Level 2: Bit-Interleaved Uses hamming code for error correction Can recover data from single-bit corruption Out of fashion! For Error correction (Hamming code)

RAID Levels Level 3: Byte-Interleaved with Parity Striping Unit: One byte. One check disk. Each read and write request involves all disks; disk array can process one request at a time.

RAID Levels Level 4: Block-Interleaved Parity Striping Unit: One disk block. One check disk. Parallel reads possible for small requests, large requests can utilize full bandwidth Writes involve modified block and check disk

RAID Levels Level 5: Block-Interleaved Distributed Parity Similar to RAID Level 4, but parity blocks are distributed over all disks Level 6 is similar to 5 but with extra parity block

What Is This Story of Solid State Disks (SSD)? No moving parts hence called solid state reads and writes to a medium called NAND flash memory Faster startup: no spinning Extremely low read latency Deterministic: performance does not depend on the location of the data

BUT Much more expensive than hard disks (~3$/GB vs. 0.15$/GB) Limited write erase time Slower write speeds High capacity SSDs may have significant higher power requirements SSD can get slower as it ages

SSD Organization SSD has its own page SSD pages are getting larger: 8KB, 16KB, 32KB Sector addressable Most SSD uses 4KB sectors

Comparison of typical hard-drives and SSD More up-to-date- SSD: Samsung SSD 840 Pro Series: 256 GB Read XFR Rate: ~510 MB/s Write XFR Rate: ~500 MB/s

Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Application DBMS OS DISK Controller Important: Tow flavors of DBMS Relying on OS file systems Uses its own or extend OS Data must be in RAM for DBMS to operate on it!

Disk Space Manager Abstractly: deals with pages as a unit of data (read, write, allocate, deallocate) Page size usually chosen as disk block Keeps track of: Which blocks are in use List of bitmap Which pages are on which page blocks Hides the details of the hardware and OS and makes higher level layers think in term of pages

Buffer Manager Memory may not be able to hold all needed pages the software layer responsible for bringing pages from disk to main memory as needed Implements replacement policy Higher levels of the DBMS code can be written without worrying about whether data pages are in memory or not.

Buffer Manager It does its job as follows: Partitions the memory into frames (1 frame holds 1 page) The collection of all frames/pages is called buffer pool Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK DB choice of frame dictated by replacement policy

Hierarchy of Data Tuples/Records Relations Files Pages Blocks Sectors

We can try to minimize the number of block accesses for frequent queries. Indexing Oversimplification: Tries to provide where blocks containing useful records are File Organization Oversimplification: Tries to provide When you read a block you get many useful records

Conclusions From the application to the disk, each layer sees the data differently Disks provide cheap, non-volatile storage. Performance and cost are the main issues