Chapter 11: Storage and File Structure Overview of Storage Media Magnetic Disks Characteristics RAID Database Buffers Structure of Records Organizing Records within Files Data-Dictionary Storage Classifying Physical Storage Media Speed Initial delay (latency) Sustained rate Cost - per bit Reliability data loss on power failure or system crash physical failure occasional or permanent Volatile/Non-volatile volatile: loses contents when power goes off 1 2 Storage Hierarchy Storage Hierarchy (Cont.) Faster Lower Cost/bit, More Capacity processor Volatile Sometimes Portable Portable Primary storage E.g. cache, main memory (semiconductor RAM) Fastest but volatile and expensive For programs and files in recent active use Secondary storage E.g. on-line magnetic hard-disk Default location for programs and data/databases Tertiary storage E.g. removable magnetic tape, optical disk Slowest but cheapest For archives and back-ups 3 4 Magnetic Hard Disk Mechanism A sector is smallest addressable unit (~512b) To access a sector: 1. Move head to the correct track (seek time) 2. Spin disk to the right angular position (rotational latency) NOTE: Diagram is schematic, and simplifies the structure of actual disk drives Performance Measures of Disks Magnetic Disk s access time is ~1 million times slower than Main Memory s Access time = seek time + rotational latency Initial delay from request to first bit of data Seek time Average seek time is ~1/2 the worst case seek time. 8 to 20 ms on typical disks Rotational latency Average latency is ~1/2 of the worst case latency. 4 to 11 ms on typical disks (5400 to 15000 r.p.m.) Data-transfer rate (bandwidth) From first byte to last byte of one block request 25 to 100 MB/s max, lower for inner tracks 5 6
Optimization of Disk-Block Access Minimize the number of disk accesses Larger blocks? Arrange file contents by blocks Minimize physical distance traveled Cluster blocks together Reorder the sequence of disk accesses Use fast caches and buffers to eliminate or parallelize disk accesses RAID for Reliability and Speed Redundant Arrays of Independent Disks With N disks, failure rate increases by factor of N Reliability through Redundancy Mirroring (or shadowing) - Duplicate set of disks Allows a complete failure of one disk per pair. Parity bit - detect and correct single bit errors Performance through Parallelism Can complete one request more quickly Can sometimes do multiple requests in parallel Bit-level striping spread 1 byte across 8 disks Block-level striping spread consecutive blocks across N disks 7 8 RAID Levels 1 and 5 (Most Popular) Choice of RAID Level Level 1: Mirrored disks with block striping Offers best write performance. Popular for applications such as storing log files in a database system. Level 5: Block-Interleaved Distributed Parity: partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. Decision Factors Monetary cost Throughput and bandwidth of normal operation Performance during failure Performance during rebuild of failed disk Level 1 has better write performance than level 5 Level 5 requires at least 2 block reads and 2 block writes to write a single block; Level 1 only requires 2 block writes Level 5 is preferred for applications with low update rate and large amounts of data Level 1 is preferred for all other applications 9 10 Database Buffer DBMS uses part of main memory as a disk cache buffer. When a block is needed 1. If the block is already in the buffer, get it from the buffer instead of disk (cache hit) 2. If the block is not in the buffer (cache miss), then Allocates space in the buffer for the block A. Replace some block already in the buffer to make space for the new block B. The replaced block is first written back to disk if it was modified since it was buffered (write-back policy) Buffer-Replacement Policies Most operating systems replace the block least recently used (LRU strategy) LRU isn t always a good strategy Example: nested join for each tuple t r of r do for each tuple t s of s do if the tuples tr and ts match Better strategy: a query optimizer provides hints on replacement strategy 11 12
File Organization A database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields. Two approaches Fixed length records, homogeneous files Each file has records of one particular type only Each table has a separate file Variable length records, heterogeneous files Variable-length fields (VARCHAR) Records from multiple tables in one file Case 1: Fixed-Length Records Store record i starting from byte n (i 1), where n is the size of each record. But, what if we need to delete record i? 3 options: Move records i + 1,..., n to i,..., n 1 Move record n to i Do not move records, but link all free records on a free list 13 14 Free Lists Store the address of the first deleted record in the file header. Use this first record to point to the second deleted record, and so on The figure shows a pointer field in each record, but there is a way to eliminate this. How? Case 2: Variable-length Fields Variable length fields can be represented by a pair (offset,length) offset is the location within the record length is field length All fields start at predefined location, but extra indirection required for variable length fields Q: How do we insert a record? How do we then update the free list? A-102 offset 10 400 account_number balance branch_name Perryridge Example record structure of account record 15 16 Slotted Page Structure for Mixed Files Header contains: Number of record entries End of free space in the block Location and size of each record Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated Organizing Records within a File Sequential store records in sequential order, based on the value of the search key of each record Heap a record can be placed anywhere in the file where there is space Hashing a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed Multitable clustering stores related records from several different relations adjacent to one other Like doing part of the work of a JOIN ahead of time More in Chapter 12 17 18
Sequential File Organization Suitable for applications that require sequential processing of the entire file The records in the file are ordered by a search-key Maintaining Sequential File Organization Deletion use pointer chains Insertion locate the position where the record is to be inserted if there is free space insert there if no free space, insert the record in an overflow block In either case, pointer chain must be updated Need to reorganize the file from time to time to restore sequential order 19 20 Multitable Clustering File Organization Store several relations in one file using a multitable clustering file organization Multitable Clustering File Organization Multitable clustering of customer and depositor: Depositor Customer Good for queries involving depositorcustomer, and for one single customer and his accounts Bad for queries involving only customer Can add pointer chains to link records of a particular relation 21 22 Data Dictionary (aka System Catalog) Stores metadata; that is, data about data, such as Logical Schema names of relations names and types of attributes of each relation names and definitions of views integrity constraints Physical Schema Physical location of relation How relation is organized (sequential/hash/ ) Information about indices (Chapter 12) User and accounting information Statistical and descriptive data Data Dictionary Storage (Cont.) Catalog structure Relational representation on disk, or Specialized data structures designed for efficient access Q: What does MySQL use? A possible catalog representation: Relation_metadata = (relation_name, number_of_attributes, storage_organization, location) Attribute_metadata = (attribute_name, relation_name, domain_type, position, length) User_metadata = (user_name, encrypted_password, group) Index_metadata = (index_name, relation_name, index_type, index_attributes) View_metadata = (view_name, definition) 23 24
Figure 11.4 RAID Levels Figure 11.100 More recent RAID Levels 25 26 Different Ways to Handle Deleting Record 2 Shift all later records to fill the gap Move last record to fill the gap Clustering File Structure With Pointer Chains 27 28 Byte-String Representation of Variable-Length Records Byte-String Representation (cont) Alow two kinds of block in file: Anchor block contains the first records of chain Overflow block contains records other than those that are the first records of chairs. Non-1NF Schema: ( branch_name, {account(s)} ) Attach an end-of-record ( ) control character to the end of each record Difficulty with deletion Difficulty with growth 29 30