Lecture II Storage Layer Kyumars Sheykh Esmaili
Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web Databases Parallel Databases and MapReduce Distributed Databases Data Stream Management Systems Security in Databases 2
Outline DBMS Architecture Storage Systems Storage Management 3
DBMS Architecture
Database s Main Job What Input: SQL statement Output: {tuples} How 1. Translate SQL into a set of get/put requests to backend storage 2. Extract, process, transform tuples from blocks 5
End-to-End Query Processing SQL {tuples} Compiler Parser QGM Rewrite QGM Optimizer QGM++ CodeGen Plan Interpreter Runtime System 6
Parser Generates relational algebra (RA) tree for each sub-query constructs graph of trees: Query Graph Model (QGM) nodes are subqueries edges represent relationships between subqueries Extended RA because SQL more than RA GROUP BY ORDER BY DISTINCT Parser needs schema information Why? 7
SQL => RA - Example π Title select Title from Professor, Lecture where Name = Popper and Date = 1979 σ Name = Popper and Date= 1979 Professor Lecture 8
Query Rewrite Many equivalent query plans Finding the right plan can dramatically impact performance 9
Query Optimization Mainly based on statistics Many, many techniques Will be discussed in a separate lecture 10
Query Execution Once the final query plan is identified, it s rather straightforwad Code generated based on the plan Interpreter generates the output 11
Components of a DB System Naive User Expert User App- Developer DBadmin Application Ad-hoc Query Compiler Management tools DML-Compiler DDL-Compiler Query Processor/Optimizer DBMS TA Management Recovery Runtime Storage Manager Schema Logs Indexes DB Catalogue Storage System 12
Storage Systems
Memory Hierarchy Fast, but expensive and small, memory close to CPU Larger, slower memory at the periphery We ll try to hide latency by using the fast memory as a cache 14
Magnetic Disks A stepper motor positions an array of disk heads on the requested track Platters (disks) steadily rotate Disks are managed in blocks: the system reads/writes data one block at a time 15
Access Time Magnet disk s design has implications on the access time to read/write a given block: Move disk arms to desired track (seek time t s ) Wait for desired block to rotate under disk head (rotational delay t r ) Read/write data (transfer time t tr ) access time: t = t s + t r + t tr 16
Access Time - Example Notebook drive Hitachi Travelstar 7K200 rotational speed: 7200 rpm average seek time: 10 ms transfer rate: 50 MB/s 512 bytes per sector 63 sector per track Track-to-track seek time: 1ms What is the access time to read an 8 KB data block? What about 1000 blocks of size 8KB Random Sequential 17
Disks: Sequential vs. Random IO Random access: t rnd = 1000 * t = 1000 * (t s + t r + t tr ) = 1000 * (10 + 4.17 + 0.16) = 1000 * 14.33 = 14330 ms Sequential access: t seq = t s + t r + 1000 * t tr + N * t track-to-track seek time = t s + t r + 1000 * 0.16 ms + (16 * 1000)/63 * 1 ms = 10 ms + 4.17 ms + 160 ms + 254 ms 428 ms Need consider this gap in algorithms! 18
Performance Tricks Track skewing Align sector 0 of each track to avoid rotational delay during sequential scans Request scheduling Choose the request that requires the smallest arm movement Zoning Divide outer tracks into more sectors than inners 19
Evolution of Hard Disk Technology Disk latencies have only marginally improved over the last years 10% per year But: Throughput (i.e., transfer rates) improve by 50% per year Hard disk capacity grows by 50% every year Therefore: Random access cost hurts even more as time progresses 20
Ways to Improve I/O Performance The latency penalty is hard to avoid. But: Throughput can be increased rather easily by exploiting parallelism. Idea: Use multiple disks and access them in parallel RAID: Redundant Array of Inexpensive Disks 21
Disk Mirroring Replicate data onto multiple disks I/O parallelism only for reads This is also known as RAID 1 (mirroring without parity) Failure risk? 22
Disk Striping Distribute data over disks Full I/O parallelism Also known as RAID 0 (striping without parity) Failure risk? 23
Disk Striping with Parity Distribute data and parity information over disks High I/O parallelism I Also known as RAID 5 (striping with distributed parity) Fault risk? 24
Solid-State Disks Solid state disks (SSDs) have emerged as an alternative to conventional hard disks Faster random reads Slower random writes Pages have to be erased before Once erased, sequential writes almost as fast as reads Adapting databases to these characteristics is a current research topic 25
Network-Based Storage The network is not a bottleneck any more Disks bandwidths Hard disk: 50 100 MB/s Serial ATA: 375 MB/s Network bandwidth 10 gigabit Ethernet: 1,250 MB/s Infiniband QDR: 12,000 MB/s Why not use the network for database storage? 26
Grid or Cloud Storage Some big enterprises (e.g., Google, Amazon) employ clusters with thousands of commodity PCs : Spare CPU cycles and disk space can be sold as a service use massive replication for data storage Amazon s Elastic Computing Cloud (EC2) Use Amazon s compute cluster by the hour ( 10 /hour). Amazon s Simple Storage Systems (S3) Infinite store for objects between 1 Byte and 5 GB in size 27
Components of a DB System Naive User Expert User App- Developer DBadmin Application Ad-hoc Query Compiler Management tools DML-Compiler DDL-Compiler Query Processor/Optimizer DBMS TA Management Recovery Runtime Storage Manager Schema Logs Indexes DB Catalogue Storage System 28
Storage Manager Interface to the stroage system Buffer management Handles the storage hierarchy Data management (files and blocks) Outsmarts OS o Oracle, Google, etc. implement their own file system Keeps track of recovery logs 29
Buffer Manager The buffer manager mediates between external storage and main memory manages a designated main memory area, the buffer pool for this task Disk pages are brought into memory as needed A replacement policy decides which page to evict when the buffer is full 30
Replacement Policies The effectiveness of the buffer manager s caching functionality can depend on the replacement policy it uses, e.g., Least Recently Used (LRU) LRU-k Most Recently Used (MRU) Random What could be the rationales behind each of these strategies? 31
Data Manager Maps records to pages implement record identifier (RID) Implementation of Indexes B+ trees, etc. Free space Management Various schemes 32
Database = { files } A file = variable-sized sequence of blocks Block is the unit of transfer to disk A page = fixed-sized sequence of blocks A page contains records or index entries Typical page size: 8KB Page is logical unit of transfer and unit of buffering Blocks of same page are prefetched, stored on same track on disk 33
Heap Files The most important type of files in a database A linked list of pages stores records in no particular order (in line with, e.g., SQL) Problems? 34
Heap Files Directory of pages use as space map with information about free page 35
Free Space Management Find a page for a new record Many different heuristics conceivable All based on a list of pages with free space Append Only Try to insert into the last page of free space list If no room in last page, create a new page. Best Fit Scan through list and find min page that fits First Fit, Next Fit Scan through list and find first / next fit Advantages and disadvantages? 36
Inside a Page record identifier (rid): <pageno, slotno> indexes use rids to ref. records record position (in page): slotno x bytes per slot 37
Insid a Page Variable-sized Fields Variable-sized fields moved to end of each record Slot directory points to start of each record. Create forward address if record won t fit on page 38
DBMS vs. OS Buffer management and data management very much look like virtual memory and file management in OSs But a DBMS may be much more aware of the access patterns of certain operators (-> pre-fetching) concurrency control often calls for a defined order of write operations technical reasons may make OS tools unsuitable for a database (e.g., file size limitation, platform independence). 39
Access Patterns of Databases Sequential: table scans P 1, P 2, P 3, P 4, P 5, Hiearchical: index navigation P 1, P 4, P 11, P 1, P 4, P 12, P 1, P 3, P 8, P 1, P 2, P 7, P 1, P 3, P 9, Random: index lookup P 13, P 27, P 3, P 43, P 15, Cyclic: nested-loops join P 1, P 2, P 3, P 4, P 5, P 1, P 2, P 3, P 4, P 5, P 1, P 2, P 3, P 4, P 5, 40
DBMS vs. OS In fact, databases and operating systems sometimes interfere Operating system and buffer manager effectively buffer the same data twice Things get really bad if parts of the DBMS buffer get swapped out to disk by OS VM manager Therefore, databases try to turn off OS functionality as much as possible Raw disk access instead of OS files 41