2.1.2 Accelerating Disk-I/O

Size: px

Start display at page:

Download "2.1.2 Accelerating Disk-I/O"

Claud Park
5 years ago
Views:

1 Module 2: Storing Data: Disks and Files 2.1 Memory hierarchy Memory in off-the-shelf computer systems is arranged in a hierarchy: Module Outline 2.1 Memory hierarchy 2.2 Disk space management 2.3 Buffer manager 2.4 File and record organization 2.5 Page formats 2.6 Record formats 2.7 Addressing schemes Web Forms Applications SQL Interface SQL Commands Plan Executor Parser Operator Evaluator Optimizer Query Processor Files and Index Structures Transaction Recovery Buffer Lock Disk Space Concurrency Control DBMS Index Files System Catalog Data Files Database 13 Request & & & & CPU CPU Cache (L1, L2) Main Memory (RAM) Magnetic Disk Tape, CD-ROM, DVD Storage Class ff primary secondary tertiary Cost of primary memory 100 cost of secondary storage space of the same size. Size of address space in primary memory (e.g., 2 32 Byte = 4 GB) may not be sufficient to map the whole database (we might even have 2 32 records). DBMS needs to make data persistent across DBMS (or host) shutdowns or crashes; only secondary/tertiary storage is nonvolatile. DBMS needs to bring in data from lower levels in memory hierarchy as needed for processing Magnetic disks Tapes store vast amounts of data ( 100 GB; more for roboter tape farms) but they are sequential devices. Magnetic disks (hard disks) allow direct access to any desired location; hard disks dominate database system scenarios by far. disk arm arm movement disk head rotation track cylinder platter 1 Data on a hard disk is arranged in concentric rings (tracks) on one or more platters, 2 tracks can be recorded on one or both surfaces of a platter, 3 set of tracks with same diameter form a cylinder, 4 an array (disk arm) of disk heads, one per recorded surface, is moved as a unit, 5 a stepper motor moves the disk heads from track to track, the platters steadily rotate. 15 track sector block 1 Each track is divided into arc-shaped sectors (a characteristic of the disk s hardware), 2 data is written to and read from disk block by block (the block size is set to a multiple of the sector size when the disk is formatted), 3 typical disk block sizes are 4 KB or 8 KB. Data blocks can only be written and read if disk heads and platters are positioned accordingly. This has implications on the disk access time: 1 Disk heads have to be moved to desired track (seek time), 2 disk controller waits for desired block to rotate under disk head (rotational delay), 3 disk block data has to be actually written/read (transfer time). access time =

2 I Access time for the IBM Deskstar 14GPX 3.5 inch hard disk, 14.4 GB capacity 5 platters of 3.35 GB of user data each, platters rotate at 7200/min average seek time 9.1 ms (min: 2.2 ms [track-to-track], max: 15.5 ms) average rotational delay 4.17 ms data transfer rate 13 MB/s access time 8 KB block 9.1 ms ms + 1 s ms 13 MB/8 KB N.B. Accessing a main memory location typically takes < 60 ns. 17 The unit of a data transfer between disk and main memory is a block, if a single item (e.g., record, attribute) is needed, the whole containing block must be transferred: Reading or writing a disk block is called an I/O operation. The time for I/O operations dominates the time taken for database operations. DBMSs take the geometry and mechanics of hard disks into account. Current disk designs can transfer a whole track in one platter revolution, active disk head can be switched after each revolution. This implies a closeness measure for data records r 1, r 2 on disk: 1 Place r 1 and r 2 inside the same block (single I/O operation), 2 place r 2 inside a block adjacent to r 1 s block on the same track, 3 place r 2 in a block somewhere on r 1 s track, 4 place r 2 in a track of the same cylinder than r 1 s track, 5 place r 2 in a cyclinder adjacent to r 1 s cylinder Accelerating Disk-I/O Performance gains withs parallel I/Os Goals reduce number of I/Os: DBMS buffer, physical DB-design reduce duration of I/Os: access neighboring disk blocks (clustering) and bulk-i/o: advantage: optimized seek time, optimized rotational delay, minimized overhead (e.g. interrupt handling) disadvantage: I/O path busy for a long time (concurrency) bulk I/Os can be implemented on top of or inside disk controller... used for mid-sized data access (prefetching, sector-buffering) different I/O paths (declustering) with parallel access advantage: parallel I/Os, minimized transfer time by multiple bandwidth disadvantage: avg. seek time and rotational delay increased, more hardware needed, blocking of parallel transactions advanced hardware or disk arrays (RAID systems)... used for large-size data access 19 Partition files into equally sized areas of consecutive blocks ( striping ) " " # $ % & % " K C H E B B I F = H = A E J J intra-i/o parallelism ) K B J H = C I F = H = A E J J inter-i/o parallelism Striping unit (# logically consecutive bytes on one disk) determines degree of parallelism for single I/O and degree of parallelism between different I/Os: small chunks: high intra-access parallelism but many devices busy not many I/Os in parallel large chunks: low intra-access parallelism, but many I/Os in parallel 20

3 RAID-Systems: Improving Availability Goal: maximize availability... = MT T F MT T F + MT T R where: MT T F =mean time to failure, MT T R=mean time to repair Problem: with N disks, we have N times higher probability for problems Thus: MT T DL = MT T F N where MT T DL=mean time to data loss Solution: RAID=Redundant array of inexpensive (independent) disks Now we get MT T DL = MT T F N MT T F (N 1) MT T R i.e., we only suffer from data loss, if a second disk fails before the first failed disk has been replaced. 21 Principle of Operation Use data redundancy to be able to reconstruct lost data, e.g., compute parity information during normal operation... here denotes logical xor (exclusive or) When one of the disks fails, use parity to reconstruct lost data during failure recovery... typically reserve one extra disk as a hot spare to replace the failed one immediately. 22 Executing I/Os from/to a RAID System Read Access: to read block number k from disk j, execute a read(b j k, disk j)-operation. Write Access: to write block number k back to disk j, we have to update the parity information, too (let p be the number of the parity disk for block k): i j : read(b i k, disk i ); compute new parity block B p k from contents of all B i k; write(b j k, disk j); write(b p k, diskp); we can do better (i.e., more efficient), though: Write Access to blocks b on all disks i p: compute new parity block B p k i : write(b i k, disk i ); from contents of all B i p k ; Reconstruction of block b on a failed disk j (let r be the number of the replacement disk): i j : read(b i k, disk i ); reconstruct B j k write(b j k, diskr ); as parity from all Bi j k read(b p k, diskp); compute new parity block B p k write(b j k, disk j); := Bp k Bj k B j k ; write(b p k, diskp); 23 24

4 Recovery Strategies off-line: if a disk has failed, suspend normal I/O; reconstruct all blocks of the failed disk and write them to the replacement disk; only afterwards resume normal I/O traffic. on-line: (needs a hot spare disk) resume normal I/O immediately start reconstructing all blocks not yet reconstructed since the crash (in background); allow parallel normal writes: write the block to the replacement disk and update parity; allow parallel normal read I/O: if block has not yet been repaired: reconstruct block if block has already been reconstructed: read block from replacement disk or reconstruct it (load balancing decision) RAID-Levels There are a number of variants ( RAID-Levels) differing w.r.t. the following characteristics: striping unit (data interleaving) how to scatter (primary) data across disks? fine (bits/bytes) or coarse (blocks) grain? How to compute and distribute redundant information? what kind of redundant information (parity, ECCs) where to allocate redundant information (separate/few disks, all disks of the array) 5 RAID-levels have been introduced, later more levels have been defined. N.B. we can even wait with all reconstruction until first normal read access RAID Level 0: no redundancy, just striping least storage overhead no extra effort for write-access not the best read-performance RAID Level 1: mirroring doubles necessary storage space doubles write-access optimized read-performance due to alternative I/O path RAID Level 2: memory-style ECC compute error-correcting codes for data of n disks store onto n 1 additional disks failure recovery: determine lost disk by using the n 1 extra disks; correct (reconstruct) its contents from 1 of those 27 RAID Level 3: bit-interleaved parity one parity disk suffices, since controller can easily identify faulty disk distribute (primary) data bit-wise onto data disks read and write access goes to all disks, therefore, no inter-i/o parallelism, but high bandwidth RAID Level 4: block-interleaved parity like RAID 3, but distribute data block-wise (variable block size) small read I/O goes to only one disk bottleneck: all write I/Os go to the one parity disk RAID Level 5: block-interleaved striped parity like RAID 4, but distribute parity blocks across all disks load balancing best performance for small and large read as well as large write I/Os variants w.r.t. distribution of block More recent levels combine aspects of the ones listed here, or add multiple parity blocks, e.g. RAID 6: two parity blocks per group. 28

5 Parity groups non-h = J4 ) 1, Parity is not necessarily computed across all disks within an array, it is possibile to define parity groups (of same or different sizes). mirroring 4 ) 1, disk 1 disk 2 disk 3 disk 4 disk 5 ma HO sj OA ) 1, bit-i JA HA = LA p= ) 1, block-i JA HA = p= HEJO 4 ) 1, " data interleaving on the byte level data i JA HA = LE C on the block level shading = redundant info group 1 group 1 group 1 parity 1 group 2 group 2 group 2 parity 2 parity 3 group 3 group 3 group 3 group 4 parity 4 group 4 group 4 parity 5 group 5 group 5 group b?i JA HA = sj HEF p= HEJO4 ) 1, # 2 3 p= HEJOsJ HEF 4 ) 1, $ Selecting RAID levels RAID level 0: improves overall performance at lowest cost; no provision against data loss, best write performance, since no redundancy; RAID levels 0+1: (aka. level 10) superior to level 1, main appl. area is small storage subsystems, sometimes for write-intensive applications. RAID level 1: most expensive version; typically serialize the two necessary I/Os for writes to avoid data loss in case of power failures, etc. RAID levels 2 and 4: are always inferior to levels 3 and 5, resp ly. Level 3 appropriate for workloads with large requests for contiguous blocks; bad for many small requests of a single block. RAID level 5: is a good general-purpose solution. Best performance (with redundancy) for small and large read as well as large write requests. RAID level 6: choice for higher level of reliability. RAID logic can be implemented inside the disk subsystem/controller ( hardware RAID ) or in OS ( software RAID ). 2.2 Disk space management Web Forms Transaction Lock Plan Executor Operator Evaluator Concurrency Control Applications Files and Index Structures Buffer Disk Space Index Files Data Files SQL Commands System Catalog Parser Optimizer You are here SQL Interface Query Processor Database Recovery DBMS The disk space manager (DSM) encapsulates the gory details of hard disk access for the DBMS, the DSM talks to the disk controller and initiates I/O operations, once a block has been brought in from disk it is referred to as a a. Sequences of data s are mapped onto contiguous sequences of blocks by the DSM. The DBMS issues allocate/deallocate and read/write commands to the DSM, which, internally, uses a mapping block-# -# to keep track of locations and block usage. a Disk blocks and s are of the same size

6 2.2.1 Keeping track of free blocks During database (or table) creation it is likely that blocks indeed can be arranged contiguously on disk. Subsequent deallocations and new allocations however will, in general, create holes. To reclaim space that has been freed, the disk space manager either uses a free block list: 1 keep a pointer to the first free block in a known location on disk, 2 when a block is no longer needed, append/prepend this block to the free block list for future use, 3 next pointers may be stored in disk blocks themselves, or free block bitmap: 1 reserve a block whose bytes are interpreted bit-wise (bit n = 0: block n is free), 2 toggle bit n whenever block n is (de-)allocated. Free block bitmaps allow for fast identification of contiguous sequences of free blocks Buffer manager Web Forms Applications SQL Interface SQL Commands Plan Executor Parser Operator Evaluator Optimizer Query Processor Files and Index Structures Transaction Recovery Buffer Lock Disk Space You are here Concurrency Control DBMS Index Files System Catalog Data Files Database Size of the database on secondary storage size of avail. primary mem. to hold user data. To scan the entire s of a 20 GB table (SELECT FROM... ), the DBMS needs to 1 bring in s as they are needed for inmemory processing, 2 overwrite (replace) such s when they become obsolete for query processing and new s require in-memory space. The buffer manager manages a collection of s in a designated main memory area, the buffer pool, once all slots (frames) in this pool have been occupied, the buffer manager uses a replacement policy to decide which frame to overwrite when a new needs to be brought in. 34 N.B. Simply overwriting a in the buffer pool is not sufficient if this has been modified after it has been brought in (i.e., the is so-called dirty). disk free frame pinpage / unpinpage buffer pool Simple interface for a typical buffer manager Indicate that p is needed for further processing: function pinpage(p): if buffer pool contains p already then pincount(p) pincount(p) + 1; return address of frame for p; select a victim frame p to be replaced using the replacement policy; if dirty(p ) then write p to disk; read p from disk into selected frame; pincount(p) 1; dirty(p) false; main memory disk database Indicate that p is no longer needed as well as whether p has been modified by a transaction (d): function unpinpage(p, d): pincount(p) pincount(p) 1; dirty(p) d; 35 36

7 N.B. The pincount of a indicates how many users (e.g., transactions) are working with that, clean victim s are not written back to disk, a call to unpinpage does not trigger any I/O operation, even if the pincount for this goes down to 0 (the might become a suitable victim, though), a database transaction is required to properly bracket any operation using pinpage and unpinpage, i.e. a pinpage(p);... read data (records) on at address a;... unpinpage(p, false); or a pinpage(p);... read and modify data (records) on at address a;... unpinpage(p, true); A buffer manager typically offers at least one more interface-call: flushpage(p) to force p (synchronously) back to disk (for transaction mgmt. purposes) 37 Two strategic questions 1 How much pretious buffer space to allocate to each of the active transactions (Buffer Allocation Problem)? Two principal approaches: static assignment dynamic assignment 2 Which to replace when a new request arrives and the buffer is full (Page Replacement Problem)? Again, two approaches can be followed: decide without knowledge on reference pattern presume knowledge on (expected) reference pattern Additional complexity is introduced when we take into account that the DBMS may manage segments of different sizes: one buffer pool: good space utilization, but fragmentation problem many buffer pools: no fragmentation, but worse utilization; global replacement/assignment strategy may get complicated Possible solution could be to allow for set-oriented pinpages({p})-calls Buffer allocation policies Problem: shall we allocate parts of the buffer pool to each transaction (TX) or let the replacement strategy alone decide on who gets how much buffer space? Properties of a local policy: one TX cannot hurt others TXs are treated equally possibly bad overall utilization of buffer space some TXs may have vast amounts of buffer space occupied by old s, while others experience internal thrashing, i.e., suffer from too little space Problem with a global policy: Consider a TX executing a sequential read on a huge relation: all accesses are references to newly loaded s; hence, almost all other s are likely be replaced (following a standard replacement strategy); other TXs cannot proceed without loading in their s again ( external thrashing ). 39 Typical allocation strategies include: global one buffer pool for all transactions local based on different kinds of data (e.g., catalog, index, data,... ) local each transaction gets a certain fraction of the buffer pool: static partitioning assign buffer budget once for each TX dynamic partitioning adjust a TX s buffer budget according to its past reference pattern some kind of semantic information It is also possible to apply mixed strategies, e.g., have different pools working with different approaches. This complicates matters significantly, though. 40

8 Examples for dynamic allocation strategies 1 Local LRU (cf. LRU replacement, later) keep a separate LRU-stack for each active TX, and a global freelist for s not pinned by any TX Strategy: i. replace a from the freelist ii. replace a from the LRU-stack of the requesting TX iii. replace a from the TX with the largest LRU-stack 2 Working Set Model (cf. operating systems virtual memory management) Goal: avoid thrashing by allocating just enough buffer space to each TX Approach: observe number of different requests by each TX within a certain intervall of time (window size τ) deduce optimal buffer budget from this observation, allocate buffer budgets according to the ratio between those optimal sizes 41 Implementation of the Working Set Model Let W S(T, τ) be the working set of TX T for window size τ, i.e., W S(T, τ) = {s referenced by T in the inverall [now τ, now]}. The strategy is to keep, for each transaction T i, its working set, W S(T i, τ) in the buffer. Possible implementation: keep two counters, per TX and per, resp ly trc(t i )... TX-specific reference counter, lrc(t i, P j )... TX-specific last reference counter for each referenced P j. Idea of the algorithm: Whenever T i references P j : increment trc(t i ); copy trc(t i ) to lrc(t i, P j ). If a has to be replaced for T i, select among those with trc(t i ) lrc(t i, P j ) τ Buffer replacement policies Schematic overview of buffer replacement policies The choice of victim frame selection (or buffer replacement) policy can considerably affect DBMS performance. Large number of policies in operating systems and database mgmt. systems. 4 7 ) * ) * ) * + ). 1. Criteria for victim selection used in some strategies Criteria References Age of in buffer no since last ref. total age none Random FIFO last all LFU LRU CLOCK GCLOCK(V1) GCLOCK(V2) LRD(V1) DGCLOCK LRD(V2) ref to A in buffer. 7 victim H 4, 8 victim H gc # ref to C not in buffer $ rc age $ " " # # $ % % D =? A / + + possibly initialized with weights "used" bit ref count $ 43 44

9 Two policies found in a number of DBMSs: 1 LRU ( least recently used ) Keep a queue (often described as a stack) of pointers to frames. In unpinpage(p, d), append p to the tail of queue, if pincount(p) is decremented to 0. To find the next victim, search through the queue from its head and find the first p with pincount(p) = 0. 2 Clock ( second chance ) Number the N frames in buffer pool 0... N 1, initialize counter current 0, and maintain a bit array referenced[0... N 1], initialized to all 0. In pinpage(p), do reference[p] 1. To find the next victim, consider current. If pincount(current) = 0 and referenced[current] = 0, current is the victim. Otherwise, referenced[current] 0, current (current + 1) mod N, repeat. Generalization: LRU(k) take timestamps of last k references into account. Standard LRU is LRU(1). 45 N.B. LRU as well as Clock are heuristics only. Any heuristic can fail miserably in certain scenarios: A challenge for LRU A number of transactions want to scan the same sequence of s (e.g., SELECT FROM R) one after the other. Assume a buffer pool with a capacity of 10 s. 1 Let the size of relation R be 10 or less s. How many I/Os do you expect? 2 Let the size of relation R be 11 s. What about the number of I/O operations in this case? Other well-known replacement policies are, e.g., FIFO ( first in, first out ), LIFO ( last in, first out ) LFU ( least frequently used ), MRU ( most recently used ), GCLOCK ( generalized clock ), DGCLOCK ( dynamic GCLOCK ) LRD ( least reference density ), WS, HS ( working set, hot set ) see above, Random. 46 LRD least reference density Record the following three parameters trc(t)... total reference count of transaction t, age(p)... value of trc(t) at the time of loading p into buffer, rc(p)... reference count of p. Update these parameters during a transaction s references (pinpage-calls). From those, compute mean reference density of a p at time t as: rd(p, t) := rc(p) trc(t) age(p)... where trc(t) rc(p) 1 Exploiting semantic knowledge Query compiler/optimizer... selects access plan, e.g., sequential scan vs. index, estimates number of I/Os for cost-based optimization. Idea: use this information to determine query-specific, optimal buffer budget Query Hot Set model. Goals: optimize overall system throughput; to avoid thrashing is the most important goal. Strategy for victim selection: chose with least reference density r d(p, t)... many variants, e.g., for gradually disregarding old references

10 Hot Set with disjoint sets 1 Only those queries are activated, whose Hot Set buffer budget can be satisfied immediately. 2 Queries with higher demands have to wait until their budget becomes available. 3 Within its own buffer budget, each transaction applies a local LRU policy. Properties: No sharing of buffered s between transactions. Risk of internal thrashing when Hot Set estimates are wrong. Queries with large Hot Sets block following small queries. (Or, if bypassing is permitted, many small queries can lead to starvation of large ones.) Hot Set with non-disjoint sets 1 Queries allocate their budget stepwise, upto the size of their Hot Set. 2 Local LRU stacks are used for replacement. 3 Request for a p: (i) If found in own LRU-stack: update LRU-stack. (ii) If found in another transaction s LRU-stack: access, but don t update the other LRU-stack. (iii) If found in freelist: push on own LRU-stack. 4 unpinpage: push onto freelist-stack. 5 Filling empty buffer frames: taken from the bottom of the freelist-stack. N.B. As long as a is in a local LRU-stack, it cannot be replaced. If a drops out of a local LRU-stack, it is pushed onto the freelist-stack. 49 A is replaced only if it reaches the bottom of the freelist-stack before some transaction pins it again. 50 Priority Hints Idea: with unpinpage, a transaction gives one of the two possible indications to the buffer manager: preferred... those are managed in a TX-local parition, ordinary... managed in a global partition. Strategy: when a needs to be replaced, 1. try to replace an ordinary from the global partition using LRU; 2. replace a preferred of the requesting TX according to MRU. Prefetching... when the buffer manager receives requests for (single) (s), it may decide to (asynchronously) read ahead on-demand, asynchronous read-ahead: e.g., when traversing the sequence set of an index, during a sequential scan of a relation,... heuristic (speculative) prefetching: e.g., sequential n-block lookahead (cf. drive or controller buffers in harddisks), semantically determined supersets, index prefetch,... Advantages: much simpler than DBMIN ( Hot Set ), similar performance, easy to deal with too small partitions

11 2.3.3 Buffer management in DBMSs vs. OSs Buffer management for a DBMS curiously tastes like the virtual memory 1 concept of modern operating systems. Both techniques provide access to more data than will fit into primary memory. So: why don t we use OS virtual memory facilities to implement DBMSs? A DBMS can predict certain reference patterns for s in a buffer a lot better than a general purpose OS. This is mainly because references in a DBMS are initiated by higher-level operations (sequential scans, relational operators) by the DBMS itself. Reference pattern examples in a DBMS 1 Sequential scans call for prefetching. 2 Nested-loop joins call for fixing and hating. Concurrency control protocols often prescribe the order in which s are written back to disk. Operating systems usually do not provide hooks for that. Double Buffering If the DBMS uses its own buffer manager (within the virtual memory of the DBMS server process), independently from the OS VM manager, we may experience the following: Virtual fault: resides in DBMS buffer. However, frame has been swapped out of physical memory by OS VM manager. An I/O operation is necessary that is not visible to the DBMS. Buffer fault: does not reside in DBMS buffer, frame is in physical memory. Regular DBMS replacement, requiring an I/O operation. Double fault: does not reside in DBMS buffer, frame has been swapped out of physical memory by OS VM manager. Two I/O operations necessary: one to bring in the frame (OS) 2 ; another one to replace the in that frame (DBMS). = DBMS buffer needs to be memory resident in OS. 1 Generally implemented using a hardware interrupt mechanism called faulting OS VM does not know dirty flags, hence brings in s that could simply be overwritten File and record organization Web Forms Transaction Lock Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Disk Space Index Files Data Files System Catalog Parser Optimizer SQL Interface Query Processor You are here Database Recovery DBMS We will now turn away from management and will instead focus on usage in a DBMS. On the conceptual level, a relational DBMS manages tables of tuples 3, e.g. A B C true foo... On the physical level, such tables are represented as files of records (tuple ˆ= record), each holds one or more records (in general, record ). A file is a collection of records that may reside on several s. 3 More precisely, table actually means bag here (set of elements with multiplicity 0) Heap files The most simple file structure is the heap file which represents an unordered collection of records. As in any file structure, each record has a unique record identifier (rid). A typical heap file interface supports the following operations: create/destroy heap file f named n: createfile(n) / deletefile(f ) insert record r and return its rid: insertrecord(f, r) delete a record with a given rid: deleterecord(f, rid) get a record with a given rid: getrecord(f, rid) initiate a sequential scan over the whole heap file: openscan(f ) N.B. Record ids (rids) are used like record addresses (or pointers). Internally, the heap file structure must be able to map a given rid to the containing the record. 56

12 To support openscan(f ), the heap file structure has to keep track of all s in file f ; to support insertrecord(f, r) efficiently, we need to keep track of all s with free space in file f. Let us have a look at two simple structures which can offer this support Linked list of s When createfile(n) is called, 1 the DBMS allocates a free (the file header) and writes an appropriate entry n, header to a known location on disk; 2 the header is initialized to point to two doubly linked lists of s: header data 3 Initially, both lists are empty. data linked list of s with free space data data linked list of full s Remarks: For insertrecord(f, r), 1 try to find a p in the free list with free space > r ; should this fail, ask the disk space manager to allocate a new p; 2 record r is written to p; 3 since generally r p, p will belong to the list of s with free space; 4 a unique rid for r is computed and returned to the caller. For openscan(f ), 1 both lists have to be traversed. A call to deleterecord(f, rid) 1 may result in moving the containing from full to free list, 2 or even lead to deallocation if the is completely free after deletion. Finding a with sufficient free space is an important problem to solve inside insertrecord(f, r). How does the heap file structure support this operation? (How many s of a file do you expect to be in the list of free s?) Directory of s An alternative to the linked list approach is to maintain a directory of s in a file. The header contains the first of a chain of directory s; each entry in a directory identifies a of the file: Remarks: directory data s Free space management is also done via the directory: each directory entry is actually of the form addr p, nfree, where nfree indicates the actual amount of free space (e.g. in bytes) on p. header data data I/O operations and free space management For a file of s, give lower and upper bounds for the number of I/O operations during an insertrecord(f, r) call for a heap file organized using 1 a linked list of s, 2 a directory of s (1000 directory entries/). directory data linked list lower bound: header + first in free list + write r = 3 I/Os upper bound: header s in free list + write r = I/Os directory (1000 entries/) lower bound: directory header + write r = 2 I/Os upper bound: 10 directory s + write r = 11 I/Os 59 60

13 2.5 Page formats Locating the containing data for a given rid is not the whole story when the DBMS needs to access a specific record: the internal structure of s plays a crucial role. For the following discussion, we consider a to consist of a sequence of slots, each of which contains a record. A complete record identifier then has the unique form addr, nslot where nslot denotes the slot number on the Fixed-length records Life is particularly easy if all records on the (in the file) are of the same size s; getrecord(f, p, n ): given the rid p, n we know that the record is to be found at (byte) offset n s on p. deleterecord(f, p, n ): copy the bytes of the last occupied slot on p to offset n s, mark last slot as free; all occupied slots thus appear together at the start of the (the is packed). insertrecord(f, r): find a p with free space s (see previous section); copy r to the first free slot on p, mark slot as occupied. 61 Packed s and deletions: One problem with packed s remains, though: calling deleterecord(f, p, n ) modifies the rid of a different record p, n on the same. If any external reference to this record exists we need to chase the whole database and update rid references p, n p, n. Bad 62 To avoid record copying (and thus rid modifications), we could simply use a free slot bitmap on each : slot 0 slot 1 slot N 1 packed N number of records free space header unpacked w/ bitmap 1 M M 1 slot 0 slot 1 slot 2 slot M 1 number of slots Variable-length records If records on a are of varying size (cf. the SQL datatype VARCHAR(n)), we have to deal with fragmentation: In insertrecord(f, r) we have to find an empty slot of size r ; at the same time we want to try to minimize waste. To get rid of holes produced by deleterecord(f, rid), compact the remaining records to maintain a contiguous area of free space on the. A solution is to maintain a slot directory on each (compare this with a heap file directory): rid = <p, N 1> rid = <p,1> p Calling deleterecord(f, p, n ) simply means to set bit n in the bitmap to 0, no other rids are affected. offset of recd from start of data area rid = <p,0> 24 bytes Page header or trailer? In both organization schemes we have positioned the header at the end of its. How would you justify this design decision? N N slot directory pointer to start of free space number of entries in slot directory 64

14 Remarks: The slot directory contains entries offset, length where offset is measured in bytes from the data start. In deleterecord(f, p, n ), simply set the offset of directory entry n to 1; such an entry can be reused during subsequent insertrecord(f, r) calls which hit p. Directory compaction is not allowed in this scheme (again, this would modify the rids of all records p, n, n > n) If insertions are much more common than deletions, the directory size will nevertheless be close to the actual number of records stored on the. N.B. Record compaction (defragmentation) is performed, of course. 2.6 Record formats This section zooms into the record internals themselves, thus discussing access to single record fields (conceptually: attributes). Attribute values are considered atomic by an RDBMS. Depending on the field types, we are dealing with fixed-length or variablelength fields in a record, e.g. SQL datatype fixed-length? length (# of bytes) 4 INTEGER 4 BIGINT 8 CHAR(n) n, 1 n 254 VARCHAR(n) 1... n, 1 n CLOB(n) 1... n, 1 n 2 31 DATE 4. The DBMS computes and then saves the field size information for the records of a file in the system catalog when it processes the corresponding command CREATE TABLE I Datatype lengths valid for DB2 V Fixed-length fields If all fields in a record are of fixed length, offsets for field access can simply be read off the DBMS system catalog (field fi of size li): f1 f2 f3 f4 l1 l2 l3 l4 Final remarks: Variable-length record formats seem to be more complicated but, among other advantages, they allow for the compact representation of SQL NULL values (field f3 is NULL below): f1 $ f2 $ $ f4 base address b address = b+l1+l2 f1 f2 f Variable-length fields If a record contains one or more variable-length fields, other record representations are needed: 1 Use a special delimiter symbol ($) to separate record fields. Accessing field fi means to scan over the bytes of fields f1... f(n 1); 2 for a record of n fields, use an array of n + 1 offsets pointing into the record (the last array entry marks the end of field fn): f1 $ f2 $ f3 $ f4 $ f1 f2 f3 f4 Growing a record Consider an update on a field (e.g. of type VARCHAR(n)) which lets the record grow beyond the size of the free space on its containing. How could the DBMS handle this situation efficiently? Really growing a record For fields of type VARCHAR(n) or CLOB(n) with n > size we are in trouble whenever a record actually grows beyond size (the record won t fit on any one ). 67 How could the DBMS file manager cope with this? 68

15 2.7 Addressing schemes What makes a good record ID (rid)? given a rid, it should ideally not take more than 1 I/O to get to the record itself rids should be stable under all circumstances, such as a record is being moved within a a record is being moved across s Why are these goals important to achieve? Consider the fact that rids are used as persistent pointers in a DBMS (indexes, directories, implemenation of CODAYSL sets,...) Direct addressing RBA relative byte address: Consider disk file as a persistent virtual address space and use byte-offset as rid. pro: very efficient access to and to record within con: no stability at all w.r.t. moving record PP pointers: Use disk numbers as rid. pro: very efficient access to ; locating record within is cheap (inmemory operation) con: stable w.r.t. moving records within, but not when moving across s Conflicting goals Efficiency calls for direct disk address, while stability calls for some kind of indirection Indirect addressing LSN logical sequence numbers: Assign logical numbers to records. Address translation table maps to PPs (or even RBAs). pro: full stability w.r.t. all relocations of records con: additional I/O to translation table (often in the buffer) CODASYL systems call this DBTT (database key translation table):, * A O F D O I E I? D A H A I I A Indirect addressing fancy variant LSN/PPP LSN with probable pointers: Try to avoid extra I/O by adding a probable PP (PPP) to LSNs. PPP is the PP at the time of insertion into the database. If record is moved across s, PPPs are not updated pro: full stability w.r.t. all record relocations; PPP can save extra I/O for translation table, iff still correct con: 2 additional I/Os in case PPP is no longer valid: old to notice record has moved, second I/O to translation table to lookup new number % & # ' ' ) 6 = > A, = J A >? % 71 72

16 TID addressing TID tuple identifier with forwarding: Use PP, Slot# -pair as rid (see above). To guarantee stability, leave a forward address on original, if record has to be moved across s. For example: access record with given rid= 17, 2 : % # # # # Avoid chains of forward addresses When record has to be moved again: do not leave another forward address, rather update forward on original pro: full stability w.r.t. all relocations of records; no extra I/O due to indirection con: 1 additional I/O in case of forward pointer on original 73 Bibliography Brown, K., Carey, M., and Livny, M. (1996). Goal-oriented buffer management revisited. In Proc. ACM SIGMOD Conference on Management of Data. Chen, P., Lee, E., Gibson, G., R.H.Katz, and Patterson, D. (1994). Raid: Highperformance, reliable secondary storage. ACM Computing Surveys, 26(2): Denning, P. (1968). The working-set model for program behaviour. Communications of the ACM, 11(5): Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley, Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von Datenbanken. Härder, T. (1987). Realisierung von operationalen Schnittstellen, chapter 3. in (Lockemann and Schmidt, 1987). Springer. Härder, T. (1999). Springer. Datenbanksysteme: Konzepte und Techniken der Implementierung. Härder, T. and Rahm, E. (2001). Datenbanksysteme: Konzepte und Techniken der Implementierung. Springer Verlag, Berlin, Heidelberg, 2 edition. Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Int l Thompson Publishing, Bonn. Lockemann, P. and Dittrich, K. (1987). Architektur von Datenbanksystemen, chapter 2. in (Lockemann and Schmidt, 1987). Springer. 74 Lockemann, P. and Schmidt, J., editors (1987). Datenbank-Handbuch. Springer-Verlag. Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Implementierungsaspekte. Vieweg. O Neil, E., O Neil, P., and Weikum, G. (1999). An optimality proof of the LRU-k replacement algorithm. Journal of the ACM, 46(1): Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. Stonebraker, M. (1981). Operating systems support for database management. Communications of the ACM, 14(7):

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification)

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification) Outlines Chapter 2 Storage Structure Instructor: Churee Techawut 1) Structure of a DBMS 2) The memory hierarchy 3) Magnetic tapes 4) Magnetic disks 5) RAID 6) Disk space management 7) Buffer management