UNIT III DATA STORAGE AND QUERY PROCESSING

UNIT III DATA STORAGE AND QUERY PROCESSING Although a database system provides a high-level view of data, ultimately data have to be stored as bits on one or more storage devices. A vast majority of databases today store data on magnetic disk and fetch data into main space memory for processing, or copy data onto tapes and other backup devices for archival storage. The physical characteristics of storage devices play a major role in the way data are stored, in particular because access to a random piece of data on disk is much slower than memory access: Disk access takes tens of milliseconds, whereas memory access takes a tenth of a microsecond. Many queries reference only a small proportion of the records in a file. An index is a structure that helps locate desired records of a relation quickly, without examining all records. User queries have to be executed on the database contents, which reside on storage devices. It is usually convenient to break up queries into smaller operations, roughly corresponding to the relational algebra operations. There are many alternative ways of processing a query, which can have widely varying costs. Query optimization refers to the process of finding the lowest-cost method of evaluating a given query.

Storage Devices Computer Storage Medium (Hierarchy) Factors: cost, capacity, speed Primary Storage data processed directly by the CPU; main memory, cache memory Secondary (on-line) Storage - data must first be copied into primary storage for processing; magnetic disks Secondary (off-line) Storage - optical disks (direct access), magnetic tapes (sequential) Classification of Physical Storage Media Speed with which data can be accessed Cost per unit of data Reliability data loss on power failure or system crash physical failure of the storage device Can differentiate storage into: volatile storage: loses contents when power is switched off non-volatile storage: Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as batterbacked up main-memory. Physical Storage Media Cache fastest and most costly form of storage; volatile; managed by the computer system hardware. Main memory: o fast access (10s to 100s of nanoseconds; 1 nanosecond = 10 9 seconds) o generally too small (or too expensive) to store the entire database capacities of up to a few Gigabytes widely used currently

o Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2 to 3 years) Volatile contents of main memory are usually lost if a power failure or system crash occurs. Flash memory o o o o o o o Data survives power failure Data can be written at a location only once, but location can be erased and written to again Can support only a limited number of write/erase cycles. Erasing of memory has to be done to an entire bank of memory Reads are roughly as fast as main memory But writes are slow (few microseconds), erase is slower Cost per unit of storage roughly similar to main memory Widely used in embedded devices such as digital cameras also known as EEPROM (Electrically Erasable Programmable Read- Only Memory) Magnetic-disk o Data is stored on spinning disk, and read/written magnetically o Primary medium for the long-term storage of data; typically stores entire database. o Data must be moved from disk to main memory for access, and written back for storage Much slower access than main memory direct-access possible to read data on disk in any order, unlike magnetic tape Hard disks vs floppy disks Capacities range up to roughly 100 GB currently Much larger capacity and cost/byte than main memory/flash memory

Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2 years) Survives power failures and system crashes disk failure can destroy data, but is very rare Optical storage non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms Write-one, read-many (WORM) optical disks used for archival storage (CD-R and DVD-R) Multiple write versions also available (CD-RW, DVD-RW, and DVD- RAM) Reads and writes are slower than with magnetic disk Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data Tape storage non-volatile, used primarily for backup (to recover from disk failure), and for archival data sequential-access much slower than disk very high capacity (40 to 300 GB tapes available) tape can be removed from drive storage costs much cheaper than disk, but drives are expensive Tape jukeboxes available for storing massive amounts of data hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes) Storage Hierarchy

primary storage: Fastest media but volatile (cache, main memory). secondary storage: next level in hierarchy, non-volatile, moderately fast access time o also called on-line storage o E.g. flash memory, magnetic disks tertiary storage: lowest level in hierarchy, non-volatile, slow access time o also called off-line storage o E.g. magnetic tape, optical storage Magnetic Hard Disk Mechanism

Read-write head o Positioned very close to the platter surface (almost touching it) o Reads or writes magnetically encoded information. Surface of platter divided into circular tracks o Over 16,000 tracks per platter on typical hard disks Each track is divided into sectors. o A sector is the smallest unit of data that can be read or written. o Sector size typically 512 bytes o Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks) To read/write a sector o disk arm swings to position head on right track o platter spins continually; data is read/written as sector passes under head Head-disk assemblies o multiple disk platters on a single spindle (typically 2 to 4) o one head per platter, mounted on a common arm. Cylinder i consists of ith track of all the platters Earlier generation disks were susceptible to head-crashes o Surface of earlier generation disks had metal-oxide coatings which would disintegrate on head crash and damage all data on disk

o Current generation disks are less susceptible to such disastrous failures, although individual sectors may get corrupted Disk controller interfaces between the computer system and the disk drive hardware. o accepts high-level commands to read or write a sector o initiates actions such as moving the disk arm to the right track and actually reading or writing the data o Computes and attaches checksums to each sector to verify that data is read back correctly If data is corrupted, with very high probability stored checksum won t match recomputed checksum o Ensures successful writing by reading back sector after writing it o Performs remapping of bad sectors Performance Measures of Disks Access time the time it takes from when a read or write request is issued to when data transfer begins. Consists of: o Seek time time it takes to reposition the arm over the correct track. Average seek time is 1/2 the worst case seek time. Would be 1/3 if all tracks had the same number of sectors, and we ignore the time to start and stop arm movement o 4 to 10 milliseconds on typical disks Rotational latency time it takes for the sector to be accessed to appear under the head. Average latency is 1/2 of the worst case latency. 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.) Data-transfer rate the rate at which data can be retrieved from or stored to the disk. o 4 to 8 MB per second is typical o Multiple disks may share a controller, so rate that controller can handle is also important

E.g. ATA-5: 66 MB/second, SCSI-3: 40 MB/s Fiber Channel: 256 MB/s Mean time to failure (MTTF) the average time the disk is expected to run continuously without any failure. o Typically 3 to 5 years o Probability of failure of new disks is quite low, corresponding to a theoretical MTTF of 30,000 to 1,200,000 hours for a new disk E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours o MTTF decreases as disk ages RAID RAID: Redundant Arrays of Independent Disks o disk organization techniques that manage a large numbers of disks, providing a view of a single disk of high capacity and high speed by using multiple disks in parallel, and high reliability by storing data redundantly, so that data can be recovered even if a disk fails The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. o E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days) o Techniques for using redundancy to avoid data loss are critical with large numbers of disks Originally a cost-effective alternative to large, expensive disks o I in RAID originally stood for ``inexpensive o Today RAIDs are used for their higher reliability and bandwidth. The I is interpreted as independent RAID Levels

Schemes to provide redundancy at lower cost by using disk striping combined with parity bits o Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics RAID Level 0: Block striping; non-redundant. o Used in high-performance applications where data lost is not critical. RAID Level 1: Mirrored disks with block striping o Offers best write performance. o Popular for applications such as storing log files in a database system. RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping. RAID Level 3: Bit-Interleaved Parity o a single parity bit is enough for error correction, not just detection, since we know which disk has failed When writing data, corresponding parity bits must also be computed and written to a parity bit disk To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk)

RAID Level 3 (Cont.) o Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O. o Subsumes Level 2 (provides all its benefits, at lower cost). RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks. o When writing data block, corresponding block of parity bits must also be computed and written to parity disk o To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks. RAID Level 4 (Cont.) o Provides higher I/O rates for independent block reads than Level 3 block read goes to a single disk, so blocks stored on different disks can be read in parallel o Provides high transfer rates for reads of multiple blocks than nostriping o Before writing a block, parity data must be computed

Can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes) Or by recomputing the parity value using the new values of blocks corresponding to the parity block More efficient for writing large amounts of data sequentially o Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. o E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks. Storage of Databases Main Memory Databases entire databases are kept in main memory main memory is a volatile storage: requires a backup copy (on magnetic disk) Most Databases are stored permanently on magnetic disk

are too large to fit entirely in main memory magnetic disk is less expensive File Records on Disk Records file as a sequence of records record type = field names + data types Fixed-Length Records records with the same size in a file Variable-Length Records (with separators) records of different sizes caused by multi-valued fields, optional fields, or variable-length fields File Blocks on Disk Disk Block unit of data transfer between disk & memory records of a file are allocated to disk blocks usually 512 to 4K bytes (K=1024) Blocking Factor (bfr) number of (fixed-length) records in a block bfr = B/R (floor function) B = block size, R = record size (in bytes) Spanned vs. Unspanned File Org. Unspanned: leaves the remaining space in each block unused Spanned: utilizes the unused space Contiguous vs. Linked Allocation Operations on Files Contiguous: file blocks are allocated to consecutive disk blocks Linked: each file block contains the pointer to the next block Types of Operations Retrieval: do not change data in the file (open/close a file, find/read records)

Update: change the files by insertion, deletion or modification of records Record-at-a-time: operations are applied to a single record Set-at-a-time: operations are applied to a set of records or to the whole file File Open/Close Operations Open: readies the file for access, allocates buffers to hold file blocks, sets the file pointer to the beginning of the file Close: terminates access to the file Set-at-a-time Operations Find: searches for the first file record that satisfies a certain condition (selection condition), and makes it the current file record FindNext: searches for the next file record (from the current record) and makes it the current file record Read: reads the current file record Insert: inserts a new record into the file and makes it the current file record Delete: removes the current file record from the file by marking the record to indicate that it is no longer valid Modify: changes the values of some fields of the current file record Record-at-a-time Operations FindAll: locates all the records satisfying a search condition FindOrdered: retrieves all the records in a specific order Reorganize: reorganizes the records after update operations Operation Factors Access Type: attribute value(=) or range(>) Access Time: to find a particular record(s) Insertion Time: to insert a new record (find the place to insert + index structure update) Deletion Time: to delete a record (find the record(s) to delete + index structure update) Space Overhead: additional space occupied by an index structure

Primary vs. Secondary File Organizations Primary File Organizations Heap Files Sorted Files Hashing Secondary File Organizations (Index) Single-level or Multi-level Indexes B-trees B+-trees Heap Files Files of Unordered Records simplest and basic file organization new records are inserted at the end of the file Access: linear search requires searching through the file block by block (N/2 file blocks on average if the record exists, N file blocks if not), very inefficient (it takes O(N) time) Insertion: very efficient (random order) Deletion: must first find its block, inefficient Direct File allows direct access by the position of a record in a file applies only to fixed-length records, contiguous allocation, and unspanned blocks file records: 0, 1,, r-1 (i.e., 120) records in each block (bfr): 0, 1,, bfr-1 (15) ith record of a file (43): block position = (i/bfr), record position in the block = (i mod bfr) Files of Ordered Records file records are kept sorted by the values of an ordering field (sequential file): Access: binary search (on its ordering field) requires reading and searching log2 of the file blocks on the average (O(logN) time), improvement over linear search

Insertion: records must be inserted in the correct order, very inefficient Deletion: inefficient, less expensive with deletion marker and periodic reorganization FindOrdered: reading the records in order of the ordering key values is extremely efficient Overflow: temporary unordered file for new records to improve insertion efficiency, periodically merged with the main ordered file Hashing Hash Functions records in the file are unordered determine the address (B) of a record based on the value of the hash field (K) in the record h(k) -> B ex) h(k) = K mod M (1, 2,, M-1) allow direct access to the target disk block record search in the block: main memory Internal Hashing Internal Hashing hashing for an internal file hash table as an array of records noninteger hash field value such as names can be transformed into an integer (ASCII) Collision (of hash addresses) occurs when two hash field values are mapped into the same hash address Collision Resolution Open Addressing checks the subsequent positions in order until an empty position is found Chaining extend the array with a number of overflow positions

use a linked list of overflow records for each hash address overflow pointer refers to the position of the next record Multiple Hashing applies a second hash function if the first hash function results in a collision uses open addressing or applies a third hash function if another collision results Good Hashing Function uniform and random distribution of records hash table 70-90% full to minimize collisions with less unused locations External Hashing Hashing Function target address space is made of buckets (one disk block or a cluster of contiguous blocks) maps a hash field value into a bucket number bucket number is then converted to the corresponding disk block address collision is less severe with buckets because as many records as will fit in a bucket Bucket Overflow when a bucket is filled to capacity can be solved by chaining method: a pointer is maintained in each bucket to a linked list of overflow records for the bucket record pointers include both a block address and a relative record position within the block Static Hashing very fast access to records by the hash field a fixed number of buckets M is allocated not suitable for dynamic files (grows and shrinks dynamically) difficult to determine the number of buckets in advance

Dynamic Hashing requires a dynamic hashing technique Extendible Hashing maintains a directory of 2d bucket addresses uses first d bits of a hash value to determine a directory entry and then a bucket address d = global depth, d = local depth of a bucket directory expands and shrinks dynamically bucket doubling (split) vs. halving (merge) update directory and local depth appropriately Indexing and Hashing Basic Concepts Indexing mechanisms used to speed up access to desired data. E.g., author catalog in library Search Key - attribute to set of attributes used to look up records in a file. An index file consists of records (called index entries) of the form pointer search-key Index files are typically much smaller than the original file Two basic kinds of indices: Ordered indices: search keys are stored in sorted order Hash indices: search keys are distributed uniformly across buckets using a hash function. Index Evaluation Metrics Access types supported efficiently. E.g., records with a specified value in the attribute or records with an attribute value falling in a specified range of values. Access time Insertion time Deletion time

Space overhead Ordered Indices In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. Also called clustering index The search key of a primary index is usually but not necessarily the primary key. Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index. Index-sequential file: ordered sequential file with a primary index. Dense Index Files Dense index Index record appears for every search-key value in the file. Sparse Index Files Sparse Index: contains index records for only some search-key values. Applicable when records are sequentially ordered on search-key To locate a record with search-key value K we: Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points

Compared to dense indices: Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records. Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block. Multilevel Index If primary index does not fit in memory, access becomes expensive. Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index a sparse index of primary index inner index the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.

Index Update: Deletion If deleted record was the only record in the file with its particular searchkey value, the search-key is deleted from the index also. Single-level index deletion: Dense indices deletion of search-key:similar to file record deletion. Sparse indices if an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced. Index Update: Insertion Single-level index insertion:

Perform a lookup using the search-key value appearing in the record to be inserted. Dense indices if the search-key value does not appear in the index, insert it. Sparse indices if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. If a new block is created, the first search-key value appearing in the new block is inserted into the index. Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level algorithms Secondary Indices Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index) satisfy some condition. Example 1: In the account relation stored sequentially by account number, we may want to find all accounts in a particular branch Example 2: as above, but where we want to find all accounts with a specified balance or range of balances We can have a secondary index with an index record for each search-key value Secondary Indices Example Secondary index on balance field of account Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. Secondary indices have to be dense

Primary and Secondary Indices Indices offer substantial benefits when searching for records. BUT: Updating indices imposes overhead on database modification --when a file is modified, every index on the file must be updated, Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive Each record access may fetch a new block from disk Block fetch requires about 5 to 10 milliseconds versus about 100 nanoseconds for memory access B+-Tree Index Files B+-tree indices are an alternative to indexed-sequential files. Disadvantage of indexed-sequential files performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required. Advantage of B+-tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. (Minor) disadvantage of B+-trees: extra insertion and deletion overhead, space overhead. Advantages of B+-trees outweigh disadvantages B+-trees are used extensively A B+-tree is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length Each node that is not a root or a leaf has between n/2 and n children. A leaf node has between (n 1)/2 and n 1 values Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n 1) values.

B+-Tree Node Structure Typical node Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K1 < K2 < K3 <... < Kn 1 Leaf Nodes in B+-Trees Properties of a leaf node For i = 1, 2,..., n 1, pointer Pi either points to a file record with searchkey value Ki, or to a bucket of pointers to file records, each record having search-key value Ki. Only need bucket structure if search-key does not form a primary key. If Li, Lj are leaf nodes and i < j, Li s search-key values are less than Lj s search-key values Pn points to next leaf node in search-key order Non-Leaf Nodes in B+-Trees Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: All the search-keys in the subtree to which P1 points are less than K1 For 2 i n 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki 1 and less than Ki All the search-keys in the subtree to which Pn points have values greater than or equal to Kn 1

Example of a B+-tree B+-tree for account file (n = 3) B+-tree for account file (n = 5) Leaf nodes must have between 2 and 4 values ( (n 1)/2 and n 1, with n = 5). Non-leaf nodes other than root must have between 3 and 5 children ( (n/ 2 and n with n =5). Root must have at least 2 children. B-Tree Index Files Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys. Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a nonleaf node must be included. Generalized B-tree leaf node Nonleaf node pointers Bi are the bucket or file record pointers.

B-Tree Index File Example B-tree (above) and B+-tree (below) on same data Advantages of B-Tree indices: May use less tree nodes than a corresponding B+-Tree. Sometimes possible to find search-key value before reaching leaf node. Disadvantages of B-Tree indices: Only small fraction of all search-key values are found early Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth than corresponding B+-Tree Insertion and deletion more complicated than in B+-Trees Implementation is harder than B+-Trees. Typically, advantages of B-Trees do not out weigh disadvantages. Query Processing Overview

Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation Parsing and translation translate the query into its internal form. This is then translated into relational algebra. Parser checks syntax, verifies relations Evaluation The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query. Basic Steps in Query Processing : Optimization

A relational algebra expression may have many equivalent expressions E.g., σbalance<2500( balance(account)) is equivalent to balance(σbalance<2500(account)) Each relational algebra operation can be evaluated using one of several different algorithms Correspondingly, a relational-algebra expression can be evaluated in many ways. Annotated expression specifying detailed evaluation strategy is called an evaluation-plan. E.g., can use an index on balance to find accounts with balance < 2500, or can perform complete relation scan and discard accounts with balance 2500 Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. Cost is estimated using statistical information from the database catalog e.g. number of tuples in each relation, size of tuples, etc. Measures of Query Cost Cost is generally measured as total elapsed time for answering query Many factors contribute to time cost disk accesses, CPU, or even network communication Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account Number of seeks * average-seek-cost Number of blocks read * average-block-read-cost Number of blocks written * average-block-write-cost Cost to write a block is greater than cost to read a block data is read back after being written to ensure that the write was successful

For simplicity we just use the number of block transfers from disk and the number of seeks as the cost measures tt time to transfer one block ts time for one seek Cost for b block transfers plus S seeks b * tt + S * ts We ignore CPU costs for simplicity Real systems do take CPU cost into account We do not include cost to writing output to disk in our cost formulae Several algorithms can reduce disk IO by using extra buffer space Amount of real memory available to buffer depends on other concurrent queries and OS processes, known only during execution We often use worst case estimates, assuming only the minimum amount of memory needed for the operation is available Required data may be buffer resident already, avoiding disk I/O But hard to take into account for cost estimation Selection Operation File scan search algorithms that locate and retrieve records that fulfill a selection condition. Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition. Cost estimate = br block transfers + 1 seek br denotes number of blocks containing records from relation r If selection is on a key attribute, can stop on finding record cost = (br /2) block transfers + 1 seek Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices

A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered. Assume that the blocks of a relation are stored contiguously Cost estimate (number of disk blocks to be scanned): cost of locating the first tuple by a binary search on the blocks log2(br) * (tt + ts) If there are multiple records satisfying selection Add transfer cost of the number of blocks containing records that satisfy selection condition Selections Using Indices Index scan search algorithms that use an index selection condition must be on search-key of index. A3 (primary index on candidate key, equality). Retrieve a single record that satisfies the corresponding equality condition Cost = (hi + 1) * (tt + ts) A4 (primary index on nonkey, equality) Retrieve multiple records. Records will be on consecutive blocks Let b = number of blocks containing matching records Cost = hi * (tt + ts) + ts + tt * b A5 (equality on search-key of secondary index). Retrieve a single record if the search-key is a candidate key Cost = (hi + 1) * (tt + ts) Retrieve multiple records if search-key is not a candidate key each of n matching records may be on a different block Cost = (hi + n) * (tt + ts) Can be very expensive! Join Operation Several different algorithms to implement joins Nested-loop join

Block nested-loop join Indexed nested-loop join Merge-join Hash-join Choice based on cost estimate Examples use the following information Number of records of customer: 10,000 depositor: 5000 Number of blocks of customer: 400 depositor: 100 Merge-Join 1. Sort both relations on their join attribute (if not already sorted on the join attributes). 2. Merge the sorted relations to join them 1. Join step is similar to the merge stage of the sort-merge algorithm. 2. Main difference is handling of duplicate values in join attribute every pair with same value on join attribute must be matched 3. Detailed algorithm in book Can be used only for equi-joins and natural joins Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory Thus the cost of merge join is: br + bs block transfers + br / bb + bs / bb seeks + the cost of sorting if relations are unsorted. hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute

Merge the sorted relation with the leaf entries of the B+-tree. Sort the result on the addresses of the unsorted relation s tuples Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples Sequential scan more efficient than random lookup Hash-Join Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations h maps JoinAttrs values to {0, 1,..., n}, where JoinAttrs denotes the common attributes of r and s used in the natural join. r0, r1,..., rn denote partitions of r tuples Each tuple tr r is put in partition ri where i = h(tr [JoinAttrs]). r0,, r1..., rn denotes partitions of s tuples Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]). Note: In book, ri is denoted as Hri, si is denoted as Hsi and n is denoted as nh. r tuples in ri need only to be compared with s tuples in si Need not be compared with s tuples in any other partition, since: an r tuple and an s tuple that satisfy the join condition will have the same value for the join attributes.

If that value is hashed to some value i, the r tuple has to be in ri and the s tuple in si. Hash-Join Algorithm The hash-join of r and s is computed as follows. 1.Partition the relation s using hashing function h. When partitioning a relation, one block of memory is reserved as the output buffer for each partition. 2.Partition r similarly. 3.For each i: (a) Load si into memory and build an in-memory hash index on it using the join attribute. This hash index uses a different hash function than the earlier one h. (b) Read the tuples in ri from the disk one by one. For each tuple tr locate each matching tuple ts in si using the in-memory hash index. Output the concatenation of their attributes. The value n and the hash function h is chosen such that each si should fit in memory. Typically n is chosen as bs/m * f where f is a fudge factor, typically around 1.2 The probe relation partitions si need not fit in memory Recursive partitioning required if number of partitions n is greater than number of pages M of memory. instead of partitioning n ways, use M 1 partitions for s Further partition the M 1 partitions using a different hash function Use same partitioning method on r Rarely required: e.g., recursive partitioning not needed for relations of 1GB or less with memory size of 2MB, with block size of 4KB.

Objective Questions 1. The disk surface is logically divided into sectors, which are subdivided into tracks a. True b. False 2. SCSI is a. System Computer Small Interconnect b. System Connection Small Interface c. Small Computer System Interconnect d. Small Connection System Interface 3. The time from when a read or write request is issued to when data transfer begins is: a. Seek time b. Access time c. Transfer rate d. None of the above 4. Pick out the true statement regarding "Blocks" a. Data is transferred between disk and main memory in units called blocks b. A block is a contiguous sequence of bytes from a single track of one platter c. Block sizes range from 512 bytes to several thousand d. All of the above e. None of the above 5. Commonly used disk-arm scheduling algorithm is: a. Relavator Algorithm b. Elevator Algorithm c. None of the above 6.RAID is a. Removable Array of Inexpensive Disk b. Reliable Array of Inexpensive Disk c. Rewritable Array of Inexpensive Disk d. Reduntant Array of Inexpensive Disk 7. Which of the following is correct in the case of buffer manager? a. The supersystem responsible for the allocation of buffer space b. Handles some of the requests for blocks of the database

c. If the block is already in main memory, the data in main memory is given to the requester d. None of the above 8. Which symbol is used as the end of record? a. % b. # c. d. @ e. } 9. Which of the following contains first records of a chain? a. Anchor block b. Overflow block c. Reserved space d. Pointers 10. In which type of file organization, records can be placed anywhere in the file where there is space for the record a. sequential file organization b. hashing file organization c. heap file organization d. clustering file organization 11. Indices whose search key specifies an order different from the sequential order of the file are called a. non clustering index b. primary index c. clustering index 12. In which indices,an index record appears for every search key value in file a. secondary index b. dense index c. sparse index d. multi level index 13. How many children should each nonleaf node in the tree must have? a. between n and n +n / 2 children b. between 2+ n / 2 and n children c. between 2 n* n / 2 and n / 2 children d. between n / 2 and n children

14. In processing a query, we traverse a path from the root to a leaf node. If there are K search key values in the file, this path is no longer than a. log (n) K b. log (n/2) K c. log (2n) K d. log (n/2) + K e. None of the above 15. The number of pointers in a node is called the fan-in of the node a. True b. False 16. In B+ trees file organization, if m nodes are involved in redistribution, each node can be guaranteed to contain a. at least [ ( m - 1 ) n / m ] entries b. more than [ ( m - 1 ) n / m ] entries c. less than [ ( m - 1 ) n / m ] entries d. None of the above 17. Some hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinking of the database which are called as a. static hash functions b. dynamic hash functions c. Both a and b 18. Out of the following which involves computing the address of a data item by computing a function on the search key value. a. Secondary indices b. B-tree c. Hashing d. B+ -tree 19. In B -tree file organization, the leaf nodes of the tree store records instead of storing pointers to records a. True b. False 20. In a hash function if K1 and K2 are search key values and K1 < K2 then a. h ( K1 ) < h ( K2 ) b. h ( K1 ) > h ( K2 ) c. h ( K1 ) = h ( K2 ) d. h ( K1 ) / h ( K2 )

PART-A 1. What is fixed length record? 2. What is variable length record? 3. What are the methods used to implement variable length record? 4. What are the various types of file organization? 5. How clustering file organization differs from other file organization? 6. What is index? What are its uses? 7. List out the types of indices? 8. What is sparse & dense index? 9. What is static and dynamic hashing? 10. What are the qualities of hash function? 11. Difference between B-tree and B+ tree? 12. What is query processing? 13. What is the use of parser and translator? 14. What is query evaluation plan? 15.What are the factors to be considered in estimating the cost of query evaluation plan? 16. What are the methods used for evaluating an expression? Two marks Questions and answers 1. What is an index? An index is a structure that helps to locate desired records of a relation quickly, without examining all records. 2. Define query optimization. Query optimization refers to the process of finding the lowest cost method of evaluating a given query. 3. What are called jukebox systems? Jukebox systems contain a few drives and numerous disks that can be loaded into one of the drives automatically. 4. What are the types of storage devices? Primary storage Secondary storage

Tertiary storage Volatile storage Nonvolatile storage 5. What is called remapping of bad sectors? If the controller detects that a sector is damaged when the disk is initially formatted, or when an attempt is made to write the sector, it can logically map the sector to a different physical location. 6. Define access time. Access time is the time from when a read or write request is issued to when data transfer begins. 7. Define seek time. The time for repositioning the arm is called the seek time and it increases with the distance that the arm is called the seek time. 8. Define average seek time. The average seek time is the average of the seek times, measured over a sequence of random requests. 9. Define rotational latency time. The time spent waiting for the sector to be accessed to appear under the head is called the rotational latency time. 10. Define average latency time. The average latency time of the disk is one-half the time for a full rotation of the disk. 11. What is meant by data-transfer rate? The data-transfer rate is the rate at which data can be retrieved from or stored to the disk. 12. What is meant by mean time to failure? The mean time to failure is the amount of time that the system could run continuously without failure. 13. What is a block and a block number? A block is a contiguous sequence of sectors from a single track of one platter. Each request specifies the address on the disk to be referenced. That address is in the form of a block number.

14. What are called journaling file systems? File systems that support log disks are called journaling file systems. 15. What is the use of RAID? A variety of disk-organization techniques, collectively called redundant arrays of independent disks are used to improve the performance and reliability. 16. What is called mirroring? The simplest approach to introducing redundancy is to duplicate every disk. This technique is called mirroring or shadowing. 17. What is called mean time to repair? The mean time to failure is the time it takes to replace a failed disk and to restore the data on it. 18. What is called bit-level striping? Data striping consists of splitting the bits of each byte across multiple disks. This is called bit-level striping. 19. What is called block-level striping? Block level striping stripes blocks across multiple disks. It treats the array of disks as a large disk, and gives blocks logical numbers. 20. What are the two main goals of parallelism? Load balance multiple small accesses, so that the throughput of such accesses increases. Parallelize large accesses so that the response time of large accesses is reduced. 21. What are the factors to be taken into account when choosing a RAID level? Monetary cost of extra disk storage requirements. Performance requirements in terms of number of I/O operations Performance when a disk has failed. Performances during rebuild. 22. What is meant by software and hardware RAID systems? RAID can be implemented with no change at the hardware level, using only software modification. Such RAID implementations are called software RAID systems and the systems with special hardware support are called hardware RAID systems.

23. Define hot swapping? Hot swapping permits the removal of faulty disks and replaces it by new ones without turning power off. Hot swapping reduces the mean time to repair. 24. What are the ways in which the variable-length records arise in database systems? Storage of multiple record types in a file. Record types that allow variable lengths for one or more fields. Record types that allow repeating fields. 25. What is the use of a slotted-page structure and what is the information present in the header? The slotted-page structure is used for organizing records within a single block. The header contains the following information. The number of record entries in the header. The end of free space An array whose entries contain the location and size of each record. 26. What are the two types of blocks in the fixed length representation? Define them. Anchor block: Contains the first record of a chain. Overflow block: Contains the records other than those that are the first record of a chain. 27. What is known as heap file organization? In the heap file organization, any record can be placed anywhere in the file where there is space for the record. There is no ordering of records. There is a single file for each relation. 28. What is known as sequential file organization? In the sequential file organization, the records are stored in sequential order, according to the value of a search key of each record. 29. What is hashing file organization? In the hashing file organization, a hash function is computed on some attribute of each record. The result of the hash function specifies in which block of the file the record should be placed. 30. What is known as clustering file organization?

In the clustering file organization, records of several different relations are stored in the same file. 31. What are the types of indices? Ordered indices Hash indices 32. What are the techniques to be evaluated for both ordered indexing and hashing? Access types Access time Insertion time Deletion time Space overhead 33. What is known as a search key? An attribute or set of attributes used to look up records in a file is called a search key. 34. What is a primary index? A primary index is an index whose search key also defines the sequential order of the file. 35. What are called index-sequential files? The files that are ordered sequentially with a primary index on the search key, are called index-sequential files. 36. What are the two types of indices? Dense index Sparse index 37. What are called multilevel indices? Indices with two or more levels are called multilevel indices. 38. What is B-Tree? A B-tree eliminates the redundant storage of search-key values.it allows search key values to appear only once. 39. What is a B+-Tree index? A B+-Tree index takes the form of a balanced tree in which every path from the root of the root of the root of the tree to a leaf of the tree is of the same length.

40. What is a hash index? A hash index organizes the search keys, with their associated pointers, into a hash file structure. 41. What is called query processing? Query processing refers to the range of activities involved in extracting data from a database. 42. What are the steps involved in query processing? The basic steps are: Parsing and translation Optimization Evaluation 43. What is called an evaluation primitive? A relational algebra operation annotated with instructions on how to evaluate is called an evaluation primitive. 44. What is called a query evaluation plan? A sequence of primitive operations that can be used to evaluate ba query is a query evaluation plan or a query execution plan. 45. What is called a query execution engine? The query execution engine takes a query evaluation plan, executes that plan, and returns the answers to the query. 46. What are called as index scans? Search algorithms that use an index are referred to as index scans. 47. What is called as external sorting? Sorting of relations that do not fit into memory is called as external sorting. 48. What is called as recursive partitioning? The system repeats the splitting of the input until each partition of the build input fits in the memory. Such partitioning is called recursive partitioning. 49. What is called as an N-way merge? The merge operation is a generalization of the two-way merge used by the standard in-memory sort-merge algorithm. It merges N runs, so it is called an N- way merge. 50. What is known as fudge factor?

The number of partitions is increased by a small value called the fudge factor, which is usually 20 percent of the number of hash partitions computed. PART-B 1. Explain the implementation of fixed length & variable length records? 2. Explain sequential, clustering, heap file organization? 3. Explain Primary index, secondary index and multilevel indices? 4. Write short notes on B-tree and B+tree? 5. Explain the various algorithms used to implement selection and join operation? 6. Dense indices are faster in general, but sparse indices require less space and impose less maintenance for insertions and deletions. Why? 7. State the difference between B+trees and B trees? 8. Draw the structure of a B +tree and explain. 9. Compare and describe Indexing and Hashing 10. What is an Ordered index? Explain the two types of Ordered indices. 11. What is meant by the term hash function? 12.State the difference between dynamic and static hashing. How does these work? 13. What are the characteristics of a magnetic disk?

14.Why RAIDs are used? Why are they called so? 15.What is a buffer? Why buffers are used? What is the role of a buffer manager in buffer management? 16. State buffer replacement policies 17. What is a file? 18. What is clustering file organisation and sequential file organisation? Book References: 1. Abraham Silberschatz, Henry F. Korth and S. Sudarshan- Database System Concepts, Fourth Edition, McGraw-Hill, 2002. 2. Ramez Elmasri and Shamkant B. Navathe, Fundamental Database Systems, Third Edition, Pearson Education, 2003. 3. Raghu Ramakrishnan, Database Management System, Tata McGraw- Hill Publishing Company, 2003. Web Resources: http://www.cs.sfu.ca/cc/354/zaiane/material/notes/chapter10 /node1.html http://www.cs.sfu.ca/cc/354/zaiane/material/notes/chapter11 /node1.html http://www.cs.uku.fi/%7ekilpelai/dbms01/lectures/b-trees.ppt http://www.cs.ualberta.ca/%7ezaiane/courses/cmput391-02/slides/lect3/

http://www-courses.cs.uiuc.edu/%7ecs411/lectures/16/311-16- 100.html