Overview of Storage and Indexing

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Data on External Storage Disks: Can retrieve random page at fixed cost But reading several consecutive pages is much cheaper than reading them in random order File organization: Method of arranging a file of records on external storage. Record id (rid) is sufficient to physically locate record Indexes are data structures that allow us to find the record ids of records with given values in index search key fields Architecture: Buffer manager stages pages from external storage to main memory buffer pool. File and index layers make calls to the buffer manager. 2

Indexes An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). An index supports efficient retrieval of all data entries with a given key value k. 3 Index Classification Primary vs. secondary: If search key contains primary key, then called primary index. Unique index: Search key contains a candidate key. Clustered vs. unclustered: If order of data records is the same as, or `close to, order of data entries, then called clustered index. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not! Dense vs. sparse: If there is an index data entry for each data records then called dense index. 4

B+ Tree Indexes Non-leaf Pages Leaf Pages Leaf pages contain data entries, and are chained (prev & next) Non-leaf pages contain index entries and direct searches: index entry P 0 K 1 P 1 K 2 P 2 K m P m 5 B+ Tree Insert/delete at log F N cost; keep tree heightbalanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. Supports equality and range-searches efficiently. Index Entries (Direct search) Data Entries ("Sequence set") 6

Queries on B + -Trees Find all records with a search-key value of k. Start with the root node Examine the node for the smallest search-key value > k. If such a value exists, assume it is K j. The follow P i to the child node Otherwise k K m 1, where there are m pointers in the node. Then follow P m to the child node. If the node reached by following the pointer above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer. Eventually reach a leaf node. If key K i = k, follow pointer P i to the desired record or bucket. Else no record with search-key value k exists. 7 Example B+ Tree Root 17 Entries < 17 Entries >= 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Find 28*? 29*? All > 15* and < 30* Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. And change sometimes bubbles up the tree 8

B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. average fanout = 133 Typical capacities: Height 4: 133 4 = 312,900,700 records Height 3: 133 3 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes 9 Index Definition in SQL Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index b-index on branch(branch-name) Use create unique index to indirectly specify and enforce the condition that the search key is a candidate keyis a candidate key. To drop an index drop index <index-name> 10

Multiple-Key Access Use multiple indices for certain types of queries. Example: select account-number from account where branch-name = Perryridge and balance = 1000 Possible strategies for processing query using indices on single attributes: 1. Use index on branch-name to find accounts with branch-name = Perryridge.; test balance = $1000. 2. Use index on balance to find accounts with balances of $1000; test branch-name = Perryridge. 3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained. 11 Indices on Multiple Attributes Suppose we have an index on combined search-key (branch-name, balance). With the where clause where branch-name = Perryridge and balance = 1000 the index on the combined search-key will fetch only records that satisfy both conditions. Using separate indices in less efficient we may fetch many records (or pointers) that satisfy only one of the conditions. 12

Choice of Indexes What indexes should we create? Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? For each index, what kind of an index should it be? Clustered? Hash/tree? 13 Hash-Based Indexes Good for equality selections. Index is a collection of buckets. Bucket = primary page plus zero or more overflow pages. Hashing function h: h(r) = bucket in which record r belongs. h looks at the search key fields of r. 14

Summary Many alternative file organizations exist, each appropriate in some situation. If selection queries are frequent, sorting the file or building an index is important. Hash-based indexes only good for equality search. Sorted files and tree-based indexes best for range search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.) Index is a collection of data entries plus a way to quickly find entries with given key values. 15 Summary (Contd.) Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs. Choice orthogonal to indexing technique used to locate data entries with a given key value. Can have several indexes on a given file of data records, each with a different search key. Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance. 16