Introduction to Data Management. Lecture #13 (Indexing)

Similar documents
Introduction to Data Management. Lecture 14 (Storage and Indexing)

CS122A: Introduction to Data Management. Lecture #14: Indexing. Instructor: Chen Li

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Storing Data: Disks and Files

Project is due on March 11, 2003 Final Examination March 18, pm to 10.30pm

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Disks and Files. Jim Gray s Storage Latency Analogy: How Far Away is the Data? Components of a Disk. Disks

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Storing Data: Disks and Files

Storing Data: Disks and Files

Storing Data: Disks and Files

Review 1-- Storing Data: Disks and Files

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

L9: Storage Manager Physical Data Organization

Disks & Files. Yanlei Diao UMass Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

Storing Data: Disks and Files

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing Data: Disks and Files. Administrivia (part 2 of 2) Review. Disks, Memory, and Files. Disks and Files. Lecture 3 (R&G Chapter 7)

Introduction to Data Management. Lecture 15 (More About Indexing)

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

CS 405G: Introduction to Database Systems. Storage

Disks, Memories & Buffer Management

Tree-Structured Indexes

Storing Data: Disks and Files

Overview of Storage and Indexing

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Storing Data: Disks and Files

CPSC 421 Database Management Systems. Lecture 11: Storage and File Organization

Kathleen Durant PhD Northeastern University CS Indexes

Overview of Storage and Indexing

Overview of Storage and Indexing

Overview of Storage and Indexing

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Tree-Structured Indexes

RAID in Practice, Overview of Indexing

Storage and Indexing

Tree-Structured Indexes. Chapter 10

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Step 4: Choose file organizations and indexes

Tree-Structured Indexes

Database design and implementation CMPSCI 645. Lecture 08: Storage and Indexing

Lecture 8 Index (B+-Tree and Hash)

Tree-Structured Indexes

Unit 3 Disk Scheduling, Records, Files, Metadata

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Data on External Storage

Introduction. Choice orthogonal to indexing technique used to locate entries K.

CompSci 516: Database Systems

Tree-Structured Indexes

Database Applications (15-415)

CSE 444: Database Internals. Lectures 5-6 Indexing

Overview of Storage and Indexing. Data on External Storage

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Tree-Structured Indexes

Modern Database Systems Lecture 1

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

Physical Data Organization. Introduction to Databases CompSci 316 Fall 2018

Principles of Data Management. Lecture #3 (Managing Files of Records)

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Database Applications (15-415)

STORING DATA: DISK AND FILES

Normalization, Generated Keys, Disks

Oracle on RAID. RAID in Practice, Overview of Indexing. High-end RAID Example, continued. Disks and Files: RAID in practice. Gluing RAIDs together

Tree-Structured Indexes

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Overview of Storage and Indexing

Lecture 12. Lecture 12: The IO Model & External Sorting

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Overview of Storage & Indexing (i)

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification)

Database Systems II. Secondary Storage

CSE 544 Principles of Database Management Systems

Professor: Pete Keleher! Closures, candidate keys, canonical covers etc! Armstrong axioms!

Evaluation of Relational Operations

Instructor: Amol Deshpande

Tree-Structured Indexes (Brass Tacks)

Database Applications (15-415)

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

EXTERNAL SORTING. Sorting

Chapter 1: overview of Storage & Indexing, Disks & Files:

Single Record and Range Search

Introduction to Data Management. Lecture #25 (Transactions II)

Evaluation of Relational Operations: Other Techniques

Introduction to Data Management. Lecture #18 (Transactions)

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

Review of Storage and Indexing

Managing the Database

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Storage Devices for Database Systems

Transcription:

Introduction to Data Management Lecture #13 (Indexing) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Announcements v Homework info: HW #5 (SQL): Due now. HW #6 (More SQL): Available now! v Today s and Thursday s plans: Intro to indexing On to the ubiquitous B+ tree! v Midterms graded, pick up from TA/Reader Average 69, median = 69, range = [38, 90] Rough model: Add 10, convert to letter grade...? (J ) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

All done with SQL...! v ANY LINGERING QUESTIONS...? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 Disks and Files DBMSs store information on ( hard ) disks. v This has major implications for DBMS design! v READ: transfer data from disk to main memory (RAM). WRITE: transfer data from RAM to disk. Both are high-cost operations, relative to in-memory operations, so must be considered carefully! Query Compiler -----------------------File & Index Mgmt DBMS code P7 Read P5 Buffer pool P2 P5 P3 Write P3 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke P3 P1 P7 Stored data 4

Storage Hierarchy & Latency (Jim Gray): How Far Away is the Data? 10 9 Tape /Optical Robot Andromeda 2,000 years 10 6 Disk Pluto 2 years 100 Memory 1.5 hr 10 2 1 On Board Cache On Chip Cache Registers This Room My Head 10 min 1 min Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5 Why Not Store Data in Main Memory?? v Costs too much. Dell was recently asking $65 for 500GB of disk, $600 for 256GB of SSD, and $57 for 4GB of RAM à $0.13, $2.34, $14.25 per GB v Main memory is volatile. We want data to be saved between runs. (Obviously!!) v Your typical (basic) storage hierarchy: Main memory (RAM) for currently used data Disk for the main database (secondary storage) Tapes for archiving the data (tertiary storage) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

Components of a Disk v The platters spin (10,000 rpm) v The arm assembly is moved in or out to position a head on a desired track Tracks under heads form a cylinder (imaginary!) v Only one head reads/writes at any one time. v Block size is a multiple of sector size (which is fixed) Arm assembly Disk head Arm movement Spindle Tracks Sector Platters Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 Accessing a Disk Page v Time to access (read/write) a disk block: Seek time (moving arms to position disk head on track) Rotational delay (waiting for block to rotate under head) Transfer time (actually moving data to/from disk surface) v Seek time and rotational delay dominate! Seek time varies from about 1 to 20msec Rotational delay varies from 0 to 10msec Transfer rate is < 1 msec per 4KB page Key to lowering I/O cost: Reduce seek/rotation delays! à Bottom line: Random vs. sequential I/O Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

Ex: Emp(eid, ename, sal, deptid) Underlying Emp file pages P1 P2 P10000 1 2 3 4 1 2 3 4 1 2 3 4 111 Smith 3K 1 222 Lee 100K 3 333 Carey 80K 1 444 Smith 12K 7 555 Smith 18K 3 666 Jones 90K 5 777 Smith 23K 4 888 Krishan 60K 8.................................... 9999999 Smith 18K 11 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9... ß Record id (RID) is (P2,3) Processing a Query v Suppose someone asks a simple SQL query: SELECT * FROM Emp WHERE eid = 12345; v Processing options include: Option 1: Sequentially scan the data file (and stop, if we know eid is a key) à 5000 page reads (avg.) Option 2: Binary search the data file (and stop, if we know eid is a key) à log 2 (10,000) 15 page reads (avg.) Even though Option 2 is 30x faster, we d like to do even better (especially) for large data sets!!) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Indexing is the Answer! Key k or key range (k 1, k 2 ) Index on k I(k) or I(k 1 ),..., I(k 2 ) 888 (P2,4) v Index maps from keys to associated info I I(k) can be the data record with key k, or I(k) can be the RID of the data record with key k, or I(k) can be a list of RIDs of data records with key k! Alternatively, we could map from data field values to the PK value(s) of the associated record(s) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 Indexes v An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can serve as the search key for an index on the relation. Search key is not the same as a key (i.e., it s not the primary key, it s a field we re very interested in). v An index contains a collection of data entries, and it supports efficient retrieval of all data entries k* with a given key value k. Given a data entry k*, we can find 1 st record with key k with just more disk I/O. (Details soon ) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Ex: Emp(eid, ename, sal, deptid) v One simple approach would be to have another file, sorted on k, for each k that we want to index Hundreds of (key, RID) entries will fit on a single page Index is thus much smaller than the data file Less data (fewer reads) to search to locate the RIDs of interest eid eid RID sal RID 111 (P1,1) 222 (P1,2) 333 (P1,3) 444 (P1,4) 555 (P2,1) 666 (P2,2)......... (RID) sal Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 3K 12K 18K 18K 23K 60K (P1,1) (P1,4) (P2,1) (P10000,1) (P2,3) (P2,4)......... (RID) Note: Can have multiple of these! Even Better: Tree Indexes! Non-leaf Pages of Index Recursive application of indexing concept! (w/ pages) Leaf Pages of Index (Sorted by search key) v Leaf pages contain data entries, and are chained v Non-leaf pages have index entries; role is to guide searches v Query processing steps become: 1. Choose a good index to use (if one is available) 2. Search the index to determine the interesting RID(s) 3. Use the RID(s) to fetch the corresponding record(s) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

An Example (B+ Tree) 29...? Root 17 Notice that data entries at leaf level are sorted Entries < 17 Entries >= 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Note: Just 3 page reads to get from root to (any) leaf here! k* = (k, I(k)) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 Index Classification v Primary vs. secondary: If search key contains the primary key, then called the primary index. Unique index: Search key contains a candidate key. v Clustered vs. unclustered: If order of data records is the same as, or `close to, the order of stored data records, then called a clustered index. A table can be clustered on at most one search key. Cost of retrieving data records via an index varies greatly based on whether index is clustered or not! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

Clustered vs. Unclustered Indexes CLUSTERED Index entries direct search for data entries UNCLUSTERED Data entries (Index File) (Data file) Data entries Data Records Data Records Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Indexing in MySQL (w/innodb) CREATE [UNIQUE FULLTEXT SPATIAL] INDEX index_name [index_type] ON tbl_name (index_col_name,...) [index_option] [algorithm_option lock_option]... index_col_name: col_name [(length)] [ASC DESC] index_option: index_type KEY_BLOCK_SIZE [=] value WITH PARSER parser_name COMMENT 'string index_type: USING {BTREE HASH} algorithm_option: ALGORITHM [=] {DEFAULT INPLACE COPY} lock_option: LOCK [=] {DEFAULT NONE SHARED EXCLUSIVE} ß Ex: CREATE INDEX salidx ON Emp (sal) USING BTREE; (http://dev.mysql.com/doc/refman/5.7/en/create-index.html) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18

Next Up: ISAM! v Note: ISAM = prehistoric B+ tree (J ) v We ll cover... Non-leaf and leaf page contents of an index Search algorithms (exact match & range) Insert algorithms (incl. overflow chains) Delete algorithms (ditto) Any lingering questions first...? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19 Tree Indexes: Over(re)view v As for any index, 3 alternatives for data entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k> v This data entry choice is orthogonal to the indexing technique used to locate data entries k*. v Tree-structured indexing techniques support both range searches and equality searches. v ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20

Range Searches v Find all students with gpa > 3.0 v If records are in a sorted file, do binary search to find first such student, then scan to find others. Cost of file binary search can be quite high. v Simple idea to do better: add an index file. k1 k2 kn Index File Page 1 Page 2 Page 3 Page N Data File Can now binary search the (smaller) index file! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21 ISAM index entry P 0 K 1 P 1 K 2 P 2 K m P m v Index file may still be quite large. But, we can apply the same idea recursively to address that! Non-leaf Pages Leaf Pages Overflow page Leaf pages contain data entries. Primary pages Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22

Comments on ISAM v File creation: Leaf (data) pages first allocated sequentially, sorted by search key; then index pages allocated, and then overflow pages. v Index entries: <search key value, page id>; they direct searches for data entries, which are in leaf pages. v Search: Start at root; use key comparisons to go to leaf. I/O cost log F N ; F = # entries/index pg, N = # leaf pgs v Insert: Find leaf data entry belongs to, and put it there. v Delete: Find and remove from leaf; if empty overflow page, de-allocate. Static tree structure: inserts/deletes affect only leaf pages. Data Pages Index Pages Overflow pages Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23 Example ISAM Tree v Suppose each node can hold 2 entries (really more like 200 since the nodes are disk pages) Root 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24

After Inserting 23*, 48*, 41*, 42*... Index Pages 40 20 33 51 63 Primary Leaf Pages 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Overflow Pages 23* 48* 41* 42* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25... Then Deleting 42*, 51*, 97* 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 55* 63* 23* 48* 41* Note that 51* appears in index levels, but not in leaf! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26

Next Time: Innards of B+ Trees! v Stay tuned... Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 27