Overview of Storage and Indexing

Similar documents
Data on External Storage

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Overview of Storage and Indexing

Overview of Storage and Indexing

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Overview of Storage and Indexing. Data on External Storage

Chapter 1: overview of Storage & Indexing, Disks & Files:

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Storage and Indexing

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

The use of indexes. Iztok Savnik, FAMNIT. IDB, Indexes

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Tree-Structured Indexes

Overview of Storage and Indexing

Query Processing: The Basics. External Sorting

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Data Management for Data Science

CompSci 516: Database Systems

Overview of Storage and Indexing

RAID in Practice, Overview of Indexing

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

ACCESS METHODS: FILE ORGANIZATIONS, B+TREE

Modern Database Systems Lecture 1

CSE 544 Principles of Database Management Systems

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Lecture 8 Index (B+-Tree and Hash)

Database design and implementation CMPSCI 645. Lecture 08: Storage and Indexing

CSE 444: Database Internals. Lectures 5-6 Indexing

Review of Storage and Indexing

External Sorting Implementing Relational Operators

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Kathleen Durant PhD Northeastern University CS Indexes

Step 4: Choose file organizations and indexes

Overview of Storage and Indexing

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Database Applications (15-415)

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

CS 443 Database Management Systems. Professor: Sina Meraji

Chapter 5: Physical Database Design. Designing Physical Files

amiri advanced databases '05

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Lecture 13. Lecture 13: B+ Tree

Storage hierarchy. Textbook: chapters 11, 12, and 13

Indexing Methods. Lecture 9. Storage Requirements of Databases

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Unit 3 Disk Scheduling, Records, Files, Metadata

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Hash-Based Indexing 165

EXTERNAL SORTING. Sorting

CSE 530A. B+ Trees. Washington University Fall 2013

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

(i) It is efficient technique for small and medium sized data file. (ii) Searching is comparatively fast and efficient.

Records in a file are grouped into buckets. Search key values are organized in a tree. The highest level is the root

Evaluation of Relational Operations

Single Record and Range Search

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

CSC 261/461 Database Systems Lecture 17. Fall 2017

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

COMP 430 Intro. to Database Systems. Indexing

Tree-Structured Indexes

Chapter 12: Indexing and Hashing. Basic Concepts

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

Tree-Structured Indexes

Chapter 12: Indexing and Hashing

Chapter 17 Indexing Structures for Files and Physical Database Design

Chapter 11: Indexing and Hashing

Database Management Systems (COP 5725) Homework 3

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

COSC-4411(M) Midterm #1

Database Applications (15-415)

Database Applications (15-415)

Database Applications (15-415)

Physical Database Design: Outline

CSIT5300: Advanced Database Systems

Overview of Storage & Indexing (i)

MySQL B+ tree. A typical B+tree. Why use B+tree?

Advanced Database Systems

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

CS 245 Midterm Exam Winter 2014

Important Note. Today: Starting at the Bottom. DBMS Architecture. General HeapFile Operations. HeapFile In SimpleDB. CSE 444: Database Internals

Evaluation of Relational Operations

Module 4: Tree-Structured Indexing

Tree-Structured Indexes. Chapter 10

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Database System Concepts

Chapter 13: Query Processing

Lecture 31 11/16/15. CMPSC431W: Database Management Systems. Instructor: Yu- San Lin

Dtb Database Systems. Announcement

Midterm Review. March 27, 2017

Evaluation of Relational Operations: Other Techniques

Introduction to Data Management. Lecture 14 (Storage and Indexing)

Chapter 12: Query Processing. Chapter 12: Query Processing

Transcription:

Overview of Storage and Indexing UVic C SC 370 Dr. Daniel M. German Department of Computer Science July 2, 2003 Version: 1.1.1 7 1 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Overview How does the DBMS storage data? Why is I/O is so important for database operations? Why are indices useful? 7 2 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Introduction I/O is way more expensive than any in-memory operation Disks are the most important external storage device The basic DBMS storage abstraction is a file of records Each record in a file should have a unique identifier (record id or rid) in this chapter, a record is a page, do not confuse with a row) The buffer manager is responsible for loading and writing a record from/to disk The file and access methods layer sits on top of the buffer manager, asking it to retrieve or write a given page rid 7 3 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

File Organization and Indexing A relation is typically stored as a file of records Rows are packed into records (pages) The simplest file structure is an unordered file or heap file Records are stored in random order The file manager is able to retrieve and store a particular record by its rid An index is a data structure that organizes data records to optimize certain kinds of retrieval operations They allow the efficient retrieval of records that satisfy a search condition on the search key Multiple indexes can be created on the same file (i.e table) 7 4 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Data Entries A data entry is a record stored in an index file A data entry with search value k (denoted as k*) contains enough information to locate all records with search key value k There are 3 ways to store a data entry: 1. A data entry k* is an actual data record (with search key value k) 2. A data entry is a k,rid pair, where rid is the record id of the data record with search key k 3. A data entry is a k,rid list pair, where rid list is a list of record ids with search key value k If we want multiple indexes on a table, we will need to use 1 and 2 or 3. 7 5 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Clustered Indexes If the ordering of the data records is the same or close to the ordering of the data entries in some index, it is called clustered (indexing alternative 1) is clustered by definition Otherwise it is unclustered CLUSTERED Index entries direct search for data entries UNCLUSTERED Data entries (Index File) (Data file) Data entries Data Records Data Records 7 6 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Primary and Secondary Indices An index on the primary key is a primary index Other indexes are secondary indexes Two data entries are duplicates if the have same search key A primary index does not have duplicates A secondary index could have no duplicates if is specified as UNIQUE 7 7 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Index Data Structures Indexes are usually implemented in one of two ways: Hash-based Indexing Tree-based Indexing 7 8 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Hash-based Indexing Use a hashing function to cluster records (and find them) Records in a file are clustered in buckets A bucket is composed of primary page and additional pages linked in a chain. The DBMS uses a hashing function to determine, for a given record, its bucket number Given a bucket number the DBMS should be able to retrieve the record in 1 or 2 I/Os (for the sake of this chapter, we will assume 1 I/O) If alternative (1) is used, the buckets contain the records, otherwise they contain the k,rid or k,rid list 7 9 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Tree-based Indexing The tree organizes the keys in order Non-leaf Pages Leaf Pages The inside nodes contain only keys, while the leafs contains the data entries 7 10 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

B+Tree Root 17 Entries <= 17 Entries > 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* B+ trees are the most commonly used They are guaranteed to be balanced The height of the tree is the length of a path from the root to the leaf (guaranteed to the always the same) Each node can have up to n children (fan-out of the tree), so a tree of fan-out n has in the order of n h 7 11 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

B+ Trees In practice, height is 3 or 4, fan-out > 100 A full tree with height=3,fan-out=100, can hold 100 4 == 10 8 records So we can retrieve either one of them with 4 I/Os! 7 12 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Example Clustered B+ Tree Fan out: 2, Height: 3 (length from root to leaf) 7 13 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

What do we compare? For the sake of comparison we will use the following queries: 1. Scan: fetch all records that satisfy a given condition: fetch pages from disk into the buffer pool. 2. Search with equality: find all records that satisfy a given equality condition. Fetch pages from disk into buffer pool and retrieve matching records. 3. Search with Range Selection: find all records within certain range: employees with age > 10 AND age < 100. Similar to previous. 4. Insert Record: Insert a new record. Identify the page where it should be inserted, fetch it, modify it, and write it back. Depending on file organization other pages might need to be changed. 5. Delete a record. Similar to previous. 7 14 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

The contenders Heap file of employee records File of employee records, sorted on age,sal Cluster B+ tree with search key age,sal Heap file with an unclustered B+ tree index on age,sal Heap file with an unclustered hash index on age,sal 7 15 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

What we compare We will use a very simple cost model B is the number of data pages when pages are packed and there is no waste R is the number of records per page D is the avg. time to read or write a page C is the cost of comparing a key F is the fan out of the B+ trees H is the time to apply the hash function D = 15ms, while C and H = 100ns, so we will concentrate on I/O cost. We ignore other details (such as contiguity of pages) 7 16 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Heap Files Scan: BD Search with Equality: BD 2 Range Search: BD Insert: 2D Delete: Search + D 7 17 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Sorted Files Scan: BD Search with Equality: Dlog 2 B Range Search:Dlog 2 B+cost of retrieving sequential matching records Insert: Search + BD Delete: Search + BD 7 18 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Clustered B+ Tree Fan out: F (F keys per inner node) In leave nodes, on average only 67% of the space is used, hence we need 1.5 times the number of leaves we would need if each node was full (1.5 B, as each node uses one page) Scan: 1.5BD (scan all the leave nodes) Search with Equality: Dlog F 1.5B Use the tree, with depth proportional to log F (number of leave nodes) Range Search:Dlog F 1.5B+ cost of retrieving sequential matching rows (find first, scan sequential) Insert: Search + D (search and insert) Delete: Search + D (search and destroy) 7 19 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Heap File with Unclustered Tree Index Assume Index data is 1/10th of the size of the data record So 10 times more index records < key,rid > than data records per leave page Total number of leave pages 0.15B (assuming 67% occupancy rate) Scan: Cost of scanning all index leave pages 0.15BD + Per each key in the index, retrieve one data page (heap file) BRD. Total: BD(0.15 + R) Search with Equality: Search in tree + 1 data page D(log F 0.15B + 1) 7 20 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Range Search: Search + cost of matching records D(log F 0.15B + matching rows) Insert: Insert into heap + insert into tree: 2D + D(log F 0.15B + 1) Delete: Search + 2D 7 21 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

Heap File with Unclustered Hash Index Assume: 80% occupancy rate and simple hash structures (no overflow). Hence and 8 index entries per page in index: total size of index 0.125B pages Scan: Read of rids: 0.125BD + cost of retrieval of each data page: BRD. Total: BD(0.125 + R) Search with Equality: 2D Range Search: Search: BD Insert: 4D Delete: Search + 2D 7 22 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca

The Results File Type Scan Eq. Search Range S. Insert Delete Heap BD 0.5BD BD 2D Search + D Sorted Dlog Files BD Dlog 2 B 2 B + #MP Search+BD Search+BD Clustered log B+ tree 1.5BD Dlog F B F 1.5B + #MP Search + D Search + D Unclusted BD(0.15+ D(log F 0.15B+ D(log F 0.15B+ 3D + B+ tree R) 1) #MP) Dlog F 0.15B Search + 2D Unclusted BD(0.125+ hash 2D R) BD 4D Search + 2D #MP is the number of pages that contain matching records in a range search. So, which one is better? 7 23 Overview of Storage and Indexing (1.1.1) CSC 370 dmgerman@uvic.ca