COMP 430 Intro. to Database Systems. Indexing

Similar documents
Indexing: Overview & Hashing. CS 377: Database Systems

Chapter 12: Indexing and Hashing. Basic Concepts

Kathleen Durant PhD Northeastern University CS Indexes

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Chapter 12: Indexing and Hashing

Chapter 11: Indexing and Hashing

CSE 562 Database Systems

Database index structures

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Chapter 12: Indexing and Hashing

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

CSIT5300: Advanced Database Systems

Intro to DB CHAPTER 12 INDEXING & HASHING

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

(i) It is efficient technique for small and medium sized data file. (ii) Searching is comparatively fast and efficient.

Chapter 11: Indexing and Hashing

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS


Indexing and Hashing

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

Database Management and Tuning

Outline. Database Management and Tuning. What is an Index? Key of an Index. Index Tuning. Johann Gamper. Unit 4

amiri advanced databases '05

Physical Level of Databases: B+-Trees

Index Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Hash-Based Indexes. Chapter 11

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Remember. 376a. Database Design. Also. B + tree reminders. Algorithms for B + trees. Remember

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

SUMMARY OF DATABASE STORAGE AND QUERYING

Chapter 17 Indexing Structures for Files and Physical Database Design

Single Record and Range Search

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

CS 245: Database System Principles

RAID in Practice, Overview of Indexing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

CS34800 Information Systems

Indexing Methods. Lecture 9. Storage Requirements of Databases

Chapter 5: Physical Database Design. Designing Physical Files

Datenbanksysteme II: Caching and File Structures. Ulf Leser

TUTORIAL ON INDEXING PART 2: HASH-BASED INDEXING

Data Storage and Query Answering. Indexing and Hashing (5)

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

QUIZ: Buffer replacement policies

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

CS 525: Advanced Database Organization 04: Indexing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Outline. Query Types

Administração e Optimização de Bases de Dados 2012/2013 Index Tuning

Overview of Storage and Indexing

File Structures and Indexing

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CMSC 424 Database design Lecture 13 Storage: Files. Mihai Pop

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Chapter 12: Indexing and Hashing (Cnt(

Hashing file organization

2, 3, 5, 7, 11, 17, 19, 23, 29, 31

7RSLFV ,QGH[HV 7HUPVÃDQGÃ'LVWLQFWLRQV. Conventional Indexes B-Tree Indexes Hashing Indexes

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Introduction to Data Management. Lecture 21 (Indexing, cont.)

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Tree-Structured Indexes. Chapter 10

Data and File Structures Chapter 11. Hashing

CSE 544 Principles of Database Management Systems

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Material You Need to Know

File Organization and Storage Structures

Lecture 8 Index (B+-Tree and Hash)

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Overview of Storage and Indexing

Data Organization B trees

Hashing IV and Course Overview

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Module 4: Tree-Structured Indexing

Hashed-Based Indexing

CS 245 Midterm Exam Winter 2014

Physical Database Design: Outline

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Indexing: B + -Tree. CS 377: Database Systems

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Hash-Based Indexing 1

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Advanced Application Indexing and Index Tuning

Storage and Indexing

Database Applications (15-415)

CSC 261/461 Database Systems Lecture 17. Fall 2017

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Storage hierarchy. Textbook: chapters 11, 12, and 13

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Transcription:

COMP 430 Intro. to Database Systems Indexing

How does DB find records quickly? Various forms of indexing An index is automatically created for primary key. SQL gives us some control, so we should understand the options. Concerned with user-visible effects, not underlying implementation. CREATE INDEX index_city_salary ON Employees (city, salary); CREATE UNIQUE CLUSTERED INDEX idx ON MyTable (attr1 DESC, attr2 ASC); Options vary.

Phone book model Data is stored with search key. Clustered index. Organized by one search key: Last name, first name. Searching by any other key is slow.

Library card catalog model Index stores pointers to data. Non-clustered index. Organized by search key(s). Author last name, first name. Title Subject

Evaluating an indexing scheme Access type flexibility Specific key value e.g., John, Smith Key value range e.g., salary between $50K and $60K Advantage: Access time Disadvantages: Update time Space overhead Creating an index can slow down the system!

Ordered indices Sorted, as in previous human-oriented examples

Types of indices we ll see Dense vs. sparse Primary vs. secondary Unique vs. not-unique Single- vs. multi-level These ideas can be combined in various ways.

Primary index the clustering index Index 10 10 Data Index 10 10 Data 30 20 20 20 80 30 30 30 100 40 40 40 140 50 50 50 60 60 Sparse Dense Typically unique index also, since primary search key typically same as primary key. What were primary search keys in phone book & library card catalog?

Primary index trade-off dense vs. sparse Dense faster access Sparse less update time, less space overhead Index 10 10 40 70 100 130 20 30 40 50 60 70 Data Good trade-off: Sparse, but link to first key of each file block.

Secondary index a non-clustering index Index Data 10 30 20 30 50 100 Needs to be dense, otherwise we can t find all search keys efficiently. 100 20 10 50 140 What were secondary search keys in phone book & library card catalog?

Secondary index typically not unique Index Buckets Data 10 30 20 30 80 100 100 20 10 30 10

Index size Dense index search key + pointer per record Sparse index search key + pointer per file block (typically) Many records, but want index to fit in memory Solution: Multi-level index Same problem & solution as for page tables in virtual memory.

Multi-level primary index Sparse Index 10 130 250 250 (Semi-)dense Index 10 10 40 70 100 130 160 190 220 250 20 30 40 50 60 70 80 90 100 Data

Multi-level secondary index Sparse Index 10 50 90 130 Dense Index Data 10 70 20 30 40 50 60 70 80 90 40 30 100 10 90 60 20 80 50

Multi-level index summary Access time now very locality-dependent Update time increased must update multiple indices Total space overhead increased

Updating database & indices inserting Index 10 10 Data 10 Data 40 20 20 70 30 25 100 130 40 50 60 30 70 Moving all records is too expensive. Have same issue when updating indices. 40 50 60 70

Updating database & indices deleting Index 10 10 Data 10 Data 40 20 20 70 30 25 100 130 40 50 60 30 70 40 60 Moving all records is too expensive. Have same issue when updating indices. 70

Updating file leads to fragmentation Performance degrades with time. Need to periodically reorganize data & indices. Solution: B+-trees

B+-tree indices

B+-trees Balanced search trees optimized for disk-based access Reorganizes a little on every update Shallow & wide (high fan-out) Typically, node size = disk block Slight differences from more commonly-known B-trees All data in leaf nodes Leaf nodes sequentially linked Easily get all data in order. CREATE INDEX ON USING BTREE; Or, is the default in many DBMSs.

3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 30 120 150 180 100 B+-tree example Data records

B+-tree performance Very similar to multi-level index structure Slightly higher per-operation access & update time + No degradation over time + No periodic reorganization B+-tree widely used in relational DBMSs

What about non-unique search keys? Allow duplicates in tree. Maintain order instead of <. Slightly complicates tree operations. Make unique by adding record-id. Extra storage. But, record-id useful for other purposes, too. List of duplicate records for each key. Trivial to get all duplicates. Inefficient when lists get long.

Indexing on VARCHAR keys Key size variable, so number of keys that fit into a node also varies. Techniques to maximize fan-out: Key values at internal nodes can be prefixes of full key. E.g., Johnson and Jones can be separated by Jon. Key values at leaf nodes can be compressed by sharing common prefixes. E.g., Johnson and Jones can be stored as Jo + hnson / nes

Hash indices

Basic idea of a hash function value1 value2 h() distributes values uniformly among the buckets. value3 value4 h() distributes typical subsets of values uniformly among the buckets.

Using hash function for indexing value1 value1 value4 Motivation: Constant-time hash instead of multi-level/tree-based. value3 value4 value3 Use hash function to find bucket. Search/insert/delete from bucket. Bucket items possibly sorted. CREATE INDEX ON USING HASH;

Buckets can overflow Overflow some buckets skewed usage Overflow many buckets not enough space reserved Solutions: Chain additional buckets degrades to linear search Stop and reorganize Dynamic hashing (extensible or linear hashing) techniques that allow the number of buckets to grow without rehashing existing data

Advantages & disadvantages of hash indexing + One hash function vs. multiple levels of indexing or B+-tree Storage about the same. + Don t need multiple levels. But, need buckets about half empty for good performance. - Can t easily access a data range. - Simple hashing degrades poorly when data is skewed relative to the hash function.

Multiple indices & Multiple keys

Using indices for multiple attributes What if queries like the following are common? SELECT FROM Employee WHERE job_code = 2 AND performance_rating = 5;

Strategy 1 Index one attribute CREATE INDEX idx_job_code on Employee (job_code); SELECT FROM Employee WHERE job_code = 2 AND performance_rating = 5; Internal strategy: 1. Use index to find Employees with job_code = 2. 2. Linear search of those to check performance_rating = 5.

Strategy 2 Index both attributes CREATE INDEX idx_job_code on Employee (job_code); CREATE INDEX idx_perf_rating on Employee (performance_rating); SELECT FROM Employee WHERE job_code = 2 AND performance_rating = 5; Internal strategy chooses between Use job_code index, then linear search on performance_rating. Use performance_rating index, then linear search on job_code. Use both indices, then intersect resulting sets of pointers.

Strategy 3 Index attribute set CREATE INDEX idx_job_perf on Employee (job_code, performance_rating); SELECT FROM Employee WHERE job_code = 2 AND performance_rating = 5; Attribute sets ordered lexicographically: (jc1, pr1) < (jc2, pr2) iff either jc1 < jc2 jc1 = jc2 and pr1 < pr2 This strategy typically uses ordered index, not hashing. Note that this prioritizes job_code over performance_rating!

Strategy 3 Index attribute set CREATE INDEX idx_job_perf on Employee (job_code, performance_rating); Efficient SELECT FROM Employee WHERE job_code = 2 AND performance_rating = 5; SELECT FROM Employee WHERE job_code = 2 AND performance_rating < 5; CREATE INDEX idx_job_perf on Employee (job_code, performance_rating); Inefficient SELECT FROM Employee WHERE job_code < 2 AND performance_rating = 5; SELECT FROM Employee WHERE job_code = 2 OR performance_rating = 5;

Strategy 4 Grid indexing n attributes viewed as being in n-dimensional grid Numerous implementations Grid file, K-D tree, R-tree, Mainly used for spatial data CREATE SPATIAL INDEX ON ;

Strategy 5 Bitmap indices Coming next

Bitmap indices

Basic idea of bitmap indices Assume records numbered 0, 1,, and easy to find record #i. Record # id gender income_level 0 51351 M 1 1 73864 F 2 2 13428 F 1 3 53718 M 4 4 83923 F 3 Bitmaps M F 1 2 3 4 5 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 F s bitmap: 01101 CREATE BITMAP INDEX ON ;

Queries use standard bitmap operations SELECT FROM WHERE gender = F AND income_level = 1; F s bitmap 01101 1 s bitmap & = 00100 10100 SELECT FROM WHERE gender = F OR income_level = 1; F s bitmap 01101 1 s bitmap = 11101 10100 SELECT FROM WHERE income_level <> 1; 1 s bitmap ~ = 01011 10100

Space overhead Normal: #records (#values + 1) bits, if nullable Typically used when # of values for attribute is low. Encoded: #records log(#values) But need to use more bitmaps Compressed: ~50% of normal Can do bitmap operations on compressed form.

A couple details When deleting records, either Delete bit from every bitmap for that table, or Have an existence bitmap indicating whether that row exists. Can use idea in B+-tree leaves.

Summary of index types Single-level Multi-level B+-tree Spatial Hash Bitmap Index generally too large to fit in memory Degrades due to fragmentation General-purpose choice Good for spatial coordinates Good for equality tests; potentially degrades due to skew & overflow Good for multiple attributes each with few values