Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

Similar documents
Query Processing. Introduction to Databases CompSci 316 Fall 2017

EECS 647: Introduction to Database Systems

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

Advanced Database Systems

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

Database System Concepts

Chapter 13: Query Processing

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing

University of Waterloo Midterm Examination Sample Solution

Chapter 12: Query Processing

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Introduction to Database Systems CSE 344

Database Systems CSE 414

Storage hierarchy. Textbook: chapters 11, 12, and 13

Midterm Review. March 27, 2017

Chapter 12: Query Processing

Query Processing: The Basics. External Sorting

CS54100: Database Systems

Query Processing and Advanced Queries. Query Optimization (4)

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Fundamentals of Database Systems

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

CMSC424: Database Design. Instructor: Amol Deshpande

Chapters 15 and 16b: Query Optimization

From SQL-query to result Have a look under the hood

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Database Applications (15-415)

Database Applications (15-415)

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Principles of Database Management Systems

CSE 344 FEBRUARY 14 TH INDEXING

Chapter 12: Indexing and Hashing. Basic Concepts

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Query Execution [15]

Chapter 12: Indexing and Hashing

Review. Support for data retrieval at the physical level:

Database Management Systems (COP 5725) Homework 3

Database Applications (15-415)

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

CS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1

Chapter 17: Parallel Databases

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS 245 Midterm Exam Winter 2014

CSIT5300: Advanced Database Systems

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

CMSC 424 Database design Lecture 13 Storage: Files. Mihai Pop

Column Stores vs. Row Stores How Different Are They Really?

Kathleen Durant PhD Northeastern University CS Indexes

7. Query Processing and Optimization

Query Evaluation and Optimization

IMPORTANT: Circle the last two letters of your class account:

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

CSE344 Midterm Exam Winter 2017

Datenbanksysteme II: Implementing Joins. Ulf Leser

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Statistics. Duplicate Elimination

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Database Systems CSE 414

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Database Systems CSE 414

Chapter 3. Algorithms for Query Processing and Optimization

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Advanced Oracle SQL Tuning v3.0 by Tanel Poder

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Database Applications (15-415)

CS 245 Midterm Exam Solution Winter 2015

Chapter 12: Indexing and Hashing

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

CS-245 Database System Principles

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

CSIT5300: Advanced Database Systems

Query Processing & Optimization

Introduction to Database Systems CSE 344

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

CSE 190D Spring 2017 Final Exam Answers

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Database Applications (15-415)

CSE Midterm - Spring 2017 Solutions

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

CSC 261/461 Database Systems Lecture 19

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Transcription:

Query Processing with Indexes CPS 216 Advanced Database Systems Announcements (February 24) 2 More reading assignment for next week Buffer management (due next Wednesday) Homework #2 due next Thursday Course project proposal due in 1½ weeks Midterm in two weeks Christos Faloutsos (CMU) talk Data Mining Using Fractals and Power Laws 4-5pm, Monday, February 28 130A North Building (telecast from UNC) Review 3 Many different ways of processing the same query Scan (e.g., nested-loop join) Sort (e.g., sort-merge join) Hash (e.g., hash join) Index

Selection using index 4 Equality predicate: σ A = v (R) Use an ISAM, B + -tree, or hash index on R(A) Range predicate: σ A > v (R) Use an ordered index (e.g., ISAM or B + -tree) on R(A) Hash index is not applicable Indexes other than those on R(A) may be useful Example: B + -tree index on R(A, B) How about B + -tree index on R(B, A)? Index versus table scan 5 Situations where index clearly wins: Index-only queries which do not require retrieving actual tuples Example: π A (σ A > v (R)) Primary index clustered according to search key One lookup leads to all result tuples in their entirety Index versus table scan (cont d) 6 BUT(!): Consider σ A > v (R) and a secondary, non-clustered index on R(A) Need to follow pointers to get the actual result tuples Say that 20% of R satisfies A > v Could happen even for equality predicates I/O s for index-based selection: lookup + 20% R I/O s for scan-based selection: B(R) Table scan wins if a block contains more than 5 tuples

Index nested-loop join 7 R R.A = S.B S Idea: use the value of R.A to probe the index on S(B) For each block of R, and for each r in the block: Use the index on S(B) to retrieve s with s.b = r.a Output rs I/O s: B(R) + R (index lookup) Typically, the cost of an index lookup is 2-4 I/O s Beats other join methods if R is not too big Better pick R to be the smaller relation Memory requirement: 2 Tricks for index nested-loop join 8 Goal: reduce R (index lookup) For tree-based indexes, keep the upper part of the tree in memory For extensible hash index, keep the directory in memory Sort or partition R according to the join attribute Improves locality: subsequent lookup may follow the same path or go to the same bucket Zig-zag join using ordered indexes 9 R R.A = S.B S Idea: use the ordering provided by the indexes on R(A) and S(B) to eliminate the sorting step of sort-merge join Trick: use the larger key to probe the other index Possibly skipping many keys that do not match B + -tree on R(A) 1 2 3 4 7 9 18 1 7 9 11 12 17 19 B + -tree on S(B)

More indexes ahead! 10 Bitmap index Generalized value-list index Projection index Bit-sliced index Search key values tuples 11 Search key values 8 9 10 26 108 Looks familiar? Keywords documents Tuples 0 1 2 n 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 means tuple has the particular search key value 0 means otherwise Bitmap index 12 Value-list index stores the matrix by rows Traditionally list contains pointers to tuples B + -tree: tuples with same search key values Inverted list: documents with same keywords If there are not many search key values, and there are lots of 1 s in each row, pointer list is not spaceefficient How about a bitmap? Still a B + -tree, except leaves have a different format

Technicalities 13 How do we go from a bitmap index (0 to n 1) to the actual tuple? One more level of indirection solves everything Or, given a bitmap index, directly calculate the physical block number and the slot number within the block for the tuple In either case, certain block/slot may be invalid Because of deletion, or variable-length tuples Keep an existence bitmap: bit set to 1 if tuple exists Bitmap versus traditional value-list 14 Operations on bitmaps are faster than pointer lists Bitmap AND: bit-wise AND Value-list AND: sort-merge join Bitmap is more efficient when the matrix is sufficiently dense; otherwise, pointer list is more efficient Smaller means more in memory and fewer I/O s Generalized value-list index: with both bitmap and pointer list as alternatives Projection index 15 Just store π A (R) and use it as an index! Could be implicit and not explicitly stored TID A B 0 8 1 8 2 26 3 108 n -1 10 Projection index

Why projection index? 16 Idea: still a table scan, but we are scanning a much smaller table (project index) Savings could be substantial for long tuples with lots of attributes Looks familiar? Bit-sliced index 17 If a column stores binary numbers, then slice their bits vertically Basically a projection index by slices Projection index TID A 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 2 0 0 0 1 1 0 1 0 3 0 1 1 0 1 1 0 0 n -1 0 0 0 0 1 0 1 0 Slice 7 Slice 0 Bit-sliced index Aggregate query processing example 18 SELECT SUM(dollar_sales) FROM Sales WHERE condition; Already found B f (a bitmap or a sorted list of TID s that point to Sales tuples that satisfy condition) Probably used a secondary index Need to compute SUM(dollar_sales) for tuples in B f

SUM without any index 19 For each tuple in B f, go fetch the actual tuple, and add dollar_sales to a running sum I/O s: number of Sales blocks with B f tuples Assuming we fetch them in sorted order SUM with a value-list index 20 Assume a value-list index on Sales(dollar_sales) Idea: the index stores dollar_sales values and their counts (in a pretty compact form) sum = 0; Scan Sales(dollar_sales) index; for each indexed value v with value-list B v : sum += v count-1-bits(b v AND B f ); I/Os: number of blocks taken by the value-list index Bitmaps can possibly speed up AND and reduce the size of the index SUM with a projection index 21 Assume a project index on Sales(dollar_sales) Idea: merge join B f and the projection index, add joining tuples dollar_sales to a running sum Assuming both B f and the index are sorted on TID I/O s: number of blocks taken by the projection index Compared with a value-list index, the projection index may be more compact (no empty space or pointers), but it does store duplicate dollar_sales values Also: simpler algorithm, fewer CPU operations

SUM with a bit-sliced index 22 Assume a bit-sliced index on Sales(dollar_sales), with slices B k 1,, B 1, B 0 sum = 0; for i = 0 to k 1: sum += 2 i count-1-bits(b i AND B f ); I/O s: number of blocks taken by the bit-sliced index Conceptually a bit-sliced index contains the same information as a projection index But the bit-sliced index does not keep TID Bitmap AND is faster Summary of SUM 23 Best: bit-sliced index Index is small B f can be applied fast! Good: projection index Not bad: value-list index Full-fledged index carries a bigger overhead The fact that we have counts of values helped But we did not really need values to be ordered MEDIAN 24 SELECT MEDIAN(dollar_sales) FROM Sales WHERE condition; Same deal: already found B f (a bitmap or a sorted list of TID s that point to Sales tuples that satisfy condition) Need to find the dollar_sales value that is greater than or equal to ½ count-1-bits(b f ) dollar_sales values among B f tuples

MEDIAN with an ordered value-list index 25 Idea: take advantage of the fact that the index is ordered by dollar_sales Scan the index in order, count the number of tuples that appeared in B f until the count reaches ½ count-1-bits(b f ) I/O s: roughly half of the index MEDIAN with a projection index 26 In general, need to sort the index by dollar_sales Well, when you sort, you more or less get back an ordered value-list index! Not useful unless B f is small MEDIAN with a bit-sliced index 27 Tough at the first glance index is not sorted Think of it as sorted We won t actually make use of the this fact Look at B k 1 first More than half are 0 s? 0 0 0 0 0 1 1 0 0 1 1 0 1 1 1 Yes; continue searching for median here No; continue searching for median here By looking at B k 1 we know the (k 1)-th bit of the median

MEDIAN with a bit-sliced index 28 median = 0; B current = B f ; sofar = 0; // which tuples we are considering // number of tuples whose values are less // than what we are considering for i = k 1 to 0: if (sofar + count-1-bits(b current AND NOT(B i )) ½ count-1-bits(b f )): B current = B current AND B i ; sofar += count-1-bits(b current AND NOT(B i ); median += 2 i ; else: B current = B current AND NOT(B i ); I/O s: still need to scan the entire index Summary of MEDIAN 29 Best: ordered value-list index It helps to be ordered! Pretty good: bit-sliced index Could beat ordered value-list index if B f is clustered Only need to retrieve the corresponding segment More variant indexes 30 Improved Query Performance with Variant Indexes, by O Neil and Quass. SIGMOD, 1997 MIN/MAX, and range query using bit-sliced index Join indexes for star schema Traditional: one for each combination of foreign columns Bitmap: one for each foreign column Precomputed query results (materialized views)?

Variant vs. traditional indexes 31 What is the more glaring problem of these variant indexes that makes them not as widely applicable as the B + -tree? How did the paper get away with that?