Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Similar documents
Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Introduction to Database Systems CSE 344

Introduction to Database Systems CSE 344

Database Systems CSE 414

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Database Systems CSE 414

CSE 344 FEBRUARY 14 TH INDEXING

Database Systems CSE 414

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Introduction to Database Systems CSE 344

Database Systems CSE 414

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Introduction to Database Systems CSE 444, Winter 2011

CSE 344 APRIL 27 TH COST ESTIMATION

Review: Query Evaluation Steps. Example Query: Logical Plan 1. What We Already Know. Example Query: Logical Plan 2.

CSE 544 Principles of Database Management Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

RELATIONAL OPERATORS #1

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Advanced Database Systems

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Principles of Database Management Systems

L20: Joins. CS3200 Database design (sp18 s2) 3/29/2018

Implementation of Relational Operations: Other Operations

EECS 647: Introduction to Database Systems

Principles of Data Management. Lecture #9 (Query Processing Overview)

University of Waterloo Midterm Examination Sample Solution

CSC 261/461 Database Systems Lecture 19

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Evaluation of Relational Operations: Other Techniques

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Query Processing: The Basics. External Sorting

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Fundamentals of Database Systems

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Overview of Implementing Relational Operators and Query Evaluation

CSE 544 Principles of Database Management Systems

Evaluation of Relational Operations: Other Techniques

Query Execution [15]

CSE544: Principles of Database Systems. Lectures 5-6 Database Architecture Storage and Indexes

Database Applications (15-415)

Hash-Based Indexing 165

CSE344 Midterm Exam Winter 2017

Database Applications (15-415)

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Query Processing & Optimization

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Query Processing & Optimization. CS 377: Database Systems

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

QUERY PROCESSING & OPTIMIZATION CHAPTER 19 (6/E) CHAPTER 15 (5/E)

Database Management System

Evaluation of Relational Operations

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Evaluation of Relational Operations: Other Techniques

Relational Query Optimization. Highlights of System R Optimizer

Implementation of Relational Operations

CSE 444: Database Internals. Lectures 5-6 Indexing

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

Evaluation of Relational Operations

Modern Database Systems Lecture 1

Querying Data with Transact SQL

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Database System Concepts

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Relational Query Optimization

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Lecture 12. Lecture 12: Access Methods

Database Tuning and Physical Design: Basics of Query Execution

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

CS-245 Database System Principles

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Lecture 20: Query Optimization (2)

Storage and Indexing

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Evaluation of relational operations

Project. CIS611 Spring 2014 SS Chung Due by April 15. Performance Evaluation Experiment on Query Rewrite Optimization

Evaluation of Relational Operations

Relational Database Management Systems for Epidemiologists: SQL Part II

CSE344 Midterm Exam Fall 2016

External Sorting Implementing Relational Operators

CompSci 516 Data Intensive Computing Systems

Overview of Query Evaluation. Chapter 12

Chapter 12: Query Processing. Chapter 12: Query Processing

15-415/615 Faloutsos 1

Project. Building a Simple Query Optimizer with Performance Evaluation Experiment on Query Rewrite Optimization

Chapter 12: Query Processing

Transcription:

Student Introduction to Database Systems CSE 414 Hash table example Index Student_ID on Student.ID Data File Student 10 Tom Hanks 10 20 20 Amy Hanks ID fname lname 10 Tom Hanks 20 Amy Hanks Lecture 26: More Indexes and Operator Costs 50 200 220 240 420 800 50 200 220 240 420 800 Index File (preferably in memory) Data file (on disk) 1 Hash table indexes CSE are 414 -good Spring 2018 for point queries 2 B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered d = 2 Find the key 40 80 40 <= 80 20 60 100 120 140 20 < 40 <= 60 Index File (preferably in memory) B+ Tree B+ Tree Index entries (Index File) (Data file) Index entries 10 15 18 20 30 40 50 60 65 80 85 90 CLUSTERED Data Records Data Records UNCLUSTERED 30 < 40 <= 40 10 15 18 20 30 40 50 60 65 80 85 90 B+ indexes are good for range queries Data file (on disk) 3 Every table can have only one clustered and many unclustered indexes Why? 4 Student(ID, fname, lname) Takes(studentID, courseid) Example FROM Student x, Takes y WHERE x.id=y.studentid AND y.courseid > 300 Which Indexes? Student ID fname lname 10 Tom Hanks 20 Amy Hanks for y in Takes if courseid > 300 then for x in Student if x.id=y.studentid output * Assume the database has indexes on these attributes: Takes_courseID = index on Takes.courseID Student_ID = index on Student.ID Index selection How many indexes could we create? Which indexes should we create? Index join studentid=id σcourseid>300 Index selection for y in Takes_courseID where y.courseid > 300 y = fetch the Takes record pointed to by y for x in Student_ID where x.id = y.studentid x = fetch the Student record pointed to by x output * In general this is a very hard problem Takes Student 5 6 1

Student Which Indexes? ID fname lname 10 Tom Hanks 20 Amy Hanks The index selection problem Given a table, and a workload (big Java application with lots of SQL queries), decide which indexes to create (and which ones NOT to create!) Who does index selection: The database administrator DBA Index Selection: Which Search Key Make some attribute K a search key if the WHERE clause contains: An exact match on K A range predicate on K Semi-automatically, using a database administration tool 7 8 The Index Selection Problem 1 The Index Selection Problem 1 What indexes? 9 10 The Index Selection Problem 1 The Index Selection Problem 2 WHERE N>? and N<? 100000 queries: INSERT INTO V VALUES (?,?,?) A: V(N) and V(P) (hash tables or B-trees) What indexes? 11 12 2

The Index Selection Problem 2 The Index Selection Problem 3 WHERE N>? and N<? 100000 queries: INSERT INTO V VALUES (?,?,?) 100000 queries: 1000000 queries: and P>? 100000 queries: INSERT INTO V VALUES (?,?,?) A: definitely V(N) (must B-tree); unsure about V(P) What indexes? 13 14 The Index Selection Problem 3 The Index Selection Problem 4 100000 queries: 1000000 queries: and P>? 100000 queries: INSERT INTO V VALUES (?,?,?) 1000 queries: 100000 queries: WHERE N>? and N<? WHERE P>? and P<? A: V(N, P) How does this index differ from: 1. Two indexes V(N) and V(P)? 2. An index V(P, N)? 15 What indexes? 16 The Index Selection Problem 4 Two typical kinds of queries 1000 queries: 100000 queries: WHERE N>? and N<? WHERE P>? and P<? A: V(N) secondary, V(P) primary index FROM Movie WHERE year =? FROM Movie WHERE year >=? AND year <=? Point queries What data structure should be used for index? Range queries What data structure should be used for index? 17 18 3

Basic Index Selection Guidelines Consider queries in workload in order of importance Consider relations accessed by query No point indexing other relations Cost FROM R WHERE R.K>? and R.K<? Look at WHERE clause for possible search key Try to choose indexes that speed-up multiple queries Range queries benefit mostly from clustering 0 100 Percentage tuples retrieved 19 20 FROM R WHERE R.K>? and R.K<? FROM R WHERE R.K>? and R.K<? Cost Sequential scan Cost Sequential scan Clustered index 0 100 Percentage tuples retrieved 0 100 Percentage tuples retrieved 21 22 Cost Unclustered index FROM R WHERE R.K>? and R.K<? Sequential scan Choosing Index is Not Enough To estimate the cost of a query plan, we still need to consider other factors: Clustered index How each operator is implemented The cost of each operator 0 100 Percentage tuples retrieved 23 Let s start with the basics 24 4

Cost Parameters Cost of Reading Data From Disk Cost = I/O + CPU + Network BW We will focus on I/O in this class Parameters (a.k.a. statistics): B(R) = # of blocks (i.e., pages) for relation R T(R) = # of tuples in relation R V(R, a) = # of distinct values of attribute a When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anything <= T(R) DBMS collects statistics about base tables must infer them for intermediate results 25 26 Selectivity Factors for Conditions A = c /* σ A=c (R) */ Selectivity = 1/V(R,A) Cost of Reading Data From Disk Sequential scan for relation R costs B(R) A < c /* σ A<c (R)*/ Selectivity = (c - min(r, A))/(max(R,A) - min(r,a)) c1 < A < c2 /* σ c1<a<c2 (R)*/ Selectivity = (c2 c1)/(max(r,a) - min(r,a)) 27 28 Table scan: 29 30 5

If index is clustered: If index is unclustered: If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: 31 32 If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os Note: we ignore I/O cost for index pages 33 If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os Lesson: Don t build unclustered indexes when V(R,a) is small! 34 Outline Cost of Executing Operators (Focus on Single Node Joins) Join operator algorithms One-pass algorithms (Sec. 15.2 and 15.3) Index-based algorithms (Sec 15.6) Note about readings: In class, we discuss only algorithms for joins Other operators are easier: read the book 35 36 6

Nested loop join Hash join Sort-merge join Join Algorithms Nested Loop Joins Tuple-based nested loop R S R is the outer relation, S is the inner relation for each tuple t 1 in R do for each tuple t 2 in S do if t 1 and t 2 join then output (t 1,t 2 ) What is the Cost? 37 38 Nested Loop Joins Tuple-based nested loop R S R is the outer relation, S is the inner relation for each tuple t 1 in R do for each tuple t 2 in S do if t 1 and t 2 join then output (t 1,t 2 ) Page-at-a-time Refinement for each page of tuples r in R do for each page of tuples s in S do for all pairs of tuples t 1 in r, t 2 in s if t 1 and t 2 join then output (t 1,t 2 ) What is the Cost? Cost: B(R) + T(R) B(S) Multiple-pass since S is read many times Cost: B(R) + B(R)B(S) What is the Cost? 39 40 Page-at-a-time Refinement Page-at-a-time Refinement 1 2 Input buffer for Patient 1 2 Input buffer for Patient Disk Patient Insurance 1 2 2 4 6 6 3 4 4 3 1 3 9 6 2 8 2 4 Input buffer for Insurance 2 2 Output buffer Disk Patient Insurance 1 2 2 4 6 6 3 4 4 3 1 3 9 6 2 8 4 3 Input buffer for Insurance Output buffer 8 5 8 9 41 8 5 8 9 42 7

Page-at-a-time Refinement 1 2 Input buffer for Patient Patient 1 2 3 4 9 6 8 5 Disk Insurance 2 4 4 3 2 8 8 9 6 6 1 3 2 8 Input buffer for Insurance Keep going until read all of Insurance 2 2 Output buffer Then repeat for next page of Patient until end of Patient Cost: B(R) + B(R)B(S) 43 8