Storage and Indexing

Similar documents
CompSci 516: Database Systems

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Overview of Storage and Indexing

Overview of Storage and Indexing. Data on External Storage

Overview of Storage and Indexing

Single Record and Range Search

Review of Storage and Indexing

Overview of Storage and Indexing

The use of indexes. Iztok Savnik, FAMNIT. IDB, Indexes

Overview of Storage and Indexing

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Overview of Indexing. Chapter 8 Part II. A glimpse at indices and workloads

Modern Database Systems Lecture 1

Lecture 8 Index (B+-Tree and Hash)

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

CS 443 Database Management Systems. Professor: Sina Meraji

Lecture 34 11/30/15. CMPSC431W: Database Management Systems. Instructor: Yu- San Lin

Step 4: Choose file organizations and indexes

Lecture #16 (Physical DB Design)

Introduction to Data Management. Lecture #17 (Physical DB Design!)

CSIT5300: Advanced Database Systems

Friday Nights with Databases!

Data on External Storage

Physical Database Design and Tuning. Review - Normal Forms. Review: Normal Forms. Introduction. Understanding the Workload. Creating an ISUD Chart

Physical Database Design and Tuning

Overview of Storage and Indexing

Physical Database Design and Tuning. Chapter 20

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Lecture 36 12/4/15. CMPSC431W: Database Management Systems. Instructor: Yu- San Lin

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

Chapter 1: overview of Storage & Indexing, Disks & Files:

Overview. Understanding the Workload. Physical Database Design And Database Tuning. Chapter 20

3.1.1 Cost model Search with equality test (A = const) Scan

Overview of Storage and Indexing

Discuss physical db design and workload What choises we have for tuning a database How to tune queries and views

Lecture 31 11/16/15. CMPSC431W: Database Management Systems. Instructor: Yu- San Lin

ACCESS METHODS: FILE ORGANIZATIONS, B+TREE

RAID in Practice, Overview of Indexing

Introduction to Data Management. Lecture #13 (Indexing)

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Hash-Based Indexing 165

Administrivia. Physical Database Design. Review: Optimization Strategies. Review: Query Optimization. Review: Database Design

Introduction to Data Management. Lecture 14 (Storage and Indexing)

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

CPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Storing Data: Disks and Files

Database Applications (15-415)

Kathleen Durant PhD Northeastern University CS Indexes

Introduction to Data Management. Lecture 21 (Indexing, cont.)

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

CompSci 516 Data Intensive Computing Systems

CSE 544 Principles of Database Management Systems

External Sorting Implementing Relational Operators

Database Design and Tuning

Overview of Query Evaluation. Chapter 12

Tree-Structured Indexes

Records in a file are grouped into buckets. Search key values are organized in a tree. The highest level is the root

Database Applications (15-415)

Database Management Systems (COP 5725) Homework 3

RELATIONAL OPERATORS #1

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

CSE 444: Database Internals. Lectures 5-6 Indexing

CSC 261/461 Database Systems Lecture 17. Fall 2017

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Lecture 5 Design Theory and Normalization

CS122A: Introduction to Data Management. Lecture #14: Indexing. Instructor: Chen Li

Project is due on March 11, 2003 Final Examination March 18, pm to 10.30pm

Unit 3 Disk Scheduling, Records, Files, Metadata

Oracle on RAID. RAID in Practice, Overview of Indexing. High-end RAID Example, continued. Disks and Files: RAID in practice. Gluing RAIDs together

Evaluation of Relational Operations: Other Techniques

TotalCost = 3 (1, , 000) = 6, 000

Database Applications (15-415)

Lecture 12. Lecture 12: Access Methods

Lecture 13. Lecture 13: B+ Tree

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

ECS 165B: Database System Implementa6on Lecture 7

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Converting ER to Relational Schema

System Structure Revisited

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

EXTERNAL SORTING. Sorting

Introduction to Database Systems CSE 344

Database Systems CSE 414

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Principles of Data Management. Lecture #9 (Query Processing Overview)

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Lecture 2 SQL. Announcements. Recap: Lecture 1. Today s topic. Semi-structured Data and XML. XML: an overview 8/30/17. Instructor: Sudeepa Roy

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Transcription:

CompSci 516 Data Intensive Computing Systems Lecture 5 Storage and Indexing Instructor: Sudeepa Roy Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1

Announcement Homework 1 Due on Feb 9, 11:59 pm Please post any question you have on Piazza Lecture slides Book chapters will be posted on the webpage two days before the lecture you don t have to look at the chapters before the class! Slides will be posted after the class Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 2

Reading Material [RG] Chapters 8.1, 8.2, 8.3, 8.5 Additional reading [GUW] Chapters 8.3, 14.1 Acknowledgement: The following slides have been created adapting the instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke. Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 3

Where are we now? We learnt Relational Model and Query Languages RA, RC, SQL, Postgres (DBMS) Schema refinement Normalization (for Schema Design with E/R diagram see the book or CompSci 316) Next Database Internals Architecture, Storage, Indexing Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 4

What will we learn? How does a DBMS organize files? What is an index? What are different types of indexes? How do we use index to optimize performance? Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 5

Data on External Storage Data must persist on disk across program executions in a DBMS But has to be fetched into main memory when DBMS processes the data The unit of information for reading data from or writing data to disk is a page Disks: Can retrieve random page at fixed cost But reading several consecutive pages is much cheaper than reading them in random order Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 6

File Organization File organization: Method of arranging a file of records on external storage. Record id (rid) is sufficient to physically locate the page containing the record on disk Indexes are data structures that allow us to find the record ids of records with given values in index search key fields Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 7

Alternative File Organizations Many alternatives exist, each ideal for some situations, and not so good in others: Heap (random order) files: Suitable when typical access is a file scan retrieving all records Sorted Files: Best if records must be retrieved in some order, or only a `range of records is needed. Indexes: Data structures to organize records via trees or hashing. Like sorted files, they speed up searches for a subset of records, based on values in certain ( search key ) fields Updates are much faster than in sorted files. Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 8

Indexes An index on a file speeds up selections on the search key fields for the index Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key key = minimal set of fields that uniquely identify a tuple An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 9

Alternatives for Data Entry k* in Index k In a data entry k* we can store: 1. The actual data record with key value k, or 2. <k, rid> rid = record of data record with search key value k, or 3. <k, rid- list> list of record ids of data records with search key k> Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 10

Alternatives for Data Entries: Alternative 1 In a data entry k* we can store: 1. The actual data record with key value k, or 2. <k, rid> rid = record of data record with search key value k, or 3. <k, rid- list> list of record ids of data records with search key k> Advantages/ Disadvantages? Index structure is a file organization for data records instead of a Heap file or sorted file How many different indexes can use Alternative 1? At most one index can use Alternative 1 Otherwise, data records are duplicated, leading to redundant storage and potential inconsistency If data records are very large, #pages with data entries is high Implies size of auxiliary information in the index is also large Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 11

Alternatives for Data Entries: Alternative 2, 3 In a data entry k* we can store: 1. The actual data record with key value k, or 2. <k, rid> rid = record of data record with search key value k, or 3. <k, rid- list> list of record ids of data records with search key k> Advantages/ Disadvantages? Data entries typically much smaller than data records So, better than Alternative 1 with large data records Especially if search keys are small. Alternative 3 more compact than Alternative 2 but leads to variable- size data entries even if search keys have fixed length. Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 12

Index Classification Primary vs. secondary Clustered vs. unclustered Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 13

Primary vs. Secondary Index If search key contains primary key, then called primary index, otherwise secondary Unique index: Search key contains a candidate key Duplicate data entries: if they have the same value of search key field k Primary/unique index never has a duplicate Other secondary index can have duplicates Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 14

Clustered vs. Unclustered Index If order of data records in a file is the same as, or `close to, order of data entries in an index, then clustered, otherwise unclustered Alternative 1 implies clustered 2, 3 are typically unclustered In practice, clustered also implies Alternative 1 (since sorted files are rare) A file can be clustered on at most one search key Cost of retrieving data records (range queries) through index varies greatly based on whether index is clustered or not Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 15

Clustered vs. Unclustered Index Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. To build clustered index, first sort the Heap file with some free space on each page for future inserts Overflow pages may be needed for inserts Thus, data records are `close to, but not identical to, sorted CLUSTERED Index entries direct search for data entries UNCLUSTERED Data entries (Index File) (Data file) Data entries Duke CS, Spring 2016 Data Records CompSci 516: Data Intensive Computing Data Systems Records 16

Methods for indexing Tree- based Hash- based In detail in the next lecture Only two examples today Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 17

B+ Tree Indexes Non-leaf Pages Leaf Pages (Sorted by search key) Leaf pages contain data entries, and are chained (prev & next) Non-leaf pages have index entries; only used to direct searches: index entry P 0 K 1 P 1 K 2 P 2 K m P m Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 18

Example B+ Tree Root 17 Note how data entries in leaf level are sorted Entries <= 17 Entries > 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Find 28*? 29*? All > 15* and < 30* Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 19

Hash- Based Indexes Records are grouped into buckets Bucket = primary page plus zero or more overflow pages Hashing function h: h(r) = bucket in which (data entry for) record r belongs h looks at the search key fields of r No need for index entries in this scheme Good for equality selections find all records with name = Joe Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 20

Example: Hash- based index Index organized file hashed on AGE, with Auxiliary index on SAL AGE h1(age) = 00 h1(age) = 01 h1 Smith, 44, 3000 Jones, 40, 6003 Tracy, 44, 5004 Ashby, 25, 3000 Basu, 33, 4003 Bristow, 29, 2007 3000 3000 5004 5004 4003 2007 6003 6003 h2(age) = 00 h2 h2(sal) = 01 h1(age) = 10 Cass, 50, 5004 Daniels, 22, 6003 File of <SAL, rid> pairs hashed on SAL Employee File hashed on AGE Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 21

Different File Organizations Search key = <age, sal> We need to understand the importance of appropriate file organization and index Heap files random order; insert at end- of- file Sorted files sorted on <age, sal> Clustered B+ tree file search key <age, sal> Heap file with unclustered B + - tree index on search key <age, sal> Heap file with unclustered hash index on search key <age, sal> Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 22

Possible Operations Scan Fetch all records from disk to buffer pool Equality search Find all employees with age = 23 and sal = 50 Fetch page from disk, then locate qualifying record in page Range selection Find all employees with age > 35 Insert a record identify the page, fetch that page from disk, inset record, write back to disk (possibly other pages as well) Delete a record similar to insert Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 23

Understanding the Workload A workload is a mix of queries and updates For each query in the workload: Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? For each update in the workload: Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 24

Choice of Indexes What indexes should we create? Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? For each index, what kind of an index should it be? Clustered? Hash/tree? Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 25

More on Choice of Indexes One approach: Consider the most important queries Consider the best plan using the current indexes See if a better plan is possible with an additional index. If so, create it. Obviously, this implies that we must understand how a DBMS evaluates queries and creates query evaluation plans We will learn query execution and optimization later - For now, we discuss simple 1- table queries. Before creating an index, must also consider the impact on updates in the workload! Trade- off: Indexes can make queries go faster, updates slower. Require disk space, too. Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 26

Index Selection Guidelines 1/3 Attributes in WHERE clause are candidates for index keys. Exact match condition suggests hash index. Range query suggests tree index. Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates. Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 27

Index Selection Guidelines 2/3 Multi- attribute search keys should be considered when a WHERE clause contains several conditions. Order of attributes is important for range queries. Such indexes can sometimes enable index- only strategies for important queries For index- only strategies, clustering is not important Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 28

Index Selection Guidelines 3/3 Try to choose indexes that benefit as many queries as possible Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering Note: clustered index should be used judiciously expensive updates, although cheaper than sorted files Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 29

Examples of Clustered Indexes B+ tree index on E.age can be used to get qualifying tuples How selective is the condition? everyone > 40, index not of much help, scan is as good Suppose 10% > 40. Then? What is a good indexing strategy? SELECT E.dno FROM Emp E WHERE E.age>40 Depends on if the index is clustered otherwise can be more expensive than a linear scan if clustered, 10% I/O (+ index pages) Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 30

Examples of Clustered Indexes Group- By query Use E.age as search key? Bad If many tuples have E.age > 10 or if not clustered. using E.age index and sorting the retrieved tuples by E.dno may be costly Clustered E.dno index may be better First group by, then count tuples with age > 10 good when age > 10 is not too selective What is a good indexing strategy? SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno Note: the first option is good when the WHERE condition is highly selective (few tuples have age > 10), the second is good when not highly selective Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 31

Examples of Clustered Indexes Equality queries and duplicates Clustering on E.hobby helps hobby not a candidate key, several tuples possible Does clustering help now? Not much at most one tuple satisfies the condition What is a good indexing strategy? SELECT E.dno FROM Emp E WHERE E.hobby= Stamps SELECT E.dno FROM Emp E WHERE E.eid=50 Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 32

Indexes with Composite Search Keys Composite Search Keys: Search on a combination of fields. Examples of composite key indexes using lexicographic order. Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: age=20 and sal =75 Range query: Some field value is not a constant. E.g.: sal > 10 <age, sal> does not help has to be a prefix 11,80 12,10 12,20 13,75 <age, sal> 10,12 20,12 75,13 80,11 <sal, age> name age sal bob cal Data entries in index sorted by <sal,age> 12 11 joe 12 10 80 20 sue 13 75 Data records sorted by name 11 12 12 13 <age> 10 20 75 80 <sal> Data entries sorted by <sal>

Composite Search Keys To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be better than an index on age or an index on sal first find age = 30, among them search sal = 4000 If condition is: 20<age<30 AND 3000<sal<5000: Clustered tree index on <age,sal> or <sal,age> is best. If condition is: age=30 AND 3000<sal<5000: Clustered <age,sal> index much better than <sal,age> index more index entries are retrieved for the latter Composite indexes are larger, updated more often Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 34

Index- Only Plans A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno <E.dno> SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno <E.dno,E.sal> Tree index! <E. age,e.sal> Tree index! SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 35