Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh

Similar documents
DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Databasesystemer, forår 2006 IT Universitetet i København. Forelæsning 9: Mere om SQL. 30. marts Forelæser: Esben Rune Hansen

Inputs. Decisions. Leads to

Handout 6 CS-605 Spring 18 Page 1 of 7. Handout 6. Physical Database Modeling

Database Systems CSE 414

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 2: Relations and SQL. September 5, Lecturer: Rasmus Pagh

Physical DB design and tuning: outline

Database Systems CSE 414

CMSC 424 Database design Lecture 13 Storage: Files. Mihai Pop

Modern Database Systems Lecture 1

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

File Structures and Indexing

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Relational Database Index Design and the Optimizers

Storage and Indexing

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Introduction to Database Systems CSE 344

Database Systems CSE 414

CSIT5300: Advanced Database Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Chapter 12: Indexing and Hashing. Basic Concepts

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 16-1

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Chapter 12: Indexing and Hashing

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 10: Transaction processing. November 14, Lecturer: Rasmus Pagh

Databases IIB: DBMS-Implementation Exercise Sheet 13

Column Stores vs. Row Stores How Different Are They Really?

OBJECTIVES. How to derive a set of relations from a conceptual data model. How to validate these relations using the technique of normalization.

CSE 544 Principles of Database Management Systems

CS317 File and Database Systems

COMP102: Introduction to Databases, 14

Fundamentals of Physical Design: State of Art

Physical Database Design

Database Applications (15-415)

Wentworth Institute of Technology COMP2670 Databases Spring 2016 Derbinsky. Physical Tuning. Lecture 12. Physical Tuning

Unit 3 Disk Scheduling, Records, Files, Metadata

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Database Architectures

CS317 File and Database Systems

Physical DB Issues, Indexes, Query Optimisation. Database Systems Lecture 13 Natasha Alechina

CSE 344 FEBRUARY 14 TH INDEXING

Wentworth Institute of Technology COMP570 Database Applications Fall 2014 Derbinsky. Physical Tuning. Lecture 10. Physical Tuning

Project Revision. just links to Principles of Information and Database Management 198:336 Week 13 May 2 Matthew Stone

RELATIONAL OPERATORS #1

Physical Database Design and Performance (Significant Concepts)

CSC 261/461 Database Systems Lecture 19

Bases de Dades: introduction to SQL (indexes and transactions)

QUIZ: Is either set of attributes a superkey? A candidate key? Source:

LAB 2 Notes. Conceptual Design ER. Logical DB Design (relational) Schema Refinement. Physical DD

Introduction to Database Systems CSE 344

Index Tuning. Index. An index is a data structure that supports efficient access to data. Matching records. Condition on attribute value

Managing the Database

CISC 3140 (CIS 20.2) Design & Implementation of Software Application II

CompSci 516 Data Intensive Computing Systems

CSE544: Principles of Database Systems. Lectures 5-6 Database Architecture Storage and Indexes

Processing of Very Large Data

Relational Database Index Design and the Optimizers

Database Systemer, Forår 2006 IT Universitet i København. Lecture 10: Transaction processing. 6 april, Forelæser: Esben Rune Hansen

COMP 430 Intro. to Database Systems. Indexing

Administrivia Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Administrivia. Physical Database Design. Review: Optimization Strategies. Review: Query Optimization. Review: Database Design

Physical Database Design and Tuning. Review - Normal Forms. Review: Normal Forms. Introduction. Understanding the Workload. Creating an ISUD Chart

Administração e Optimização de Bases de Dados 2012/2013 Index Tuning

Kathleen Durant PhD Northeastern University CS Indexes

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Department of Information Technology B.E/B.Tech : CSE/IT Regulation: 2013 Sub. Code / Sub. Name : CS6302 Database Management Systems

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Database Applications (15-415)

Course Book Academic Year

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Goals for Today. CS 133: Databases. Final Exam: Logistics. Why Use a DBMS? Brief overview of course. Course evaluations

Optimization Overview

15-415/615 Faloutsos 1

Advanced Database Systems

Introduction to Databases, Fall 2003 IT University of Copenhagen. Lecture 4: Normalization. September 16, Lecturer: Rasmus Pagh

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

CS W Introduction to Databases Spring Computer Science Department Columbia University

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

Chapter 11: Indexing and Hashing

Writing High Performance SQL Statements. Tim Sharp July 14, 2014

SQL: Part III. Announcements. Constraints. CPS 216 Advanced Database Systems

RAID in Practice, Overview of Indexing

Introduction to Database Design, fall 2011 IT University of Copenhagen. Normalization. Rasmus Pagh

Questions about the contents of the final section of the course of Advanced Databases. Version 0.3 of 28/05/2018.

Ext3/4 file systems. Don Porter CSE 506

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Database Applications (15-415)

CSE 3241: Database Systems I Databases Introduction (Ch. 1-2) Jeremy Morris

Physical Database Design and Tuning

Administrivia Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

CPSC 310: Database Systems / CSPC 603: Database Systems and Applications Exam 2 November 16, 2005

TIM 50 - Business Information Systems

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

Database Optimization

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

Transcription:

Databasesystemer, forår 2005 IT Universitetet i København Forelæsning 8: Database effektivitet. 31. marts 2005 Forelæser: Rasmus Pagh

Today s lecture Database efficiency Indexing Schema tuning 1

Database efficiency This lecture is on database efficiency: The property that the database uses a (mostly) small amount of computational and storage resources. Here, small is relative to the best use of resources we could hope for: We don t want the database to use 100 times more storage or computation time than the best that the computer could be programmed to do. 2

RDBMS efficiency is largely automatic One of the reasons for the success of RDBMSs is that, to a large extent, they are efficient automatically. One of the great dividends of investing in an RDBMS is that you don t have to think too much about the computer s inner life. You re the programmer and say what kinds of data you want. The computer s job is to fetch it and you don t really care how. Philip Greenspun in SQL for Web Nerds This lecture is about some of the things that you, as a database programmer, might have to do to help the RDBMS improve efficiency: Define suitable indexes. Tune the database schema. 3

Hard drives Relations of large databases are usually stored on hard drives. Hard drives can store large amounts of data, but work rather slowly compared to the memory of a modern computer: The time to access a specific piece of data is on the order of 10 6 times slower. The rate at which data can be read is on the order of 10 100 times slower. For databases that do not fit in the computer s main memory, the time used for accessing disks is usually the main performance bottleneck. 4

Using several hard drives Many database systems use several hard drives to: Enable several pieces of data to be fetched in parallel. Increase the total rate of data from disk. Systems of several disks are often arranged in so-called RAID systems, which support various levels of performance improvement and error resilience. Even in systems with many hard drives, the time used for accessing disks is usually the main performance bottleneck. 5

Next: Indexing

Full table scans When a DBMS encounters a query of the form SELECT * FROM R WHERE <condition> the obvious thing to do is read through the tuples of R and report those tuples that satisfy the condition. This is called a full table scan. 7

Selective queries Consider the selection query from before: SELECT * FROM R WHERE <condition> If we have to report 80% of the tuples in R, it makes sense to do a full table scan. On the other hand, if the query is very selective, and returns just a small percentage of the tuples we might hope to do better: Is there a way of skipping over tuples that will not be selected? 8

Non-computer search for information The following are well-known examples that there are ways of going directly to the desired information : A phone book. A dictionary. A cookbook. Database systems use similar principles to find information quickly. Sometimes, the equivalent of a full table scan is needed, e.g. when looking for recipe combining eggs and mustard at 220 degrees in the oven. 9

Point queries Consider a selection query with a single equality in the condition: SELECT * FROM R WHERE grade = 11 This is a so-called point query: We report all grades at the point 11. Point queries are often very selective. Suppose the tuples of R were sorted by grade. Then the DBMS could: 1. Search for the first tuple with grade 11. 2. Report this and all the following tuples with grade 11. 10

Range queries Consider a selection query with a single inequality in the condition: SELECT * FROM R WHERE grade > 6 AND grade < 10 This is an example of a range query, since we report all grades in the range 7 to 9. Suppose that the tuples of R were sorted according to grade. Then the DBMS could: 1. Search for the first tuple with grade > 6. 2. Report this and all the following tuples with grade < 10. 11

Indexes To be able to quickly find the first tuple with a specific grade, the DBMS may build an index on the grade attribute. A database index is similar to an index in the back of a book: 1. For every piece of data you might be interested in (e.g., the attribute value 9), the index says where to find it. 2. The index itself is organized such that one can quickly do the lookup. Looking for information in a relation with the help of an index is called an index scan. 12

Primary indexes If the tuples of a relation are stored sorted according to some attribute, an index on this attribute is called primary. Primary indexes make point and range queries on the key very efficient. Many DBMSs automatically build a primary index on the primary key of each relation. (In Oracle the programmer must explicitly specify that the table should be index organized.) A primary index is sometimes referred to as a clustering or sparse index. 13

Secondary indexes It is possible to have more than one index on a relation. While not part of the SQL standard, additional indexes can usually be created by writing statements such as: CREATE INDEX gradeindex ON projects (grade); The non-primary indexes are called secondary indexes. Secondary indexes make most point queries on the key more efficient. Secondary indexes make some range queries on the key more efficient. A secondary index is sometimes referred to as non-clustering or dense index. 14

Index scan versus full scan Point and range queries on the attribute(s) of the primary index are almost always best performed using an index scan. Secondary indexes should be used with high selectivity queries: As a rule of thumb, a secondary index scan is faster than a full table scan for queries returning less than 10-20% of a relation. 15

Choosing whether to use the index Even if an index scan is possible, a good DBMS sometimes chooses to do a full table scan. The decision is usually based on statistics on the data in the relation, which allows the selectivity of the query to be estimated. Often computation of the statistics is controlled manually. In Oracle, statistics can be computed as follows: ANALYZE TABLE <relation> COMPUTE STATISTICS; 16

Indexes on several attributes An index can be defined on one or more attributes, e.g. CREATE INDEX myindex ON projects (grade,start,exam); Such an index speeds up point queries such as: SELECT * FROM projects WHERE grade = 13 AND start= 2003-11-24 AND exam= 2004-01-28 ; Due to the way indexes are (usually) implemented (as a B-tree), an index on several attributes automatically gives index for any prefix of these attributes. Example: myindex also gives an index for (grade,start) and for (grade). 17

Indexes on several attributes, intuition An index on the attributes (grade,start,exam) has a similar effect as storing the tuples sorted first according to grade, secondly according to start, and thirdly sorted according to exam. grade start exam......... 11 2003-11-24 2004-01-28 11 2004-11-27 2005-01-07 13 2003-11-24 2004-01-28 13 2003-11-24 2004-01-28 13 2003-11-24 2005-01-07 13 2004-11-27 2005-01-07 13 2004-11-27 2005-01-07 18

Problem session (5 minutes) What kinds of point and range queries are easy when the relation is stored as in the previous example: A range query on the first attribute? A range query on the second attribute? A point query on the second attribute? A point query on the first attribute combined with a range query on the second attribute? A point query on the second attribute combined with a range query on the first attribute? 19

Other uses of indexes Indexes are used by the DBMS to speed up other operations than point and range queries. Join operations are often considerably faster when the join attributes are indexed. Indexes are used by DBMSs to efficiently check referential integrity constraints (such indexes are usually automatically created). 20

Time usage of indexes Based on what you have seen until now, a natural question would be: Why not make indexes for all possible sets of attributes? The main reason is that indexes need to be updated when the relation changes, and index updates take time. Thus we have the following trade-off for every index: It speeds up certain database queries, but slows down every update to the relation. Whether an index is a good idea is thus a matter of weighing the time saved on queries against the additional time spent on updates. 21

Space usage of indexes In addition to taking time to update, an index has a space cost. Primary indexes usually have modest space requirements, i.e., considerably less than the space for the relation itself. Secondary indexes use space similar to that required for the attributes indexed. For space reasons, one should therefore be careful with creating secondary indexes on large amounts of data. 22

Other types of indexes Indexing is a science of it own: There are special index types such as bitmap indexes and hash indexes that are more efficient that B-trees for some types of data. There are join indexes that speed up join operations (but are expensive to update). There are indexes for geometric and multidimensional data. There are indexes for textual data.... MDM has a few short descriptions, but a deeper understanding of indexes is beyond the scope of this course. 23

Join Advanced database technology If you have taken an introductory course in algorithms (at ITU or elsewhere), or will take the course in the fall semester, there is the possibility to learn about the inner workings of indexes, and other things touched upon by MDM 6, in spring 2006. Advanced database technology Taught by Anna Östlin and Rasmus Pagh The course contains explanation of disk based algorithms, indexes, query optimization, and many other things to make you a database expert! Understand performance rules of thumb and know when they apply! 24

Next: Schema tuning

Use the smallest possible data type DBMSs often offer many different data types. Among other things, this is to allow programmers to choose a data type with the smallest possible space usage. Space may have an impact on time, e.g., if tuples are twice as long as needed, a full table scan takes twice as long as needed. In Oracle, space can often be saved be replacing fixed length strings such as CHAR(20) with variable length strings such as VARCHAR(20). 26

Use of a code table For non-standard data types with few possible values (e.g. types of wood ), space can be saved by introducing a relation with short integer codes for each possible value (encoded in a less space efficient data type). The drawback of this approach is that a join operation is necessary to combine the information in the two relations. 27

Normalization and efficiency The good: Normalization can be used to eliminate redundancy in a database design, and thus avoid all sorts of problems. Also, less redundancy means that e.g. full table scans are faster. The bad: Normalization implies that more join operations must be performed when answering database queries. Performance may not be adequate. Sometimes a database designer may choose to denormalize a database schema, i.e., join several relation schemas into one. This may make some queries run faster. However, updates must then be handled more carefully, and the space usage may rise. 28

Denormalization Typical denormalization scenarios: 1-1 relationship. Joining the relations of the two entity types will result in no (or little) redundancy, and will usually still be normalized. M-1 relationship. Joining in this case will not result in a much larger relation if an entity instance on the one side is typically related to few instances on the many side, or if the total size of the attributes on the one side is small. Some dangers of denormalization: Destroys the advantages of a normalized design. Requires more programming and is more error-prone. May speed up some queries, but slow others down. Recent research results on join processing suggest that in future database systems the effect of denormalization will diminish. 29

Denormalization vs materialized views An alternative to denormalizing the database schema is to create a materialized view that maintains the join of some of the relations. The good: Accessing the materialized view is the same as accessing the corresponding relation in a denormalized design. Updates can be done in the normalized schema (no anomalies). The bad: Space usage is even larger than in denormalized design. The DBMS may not handle updates as efficiently as updates in a hand-coded denormalized design. 30

Vertical and horizontal partitioning Related to, but independent of, denormalization, the DB programmer can instruct the DBMS to use a particular physical layout of a relation. Horizontal partitioning. If some (frequent) queries deal only with some attributes of a relation, it can speed up processing to physically split the relation into two parts one with those attributes, and one with the rest. Vertical partitioning. If some (frequent) queries deal with tuples having a particular value on some attribute, it can speed up processing to physically split the relation into parts depending on the value of that attribute. This is a light alternative to having a primary index. The cost of horizontal partitioning is that it takes more time to retrieve an entire tuple. The cost of vertical partitioning is that insertions of new tuples take more time. 31

Most important points in this lecture As a minimum, you should after this week: Remember that disk accesses are the performance bottleneck of most large databases. Be able to estimate the value of a particular index based on: The selectivity of typical queries. The frequency of queries and updates. Understand the basic techniques of schema tuning. In this lecture some details were given that you will not find in the course book or supplementary material: You are expected to know this to the detail given by the slides. 32