Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Similar documents
Query Evaluation (i)

Overview of Query Processing

The Database Language SQL (i)

Database Management System. Relational Algebra and operations

Database Management Systems. Chapter 4. Relational Algebra. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Relational Algebra. [R&G] Chapter 4, Part A CS4320 1

Relational Algebra. Chapter 4, Part A. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

SQL: Queries, Programming, Triggers

Basic form of SQL Queries

CAS CS 460/660 Introduction to Database Systems. Relational Algebra 1.1

Introduction to Data Management. Lecture #14 (Relational Languages IV)

Relational Algebra. Relational Query Languages

Introduction to Data Management. Lecture #13 (Relational Calculus, Continued) It s time for another installment of...

Relational Algebra. Note: Slides are posted on the class website, protected by a password written on the board

Relational Query Languages. Relational Algebra. Preliminaries. Formal Relational Query Languages. Relational Algebra: 5 Basic Operations

Relational Algebra 1

SQL: Queries, Constraints, Triggers

CIS 330: Applied Database Systems

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Relational algebra. Iztok Savnik, FAMNIT. IDB, Algebra

Relational Query Languages

CSCC43H: Introduction to Databases. Lecture 3

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

SQL: Queries, Programming, Triggers. Basic SQL Query. Conceptual Evaluation Strategy. Example of Conceptual Evaluation. A Note on Range Variables

Relational Algebra. Study Chapter Comp 521 Files and Databases Fall

CS330. Query Processing

Overview of Query Evaluation. Overview of Query Evaluation

Database Applications (15-415)

Evaluation of relational operations

CompSci 516 Data Intensive Computing Systems

Relational Algebra Homework 0 Due Tonight, 5pm! R & G, Chapter 4 Room Swap for Tuesday Discussion Section Homework 1 will be posted Tomorrow

Introduction to Data Management. Lecture #11 (Relational Algebra)

MIS Database Systems Relational Algebra

SQL. Chapter 5 FROM WHERE

Relational Query Languages. Preliminaries. Formal Relational Query Languages. Example Schema, with table contents. Relational Algebra

Database Applications (15-415)

Overview of Query Evaluation

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Database Systems. Course Administration. 10/13/2010 Lecture #4

Database Applications (15-415)

Implementation of Relational Operations

Database Systems. October 12, 2011 Lecture #5. Possessed Hands (U. of Tokyo)

15-415/615 Faloutsos 1

CIS 330: Applied Database Systems. ER to Relational Relational Algebra

Database Systems ( 資料庫系統 )

Relational Algebra and Calculus. Relational Query Languages. Formal Relational Query Languages. Yanlei Diao UMass Amherst Feb 1, 2007

v Conceptual Design: ER model v Logical Design: ER to relational model v Querying and manipulating data

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

SQL Part 2. Kathleen Durant PhD Northeastern University CS3200 Lesson 6

SQL: Queries, Constraints, Triggers (iii)

Lecture 3 SQL - 2. Today s topic. Recap: Lecture 2. Basic SQL Query. Conceptual Evaluation Strategy 9/3/17. Instructor: Sudeepa Roy

Evaluation of Relational Operations

SQL: The Query Language Part 1. Relational Query Languages

Lecture 3 More SQL. Instructor: Sudeepa Roy. CompSci 516: Database Systems

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Review: Where have we been?

Principles of Data Management. Lecture #9 (Query Processing Overview)

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

Introduction to Data Management. Lecture 14 (SQL: the Saga Continues...)

Evaluation of Relational Operations

Today s topics. Null Values. Nulls and Views in SQL. Standard Boolean 2-valued logic 9/5/17. 2-valued logic does not work for nulls

Overview of Implementing Relational Operators and Query Evaluation

CompSci 516: Database Systems

Relational Algebra 1. Week 4

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Relational Algebra 1

Database Applications (15-415)

Relational Query Optimization. Highlights of System R Optimizer

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Relational Query Optimization

Experimenting with bags (tables and query answers with duplicate rows):

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Keys, SQL, and Views CMPSCI 645

The SQL Query Language. Creating Relations in SQL. Adding and Deleting Tuples. Destroying and Alterating Relations. Conceptual Evaluation Strategy

Enterprise Database Systems

Evaluation of Relational Operations. SS Chung

ECS 165B: Database System Implementa6on Lecture 7

Overview of Query Evaluation. Chapter 12

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Data Science 100. Databases Part 2 (The SQL) Slides by: Joseph E. Gonzalez & Joseph Hellerstein,

CSIT5300: Advanced Database Systems

Database Applications (15-415)

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Implementation of Relational Operations: Other Operations

SQL - Lecture 3 (Aggregation, etc.)

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Databases - Relational Algebra. (GF Royle, N Spadaccini ) Databases - Relational Algebra 1 / 24

This lecture. Projection. Relational Algebra. Suppose we have a relation

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Database Applications (15-415)

Data Science 100 Databases Part 2 (The SQL) Previously. How do you interact with a database? 2/22/18. Database Management Systems

This lecture. Step 1

Database Applications (15-415)

ATYPICAL RELATIONAL QUERY OPTIMIZER

Principles of Data Management. Lecture #12 (Query Optimization I)

Transcription:

ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 1

Example Relations Sailors( sid: integer, sname: string, rating: integer, age: real) Boats( bid: integer, bname: string, color: string) Reserves( sid: integer, bid: string, day: date) S1 B1 R1 sid bid day 22 101 10/10/96 58 103 11/12/96 sid sname rating age 22 Dustin 7 45.0 31 Lubber 8 55.5 58 Rusty 10 35.0 bid bname color 101 Interlake Blue 102 Interlake Red 103 Clipper green 104 Marine Red 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 2

Basic SQL Query SELECT [ DISTINCT ] target-list FROM relation-list WHERE qualification relation-list A list of relation names (possibly with a range-variable after each name). target-list A list of attributes of relations in relation-list qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of <, >,,, =, ) combined using AND, OR and NOT. DISTINCT is an optional keyword indicating that the answer should not contain duplicates. Default is that duplicates are not eliminated! 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 3

Example Q1 SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND bid=103 Without range variables SELECT sname FROM Sailors, Reserves WHERE Sailors.sid=Reserves.sid AND bid=103 Range variables really needed only if the same relation appears twice in the FROM clause. Good style to always use range variables 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 4

Conceptual Evaluation Strategy Semantics of an SQL query defined in terms of the following conceptual evaluation strategy: 1. Compute the cross-product of relation-list. 2. Discard resulting tuples if they fail qualifications. 3. Delete attributes that are not in target-list. 4. If DISTINCT is specified, eliminate duplicate rows. This strategy is probably the least efficient way to compute a query! An optimizer will find more efficient strategies to compute the same answers. 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 5

Example Q1: conceptual evaluation SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND bid=103 S.sid sname rating age R.sid bid day 22 Dustin 7 45 22 101 10/10/96 22 Dustin 7 45 58 103 11/12/96 31 Lubber 8 55.5 22 101 10/10/96 31 Lubber 8 55.5 58 103 11/12/96 58 Rusty 10 35.0 22 101 10/10/96 58 Rusty 10 35.0 58 103 11/12/96 S.sid sname rating age R.sid bid day 58 Rusty 10 35.0 58 103 11/12/96 Conceptual Evaluation Steps: 1. Compute cross-product 2. Discard disqualified tuples 3. Delete unwanted attributes 4. If DISTINCT is specified, eliminate duplicate rows. sname Rusty 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 6

Relational Algebra Basic operations: Selection (σ) Selects a subset of rows from relation. Projection (π) Deletes unwanted columns from relation. Cross-product ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union (U) Tuples in reln. 1 and in reln. 2. Additional operations: Intersection, join, division, renaming: Not essential, but (very!) useful. Since each operation returns a relation, operations can be composed! (Algebra is closed.) 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 7

Projection Deletes attributes that are not in projection list. Schema of result contains exactly the fields in the projection list, with the same names that they had in the (only) input relation. Projection operator has to eliminate duplicates! (Why??) Note: real systems typically don t do duplicate elimination unless the user explicitly asks for it. (Why not?) π sname, rating (S2) sname rating Yuppy 9 Lubber 8 Guppy 5 Rusty 10 π age (S2) age 35.0 55.5 35.0 35.0 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 8

Selection Selects rows that satisfy selection condition. No duplicates in result! (Why?) Schema of result identical to schema of (only) input relation. Result relation can be the input for another relational algebra operation! (Operator composition.) σ rating > 8 (S2) sid sname rating age 28 Yuppy 9 35.0 31 Lubber 8 55.5 44 Guppy 5 35.0 58 Rusty 10 35.0 π sname, rating (σrating>8 (S2)) sid sname rating age 28 Yuppy 9 35.0 31 Lubber 8 55.5 44 Guppy 5 35.0 58 Rusty 10 35.0 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 9

Union, Intersection, Set-Difference All of these operations take two input relations, which must be union-compatible: Same number of fields. `Corresponding fields have the same type. What is the schema of result? S1 sid sname rating age 22 Dustin 7 45.0 31 Lubber 8 55.5 58 Rusty 10 35.0 S2 S1 U S2 sid sname rating age 22 Dustin 7 45.0 28 Yuppy 9 35.0 31 Lubber 8 55.5 44 Guppy 5 35.0 58 Rusty 10 35.0 sid sname rating age 28 Yuppy 9 35.0 31 Lubber 8 55.5 44 Guppy 5 35.0 58 Rusty 10 35.0 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 10

Intersection & Set-Difference S1 S2 S1 S2 sid sname rating age 31 Lubber 8 55.5 58 Rusty 10 35.0 sid sname rating age 22 Dustin 7 45.0 S1 sid sname rating age 22 Dustin 7 45.0 31 Lubber 8 55.5 58 Rusty 10 35.0 S2 sid sname rating age 28 Yuppy 9 35.0 31 Lubber 8 55.5 44 Guppy 5 35.0 58 Rusty 10 35.0 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 11

S1 Cross-Product Consider the cross product of S1 with R1 Each row of S1 is paired with each row of R1. Result schema has one field per field of S1 and R1, with field names `inherited if possible. Conflict: Both S1 and R1 have a field called sid. Rename to sid1 and sid2 R1 sid bid day 22 101 10/10/96 58 103 11/12/96 sid sname rating age 22 Dustin 7 45.0 31 Lubber 8 55.5 58 Rusty 10 35.0 S1 R1 sid sname rating age sid bid day 22 Dustin 7 45 22 101 10/10/96 22 Dustin 7 45 58 103 11/12/96 31 Lubber 8 55.5 22 101 10/10/96 31 Lubber 8 55.5 58 103 11/12/96 58 Rusty 10 35.0 22 101 10/10/96 58 Rusty 10 35.0 58 103 11/12/96 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 12

Joins Condition Join: R c S c( R S) Result schema same as that of cross-product. Fewer tuples than cross-product, might be able to compute more efficiently Sometimes called a theta-join. S 1 R 1 S1. sid R1. sid sid sname rating age sid bid day 22 Dustin 7 45 58 103 11/12/96 31 Lubber 8 55.5 58 103 11/12/96 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 13

Equi-Joins & Natural Joins Equi-join: A special case of condition join where the condition c contains only equalities. Result schema similar to cross-product, but only one copy of fields for which equality is specified. Natural Join: Equi-join on all common fields. S1 R1 sid sid sname rating age bid day 22 Dustin 7 45 101 10/10/96 58 Rusty 10 35.0 103 11/12/96 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 14

Query Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result A SCAN (sid=101) Optimizer Reserves SELECT * FROM Reserves WHERE sid=101 Sid=101 Reserves IDXSCAN (sid=101) fetch Reserves 32.0 Index(sid) 25.0 Pick B Evaluate Plan A B 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 15

Query Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Input : SQL Parse Query Eg. SELECT-FROM-WHERE, CREATE TABLE, DROP TABLE statements Output: Some data structure to represent the query Relational algebra? Also checks syntax, resolves aliases, binds names in SQL to objects in the catalog How? 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 16

Query Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Enumerate Plans Input : a data structure representing the query Output: a collection of equivalent query evaluation plans Query Execution Plan (QEP): tree of database operators. high-level: RA operators are used low-level: RA operators with particular implementation algorithm. Plan enumeration: find equivalent plans Different QEPs that return the same results Query rewriting : transformation of one QEP to another equivalent QEP. 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 17

Query Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Estimate Cost Input : a collection of equivalent query evaluation plans Output: a cost estimate for each QEP in the collection Cost estimation: a mapping of a QEP to a cost Cost Model: a model of what counts in the cost estimate. Eg. Disk accesses, CPU cost Statistics about the data and the hardware are used. 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 18

Query Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Choose Best Plan Input : a collection of equivalent query evaluation plans and their cost estimate Output: best QEP in the collection The steps: enumerate plans, estimate cost, choose best plan collectively called the: Query Optimizer: Explores the space of equivalent plan for a query Chooses the best plan according to a cost model 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 19

Query Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Evaluate Query Plan Input : a QEP (hopefully the best) Output: Query results Often includes a code generation step to generate a lower level QEP in executable code. Query evaluation engine is a virtual machine that executes some code representing low level QEP. 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 20

Query Execution Plans (QEPs) A tree of database operators: each operator is a RA operator with specific implementation Selection : Index Scan or Table Scan Projection π: Without DISTINCT : Table Scan With DISTINCT : requires sorting or index scan Join : Nested loop joins (naïve) Index nested loop joins Sort merge joins Sort : In-memory sort External sort 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 21

QEP Examples SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 π S.sname π S.sname S.rating>5 AND R.bid=100 Reserves R.sid=S.sid Sailors π S.sname S.rating>5 R.bid=100 R.sid=S.sid Reserves Sailors π S.sname S.rating>5 AND R.bid=100 R.sid=S.sid Reserves (SCAN) On the fly On the fly Nested Loop Join Sailors (SCAN) R.bid=100 Reserves R.sid=S.sid (SCAN) On the fly Nested Loop Join Temp T1 S.rating>5 Sailors (SCAN) 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 22

Access Paths An access path is a method of retrieving tuples. Eg. Given a query with a selection condition: File or table scan Index scan Index matching problem: given a selection condition, which indexes can be used for the selection, i.e., matches the selection? Selection condition normalized to conjunctive normal form (CNF), where each term is a conjunct Eg. (day<8/9/94 AND rname= Paul ) OR bid=5 OR sid=3 CNF: (day<8/9/94 OR bid=5 OR sid=3 ) AND (rname= Paul OR bid=5 OR sid=3) R.bid=100 π S.sname R.sid=S.sid On the fly Nested Loop Join S.rating>5 (SCAN) (SCAN) Reserves Sailors R.bid=100 Fetch Index(R.bid) (IDXSCAN) Temp T1 Reserves 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 23

Index Matching Q1: a=5 AND b=3 Q2: a=5 AND b>6 Q3: b=3 Q4: a=5 AND b=3 AND c=5 Q5: a>5 AND b=3 AND c=5 I1: Tree Index (a,b,c) I2: Tree Index (b,c,d) I3: Hash Index (a,b,c) A tree index matches a selection condition if the selection condition is a prefix of the index search key. A hash index matches a selection condition if the selection condition has a term attribute=value for every attribute in the index search key 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 24

Unstructured Text Data Field of Information Retrieval Data Model Collection of documents Each document is a bag of words (aka terms) Query Model Keyword + Boolean Combinations Eg. DBMS and SQL and tutorial Details: Not all words are equal. Stop words (eg. the, a, his...) are ignored. Stemming : convert words to their basic form. Eg. Surfing, surfed becomes surf 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 25

Inverted Indexes Recall: an index is a mapping of search key to data entries What is the search key? What is the data entry? Inverted Index: For each term store a list of postings A posting consists of <docid,position> pairs lexicon Posting lists DBMS doc01 10 18 20 doc02 5 38 doc01 13 SQL doc06 1 12 doc09 4 9 doc20 12 trigger doc01 12 15 doc09 14 21 25 doc10 11 55...... What is the data in an inverted index sorted on? 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 26

lexicon Lookups using Inverted Indexes Posting lists DBMS doc01 10 18 20 doc02 5 38 doc01 13 SQL doc06 1 12 doc09 4 9 doc20 12 trigger doc01 12 15 doc09 14 21 25 doc10 11 55...... Given a single keyword query k (eg. SQL) Find k in the lexicon Retrieve the posting list for k Scan posting list for document IDs [and positions] What if the query is k1 and k2? Retrieve document IDs for k1 and k2 Perform intersection 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 27

Too Many Matching Documents Rank the results by relevance! Vector-Space Model Documents are vectors in hidimensional space Each dimension in the vector represents a term Queries are represented as vectors similarly Vector distance (dot product) between query vector and document vector gives ranking criteria Weights can be used to tweak relevance PageRank (later) Star Doc about movie stars Doc about astronomy Doc about behavior Diet 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 28

How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to user s information need Recall : Fraction of relevant docs in collection that are retrieved 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 29