Announcements Introduction to Data Management CSE 344 Webquiz 3 is due tomorrow Lectures 9: Relational Algebra (part 2) and Query Evaluation 1 2 Query Evaluation Steps Translate query string into internal representation Logical plan à physical plan SQL query Parse & Check Query Decide how best to answer query: query optimization Query Execution Return Results Check syntax, access control, table names, etc. Query Evaluation 3 Product(pid, name, price) Purchase(pid, cid, store) Customer(cid, name, city) From SQL to RA SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z WHERE x.pid = y.pid and y.cid = y.cid and x.price > 100 and z.city = Seattle 4 Product(pid, name, price) Purchase(pid, cid, store) Customer(cid, name, city) From SQL to RA SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z δ WHERE x.pid = y.pid and y.cid = y.cid and x.price > 100 and z.city = Seattle Π x.name,z.name An Equivalent Expression Query optimization = finding cheaper, equivalent expressions SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z δ WHERE x.pid = y.pid and y.cid = y.cid and x.price > 100 and z.city = Seattle Π x.name,z.name σ price>100 and city= Seattle cid=cid Product pid=pid Purchase cid=cid Customer 5 pid=pid σ price>100 Product Purchase σ city= Seattle Customer 6 1
Extended RA: Operators on Bags Duplicate elimination δ Grouping γ Sorting τ Logical Query Plan SELECT city, count(*) FROM sales GROUP BY city HAVING sum(price) > 100 Π city, c σ p > 100 T3(city, c) T2(city,p,c) T1(city,p,c) γ city, sum(price) p, count(*) c T1, T2, T3 = temporary tables sales(product, city, price) 7 8 Typical Plan for Block (1/2)... Typical Plan For Block (2/2) having condition π fields γ fields, sum/count/min/max(fields) σ selection condition join condition SELECT-PROJECT-JOIN Query π fields σ selection condition join condition join condition R S 9 10 (sno,sname,scity,sstate) How about Subqueries? (sno,sname,scity,sstate) How about Subqueries? and not exists (SELECT * WHERE P.sno = Q.sno and P.price > 100) and not exists (SELECT * WHERE P.sno = Q.sno and P.price > 100) Correlation! 11 12 2
How about Subqueries? and not exists (SELECT * WHERE P.sno = Q.sno and P.price > 100) (sno,sname,scity,sstate) De-Correlation and Q.sno not in (SELECT P.sno WHERE P.price > 100) 13 How about Subqueries? ( ) EXCEPT (SELECT P.sno WHERE P.price > 100) EXCEPT = set difference Un-nesting (sno,sname,scity,sstate) and Q.sno not in (SELECT P.sno WHERE P.price > 100) 14 (sno,sname,scity,sstate) How about Subqueries? ( ) EXCEPT (SELECT P.sno WHERE P.price > 100) Finally π sno σ sstate= WA π sno σ Price > 100 From Logical Plans to Physical Plans 15 16 Query Evaluation Steps Review SQL query (sid, sname, scity, sstate) Example Query optimization Parse & Rewrite Query Select Logical Plan Select Physical Plan Query Execution Logical plan Physical plan Give a relational algebra expression for this query Disk 17 18 3
(sid, sname, scity, sstate) Relational Algebra (σ scity= Seattle sstate= WA pno=2 ( sid = sid )) (sid, sname, scity, sstate) Relational algebra expression is also called the logical query plan Relational Algebra σ scity= Seattle sstate= WA pno=2 sid = sid 19 20 (sid, sname, scity, sstate) Physical Query Plan 1 σ scity= Seattle sstate= WA pno=2 (Nested loop) sid = sid A physical query plan is a logical query plan annotated with physical implementation details 21 (sid, sname, scity, sstate) Physical Query Plan 2 σ scity= Seattle sstate= WA pno=2 (Hash join) sid = sid Same logical query plan Different physical plan 22 (sid, sname, scity, sstate) (Sort-merge join) (Scan & write to T1) Physical Query Plan 3 sid = sid (a) σ scity= Seattle sstate= WA (d) (c) Different but equivalent logical query plan; different physical plan (b) σ pno=2 (Scan & write to T2) Query Optimization Problem For each SQL query many logical plans For each logical plan many physical plans How do find a fast physical plan? Will discuss in a few lectures 23 24 4
Demonstration with SQL Server Management Studio Query Execution 25 26 Iterator Interface Initializes operator state Sets parameters such as selection condition Operator invokes get_ recursively on its inputs Performs processing and produces an output tuple close(): clean-up state Pipelined Query Execution (Nested loop) σ sscity= Seattle sstate= WA pno=2 s sno = sno 27 Supplies 28 Pipelined Query Execution (Nested loop) σ sscity= Seattle sstate= WA pno=2 s sno = sno nex() Supplies 29 Pipelined Execution Tuples generated by an operator are immediately sent to the parent Benefits: No operator synchronization issues No need to buffer tuples between operators Saves cost of writing intermediate data to disk Saves cost of reading intermediate data from disk This approach is used whenever possible 30 5
Intermediate Tuple Materialization Tuples generated by an operator are written to disk an in intermediate table No direct benefit Necessary: For certain operator implementations When we don t have enough memory Intermediate Tuple Materialization (Sort-merge join) (Scan: write to T1) s sno = sno σ sscity= Seattle sstate= WA 31 σ pno=2 Supplies (Scan: write to T2) 32 Query Execution Bottom Line SQL query transformed into physical plan Access path selection for each relation Scan the relation or use an index (see next lecture) Implementation choice for each operator Nested loop join, hash join, etc. Scheduling decisions for operators Pipelined execution or intermediate materialization Execution of the physical plan is pull-based Physical Data Independence Means that applications are insulated from changes in physical storage details E.g., can add/remove indexes without changing apps Can do other physical tunings for performance SQL and relational algebra facilitate physical data independence because both languages are set-at-a-time : Relations as input and output 33 34 Database Management System Deployments Architectures 1. Serverless 2. Two tier: client/server 3. Three tier: client/app-server/db-server 35 36 6
Desktop DBMS Application (SQLite) Serverless User SQLite: One data file One user One DBMS application But only a limited number of scenarios work with such model Client Applications File Data file Disk 37 38 Client Applications Server Machine Client Applications Connection (JDBC, ODBC) Connection (JDBC, ODBC) 39 One server running the database Many clients, connecting via the ODBC or JDBC (Java Database Connectivity) protocol 40 Server Machine Supports many apps and many users simultaneously Client Applications Connection (JDBC, ODBC) One server that runs the DBMS (or RDBMS): Your own desktop, or Some beefy system, or A cloud service (SQL Azure) One server running the database Many clients, connecting via the ODBC or JDBC (Java Database Connectivity) protocol 41 42 7
One server that runs the DBMS (or RDBMS): Your own desktop, or Some beefy system, or A cloud service (SQL Azure) Many clients run apps and connect to DBMS Microsoft s Management Studio (for SQL Server), or psql (for postgres) Some Java program (HW7) or some C++ program One server that runs the DBMS (or RDBMS): Your own desktop, or Some beefy system, or A cloud service (SQL Azure) Many clients run apps and connect to DBMS Microsoft s Management Studio (for SQL Server), or psql (for postgres) Some Java program (HW5) or some C++ program Clients talk to server using JDBC/ODBC protocol 43 44 3-Tiers DBMS Deployment Browser 3-Tiers DBMS Deployment Browser Connection (e.g., JDBC) 45 46 3-Tiers DBMS Deployment Browser 3-Tiers DBMS Deployment Connection (e.g., JDBC) Connection (e.g., JDBC) Web-based applications 47 48 8
Replicate 3-Tiers App DBMS server Deployment for scaleup DBMS Deployment: Cloud Easy scaleup, scaledown Users Why don t we replicate the DB server too? Connection (e.g., JDBC) Developers 49 Web & App Server 50 Using a DBMS Server 1. Client application establishes connection to server 2. Client must authenticate self 3. Client submits SQL commands to server 4. Server executes commands and returns results 51 9