Avoiding Sorting and Grouping In Processing Queries

Similar documents
Midterm Review. March 27, 2017

On-Disk Bitmap Index Performance in Bizgres 0.9

6.830 Problem Set 2 (2017)

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Beyond EXPLAIN. Query Optimization From Theory To Code. Yuto Hayamizu Ryoji Kawamichi. 2016/5/20 PGCon Ottawa

Advanced Database Systems

Chapter 12: Query Processing. Chapter 12: Query Processing

Querying Data with Transact SQL

Chapter 13: Query Processing

Database System Concepts

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Chapter 12: Query Processing

Histogram Support in MySQL 8.0

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Towards Comprehensive Testing Tools

MOIRA A Goal-Oriented Incremental Machine Learning Approach to Dynamic Resource Cost Estimation in Distributed Stream Processing Systems

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Efficient in-memory query execution using JIT compiling. Han-Gyu Park

Chapter 3. Algorithms for Query Processing and Optimization

Optimizing Queries Using Materialized Views

Orri Erling (Program Manager, OpenLink Virtuoso), Ivan Mikhailov (Lead Developer, OpenLink Virtuoso).

A Compression Framework for Query Results

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 6 Outline. Unary Relational Operations: SELECT and

Evaluation of Relational Operations

Optimizer Standof. MySQL 5.6 vs MariaDB 5.5. Peter Zaitsev, Ovais Tariq Percona Inc April 18, 2012

Materialized Views. March 28, 2018

Two-Phase Optimization for Selecting Materialized Views in a Data Warehouse

1.1 - Basics of Query Processing in SQL Server

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Lazy Maintenance of Materialized Views

Evaluation of Relational Operations: Other Techniques

CMSC424: Database Design. Instructor: Amol Deshpande

Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters

When and How to Take Advantage of New Optimizer Features in MySQL 5.6. Øystein Grøvlen Senior Principal Software Engineer, MySQL Oracle

Database Applications (15-415)

Chapter 12: Query Processing

CSIT5300: Advanced Database Systems

Evaluation of Relational Operations: Other Techniques

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

DBMS Query evaluation

Chapter 18 Indexing Structures for Files

Evaluation of Relational Operations: Other Techniques

arxiv: v1 [cs.db] 22 Sep 2014

Implementation of Relational Operations: Other Operations

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Leveraging Query Parallelism In PostgreSQL

Materialized Views. March 26, 2018

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Advanced Query Optimization

TPC-H Benchmark Set. TPC-H Benchmark. DDL for TPC-H datasets

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Fundamentals of Database Systems

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

Stop-and-Resume Style Execution for Long Running Decision Support Queries

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Database Management

SELECT Product.name, Purchase.store FROM Product JOIN Purchase ON Product.name = Purchase.prodName

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Querying Data with Transact-SQL

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Evaluation of relational operations

Evaluation of Relational Operations

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

CSC317/MCS9317. Database Performance Tuning. Class test

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

Technical Report - Distributed Database Victor FERNANDES - Université de Strasbourg /2000 TECHNICAL REPORT

Join (SQL) - Wikipedia, the free encyclopedia

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

Database Applications (15-415)

Evaluation of Relational Operations

Chapter 17: Parallel Databases

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

CSE 530A. B+ Trees. Washington University Fall 2013

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Chapter 12: Indexing and Hashing. Basic Concepts

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Query Processing & Optimization

Chapter 12: Indexing and Hashing

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

SQL QUERY EVALUATION. CS121: Relational Databases Fall 2017 Lecture 12

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Chapter 12: Indexing and Hashing

Dtb Database Systems. Announcement

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Chapter 11: Indexing and Hashing

Toward a Progress Indicator for Database Queries

Query processing of pre-partitioned data using Sandwich Operators

University of Waterloo Midterm Examination Sample Solution

Indexing and Hashing

Database Applications (15-415)

Question 1 (a) 10 marks

Transcription:

Avoiding Sorting and Grouping In Processing Queries

Outline Motivation Simple Example Order Properties Grouping followed by ordering Order Property Optimization Performance Results Conclusion

Motivation Previous presentation: Fundamental Techniques for Order Optimization Using FDs and selection predicates Determining order propagation from input to output Infer from ordering Current presentation: Aside from orderings, we also infer how relations are grouped (i.e., how records in relations are clustered according to value of certain attributes) Infer from grouping Infer from secondary ordering

Motivation(cont.) Inferred orderings Make it possible to avoid ing when preprocessing ORDER BY clauses of SQL query Inferred groupings Avoid ing or hashing prior to computing aggregates for GROUP BY clauses Reduce the cost of projection with duplicate elimination Complete projection and duplicate elimination in a single pass Reduce the cost of evaluating selection queries in the form σ A=k (R) in the absence of indexes or an ordering on A Inference of secondary ordering and grouping Avoid unnecessary ing or grouping over multiple attributes Infer new primary orderings or groupings (example follows)

Simple Example Benefits of inferring grouping and secondary ordering TPC-H Query SELECT c_custkey, COUNT (*) FROM Customer, Supplier WHERE c_nationkey = s_nationkey GROUPBY c_custkey How many suppliers could supply each costumer directly without having to go through customs

Simple Example (cont.) -merge join result is ed (and hence grouped) on c_nationkey and the output tuples are in the same group with respect to c_nationkey, they are themselves grouped on the key of outer relation (c_custkey) But one-pass aggregation requires data only to be grouped and not ed! group c_custkey, count(*) c_custkey merge-join c_nationkey = s_nationkey c_nationkey G c_custkey G =>no Postgres Plan first s the join result to on the grouping attribute c_custkey so as to be able to aggregate over groups in a single pass. TPC-H Query c_nationkey table scan customer s_nationkey table scan supplier SELECT c_custkey, COUNT (*) FROM Customer, Supplier WHERE c_nationkey = s_nationkey GROUPBY c_custkey Postgres QEP of the Query

Order Properties order properties have the form: each A i is an attribute, each α i either specifies an ordering (α i = O) or a grouping (α i =G) A 1 α 1 primary ordering or grouping and A2 α 2 secondary Signatures for ordering constructs

Order Properties (cont.) The general properties have the form: Shorthand: Also, given and Shorthand: o 1 o 2 (concatenation of OP):

Order Properties (cont.) Identities for any order property that holds of a physical relation, all prefixes of that order property also hold of R an ordering on any attribute implies a grouping on that attribute If X functionally determines B, and an order property that includes all attributes in X (ordered or grouped) appearing before B α, then B α is superfluous.

Order Properties (cont.) Identities (cont.) special case of identity #3, covering the case where X consists of a single attribute the grouping of an attribute that is functionally determined by the attribute that follows it in the order property is superfluous

Grouping followed by ordering Suppose that R=(A,B) consists of 10 tuples, t 1,,t 10, and its physical representation satisfies the order property, A o B G is illustrated on the next slide

Grouping followed by ordering (cont.) The primary ordering (A O ) says that the group of tuples with A=1 precedes the group of tuples with A=2 which precedes the group with A=3 A=1 B=1 B=2 t 1 t 2 t 3 A=3 B=3 B=2 t 7 t 6 B=1 t 5 t 4 The secondary ordering (B G ) says that within each group of tuples with like values of A, tuples are clustered together if they have the same value for B t 1 can precede t2 or t2 can precede t1 but the must be adjacent < B=1 t 8 A=2 < B=2 t 9 t 10 Two Example permutations that satisfies the order property : t 2, t 1, t 3, t 10, t 8, t 9, t 6, t 7, t 4, t 5 An illustration of A O B G t 1, t 2, t 3, t 9, t 8, t 10, t 4, t 5, t 6, t 7

Order Property Optimization Postgres Plan Operators Summarized The data structures for all plan nodes in postgres include the following fields: inp1, inp n : the fields contained in all input tuples to the node left: the left subtree of the node (set to Null for leaf nodes and Append) right: the right subtree of the node (set to Null for leaf nodes, unary operators and Append).

Order Property Optimization Postgres Plan Operators Summarized(cont.) Additional operator-specific fields provided by Postgres and used by our refinement algorithm

Order Property Optimization Postgres Plan Operators Summarized (cont.) Group performs two passes over its input: 1. insert Null values between pairs of consecutive tuples with different values for attributes, att 1,,att k, 2. apply functions F k+1,, F n to the collection of values of attributes att k+1,,att n respectively, for each set of tuples separated by Nulls. Hash: builds a hash table over its input using a predetermined hash function over attribute att.

Order Property Optimization HJoin: performs a (non-order-preserving) simple hash equijoin (att 1 = att 2 ) with the relation produced by left as the probe relation, and the relation produced by right as the build relation. Merge: performs a merge equijoin (att 1 = att 2 ) with the relation produced by left as the outer relation, and the relation produced by right as the inner relation. NOP: we have added as a dummy plan operator that we temporarily make the root of a Postgres plan prior to its refinement.

Order Property Optimization A Plan Refinement Algorithm Input: query plan tree generated by Postgres Output: an equivalent plan tree with unnecessary Sort operators (used either to order or group) removed Requires: 4 new attributes associated with every node in a query plan tree

Order Property Optimization A Plan Refinement Algorithm(cont.) New Attributes keys: a set of attribute sets that are guaranteed to be keys of inputs to n fds: a set of functional dependencies (attribute sets attribute) that are guaranteed to hold of inputs to n req: a single order property that is required to hold of inputs either to n or some ancestor node of n for that node to execute sat: a set of order properties that are guaranteed to be satisfied by outputs of n

Order Property Optimization A Plan Refinement Algorithm (cont.) Idea: decorate the input plan with the attributes above remove any Sort operator n whose child node produces a result that is guaranteed to satisfy an order property required by its parent node Accomplished in 3 passes

Order Property Optimization A Plan Refinement Algorithm (cont.) Refinement of the query plan NOP group c_custkey, count(*) c_custkey merge-join c_nationkey = s_nationkey c_nationkey table scan customer s_nationkey table scan supplier

Order Property Optimization A Plan Refinement Algorithm (cont.) Pass 1:Functional Dependencies and Keys A bottom-up pass, FDs and keys are propagated upwards when inferred to hold intermediate query result Pass 2:Required Order Properties Top-down pass requires order properties (req) which are propagated downwards from the root of the tree Operations pseudocode captured in SetReq New required operators are generated by: NOP: Node Order Property (called on the root of the plan to trigger the top-down pass) Group and Unique Join operator All other nodes pass the required order properties they inherit from parent nodes to their child nodes, except for Hash and Append which propagate the empty order property to their child nodes

Order Property Optimization

Order Property Optimization A Plan Refinement Algorithm (cont.) Pass 3:Sort Elimination A bottom-up pass of the query plan tree that determines what order properties are guaranteed to be satisfied by outputs of each node (sat), and that concurrently removes any Sort operator, n for which n.left.sat Є n.req Algorithm: InferSat!

Order Property Optimization A Plan Refinement Algorithm (cont.) InferSat

Order Property Optimization A Plan Refinement Algorithm (cont.) InferSat (cont.)

Example:TPC-D(now TPC-H) Query 3 TPC-D Query 3 select l_orderkey, sum (l _extendedprice*( 1- l_discount)) as rev, o_orderdate, o_shippriority from customer, order, lineitem where o_orderkey = l_orderkey and c_custkey = o_custkey and c_mktsegment = building and o_orderdate < date( 1998-11-30 ) and l_shipdate > date( 1998-11-30 ) group by l_orderkey, o_orderdate, o_shippriority order by rev desc, o_orderdate

Example:TPC-D(now TPC-H) Query 3 Previous presentation: We showed that the optimized plan outperformed the original plan by a factor of 2 Now: Improve using our approach to plan refinement reasons about groupings and secondary orderings

Example:TPC-D(now TPC-H) Query 3 NLJ R=> O o_orderkey G(U) Identitiy#5 => O o_orderkey G(T) Identitiy#4 =>O o_custkey G O o_orderkey G(T) rev, o_orderdate group by o_orderkey MJ Rule =>O c_custkey G c_custkey G o_custkey G o_orderkey G(T) and c_custkey = o_custkey => O o_custkey G o_custkey G o_custkey G o_orderkey G(T) merge-join c_custkey = o_custkey nested-loops o_orderkey = l_orderkey o_orderkey Index scan lineitem Identitiy#5 => O c_custkey G o_orderkey G(S) O c_custkey o(r)=> O c_custkey G(R) c_custkey o_custkey O o_custkey o(s)=> O o_custkey G(S) table scan customer table scan order

Performance Results TPC-D (now TPC-H) Results Database: PC: Customer table: 150,000 rows Supplier table: 10,000 rows Order table: 1,500,000 rows LineItem table: 6,000,000 rows 1 GHz Pentium III Linux, with 512 MB RAM, 120 GB HDD

Performance Results Experiment #1 our example Postgres Plan 6384.9 sec Refined 487.9 sec Ratio 13.08 group c_custkey, count(*) c_custkey merge-join c_nationkey = s_nationkey c_nationkey s_nationkey table scan customer table scan supplier

Performance Results Experiment #2 TPC-H Query 3 rev, o_orderdate Postgres Plan Refined Ratio group by o_orderkey 126.8 sec Same value of o_orderkey were consecutive tuples thereby increased likelihood of finding joining tuples from lineitem in the cache 2729.9 sec 0.05 merge-join c_custkey = o_custkey nested-loops o_orderkey = l_orderkey o_orderkey Index scan lineitem c_custkey o_custkey table scan customer table scan order

Performance Results Experiment #2 TPC-H Query 3 With table scan on lineitem Postgres Plan Refined Ratio rev, o_orderdate group by o_orderkey 121.4 sec 113.3 sec 1.07 nested-loops o_orderkey = l_orderkey o_orderkey Table scan lineitem merge-join c_custkey = o_custkey c_custkey o_custkey table scan customer table scan order

Conclusion we present a formal approach to order optimization that integrates both orderings and groupings within the same comprehensive framework We also consider secondary orderings and groupings By inferring secondary orderings and groupings, it is possible to avoid unnecessary ing or grouping over multiple attributes we can use secondary orderings known of an operator's input to infer primary orderings of its output

Any Questions?