Incremental Query Optimization

Similar documents
CS448/648 Database Systems Implementation. Volcano: A top-down optimizer generator

Backtracking. Chapter 5

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

The Volcano Optimizer Generator: Extensibility and Efficient Search

Chapter 11 Search Algorithms for Discrete Optimization Problems

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

Basic Search Algorithms

HEURISTIC SEARCH. 4.3 Using Heuristics in Games 4.4 Complexity Issues 4.5 Epilogue and References 4.6 Exercises

Process a program in execution; process execution must progress in sequential fashion. Operating Systems

Parallel Programming. Parallel algorithms Combinatorial Search

Efficient and Extensible Algorithms for Multi Query Optimization

Arnab Bhattacharya, Shrikant Awate. Dept. of Computer Science and Engineering, Indian Institute of Technology, Kanpur

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Geometric data structures:

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

Red-Black-Trees and Heaps in Timestamp-Adjusting Sweepline Based Algorithms

Query Processing & Optimization

Tail Recursion. ;; a recursive program for factorial (define fact (lambda (m) ;; m is non-negative (if (= m 0) 1 (* m (fact (- m 1))))))

Lecture #14 Optimizer Implementation (Part I)

Optimizing Join Enumeration in Transformation- based Query Optimizers

Efficient memory-bounded search methods

Traditional Query Optimization

Stochastic propositionalization of relational data using aggregates

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Constraint Satisfaction Problems

Search Algorithms for Discrete Optimization Problems

Checking: Trading Space for Time. Divesh Srivastava. AT&T Research. 10]). Given a materialized view dened using database

Search and Optimization

Select (SumSal > Budget) Join (DName) E4: Aggregate (SUM Salary by DName) Emp. Select (SumSal > Budget) Aggregate (SUM Salary by DName, Budget)

Evaluating XPath Queries

Alpha-Beta Pruning in Mini-Max Algorithm An Optimized Approach for a Connect-4 Game

Chronological Backtracking Conflict Directed Backjumping Dynamic Backtracking Branching Strategies Branching Heuristics Heavy Tail Behavior

Software Speculative Multithreading for Java

Stream Joins and Dynamic Query Evaluation. WANG, Qichen HUANG, Ziyue

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Search Algorithms for Discrete Optimization Problems

Chapter 11: Indexing and Hashing

15.083J Integer Programming and Combinatorial Optimization Fall Enumerative Methods

XML Data Stream Processing: Extensions to YFilter

arxiv: v1 [cs.db] 22 Sep 2014

Chapter 3: Processes. Operating System Concepts 8th Edition,

Distributed Tree Searc and its Application to Alpha-Beta Pruning*

Efficient AND/OR Search Algorithms for Exact MAP Inference Task over Graphical Models

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

Exploiting Upper and Lower Bounds in Top-Down Query Optimization

Stackless BVH Collision Detection for Physical Simulation

CSE 214 Computer Science II Introduction to Tree

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Artificial Intelligence (part 4c) Strategies for State Space Search. (Informed..Heuristic search)

Efficient in-network evaluation of multiple queries

Lecture 5: Search Algorithms for Discrete Optimization Problems

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

. The problem: ynamic ata Warehouse esign Ws are dynamic entities that evolve continuously over time. As time passes, new queries need to be answered

A Framework for Space and Time Efficient Scheduling of Parallelism

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Spatial Queries. Nearest Neighbor Queries

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Accelerated Raytracing

Appropriate Item Partition for Improving the Mining Performance

APHID: Asynchronous Parallel Game-Tree Search

: Principles of Automated Reasoning and Decision Making Midterm

Optimization of Queries with User-Defined Predicates

Course Review. Cpt S 223 Fall 2009

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 3.5, Page 1

Performance impact of dynamic parallelism on different clustering algorithms

UNIT 4 Branch and Bound

Spatial Data Structures

Chapter 12: Indexing and Hashing. Basic Concepts

The Legion Mapping Interface

Advanced Database Systems

Parallel Approach for Implementing Data Mining Algorithms

Scribes: Romil Verma, Juliana Cook (2015), Virginia Williams, Date: May 1, 2017 and Seth Hildick-Smith (2016), G. Valiant (2017), M.

IMPROVED A* ALGORITHM FOR QUERY OPTIMIZATION

Figure 1. A breadth-first traversal.

Spatial Data Structures

Effective Modular Design

Chapter 12: Indexing and Hashing

Data Warehousing & Data Mining

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

Massively Parallel Approximation Algorithms for the Traveling Salesman Problem

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Parallel Query Optimisation

Register Allocation via Hierarchical Graph Coloring

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

Tree traversals and binary trees

A Re-examination of Limited Discrepancy Search

CS301 - Data Structures Glossary By

18.3 Deleting a key from a B-tree

ADVANCED DATABASE SYSTEMS. Lecture #15. Optimizer Implementation (Part // // Spring 2018

Flow Graph Theory. Depth-First Ordering Efficiency of Iterative Algorithms Reducible Flow Graphs

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Artificial Intelligence (part 4d)

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

Chapter-6 Backtracking

Proceedings of the IE 2014 International Conference AGILE DATA MODELS

Algorithms. Ch.15 Dynamic Programming

CSE 326: Data Structures B-Trees and B+ Trees

Inheritance Metrics: What do they Measure?

Visualization and Analysis of Inverse Kinematics Algorithms Using Performance Metric Maps

Transcription:

Incremental Query Optimization Vipul Venkataraman Dr. S. Sudarshan Computer Science and Engineering Indian Institute of Technology Bombay

Outline Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization 2 / 46

Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization Introduction 3 / 46

Query Optimization - Overview Active area of research - spanning more then four decades [1] Determine the most efficient way to execute a given query by considering the space of all possible query plans The optimization process should not be costly in itself - typical issue in web related applications that handle huge data More data = bigger search space Can we find a good enough plan without exploring the whole search space? Incremental Query Optimization Introduction 4 / 46

Logical Query DAG Logical Query DAG for the query A B C Incremental Query Optimization Introduction 5 / 46 Image source: Prasan Roy s thesis

Physical Query DAG Physical Query DAG for the query A B Incremental Query Optimization Introduction 6 / 46 Image source: Prasan Roy s thesis

Steps in Optimization Incremental Query Optimization Introduction 7 / 46 Image source: Prasan Roy s thesis

Volcano Volcano, Graefe [2] Top-down optimizer, uses dynamic programming and memoization Expands whole Logical Query DAG initially to create all possible logical expressions Enumerates physical search space in depth-first order Uses branch and bound pruning to prune the search space Cost upper bound is propagated down the DAG, if upper bound < lower bound, plan is pruned Incremental Query Optimization Introduction 8 / 46

Cascades Cascades, Graefe et al [3] Adheres to the Object Oriented Paradigm Optimization algorithm broken into task objects which are collected in a data structure A task is scheduled by popping it from the stack and invoking the perform function Only explores a group on demand, while Volcano exhaustively generates all logically equivalent expressions before the actual optimization begins Ordering of transformations to be applied is based on a promise value assigned to each task Incremental Query Optimization Introduction 9 / 46

Columbia Columbia, Xu et al [4] Based on the Cascades optimizer generator Lower bound group pruning A group lower bound associated with a group of logically equivalent expressions, represents the minimal cost of the group Based on the logical properties of the group Global epsilon pruning A plan is considered optimal plan for an equivalence node if its cost is close enough (within epsilon) to the cost of optimal plan If the cost of a plan is found less than epsilon, it is good enough Epsilon values affect pruning greatly Incremental Query Optimization Introduction 10 / 46

Columbia Columbia, Xu et al [4] Based on the Cascades optimizer generator Lower bound group pruning A group lower bound associated with a group of logically equivalent expressions, represents the minimal cost of the group Based on the logical properties of the group Global epsilon pruning A plan is considered optimal plan for an equivalence node if its cost is close enough (within epsilon) to the cost of optimal plan If the cost of a plan is found less than epsilon, it is good enough Epsilon values affect pruning greatly Incremental Query Optimization Introduction 10 / 46

Columbia Columbia, Xu et al [4] Based on the Cascades optimizer generator Lower bound group pruning A group lower bound associated with a group of logically equivalent expressions, represents the minimal cost of the group Based on the logical properties of the group Global epsilon pruning A plan is considered optimal plan for an equivalence node if its cost is close enough (within epsilon) to the cost of optimal plan If the cost of a plan is found less than epsilon, it is good enough Epsilon values affect pruning greatly Incremental Query Optimization Introduction 10 / 46

Limitations of The Cascades Framework Data model representation is tightly coupled with optimizer Requires database implementer to define logical & physical operators and logical & physical properties Difficult to integrate a new query optimizer to an existing database, as it requires a database implementer to write new class definitions [5] Depth-first search algorithm Implements depth-first search algorithm; no plan generated unless whole optimization process finishes Huge search space of query optimizer leads to inefficiency Incremental Query Optimization Introduction 11 / 46

Limitations of The Cascades Framework Data model representation is tightly coupled with optimizer Requires database implementer to define logical & physical operators and logical & physical properties Difficult to integrate a new query optimizer to an existing database, as it requires a database implementer to write new class definitions [5] Depth-first search algorithm Implements depth-first search algorithm; no plan generated unless whole optimization process finishes Huge search space of query optimizer leads to inefficiency Incremental Query Optimization Introduction 11 / 46

Limitations of The Cascades Framework Ideas for improvements The current best plan can be used in the search algorithm to enable efficient exploration This will enable incremental view maintenance Depth first search can be replaced with a more efficient Best first search strategy Tasks can be operated depending on their promise Pruning can be done more aggressively, at the risk of a near optimal plan Incremental Query Optimization Introduction 12 / 46

Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization Incremental Optimization 13 / 46

Incremental Optimization Maintaining current best plan The optimal plan for a query is converged upon by incrementally improving a current best plan through application of logical and physical transformations At each step of the optimization, a current best plan is available, which is the plan with the best cost among all plans that have been explored The order of exploring alternative plans becomes important as it controls the time required to reach the optimal plan, and the amount of search space that can be pruned Benefits Pruning can be done aggressively, at the risk of returning a near-optimal plan Optimization can be aborted if the current best plan cost reaches a certain threshold, allowing for quickly returning near-optimal plans Incremental Query Optimization Incremental Optimization 14 / 46

Incremental Optimization Maintaining current best plan The optimal plan for a query is converged upon by incrementally improving a current best plan through application of logical and physical transformations At each step of the optimization, a current best plan is available, which is the plan with the best cost among all plans that have been explored The order of exploring alternative plans becomes important as it controls the time required to reach the optimal plan, and the amount of search space that can be pruned Benefits Pruning can be done aggressively, at the risk of returning a near-optimal plan Optimization can be aborted if the current best plan cost reaches a certain threshold, allowing for quickly returning near-optimal plans Incremental Query Optimization Incremental Optimization 14 / 46

Incremental Optimization: Issues Prioritization of Transformations The performance of the incremental optimizer is greatly affected by the order in which the search space is explored Enumeration of search space is done by applying transformations at each operator in the logical plan At each step, multiple valid transformations may exist If promising plans are expanded first, the less promising ones will be pruned and not expanded at all Difficult to ascertain which plans are promising, before expanding them Optimal selection depends on various factors, e.g. the type of operator being substituted, the transformation rule, logical and physical properties of the required context, database statistics, etc Incremental Query Optimization Incremental Optimization 15 / 46

Incremental Optimization: Issues Heuristic Ordering of Nodes The order in which nodes are selected for optimization determines the performance of the incremental optimizer But the best order depends on the query as well as dataset characteristics Developing a set of heuristics for choosing the order of exploration of nodes, that take into account the operator, child nodes, logical and physical properties of the query, should improve the performance of the optimizer Converging on a set of good heuristics would require an analysis of the performance of the optimizer under various heuristics Incremental Query Optimization Incremental Optimization 16 / 46

Priority Metrics Priority Metrics Depth from the root of the DAG Input Blocks Cost upper bound Incremental Query Optimization Incremental Optimization 17 / 46

Priority Metrics Depth from the root of the DAG Length of the path from the root of the DAG to that particular node Depth assigned to explore the DAG in depth-first-search fashion Priority assigned during exploration : root.priority = 0 child.priority = root.priority + 1; Incremental Query Optimization Incremental Optimization 18 / 46

Priority Metrics Input Blocks Size For an equivalence node, it is equal to the number of blocks in which it will reside in memory inputblocks(equivalence node) = (number of tuples)/(number of tuples per block)* For an operator node, it is equal to the number of blocks that need to be read to perform that operation inputblocks(operator node) = inputblocks(children equivalence node of that operator node) * this approximation does not hold in the case of nested loop joins. We plan to reformulate the cost model soon Incremental Query Optimization Incremental Optimization 19 / 46

Priority Metrics Cost Maintenance Cost of a node is equal to the best plan cost to evaluate that node Any new plan obtained at a node results in propagating it up DAG to modify the best plan costs of the parent nodes Cost computations is done as: Operator Node: The cost of an operator node is the sum of costs of all its children equivalence nodes in addition to its local cost Equivalence Node: The cost of an equivalence node is the minimum of the costs of its children operator nodes Incremental Query Optimization Incremental Optimization 20 / 46

Priority Metrics Upper Bound Cost Maintenance After updating the best plan cost for the required nodes, upper bound is updated for all the children recursively in top down fashion Upper Bound Cost computations is done as: Operator Node: The upper bound of an operator node is the upper bound of its parent equivalence node Equivalence Node: = Consider a hierarchy of nodes like e1-p1-e2. Upper bound of e2 is calculated as: U e2 = U e1 L i Local p1 iϵsiblings(e2) where U x is the upper bound cost of x, L x is the lower bound cost of x Incremental Query Optimization Incremental Optimization 21 / 46

Deferred Cost Propagation Deferred cost propagation Whenever cost changes at a node, it results in updating best plan cost at all parents and upper bounds at all children Updates can be ignored if change in best plan cost is negligible at the node Following things are expected if revisiting large portions of the graph due to small changes in local cost is avoided Faster optimization: Faster computation of the plan is expected since graph exploration is avoided when not necessary Sub-optimal plans: Sub-optimal plan is expected since all changes in the cost are not updated Incremental Query Optimization Incremental Optimization 22 / 46

Deferred Cost Propagation Fudge factor Change is propagated only when the cost at a node differs atleast by a factor α - fudge factor The fudge factor will essentially balance the trade-off between faster optimization and obtaining near-optimal plans Fudge factor needs to be tuned to ensure: The time taken for optimization is faster A near-optimal plan is generated Incremental Query Optimization Incremental Optimization 23 / 46

Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization Experiments 24 / 46

Implementation PyroJ optimizer [6] Evaluate quality of various search strategies Heuristics 1 Depth first search 2 Maximum cost upper bound 3 Maximum input blocks 4 Minimum cost upper bound 5 Minimum input blocks Incremental Query Optimization Experiments 25 / 46

Implementation PyroJ optimizer [6] Evaluate quality of various search strategies Heuristics 1 Depth first search 2 Maximum cost upper bound 3 Maximum input blocks 4 Minimum cost upper bound 5 Minimum input blocks Incremental Query Optimization Experiments 25 / 46

Current Best Cost Plots Tracking improvements in current best cost We present a series of experiments measuring the changes in the current best cost plan as the query optimization process progresses The 5 implemented heuristics are evaluated on a fixed set of 10 queries on the TPC-DS dataset Current best cost plot x axis shows the time in milliseconds y axis shows the cost of the current best plan, which improves until it reaches the optimal cost Each plot shows the results for a fixed query, under different heuristics There is a curve for each heuristic under consideration Incremental Query Optimization Experiments 26 / 46

Current Best Cost Plots Tracking improvements in current best cost We present a series of experiments measuring the changes in the current best cost plan as the query optimization process progresses The 5 implemented heuristics are evaluated on a fixed set of 10 queries on the TPC-DS dataset Current best cost plot x axis shows the time in milliseconds y axis shows the cost of the current best plan, which improves until it reaches the optimal cost Each plot shows the results for a fixed query, under different heuristics There is a curve for each heuristic under consideration Incremental Query Optimization Experiments 26 / 46

Current Best Cost Plots Incremental Query Optimization Experiments 27 / 46

Current Best Cost Plots Incremental Query Optimization Experiments 28 / 46

Current Best Cost Plots Incremental Query Optimization Experiments 29 / 46

Area Under Curve Plot Selecting the best heuristic The heuristic has a reasonable effect on the performance of the optimizer We would prefer a heuristics that provides optimal (or near-optimal) plans in as few iterations as possible Next slide shows a plot of the area-under-curve of each heuristic under consideration Area under curve plot There is a curve for each heuristic in the plot A heuristic with a low AOC value for a particular query, relative to other heuristic, implies that that heuristic is a better choice for that query We are interested in heuristics that perform consistently well on all queries Incremental Query Optimization Experiments 30 / 46

Area Under Curve Plot Selecting the best heuristic The heuristic has a reasonable effect on the performance of the optimizer We would prefer a heuristics that provides optimal (or near-optimal) plans in as few iterations as possible Next slide shows a plot of the area-under-curve of each heuristic under consideration Area under curve plot There is a curve for each heuristic in the plot A heuristic with a low AOC value for a particular query, relative to other heuristic, implies that that heuristic is a better choice for that query We are interested in heuristics that perform consistently well on all queries Incremental Query Optimization Experiments 30 / 46

Area Under Curve Plot Incremental Query Optimization Experiments 31 / 46

Initial plan Importance We have a plan to work with Incremental view maintenance Algorithm Operator Node: add all children equivalence nodes to the initial plan Equivalence Node: add any one child (operator node) to the initial plan Incremental Query Optimization Experiments 32 / 46

Initial plan Importance We have a plan to work with Incremental view maintenance Algorithm Operator Node: add all children equivalence nodes to the initial plan Equivalence Node: add any one child (operator node) to the initial plan Incremental Query Optimization Experiments 32 / 46

Area Under Curve Plot Incremental Query Optimization Experiments 33 / 46

Area Under Curve Plot Incremental Query Optimization Experiments 34 / 46

Effect of using an initial plan Observable difference among different heuristics Using initial plan improved performance Minimum cost upper bound heuristic + initial plan is the winner Incremental Query Optimization Experiments 35 / 46

Deferred Cost Propagation Fudge factors used 0.00 - returns best plan! 0.10 0.15 0.20 We hope that increasing fudge factor reduces optimization time we obtain good sub-optimal plans This indeed happens Incremental Query Optimization Experiments 36 / 46

Deferred Cost Propagation Incremental Query Optimization Experiments 37 / 46

Deferred Cost Propagation Incremental Query Optimization Experiments 38 / 46

Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization Conclusion 39 / 46

Conclusion We presented a literature survey in the area of traditional database optimization techniques We explained the incremental optimization technique in the context of query optimization The overhead of storing the current best plan in incremental optimization can be offset by intelligently exploring the plan space We achieved this using various heuristics that lead to faster discovery of optimal or near-optimal plans, as well as greater pruning of the plan space We have implemented deferred cost propagation coupled with heuristic search Our experimental results show the significant benefits of our techniques. Incremental Query Optimization Conclusion 40 / 46

Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization Future Work 41 / 46

Future Work Effective initial plan computation methods Theoretical guarantees for heuristic performance Adaptive heuristic optimization Global epsilon pruning Evaluating on different dataset/queries Incremental Query Optimization Future Work 42 / 46

Introduction Volcano Cascades Incremental Optimization Priority Metrics Deferred Cost Propagation Experiments Current Best Cost Plots Area Under Curve Plot Initial Plan Conclusion Future Work References Incremental Query Optimization References 43 / 46

References I [1] Surajit Chaudhuri. An Overview of Query Optimization. [2] G. Graefe and W.J. McKenna. The Volcano optimizer generator: extensibility and efficient search. Proceedings of IEEE 9th International Conference on Data Engineering, pages 209 218, 1993. [3] Goetz Graefe. The cascades framework for query optimization. Data Engineering Bulletin, pages 19 28, 1995. [4] Yongwen Xu. Efficiency in the columbia database query optimizer. Master s thesis, Portland State University, Portland, 1998. [5] Sapna Jain. Query optimization for massively parallel data processing. Master s thesis, IIT Bombay, Mumbai, 2012. Incremental Query Optimization References 44 / 46

References II [6] Prasan Roy. Multiquery optimization and applications. Technical report, Department of Computer Science, Indian Institute of Technology Bombay, 2000. Incremental Query Optimization References 45 / 46

Thank you Incremental Query Optimization References 46 / 46