Horizontal Aggregations for Mining Relational Databases

Similar documents
Generating Data Sets for Data Mining Analysis using Horizontal Aggregations in SQL

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

Preparation of Data Set for Data Mining Analysis using Horizontal Aggregation in SQL

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Horizontal Aggregations in SQL to Generate Data Sets for Data Mining Analysis in an Optimized Manner

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

A Better Approach for Horizontal Aggregations in SQL Using Data Sets for Data Mining Analysis

A Hybrid Approach for Horizontal Aggregation Function Using Clustering

Vertical and Horizontal Percentage Aggregations

Horizontal Aggregation in SQL to Prepare Dataset for Generation of Decision Tree using C4.5 Algorithm in WEKA

Fundamental methods to evaluate horizontal aggregation in SQL

Horizontal Aggregations for Building Tabular Data Sets

Relational Databases

Database System Concepts

Relational Model: History

RELATIONAL DATA MODEL

An Overview of Cost-based Optimization of Queries with Aggregates

Relational Model, Relational Algebra, and SQL

Chapter 12: Query Processing

CSIT5300: Advanced Database Systems

Unit Assessment Guide

Concepts of Database Management Eighth Edition. Chapter 2 The Relational Model 1: Introduction, QBE, and Relational Algebra

Hyperion Interactive Reporting Reports & Dashboards Essentials

Relational Algebra and SQL

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Chapter 12: Query Processing

T-SQL Training: T-SQL for SQL Server for Developers

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Optimizing Recursive Queries in SQL

SQL. Chapter 5 FROM WHERE

Chapter 13: Query Processing

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Horizontal Aggregation Function Using Multi Class Clustering (MCC) and Weighted (PCA)

7. Query Processing and Optimization

SQL: Queries, Programming, Triggers

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Relational Database: The Relational Data Model; Operations on Database Relations

Chapter 14: Query Optimization

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 7 Introduction to Structured Query Language (SQL)

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Query processing and optimization

Enterprise Database Systems

8) A top-to-bottom relationship among the items in a database is established by a

Database System Concepts

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Slides by: Ms. Shree Jaswal

Data Warehouse and Data Mining

Query Processing & Optimization

Query Optimization. Shuigeng Zhou. December 9, 2009 School of Computer Science Fudan University

DBMS Query evaluation

Data about data is database Select correct option: True False Partially True None of the Above

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Chapter 13: Query Optimization

SQL: Queries, Constraints, Triggers

Database Systems SQL SL03

Chapter 3. Algorithms for Query Processing and Optimization

Database Technology Introduction. Heiko Paulheim

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

SIT772 Database and Information Retrieval WEEK 6. RELATIONAL ALGEBRAS. The foundation of good database design

MTA Database Administrator Fundamentals Course

Database Languages and their Compilers

CIS 330: Applied Database Systems

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

EECS 647: Introduction to Database Systems

CS122 Lecture 4 Winter Term,

Database Systems SQL SL03

Advanced Database Systems

Ian Kenny. December 1, 2017

Parser: SQL parse tree

SQL Server Analysis Services

Overview of Query Processing and Optimization

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

Chapter 19 Query Optimization

SQL functions fit into two broad categories: Data definition language Data manipulation language

SQL: Queries, Programming, Triggers. Basic SQL Query. Conceptual Evaluation Strategy. Example of Conceptual Evaluation. A Note on Range Variables

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Optimized Query Plan Algorithm for the Nested Query

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

The SQL data-definition language (DDL) allows defining :

Data Warehouse and Data Mining

COSC6340: Database Systems Mini-DBMS Project evaluating SQL queries

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Query Processing: an Overview. Query Processing in a Nutshell. .. CSC 468 DBMS Implementation Alexander Dekhtyar.. QUERY. Parser.

Operator Implementation Wrap-Up Query Optimization

Relational Query Optimization. Highlights of System R Optimizer

Course Details Duration: 3 days Starting time: 9.00 am Finishing time: 4.30 pm Lunch and refreshments are provided.

1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.

Query Optimization Overview

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Transcription:

Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh, India. Abstract: Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group using group functions. A significant manual effort using a compliant programming language is required to build data sets, where a horizontal layout is required. Earlier a simple, yet powerful, methods(case,pivot,spj) to generate aggregated columns in a horizontal tabular layout were developed. Both CASE and PIVOT evaluation methods are significantly faster than the SPJ method. We propose to use a technique called generalized projections (GPs) to improve the performance of SPJ method. The proposed technique pushes down to the lowest levels of a query tree aggregation computation, function computation and duplicate elimination. It also creates aggregations in queries that did not use aggregation to begin with. It unifies set and duplicate semantics, and helps in better understanding aggregations. It improves SPJ performance significantly since applying aggregations early in query processing can provide significant performance improvements. I.INTRODUCTION Building a suitable data set for data mining purposes is a time-consuming task. This task generally requires writing long SQL statements or customizing SQL code if it is automatically generated by some tool. There are two main ingredients in such SQL code: joins and aggregations. The most widely-known aggregation is the sum of a column over groups of rows. There exist many aggregation functions and operators in SQL. Unfortunately, all these aggregations have limitations to build data sets for data mining purposes. The main reason is that, in general, data sets that are stored in a relational database (or a data warehouse) come from On-Line Transaction Processing (OLTP) systems where database schemas are highly normalized. Based on current available functions and clauses in SQL, a significant effort is required to compute aggregations. Such effort is due to the amount and complexity of SQL code that needs to be written, optimized and tested. Standard aggregations are hard to interpret when there are many result rows. To perform analysis of exported tables into spreadsheets it may be more convenient to have aggregations on the same group in one row. With such limitations in mind, we propose a new class of aggregate functions that aggregate numeric expressions and transpose results to produce a data set with a horizontal layout. Functions belonging to this class are called horizontal aggregations. Horizontal aggregations represent an extended form of traditional SQL aggregations, which return a set of values in a horizontal layout instead of a single value per row. Horizontal aggregations provide several unique features and advantages. First, they represent a template to generate SQL code from a data mining tool. This SQL code reduces manual work in the data preparation phase in a data mining project. Second, since SQL code is automatically generated it is likely to be more efficient than SQL code written by an end user. Third, the data set can be created entirely inside the DBMS. Horizontal aggregations just require a small syntax extension to aggregate functions called in a SELECT statement. Alternatively, horizontal aggregations can be used to generate SQL code from a data mining tool to build data sets for data mining analysis. To perform horizontal aggregation, SPJ method is implemented with generalized Projections. GPs capture aggregations, group- conventional projection with duplicate elimination (distinct), and duplicate preserving projections. We develop a technique for pushing GPs down query trees of Selectproject-join may use aggregations like max, sum, etc. and that use arbitrary functions in their selection conditions. Our technique pushes down to the lowest levels of a query tree aggregation computation, duplicate elimination, and function computation. II. DEFINITIONS Let F be a table having a simple primary key K represented by an integer, p discrete attributes and one numeric attribute: F(K;D1;..;Dp;A). In OLAP terms, F is a fact table with one column used as primary key, p dimensions and one measure column passed to standard SQL aggregations. F is assumed to have a star schema 3483

to simplify exposition. Column K will not be used to compute aggregations. Dimension lookup tables will be based on simple foreign keys. That is, one dimension column Dj will be a foreign key linked to a lookup table that has Dj as primary key. Input table F size is called N. That is, F = N. Table F represents a temporary table or a view based on a,star join, query on several tables. III. HORIZONTAL AGGREGATIONS Horizontal aggregations just require a small syntax extension to aggregate functions called in a SELECT statement. Alternatively, horizontal aggregations can be used to generate SQL code from a data mining tool to build data sets for data mining analysis. Our main goal is to define a template to generate SQL code combining aggregation and transposition (pivoting). A second goal is to extend the SELECT statement with a clause that combines transposition with aggregation. A method, SPJ method, is used to evaluate horizontal aggregations which relies on relational operations. That is, select, project, join and aggregation queries. In order to evaluate this query the query optimizer takes three input parameters: (1) the input table F, (2) the list of grouping columns L 1 ;. ;L m, (3) the column to aggregate (A). In a horizontal aggregation there are four input parameters to generate SQL code: 1) the input table F, 2) the list of GROUP BY columns L 1 ; ;L j, 3) the column to aggregate (A), 4) the list of transposing columns R 1 ; ; R k. we extend standard SQL aggregate functions with a.transposing. BY clause followed by a list of columns (i.e. R 1 ; ; R k ), to produce a horizontal set of numbers instead of one number. Our proposed syntax is as follows. SELECT L 1 ; ; L J, H(A BY R 1 ; ; R k ) FROM F GROUP BY L 1 ; ; L J ; Here,H() represents some SQL aggregation (e.g. sum(), count(), min(), max(), avg()). The function H() must have at least one argument represented by A, followed by a list of columns. The result rows are determined by columns L 1 ; ; L J in the GROUP BY clause if present. Result columns are determined by all potential combinations of columns R 1 ; ; R k, where k = 1 is the default. In order to get a consistent query evaluation it is necessary to use locking. The main reasons are that any insertion into F during evaluation may cause inconsistencies: (1) it can create extra columns in F H, for a new combination of R 1 ; ; R k ; (2) it may change the number of rows of F H, for a new combination of L 1 ; ; L J ; (3) it may change actual aggregation values in F H. The horizontal aggregation function H() returns not a single value, but a set of values for each group L 1 ; ; L J. Therefore, the result table FH must have as primary key the set of grouping columns { L 1 ; ; L J } and as non-key columns all existing combinations of values R 1 ; ; R k. A horizontal aggregation exhibits the following properties: 1) n= F H matches the number of rows in a vertical aggregation grouped by L1; ;Lj. 2) d = π R1,.,Rk (F) 3) Table F H may potentially store more aggregated values than F V due to nulls. That is, F V nd. DBMS limitations: There exist two DBMS limitations with horizontal aggregations: reaching the maximum number of columns in one table and reaching the maximum column name length when columns are automatically named. On the other hand, the second important issue is automatically generating unique column names. However, these are not important limitations because if there are many dimensions that is likely to correspond to a sparse matrix (having many zeroes or nulls) on which it will be difficult or impossible to compute a data mining model. The column name length issue can be solved by generating column identifiers with integers and creating a description table that maps identifiers to full descriptions, but the meaning of each dimension is lost. An alternative is the use of abbreviations, which may require manual input. IV SPJ METHOD The SPJ method is interesting from a theoretical point of view because it is based on relational operators only. The basic idea is to create one table with a vertical aggregation for each result column, and then join all those tables to produce FH. We aggregate from F into d projected tables with d Select- Project-Join-Aggregation queries (selection, projection, join, aggregation). Each table F I corresponds to one subgrouping combination and has {L1; ;Lj} as primary key and an aggregation on A as the only non-key column. It is necessary to introduce an additional table F 0, that will be outer joined with projected tables to get a complete result set. We propose two basic sub-strategies to compute F H. The first one directly aggregates from F. The second one computes the equivalent vertical aggregation in a temporary table F V grouping by L1; ;Lj ; R 1 ; ; R k. Then horizontal aggregations can be instead computed from F V, which is a compressed version of F, since standard aggregations are distributive. 3484

Select Distinct R1,,Rk SPJ d left joins Compute F H Fig : Main steps of methods based on FV (un optimized). Select Distinct R1,,Rk Compute F V SPJ d left joins Compute F H Fig : Main steps of methods based on FV (optimized). We now introduce the indirect aggregation based on the intermediate table F V, that will be used for the SPJ method. Let F V be a table containing the vertical aggregation, based on L 1 ; ; L J ; R 1 ; ; R k. Let V() represent the corresponding vertical aggregation for H(). The statement to compute FV gets a cube: INSERT INTO F V SELECT L 1 ; ; L J ; R 1 ; ; R k V(A) FROM F GROUP BY L 1 ; ; L J ; R 1 ; ; R k ; Table F0 de_nes the number of result rows, and builds the primary key. F0 is populated so that it contains every existing combination of L 1 ; ; L J. Table F0 has { L 1 ; ; L J } as primary key and it does not have any non-key column. INSERT INTO F 0 SELECT DISTINCT L 1 ; ; L J FROM {F F V }; In the following discussion I {1; ; d}. we use h to make writing clear, mainly to define boolean expressions. We need to get all distinct combinations of subgrouping columns R 1 ; ; R k, to create the name of dimension columns, to get d, the number of dimensions, and to generate the boolean expressions for WHERE clauses. Each WHERE clause consists of a conjunction of k equalities based on R 1 ; ;Rk. SELECT DISTINCT R 1 ; ;R k FROM {F F V }; Tables F 1 ; ; F d contain individual aggregations for each combination of R 1 ; ;R k. The primary key of table FI is { L 1 ; ; L J }. INSERT INTO F I SELECT L1; ;Lj ; V (A) FROM {F FV} WHERE R1 = v1i AND.. AND Rk = vki GROUP BY L1; ;Lj ; Then each table F I aggregates only those rows that correspond to the Ith unique combination of R 1 ; ;R k, given by the WHERE clause. A possible optimization is synchronizing table scans to compute the d tables in one pass. Finally, to get F H we need d left outer joins with the d + 1 tables so that all individual aggregations are properly assembled as a set of d dimensions for each group. Outer joins set result columns to null for missing combinations for the given group. In general, nulls should be the default value for groups with missing combinations. We believe it would be incorrect to set the result to zero or some other number by default if there are no qualifying rows. Such approach should be considered on a per-case basis. INSERT INTO F H SELECT F 0.L 1 ; F 0.L 2 ; ; F 0.L J ; F 1.A; F 2.A; ; F d.a FROM F d LEFT OUTER JOIN F 1 ON F 0.L 1 = F 1.L 1 and and F 0.L J = F 1.L J LEFT OUTER JOIN F 2 ON F 0.L 1 = F 2.L 1 and and F 0.L J = F 2.L J. LEFT OUTER JOIN F d ON F 0.L 1 = F d.l 1 and. and F 0.L J = F d.l J ; This statement may look complex, but it is easy to see that each left outer join is based on the same columns L1; ;Lj. To improve the performance of SPJ, We introduce the notion of a generalized projection that unifies duplicate eliminating projections corresponds to the SQL distinct adjective, duplicate preserving projections, groupby, and aggregations, in a common 3485

framework. V. GENERALIZED PROJECTION Aggregations in SQL are closely related to the relational projection operator. Aggregations as defined in the SQL standard also produce a new relation given an input relation, by manipulating attributes of the input relation. We introduce a generalized projection operator, denoted by the symbol π, that is similar to aggregation operator. A GP takes as its argument a relation R and outputs a new relation based on the subscript of the GP. The subscript specifies the computation to be done on R. The subscript has two parts: 1. A set of groupby components. We refer to them as components and not attributes because they may be functions of attributes and not just attributes. For instance, the GP π A*B (R) is written as the following SQL query: select (A*B) from R groupby (A*B). 2. A set of aggregate components. For example, we can write the GP π D,max(S) (R) as the query: select D, max(s) from R groupby D. Here D is the only groupby component and max(s) is the only aggregate component. It is simple to observe that a GP has exactly one tuple for each value of the groupby components and thus does not produce any duplicates in its output. Here class of queries expressed in a qyery tree. The permitted query trees have ve types of nodes: selection nodes, projection nodes, cross-product nodes, groupby nodes, and aggregate-groupby node pairs. The topmost node in tree is always a projection. This projection is the GP that is pushed down the query tree. Projections may preserve duplicates or discard them.. Selection nodes eliminate tuples from the input relation, groupby nodes do projection+duplicate elimination, and cross-product nodes output the cross product of two input rela- tions. Aggregate- groupby node pairs have a groupby node followed by an aggregate node. An aggregate-groupby node pair produces as output a relation with one tuple for every distinct value in the input relation of the groupby attributes. GPs are incorporated into query trees using a two step process: 1. Push GPs down a query tree and annotate the query tree with a GP above each node in the tree. 2. Rewrite the annotated query tree to incorporate the GPs that the query optimizer chooses to evaluate and to eliminate all other GPs introduced in the push-down process. We implement top-down pass technique for pushing GPs down a query tree in the form of a table that gives the algebraic transformations needed for pushing GPs. After the top-down pass associates a GP with some or all nodes of the query tree, the query optimizer decides which GPs improve the query plan. The other GPs are removed from the tree. VI PERFORMANCE ISSUE No optimization technique reduces the cost of query execution in all cases. There are always cases where the cost of doing the optimization is greater than the benefit. Our algorithm, Generalized Projections, works best on queries when the groupby attributes we push down do not have too many distinct values in the underlying relation. Most queries are not interested in individual tuples of this relation, but rather aggregate properties of this relation. Thus in most cases, we need to do a groupby on a non-key attribute of this relation. When this relation is joined with some other relation, that need not be aggregated. In such cases, our technique would reduce considerably the size of the massive table before we did a join. It can be argued that in such cases a join algorithm like a hash join could be used to achieve a similar result. However, hash joins are dificult to implement in practice and not commonly implemented. Single table aggregations being a commonly used feature of SQL exist in most systems. Our technique only requires the use of these operators. In addition, our technique works in many cases where hash joins do not do well: for instance, if two very large tables were joined. Our optimization, when applied to query plans, potentially interferes with join ordering, since we reduce the size of the relations participating in the join. However, the technique can be used advantageously as a post join-ordering step. For greater performance gains our push-down algorithm should be integrated with the join ordering module. CONCLUSION This paper prescribes to improve the performance of SPJ in terms of speed and scalability. Horizontal aggregations represent an extended form of traditional SQL aggregations, which return a set of values in a horizontal layout instead of a single value per row. Horizontal aggregations provide several unique features and advantages. Horizontal aggregation is performed by SPJ method with Generalized projection. The SPJ method is interesting because it is based on relational operators only. A technique called generalized projections (GPs) is proposed, to improve the performance of SPJ method. The technique pushes down to the lowest levels of a query tree aggregation computation, function computation and duplicate 3486

elimination. The permitted query trees have five types of nodes: selection nodes,projection nodes,crossproduct nodes, groupby nodes, aggregate-groupby node pairs. Efficiently evaluating horizontal aggregations using left outer joins presents opportunities for query optimization. Secondary indexes on common grouping columns, besides indexes on primary keys, can accelerate computation. Horizontal aggregations produce tables with fewer rows, but with more columns. REFERENCES [1] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng, and S. Subramanian. Spreadsheets in RDBMS for OLAP. In Proc. ACM SIGMOD Coference, pages 52.63, 2003. [2] Venky Harinarayan,Ashish Guptay Generalized Projections: a Powerful Query-Optimization Technique [3] Vertical and Horizontal Percentage Aggregations, Carlos Ordonez Teradata, NCR San Diego, CA 92127, USA. [4] G. Bhargava, P. Goel, and B.R. Iyer. Hypergraph based reordering of outer join queries with complex predicates. In ACM SIGMOD Conference, pages 304.315, 1995. [5] U. Dayal, N. Goodman, and R. H. Katz. An Extended Relational Algebra with Control over Duplicate Elimination. In Proceedings of the ACM Symposium on Principles of Database Systems, 1982, pages 117-123. [6] Venky Harinarayan and Ashish Gupta. Optimization Using Tuple Subsumption. To appear in ICDT 95, January 1995. 3487