Lecture 2 Data Cube Basics

Similar documents
Efficient Computation of Data Cubes. Network Database Lab

Computing Data Cubes Using Massively Parallel Processors

Sameet Agarwal Rakesh Agrawal Prasad M. Deshpande Ashish Gupta. Jerey F. Naughton Raghu Ramakrishnan Sunita Sarawagi

Data Warehousing and Data Mining

Multi-Cube Computation

Performance and Scalability: Apriori Implementa6on

Chapter 5, Data Cube Computation

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843

ECS 165B: Database System Implementa6on Lecture 14

Coarse Grained Parallel On-Line Analytical Processing (OLAP) for Data Mining

An Empirical Comparison of Methods for Iceberg-CUBE Construction. Leah Findlater and Howard J. Hamilton Technical Report CS August, 2000

Decision Support Systems

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

Introduction to Database Systems CSE 444, Winter 2011

Quotient Cube: How to Summarize the Semantics of a Data Cube

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #21: Graph Mining 2

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Map-Reduce for Cube Computation

2 CONTENTS

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-1 E-ISSN:

Query and Join Op/miza/on 11/5

Specifying logic functions

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Using Tiling to Scale Parallel Data Cube Construction

Building Large ROLAP Data Cubes in Parallel

Different Cube Computation Approaches: Survey Paper

Data Cube Technology

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 5

Improved Data Partitioning For Building Large ROLAP Data Cubes in Parallel

Sec$on 4: Parallel Algorithms. Michelle Ku8el

ACM-ICPC Indonesia National Contest Problem A. The Best Team. Time Limit: 2s

Roadmap. PCY Algorithm

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

CS 6140: Machine Learning Spring 2017

PnP: Parallel And External Memory Iceberg Cube Computation

Data Cube Technology. Chapter 5: Data Cube Technology. Data Cube: A Lattice of Cuboids. Data Cube: A Lattice of Cuboids

Advanced Database Systems

Synthesis 1. 1 Figures in this chapter taken from S. H. Gerez, Algorithms for VLSI Design Automation, Wiley, Typeset by FoilTEX 1

Hypergraph Sparsifica/on and Its Applica/on to Par//oning

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Larger K-maps. So far we have only discussed 2 and 3-variable K-maps. We can now create a 4-variable map in the

CompSci 516 Data Intensive Computing Systems

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Query Processing & Optimization

Generalizing Map- Reduce

Chapter 12: Query Processing. Chapter 12: Query Processing

Novel Materialized View Selection in a Multidimensional Database

Op#miza#on Problems, John Gu7ag MIT Department of Electrical Engineering and Computer Science LECTURE 2 1

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

An Empirical Comparison of Methods for Iceberg-CUBE Construction

Data Mining Techniques

ELCT201: DIGITAL LOGIC DESIGN

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn

Lecture Notes on Spanning Trees

Communication and Memory Optimal Parallel Data Cube Construction

PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining

Chapter 4: Association analysis:

Multi-Dimensional Partitioning in BUC for Data Cubes

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

High-Level Synthesis Creating Custom Circuits from High-Level Code

Mining Social Network Graphs

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Ar#ficial Intelligence

Introduction to Data Mining

Advanced Data Management Technologies

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Chapter 12: Query Processing

Lecture 25 Notes Spanning Trees

Compiler: Control Flow Optimization

CompSci 516 Data Intensive Computing Systems

Chapter 4: Mining Frequent Patterns, Associations and Correlations

What is Search For? CS 188: Ar)ficial Intelligence. Constraint Sa)sfac)on Problems Sep 14, 2015

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

DATA MINING II - 1DL460

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example

Generating Data Sets for Data Mining Analysis using Horizontal Aggregations in SQL

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Database System Concepts

On-Line Analytical Processing (OLAP) Traditional OLTP

Fixed- Parameter Evolu2onary Algorithms

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Mining Frequent Patterns without Candidate Generation

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

ERMO DG: Evolving Region Moving Object Dataset Generator. Authors: B. Aydin, R. Angryk, K. G. Pillai Presented by Micheal Schuh May 21, 2014

ETL and OLAP Systems

Data Mining Part 3. Associations Rules

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

ELCT201: DIGITAL LOGIC DESIGN

Chapter 13: Query Processing

Transcription:

CompSci 590.6 Understanding Data: Theory and Applica>ons Lecture 2 Data Cube Basics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu 1

Today s Papers 1. Gray- Chaudhuri- Bosworth- Layman- Reichart- Venkatrao- Pellow- Pirahesh Data Cube: A Rela-onal Aggrega-on Operator Generalizing Group- By, Cross- Tab, and Sub- Totals ICDE 1996/Data Mining and Knowledge Discovery 1997 Thinking process at that >me 2. Agarwal- Agrawal- Deshpande- Gupta- Naughton- Ramakrishnan- Sarawagi On the Computa-on of Mul-dimensional Aggregates VLDB 1996 Technical (more than 2630 and 750 cita>ons resp. on Google Scholar) 2

Sales (Model, Year, Color, Units) Naïve Approach Data analysts are interested in exploring trends and anomalies Possibly by visualiza>on (Excel) - 2D or 3D plots Dimensionality Reduc>on by summarizing data and compu>ng aggregates Total Unit sales Find total unit sales for each 1. Model 2. Model, broken into years 3. Year, broken into colors 4. Year 5. Model, broken into colors 6.. Model Year Color 3

Sales (Model, Year, Color, Units) Naïve Approach Run a number of queries SELECT sum(units) FROM Sales Total Unit sales SELECT Color, sum(units) FROM Sales GROUP BY Color SELECT Year, sum(units) FROM Sales GROUP BY Year Model SELECT Model, Year, sum(units) FROM Sales GROUP BY Model, Year. Year Color Data cube generalizes Histogram, Roll- Ups, Cross- Tabs More complex to do these with GROUP- BY How many sub- queries? How many sub- queries for 8 ahributes? 4

Histograms A tabulated frequency of computed values Sales (Model, Year, Color, Units) SELECT Year, COUNT(Units) as total FROM Sales GROUP BY Year ORDER BY Year total - > Year - > May require a nested SELECT to compute

Roll- Ups Analysis reports start at a coarse level, go to finer levels Order of ahribute mahers Not rela>onal data (empty cells no keys) Sales (Model, Year, Color, Units) Roll- ups Drill- downs GROUP BY Model Year Color Model, Year, Color Model, Year Model Chevy 1994 Black 50 Chevy 1994 White 40 90 Chevy 1995 Black 115 Chevy 1995 White 85 200 290

Roll- Ups Another representa>on (Chris Date 96) Rela>onal, but long ahribute names hard to express in SQL and repe>>on Sales (Model, Year, Color, Units) GROUP BY Model Year Color Model, Year, Color Model, Year Model Chevy 1994 Black 50 90 290 Chevy 1994 White 40 90 290 Chevy 1995 Black 85 200 290 Chevy 1995 Black 115 200 290

ALL Construct Easier to visualize roll- up if allow ALL to fill in the super- aggregates Sales (Model, Year, Color, Units) SELECT Model, Year, Color, SUM(Units) FROM Sales WHERE Model = Chevy GROUP BY Model, Year, Color UNION SELECT Model, Year, ALL, SUM(Units) FROM Sales WHERE Model = Chevy GROUP BY Model, Year UNION UNION SELECT ALL, ALL, ALL, SUM(Units) FROM Sales WHERE Model = Chevy ; Model Year Color Units Chevy 1994 Black 50 Chevy 1994 White 40 Chevy 1994 ALL 90 Chevy 1995 Black 85 Chevy 1995 White 115 Chevy 1995 ALL 200 Chevy ALL ALL 290

Tradi>onal Roll- Up Sales (Model, Year, Color, Units) ALL Roll- Up Model Year Color Model, Year, Color Model, Year Model Model Year Color Units Chevy 1994 Black 50 Chevy 1994 Black 50 Chevy 1994 White 40 90 Chevy 1995 Black 115 Chevy 1995 White 85 200 290 Chevy 1994 White 40 Chevy 1994 ALL 90 Chevy 1995 Black 85 Chevy 1995 White 115 Chevy 1995 ALL 200 Chevy ALL ALL 290 Roll- ups are asymmetric

Cross Tabula>on Sales (Model, Year, Color, Units) If we made the roll- up symmetric, we would get a cross- tabula>on Generalizes to higher dimensions SELECT Model, ALL, Color, SUM(Units) FROM Sales WHERE Model = Chevy GROUP BY Model, Color Chevy 1994 1995 Total (ALL) Black 50 85 135 White 40 115 155 Total (ALL) 90 200 290 Is the problem solved with Cross- Tab and GROUP- BYs with ALL? Requires a lot of GROUP BYs (64 for 6- dimension) Too complex to op>mize (64 scans, 64 sort/hash, slow)

Sales (Model, Year, Color, Units) Data Cube: Intui>on SELECT ALL, ALL, ALL, sum(units) FROM Sales UNION SELECT ALL, ALL, Color, sum(units) FROM Sales GROUP BY Color UNION SELECT ALL, Year, ALL, sum(units) FROM Sales GROUP BY Year UNION SELECT Model, Year, ALL, sum(units) FROM Sales GROUP BY Model, Year UNION. Model Total Unit sales Year Color 11

Data Cube Ack: from slides by Laurel Orr and Jeremy Hyrkas, UW

Data Cube Computes the aggregate on all possible combina>ons of group by columns. If there are N ahributes, there are 2 N - 1 super- aggregates. If the cardinality of the N ahributes are C 1,..., C N, then there are a total of (C 1 +1)...(C N +1) values in the cube. ROLL- UP is similar but just looks at N aggregates

Data Cube Syntax Sales (Model, Year, Color, Units) SQL Server SELECT Model, Year, Color, sum(units) FROM Sales GROUP BY Model, Year, Color WITH CUBE

Types of Aggregates Distribu>ve: input can be par>>oned into disjoint sets and aggregated separately o COUNT, SUM, MIN Algebraic: can be composed of distribu>ve aggregates o AVG Holis>c: aggregate must be computed over the en>re input set o MEDIAN

Types of Aggregates Efficient computa>on of the CUBE operator depends on the type of aggregate Distribu>ve and Algebraic aggregates mo>vate op>miza>ons

Agarwal et al paper Compute GROUP- BYs from previously computed GROUP- BYs Which direc>on? Next, some generic op>miza>ons

Op>miza>on 1: Smallest Parent Compute GROUP- BY from the smallest (size) previously computed GROUP- BY as a parent AB can be computed from ABC, ABD, or ABCD ABC or ABD beher than ABCD Even ABC or ABD may have different sizes AB all A B C D AC AD BC BD CD ABC ABD ACD BCD ABCD 18

Op>miza>on 2: Cache Results Cache result of one GROUP- BY in memory to reduce disk I/O Compute AB from ABC while ABC is s>ll in memory AB all A B C D AC AD BC BD CD ABC ABD ACD BCD ABCD 19

Op>miza>on 3: Amor>ze Disk Scans Amor>ze disk reads for mul>ple GROUP- Bys Suppose the result for ABCD is stored on disk Compute all of ABC, ABD, ACD, BCD simultaneously in one scan of ABCD AB all A B C D AC AD BC BD CD ABC ABD ACD BCD ABCD 20

Op>miza>on 4, 5 (later) 4. Share- sort for sort- based algorithms 5. Shared- par>>on for hash- based algorithms AB all A B C D AC AD BC BD CD ABC ABD ACD BCD ABCD 21

PipeSort Algorithm 22

PipeSort: Basic Idea Share- sort op>miza>on: Data sorted in one order Compute all GROUP- BYs prefixed in that order Example: GROUP- BY over ahributes ABCD Sort raw data by ABCD Compute ABCD - > ABC - > AB - > A in pipelined fashion No addi>onal sort needed BUT, may have a conflict with smallest- parent op>miza>on ABD - > AB could be a beher choice Pipe- sort algorithm: Combines two op>miza>ons: shared- sorts and smallest- parent Also includes cache- results and amor>zed- scans ABCD Compute one tuple of ABCD, propagate upward in the pipeline by a single scan A AB ABC

Search Latce Harinarayan et al. 96 Lecture 3 Directed edge => one ahribute less and possible computa>on Level k contains k ahributes all = 0 ahribute Two possible costs for each edge e ij = i - - - > j A(e ij ): i is sorted for j S(e ij ): i is NOT sorted for j AB all Level 0 A B C D Level 1 AC AD BC BD CD ABC ABD ACD BCD Level 2 Level 3 Sorted Not Sorted ABCD Level 4 A B C sum A B C sum a1 b1 c1 5 a2 b2 c3 11 a1 b1 c2 10 a1 b1 c2 10 A B sum a1 b2 c3 8 a2 b2 c1 2 a1 b1 15 a2 b2 c1 2 a1 b1 c1 5 a1 b2 8 a2 b2 c3 11 a1 b2 c3 8 a2 b2 13

A subgraph O each node has a single parent each node has a sorted order of ahributes if parent s sorted order is a prefix, cost = A(e ij ), else S(e ij ) Mark by A or S At most one A- marked out- edge Goal: Find O with min total cost Q. Should we always have a green out- edge? Sorted A B C sum a1 b1 c1 5 a1 b1 c2 10 a1 b2 c3 8 a2 b2 c1 2 a2 b2 c3 11 PipeSort Output AB Not Sorted A B C sum a2 b2 c3 11 a1 b1 c2 10 a2 b2 c1 2 a1 b1 c1 5 a1 b2 c3 8 all Level 0 A B C D AC AD BC BD CD ACB ABD ACD BDC ACBD Sorted (A) Not- Sorted (S) A B sum a1 b1 15 a1 b2 8 a2 b2 13 Level 1 Level 2 Level 3 Level 4

Outline: PipeSort Algorithm (1) Go from level 0 to N- 1 here N = 4 For each level k find the best way to construct it from level k+1 AB all Level 0 A B C D Level 1 AC AD BC BD CD Level 2 Weighted Bipar>te Matching G(V1, V2, E) Weight on edges each vertex in V1 should be connected to at most one vertex in V2 Find a matching of max total weight Here min total weight w - > max_weight w Requires V2 >= V1 ABC ABD ACD BCD ABCD Level 3 Level 4

Outline: PipeSort Algorithm (2) Reduc>on to a weighted bipar>te matching between level k and k+1 Make k new copies of each node in level k+1 k+1 copies for each in total replicate edges Original copy = cost A(e ij ) = sorted sorted order of i fixed New copies = cost S(e ij ) = not sorted need to sort i AB all Level 0 A B C D AC AD BC BD CD ABC ABD ACD BCD ABCD Level 1 Make new 2 copies Total 3 copies each Level 2 Level 3 Level 4

Outline: PipeSort Algorithm (3) Illustra>on with a smaller example Level k = 1 from level k+1 = 2 one new copy (dohed edges) one exis>ng copy (solid edge) Assump>on for simplicity same cost for all outgoing edges A(e ij ) = A(e ij ) S(e ij ) = S(e ij ) all Level 0 A B C Level 1 AB AC BC 2, 10 5, 12 13, 20 Level 2 Level 3 ABC Op>mal on total cost Not on #sorts can be subop>mal (size)

Outline: PipeSort Algorithm (4) Azer compu>ng the plan, execute all pipelines 1. First pipeline is executed by one scan of the data 2. Sort CBAD - > BADC, compute the second pipeline 3... A, S costs

Outline: PipeSort Algorithm (5) Observa>ons: Finds the best plan for compu>ng level k from level k+1 Assuming the cost of sor>ng BAD does not depend on how the GROUP- BY on BAD has been computed Genera>ng plan k+1 - > k does not prevent genera>ng plan k+2 - > k+1 from finding the best choice Not provably globally op>mal e.g. can the op>mal plan compute AB from ABCD? something to explore! BC ABC - > BCA ABCD If the green edge is chosen, the sorted order of ABCD will be BCAD

PipeHash Algorithm 31

PipeHash: Basic Idea (1) N = 4 Use hash tables to compute smaller GROUP- BYs If the hash tables for AB and AC fit in memory, compute both in one scan of ABC all A B C D With no memory restric>ons AB AC AD BC BD CD for k = N...0: For each k+1- ahribute GROUP BY g Compute in one scan of g all k- ahribute GROUP BY where g is smallest parent Save g to disk and destroy the hash table of g ABC ABD ACD BCD ABCD A B C sum A B sum a1 b1 15 a1 b2 8 a2 b2 13 A C sum a1 c1 5 a1 c2 10 a2 c3 19 a2 c1 2 a1 b1 c1 5 a1 b1 c2 10 a2 b2 c3 8 a2 b2 c1 2 a2 b2 c3 11

PipeHash: Basic Idea (2) N = 4 But, data might be large, Hash Tables may not fit in memory Solu>on: op>miza>on shared- par>>on par>>on data on one or more ahributes Suppose the data is par>>oned on ahribute A All GROUP- Bys containing A (AB, AC, AD, ABC ) can be computed independently on each par>>on Cost of par>>oning is shared by mul>ple GROUP- BYs AB all A B C D AC AD BC BD CD ABC ABD ACD BCD ABCD A B C sum A B sum a1 b1 15 a1 b2 8 a2 b2 13 A C sum a1 c1 5 a1 c2 10 a2 c3 19 a2 c1 2 a1 b1 c1 5 a1 b1 c2 10 a2 b2 c3 8 a2 b2 c1 2 a2 b2 c3 11

PipeHash: Basic Idea (3) N = 4 Input: search latce For each group- by, select smallest parent Result: Minimum Spanning Tree (MST) all A B C D AB AC AD BC BD CD ABC ABD ACD BCD Size of GROUP-BY ABCD But, all Hash Tables (HT) in the MST may not fit in the memory together To consider: Which GROUP- BYs to compute together? When to allocate- release memory for HT? What ahributes to par>>on on?

Outline: PipeHash Algorithm (1) Once again, a combinatorial op>miza>on problem This problem is conjectured to be NP- complete in the paper something to explore! Use heuris>cs Trade- offs 1. Choose as large sub- tree of MST as possible ( cache- results, amor>zed scan ) 2. The sub- tree must include the par>>oning ahribute(s) Heuris>c Choose a par>>oning ahribute that allows selec>on of the largest subtree of MST

Algorithm Outline: PipeHash Algorithm (2) Input: search latce worklist = {MST} while worklist not empty select one tree T from the worklist T = select- subtree(t) Compute- subtree(t ) AB all A B C D AC AD BC BD CD ABC ABD ACD BCD Next, through examples Select- subtree(t) May add more subtrees to worklist Compute- subtree(t ) ABCD

Outline: PipeHash Algorithm (3) T = Select- Subtree(T) = T A Compute- Subtree(T ) Par>>on T A For each par>>on, Hash- Table in memory un>l all children are created s= {A} is such that T s per par>>on in P s fits in memory P s = #par>>ons T = T s is the largest Creates new sub- trees to add Compute GROUP- BY ABCD Scan ABCD to compute ABC, ABD, ACD Save ABCD, ABD to disk Compute AD from ACD Save ACD, AD to disk Compute AB, AC from ABC Save ABC, AC to disk Compute A from AB Save AB, A from disk

Experiments Here sort- based beher than hash- based (new hash- table for each GROUP- BY) Another experiment on synthe>c data (see paper) For less sparse data, hash- based beher than sort- based

Summary Similar Overlap algorithm by Deshpande et al. (see paper) All algorithms try to pick the best plan to compute aggregates with fewer scans and maximal memory usage Finding op>mal decisions for each algorithm may be NP- complete Algorithms use heuris>cs that work well in prac>ce Next class: other efficient implementa>ons and index for cube