Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.

Similar documents
Acknowledgment. Sales data example. Outline. Excel pivot table. Example: Sales

Information Management course

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Dta Mining and Data Warehousing

Introduction to Data Warehousing

What is a Data Warehouse?

Information Management course

Chapter 4, Data Warehouse and OLAP Operations

A Multi-Dimensional Data Model

Data Warehousing & OLAP

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843

Reminds on Data Warehousing

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Warehousing & On-Line Analytical Processing

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

CS490D: Introduction to Data Mining Chris Clifton

Data Warehouse. Concepts and Techniques. Chapter 3. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Analyse des Données. Master 2 IMAFA. Andrea G. B. Tettamanzi

CS 412 Intro. to Data Mining

Data Warehousing & On-line Analytical Processing

Cognos also provides you an option to export the report in XML or PDF format or you can view the reports in XML format.

Data Mining Concepts & Techniques

ETL and OLAP Systems

DATA WAREHOUSING & DATA MINING. by: Prof. Asha Ambhaikar

Decision Support Systems

Reporting and Query tools and Applications

Sql Fact Constellation Schema In Data Warehouse With Example

Data Warehouses. Yanlei Diao. Slides Courtesy of R. Ramakrishnan and J. Gehrke

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Decision Support Systems aka Analytical Systems

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

An Overview of Data Warehousing and OLAP Technology

Data Warehouse and Data Mining

SQL Server Analysis Services

Data Analysis and Data Science

Big Data 13. Data Warehousing

collection of data that is used primarily in organizational decision making.

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

Data Quality. Data Cleaning and Integration. Data Cleaning. Data Preprocessing. Handling Missing Values. Disguised Missing Data?

An Overview of Data Warehousing and OLAP Technology

Data Warehousing Introduction. Toon Calders

Data Warehousing & OLAP

Chapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model

SQL Server 2005 Analysis Services

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Unit 7: Basics in MS Power BI for Excel 2013 M7-5: OLAP

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Adnan YAZICI Computer Engineering Department

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

CS 1655 / Spring 2013! Secure Data Management and Web Applications

Data Warehouse and Data Mining

Big Data 13. Data Warehousing

Data Warehousing and Decision Support

On-Line Application Processing

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Data Warehousing and Decision Support

Chapter 18: Data Analysis and Mining

Teradata Aggregate Designer

Oracle Database 11g: Data Warehousing Fundamentals

Data Mining & Data Warehouse

Advanced Data Management Technologies

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Evolution of Database Systems

Data Warehousing. Overview

Proceedings of the IE 2014 International Conference AGILE DATA MODELS

Multidimensional Queries

Data mining: Hmm, what is it?

Syllabus. Syllabus. Motivation Decision Support. Syllabus

Data warehouses Decision support The multidimensional model OLAP queries

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

Most database operations involve On- Line Transaction Processing (OTLP).

QUALITY MONITORING AND

DATA MINING AND WAREHOUSING

Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis

Question Bank. 4) It is the source of information later delivered to data marts.

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Call: SAS BI Course Content:35-40hours

ECT7110 Introduction to Data Warehousing

Business Intelligence

ALTERNATE SCHEMA DIAGRAMMING METHODS DECISION SUPPORT SYSTEMS. CS121: Relational Databases Fall 2017 Lecture 22

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

SwatCube An OLAP approach for Managing Swat Model results

Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory

ECLT 5810 Introduction to Data Warehousing

DATA MINING TRANSACTION

One Size Fits All: An Idea Whose Time Has Come and Gone

Data Warehousing Introduction. Esteban Zimanyi Slides from Toon Calders

Data Warehouse and Data Mining

Hands-On with OLAP 11g for Smarter and Faster Data Warehouses

The strategic advantage of OLAP and multidimensional analysis

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

Basics of Dimensional Modeling

Transcription:

MTAT.03.183 Data Mining Week 7: Online Analytical Processing and Data Warehouses Marlon Dumas marlon.dumas ät ut. ee Acknowledgment This slide deck is a mashup of the following publicly available slide decks: http://www.postech.ac.kr/~swhwang/grass/datacube. ppt http://classweb.gmu.edu/kersch/inft864/readings/sho /k h/i di /Sh shani/datacube/cubenoteskerschberg.ppt http://ohr.gsfc.nasa.gov/wfstatistics/data_cube_traini ng.ppt http://www.cs.uiuc.edu/homes/hanj/bk2/03.ppt 2 Outline The data cube abstraction Multidimensional data models Data warehouses Typical Data Analysis Process Formulate a query to extract relevant information Extract aggregated data from the database Visualize the result to look for patterns. Analyze the result and formulate new queries. Online Analytical Processing (OLAP) is about supporting such processes OLAP characteristics: No updates, lots of aggregation, need to visualize and to interact Let s first talk about aggregation 3 4 Relational Aggregation Operators The Relational GROUP BY Operator SQL has several aggregate operators: SUM(), MIN(), MAX(), COUNT(), AVG() The basic idea is: Combine all values in a column into a single scalar value 5 17 Syntax 2 SELECT AVG(Temp) FROM Weather;... AVG()? GROUP BY allows aggregates over table sub-groups SELECT Time, Altitude, AVG(Temp) FROM Weather GROUP BY Time, Altitude; Altitude Time Latitude Longitude Temp (m) Altitude 07/9/5:1500 20 24 Time AVG(Temp) (m) 20 07/9/5:1500 20 22 07/9/5:1500 23 07/9/5:1500 100 17 07/9/5:1500 100 17 13 07/9/9:1500 50 19 07/9/9:1500 50 20 07/9/9:1500 50 21 5 6 1

Limitations of the GROUP BY Group-by is one-dimensional: one group per combination of the selected attribute values Does not give sub-totals Model Year Color Sales Chevy 1994 Black 50 Chevy Black 85 Chevy 1994 White 40 Chevy White 115 1. Calculate total sales per year 2. Compute total sales per year and per color 3. Calculate sales per year, per color and per model 7 Grouping with Sub-Totals (Pivot table) Sales by Model by Year by Color Note that sub-totals by color are missing, if added it becomes a cross-tabulation 8 Grouping with sub-totals (cross-tab) Grouping with Sub-Totals (Relational version) 9 Sub-totals by color are still missing 10 SQL Query Adding the colors 11 12 2

CUBE and Roll Up Operators The Cube An Example of 3D Data Cube Aggregate RED WHITE BLUE Group By (with total) Cross Tab Chevy Ford By Year By Make RED WHITE BLUE By Make By Make & Year By Year The Data Cube and The Sub-Space Aggregates By Make & Year RED WHITE BLUE By Make & Color 13 14 14 Cube: Each Attribute is a Dimension The Data Cube Concept N-dimensional Aggregate (sum(), max(),...) Fits relational model exactly: a 1, a 2,..., a N, f() Super-aggregate over N-1 Dimensional sub-cubes ALL, a 2,..., a N, f() a 3, ALL, a 3,..., a N, f()... a 1, a 2,..., ALL, f() This is the N-1 Dimensional cross-tab. Super-aggregate over N-2 Dimensional sub-cubes ALL, ALL, a 3,..., a N, f()... a 1, a 2,..., ALL, ALL, f() 15 1994 W Black F White C 1994 COLOR B 1994 MAKE Ford Chevy 1994 F C B W F YEAR C B W 16 Sub-cube Derivation CUBE Operator Possible syntax <M,Y,C> <M,Y,*> <M,*,C> <*,Y,C> <M,*,*> <*,Y,*> <*,*,C> <*,*,*> Proposed syntax example: SELECT Model, Make, Year, SUM(Sales) FROM Sales WHERE Model IN { Chevy, Ford } AND Year BETWEEN 1990 AND 1994 GROUP BY CUBE Model, Make, Year HAVING SUM(Sales) > 0; Note: GROUP BY operator repeats aggregate list in select list in group by list Dimension collapse, * denotes ALL 17 18 18 3

Rollup Operator Cube Operator Example ROLLUP Operator: special case of CUBE Operator Return Sales Roll Up by Store by Quarter in 1994.: SELECT Store, quarter, SUM(Sales) FROM Sales WHERE nation= Korea AND Year=1994 GROUP BY ROLLUP Store, Quarter(Date) AS quarter; 19 SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 CUBE DATA CUBE Model Year Color Sales ALL ALL ALL 942 chevy ALL ALL 510 ford ALL ALL 432 ALL 1990 ALL 343 ALL 1991 ALL 314 ALL 1992 ALL 285 ALL ALL red 165 ALL ALL white 273 ALL ALL blue 339 chevy 1990 ALL 154 chevy 1991 ALL 199 chevy 1992 ALL 157 ford 1990 ALL 189 ford 1991 ALL 116 ford 1992 ALL 128 chevy ALL red 91 chevy ALL white 236 chevy ALL blue 183 ford ALL red 144 ford ALL white 133 ford ALL blue 156 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 blue 125 ALL 1991 red 107 ALL 1991 white 104 ALL 1991 blue 104 ALL 1992 red 59 ALL 1992 white 116 ALL 1992 blue 110 20 mary Problems with GROUP BY GROUP BY cannot directly construct Pivot tables / roll-up reports Cross-Tabs CUBE Operator Generalizes GROUP BY and Roll-Up and Cross-Tabs!! Now let s have a look at one NASA Workforce cubes http://nasapeople.nasa.gov/workforce Btell demo reports http://www.btell.de Follow the demo link and start a demo, the go to reports 21 21 22 Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Product J. Han: Data Mining: Concepts and Techniques Industry Region Year Category Country Quarter Product City Month Week Office Day 23 Multidimensional Data Modeling Key concepts Dimensions: entities along which data can be aggregated (e.g. Car model, time, color) Measures: scalar values to be aggregated (e.g. number of sales) Two types of tables Dimension tables Fact table: stores measures and references to dimensions J. Han: Data Mining: Concepts and Techniques 24 4

Star Schema Snowflake Schema time day day_of_the_week Sales Fact Table month quarter year item_key branch branch_name units_sold branch_type dollars_sold avg_sales Measure J. Han: Data Mining: sconcepts and Techniques item item_key item_name brand type supplier _ type location street city state_or_province country 25 time day day_of_the_week Sales Fact Table month quarter year item _ key branch branch_name units_sold branch_type dollars_sold avg_sales Measure s J. Han: Data Mining: Concepts and Techniques item item_key item_name brand type supplier_key supplier supplier_key supplier_type location street city_key city city_key city state_or_province country 26 OLTP vs. OLAP OLTP Online Transaction Processing Traditional database technology Many small transactions (point queries: UPDATE or INSERT) Avoid redundancy, normalize schemas Access to consistent, up-to-date database OLTP Examples: Flight reservation Banking and financial transactions Order Management, Procurement,... Extremely fast response times... 27 OLTP vs. OLAP OLAP Online Analytical Processing Big aggerate queries, no Updates Redundancy a necessity (Materialized Views, specialpurpose indexes, de-normalized schemas) Periodic refresh of data (daily or weekly) OLAP Examples Decision support (sales per employee) Marketing (purchases per customer) Biomedical databases Goal: Response Time of seconds / few minutes 28 OLTP vs. OLAP (Water and Oil) Lock Conflicts: OLAP blocks OLTP Database design: OLTP normalized, OLAP de-normalized Tuning, Optimization OLTP: inter-query parallelism, heuristic optimization OLAP: intra-query parallelism, full-fledged optimization Freshness of Data: OLTP: serializability OLAP: reproducibility Integrity: OLTP: ACID OLAP: Sampling, Confidence Intervals Solution: Data Warehouse Special Sandbox for OLAP Data input using OLTP systems Data Warehouse aggregates and replicates data (special schema) New Data is periodically uploaded to Warehouse 29 30 5

DW Architecture DW Products and Tools OLTP OLTP Applications DB1 DB2 DB3 Extract Transform Load (ETL) OLAP GUI, Spreadsheets Data Warehouse Oracle 11g, IBM DB2, Microsoft SQL Server,... All provide OLAP extensions SAP Business Information Warehouse ERP vendors MicroStrategy, Cognos (now IBM) Specialized vendors Kind of Web-based EXCEL Niche Players (e.g., Btell) Vertical application domain 31 32 Reference (highly recommended) Jim Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery 1(1), 1997. http://citeseer.ist.psu.edu/old/392672.html Data Warehousing chapter of Jianwei Han s textbook (chapter 3) Homework Exercises 1 and 4 at: http://www.systems.ethz.ch/education/courses/fs09/d ata-warehousing/ex2.pdf Multidimensional data modeling exercise in course Wiki pages 33 34 6