CS 245: Database System Principles. Warehousing. Outline. What is a Warehouse? What is a Warehouse? Notes 13: Data Warehousing

Similar documents
Acknowledgment. Sales data example. Outline. Excel pivot table. Example: Sales

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

Data Mining Concepts & Techniques

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Decision Support, Data Warehousing, and OLAP

Syllabus. Syllabus. Motivation Decision Support. Syllabus

collection of data that is used primarily in organizational decision making.

CHAPTER 3 Implementation of Data warehouse in Data Mining

Data Warehousing and OLAP

Evolution of Database Systems

Business Analytics and Big Data: the process and the tools

Data Warehouse and Data Mining

Information Management course

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Data Warehouses. Vera Goebel. Fall Department of Informatics, University of Oslo

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

Data Warehousing and Decision Support

What is a Data Warehouse?

Data Warehousing and Decision Support

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

DATA MINING AND WAREHOUSING

ETL and OLAP Systems

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Warehouses. Yanlei Diao. Slides Courtesy of R. Ramakrishnan and J. Gehrke

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Decision Support Systems aka Analytical Systems

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

On-Line Application Processing

An Overview of Data Warehousing and OLAP Technology

Data warehouses Decision support The multidimensional model OLAP queries

Data Mining & Data Warehouse

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

DATA MINING TRANSACTION

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

Introduction to Data Warehousing

Data Warehousing & OLAP

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

Data Warehouse and Data Mining

Decision Support. Chapter 25. CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1

On-Line Analytical Processing (OLAP) Traditional OLTP

Information Management course

Chapter 4, Data Warehouse and OLAP Operations

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Data Warehouse and Data Mining

REPORTING AND QUERY TOOLS AND APPLICATIONS

Data Warehousing and OLAP Technologies for Decision-Making Process

~ Ian Hunneybell: DWDM Revision Notes (31/05/2006) ~

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

Warehousing. Data Mining

Analytical data bases Database lectures for math

Data Warehousing (1)

Data Warehouse. Asst.Prof.Dr. Pattarachai Lalitrojwong

Data Mining. Associate Professor Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology

Data warehouse architecture consists of the following interconnected layers:

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Chapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives

DATA WAREHOUING UNIT I

A Multi-Dimensional Data Model

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Sql Fact Constellation Schema In Data Warehouse With Example

OLAP Introduction and Overview

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Data Warehouse and Data Mining

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Question Bank. 4) It is the source of information later delivered to data marts.

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Data Warehouses and OLAP. Database and Information Systems. Data Warehouses and OLAP. Data Warehouses and OLAP

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory

Call: SAS BI Course Content:35-40hours

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Adnan YAZICI Computer Engineering Department

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Oracle 1Z0-515 Exam Questions & Answers

Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis

IDU0010 ERP,CRM ja DW süsteemid Loeng 5 DW concepts. Enn Õunapuu

Databases and Data Warehouses

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

Teradata Aggregate Designer

CSPP 53017: Data Warehousing Winter 2013! Lecture 7! Svetlozar Nestorov! Class News!

Data Analysis and Data Science

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

QUALITY MONITORING AND

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Q1) Describe business intelligence system development phases? (6 marks)

Rocky Mountain Technology Ventures

Table of Contents. Knowledge Management Data Warehouses and Data Mining. Introduction and Motivation

Knowledge Management Data Warehouses and Data Mining

Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders

Transcription:

Recall : Database System Principles Notes 3: Data Warehousing Three approaches to information integration: Federated databases did teaser Data warehousing next Mediation Hector Garcia-Molina (Some modifications by Chris Olston) 2 Warehousing Outline Growing industry: $8 billion in 998 Range from desktop to huge: Walmart: 9-CPU, 2,7 disk, 23TB Teradata system Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot,... What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse Future directions 3 4 What is a Warehouse? What is a Warehouse? Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile more Collection of tools gathering data cleansing, integrating,... querying, reporting, analysis data mining monitoring, administering warehouse 5 6

Warehouse Architecture Motivating Examples Forecasting Query & Analysis Comparing performance of units Monitoring, detecting fraud Metadata Warehouse Visualization Integration Source Source Source 7 8 Why a Warehouse? Mediation Approach Two Approaches: Mediation (Lazy) Warehouse (Eager) Mediator? Wrapper Wrapper Wrapper Source Source Source Source Source 9 Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Easier to query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information Advantages of Mediation No need to copy data less storage no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 2 2

OLTP vs. OLAP OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse OLTP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 3 4 Data Marts Smaller warehouses Spans part of organization e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems? Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other 5 6 Star Star Schema product prodid name price p bolt p2 nut 5 store storeid city c nyc c2 sfo c3 la sale oderid date custid prodid storeid qty amt o /7/97 53 p c 2 o2 2/7/97 53 p2 c 2 5 3/8/97 p c3 5 5 product prodid name price sale orderid date custid prodid storeid qty amt customer custid name address city customer custid name address city 53 joe main sfo 8 fred 2 main sfo sally 8 willow la store storeid city 7 8 3

Terms Dimension Hierarchies Fact table Dimension tables Measures product prodid name price sale orderid date custid prodid storeid qty amt customer custid name address city store stype city store storeid cityid tid mgr s5 sfo t joe s7 sfo t2 fred s9 la t nancy region stype tid size location t small downtown t2 large suburbs city cityid pop regid sfo M north la 5M south store storeid city Í snowflake schema Í constellations region regid weather north fog / heat south heat 9 2 Cube 3-D Cube Fact table view: sale prodid storeid amt p c 2 p2 c p c3 5 p2 c2 8 Multi-dimensional cube: p 2 5 Fact table view: p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 day 2 day Multi-dimensional cube: p 44 4 p2 p 2 5 dimensions = 2 dimensions = 3 2 22 Aggregates Aggregates Add up amounts for day In SQL: SELECT sum(amt) FROM SALE WHERE date = Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 8 p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 ans date sum 8 2 48 23 24 4

Another Example Aggregates Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodid p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 sale prodid date amt p 62 p2 9 p 2 48 Operators: sum, count, max, min, median, avg Group by clause Having clause rollup drill-down 25 26 Cube Aggregation Cube Operators day 2 day p 44 4 p2 p 2 5 Example: computing sums... day 2 day p 44 4 p2 p 2 5... sale(c,*,*) p 56 4 5 sum 67 2 5 29 p 56 4 5 sum 67 2 5 29 rollup drill-down sum p p2 9 sale(c2,p2,*) sum p p2 9 sale(*,*,*) 27 28 Extended Cube Pivoting * * p 56 4 5 9 day 2 c* c267 c3 2 * 5 29 p 44 4 48 day c p2 c2 c3 * p 2* 44 5 4 62 48 9 * 23 8 5 8 sale(*,p2,*) Fact table view: p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 Multi-dimensional cube: day 2 day p 44 4 p2 p 2 5 p 56 4 5 29 3 5

Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs, ) Spreadsheet Systems Web Interfaces Data Mining Time functions Other Operations e.g., time average Computed Attributes e.g., commission = sales * rate Text Queries e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z 3 32 Data Mining Clustering Clustering Association Rules Also: Decision trees, time-series, income education age 33 34 Another Example: Text Issues Each document is a vector e.g., <...> contains words,4,5,... Clusters contain similar documents Useful for understanding, searching documents international news business sports Given desired number of clusters? Finding best clusters Are clusters semantically meaningful? e.g., yuppies cluster? Using clusters for disk storage 35 36 6

Association Rule Mining Association Rule sales records: transaction id customer id tran cust33 p2, p5, p8 tran2 cust45 p5, p8, p tran3 cust2 p, p9 tran4 cust4 p5, p8, p tran5 cust2 p2, p9 tran6 cust2 p9 products bought market-basket data Rule: {p, p 3, p 8 } Support: number of baskets where these products appear High-support set: support threshold s Problem: find all high support sets Trend: Products p5, p8 often bough together Trend: Customer 2 likes product p9 37 38 Finding High-Support Pairs Example Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item WHY? GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; basket item t p2 t p5 t p8 t2 p5 t2 p8 t2 p...... basket item item2 t p2 p5 t p2 p8 t p5 p8 t2 p5 p8 t2 p5 p t2 p8 p......... check if count s 39 4 Issues Performance for size 2 rules big basket item t p2 t p5 t p8 t2 p5 t2 p8 t2 p...... basket item item2 t p2 p5 t p2 p8 t p5 p8 t2 p5 p8 t2 p5 p t2 p8 p......... even bigger! Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... Performance for size k rules 4 42 7

Monitoring Monitoring Techniques Source Types: relational, flat file, IMS, WWW, news-wire, Incremental vs. Periodic Refresh customer id name address city 53 joe main sfo 8 fred 2 main sfo sally 8 willow la new 43 Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring Advantages & Disadvantages!! Í 44 Monitoring Issues Integration Frequency periodic: daily, weekly, triggered: on big change, lots of changes,... Data transformation Data Cleaning Data Loading Derived Data Query & Analysis convert data to uniform format Metadata Warehouse remove & add fields (e.g., add date to get history) Integration Standards (e.g., ODBC) Gateways Source Source Source 45 46 Data Cleaning Loading Data Migration (e.g., yen Õ dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB customer(joe) merged_customer(joe) Incremental vs. periodic refresh Off-line vs. on-line Frequency of loading At night, x a week/month, continuously Parallel/Partitioned load service DB customer2(joe) Auditing: discover rules & relationships (like data mining) 47 48 8

Derived Data Materialized Views Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. periodic refresh 49 Define new warehouse relations using SQL expressions p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 product id name price p bolt p2 nut 5 jointb prodid name price storeid date amt p bolt c 2 p2 nut 5 c p bolt c3 5 p2 nut 5 c2 8 p bolt c 2 44 p bolt c2 2 4 does not exist at any source 5 Processing ROLAP Server ROLAP servers vs. MOLAP servers Index Structures Relational OLAP Server sale prodid date sum p 62 p2 9 p 2 48 What to Materialize? tools Algorithms Metadata Query & Analysis Warehouse utilities ROLAP server Special indices, tuning; Schema is denormalized Integration Source Source Source relational DBMS 5 52 MOLAP Server Index Structures Multi-Dimensional OLAP Server utilities M.D. tools multidimensional server Product milk soda eggs soap City A B 2 3 4 Date could also sit on relational DBMS Sales Traditional Access Methods B-trees, hash tables, R-trees, grids, Popular in Warehouses inverted lists bit map indexes join indexes text indexes 53 54 9

Inverted Lists Using Inverted Lists 2 23 8 9 2 2 22 23 25 26 r4 r8 r34 r35 r5 r9 r37 r4 rid name age r4 joe 2 r8 fred 2 r9 sally 2 r34 nancy 2 r35 tom 2 r36 pat 25 r5 dave 2 r4 jeff 26... Query: Get people with age = 2 and name = fred List for age = 2: r4, r8, r34, r35 List for name = fred : r8, r52 Answer is intersection: r8 age index inverted lists data records 55 56 Bit Maps Using Bit Maps 2 23 age index 8 9 2 2 22 23 25 26 bit maps id name age joe 2 2 fred 2 3 sally 2 4 nancy 2 5 tom 2 6 pat 25 7 dave 2 8 jeff 26... data records Query: Get people with age = 2 and name = fred List for age = 2: List for name = fred : Answer is intersection: Good if domain cardinality small Bit vectors can be compressed 57 58 Join Join Indexes Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT p c 2 p2 c p c3 5 p2 c2 8 p c 2 44 p c2 2 4 product id name price p bolt p2 nut 5 jointb prodid name price storeid date amt p bolt c 2 p2 nut 5 c p bolt c3 5 p2 nut 5 c2 8 p bolt c 2 44 p bolt c2 2 4 product id name price jindex p bolt r,r3,r5,r6 p2 nut 5 r2,r4 join index sale rid prodid storeid date amt r p c 2 r2 p2 c r3 p c3 5 r4 p2 c2 8 r5 p c 2 44 r6 p c2 2 4 59 6

What to Materialize? Store in warehouse results useful for common queries Example: day 2 p 44 4 p2 day p 2 5... total sales Materialization Factors Type/frequency of queries Query response time Storage cost Update cost p 67 2 5 p 56 4 5 29 materialize c p p2 9 6 62 Cube Aggregates Lattice Managing 29 all Metadata Warehouse Design p 67 2 5 city product date Tools Query & Analysis city, product city, date product, date Metadata Warehouse p 56 4 5 Integration day 2 p 44 4 p2 day p 2 5 city, product, date Need an algorithm to decide what to materialize Source Source Source 63 64 Metadata Metadata Administrative definition of sources, tools,... schemas, dimension hierarchies, rules for extraction, cleaning, refresh, purging policies user profiles, access control,... Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails 65 66

Design Tools What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 67 Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management reliable scripts for cleaning & analyzing data 68 Current State of Industry Future Directions Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything Better performance Larger warehouses Easier to use What are companies & research labs working on? 69 7 Research () Research (2) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data 7 72 2

Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: easier to use provide quality information 73 3