Data Mining Concepts & Techniques

Similar documents
Data Warehouse and Data Mining

Information Management course

Decision Support, Data Warehousing, and OLAP

Decision Support Systems aka Analytical Systems

Syllabus. Syllabus. Motivation Decision Support. Syllabus

Data Mining & Data Warehouse

Data Warehousing (1)

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Data Mining. Associate Professor Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology

Data Warehouse and Data Mining

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Data Warehouse and Data Mining

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

Evolution of Database Systems

On-Line Analytical Processing (OLAP) Traditional OLTP

Data Warehousing and OLAP

Data Warehouse and Data Mining

CHAPTER 3 Implementation of Data warehouse in Data Mining

Managing Information Resources

What is a Data Warehouse?

An Overview of Data Warehousing and OLAP Technology

Introduction to Data Warehousing

Data Warehousing and OLAP Technologies for Decision-Making Process

Data Warehouse and Data Mining

Data Warehouses. Yanlei Diao. Slides Courtesy of R. Ramakrishnan and J. Gehrke

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Data Warehouse and Data Mining

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Data Warehousing and Decision Support

Data Warehousing & OLAP

Data Warehouse and Data Mining

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

Chapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

Information Management course

Data Warehousing and Decision Support

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

CSPP 53017: Data Warehousing Winter 2013! Lecture 7! Svetlozar Nestorov! Class News!

CS 245: Database System Principles. Warehousing. Outline. What is a Warehouse? What is a Warehouse? Notes 13: Data Warehousing

Basics of Dimensional Modeling

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues. Slides by Michael Hahsler

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Business Intelligence An Overview. Zahra Mansoori

Data warehouse architecture consists of the following interconnected layers:

collection of data that is used primarily in organizational decision making.

IDU0010 ERP,CRM ja DW süsteemid Loeng 5 DW concepts. Enn Õunapuu

Chapter 4, Data Warehouse and OLAP Operations

DATA WAREHOUING UNIT I

REPORTING AND QUERY TOOLS AND APPLICATIONS

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

Rocky Mountain Technology Ventures

DATA WAREHOUSING & DATA MINING. by: Prof. Asha Ambhaikar

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Warehousing & OLAP

Fig 1.2: Relationship between DW, ODS and OLTP Systems

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

OLAP Introduction and Overview

Adnan YAZICI Computer Engineering Department

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Outline. Managing Information Resources. Concepts and Definitions. Introduction. Chapter 7

Data Warehouse. Asst.Prof.Dr. Pattarachai Lalitrojwong

Data Warehousing & Mining Techniques

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar

Data Warehouse and Mining

ETL and OLAP Systems

Decision Support. Chapter 25. CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1

Data warehouses Decision support The multidimensional model OLAP queries

DATA MINING TRANSACTION

QUALITY MONITORING AND

2. Summary. 2.1 Basic Architecture. 2. Architecture. 2.1 Staging Area. 2.1 Operational Data Store. Last week: Architecture and Data model

Data Mining. ❸Chapter 3 Data warehouse, ETL and OLAP. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Chapter 18: Data Analysis and Mining

Data Warehousing & Mining Techniques

Q1) Describe business intelligence system development phases? (6 marks)

BUSINESS INTELLIGENCE. SSAS - SQL Server Analysis Services. Business Informatics Degree

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

~ Ian Hunneybell: DWDM Revision Notes (31/05/2006) ~

DATA MINING AND WAREHOUSING

Business Intelligence and Decision Support Systems

MIS2502: Data Analytics Dimensional Data Modeling. Jing Gong

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Step-by-step data transformation

Data Warehouses and OLAP. Database and Information Systems. Data Warehouses and OLAP. Data Warehouses and OLAP

The strategic advantage of OLAP and multidimensional analysis

UNIT -1 UNIT -II. Q. 4 Why is entity-relationship modeling technique not suitable for the data warehouse? How is dimensional modeling different?

Data Warehouses. Vera Goebel. Fall Department of Informatics, University of Oslo

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Knowledge Discovery & Data Mining

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Partner Presentation Faster and Smarter Data Warehouses with Oracle OLAP 11g

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.

Transcription:

Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Database A database is a large, integrated collection of data A database management system (DBMS) is a software system designed to store, manage and facilitate access to the database A schema is a description of a database

What is Data warehouse? Basically a very large database Not all very large databases are data warehouses, but all data warehouses are pretty large databases Nowadays a warehouse is considered to start at around 800 GB and goes up to several TB It spans over several servers and needs an impressive amount of computing power

What is Data warehouse? More specific, a collective data repository Containing snapshots of the operational data (history) Obtained through data cleansing ETL (Extract-Transform- Load) Useful for analytics

What is Data warehouse? Compared to other solutions it Is suitable for tactical/strategic focus Implies a small number of transactions Implies large transactions spanning over a long period of time

Some Definitions Ralph Kimball: a copy of transaction data specifically structured for query and analysis Bill Inmon (father of data warehousing, in 1993): A Data Warehouse is a: subject oriented integrated non-volatile time-variant collection of data in support of management s decisions

Data Warehouse Subject oriented: Data is arranged by subject area rather than by application. Data is organized so that all the data elements relating to the same real-world event or object are linked together Typical subject areas in DWs are Customer, Product, Order, Claim, Account,

Data Warehouse Subject oriented: Example: customer as subject in a DW DW is organized in this case by the customer It may consist of 10, 100 or more physical tables, all related

Data Warehouse Integrated: Data is collected and consistently stored from multiple, diverse sources of an organization's operational systems and this data is made consistent E.g. gender, measurement, conflicting keys, consistency,

Data Warehouse Non-volatile: Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting. Data is loaded, but not updated When subsequent changes occur, a new snapshot record is written.

Data Warehouse Time-variant: The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time. Different environments have different time horizons associated While for operational systems a 60-to-90 day time horizon is normal, data warehouse has a 5-to-10 year horizon

Data Warehouse vs. Operational Database Data Warehouse Subject oriented Operational Database Application oriented Integrated Multiple diverse sources Non-volatile Updateable Time-variant Real-time, current

OnLine Transaction Processing OLTP (OnLine Transaction Processing): Also known under the name of operational data, it represents day-to-day operational business activities: Purchasing, sales, production distribution, Typically for data entry and retrieval transaction processing Reflects only the current state of the data

OnLine Analytical Processing OLAP (OnLine Analytical Processing): Represents front-end analytics based on a DW repository It provides information for activities like: Resource planning, capital budgeting, marketing initiatives,... It is decision oriented

OLTP vs. DW Properties Operational DB Mostly updates Many small transactions MB-TB of data Raw data Clerical users Up-to-date data DW Mostly reads Queries long, complex GB-PB of data Summarized data Decision makers May be slightly outdated

OLTP vs. DW OLTP Data Warehouse users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB historical, summarized, multidimensional integrated, consolidated metric transaction throughput query throughput, response

Applications of DW A DW is the base repository for front-end analytics OLAP KDD Data visualization Reporting KDD (Knowledge Discovery in Databases) a data mining process

Lifecycle of DW Classical SDLC vs. DW SDLC DW SDLC is almost the opposite of classical SDLC

Lifecycle of DW Classical SDLC vs. DW SDLC Because it is the opposite of SDLC, DW SDLC is also called CLDS

Basic Architecture Architecture of DW

Data Warehouse Architecture

Data Mart A data mart is a special purpose subset of enterprise data for a particular function or application (It may contain detail or summary data or both). Data Mart types: Independent created directly from operational systems to a separate physical data store Logical exists as a subset of existing data warehouse. Dependent created from data warehouse to a separate physical data store

Phases

Data Modeling Conceptual Design Transforms data requirements to conceptual model Conceptual model describes data entities, relationships, constraints, etc. on high-level Does not contain any implementation details Independent of used software and hardware Logical Design Maps the conceptual data model to the logical data model used by the DBMS e.g. relational model, dimensional model,... Technology independent conceptual model is adapted to the used DBMS software Physical Design Creates internal structures needed to efficiently store/manage data Table spaces, indexes, access paths,... Depends on used hardware and DBMS software

Data Modeling Conceptual Modeling: DW Modeling Multidimensional Entity Relationship (ME/R) Model Multidimensional UML (muml) Logical Modeling: Cubes, Dimensions, Hierarchies Physical Modeling: Star, Snowflake, Array storage

DW Modeling Components Facts: a fact is a focus of interest for decision-making, e.g., sales, shipments.. Measures: attributes that describe facts from different points of view, e.g., each sale is measured by its revenue Dimensions: discrete attributes which determine the granularity adopted to represent facts, e.g., product, store, date Hierarchies: are made up of dimension attributes Determine how facts may be aggregated and selected, e.g., day month quarter - year

OLAP A decision support system (DSS) that support adhoc querying, i.e. enables managers and analysts to interactively manipulate data. Analysis of information in a database for the purpose of making management decision The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives) OLAP (OnLine Analytical Processing) analyzes historical data (terabytes) using complex queries

OLAP Council definition: OLAP A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.

OLAP OLAP primarily involves aggregating large amounts of diverse data OLAP functionality provides dynamic multidimensional analysis, supporting analytical and navigational activities OLAP functionality is provided by the OLAP Server OLAP Council defines OLAP Server as: A high capacity, multi-user data manipulation engine specifically designed to support and operate on multidimensional data structures.

OLTP vs. OLAP OLTP Operational processing Transaction-oriented For operational staffs Daily operations Current, up-to-date data Primitive, highly detailed data Detailed, flat relational views Short, simple transactions Read/write Index on keys Many users Large databases OLAP Informational processing Analysis-oriented For managers, executive & analysts Decision support Historical data Summarized, consolidated data Summarized, multi-dimensional views Complex aggregate queries Mostly read only Many scans Small number of users Very large databases

OLTP vs. OLAP On-Line Transaction Processing Transfer $100 balance from my saving account to my checking account On-Line Analytical Processing What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?

DW Queries DW queries are big queries Imply a large portion of the data Read only queries no Updates Redundancy a necessity Materialized Views, special-purpose indexes, de-normalized schemas Data is refreshed periodically E.g., Daily or weekly Their purpose is to analyze data OLAP (OnLine Analytical Processing)

Typical OLAP operations Roll-up Drill-down Slice and dice Pivot (rotate) Other operations Aggregate functions Ranking and comparing Drill-across Drill-through OLAP operations

Roll-up Roll-up (drill-up) Taking the current aggregation level of fact values and doing a further aggregation Summarize data by Climbing up hierarchy (hierarchical roll-up) By dimensional reduction Or by a mix of these 2 techniques Used for obtaining an increased generalization E.g., from Time.Week to Time.Year

Roll-up Hierarchical roll-ups Performed on the fact table and some dimension tables by climbing up the attribute hierarchies E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group

Roll-up Dimensional roll-ups Are done solely on the fact table by dropping one or more dimensions E.g., drop the Client dimension

Roll-up Climbing above the top in hierarchical roll-up In an ultimate case, hierarchical roll-up above the top level of an attribute hierarchy (attribute ALL ) can be viewed as converting to a dimensional roll-up

Drill-down (roll-down) Reverse of Roll-up Drill-down Represents a de-aggregate operation From higher level of summary to lower level of summary detailed data Introducing new dimensions Requires the existence of materialized finer grained data One can t drill if it doesn t have the data

Roll-up drill-down example

Roll-up drill-down example

Slice Slice: a subset of the multi-dimensional array corresponding to a single value for one or more dimensions and projecting on the rest of dimensions E.g., project on Geo (store) and Time from values corresponding to Laptops in the product dimension π StoreId, TimeId, Ammount (σ ArticleId = LaptopId (Sales))

Slice Amounts to equality select condition WHERE clause in SQL E.g., slice Laptops

Slice Slicing means taking out the slice of a cube, given certain set of select dimension e.g., sales where city = Karachi and date = 20/1/2014 day 2 day 1 s1 s2 s3 p1 44 4 p2 s1 s2 s3 p1 12 50 p2 11 8 TIME = day 1 s1 s2 s3 p1 12 50 p2 11 8

Dice Dice: amounts to range select condition on one dimension, or to equality select condition on more than one dimension E.g., Range SELECT π StoreId, TimeId, Amount (σ ArticleId {Laptop, CellP} (Sales))

Dice E.g., Equality SELECT on 2 dimensions Product and Time π StoreId, Amount (σ ArticleId = Laptop MonthID = December (Sales))

Region Juice Cola Milk Cream 10 47 30 12 Pivot A pivot is a two dimensional lay-out of the summary data The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures 3/1 3/2 3/3 3/4 Date Product

Pivot Pivot (rotate): re-arranging data for viewing purposes The simplest view of pivoting is that it selects two dimensions to aggregate the measure The aggregated values are often displayed in a grid where each point in the (x, y) coordinate system corresponds to an aggregated value of the measure The x and y coordinate values are the values of the selected two dimensions The result of pivoting is also called cross tabulation

Pivot Consider pivoting the following data

Pivoting on City and Day Pivot

OLAP query languages Getting from OLAP operations to the data As in the relational model, through queries In OLTP one has SQL as the standard query language However, OLAP operations are hard to express in SQL There is no standard query language for OLAP Choices are: SQL-99 for ROLAP Grouping Set, Roll-up, Cube operators MDX (Multidimensional expressions) for both MOLAP and ROLAP Similar to SQL, used especially MOLAP solutions, in ROLAP it is mapped to SQL