CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Similar documents
Question Bank. 4) It is the source of information later delivered to data marts.

Knowledge Discovery and Data Mining

Data warehouse and Data Mining

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Introduction to Data Mining

Chapter 1, Introduction

Data Mining & Data Warehouse

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Data Mining. Associate Professor Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

1. Inroduction to Data Mininig

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

COMP 465 Special Topics: Data Mining

Data Warehousing (1)

Information Management course

DATA MINING TRANSACTION

Data Mining Concepts & Techniques

DATA WAREHOUING UNIT I

DATA MINING AND WAREHOUSING

D B M G Data Base and Data Mining Group of Politecnico di Torino

Winter Semester 2009/10 Free University of Bozen, Bolzano

Data mining fundamentals

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Data Mining and Data Warehousing Introduction to Data Mining

Data warehouses Decision support The multidimensional model OLAP queries

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Knowledge Modelling and Management. Part B (9)

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts


OLAP Introduction and Overview

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Decision Support Systems aka Analytical Systems

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

DATA MINING II - 1DL460

Data Mining. ❸Chapter 3 Data warehouse, ETL and OLAP. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

On-Line Application Processing

Data Mining Concepts

CHAPTER 3 Implementation of Data warehouse in Data Mining

TIM 50 - Business Information Systems

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

CS 245: Database System Principles. Warehousing. Outline. What is a Warehouse? What is a Warehouse? Notes 13: Data Warehousing

Syllabus. Syllabus. Motivation Decision Support. Syllabus

Databases and Data Warehouses

Data Warehouse and Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Chapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives

Data Mining Course Overview

STRATEGIC INFORMATION SYSTEMS IV STV401T / B BTIP05 / BTIX05 - BTECH DEPARTMENT OF INFORMATICS. By: Dr. Tendani J. Lavhengwa

Data Warehousing and OLAP

Data Mining and Warehousing

DATA MINING II - 1DL460

Data Mining and Analytics. Introduction

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Data Warehouse and Data Mining

Evolution of Database Systems

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

INTRODUCTION TO DATA MINING

On-Line Analytical Processing (OLAP) Traditional OLTP

Data Warehouse and Data Mining

Dr.G.R.Damodaran College of Science

Managing Information Resources

COMP90049 Knowledge Technologies

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Decision Support, Data Warehousing, and OLAP

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

TIM 50 - Business Information Systems

What is a Data Warehouse?

UNIT -1 UNIT -II. Q. 4 Why is entity-relationship modeling technique not suitable for the data warehouse? How is dimensional modeling different?

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Contents. Part I Setting the Scene

The Data Organization

Code No: R Set No. 1

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

Database Vs. Data Warehouse

Most database operations involve On- Line Transaction Processing (OTLP).

Business Intelligence An Overview. Zahra Mansoori

QUALITY MONITORING AND

Fig 1.2: Relationship between DW, ODS and OLTP Systems

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

DATA MINING Introductory and Advanced Topics Part I

Knowledge Discovery & Data Mining

ISM 50 - Business Information Systems

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Transcription:

CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1

1960s: Evolution of Database Technology Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining with a variety of applications Web technology and global information systems 2

Knowledge Discovery (KDD) Process Pattern Evaluation Task-relevant Data Data Mining Data Warehouse Selection and transformation Data Cleaning Data Integration Databases 3

What is a Data Warehouse? A data warehouse is a database used for reporting and analysis A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. W. H. Inmon Key aspects A decision support database that is maintained separately from the organization s operational database Support OLAP (vs. OLTP) Data warehousing: The process of constructing and using data warehouses 4

Data Warehouse Approach Client Query & Analysis Client Metadata Warehouse Extract, transform and load (ETL) Source Source Source 5

Multi Dimensional View: From Tables to Data Cubes Data warehouse model Multidimensional data model views data in the form of a data (hyper) cube Each cell corresponds to an aggregated value, such as total sales Data warehouse operations Drill down Roll up Slide and dice Product Time 6

Orders database - relational schema 7

Data warehouse for the orders database star schema Dimensional tables Fact tables each fact corresponds to a data cube Product Time 8

OLTP vs. OLAP OLTP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical OLAP Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts Historical data 9

Data warehouse construction Semantic data integration reconciling semantic heterogeneity of information sources Levels Schema matching (schema mapping) Data matching (data deduplication, record linkage, entity/object matching) 10

Schema Matching Techniques Rule based Learning based Type of matches 1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate)) Information used Schema information: element names, data types, structures, number of sub-elements, integrity constraints Data information: value distributions, frequency of words External evidence: past matches, corpora of schemas Ontologies Multi-matcher architecture 11

Record Linkage Rule based pair-wise similarity comparison Varying weights for different fields, e.g. name, SSN, birthdate Machine learning based 12

Knowledge Discovery (KDD) Process Pattern Evaluation Task-relevant Data Data Mining Data Warehouse Selection and transformation Data Cleaning Data Integration Databases 13

What Is Data Mining? Data mining (knowledge discovery from data) : Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining really means knowledge mining Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, information harvesting, business intelligence, etc. 14

Data Mining Functionalities Predictive: predict the value of a particular attribute based on the values of other attributes Classification Regression Descriptive: derive patterns that summarize the underlying relationships in data Cluster analysis Association analysis

Classification Fruit Identification Skin Color Size Flesh Conclusion Hairy Brown Large Hard safe Hairy Green Large Hard Safe Smooth Red Large Soft Dangerous Hairy Green Large Soft Safe Smooth Red Small Hard Dangerous 16

Classification and prediction Classification: construct models (functions) that describe and distinguish classes for future prediction Prediction/regression: predict unknown or missing numerical values Derived models can be represented as rules, mathematical formulas, etc. Classification: Decision tree, Bayesian classification, Neural networks, Support vector machines, knn Regression: linear and non-linear regression

Classification Training Data Training/Learning Skin Hairy Hairy Smooth Hairy Smooth Color Brown Green Red Green Red Size Large Large Large Large Small Flesh Hard Hard Soft Soft Hard Conclusio safe n Safe Dangerous Safe Dangerous Classifier (Model) Rule: if skin = smooth and color = red, then dangerous Unseen Data Classification Classified Data 18

Classification example

Frequent pattern mining and association analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent sequential pattern Frequent structured pattern Applications Basket data analysis Beer and diapers Web log (click stream) analysis Challenge: efficient algorithms to handle exponential size of the search space

Cluster analysis Cluster and outlier analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Unsupervised learning (vs. supervised learning) Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis E.g. Extreme large purchase

For more CS570: Introduction to Data Mining 22