Data warehouse and Data Mining

Similar documents
What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Knowledge Discovery and Data Mining

Question Bank. 4) It is the source of information later delivered to data marts.

Basic Data Mining Technique

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

COMP 465: Data Mining Classification Basics

COMP 465 Special Topics: Data Mining

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Association Rule Mining. Entscheidungsunterstützungssysteme

Chapter 1, Introduction

Chapter 4 Data Mining A Short Introduction

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Product presentations can be more intelligently planned

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data Mining Course Overview

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

DATA WAREHOUING UNIT I

Extra readings beyond the lecture slides are important:

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

DATA MINING. Prof. Navneet Goyal Department of Computer Science & Information Systems, BITS, Pilani.

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 4: Mining Frequent Patterns, Associations and Correlations

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Frequent Pattern Mining

Jarek Szlichta

Data warehouses Decision support The multidimensional model OLAP queries

Supervised and Unsupervised Learning (II)

1. Inroduction to Data Mininig

D B M G Data Base and Data Mining Group of Politecnico di Torino

Advance Association Analysis

Nesnelerin İnternetinde Veri Analizi

Data mining fundamentals

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

CompSci 516 Data Intensive Computing Systems

Introduction to Data Mining and Data Analytics

Introduction to Data Mining

Data Mining and Data Warehousing Introduction to Data Mining

Association Rules. Berlin Chen References:

UNIT-IV DATA MINING ALGORITHMS. March 14, 2012 Prof. Asha Ambhaikar

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Winter Semester 2009/10 Free University of Bozen, Bolzano

Data Warehouse and Data Mining

Data Mining Concepts

Data Mining & Feature Selection

Chapter 3: Data Mining:

Knowledge Discovery in Data Bases

9. Conclusions. 9.1 Definition KDD

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining Concepts & Techniques

On-Line Application Processing

DATA MINING TRANSACTION

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Mining Algorithms

BCB 713 Module Spring 2011

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

DATA MINING DATA KNOWLEDGE DECISION ACTION. Data. Decision. Data mining (analysis) Business modeling (using data mining software) Business hypothesis

DATA MINING AND WAREHOUSING

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Table Of Contents: xix Foreword to Second Edition

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Introduction to Data Mining. Yücel SAYGIN

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

CSC 261/461 Database Systems Lecture 26. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

An Improved Apriori Algorithm for Association Rules

Data Mining and Soft Computing

Data Warehouse and Data Mining

Overview of Web Mining Techniques and its Application towards Web

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Knowledge Modelling and Management. Part B (9)

CS490D: Introduction to Data Mining Prof. Chris Clifton

COMP90049 Knowledge Technologies

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Knowledge Discovery and Data Mining

Data Warehouse and Data Mining

Study on the Application Analysis and Future Development of Data Mining Technology

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Data Warehouse and Data Mining

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Data Mining and Analytics. Introduction

CS570 Introduction to Data Mining

Contents. Foreword to Second Edition. Acknowledgments About the Authors

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Transcription:

Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Decision support progress to Data Mining Early Filebased Systems Database Systems Data Warehouse OLAP Systems Data Mining Applications Basic accounting data Operational systems data Data for decision Support Data for multi- Dimensional Analysis Selected and extracted data No Decision Support Primitive Decision Support True Decision Support Complex Analysis & Calculations Knowledge Discovery

Data Mining A non-trivial extraction of novel, implicit, and actionable knowledge from large databases Technology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind

Data Mining: A KDD Process Pattern Evaluation Task-relevant Data Data Mining Data Warehouse Data Cleaning Selection Data Integration Databases

Data Mining: A KDD Process

Steps of KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc Use of discovered knowledge

Increasing potential to support business decisions Data Mining and Business Intelligence Making Decisions End User Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA

OLAP versus Data Mining OLAP In OLAP analysis session, analyst looks for some prior knowledge OLAP helps the user to analyze the past and gain insights Data Mining In data mining, the analyst has no prior knowledge of what results are likely to be Data Mining helps the user predict the future In OLAP, the analyst drives the process while using OLAP tools In data mining, the analyst prepares the data and sits back while the tools drive the process Complex Queries No SQL Queries

OLAP versus Data Mining Features Motivation for Information request Data granularity Number of business dimension Number of dimension attributes Sizes of datasets for the dimensions Analysis approach Analysis techniques State of the technology OLAP What is happening in the enterprise? Summary data Limited number of dimensions Small number of attributes Not large for each dimension User-driven interactive analysis Multidimensional, drilldown, and slice & dice Mature & widely used Data Mining Predict the future based on why this is happening Detailed transaction-level data Large number of dimensions Many dimension Attributes Usually very large for each dimension Data-driven automatic knowledge discovery Prepare data, launch mining tool & sit back Still emerging

Data Mining Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) Stream data mining Web mining DNA data analysis

Data Mining Techniques Data mining covers a broad range of techniques including: Classification Clustering Sequential Pattern mining Association rule mining Many more These techniques consist of the specific algorithms

Association Rule Mining Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database Motivation: finding regularities in data What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can one automatically classify web documents?

Association Rule Mining Itemset X={x 1,, x k } Find all the rules Xà Y with min confidence and support support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A à C (50%, 66.7%) C à A (50%, 100%) Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Customer buys beer Customer buys both Customer buys diapers

Mining Association Rules an Example Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Min. support 50% Min. confidence 50% Frequent pattern Support {A} 75% {B} 50% {C} 50% {A, C} 50% For Example Rule: A C support = support({a} {C}) = 50% confidence = support({a} {C})/support({A}) = 66.6%

Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values

Classification Process: Model Construction Training Data Classification Algorithms NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes

Classification Process: Use the Model in Prediction Classifier Testing Data Unseen Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes (Jeff, Professor, 4) Tenured?

Decision Trees Training set age income student credit_rating <=30 high no fair <=30 high no excellent 31!40 high no fair >40 medium no fair >40 low yes fair >40 low yes excellent 31!40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31!40 medium no excellent 31!40 high yes fair >40 medium no excellent

Decision Trees age? <=30 overcast 30..40 >40 student? yes credit rating? no yes fair excellent no yes no yes

Cluster and outlier analysis Cluster Analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity Outlier Analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

Clusters and Outliers Clusters Outliers

Sequential Pattern Mining Sequential Pattern Mining is the mining of frequently occurring ordered events or subsequences as pattern in sequence database A sequence database stores a number of records, where all records are sequences of ordered events, with or without concrete notions of time Sequential patterns are used for targeted marketing and customer retention

Terminology for Sequence Mining Itemset: non-empty set of items Sequence: Ordered list of itemsets Customer sequence: List of customer transactions ordered by increasing transaction time A customer supports a sequence if the sequence is contained in the customer-sequence Support for a sequence: Fraction of total customers that support a sequence Maximal sequence: A sequence that is not contained in any other sequence Closed sequence: A sequence which is composed of other small sequences

Example: Sequence A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)df> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern

Terms Data scrubbing: A process to upgrade the quality of data before it is moved into a data warehouse Transient data: Data in which changes to existing records cause the previous version of the records to be eliminated