Dta Mining and Data Warehousing

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Dta Mining and Data Warehousing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Data Preprocessing. Slides by: Shree Jaswal

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Preprocessing. Data Mining 1

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Data Preprocessing in Python. Prof.Sushila Aghav

Question Bank. 4) It is the source of information later delivered to data marts.

7. Cluster Data Mining (ch8) K-means Clustering Method. CSCI6405 Fall 2003 Dta Mining and Data Warehousing. Lectures Outline

2. Data Preprocessing

3. Data Preprocessing. 3.1 Introduction

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

UNIT 2 Data Preprocessing

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Data Preprocessing UE 141 Spring 2013

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data preprocessing Functional Programming and Intelligent Algorithms

Data Mining: Concepts and Techniques

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

SCHEME OF COURSE WORK. Data Warehousing and Data mining

1. Inroduction to Data Mininig

K236: Basis of Data Science

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Chapter 1, Introduction

Implementing and Maintaining Microsoft SQL Server 2005 Analysis Services

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Mining Course Overview

Data Preprocessing Part 1

CS6220: DATA MINING TECHNIQUES

Jarek Szlichta

Data Warehouse and Data Mining

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

ECLT 5810 Data Preprocessing. Prof. Wai Lam

DATA WAREHOUING UNIT I

Knowledge Modelling and Management. Part B (9)

Lecture 18. Business Intelligence and Data Warehousing. 1:M Normalization. M:M Normalization 11/1/2017. Topics Covered

Data Collection, Preprocessing and Implementation

Code No: R Set No. 1

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Data Mining Concepts & Techniques

2 CONTENTS. 3.8 Bibliographic Notes... 45

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data Preprocessing. Komate AMPHAWAN

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Database Vs. Data Warehouse

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Data warehouse and Data Mining

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

DATA MINING Introductory and Advanced Topics Part I

Domestic electricity consumption analysis using data mining techniques

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Adnan YAZICI Computer Engineering Department

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Management Information Systems

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

MIT Database Management Systems Lesson 01: Introduction

Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

Data Mining and Analytics. Introduction

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

E(xtract) T(ransform) L(oad)

CHAPTER 3 Implementation of Data warehouse in Data Mining

COMP33111: Tutorial/lab exercise 2

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. Last updated on: 30 Nov 2018.

Data Warehousing. Adopted from Dr. Sanjay Gunasekaran

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

On-Line Analytical Processing (OLAP) Traditional OLTP

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

Overview of Web Mining Techniques and its Application towards Web

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Data Mining Concepts & Techniques

Information Management course

Data Preprocessing. Chapter Why Preprocess the Data?

Winter Semester 2009/10 Free University of Bozen, Bolzano

COMP 465 Special Topics: Data Mining

Dynamic Data in terms of Data Mining Streams

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

CS 1655 / Spring 2013! Secure Data Management and Web Applications

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Transcription:

CSCI645 Fall 23 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: qggao@cs.dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours: TR, 1:3-3: PM 9 October 23 1

Lectures Outline Pat I: Overview on DM and DW 1. Introduction (ch1) Ass1 Due: Sep 23 Tue 2. Data preprocessing (ch3) Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 Oct 14 Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 Oct 21 6. Association data mining (ch6) Ass4: Oct 21 Nov 5 7. Characterization data mining (ch5) 8. Clustering data mining (ch8) Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9) 1. Mining spatial data (Ch9) Project Presentations Project Due: Dec 8 9 October 23 2

3. DATA PREPROCESSING (Ch3) Data Preprocessing (DPP) Concept Major Tasks of DPP A DPP Case Study Summary 9 October 23 3

Why Is Data Preprocessing Important? No quality data, quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 9 October 23 4

Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility 9 October 23 5

9 October 23 6

Why Data Preprocessing? Raw data have errors and inconsistencies (Data cleaning) Data need to be integrated from different sources and a unique format is needed (Data integration and transformation) Irrelevant data should be removed (Data reduction) Domain kwledge should be added into the prepared data (Discretization and concept hierarchy generation) 9 October 23 7

Major Tasks of DPP 9 October 23 8

Major Tasks of DPP (cont) Data cleaning Fill in missing values, smooth isy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9 October 23 9

Why data cleaning? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= isy: containing errors or outliers e.g., Salary= -1 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 3/7/1997 e.g., Was rating 1,2,3, w rating A, B, C 9 October 23 1

Why is data dirty? Incomplete data comes from n/a data value when collected different consideration between the time when the data was collected and when it is analyzed. human/hardware/software problems Noisy data comes from the process of data collection entry transmission Inconsistent data comes from Different data sources Functional dependency violation 9 October 23 11

E.g. Data rmalization for clustering mining E.g., For clustering mining of a customer database: DB (Age, Income, Credit) The distance between to data points: d = ((C1_a1 - C2_a1)^2 + (C2_a2 - C2_a2)^2 + (C3_a1 C3_a2)^2)^(1/2) Age Income Credit Customer1: 32 4, 1, Customer2: 24 3, 2, 8 1, 8, Normalized: 1 1/1 1/1 8 1 8 (rescaled) (rescaled) If we scale all the attributes to the same order of magnitude we obtain reliable distance measure between the different records. 9 October 23 12

A DPP Case Study Business Background: The publisher sells five types of magazine - on cars, houses, sports, music, and s. The aim of the data mining is to find new, interesting clusters of clients in order to set up a marketing exercise. The business is interested in questions such as "What is the typical profile of a reader of a car magazine?, "Is there any correlation between an interest in cars and an interest in s?"... Data mining task: - Mining clusters of clients for a magazine publisher database. - Data preparation for clustering: cleaned, integrated, rmalized, numerical valued data, etc 9 October 23 13

1. Data Selection The database should contain the records of subscription data of the magazines. It should be a selection of operational data from the publishers invoicing system and contains information about people who have subscribed to a magazine The records consist of: client number, name, address, date of subscription,and type of magazine In order to facilitate the DM process, a copy of this operational data is drawn and stored in a separate database (Refer Table 1) 9 October 23 14

Client number Name Address Date purchase Magazine purchased 239 2313 2319 Clinton King Jonson 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house 1. Original data 9 October 23 15

2. Data Cleaning: remove duplications Duplication of records: In an operational client database some clients may be represented by several records, some of the possible causes may include: - the result of negligence, such as people making typing errors - clients moving from on place to ather without tifying change of the address - the cases in which people deliberately spell their names incorrectly or give incorrect information about themselves for avoiding a negative decision... (Refer to Table 2) 9 October 23 16

Client number Name Address Date purchase Magazine purchased 239 2313 2319 Clinton King Jonson 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house 1. Original data Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house 2. De-duplication 9 October 23 17

De-duplication De-duplication: The duplicated records may be identified by a pattern recognition algorithm and then corrected. E.g., The records Mr. and Mr. Jonson in the database. They have different client numbers but the same address, which is a strong indication that they are the same person. This type of pollution will give a company the impression that it has more clients than in fact is the case. Of course, we can never be sure of this, but a de-duplication algorithm using pattern analysis techniques could identify the situation and present it to a user to make a decision. 9 October 23 18

2. Data Cleaning: correct domain inconsistency Domain inconsistency: Pollution was caused by wrong domain values which are t consistent with the definitions. E.g. In the example table, date 1-1-1 means 1 January 191 (the company did t even exist at that time). In some databases, analysis shows an unexpected high number of people born on 11 November: When people were forced to fill in a birth date on a screen and they either do t kw or do t want to divulge it, they were inclined to type in `11-11-11'. This kind of untrue random values can be disastrous in a data mining context. If information is unkwn () it should be represented as such in the database. 9 October 23 19

Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 3. Domain consistency 9 October 23 2

3. Data Integration (Enrichment) Suppose that we have purchased extra information about our clients consisting of data of birth, income, amount of credit, and whether or t an individual owns a car or a house. (Refer to Table 4) * You therefore have to make a deliberate decision either to overlook it or to delete it. A general rule states that any deletion of data must be a conscious decision, after a thorough analysis of the possible consequences. 9 October 23 21

Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 3. Domain consistency Client name Date of birth Income Credit Car owner House owner Clinton 1-2-71 $36, $26,6 yes 4. Additional data available for enrichment 9 October 23 22

Credit numb er Name Date of birth Income Credit Car owne r Hous e owne r Address Date purchase made Magazin e purchas ed 239 2313 Clinton King 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 5. Enriched table 9 October 23 23

4. Data Deduction Remove the columns and rows which are t valuable to the DM process. In Table 6, the column NAME and the row with multiple values are removed from the database. In a real DM project, maybe most of the tables that are collected from the operational data and a lot of desirable data is missing, and most is possible to retrieve. 9 October 23 24

Credit numb er Name Date of birth Income Credit Car owne r Hous e owne r Address Date purchase made Magazin e purchas ed 239 2313 Clinton King 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 5. Enriched table Credit number Date of birth Income Credit Car owne r House owner Address Date purchase made Magazine purchased 239 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 12-2-94 car music house 6. Table with column and row removed 9 October 23 25

4. Data Deduction (cont) In some cases, especially fraud detection, lack of information can be a valuable indication of interesting patterns. Up to this point, the process phase has consisted of mainly simple SQL operations. 9 October 23 26

5. Data transformation For most of databases, the information provided is much too detailed to be used as input of data mining algorithms, such as Credit number Date of birth Income Credit Car owne r House owner Address Date purchase made Magazine purchased 239 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 12-2-94 car music house Apply the following coding steps: 1. Address to region 2. Birth date to age 3. Divide income be 1 4. Divide credit by 1 5. Convert cars yes- to 1-6. Convert purchase date to month numbers starting from 199 9 October 23 27

Credit number Date of birth Income Credit Car owne r House owner Address Date purchase made Magazine purchased 239 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 12-2-94 car music house 6. Table with column and row removed Credit number Age Income Credit Car owne r House owner Region Month of purchase Magazine purchased 239 2 2 2 25 2 18.5 18.5 18.5 36. 18.5 17.8 17.8 17.8 26.6 17.8 1 1 1 1 1 1 52 42 29 48 car music house 7. An intermediate coding stage 9 October 23 28

Credit numbe r Age Income Credit Car owner House owner Region Car magazine House Sport s Music Comic 239 2 25 18.5 36. 17.8 26.6 1 1 1 1 1 1 1 1 8. The final table 9 October 23 29

9 October 23 3

Summary Data preparation is a big issue and most time cost process for both mining and warehousing Data preparation includes Data cleaning, integration, transformation, reduction, discretization, etc. Many DPP tools have been developed but it is still an active research area because of the effort needed for 9 October 23 31