Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Preprocessing. Data Mining 1

CS570: Introduction to Data Mining

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Jarek Szlichta

Information Management course

2 CONTENTS. 3.8 Bibliographic Notes... 45

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Chapter 2 Data Preprocessing

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. Slides by: Shree Jaswal

Data Mining 4. Cluster Analysis

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

UNIT 2 Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Preprocessing. Erwin M. Bakker & Stefan Manegold.

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Sponsored by AIAT.or.th and KINDML, SIIT

Data Mining Part 3. Associations Rules

Data Mining: Concepts and Techniques. Chapter 2

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Statistical Pattern Recognition

Applied Multivariate Analysis

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

Network Flows. 7. Multicommodity Flows Problems. Fall 2010 Instructor: Dr. Masoud Yaghini

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Statistical Pattern Recognition

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

2. Data Preprocessing

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Statistical Pattern Recognition

Multivariate Normal Random Numbers

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Part 1. The Review of Linear Programming The Revised Simplex Method

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Preprocessing and Visualization. Jonathan Diehl

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Relational Databases. APPENDIX A Overview of Relational Database Structure and SOL

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Subject. Creating a diagram. Dataset. Importing the data file. Descriptive statistics with TANAGRA.

7.2: Chi-Square Test for Association T.Scofield Nov. 17, 2016

ECLT 5810 Clustering

Basic Data Mining Technique

Information Management course

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Product Catalog. AcaStat. Software

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

ECLT 5810 Clustering

3. Data Preprocessing. 3.1 Introduction

Data Mining Part 5. Prediction

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

STATS PAD USER MANUAL

ETL and OLAP Systems

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Data Preprocessing UE 141 Spring 2013

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Workload Characterization Techniques

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Multivariate Capability Analysis

K236: Basis of Data Science

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Solutions to Midterm 2 - Monday, July 11th, 2009

The University of British Columbia

STA 570 Spring Lecture 5 Tuesday, Feb 1

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Database Technology Introduction. Heiko Paulheim

Relational Algebra and SQL. Basic Operations Algebra of Bags

Deep Web Content Mining

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

We had several appearances of the curse of dimensionality:

- 1 - Fig. A5.1 Missing value analysis dialog box

Table of Contents (As covered from textbook)

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

A7-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Goodness-of-Fit Testing T.Scofield Nov. 16, 2016

Introduction to STATA

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

Two-Stage Least Squares

8. MINITAB COMMANDS WEEK-BY-WEEK

DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS

Ma/CS 6b Class 13: Counting Spanning Trees

Correctly Compute Complex Samples Statistics

Data Mining and Analytics. Introduction

Transcription:

Data Mining 2.4 Fall 2008 Instructor: Dr. Masoud Yaghini

Data integration: Combines data from multiple databases into a coherent store Denormalization tables (often done to improve performance by avoiding joins) Integration of the data from multiple sources may produces redundancies and inconsistencies in the resulting data set. Tasks of data integration: Detecting and resolving data value and schema conflicts Handling Redundancy in

Outline Detecting and Resolving Data Value and Schema Conflicts Handling Redundancy in References

Detecting and Resolving Data Value and Schema Conflicts

Schema Integration: Schema Integration Integrate metadata from different sources The same attribute or object may have different names in different databases e.g. customer_id in one database and cust_number in another The metadata include: the name, meaning, data type, and range of values permitted for the attribute, and etc.

Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different This may be due to differences in representation, scaling, or encoding. Examples: the data codes for pay_type in one database may be H and S, and 1 and 2 in another. a weight attribute may be stored in metric units in one system and British imperial units in another. For a hotel chain, the price of rooms in different cities may involve not only different currencies but also different services (such as free breakfast) and taxes.

Detecting and resolving data value conflicts This step also relates to data cleaning, as described earlier.

Handling Redundancy in

Handling Redundancy in Redundant data occur often when integration of multiple databases One attribute may be a derived attribute in another table, e.g., Age= 19 and Birth_year = 1990 Redundant attributes may be able to be detected by correlation analysis

Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient) B A B A N A B N b a N B b A a r N i i i N i i i B A σ σ σ σ = = = = 1 1, ) ( ) )( ( where n is the number of tuples and are the respective means of A and B σ A and σ B are the respective standard deviation of A and B Σ(a i b i ) is the sum of the AB cross-product B A B A N N σ σ σ σ A B

If: Correlation Analysis (Numerical Data) r A,B > 0: A and B are positively correlated (A s values increase as B s). The higher, the stronger correlation. r A,B = 0: independent r A,B < 0: negatively correlated A,B Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population

Correlation Analysis (Categorical Data) A correlation relationship between two categorical (discrete) attributes, A and B, can be discovered by a X 2 (chi-square) test.

Correlation Analysis (Categorical Data) Suppose: A has c distinct values, namely a1, a2,, ac. B has r distinct values, namely b1, b2,, br The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows. Let (Ai, Bj) denote the event that attribute A takes on value ai and attribute B takes on value bj, that is, where (A = ai, B = b j ). Each and every possible (A i, B j ) joint event has its own cell (or slot) in the table.

Correlation Analysis (Categorical Data) The X 2 value (also known as the Pearson X 2 statistic) is computed as: where o ij is the observed frequency (i.e., actual count) of the joint event (A i, B j ) and e ij is the expected frequency of (A i, B j ), which can be computed as

Correlation Analysis (Categorical Data) where N is the number of data instances, count(a=a i ) is the number of tuples having value a i for A count(b = b j ) is the number of tuples having value b j for B. The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count

Chi-Square Calculation: An Example Suppose that a group of 1,500 people was surveyed. The observed frequency (or count) of each possible joint event is summarized in the contingency table shown in the Table Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 The numbers in parentheses are the expected frequencies (calculated based on the data distribution for both attributes using Equation e ij ). Are like_science_fiction and play_chess correlated?

Chi-Square Calculation: An Example For example, the expected frequency for the cell (play_chess, fiction) is e count( play _ chess)* count( like_science_fiction) N Notice that 300* 450 1500 11 = = = the sum of the expected frequencies must equal the total observed frequency for that row, and the sum of the expected frequencies in any column must also equal the total observed frequency for that column. 90

Chi-Square Calculation: An Example We can get X 2 by: For this 2 x 2 table, the degrees of freedom are (2-1)(2-1) = 1. For 1 degree of freedom, the X 2 value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage points of the X 2 distribution, typically available from any textbook on statistics).

Correlation Analysis (Categorical Data) Since our computed value is above this, we can reject the hypothesis that play chess and preferred reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people.

References

References J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier Inc. (2006). (Chapter 2)

The end