Dta Mining and Data Warehousing

CSCI645 Fall 23 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: qggao@cs.dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours: TR, 1:3-3: PM 9 October 23 1

Lectures Outline Pat I: Overview on DM and DW 1. Introduction (ch1) Ass1 Due: Sep 23 Tue 2. Data preprocessing (ch3) Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 Oct 14 Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 Oct 21 6. Association data mining (ch6) Ass4: Oct 21 Nov 5 7. Characterization data mining (ch5) 8. Clustering data mining (ch8) Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9) 1. Mining spatial data (Ch9) Project Presentations Project Due: Dec 8 9 October 23 2

3. DATA PREPROCESSING (Ch3) Data Preprocessing (DPP) Concept Major Tasks of DPP A DPP Case Study Summary 9 October 23 3

Why Is Data Preprocessing Important? No quality data, quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 9 October 23 4

Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility 9 October 23 5

9 October 23 6

Why Data Preprocessing? Raw data have errors and inconsistencies (Data cleaning) Data need to be integrated from different sources and a unique format is needed (Data integration and transformation) Irrelevant data should be removed (Data reduction) Domain kwledge should be added into the prepared data (Discretization and concept hierarchy generation) 9 October 23 7

Major Tasks of DPP 9 October 23 8

Major Tasks of DPP (cont) Data cleaning Fill in missing values, smooth isy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9 October 23 9

Why data cleaning? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= isy: containing errors or outliers e.g., Salary= -1 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 3/7/1997 e.g., Was rating 1,2,3, w rating A, B, C 9 October 23 1

Why is data dirty? Incomplete data comes from n/a data value when collected different consideration between the time when the data was collected and when it is analyzed. human/hardware/software problems Noisy data comes from the process of data collection entry transmission Inconsistent data comes from Different data sources Functional dependency violation 9 October 23 11

E.g. Data rmalization for clustering mining E.g., For clustering mining of a customer database: DB (Age, Income, Credit) The distance between to data points: d = ((C1_a1 - C2_a1)^2 + (C2_a2 - C2_a2)^2 + (C3_a1 C3_a2)^2)^(1/2) Age Income Credit Customer1: 32 4, 1, Customer2: 24 3, 2, 8 1, 8, Normalized: 1 1/1 1/1 8 1 8 (rescaled) (rescaled) If we scale all the attributes to the same order of magnitude we obtain reliable distance measure between the different records. 9 October 23 12

A DPP Case Study Business Background: The publisher sells five types of magazine - on cars, houses, sports, music, and s. The aim of the data mining is to find new, interesting clusters of clients in order to set up a marketing exercise. The business is interested in questions such as "What is the typical profile of a reader of a car magazine?, "Is there any correlation between an interest in cars and an interest in s?"... Data mining task: - Mining clusters of clients for a magazine publisher database. - Data preparation for clustering: cleaned, integrated, rmalized, numerical valued data, etc 9 October 23 13

1. Data Selection The database should contain the records of subscription data of the magazines. It should be a selection of operational data from the publishers invoicing system and contains information about people who have subscribed to a magazine The records consist of: client number, name, address, date of subscription,and type of magazine In order to facilitate the DM process, a copy of this operational data is drawn and stored in a separate database (Refer Table 1) 9 October 23 14

2. Data Cleaning: remove duplications Duplication of records: In an operational client database some clients may be represented by several records, some of the possible causes may include: - the result of negligence, such as people making typing errors - clients moving from on place to ather without tifying change of the address - the cases in which people deliberately spell their names incorrectly or give incorrect information about themselves for avoiding a negative decision... (Refer to Table 2) 9 October 23 16

Client number Name Address Date purchase Magazine purchased 239 2313 2319 Clinton King Jonson 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house 1. Original data Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house 2. De-duplication 9 October 23 17

De-duplication De-duplication: The duplicated records may be identified by a pattern recognition algorithm and then corrected. E.g., The records Mr. and Mr. Jonson in the database. They have different client numbers but the same address, which is a strong indication that they are the same person. This type of pollution will give a company the impression that it has more clients than in fact is the case. Of course, we can never be sure of this, but a de-duplication algorithm using pattern analysis techniques could identify the situation and present it to a user to make a decision. 9 October 23 18

2. Data Cleaning: correct domain inconsistency Domain inconsistency: Pollution was caused by wrong domain values which are t consistent with the definitions. E.g. In the example table, date 1-1-1 means 1 January 191 (the company did t even exist at that time). In some databases, analysis shows an unexpected high number of people born on 11 November: When people were forced to fill in a birth date on a screen and they either do t kw or do t want to divulge it, they were inclined to type in `11-11-11'. This kind of untrue random values can be disastrous in a data mining context. If information is unkwn () it should be represented as such in the database. 9 October 23 19

Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 1-1-1 2-3-95 1-1-1 car music sports house Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 3. Domain consistency 9 October 23 2

3. Data Integration (Enrichment) Suppose that we have purchased extra information about our clients consisting of data of birth, income, amount of credit, and whether or t an individual owns a car or a house. (Refer to Table 4) * You therefore have to make a deliberate decision either to overlook it or to delete it. A general rule states that any deletion of data must be a conscious decision, after a thorough analysis of the possible consequences. 9 October 23 21

Client number Name Address Date purchase Magazine purchased 239 2313 Clinton King 2 Boulevard 3 High Road 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 3. Domain consistency Client name Date of birth Income Credit Car owner House owner Clinton 1-2-71 $36, $26,6 yes 4. Additional data available for enrichment 9 October 23 22

4. Data Deduction Remove the columns and rows which are t valuable to the DM process. In Table 6, the column NAME and the row with multiple values are removed from the database. In a real DM project, maybe most of the tables that are collected from the operational data and a lot of desirable data is missing, and most is possible to retrieve. 9 October 23 24

Credit numb er Name Date of birth Income Credit Car owne r Hous e owne r Address Date purchase made Magazin e purchas ed 239 2313 Clinton King 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 2-3-95 12-2-94 car music sports house 5. Enriched table Credit number Date of birth Income Credit Car owne r House owner Address Date purchase made Magazine purchased 239 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 12-2-94 car music house 6. Table with column and row removed 9 October 23 25

4. Data Deduction (cont) In some cases, especially fraud detection, lack of information can be a valuable indication of interesting patterns. Up to this point, the process phase has consisted of mainly simple SQL operations. 9 October 23 26

5. Data transformation For most of databases, the information provided is much too detailed to be used as input of data mining algorithms, such as Credit number Date of birth Income Credit Car owne r House owner Address Date purchase made Magazine purchased 239 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 12-2-94 car music house Apply the following coding steps: 1. Address to region 2. Birth date to age 3. Divide income be 1 4. Divide credit by 1 5. Convert cars yes- to 1-6. Convert purchase date to month numbers starting from 199 9 October 23 27

Credit number Date of birth Income Credit Car owne r House owner Address Date purchase made Magazine purchased 239 1-2-11 $36, $26.6 yes 2 Boulevard 4-15-94 6-21-93 5-3-92 12-2-94 car music house 6. Table with column and row removed Credit number Age Income Credit Car owne r House owner Region Month of purchase Magazine purchased 239 2 2 2 25 2 18.5 18.5 18.5 36. 18.5 17.8 17.8 17.8 26.6 17.8 1 1 1 1 1 1 52 42 29 48 car music house 7. An intermediate coding stage 9 October 23 28

Credit numbe r Age Income Credit Car owner House owner Region Car magazine House Sport s Music Comic 239 2 25 18.5 36. 17.8 26.6 1 1 1 1 1 1 1 1 8. The final table 9 October 23 29

9 October 23 3

Summary Data preparation is a big issue and most time cost process for both mining and warehousing Data preparation includes Data cleaning, integration, transformation, reduction, discretization, etc. Many DPP tools have been developed but it is still an active research area because of the effort needed for 9 October 23 31