DATA PREPROCESSING. Tzompanaki Katerina

Size: px
Start display at page:

Download "DATA PREPROCESSING. Tzompanaki Katerina"

Transcription

1 DATA PREPROCESSING Tzompanaki Katerina

2 Background: Data storage formats Data in DBMS ODBC, JDBC protocols Data in flat files Fixed-width format (each column has a specific number of characters, filled in with special characters if needed) Delimited format: tab, comma,, other Attention: Convert field delimiters inside strings Verify the number of attributes before and after convertion 9/3/18 2

3 Background: Data and attributes Some frequently encountered terminology: Data objects are also called data points, samples, examples, vectors, instances, or data tuples. They are entities in a given context in a given dataset, eg patients, products etc. Attributes are also called features, variables, dimensions. Attribute vector is a set of attributes used to describe a given data object. Eg., the attribute vector <Name, Disease, Prescription> describes patient data objects. (Observed) Values for attributes are called observations. Eg, cancer, high blood pressure, flu may be the observations for the disease attribute in a given dataset. 9/3/18 3

4 Background: Attribute types Nominal (or categorical) attributes refer to names of things, or categories that normally have no order. E.g., marital status (single, married, divorced), color (blue, green, etc) or userid (323, 235,etc). Binary attribute is a nominal attribute with two possible values: 0 or 1 stating absence or precense. Eg, for a patient we could have the following binary attributes: smoker (yes, no), sex (male,female), test (positive, negative). Ordinal attribute is an attribute whose values have an ordering or ranking. Eg., grades (A>B>C), sizes (large>medium>small) Qualitative attributes: they describe a feature of an object without giving an actual size or quantity. 9/3/18 4

5 Background: Attribute types Numeric attributes are used to describe measurable quanities and are represented using numbers (integers or reals). They provide a ranking and allow for mathematical operations. Eg, temperature (20 C-15 C), age (44 years old is 2 times older than 22 years old) etc. Quantitative attributes: they describe measurable quantities. u Another categorisation: u Discrete attributes have a finite or countably infinite set of values, which may or may not be represented as integers. Eg, hair color, smoker, size, age, etc. u Continuous attributes are attributes that are not discrete, thus can be represented as numbers with floating points. Eg, length, income, price, etc. 9/3/18 5

6 Background: Basic Statistical Description of Data Measures of central tendency The mean of an attribute x in a multi-set of N observations, is the central value. x = N i=1 N x i = x 1 +!+ x N N The median is the middle value in an order set of values. If the number of values is even the median is not unique. The median better represents skewed data (aka not symmetric) and is less sensitive to outliers. The mode is the most frequent value. If several values have the highest frequency then we talk about multimodal datasets. Can also be used for nomimal attributes. 9/3/18 6

7 Background: Basic Statistical Description of Data Measures of data dispersion The range of a numeric attribute is the difference of the maximum and the minimum observation (max()-min()). The quantiles separate an ordered numerical set into equal size (containing the same fraction of data) sub-sets. The k th q-quantile for a given data distribution is the value v such that at most k/q of the data values are less than v and at most (q-k)/q of the data values are more than v, where k is an integer such that 0 <k <q. The 100-th quantile is called percentile. 9/3/18 7

8 Background: Basic Statistical Description of Data Measures of data dispersion The variance (σ 2 ) and standard deviation (σ) indicate how spread out the distribution of an attribute x is. A low standard deviation means that the observations tend to be very close to the mean, while a high standard deviation indicates that the observations are spread out over a large range of values. σ 2 (x) = 1 N (x i x) 2 N 1 The covariance cov(x,y) of two attributes shows how correlated the attributes are. A positive covariance cov(x,y)>0 shows that y raises as x increases while a negative one cov(x,y)<0 indicates that y decreases while x increases. Finally we define the covariance matrix for x,y (can be extended to cover all data variables/attributes): x y Note that x σ 2 (x) cov(x,y) cov(x,y)=cov(y,x) (symmetric matrix) y cov(y,x) σ 2 (y) i=1 cov(x, y) = 1 N (x i x)(y i y) N 1 i=1 9/3/18 8

9 Background: Displaying Data Histograms are used to summarize the distribution of observations. Each bar represents the frequency of the observation. For ordered numeric values, we split the range into equally sized buckets. The range of a bucket is called width. Scatter plots are used to observe correlations between pairs of numeric attributes. Positive (left) and negative (right) correlation. 9/3/18 9

10 Why Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions are based on quality data Data warehouses need consistent integration of quality data 9/3/18 10

11 Data understanding: Relevance What data are available for the task? Are these data relevant? Are additional relevant data available? How much historical data are available (provenance)? Who is the data expert? 9/3/18 11

12 Data understanding: Quantity Number of instances (records, objects) Rule of thumb: 5,000 or more desired If less, results are less reliable; use special methods (like boostingnot covered in this course) Number of attributes Rule of thumb: for each attribute, 10 or more instances If many attributes, use feature reduction and selection Number of targets Rule of thumb: more than 100 instances for each class If very unbalanced, use stratified sampling 9/3/18 12

13 Forms of data preprocessing 9/3/18 13

14 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Reduced data volume but still the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9/3/18 14

15 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Reduced data volume but still the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9/3/18 15

16 Data Cleaning Reformat/Convert data Fill in missing values Handle dates Identify outliers and smooth out noisy data Correct inconsistent data 9/3/18 16

17 Reformatting Data Convert data to a standard format Missing values Unified date format Binning of numeric data Fix errors and outliers Convert nominal fields, whose values have order, to numeric. Why? to be able to use comparison operators ( > and < ) on these fields) 9/3/18 17

18 Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes (eg customer income in sales data) Missing data may be due to equipment malfunction inconsistency with other recorded data data not entered due to misunderstanding that certain data may not be considered important at the time of entry not registered history or changes of the data v Missing data may need to be inferred! 9/3/18 18

19 Handling Missing Data Ignore the tuple: usually done in classification tasks when the tuple s class label (target value) is missing Fill in the missing value manually Use a global constant to fill in the missing value Measure of central tendency: use the attribute mean/median to fill in the missing value, or the attribute mean for all samples belonging to the same class Use the most probable value to fill in the missing value: in a supervised manner, find the most possible value using inference-based mechanisms such as a Bayesian formula or decision tree 9/3/18 19

20 Handling Missing Data Ignore the tuple: usually done in classification tasks Ineffective when the tuple s class label (target value) is missing Fill in the missing value manually Inefficient and tedius Use a global constant to fill in the missing value Measure of central tendency: use the attribute mean/median to fill in the missing value, or the attribute mean for all samples belonging to the same class Use the most probable value to fill in the missing value: in a supervised manner, find the most possible value using inference-based mechanisms such as a Bayesian formula or decision tree Not foolproof Smarter Best choice 9/3/18 20

21 Unified Date Format We want to transform all dates to the same format internally Some systems accept dates in many formats e.g. Sep 24, 2003, 9/24/03, , etc dates are transformed internally to a standard value Frequently, just the year (YYYY) is sufficient For more details, we may need the month, the day, the hour, etc Representing date as YYYYMM or YYYYMMDD can be OK, but has problems What are the problems with YYYYMMDD dates? YYYYMMDD does not preserve intervals: /3/18 21

22 Unified Date Format Options To preserve intervals, we can use Unix system date: Number of seconds since Jan 1, 1970 Number of days since Jan 1, 1960 (SAS) Problem: values are non-obvious don t help intuition and knowledge discovery harder to verify, easier to make an error 9/3/18 22

23 KSP Date Format KSP _ Date = YYYY + days _ starting _1_ Jan _if _leap _ year Preserves intervals between days The year is obvious Sep 24, 2003 is ( )/365= (round to 4 digits) Can be extended to include time 9/3/18 23

24 Conversion: Nominal to Numeric Some methods can deal with nominal values internally. Other methods (regression, nearest neighbor, neural networks) require only numeric inputs. To use nominal fields in such methods we need to convert them to a numeric value. Different strategies for binary, ordered, multi-valued nominal fields. 9/3/18 24

25 Conversion: Binary to Numeric Binary fields E.g. Gender=M, F Convert to Field_0_1 with 0, 1 values e.g. Gender = M à Gender_0_1 = 0 Gender = F à Gender_0_1 = 1 9/3/18 25

26 Conversion: Ordered to Numeric Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g. A à 4.0 A- à 3.7 B+ à 3.3 B à 3.0 Why is it important to preserve natural order? To allow meaningful comparisons, e.g. Grade > 3.5 9/3/18 26

27 Conversion: Nominal, Few Values Multi-valued, unordered attributes with small (rule of thumb < 20) no. of values e.g. Color=Red, Orange, Yellow,, Violet for each value v create a binary flag variable C_v, which is 1 if Color=v, 0 otherwise Also called one-hot-encoding or dummy variable method. ID color ID C_red C_orange C_yellow 100 red 101 yellow /3/18 27

28 Conversion: Nominal, Many Values Examples: US State Code (50 values) Profession Code (7,000 values, but only few frequent) How to deal with such fields? Ignore ID-like fields whose values are unique for each record. For other fields, group values naturally : e.g. 50 US States à 3 or 5 regions Profession à select most frequent ones, group the rest Create binary flag-fields (one-hot-encoding) for selected values. 9/3/18 28

29 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems, which require data cleaning duplicate records incomplete data inconsistent data 9/3/18 29

30 How to Handle Noisy Data? Binning method first sort data and partition into bins then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by putting a human in the loop Regression smooth by fitting the data into (linear) regression functions 9/3/18 30

31 Simple Discretization Methods: Binning Equi-width (distance) partitioning: It divides the range into N intervals of equal size if A and B are the lowest and highest values of the attribute, the interval width will be: W = (B-A)/N. The most straightforward method Equi-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately the same number of samples 9/3/18 31

32 Binning Methods for Data Smoothing Sorted data for product prices: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28, 29, 34 Smoothing by bin means: Bin 1: 9, 9, 9, 9 Bin 2: 23, 23, 23, 23 Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 4, 15 Bin 2: 21, 21, 25, 25 Bin 3: 26, 26, 26, 34 9/3/18 32

33 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Reduced data volume but still the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9/3/18 33

34 Data Integration Data integration combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem identify same real world entities in multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources maybe different possible reasons: different representations, different scales, e.g., meter vs. foot 9/3/18 34

35 Handling Redundant Data in Data Integration Redundant data occur often when integrating multiple databases The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., monthly vs annual revenue Redundant data may be able to be detected by correlation analysis Careful integration of multiple sources may help reduce/ avoid redundancies and inconsistencies and improve mining speed and quality 9/3/18 35

36 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Reduced data volume but still the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9/3/18 36

37 Data Reduction Dimensionality reduction: reduce the number of considered attributes Principal component analysis Wavelet transformation Numerosity reduction: reduce the volume of data to smaller but representative data representations Sampling: pick some of the data Clustering: create clusters of similar items, use clusters instead of members. Histograms: binning method Data compression: compress data in lossless (if original data can be reconstructed) or lossy (otherwise) manner 9/3/18 37

38 Dimensionality Reduction Purpose Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Feature selection Select the most important features Feature extraction/engineering Find representative combinations of features to use instead. 9/3/18 38

39 Dimensionality reduction: Feature selection Feature selection Select a minimum set of features such that the distribution of different classes is as close as possible to the original distribution. 2 n possible subsets! Expert knowledge can be utilized to keep the most important features. Automatic feature selection Model-based selection The most important features are selected using a supervised ML algorithm (eg decision tree). Iterative selection Iteratively, the least important features are discarded (backward elimination) or the most important are added (forward selection) until the desired number is reached. 9/3/18 39

40 Dimensionality reduction: Feature selection 9/3/18 40

41 Dimensionality reduction: Feature extraction Principal Component Analysis (PCA) Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent the data The original data set is reduced to a new one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Works for numeric data only We will see PCA in detail, when we will study unsupervised learning methods. 9/3/18 41

42 Principal Component Analysis (PCA) The perpendicular (orthogonal) arrows show the principal components of the data. The blue is the first principal component, the pink is the second one. * 9/3/18 42

43 Numerosity Reduction: Sampling Simple random sample without replacement (SRSWOR) of size s: randomly pick s samples, all with equal probability Simple random sample with replacement (SRSWR) of size s: the same item can be picked more than once 9/3/18 43

44 Numerosity Reduction: Sampling Simple random sample without replacement (SRSWOR) of size s: randomly pick s samples, all with equal probability Simple random sample with replacement (SRSWR) of size s: the same item can be picked more than once 9/3/18 44

45 Numerosity Reduction: Sampling Cluster sample: when data are clustered, pick randomly s number of them. Eg. data retrieved in memory pages 9/3/18 45

46 Numerosity Reduction: Sampling Cluster sample: when data are clustered, pick randomly s number of them. Eg. data retrieved in memory pages Stratified sample: create strata (levels) in the data to represent different categories. Then, pick a number of samples from each strata accordingly. In this way, all strata will be guaranteed to exist in the samples. 9/3/18 46

47 Numerosity Reduction: Clustering Create partitions of data objects (clusters), so that objects within a cluster are similar to one another and dissimilar to objects in other clusters. Then use clusters instead of elements in the clusters. 9/3/18 47

48 Numerosity Reduction: Histograms Use histogram representations instead of full data. As we saw before, histograms (binning method) partition the data distribution of an attribute A into disjoint buckets that are of Equal-width: In an equal-width histogram, the width of each bucket range is uniform. Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly the same number of contiguous data samples). 9/3/18 48

49 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Reduced data volume but still the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 9/3/18 49

50 Data Transformation Smoothing: remove noise from data Discretization: binning, histograms, clusters Normalization: scale to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Concept hierarchy generalization: replace a value with a higher class Aggregation: summarization, data cube construction Common tasks with data cleaning 9/3/18 50

51 Normalization min-max normalization ν ' = ν min(ν) max(ν) min(ν) z-score normalization (standardization) ν ' = ν ν σ v has zero-mean and unit variance è gaussian distribution scaling to unit length ν ' = v v 9/3/18 51

52 Concept Hierarchies For numerical data it can be regarded as discretization method. Eg salaries fall into different ranges. For nominal data, hierarchies can be implicitly or explicitly defined in schemas or by the data Specification of a partial ordering of attributes explicitly at the schema level by users or experts. Eg street < city < province or state < country Specification of a set of attributes for the hierarchy, but not of their partial ordering. To find the ordering use the distict attribute values cardinality. country province city street 15 distinct values 365 distinct values 3567 distinct values 9/3/ ,339 distinct values

53 Sources Han and Kamber: Data Mining, Concepts and Techniques Nguyen Hung Son: Data cleaning and data preprocessing Prof. Pier Luca Lanzi: Data Exploration and Preparation Muller and Guido: Introduction to Machine Learning with Python 9/3/18 53

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

K236: Basis of Data Science

K236: Basis of Data Science Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

More information

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad Data Analytics life cycle Discovery Data preparation Preprocessing requirements data cleaning, data integration, data reduction, data

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1 ECT7110 Data Preprocessing Prof. Wai Lam ECT7110 Data Preprocessing 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest,

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Komate AMPHAWAN Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

More information

Chapter 2 Data Preprocessing

Chapter 2 Data Preprocessing Chapter 2 Data Preprocessing CISC4631 1 Outline General data characteristics Data cleaning Data integration and transformation Data reduction Summary CISC4631 2 1 Types of Data Sets Record Relational records

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data Pre-processing-Data Cleaning, Integration, Transformation, Reduction, Discretization Concept Hierarchies-Concept Description: Data Generalization And

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1 Chapter 2: Data Preprocessing Why preprocess

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Open data Business Data Web Data Available at different formats 2 Data Scientist: The Sexiest Job of the 21 st Century Harvard Business Review Oct. 2012 (c)

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration

More information

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska

cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska Chapter 2: Data Preprocessing (book slide) Why preprocess the data? Descriptive data summarization Data cleaning Data

More information

Data Preprocessing in Python. Prof.Sushila Aghav

Data Preprocessing in Python. Prof.Sushila Aghav Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 03 : 13/10/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?

More information

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools Data Preparation Data Preparation (Data pre-processing) Why prepare the data? Discretization Data cleaning Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Data Preprocessing. Erwin M. Bakker & Stefan Manegold. https://homepages.cwi.nl/~manegold/dbdm/

Data Preprocessing. Erwin M. Bakker & Stefan Manegold. https://homepages.cwi.nl/~manegold/dbdm/ Data Preprocessing Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ s.manegold@liacs.leidenuniv.nl e.m.bakker@liacs.leidenuniv.nl 9/26/17

More information

Data Mining: Concepts and Techniques. Chapter 2

Data Mining: Concepts and Techniques. Chapter 2 Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data Mining: Concepts

More information

Data Preprocessing UE 141 Spring 2013

Data Preprocessing UE 141 Spring 2013 Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Data Preprocessing. Chapter Why Preprocess the Data?

Data Preprocessing. Chapter Why Preprocess the Data? Contents 2 Data Preprocessing 3 2.1 Why Preprocess the Data?........................................ 3 2.2 Descriptive Data Summarization..................................... 6 2.2.1 Measuring the Central

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

2 CONTENTS. 3.8 Bibliographic Notes... 45

2 CONTENTS. 3.8 Bibliographic Notes... 45 Contents 3 Data Preprocessing 3 3.1 Data Preprocessing: An Overview................. 4 3.1.1 Data Quality: Why Preprocess the Data?......... 4 3.1.2 Major Tasks in Data Preprocessing............. 5 3.2

More information

Data Collection, Preprocessing and Implementation

Data Collection, Preprocessing and Implementation Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,

More information

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017 CSE4334/5334 Data Mining 4 Data and Data Preprocessing Chengkai Li University of Texas at Arlington Fall 2017 10 What is Data? Collection of data objects and their attributes Attributes An attribute is

More information

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.

Data Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved. 3 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin

More information

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes 0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining 10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining Volume 117 No. 20 2017, 785-794 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A Survey on Data Preprocessing Techniques for Bioinformatics and Web

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Sponsored by AIAT.or.th and KINDML, SIIT

Sponsored by AIAT.or.th and KINDML, SIIT CC: BY NC ND Table of Contents Chapter 2. Data Preprocessing... 31 2.1. Basic Representation for Data: Database Viewpoint... 31 2.2. Data Preprocessing in the Database Point of View... 33 2.3. Data Cleaning...

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

Data Preprocessing. Outline. Motivation. How did this happen?

Data Preprocessing. Outline. Motivation. How did this happen? Outline Data Preprocessing Motivation Data cleaning Data integration and transformation Data reduction Discretization and hierarchy generation Summary CS 5331 by Rattikorn Hewett Texas Tech University

More information

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Chapter 3: Data Mining:

Chapter 3: Data Mining: Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4 Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4 Data representation 5 Data reduction, notion of similarity

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric

More information

Basic Statistical Terms and Definitions

Basic Statistical Terms and Definitions I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse352 Artifficial Intelligence Short Review for Midterm Professor Anita Wasilewska Computer Science Department Stony Brook University Midterm Midterm INCLUDES CLASSIFICATION CLASSIFOCATION by Decision

More information

DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING

DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING DEPARTMENT OF INFORMATION TECHNOLOGY IT6702 DATA WAREHOUSING & DATA MINING UNIT I PART A 1. Define data mining? Data mining refers to extracting or mining" knowledge from large amounts of data and another

More information

Enterprise Miner Tutorial Notes 2 1

Enterprise Miner Tutorial Notes 2 1 Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Enterprise Miner Software: Changes and Enhancements, Release 4.1 Enterprise Miner Software: Changes and Enhancements, Release 4.1 The correct bibliographic citation for this manual is as follows: SAS Institute Inc., Enterprise Miner TM Software: Changes and Enhancements,

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram What are we working with? Data Abstractions Week 4 Lecture A IAT 814 Lyn Bartram Munzner s What-Why-How What are we working with? DATA abstractions, statistical methods Why are we doing it? Task abstractions

More information

Measures of Dispersion

Measures of Dispersion Measures of Dispersion 6-3 I Will... Find measures of dispersion of sets of data. Find standard deviation and analyze normal distribution. Day 1: Dispersion Vocabulary Measures of Variation (Dispersion

More information

Preprocessing and Visualization. Jonathan Diehl

Preprocessing and Visualization. Jonathan Diehl RWTH Aachen University Chair of Computer Science VI Prof. Dr.-Ing. Hermann Ney Seminar Data Mining WS 2003/2004 Preprocessing and Visualization Jonathan Diehl January 19, 2004 onathan Diehl Preprocessing

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Week 2 Engineering Data

Week 2 Engineering Data Week 2 Engineering Data Seokho Chi Associate Professor Ph.D. SNU Construction Innovation Lab Source: Tan, Kumar, Steinback (2006) 10 What is Data? Collection of data objects and their attributes An attribute

More information

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation Preprocessing Data Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation Reading material: Chapters 2 and 3 of

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information