Chapter 3: Data Mining:

Size: px
Start display at page:

Download "Chapter 3: Data Mining:"

Transcription

1 Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems provide users with query and reporting tools. To some extent the query and reporting tools can assist in answering questions like, where did the largest number of students come from last year? But these tools cannot provide any intelligence about why it happened. Taking an Example of University Database system: o The OLTP system will quickly be able to answer the query like how many students are enrolled in university o The OLAP system using data warehouse will be able to show the trends in students enrollments (ex: how many students are preferring BCA) o Data mining will be able to answer where the university should market. 3.2 Data Mining and Knowledge Discovery: Data mining is an integral part of Knowledge discovery in databases (KDD) it is the process converting raw data into useful information The input data is stored in various formats (flat files, spread sheet or relational tables) The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 1

2 3.3 Motivating Challenges: Traditional analysis techniques have often faced practical difficulties posed by new data sets. Challenges that motivated the development of data mining: 1) Scalability: Data sets are of the size of Terabytes and Petabytes, if data mining algorithm is handling this massive data it must be scalable, and it need to have parallel distributed algorithms to achieve this. 2) High Dimensionality: Large amount of data always contains thousands of attributes, Complexity increases as the dimensions grows, so the data sets need to have high dimensionality so that it can deal with data containing many dimensions. Traditional data analysis technique can only deal with low dimensional data. 3) Heterogeneous and complex Data: Traditional analysis methods can deal with homogeneous type of attributes; with businesses growing rapidly data mining techniques are required to deal with heterogeneous data 4) Data Ownership and Distribution: as the data is not always stored at one location and it might be scattered at different places in different organization, Distributed data mining technique is required and the challenges faced during this is: 1) How to reduce the amount of communication needed to perform distributed computing 2) How to combine the data mining results obtained from multiple sources 3) How to address data security issues. 5) Non Traditional Analysis: The traditional statistical approach is based on a hypothesized and test paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data, and then the data is analyzed with respect to hypothesis. This process is extremely labor-intensive. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 2

3 3.4 The Origin of Data Mining: Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data 3.5 Data Mining Tasks: Data mining tasks are generally divided into two major categories: Predictive tasks: Use some variables to predict unknown or future values of other variables. Ex: by seeing the behaviour of one variable we can decide the value of other variable. The attribute to be predicted is called: target or dependent Attribute used for making prediction are called: explanatory or independent variable Descriptive tasks: Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that summarize the relationships in data. They are needed post processing the data to validate and explain the results. Cluster Analysis Predictive Modeling Association Analysis Anomaly Detection Lecturer: Syed Khutubuddin Ahmed Contact: Page 3

4 Four of the Core data Mining tasks: 1) Predictive Modeling 2) Association analysis 3) Cluster analysis 4) Anomaly detection 1) Predictive Modeling: refers to the task of building a model for the target variable as a function of the explanatory variable. There are two types of predictive modeling tasks: 1) Classification: used for discrete target variables ex: Web user will make purchase at an online bookstore is a classification task, because the target variable is binary valued. 2) Regression: used for continuous target variables. Ex: forecasting the future price of a stock is regression task because price is a continuous values attribute 2) Association Analysis: useful application of association is to find group of data that have related functionality. The Goal of associating analysis is to extract the most of interesting patterns in an efficient manner. Ex: market based analysis: We may discover the rule that {diapers} {Milk}, which suggests that customers who buy diapers also tend to buy milk. 3) Cluster Analysis: clustering has been used to group sets of related customers. EX: collection of news articles below table shows first 3 rows speak about economy and second 3 lines speak about health sector. A good clustering algorithm should be able to identify these two clusters based on the similarity between words that appear in the article. Example: Article Words Dollar:1, industry:4, country:2, loan:3, government:2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2 Death:2, cancer:4, drug:3, public:4, health:4, director:1 Medical:2, cost:3, increase:2, patient:2, health:3, care:2 4) Anomaly Detection: is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 4

5 3.6 Data: Applications of anomalies are: fraud detection, network intrusion, unusual patterns of diseases, and ecosystem disturbances. What is Data? Collection of data objects and their attributes What is an Attribute? An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristics, or feature What is an Object? A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance 3.7 Types of Data 1) Attributes and Measures: Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values: Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers Note: properties of attribute values can be different like you can find the average ages of persons but you cannot find the average ID s Types of Attributes: Nominal (particular identity) Examples: ID numbers, eye color, zip codes Ordinal (measurable) Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval (range between the two) Examples: calendar dates, temperatures in Celsius or Fahrenheit. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 5

6 Ratio () Examples: temperature in Kelvin, length, time, counts Properties of Attribute Values: The type of an attribute depends on which of the following properties it possesses: Distinctness: = Order: < > Addition: + - Multiplication: * / Nominal attribute: Uses only distinctness Ordinal attribute: Uses distinctness & order Interval attribute: Uses distinctness, order & addition Ratio attribute: Uses all 4 properties Describing Attributes by the nature of Values: Discrete Attribute (Integers) Has only a finite or countably infinite set of values Examples: zip codes, ID Numbers, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute (Floating point) Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. 3.7 Types of Data Sets Types of data is grouped in to three groups 1) Record data Transaction or market based Data Data Matrix Document Data or sparse data matrix 2) Graph data Data with relationship among objects (World Wide Web Data with objects that are Graphs (Molecular Structures) 3) Ordered data Sequential Data Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 6

7 Sequence Data or Genetic Sequence Data Time series or Temporal Data Spatial Data Three characteristics that apply to many data sets are: i) Dimensionality The dimensionality of data set is the number of attributed that the objects in the data set possess. Data with small number of dimensions tends to be qualitatively different than moderate or high dimensional data. The difficulty associated with analyzing high dimensional data are sometimes referred to as the curse of dimensionality ii) Sparsity Data with asymmetric features, most of the attribute values are zero s, in practice terms, sparsity is an advantage because usually only non-zero values to be stored and manipulated. This results in significant savings with respect to computation time and storage. Some of the Data mining algorithms work well for sparse data. iii) Resolution It is frequently possible to obtain data different levels of resolution, often the properties of the data are different at different resolution. Ex: the surface of the earth seems very uneven at a resolution of few meters, but is relatively smooth at a resolution of tens of kilometers. Ex: Photo Pixels (higher the pixel resolution clears the image lesser the resolution image is blurred. Detailed view on three types of data: 1) Record data: Record data set is a collection of data objects, which consists of fixed set of data fields (attributes). Record data is usually stored in flat files Or relational tables Types of record data are: Transaction or market based data The data matrix The sparse data matrix Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 7

8 A) Transaction or Market Basket Data: Transaction data is a special type of record data, where each record (transaction) involves a set of items. B) The Data Matrix: If a set of objects have same set numeric attributes then the data objects will be known as points in a multidimensional space. A set of such data objects can be interpreted as M by N matrix. C) The Sparse Data Matrix: This matrix only contains non zero values Ex: fig(d) document term matrix. staff in the enterprise. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 8

9 2) Graph Based data: A Graph can sometime be convenient and powerful representation for data. A. Data with relationships among objects: relationships among objects frequently convey important information ex: web pages B. data with objects that are graphs: If objects have structure, that is the object contains sub objects that have relationship, then such objects are frequently represented as graphs. Ex: benzene molecule 3) Ordered Data: a. Sequential Data: Sequential data also referred as temporal data, can be thought of as an extension of record data, where each record has time associated with it. b. Sequence Data or Genetic Sequence Data: Sequence data consists of a data set that is a sequence of individual entities, such as sequence of words or letters. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 9

10 C. Time Series Data: Time series data is a special type of sequential data in which each record is a time series, i.e. a series of measurements taken over time. D. Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as other types of attributes. An example of spatial data is weather data that is collected for a variety of geographical location. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 10

11 3.8 Data Processing Different Data Processing Techniques are: 1. Aggregation 2. Sampling 3. Dimensionality reduction 4. Feature creation 5. Discretization and binarization 6. Variable transformation 1. Aggregation: Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More stable data Aggregated data tends to have less variability 2. Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. ANALOGY: (Rice : to see whether the rice is cooked or not we only see one particle of it not all the rice particles) Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 11

12 Types of Sampling: Simple Random Sampling There is an equal probability of selecting any particular item» There are two variations on random sampling: 1) Sampling without replacement As each item is selected, it is removed from the population 2) Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition Progressive Sampling: If the proper sample size selection is difficult then adaptive or progressive sampling is used. Then these approaches start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained. 3. Dimensionality Reduction: Complexity of data increases as the dimensions grows in data. Curse of Dimensionality: When dimensionality increases, data becomes increasingly sparse in the space that it occupies For clustering, the definitions of density and distance between points, this is critical for clustering and outlier detection, become less meaningful. As a result many clustering and classification algorithms have trouble with high dimensional data, as it results in reduced classification accuracy and poor quality clusters. Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise Linear Algebra Techniques for dimensionality reduction: Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 12

13 Feature Subset Selection: Another way to reduce dimensionality of data is to use only subset of features. It might seems that such approach would lose information, this is not the case if redundant and irrelevant features are present. Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' Grade Point Averages Techniques for feature selection: Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes 4. Feature Creation: Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction creation of new set of features from the original raw data is known as feature creation Mapping Data to New Space: A totally different view of data that can reveal important and interesting features. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 13

14 Feature Construction combining features to get better features than the original 5. Discretization and Binarization Some data mining algorithms need data to be in the form of categorical attributes. And algorithms that find association patters require that the data be in the form of binary attributes. Thus transforming continuous attributes into a categorical attribute is called discretization. And transforming continuous and discrete attributes into binary attributes is called as Binarization. 6. Variable transformation A variable transformation refers to a transformation that is applied to all the values of a variable, or even attributes. In each object the transformation is applied to the value of the variable for that object. Ex: converting a floating point value to an absolute value. two types of variable transformation: Simple function normalization two types of variable transformation: Simple function: for this type of variable transformation, a simple mathematical function is applied to each value individually. If x is a variable, then example of such transformation include x k, log x, e x, 1/x, sinx or x Normalization: the goal of normalization of standardization is to make an entire set of values have a particular property. Standard deviation is one of the example of standardization where making entire set of values have a common property Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 14

15 3.9 Measures of Similarity and Dissimilarity: Basics: Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Similarity and Dissimilarity between Simple Attributes: If p and q are the attribute values for two data objects. With respect to ordinal attributes: Consider an attribute that measures the quality of the product: eg: candy bar on the scale {poor, fair, OK, good, wonderful} It would seem reasonable that a product, PI, which is rated wonderful, would be closer to a product P2, which is rated good, than it would be to a product P3, which is rated OK. To make this observation quantitative, the values of the ordinal attribute are often mapped to successive integers, beginning at 0 or 1, e.g., {poor=0, fair=l, OK=2. good=3, wonderful=4]. Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 15

16 Then, d ( Pl, P2 ) = 3 2 = 1 or, if we want the dissimilarity to fall between 0 and 1, d ( P l, P 2 ) = (3-2)/5-1 = A similarity for ordinal attributes can then be defined as s = 1- d. Possible Questions from This chapter: 1. What is data mining and why do we need data mining? Ans : Page-1 2. Write a note on Data mining and Knowledge discovery? Ans: Page-1 3. Explain the Challenges that motivated the use of data mining? Ans: page-2 4. Explain the data mining tasks in details? Ans: page-3 to 4 5. Write a note on Data, attribute and Object? Ans:Page-5 6. Explain in detail types of attributes? Ans:Page-5 to 6 7. Explain the different types of data sets in details with proper examples and figures? Ans: page-6 to Explain data processing in details with examples? Ans: Page-11 to Write a note on measures of similarities and dissimilarities? Ans: Page Lecturer: Syed Khutubuddin Ahmed Contact: khutub27@gmail.com Page 16

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes 0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining 10 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 What is Data? Collection of data objects

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017

CSE4334/5334 Data Mining 4 Data and Data Preprocessing. Chengkai Li University of Texas at Arlington Fall 2017 CSE4334/5334 Data Mining 4 Data and Data Preprocessing Chengkai Li University of Texas at Arlington Fall 2017 10 What is Data? Collection of data objects and their attributes Attributes An attribute is

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric

More information

CAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING. Rafael Santos

CAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING. Rafael Santos CAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING Rafael Santos rafael.santos@inpe.br www.lac.inpe.br/~rafael.santos/ Overview So far What is Data Mining? Applications, Examples. Let s think about your

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Proximity and Data Pre-processing

Proximity and Data Pre-processing Proximity and Data Pre-processing Slide 1/47 Proximity and Data Pre-processing Huiping Cao Proximity and Data Pre-processing Slide 2/47 Outline Types of data Data quality Measurement of proximity Data

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Data Preprocessing UE 141 Spring 2013

Data Preprocessing UE 141 Spring 2013 Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano) Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Data warehouse and Data Mining

Data warehouse and Data Mining Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000. Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).

More information

COMP 465 Special Topics: Data Mining

COMP 465 Special Topics: Data Mining COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Machine Learning Feature Creation and Selection

Machine Learning Feature Creation and Selection Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

COMP90049 Knowledge Technologies

COMP90049 Knowledge Technologies COMP90049 Knowledge Technologies Data Mining (Lecture Set 3) 2017 Rao Kotagiri Department of Computing and Information Systems The Melbourne School of Engineering Some of slides are derived from Prof Vipin

More information

BUSINESS DECISION MAKING. Topic 1 Introduction to Statistical Thinking and Business Decision Making Process; Data Collection and Presentation

BUSINESS DECISION MAKING. Topic 1 Introduction to Statistical Thinking and Business Decision Making Process; Data Collection and Presentation BUSINESS DECISION MAKING Topic 1 Introduction to Statistical Thinking and Business Decision Making Process; Data Collection and Presentation (Chap 1 The Nature of Probability and Statistics) (Chap 2 Frequency

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Open data Business Data Web Data Available at different formats 2 Data Scientist: The Sexiest Job of the 21 st Century Harvard Business Review Oct. 2012 (c)

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Knowledge Discovery in Data Bases

Knowledge Discovery in Data Bases Knowledge Discovery in Data Bases Chien-Chung Chan Department of CS University of Akron Akron, OH 44325-4003 2/24/99 1 Why KDD? We are drowning in information, but starving for knowledge John Naisbett

More information

Machine Learning - Clustering. CS102 Fall 2017

Machine Learning - Clustering. CS102 Fall 2017 Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong

CS378 Introduction to Data Mining. Data Exploration and Data Preprocessing. Li Xiong CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data Mining: Concepts

More information

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results

More information

Data mining fundamentals

Data mining fundamentals Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

A Brief Introduction to Data Mining

A Brief Introduction to Data Mining A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining?

More information

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L Books 2 Which Chapter from which Text Book? Chapter 1: Introduction from Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann

More information

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)? Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

ARTIFICIAL INTELLIGENCE (CS 370D)

ARTIFICIAL INTELLIGENCE (CS 370D) Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-18) LEARNING FROM EXAMPLES DECISION TREES Outline 1- Introduction 2- know your data 3- Classification

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Data warehouses Decision support The multidimensional model OLAP queries

Data warehouses Decision support The multidimensional model OLAP queries Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing

More information

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Komate AMPHAWAN Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data Data Statistics Population Census Sample Correlation... Voluntary Response Sample Statistical & Practical Significance Quantitative Data Qualitative Data Discrete Data Continuous Data Fewer vs Less Ratio

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Applications and Trends in Data Mining

Applications and Trends in Data Mining Applications and Trends in Data Mining Data mining applications Data mining system products and research prototypes Additional themes on data mining Social impacts of data mining Trends in data mining

More information

DATA MINING TRANSACTION

DATA MINING TRANSACTION DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

INTRODUCTION TO DATA MINING

INTRODUCTION TO DATA MINING INTRODUCTION TO DATA MINING 1 Chiara Renso KDDLab - ISTI CNR, Italy http://www-kdd.isti.cnr.it email: chiara.renso@isti.cnr.it Knowledge Discovery and Data Mining Laboratory, ISTI National Research Council,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information