CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Size: px
Start display at page:

Download "CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor"

Transcription

1 CS513-Data Mining Lecture 2: Understanding the Data Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

2 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

3 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

4 Pattern Example Example Consider the data of contact lens prescription from an optician, the task is to prescribe a soft, hard or no contact lens to the patient based on his/her information. We will analyze past data in order to find some patterns, if possible. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

5 Contact Lens Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

6 Finding Patterns: Illustration if tear production rate = reduced then recommendation = none elseif age = young and astigmatic=no then recommendation=soft else recommendation = hard end if Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

7 What we get These pattern may not be enough to be generalized as rule, since example is a simple one and we do not have enough data. (i.e., may be incomplete). We can say this pattern just summarizes the data. How many possible values of input required for extracting useful patterns? ( ) Actually, the data mining task needs to generalize to new examples as well. Real life data often contains examples in which values of some features are noisy or missing. Which can effect the performance of data mining technique. Misclassification can even occur on the datasets that were used to train/learn the method. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

8 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

9 Weather Problem Example Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

10 Some Complexity: Numeric Attributes Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

11 Classification What we have seen so far are classification rules, i.e., classifying examples We can also look examples for rules that associate values of different attributes, Association Rules. Example if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if windy = false and play = no then outlook = sunny and humidity = high Can you identify one? Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

12 Rules Definition (Rules) Set of conditions/decisions that can be specifically and implicitly interpreted in some order. They are helpful tools for making classification and association of examples. E.g., decision list, that is interpreted in a sequence, or decision tree, that are interpreted hierarchically. Sometime we may get a rule set that gives unique prescription for every conceivable example, such as for above examples However, it is generally not possible, there may be situation where no rule is applicable or more than one rules are applicable (i.e., conflict will rise then we go to probability or weigths) Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

13 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

14 Types of Learning in Data Mining Classification Learning: Learning is achieved by presenting classified examples (historical/training data) in order to classify unseen examples (future/test data). Association Learning: Association among features is learned from historical data. Here it is not just limited to learning for one particular attribute or feature. Clustering: Examples are grouped together based on some similarity or homogeneity. Numeric Prediction: The outcome to be predicted is not a discrete class but the prediction is made for numerical outcome. Definition (Concept) Any thing that is being learned is called the concept, and the output of the learning method is known as concept description. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

15 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

16 Model Vs. Pattern Definition (Model) Describe global summary of the dataset, i.e., makes statement about any point in full measurement space. For example, predicting a values or assigning an example to the cluster. Even if some points in this space is missing. Model Representation At its simplest form, a model can be represented by: Y = ax + c where Y and X are variables (Y is outcome), and a and c are model parameters. This is a linear model, since Y is a linear function of X a. a Unlike Statistics, linearity here is in terms of variables rather than model parameters Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

17 Pattern Definition Describes a structure relating to a small parts (local) of data or measurement space. For example, mail order purchase data may reveal a pattern that customers buying particular product also buy an other product. Example (Fraud Detection) Bank transaction data can be mined for fraud detection, once the usual behaviors are described by patterns. Once these structures are defined their parameters can be estimated from the data. Models or patterns with parameter values are called fitted models or patterns respectively. Fitted models or patterns are then used on future data. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

18 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

19 Why Algorithms We have seen that the data mining tasks rise in variety of different real world applications For example, Exploratory data analysis, descriptive modeling, predictive modeling, patterns and rules discovery, contents retrieval, and so on. To accomplish these tasks we need algorithms, termed as data mining algorithms Readings You should read about Real World Applications of Data Mining from different resources to build understanding of different types of problems and data mining tasks. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

20 Data Mining Algorithms Not very strict, generally there are four basic components of a data mining algorithm Components 1 Model or Pattern Structure: Describe the underlying structure or functional forms that we seek from the data. 2 Score Function: Also known as cost function, objective function or performance measure, It is used to evaluate or judge the learning capability and quality of the fitted structure (pattern or model). 3 Optimization or Searching: Optimizing the score function and searching through different possible model and pattern structures to find the best. 4 Data Management Strategy: Effective management of large data during optimization and searching. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

21 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

22 Understanding Input Definition (Example or Instance) A record or row in the data file is called an example or instance or observation. They may have relationship among them or independent of each other in some way. Definition (Attribute) The columns or fields of the data file that are fixed, predefined are known as features or attributes. An instance characterizes the set of attributes by its values. These attributes if selected or used for mining task then they will be referred as variables for the data mining algorithm. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

23 Types of Data Quantitative Data: Numerical data, either continuous (e.g., Amount of sales, temperature) or integer (e.g., number of students in a class) Qualitative Data: That approximates or characterizes but does not measure, e.g., present or absent, level of agreement. Categorical Data: That represents one of several (limited) categories, e.g., color of an object, gender of the customer etc. They are also some time called discrete as they represent some well separated categories. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

24 Measurement Levels Nominal : A variable can be treated as nominal when its values represent categories with no intrinsic ranking (for example, the departments of the company in which an employee works). Examples of nominal variables include region, zip code, and religious affiliation. Ordinal : A variable can be treated as ordinal when its values represent categories with some intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly satisfied). Examples of ordinal variables include attitude scores representing degree of satisfaction or confidence and preference rating scores. Scale : A variable can be treated as scale when its values represent ordered categories with a meaningful metric, so that distance comparisons between values are appropriate. Examples of scale variables include age in years and income in thousands of dollars. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

25 Class Activity 1 Identify different types of data, and assign different measuring levels: Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

26 Class Activity 2 Identify different types of data, and assign different measuring levels: Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

27 Outline 1 Patterns Class Activity 2 Types of Learning 3 Model 4 Data Mining Algorithms 5 Understanding your Data: Input 6 Issues with Real World Data Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

28 Issues with Input Data Due to many reasons real world data is sometime inaccurate, inexact or incomplete as apposed to the assumption of data mining algorithms. Sparse Data Most attributes of the data may contain zero values, e.g., if a market basket data contains data of purchases by customers then for many products that customer has not purchased, quantity will be zero. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

29 Missing Values Respondent in a survey may refuse to answer few questions or malfunction instrument may not record data for some attributes or values of some attributes in some circumstances may not be measured. These dataset will then contain missing values for specific attributes. Missing Values may be represented in the dataset by an out-of-range value, or negative value if it is not possible for the attribute to have negative value, by a dash, question mark, etc. When collecting or recording data, one may not find an attribute useful for their operation but that attribute may be important for mining task, then we are faced with missing attributes. For example, university may not be interested in the parent s education or income but these attributed may have significance when mining students data for possible financial aid offer. Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

30 Example of Missing Values Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

31 Inaccurate Values Since data for data mining task is not explicitly collected or recorded for this purpose one should carefully analyze data for rogue attributes or attribute values. Inaccuracy may occur: Typography Measurement Error Merging data from different sources Deliberately Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

32 References I Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining March / 32

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input Data Mining 1.3 Input Fall 2008 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be learned. Characterized

More information

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.3 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

ARTIFICIAL INTELLIGENCE (CS 370D)

ARTIFICIAL INTELLIGENCE (CS 370D) Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-18) LEARNING FROM EXAMPLES DECISION TREES Outline 1- Introduction 2- know your data 3- Classification

More information

22/10/16. Data Coding in SPSS. Data Coding in SPSS. Data Coding in SPSS. Data Coding in SPSS

22/10/16. Data Coding in SPSS. Data Coding in SPSS. Data Coding in SPSS. Data Coding in SPSS DATA CODING IN SPSS STAFF TRAINING WORKSHOP March 28, 2017 Delivered by Dr. Director of Applied Economics Unit African Heritage Institution Enugu Nigeria To code data in SPSS, Lunch the SPSS The Data Editor

More information

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci Data Representation Information Retrieval and Data Mining Prof. Matteo Matteucci Instances, Attributes, Concepts 2 Instances The atomic elements of information from a dataset Also known as records, prototypes,

More information

Data Mining Algorithms: Basic Methods

Data Mining Algorithms: Basic Methods Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Research Data Analysis using SPSS. By Dr.Anura Karunarathne Senior Lecturer, Department of Accountancy University of Kelaniya

Research Data Analysis using SPSS. By Dr.Anura Karunarathne Senior Lecturer, Department of Accountancy University of Kelaniya Research Data Analysis using SPSS By Dr.Anura Karunarathne Senior Lecturer, Department of Accountancy University of Kelaniya MBA 61013- Business Statistics and Research Methodology Learning outcomes At

More information

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12 Association Rules Charles Sutton Data Mining and Exploration Spring 2012 Based on slides by Chris Williams and Amos Storkey The Goal Find patterns : local regularities that occur more often than you would

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

Chapter 3: Data Mining:

Chapter 3: Data Mining: Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM 1 Proceedings of SEAMS-GMU Conference 2007 DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM KUSRINI Abstract. Decision tree is one of data mining techniques that is applied in classification

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Data Mining Input: Concepts, Instances, and Attributes

Data Mining Input: Concepts, Instances, and Attributes Data Mining Input: Concepts, Instances, and Attributes Chapter 2 of Data Mining Terminology Components of the input: Concepts: kinds of things that can be learned Goal: intelligible and operational concept

More information

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation Preprocessing Data Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation Reading material: Chapters 2 and 3 of

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

CS434 Notebook. April 19. Data Mining and Data Warehouse

CS434 Notebook. April 19. Data Mining and Data Warehouse CS434 Notebook April 19 2017 Data Mining and Data Warehouse Table of Contents The DM Process MS s view (DMX)... 3 The Basics... 3 The Three-Step Dance... 3 Few Important Concepts... 4 More on Attributes...

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000. Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

Chapter 1 Introduction to Statistics

Chapter 1 Introduction to Statistics Corresponds to ELEMENTARY STATISTICS USING THE TI 83/84 PLUS CALCULATOR 3rd ed. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by Mario F. Triola Chapter 1 Introduction

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

CS414-Artificial Intelligence

CS414-Artificial Intelligence CS414-Artificial Intelligence Lecture 6: Informed Search Algorithms Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta)

More information

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4. Data Mining Chapter 4. Algorithms: The Basic Methods (Covering algorithm, Association rule, Linear models, Instance-based learning, Clustering) 1 Covering approach At each stage you identify a rule that

More information

Machine Learning Feature Creation and Selection

Machine Learning Feature Creation and Selection Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Frequency distribution

Frequency distribution Frequency distribution In order to describe situations, draw conclusions, or make inferences about events, the researcher must organize the data in some meaningful way. The most convenient method of organizing

More information

MAT 155. Chapter 1 Introduction to Statistics. sample. population. parameter. statistic

MAT 155. Chapter 1 Introduction to Statistics. sample. population. parameter. statistic MAT 155 Dr. Claude Moore Cape Fear Community College Chapter 1 Introduction to Statistics 1 1Review and Preview 1 2Statistical Thinking 1 3Types of Data 1 4Critical Thinking 1 5Collecting Sample Data Key

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

A Brief Introduction to Data Mining

A Brief Introduction to Data Mining A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Feb, 2017 What is Data Mining? Introduction A possible

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

A Brief Introduction to Data Mining

A Brief Introduction to Data Mining A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining?

More information

A Simple Guide to Using SPSS (Statistical Package for the. Introduction. Steps for Analyzing Data. Social Sciences) for Windows

A Simple Guide to Using SPSS (Statistical Package for the. Introduction. Steps for Analyzing Data. Social Sciences) for Windows A Simple Guide to Using SPSS (Statistical Package for the Social Sciences) for Windows Introduction ٢ Steps for Analyzing Data Enter the data Select the procedure and options Select the variables Run the

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

CS Database Design - Assignments #3 Due on 30 March 2015 (Monday)

CS Database Design - Assignments #3 Due on 30 March 2015 (Monday) CS422 - Database Design - Assignments #3 Due on 30 March 205 (Monday) The solutions must be hand written, no computer printout, and no photocopy.. (From CJ Date s book 4th edition, page 536) Figure represents

More information

Clustering Analysis Basics

Clustering Analysis Basics Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary Introduction Cluster: A collection/group

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Homework 1 Sample Solution

Homework 1 Sample Solution Homework 1 Sample Solution 1. Iris: All attributes of iris are numeric, therefore ID3 of weka cannt be applied to this data set. Contact-lenses: tear-prod-rate = reduced: none tear-prod-rate = normal astigmatism

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Data Preprocessing UE 141 Spring 2013

Data Preprocessing UE 141 Spring 2013 Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each

More information

MACHINE LEARNING Example: Google search

MACHINE LEARNING Example: Google search MACHINE LEARNING Lauri Ilison, PhD Data Scientist 20.11.2014 Example: Google search 1 27.11.14 Facebook: 350 million photo uploads every day The dream is to build full knowledge of the world and know everything

More information

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can:

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can: IBM Software IBM SPSS Statistics 19 IBM SPSS Categories Predict outcomes and reveal relationships in categorical data Highlights With IBM SPSS Categories you can: Visualize and explore complex categorical

More information

IBM SPSS Categories 23

IBM SPSS Categories 23 IBM SPSS Categories 23 Note Before using this information and the product it supports, read the information in Notices on page 55. Product Information This edition applies to version 23, release 0, modification

More information

Data Mining with Weka

Data Mining with Weka Data Mining with Weka Class 5 Lesson 1 The data mining process Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Lesson 5.1 The data mining process Class

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis. www..com www..com Set No.1 1. a) What is data mining? Briefly explain the Knowledge discovery process. b) Explain the three-tier data warehouse architecture. 2. a) With an example, describe any two schema

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Basic concepts and terms

Basic concepts and terms CHAPTER ONE Basic concepts and terms I. Key concepts Test usefulness Reliability Construct validity Authenticity Interactiveness Impact Practicality Assessment Measurement Test Evaluation Grading/marking

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table Q Cheat Sheets What to do when you cannot figure out how to use Q What to do when the data looks wrong Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation. CSE3212 Data Mining Data Mining Approaches Defining a Data Mining Task To define a data mining task, one needs to answer the following questions: 1. What data set do I want to mine? 2. What kind of knowledge

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

Organizing Data. Class limits (in miles) Tally Frequency Total 50

Organizing Data. Class limits (in miles) Tally Frequency Total 50 2 2 Organizing Data Objective 1. Organize data using frequency distributions. Suppose a researcher wished to do a study on the number of miles the employees of a large department store traveled to work

More information

Oracle9i Data Mining. Data Sheet August 2002

Oracle9i Data Mining. Data Sheet August 2002 Oracle9i Data Mining Data Sheet August 2002 Oracle9i Data Mining enables companies to build integrated business intelligence applications. Using data mining functionality embedded in the Oracle9i Database,

More information

Knowledge Engineering and Data Mining. Knowledge engineering has 6 basic phases:

Knowledge Engineering and Data Mining. Knowledge engineering has 6 basic phases: Knowledge Engineering and Data Mining Knowledge Engineering The process of building intelligent knowledge based systems is called knowledge engineering Knowledge engineering has 6 basic phases: 1. Problem

More information

Scaling Techniques in Political Science

Scaling Techniques in Political Science Scaling Techniques in Political Science Eric Guntermann March 14th, 2014 Eric Guntermann Scaling Techniques in Political Science March 14th, 2014 1 / 19 What you need R RStudio R code file Datasets You

More information

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS Pattern Recognition Prof. Sung-Hyuk Cha Fall of 2002 School of Computer Science & Information Systems Artificial Intelligence 1 Perception Lena & Computer vision 2 Machine Vision Pattern Recognition Applications

More information

PREDICTING UPCOMING STUDENTS PERFORMANCE USING MINING TECHNIQUE

PREDICTING UPCOMING STUDENTS PERFORMANCE USING MINING TECHNIQUE PREDICTING UPCOMING STUDENTS PERFORMANCE USING MINING TECHNIQUE Madhan kumar R 1 and Rajesh N 2 1,2 Department of information science, The National Institute of Engineering, Mysuru-570008 Abstract- to

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information