Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Similar documents
Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Machine Learning Chapter 2. Input

Data Mining Practical Machine Learning Tools and Techniques

Input: Concepts, Instances, Attributes

Basic Concepts Weka Workbench and its terminology

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Data Mining Input: Concepts, Instances, and Attributes

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Mining Practical Machine Learning Tools and Techniques

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Unsupervised: no target value to predict

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Data Mining Part 5. Prediction

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Data Mining Concepts & Techniques

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining Practical Machine Learning Tools and Techniques

Summary. Machine Learning: Introduction. Marcin Sydow

Classification. Instructor: Wei Ding

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

ARTIFICIAL INTELLIGENCE (CS 370D)

CSE4334/5334 DATA MINING

Horizontal Aggregation in SQL to Prepare Dataset for Generation of Decision Tree using C4.5 Algorithm in WEKA

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

Sponsored by AIAT.or.th and KINDML, SIIT

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Part I. Instructor: Wei Ding

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Data Mining Algorithms: Basic Methods

Homework 1 Sample Solution

BITS F464: MACHINE LEARNING

COMP33111: Tutorial and lab exercise 7

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Clustering Analysis Basics

Decision Trees: Discussion

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

CS34800, Fall 2016, Assignment 4

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Machine Learning in Real World: C4.5

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

Classification: Basic Concepts, Decision Trees, and Model Evaluation

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining and Analytics

An Empirical Study on feature selection for Data Classification

Classification with Decision Tree Induction

IMPACT MODELS AND DATA MATTEO DE FELICE

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Data Mining and Machine Learning: Techniques and Algorithms

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Analytics and Boolean Algebras

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Extra readings beyond the lecture slides are important:

MAT 155. Chapter 1 Introduction to Statistics. sample. population. parameter. statistic

Sabbatical Leave Report

Classification and Regression

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Data Mining Classification - Part 1 -

Data Mining. SPSS Clementine Apriori Algorithm. Fall 2009 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining 4. Cluster Analysis

International Journal of Computer Techniques Volume 4 Issue 4, July August 2017

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Advanced learning algorithms

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

Data Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools

Part 1. The Review of Linear Programming The Revised Simplex Method

MACHINE LEARNING Example: Google search

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA MINING LAB MANUAL

Decision Trees In Weka,Data Formats

CAP-359 PRINCIPLES AND APPLICATIONS OF DATA MINING. Rafael Santos

Introduction to Machine Learning

SOCIAL MEDIA MINING. Data Mining Essentials

Chapter 3: Data Mining:

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Machine Learning Techniques for Data Mining

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

CS 206 Introduction to Computer Science II

Multi-label classification using rule-based classifier systems

IENG484 Quality Engineering Lab 1 RESEARCH ASSISTANT SHADI BOLOUKIFAR

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Transcription:

Data Mining Part 1. Introduction 1.4 Spring 2010 Instructor: Dr. Masoud Yaghini

Outline Instances Attributes References

Instances

Instance: Instances Individual, independent example of the concept to be learned. Characterized by a predetermined set of attributes to learning process: set of instances/dataset Each dataset is represented as a matrix of instances versus attributes Represented as a single table or flat file Rather restricted form of input No relationships between objects Problems often involve relationships between objects rather than separate, independent instances.

Example: An example: A family tree a family tree is given, and we want to learn the concept sister. This tree is the input to the learning process, along with a list of pairs of people and an indication of whether they are sisters or not.

Two ways of expressing the sister-of relation Neither table is of any use without the family tree itself.

Family tree represented as a table These tables do not contain independent sets of instances because values in the Name, Parent1, and Parent2 columns refer to rows of the family tree relation.

The sister-of relation represented in a table Each of instance is an individual, independent example of the concept that is to be learned.

A simple rule for the sister-of relation A simple rule for the sister-of relation is as follows:

Denormalization Denormalization or flattening: Several relations are joined together to make one to recast data into a set of independent instances Possible with any finite set of finite relations Problem: Denormalization may produce false regularities that reflect structure of database Example: supplier predicts supplier address

Instances The input to a data mining scheme is generally expressed as a table of independent instances of the concept to be learned. The instances are the rows of the tables the attributes are the columns.

Attributes

Attributes Each instance is described by a fixed predefined set of features or attributes Problem: Number of attributes may vary in different instances Example: the instances were transportation vehicles Possible solution: to make each possible feature an attribute and to use a special flag value to indicate that a particular attribute is not available for a particular case.

Attributes Another problem: existence of an attribute may depend of value of another one Spouse s name depends on the value of married or single attribute

Attributes Types Possible attribute types (levels of measurement): Numeric (Interval-scaled) quantities Binary (Boolean) quantities Categorical (Nominal) quantities Ordinal quantities

Numeric Quantities Numeric (Interval-scaled) quantities are continuous measurements of a roughly linear scale. Numeric quantities are not only ordered but measured in fixed and equal units Example 1: attribute temperature expressed in degrees Fahrenheit Example 2: attribute date (year) Difference of two values makes sense

Binary Quantities A Binary (Boolean) quantity can take has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. Given the variable smoker describing a patient, 1 indicates that the patient smokes 0 indicates that the patient does not.

Categorical Quantities Categorical (Nominal) attributes take on values in a prespecified, finite set of possibilities. Categorical quantities values are distinct symbols Values themselves serve only as labels or names Example: attribute outlook from weather data Values: sunny, overcast, and rainy No relation is implied among categorical values (no ordering or distance measure) Special case: binary attribute Example: true/false or yes / no

Categorical Quantities Note: addition, subtraction, and comparing don t make sense Only equality tests can be performed Example:

Ordinal quantities Ordinal quantities are ones that make it possible to rank order the categories. But: no distance between values defined Example: attribute temperature in weather data Or: hot > mild > cool Note: it makes sense to compare two values, but addition and subtraction don t make sense Example rule: temperature < hot => play = yes Distinction between nominal and ordinal not always clear (e.g. attribute outlook )

References

References Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Elsevier Inc., 2005. (Chapter 2)

The end