Input: Concepts, Instances, Attributes

Similar documents
Machine Learning Chapter 2. Input

Basic Concepts Weka Workbench and its terminology

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining Practical Machine Learning Tools and Techniques

Summary. Machine Learning: Introduction. Marcin Sydow

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Data Mining Input: Concepts, Instances, and Attributes

Homework 1 Sample Solution

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Mining Algorithms: Basic Methods

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Introduction to Artificial Intelligence

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining Practical Machine Learning Tools and Techniques

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Decision Trees In Weka,Data Formats

University of Florida CISE department Gator Engineering. Visualization

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Machine Learning: Algorithms and Applications Mockup Examination

Data Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Data Mining: Exploring Data. Lecture Notes for Chapter 3

The Explorer. chapter Getting started

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Nearest Neighbor Classification

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Classification with Decision Tree Induction

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Classification with Diffuse or Incomplete Information

EPL451: Data Mining on the Web Lab 5

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Preprocessing. Data Preprocessing

Introduction to Machine Learning

COMP33111: Tutorial and lab exercise 7

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

Data Mining and Knowledge Discovery Practice notes 2

Introduction to R and Statistical Data Analysis

Data Mining and Analytics

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Data Mining: Exploring Data

Data Mining and Knowledge Discovery: Practice Notes

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Experimental Design + k- Nearest Neighbors

Application of Fuzzy Logic Akira Imada Brest State Technical University

CSCI567 Machine Learning (Fall 2014)

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

CISC 4631 Data Mining

Data Mining of Range-Based Classification Rules for Data Characterization

Features: representation, normalization, selection. Chapter e-9

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

UNSUPERVISED LEARNING IN PYTHON. Visualizing the PCA transformation

Machine Learning - Clustering. CS102 Fall 2017

Orange3 Educational Add-on Documentation

Sponsored by AIAT.or.th and KINDML, SIIT

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

Introduction to Machine Learning

Classification: Decision Trees

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

Decision Trees: Discussion

ECLT 5810 Clustering

Unsupervised: no target value to predict

Hands on Datamining & Machine Learning with Weka

CS 584 Data Mining. Classification 1

Introduction to Data Mining

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION OF MULTIVARIATE DATA SET

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

االمتحان النهائي للفصل الدراسي الثاني للعام 2015/2014

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS

刘淇 School of Computer Science and Technology USTC

Boxplot

Simulation of Back Propagation Neural Network for Iris Flower Classification

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

MACHINE LEARNING Example: Google search

COMP33111: Tutorial/lab exercise 2

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

k-nearest Neighbors + Model Selection

Basic Data Mining Technique

ECLT 5810 Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Transcription:

Input: Concepts, Instances, Attributes 1

Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual, independent examples of a concept note: more complicated forms of input are possible Attributes: measuring aspects of an instance we will focus on nominal and numeric ones 2

Kinds of Learning Concept: thing to be learned Concept description: output of learning scheme Different kinds of learning: Classification learning predicting a discrete class numeric prediction Clustering Grouping similar instances into clusters Association learning detecting associations between attributes 3

What s in an example? Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instances/dataset represented as a single relation/flat file rather restricted form of input no relationships between objects 4

What s in an attribute? Each instance is described by a fixed predefined set of features / attributes Number of attributes may vary in practice Example of a dataset: Age Young Young Young Pre-presbyopic Spectacle prescription Myope Myope Hypermetrope Hypermetrope Astigmatism No No Yes Yes Tear production rate Reduced Normal Reduced Normal Recommended lenses None Soft None Hard 5

Classification learning Classification learning is supervised Scheme is provided with actual outcome Outcome is called the class of the example When the class type is numeric, it becomes a numeric prediction task. Measure success on fresh data for which class labels are known (test data) 6

Classification Learning 7

Flower Classification Iris Data Set Learn to distinguish 3 different kinds of iris flower: setosa, versicolor, and virginica Extract 4 features: sepal length, sepal width, petal length, petal width 8

Flower Classification Iris Data Set Scatter plot exploratory data analysis 9

Clustering Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success sometimes measured subjectively Sepal length Sepal width Petal length Petal width Type 1 5.1 3.5 1.4 0.2 Iris setosa 2 4.9 3.0 1.4 0.2 Iris setosa 51 7.0 3.2 4.7 1.4 Iris versicolor 52 6.4 3.2 4.5 1.5 Iris versicolor 101 6.3 3.3 6.0 2.5 Iris virginica 102 5.8 2.7 5.1 1.9 Iris virginica 10

More on attributes Each instance is described by a fixed predefined set of features / attributes But: number of attributes may vary in practice Related problem: existence of an attribute may depend of value of another one Possible attribute types ( levels of measurement ): Nominal and ordinal 11

Nominal quantities Values are distinct symbols Values themselves serve only as labels or names Nominal comes from the Latin word for name Example: attribute outlook from weather data values: sunny, overcast, and rainy No relation is implied among nominal values (no ordering or distance measure) Only equality tests can be performed 12

Ordinal quantities Impose order on values But: no distance between values defined Example: attribute temperature in weather data values: hot > mild > cool Note: addition and subtraction don t make sense Example rule: temperature < hot play = yes Distinction between nominal and ordinal not always clear (e.g. attribute outlook ) 13

Attribute Types (Other terminology) Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called categorical, enumerated, or discrete Special case: dichotomy ( Boolean attribute) Sometimes ordinal attributes include numeric, or continuous But: continuous implies mathematical continuity 14

Inaccurate values Reason: data has not been collected for mining it Result: errors and omissions that don t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes values need to be checked for consistency Typographical and measurement errors in numeric attributes outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data 15

Noise in Data In general, errors in data are called noise The dataset is also called noisy data Missing values and inaccurate values are examples of noise Practical machine learning algorithms should be able to handle noise If the amount of data is large, a small amount of noise does not affect the true underlying patterns 16