Introduction to Artificial Intelligence

Similar documents
KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Experimental Design + k- Nearest Neighbors

CSCI567 Machine Learning (Fall 2014)

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Nearest Neighbor Classification

k-nearest Neighbors + Model Selection

Input: Concepts, Instances, Attributes

EPL451: Data Mining on the Web Lab 5

Introduction to R and Statistical Data Analysis

Identification Of Iris Flower Species Using Machine Learning

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Summary. Machine Learning: Introduction. Marcin Sydow

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Basic Concepts Weka Workbench and its terminology

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Statistical Methods in AI

Artificial Intelligence. Programming Styles

Orange3 Educational Add-on Documentation

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

SOCIAL MEDIA MINING. Data Mining Essentials

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Unsupervised Learning

Introduction to Machine Learning

Classification: Decision Trees

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Machine Learning: Algorithms and Applications Mockup Examination

EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION OF MULTIVARIATE DATA SET

PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-nearest Neighbor Classifier

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Clustering & Classification (chapter 15)

CSC411/2515 Tutorial: K-NN and Decision Tree

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Cluster Analysis: Agglomerate Hierarchical Clustering

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Hands on Datamining & Machine Learning with Weka

Clustering. Chapter 10 in Introduction to statistical learning

Performance Analysis of Data Mining Classification Techniques

Clustering: K-means and Kernel K-means

Unsupervised Learning

IN recent years, neural networks have attracted considerable attention

CSCI567 Machine Learning (Fall 2018)

CS 584 Data Mining. Classification 1

USING OF THE K NEAREST NEIGHBOURS ALGORITHM (k-nns) IN THE DATA CLASSIFICATION

K-Means Clustering 3/3/17

STAT 1291: Data Science

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Data analysis case study using R for readily available data set using any one machine learning Algorithm

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Application of Fuzzy Logic Akira Imada Brest State Technical University

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

What to come. There will be a few more topics we will cover on supervised learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Support Vector Machines - Supplement

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

ECG782: Multidimensional Digital Signal Processing

Lecture 3. Oct

Pattern Recognition Chapter 3: Nearest Neighbour Algorithms

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Lecture 8: Grid Search and Model Validation Continued

Unsupervised Learning : Clustering

k-means Clustering Todd W. Neller Gettysburg College

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

Data Mining Practical Machine Learning Tools and Techniques

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Unsupervised Learning: K-means Clustering

Using Machine Learning to Optimize Storage Systems

Machine Learning in Biology

I211: Information infrastructure II

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Clustering. Supervised vs. Unsupervised Learning

Based on Raymond J. Mooney s slides

Linear discriminant analysis and logistic

Semi-supervised learning

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Intro to Artificial Intelligence

Work 2. Case-based reasoning exercise

ANN exercise session

Jarek Szlichta

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

MATH5745 Multivariate Methods Lecture 13

K- Nearest Neighbors(KNN) And Predictive Accuracy

Introduction to Machine Learning. Xiaojin Zhu

Machine Learning Chapter 2. Input

K-means Clustering & PCA

Mining di Dati Web. Lezione 3 - Clustering and Classification

Accelerated Machine Learning Algorithms in Python

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Mining Practical Machine Learning Tools and Techniques

Transcription:

Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1

Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN) K-Nearest Neighbour (KNN) Distance/Similarity measure K-fold cross validation Leave-one out cross validation K-fold cross validation vs validation set K-means clustering Unsupervised learning 2

Classification for Iris Flower Dataset Three classes of Iris flower Four features Length (cms) and width (cms) of Setosa Sepals Petals 150 instances (lines) 50 for each class Versicolor 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.3, 3.3, 6.0, 2.5, Iris-virginica 5.8, 2.7, 5.1, 1.9, Iris-virginica Virginica 3

Nearest Neighbour Classifier Directly use the training instances to classify an unseen instance (e.g. test instance) Share class with the most similar training instance 1. Compare the unseen instance with all the training instances 2. Find the nearest neighbour (the training instance that is closest to the unseen instance) from the training set 3. Classify the unseen instance as the class of the nearest neighbour 4

K-Nearest Neighbour Similar to nearest neighbour method But find k nearest instances from the training set Then choose the majority class as the class label of the unseen instance K usually an odd number Results depend on k value Need a distance/similarity measure 5

Distance Measure Given two feature vectors with numeric values A = (a %,, a ( ) and B = (b %,, b ( ) Distance measure (Euclidean distance) d =. / 01 2 ( / 45% 3 2 =. 2 701 7 / 32 +. 201 2 7 2 32 + +. :01 : 2 32 : 2 Range of the ithfeature Why divided by the range? Easy to use, can achieve good results in many cases Efficiency? 6

K-fold Cross Validation We usually divide dataset into training and test sets Train a classifier using the training set Calculate the test performance (e.g. test error) on the test/unseen set Different training/test division leads to different test performance Example: Learning algorithms A and B Dataset: instances 1~100 Train on 1~70, test on 71~100: A got 96% accuracy, B got 93% accuracy Train on 31~100, test on 1~30: A got 91% accuracy, B got 95% accuracy Which algorithm is better? K-fold cross validation makes estimation less division-dependent 7

K-fold Cross Validation Idea: chop the data into K equal subsets For each subset: Treat it as the test set Treat the rest K 1 subsets as the training set Train classifier using the training set, apply it to the test set The training/test process is repeated K times (the folds), with each of the K subsets used exactly once as the test set The K results from the folds can be then averaged (or otherwise combined) to produce a single estimation Can be used for comparing two algorithms, or measure the performance of a particular algorithm when the data set is small 8

K-fold Cross Validation: Example 10-fold cross validation Divide into 10 equally-sized subsets training set test set 9

Leave-one-out Cross Validation Very similar to k-fold cross validation Every time, it only uses one instance as the test set A special case where size of subsets is 1 Repeated m times, where m is the number of instances in the entire dataset K-fold cross validation (including leave-one-out) is an experimental design method for setting up experiments for supervised learning tasks (classification and regression) It is NOT a machine learning or classification method or technique 10

Cross Validation vs Validation Set Validation set A dataset Separate dataset from the training and test datasets Used for monitoring the training process but not directly used for learning the classifier To avoid overfitting Cross validation Experimental design method, NOT a dataset In this method, there are only training and test datasets No validation set 11

Unlabeled data K-means Clustering Expect to obtain good partitions for the instances using learning techniques. Need clustering techniques, unsupervised learning 12

K-means Clustering K-means clustering is a method of cluster analysis which aims to partition m instances into k clusters in which each instance belongs to the cluster with the nearest mean. Need some kind of distance measure such as Euclidean distance Need to assume the number of clusters 13

K-means Clustering: Algorithm 1. Set k initial means randomly from the data set. 2. Create k clusters by associating every instance with the nearest mean based on a distance measure. c(x) = arg min 45%,,C dist(c 4, x) 3. Replace the old means with the centroid of each of the k clusters (as the new means). c 4 = 1 S 4 J x 4 K / M / 4. Repeat the above two steps until convergence (no change in each cluster center). Centroid is not an object 14

K-Means Clustering: An Example 15

Choosing K Number of clusters A very important parameter But hard to determine in real world (need domain knowledge) Many methods Elbow method X-mean clustering Silhouette method 16

Summary Nearest neighbour method for classification K-Nearest neighbour method classification method Measures of comparing two feature vectors K-fold cross validation experimental design method, NOT a learning method validation set is a data set, NOT a method K-means method - clustering method, NOT for classification 17