Unsupervised: no target value to predict

Similar documents
Hierarchical Clustering Lecture 9

IBL and clustering. Relationship of IBL with CBR

Machine Learning. Unsupervised Learning. Manfred Huber

Slides for Data Mining by I. H. Witten and E. Frank

Clustering Lecture 5: Mixture Model

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

K-Means Clustering 3/3/17

Unsupervised Learning

CLUSTERING. JELENA JOVANOVIĆ Web:

Metodologie per Sistemi Intelligenti. Clustering. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Data Mining Practical Machine Learning Tools and Techniques

Network Traffic Measurements and Analysis

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Learning: Clustering

COMP33111: Tutorial and lab exercise 7

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Data Mining Practical Machine Learning Tools and Techniques

Unsupervised Learning and Clustering

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Machine Learning Chapter 2. Input

Unsupervised Learning

Lesson 3. Prof. Enza Messina

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

ALTERNATIVE METHODS FOR CLUSTERING

K-means and Hierarchical Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning and Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Association Rule Mining and Clustering

Clustering algorithms

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Clustering and The Expectation-Maximization Algorithm

Clustering CS 550: Machine Learning

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani


INF 4300 Classification III Anne Solberg The agenda today:

CHAPTER 4: CLUSTER ANALYSIS

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

CS Introduction to Data Mining Instructor: Abdullah Mueen

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

ECLT 5810 Clustering

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Introduction to Machine Learning CMU-10701

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Based on Raymond J. Mooney s slides

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

ECE 5424: Introduction to Machine Learning

CS 229 Midterm Review

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Methods for Intelligent Systems

Supervised vs unsupervised clustering

Introduction to Mobile Robotics

Finding Clusters 1 / 60

ECLT 5810 Clustering

PARALLEL CLASSIFICATION ALGORITHMS

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering: Classic Methods and Modern Views

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Data Mining Practical Machine Learning Tools and Techniques

Clustering. Chapter 10 in Introduction to statistical learning

Data Mining Algorithms: Basic Methods

Basic Concepts Weka Workbench and its terminology

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Mixture Models and the EM Algorithm

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Chapter 4: Algorithms CS 795

Clustering and Visualisation of Data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Lecture 7: Segmentation. Thursday, Sept 20

Exploratory data analysis for microarrays

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Machine Learning for OR & FE

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Understanding Clustering Supervising the unsupervised

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Mining and Analytics

COMS 4771 Clustering. Nakul Verma

Chapter 4: Algorithms CS 795

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

Transcription:

Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning Problem: Evaluation? usually by inspection But: If treated as density estimation problem, clusters can be evaluated on test data! 87

Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters Top down E.g.two closest instances in clusters vs. distance between means Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram g a c i e d k b j f h 88

The k-means algorithm To cluster data into k groups: (k is predefined) 1. Choose k cluster centers e.g. at random. Assign instances to clusters based on distance to cluster centers 3. Compute centroids of clusters 4. Go to step 1 until convergence 89

Discussion Result can vary significantly based on initial choice of seeds Can get trapped in local minimum Example: initial cluster centres instances To increase chance of finding global optimum: restart with different random seeds 90

Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility 91

9 Clustering weather data N M L K J I H G F E D C B A ID Rainy Hot Overcast Overcast Rainy Cool Cool Overcast Cool Rainy Cool Rainy Rainy Hot Overcast Hot Hot Windy Humidity Temp. Outlook 1 3

Clustering weather data ID Outlook Temp. Humidity Windy 4 A Hot B Hot C Overcast Hot D Rainy E F Rainy Rainy Cool Cool 5 G H Overcast Cool Merge best host and runner-up I Cool J K Rainy 3 L Overcast M Overcast Hot N Rainy Consider splitting the best host if merging doesn t help 93

Final hierarchy ID Outlook Temp. Humidity Windy A Hot B Hot C Overcast Hot D Rainy Oops! a and b are actually very similar 94

Example: the iris data (subset)

Clustering with cutoff 96

Category utility Category utility: quadratic loss function defined on conditional probabilities: CU ( C 1, C,..., C k! C!! Pr[ l ] (Pr[ ai = vij Cl ] " Pr[ ai = l i j ) = k v ij ] ) Every instance in different category numerator becomes! Pr[ a i vij ] maximum m = number of attributes 97

Numeric attributes Assume normal distribution: f ( a# µ ) 1 ( a) = e! "! Then: " Pr[ ai = vij ] # f ( ai ) dai = j! 1 %$ i Thus CU becomes! C!! Pr[ l ] (Pr[ ai = vij Cl ] " Pr[ ai = l i j = k 1 ' 1 1 $! Pr[ C! l ] % ( " * &) il ) # k l i i CU = v ij ] ) Prespecified minimum variance acuity parameter 98

Probability-based clustering Problems with heuristic approach: Division by k? Order of examples? Are restructuring operations sufficient? Is result at least local minimum of category utility? Probabilistic perspective seek the most likely clusters given the data Also: instance belongs to a particular cluster with a certain probability 99

Finite mixtures Model data using a mixture of distributions One cluster, one distribution governs probabilities of attribute values in that cluster Finite mixtures : finite number of clusters Individual distributions are normal (usually) Combine distributions using cluster weights 100

Two-class mixture model data A 51 A 43 B 6 B 64 A 45 A 4 A 46 A 45 A 45 B 6 A 47 A 5 B 64 A 51 B 65 A 48 A 49 A 46 B 64 A 51 A 5 B 6 A 49 A 48 B 6 A 43 A 40 A 48 B 64 A 51 B 63 A 43 B 65 B 66 B 65 A 46 A 39 B 6 B 64 A 5 B 63 B 64 A 48 B 64 A 48 A 51 A 48 B 64 A 4 A 48 A 41 model µ A =50, σ A =5, p A =0.6 µ B =65, σ B =, p B =0.4 101

Using the mixture model Probability that instance x belongs to cluster A: Pr[ A x] = Pr[ x A]Pr[ A] = Pr[ x] f ( x; µ A,! A ) p Pr[ x] A with f ( x# µ ) 1 ( x; µ,! ) = e! "! Likelihood of an instance given the clusters: Pr[ x the distributions] =! i Pr[ x cluster ]Pr[cluster ] i i 10

Learning the clusters Assume: we know there are k clusters Learn the clusters determine their parameters I.e. means and standard deviations Performance criterion: likelihood of training data given the clusters EM algorithm finds a local maximum of the likelihood 103

EM algorithm EM = Expectation-Maximization Generalize k-means to probabilistic setting Iterative procedure: E expectation step: Calculate cluster probability for each instance M maximization step: Estimate distribution parameters from cluster probabilities Store cluster probabilities as instance weights Stop when improvement is negligible 104

105 More on EM Estimate parameters from weighted instances Stop when log-likelihood saturates Log-likelihood: n n n A w w w x w x w x w!!! =... ) (... ) ( ) ( 1 1 1 µ µ µ " n n n A w w w x w x w w x =...... 1 1 1 µ ]) Pr[ ] Pr[ ( log B x p A x p i B i A i!

Extending the mixture model More then two distributions: easy Several attributes: easy assuming independence! Correlated attributes: difficult Joint model: bivariate normal distribution with a (symmetric) covariance matrix n attributes: need to estimate n n (n1)/ parameters 106

More mixture model extensions Nominal attributes: easy if independent Correlated nominal attributes: difficult Two correlated attributes v 1 v parameters Missing values: easy Can use other distributions than normal: log-normal if predetermined minimum is given log-odds if bounded from above and below Poisson for attributes that are integer counts Use cross-validation to estimate k! 107

Bayesian clustering Problem: many parameters EM overfits Bayesian approach : give every parameter a prior probability distribution Incorporate prior into overall likelihood figure Penalizes introduction of parameters Eg: Laplace estimator for nominal attributes Can also have prior on number of clusters! Implementation: NASA s AUTOCLASS 108

Discussion Can interpret clusters by using supervised learning post-processing step Decrease dependence between attributes? pre-processing step E.g. use principal component analysis Can be used to fill in missing values Key advantage of probabilistic clustering: Can estimate likelihood of data Use it to compare different models objectively 109