CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Similar documents
Unsupervised Learning and Clustering

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Unsupervised Learning

Unsupervised Learning and Clustering

Machine Learning: Algorithms and Applications

Hierarchical clustering for gene expression data analysis

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Graph-based Clustering

Machine Learning. Topic 6: Clustering

Clustering. A. Bellaachia Page: 1

Feature Reduction and Selection

Image Alignment CSC 767

K-means and Hierarchical Clustering

Announcements. Supervised Learning

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Multi-stable Perception. Necker Cube

LECTURE : MANIFOLD LEARNING

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

cos(a, b) = at b a b. To get a distance measure, subtract the cosine similarity from one. dist(a, b) =1 cos(a, b)

Structure from Motion

Smoothing Spline ANOVA for variable screening

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 534: Computer Vision Model Fitting

Biostatistics 615/815

Collaboratively Regularized Nearest Points for Set Based Recognition

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Machine Learning 9. week

Optimizing Document Scoring for Query Retrieval

Classifier Selection Based on Data Complexity Measures *

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Discriminative Dictionary Learning with Pairwise Constraints

Support Vector Machines

Parallel matrix-vector multiplication

Cluster Analysis of Electrical Behavior

On the Efficiency of Swap-Based Clustering

Machine Learning. K-means Algorithm

Data Modelling and. Multimedia. Databases M. Multimedia. Information Retrieval Part II. Outline

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Simplification of 3D Meshes

Topics. Clustering. Unsupervised vs. Supervised. Vehicle Example. Vehicle Clusters Advanced Algorithmics

A Two-Stage Algorithm for Data Clustering

Support Vector Machines

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Data Mining: Model Evaluation

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

APPLIED MACHINE LEARNING

Active Contours/Snakes

Survey of Cluster Analysis and its Various Aspects

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Classification / Regression Support Vector Machines

Support Vector Machines. CS534 - Machine Learning

Data Mining MTAT (4AP = 6EAP)

LEAST SQUARES. RANSAC. HOUGH TRANSFORM.

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Problem Set 3 Solutions

MULTI-VIEW ANCHOR GRAPH HASHING

APPLICATION OF IMPROVED K-MEANS ALGORITHM IN THE DELIVERY LOCATION

Semi-Supervised Kernel Mean Shift Clustering

Lecture 5: Multilayer Perceptrons

12. Segmentation. Computer Engineering, i Sejong University. Dongil Han

This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Face Recognition Method Based on Within-class Clustering SVM

Intro. Iterators. 1. Access

Fitting: Deformable contours April 26 th, 2018

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

A Deflected Grid-based Algorithm for Clustering Analysis

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

EXTENDED BIC CRITERION FOR MODEL SELECTION

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

Towards Semantic Knowledge Propagation from Text to Web Images

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Lecture 4: Principal components

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Vectorization in the Polyhedral Model

An Improved Neural Network Algorithm for Classifying the Transmission Line Faults

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

GSLM Operations Research II Fall 13/14

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Spring 2016

A Binarization Algorithm specialized on Document Images and Photos

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

S1 Note. Basis functions.

SVM-based Learning for Multiple Model Estimation

An Optimal Algorithm for Prufer Codes *

Web Mining: Clustering Web Documents A Preliminary Review

Programming in Fortran 90 : 2017/2018

Transcription:

CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15

Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc unsupervsed learnng = clusterng Promty Measures Crteron Functons Flat Clusterng k-means Herarchcal Clusterng Dvsve Agglomeratve

Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also called learnng wth teacher, snce correct answer (the true class) s provded In the net few lectures we consder unsupervsed learnng scenaro, where we are only gven 1. samples 1,, n Ths s also called learnng wthout teacher, snce correct answer s not provded do not splt data nto tranng and test sets

Unsupervsed Learnng Data s not labeled a lot s known easer 1. Parametrc Approach assume parametrc dstrbuton of data estmate parameters of ths dstrbuton much harder than supervsed case NonParametrc Approach group the data nto clusters, each cluster (hopefully) says somethng about categores (classes) present n the data lttle s known harder

Why Unsupervsed Learnng? Unsupervsed learnng s harder How do we know f results are meanngful? No answer labels are avalable. Let the epert look at the results (eternal evaluaton) Defne an objectve functon on clusterng (nternal evaluaton) We nevertheless need t because 1. Labelng large datasets s very costly (speech recognton) sometmes can label only a few eamples by hand 2. May have no dea what/how many classes there are (data mnng) 3. May want to use clusterng to gan some nsght nto the structure of the data before desgnng a classfer Clusterng as data descrpton

Clusterng Seek natural clusters n the data What s a good clusterng? nternal (wthn the cluster) dstances should be small eternal (ntra-cluster) should be large Clusterng s a way to dscover new categores (classes)

What we Need for Clusterng 1. Promty measure, ether smlarty measure s(, k ): large f, k are smlar dssmlarty(or dstance) measure d(, k ): small f, k are smlar large d, small s large s, small d 2. Crteron functon to evaluate a clusterng good clusterng bad clusterng 3. Algorthm to compute clusterng For eample, by optmzng the crteron functon

How Many Clusters? 3 clusters or 2 clusters? Possble approaches 1. f the number of clusters to k 2. fnd the best clusterng accordng to the crteron functon (number of clusters may vary)

Promty Measures good promty measure s VERY applcaton dependent Clusters should be nvarant under the transformatons natural to the problem For eample for object recognton, should have nvarance to rotaton dstance 0 For character recognton, no nvarance to rotaton large dstance

Dstance (dssmlarty) Measures Manhattan (cty block) dstance ( ) ( ) ( ) = = d k k j k j d 1, Eucldean dstance ( ) ( ) ( ) ( ) = = d k k j k j d 1 2, appromaton to Eucldean dstance, cheaper to compute ( ) ( ) ( ) ma, 1 k j k d k j d = Chebyshev dstance appromaton to Eucldean dstance, cheapest to compute translaton nvarant

Smlarty Measures Cosne smlarty: (, ) s = j the smaller the angle, the larger the smlarty scale nvarant measure popular n tet retreval T j j Correlaton coeffcent popular n mage processng s (, ) j = d d k = 1 ( k ) ( k ) ( ) d ( k ) 2 ( k ) ( ) k = 1 k = 1 ( ) ( j ) j 2 1/ 2 large smlarty small smlarty

Feature Scale old problem: how to choose approprate relatve scale for features? [length (n meters or cms?), weght(n n grams or kgs?)] In supervsed learnng, can normalze to zero mean unt varance wth no problems n clusterng ths s more problematc, f varance n data s due to cluster presence, then normalzng features s not a good thng before normalzaton after normalzaton

Smplest Clusterng Algorthm Havng defned a promty functon, can develop a smple clusterng algorthm go over all sample pars, and put them n the same cluster f the dstance between them s less then some threshold dstance d 0 (or f smlarty s larger than s 0 ) Pros: smple to understand and mplement Cons: very dependent on d 0 (or s 0 ), automatc choce of d 0 (or s 0 )s not an easly solved ssue d 0 too small: too many clusters d 0 larger: reasonable clusterng d 0 too large: too few clusters

Crteron Functons for Clusterng Have samples 1,, n Suppose parttoned samples nto c subsets D 1,,D c D 1 D 3 There are appromately c n /c! dstnct parttons D 2 Can defne a crteron functon J(D 1,,D c ) whch measures the qualty of a parttonng D 1,,D c Then the clusterng problem s a well defned problem the optmal clusterng s the partton whch optmzes the crteron functon

SSE Crteron Functon Let n be the number of samples n D, and defne the mean of samples n n D µ = 1 n Then the sum-of-squared errors crteron functon (to mnmze) s: c 2 JSSE = µ D = 1 D µ 1 µ 2 Note that the number of clusters, c, s fed

SSE Crteron Functon J SSE = c = 1 D 2 µ SSE crteron approprate when data forms compact clouds that are relatvely well separated SSE crteron favors equally szed clusters, and may not be approprate when natural groupngs have very dfferent szes large J SSE small J SSE

Falure Eample for J SSE larger J SSE smaller J SSE The problem s that one of the natural clusters s not compact (the outer rng)

Other Mnmum Varance Crteron Functons We can rewrte the SSE as J SSE = c = 1 D µ 2 = 1 2 c n 1 y 2 = 1 n y D D d = average Eucldan dstance between all pars of samples n D Can obtan other crteron functons by replacng - y 2 by any other measure of dstance between ponts n D Alternatvely can replace d by the medan, mamum, etc. nstead of the average dstance 2

Mamum Dstance Crteron Consder J c ma = n 2 ma y = 1 y D, D Solves prevous case However J ma s not robust to outlers smallest J ma smallest J ma

Other Crteron Functons Recall defnton of scatter matrces scatter matr for th cluster S = ( µ )( µ ) wthn the cluster scatter matr D c S W = S Determnant of S w roughly measures the square of the volume Assumng S w s nonsngular, defne determnant crteron functon: c Jd = SW = S J d s nvarant to scalng of the as, and s useful f there are unknown rrelevant lnear transformatons of the data = 1 = 1 t

Iteratve Optmzaton Algorthms Now have both promty measure and crteron functon, need algorthm to fnd the optmal clusterng Ehaustve search s mpossble, snce there are appromately c n /c! possble parttons Usually some teratve algorthm s used 1. Fnd a reasonable ntal partton 2. Repeat: move samples from one group to another s.t. the objectve functon J s mproved move samples to mprove J J = 777,777 J =666,666

Iteratve Optmzaton Algorthms Iteratve optmzaton algorthms are smlar to gradent descent move n the drecton of descent (ascent), but not n the steepest descent drecton snce have no dervatve of the objectve functon soluton depends on the ntal pont cannot fnd global mnmum Man Issue How to move from current parttonng to the one whch mproves the objectve functon

K-means Clusterng We now consder an eample of teratve optmzaton algorthm for the specal case of J SSE objectve functon J SSE = k = 1 D 2 µ for a dfferent objectve functon, we need a dfferent optmzaton algorthm, of course F number of clusters to k (c = k) k-means s probably the most famous clusterng algorthm t has a smart way of movng from current parttonng to the net one

K-means Clusterng 1. Intalze pck k cluster centers arbtrary assgn each eample to closest center k = 3 2. compute sample means for each cluster 3. reassgn all samples to the closest mean 4. f clusters changed at step 3, go to step 2

K-means Clusterng Consder steps 2 and 3 of the algorthm 2. compute sample means for each cluster JSSE = µ 1 µ 2 = 1 D 3. reassgn all samples to the closest mean k = sum of µ µ 1 µ 2 by ther old means, the error has gotten smaller If we represent clusters 2

K-means Clusterng 3. reassgn all samples to the closest mean µ 1 µ 2 If we represent clusters by ther old means, the error has gotten smaller However we represent clusters by ther new means, and mean s always the smallest representaton of a cluster z D 1 2 z ( 2 t 2 = 2 z+ z ) 2 z 2 z = 1 n D 1 = ( + z ) D D = 0

K-means Clusterng We just proved that by dong steps 2 and 3, the objectve functon goes down n two step, we found a smart move whch decreases the objectve functon Thus the algorthm converges after a fnte number of teratons of steps 2 and 3 However the algorthm s not guaranteed to fnd a global mnmum µ 1 µ 2 2-means gets stuck here global mnmum of J SSE

K-means Clusterng Fndng the optmum of J SSE s NP-hard In practce, k-means clusterng performs usually well It s very effcent Its soluton can be used as a startng pont for other clusterng algorthms Stll 100 s of papers on varants and mprovements of k-means clusterng every year

Herarchcal Clusterng Up to now, consdered flat clusterng? For some data, herarchcal clusterng s more approprate than flat clusterng Herarchcal clusterng

Herarchcal Clusterng: Bologcal Taonomy anmal plant wth spne no spne seed producng spore producng dog cat jellyfsh apple rose mushroom mold

Herarchcal Clusterng: Dendogram prefered way to represent a herarchcal clusterng s a dendogram Bnary tree Level k corresponds to parttonng wth n-k+1 clusters f need k clusters, take clusterng from level n-k+1 If samples are n the same cluster at level k, they stay n the same cluster at hgher levels dendogram typcally shows the smlarty of grouped clusters

Herarchcal Clusterng: Venn Dagram Can also use Venn dagram to show herarchcal clusterng, but smlarty s not represented quanttatvely

Herarchcal Clusterng Algorthms for herarchcal clusterng can be dvded nto two types: 1. Agglomeratve (bottom up) procedures Start wth n sngleton clusters Form herarchy by mergng most smlar clusters 3 4 2 5 6 2. Dvsve (top bottom) procedures Start wth all samples n one cluster Form herarchy by splttng the worst clusters

Dvsve Herarchcal Clusterng Any flat algorthm whch produces a fed number of clusters can be used set c = 2

Agglomeratve Herarchcal Clusterng ntalze wth each eample n sngleton cluster whle there s more than 1 cluster 1. fnd 2 nearest clusters 2. merge them Four common ways to measure cluster dstance 1. mnmum dstance d ( D, D ) = mn y mn j D, y D j 2. mamum dstance d (, ) ma ma D D j = y 3. average dstance 4. mean dstance D, y D j 1 ( D, D j ) = d avg y n n mean j D y D j ( D D ) = µ d µ, j j

Sngle Lnkage or Nearest Neghbor Agglomeratve clusterng wth mnmum dstance d ( D, D ) = mn y 3 mn 5 j D, y D j 1 2 4 generates mnmum spannng tree encourages growth of elongated clusters dsadvantage: very senstve to nose what we want what we get nosy sample

Complete Lnkage or Farthest Neghbor Agglomeratve clusterng wth mamum dstance d ma ( D, D ) = ma y j D, y D j Encourages compact clusters Does not work well f elongated clusters present ( D ) d < ma 1,D2 D 1 d ( D ) ma 2,D3 D D 2 3 thus D 1 and D 2 are merged nstead of D 2 and D 3

Average and Mean Agglomeratve Clusterng Agglomeratve clusterng s more robust under the average or the mean cluster dstance 1 ( D, D j ) = d avg y n n mean j D y D j ( D D ) = µ d µ, j j mean dstance s cheaper to compute than the average dstance unfortunately, there s not much to say about agglomeratve clusterng theoretcally, but t does work reasonably well n practce

Agglomeratve vs. Dvsve Agglomeratve s faster to compute, n general Dvsve may be less blnd to the global structure of the data Dvsve when takng the frst step (splt), have access to all the data; can fnd the best possble splt n 2 parts Agglomeratve when takng the frst step mergng, do not consder the global structure of the data, only look at parwse structure

Summary Clusterng (nonparametrc learnng) s useful for dscoverng nherent structure n data Clusterng s mmensely useful n dfferent felds Clusterng comes naturally to humans (n up to 3 dmensons), but not so to computers It s very easy to desgn a clusterng algorthm, but t s very hard to say f t does anythng good General purpose clusterng s unlkely to est, for best results, clusterng should be tuned to applcaton at hand