Cluster Analysis (b) Lijun Zhang

Similar documents
Data Mining 4. Cluster Analysis

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II

CSE 5243 INTRO. TO DATA MINING

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering Part 4 DBSCAN

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

DBSCAN. Presented by: Garrett Poppe

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

COMP 465: Data Mining Still More on Clustering

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering in Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 5

Visual Representations for Machine Learning

Big-data Clustering: K-means vs K-indicators

Clustering Lecture 4: Density-based Methods

CS570: Introduction to Data Mining

Clustering CS 550: Machine Learning

Locally Linear Landmarks for large-scale manifold learning

Understanding Clustering Supervising the unsupervised

Introduction to Machine Learning

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Unsupervised Learning : Clustering

Large-Scale Face Manifold Learning

Spectral Clustering X I AO ZE N G + E L HA M TA BA S SI CS E CL A S S P R ESENTATION MA RCH 1 6,

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Clustering Algorithms for Data Stream

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to Hotspot Detection and OPC

Distance-based Methods: Drawbacks

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Graph based machine learning with applications to media analytics

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Clustering: Classic Methods and Modern Views

Behavioral Data Mining. Lecture 18 Clustering

Problem Definition. Clustering nonlinearly separable data:

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

Knowledge Discovery in Databases

Lecture on Modeling Tools for Clustering & Regression

Sparse Manifold Clustering and Embedding

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Aarti Singh. Machine Learning / Slides Courtesy: Eric Xing, M. Hein & U.V. Luxburg

Machine Learning (BSMC-GA 4439) Wenke Liu

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Clustering in Ratemaking: Applications in Territories Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

CS 664 Segmentation. Daniel Huttenlocher

The K-modes and Laplacian K-modes algorithms for clustering

Network Traffic Measurements and Analysis

CS Introduction to Data Mining Instructor: Abdullah Mueen

MSA220 - Statistical Learning for Big Data

Lecture 7 Cluster Analysis: Part A

Unsupervised learning on Color Images

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2016

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Stream Clustering Using Micro Clusters

Faster Clustering with DBSCAN

Edge and corner detection

COMPRESSED DETECTION VIA MANIFOLD LEARNING. Hyun Jeong Cho, Kuang-Hung Liu, Jae Young Park. { zzon, khliu, jaeypark

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis:midyear Report

Gene Clustering & Classification

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Object-Based Classification & ecognition. Zutao Ouyang 11/17/2015

The Curse of Dimensionality

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Introduction to Trajectory Clustering. By YONGLI ZHANG

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

数据挖掘 Introduction to Data Mining

CSE 347/447: DATA MINING

Non-linear dimension reduction

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Optimization of Design. Lecturer:Dung-An Wang Lecture 8

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Data fusion and multi-cue data matching using diffusion maps

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Unsupervised Learning

CHAPTER 4: CLUSTER ANALYSIS

Clustering part II 1

EE795: Computer Vision and Intelligent Systems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Energy Minimization for Segmentation in Computer Vision

Non-negative Matrix Factorization for Multimodal Image Retrieval

Detecting Clusters and Outliers for Multidimensional

Modern Medical Image Analysis 8DC00 Exam

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Saliency Detection in Aerial Imagery

Locality Preserving Projections (LPP) Abstract

Transcription:

Cluster Analysis (b) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj

Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

Density-Based Algorithms One Motivation Find clusters with arbitrary shape The Key Idea Identify fine-grained dense regions Merge regions into clusters Representative Algorithms Grid-Based Methods DBSCAN DENCLUE

Grid-Based Methods The Algorithm

Limitations-2 Parameters (1) The number of Grids

Limitations-2 Parameters (2) The Level of Density

DBSCAN (1) 1. Classify data points into Core point: A data point is defined as a core point, if it contains at least data points within a radius. Border point: A data point is defined as a border point, if it contains less than points, but it also contains at least one core point within a radius. Noise point: A data point that is neither a core point nor a border point is defined as a noise point.

DBSCAN (2) 1. Classify data points into Core point, Border point, and Noise points.

DBSCAN (3) 1. Classify data points into Core point, Border point, and Noise points. 2. A connectivity graph is constructed with respect to the core points Core points are connected if they are within of one another 3. Determine connected components 4. Assign each border point to connected component with which it is best connected

Limitations of DBSCAN Two Parameters Radius and Level of Density They are related to each other High Computational Cost Identifying neighbors

DENCLUE Preliminary Kernel-density Estimation Given data points is a kernel function [Hinneburg and Keim, 1998]

DENCLUE The Key Idea Determine clusters by using a density threshold 2 clusters 3 clusters

DENCLUE Procedure Density Attractors Local Maximum/Peak

DENCLUE Procedure Density Attractors Local Maximum/Peak Identify a Peak for Each Data Point An iterative gradient ascent

DENCLUE Procedure Density Attractors Local Maximum/Peak Identify a Peak for Each Data Point An iterative gradient ascent Post-Processing Attractors whose density is smaller than are excluded Density attractors are connected to each other by a path of density at least will be merged

DENCLUE Implementation Gradient Ascent Gradient Gaussian Kernel Mean-shift Method Converges much faster

Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

Graph Construction for a Set of Points A node is defined for each An edge exists between and If the distance If either one is a -nearest neighbor of the other (A better approach) If there is an edge, then its weight is 1 Heat Kernel:, /

Spectral Clustering Dimensionality Reduction Find a low-dimensional representation for each node in the graph Laplacian Eigenmap [Belkin and Niyogi, 2002] -means Apply -means to new representations of the data

Laplacian Eigenmap (1) The Objective Function ( ) is a -dimensional representation of is the similarity between and Similar points will be mapped closer Similar points have larger weights

Laplacian Eigenmap (2) The Objective Function ( ) Vector Form is the graph Laplacian Positive Semidefinite (PSD) is the similarity matrix is a diagonal matrix with 2

Laplacian Eigenmap (3) The Optimization Problem ( ) Add a Constraint to Remove Scaling Factor is introduced for normalization [Luxburg, 2007] The Solution Generalized Eigenproblem [Luxburg 2007] The smallest eigenvector is Useless since

Laplacian Eigenmap (3) The Optimization Problem ( ) Add a Constraint to Remove Scaling Factor is introduced for normalization [Luxburg, 2007] The Solution Generalized Eigenproblem [Luxburg 2007] The smallest eigenvector is Use the second smallest eigenvector The new representation for is

Laplacian Eigenmap (4) The Objective Function ( ) Vector Form is the graph Laplacian is the similarity matrix is a diagonal matrix with 2trace

Laplacian Eigenmap (4) The Optimization Problem ( ) The Solution Generalized Eigenproblem [Luxburg 2007] Use,, as the optimal solution is the -th generalized eigenvector The new representation for is the -th row of Don t forget the normalization

Properties of Spectral Clustering Varying Cluster Shape and Density Due to the nearest neighbor graph High Computational Cost

Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

Non-negative Matrix Factorization (NMF) Let be a nonnegative data matrix NMF aims to factor as and are non-negative The Optimization Problem, Non-convex

Interpretation of NMF (1) Matrix Appromation Element-wise, where, where, where is the -th column of is the -th row of Then, is the -th element of vector

Interpretation of NMF (2) Vector Approximation can be treated as basis vectors They may be not orthonormal They are non-negative can be treated as a new -dimensional representation of

Parts-Based Representations When each is a face image [Lee and Seung, 1999]

Clustering by NMF Vector Approximation can be treated as an representative of the -th cluster can be treated as the association between and The cluster label for [Xu et al., 2003] argmax

An Example Discover both Row and Column Clusters

Optimization in NMF Alternating between and Local Optimal Solutions Run multiple times and choose the best one Other Optimization Algorithms are also Possible

Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

Concepts Cluster validation Evaluate the quality of a clustering Internal Validation Criteria Do not need additional information Biased toward one algorithm or the other External Validation Criteria Ground-truth clusters are known Ground-truth may not reflect the natural clusters in the data

Internal Validation Criteria Sum of square distances to centroids Intracluster to intercluster distance ratio Silhouette coefficient Probabilistic measure

External Validation Criteria Class Labels The Ground-truth Confusion Matrix Each row corresponds to the class label Each column corresponds to the algorithm-determined cluster Ideal clustering permutation a diagonal matrix after

Notations : number of data points from class (ground-truth) cluster that are mapped to (algorithm-determined) cluster : number of data points in true cluster : number of data points in algorithmdetermined cluster

Purity For a given algorithm-determined cluster, define as number of data points in its dominant class The overall purity High values of the purity are desirable

Gini index Limitation of Purity Only accounts for the dominant label in the cluster and ignores the distribution of the remaining points Gini index for column (algorithmdetermined cluster) The average Gini coefficient Low values

Outline Grid-Based and Density-Based Algorithms Graph-Based Algorithms Non-negative Matrix Factorization Cluster Validation Summary

Summary Grid-Based and Density-Based Algorithms Grid-Based Methods DBSCAN, DENCLUE Graph-Based Algorithms Laplacian Eigenmap Non-negative Matrix Factorization Cluster Validation Purity, Gini index

Reference [Belkin and Niyogi, 2002] Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS 14, pages 585 591. [Luxburg, 2007] Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416. [Lee and Seung, 1999] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788 791. [Xu et al., 2003] Xu, W., Liu, X., and Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR, pages 267 273. [Hinneburg and Keim, 1998] Hinneburg, A. and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58 65.