Behavioral Data Mining. Lecture 18 Clustering

Similar documents
Introduction to Data Science Lecture 8 Unsupervised Learning. CS 194 Fall 2015 John Canny

SGN (4 cr) Chapter 11

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Visual Representations for Machine Learning

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Clustering Algorithms for general similarity measures

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Methods for Intelligent Systems

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CHAPTER 4: CLUSTER ANALYSIS

Machine Learning (BSMC-GA 4439) Wenke Liu

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines

Clustering Lecture 5: Mixture Model

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering CS 550: Machine Learning

10-701/15-781, Fall 2006, Final

My favorite application using eigenvalues: partitioning and community detection in social networks

CSE 5243 INTRO. TO DATA MINING

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CSE 494 Project C. Garrett Wolf

Machine Learning. Unsupervised Learning. Manfred Huber

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

CS 229 Midterm Review

Lesson 3. Prof. Enza Messina

Favorites-Based Search Result Ordering

Chapter 9. Classification and Clustering

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. Bruno Martins. 1 st Semester 2012/2013

Based on Raymond J. Mooney s slides

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Unsupervised Learning

Introduction to Mobile Robotics

Lecture 7: Segmentation. Thursday, Sept 20

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Segmentation and Grouping

Clustering: Overview and K-means algorithm

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering: Overview and K-means algorithm

CSE 5243 INTRO. TO DATA MINING

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Cluster Analysis (b) Lijun Zhang

Spectral Methods for Network Community Detection and Graph Partitioning

Text Documents clustering using K Means Algorithm

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Text Modeling with the Trace Norm

Unsupervised Learning and Clustering

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Machine Learning for OR & FE

6.801/866. Segmentation and Line Fitting. T. Darrell

CS249: ADVANCED DATA MINING

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

General Instructions. Questions

Clustering and Visualisation of Data

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

ECS 234: Data Analysis: Clustering ECS 234

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

9.1. K-means Clustering

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

K-Means Clustering 3/3/17

4. Ad-hoc I: Hierarchical clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Supervised vs. Unsupervised Learning

Unsupervised Learning : Clustering

Clustering: Classic Methods and Modern Views

Effective Latent Space Graph-based Re-ranking Model with Global Consistency


Segmentation Computer Vision Spring 2018, Lecture 27

Finding Clusters 1 / 60

Object Classification Problem

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Data Clustering. Danushka Bollegala

Chapter DM:II. II. Cluster Analysis

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Segmentation. Bottom Up Segmentation

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Normalized cuts and image segmentation

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Machine learning - HT Clustering

Information Retrieval and Organisation

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Unsupervised Learning

Summary: A Tutorial on Learning With Bayesian Networks

Feature Selection for fmri Classification

Transcription:

Behavioral Data Mining Lecture 18 Clustering

Outline Why? Cluster quality K-means Spectral clustering Generative Models

Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i into subsets C 1,, C k. Clustering has one or more goals: Segment a large set of cases into small (even-sized) subsets that can be treated similarly - segmentation Model the variation in a large dataset using a smaller number of samples - condensation Generate a more compact description of a dataset - compression Model an underlying process that generates the data as a mixture of different, localized processes representation

Stereotypical Clustering

Model-based Clustering

Model-based Clustering

Clustering for Segmentation

Condensation/Compression

Cognitive Bias Human beings conceptualize the world through categories represented as examplars (Rosch 73, Estes 94). We tend to see cluster structure whether it is there or not.

Cognitive Bias

Netflix

Outline Why? Cluster quality K-means Spectral clustering Generative Models

Terminology Hierarchical clustering: clusters form a hierarchy. Can be computed bottom-up or top-down. Flat clustering: no inter-cluster structure. Hard clustering: items assigned to a unique cluster. Soft clustering: cluster membership is a real-valued function, distributed across several clusters.

Clustering Basics We need at a minimum a similarity measure for items. From the similarity measure, we can specify a clustering quality measure Q (objective function), which increases with intracluster similarity, and decreases with inter-cluster similarity. Choose an algorithm to (approximately) optimize the measure Q: Greedy assignment/cluster update (e.g. k-means) Expectation maximization Greedy merging (HAC clustering) Partitioning (spectral clustering) Cluster management: Splitting, merging, shattering clusters of low quality.

Normalized Cut measure Quality measure based on weight of cuts (total similarity weight of edges across a cut). Ncut miminizes:

Outline Why? Cluster quality K-means Spectral clustering

K-means clustering The standard k-means algorithm is based on Euclidean inner products (kernels) <u,v>. The cluster quality measure is an intra-cluster measure only, equivalent to the sum of item-to-centroid kernels. A simple greedy algorithm locally optimizes this measure: Find the best cluster (best item-centroid kernel) for each item Recompute the cluster centroid for the newly-assigned items in the cluster. Item vectors are normally scaled to unit magnitude. Otherwise large vectors will attract too many neighbors.

K-means clustering

K-means clustering

K-means properties Very simple convergence proofs. Performance is O(ntk) per iteration, not bad and can be heuristically improved. n = num docs, t = non-zeros per doc, k = num clusters Many generalizations, e.g. Simple generalization to m-best soft clustering As a local clustering method, it works well for data condensation/compression.

K-means properties No inter-cluster measure: Splitting clusters always improves the clustering Merging and shattering always worsen it Optimal clustering of N points is pathological Greedy algorithm fortunately fails to find it! Greedy algorithm cannot increase number of clusters

Choosing clustering dimension AIC or Akaike Information Criterion: K=dimension, L(K) is the likelihood (could be RSS) and q(k) is a measure of model complexity (cluster description complexity). AIC favors more compact (fewer clusters) clusterings. For sparse data, AIC will incorporate the number of nonzeros in the cluster spec. Lower is better.

Outline Why? Cluster quality K-means Spectral clustering Generative Models

Spectral Clustering Spectral clustering methods derive from graph partitioning. They are able to simultaneous find partitions with approximately increase affinity within a cluster while decreasing it between clusters. We look for the second-largest eigenvalue/vector of this matrix: Where W is the original affinity matrix, D a diagonal matrix of total weights for each node. The largest eigenvalue is 1, with eigenvector all 1 s.

Spectral Clustering Standard spectral clustering uses only the second eigenvector, sorts the nodes by the coefficients of this vector and then chooses the cut which minimizes In practice, enumeration of the cut values dominates the running time, so a reduced-resolution graph of the cut values can be used. This generates only a single binary split of the data. To complete the clustering, the to parts are recursively clustered.

Spectral Clustering with k-means

Results

Results

Spectral, Kernel, k-means Defines an objective function equivalent to normalized cut which can be directly minimized with k-means.

Spectral, Kernel, k-means Euclidean distance from (a) to center m j is Where (x) is the non-linear, kernel component function. Similar to Jordan et al., but allows general kernels, performs final k-means in original space.

Spectral, Kernel, k-means Triangle inequality estimation: uses cluster-cluster distances to bound new point scores.

Spectral, Kernel, k-means

Outline Why? Cluster quality K-means Spectral clustering Generative Models

Generative Clustering Generative or probabilistic clustering includes a probabilistic model which generates samples of the observed data. Each model assigns a likelihood to an actual dataset. We replace the item-item similarity measure with a likelihood of membership in cluster i. Each cluster may be represented by a center c i, and the (log) likelihood that d j is in cluster i by <c i, d j > for some <,> Generative models add a prior probability on the cluster parameters, on one or both sides.

Clustering Meta-Formula We can write our observed data A as A W U where A is typically sparse discrete data (e.g. A ij = count of word i in document j or event i by user j), possibly scaled (by tf-idf); W are word-cluster weights (e.g. cluster center weights); U are user(or document)-cluster weights;

Clustering Meta-Formula Key choices: A W U Metric: The likelihood of A given WU, i.e. what means. Multinomial for LDA, Poisson for GAP, (roughly KL-divergence) L 2 for k-means, NMF, spectral methods, LSA (SVD) Von Mises-Fisher distribution Laplacian approximation Sign of W: Are coefficients signed or non-negative? Signed: LSA, k-means* (depends on data) Non-negative: LDA, GAP, NMF

Clustering Meta-Formula Range of U: Signed: LSA Non-negative: k-means (soft), LDA, GAP, NMF {0,1}: k-means (hard) Prior: Whether there is a prior on W or U: K-means: No, or (sometimes) linear smoothing of center coefficients LDA: Dirichlet on U, in Monte-Carlo versions: Dirichlet on W GAP: Gamma prior on U NMF: No

Performance Comparison Cai et al. paper. TREC AP corpus. Scoring: Cluster docs into k clusters (oblivious to the actual doc. N category labels) and then compute the best k n matching. Accuracy and mutual information measurements are based on this matching.

Performance Comparison Algorithms: K-means LDA, Latent Dirichlet Allocation PLSI, probabilistic Latent Semantic Indexing (Hoffmann) AA: Average Association (spectral) NC: Normalized cut NMF: Normalized Matrix Factorization LapPLSI: Laplacian PLSI

Performance Comparison

Performance Comparison

Performance Comparison

Performance Comparison

vmf distributions A d-dimensional unit random vector x is said to have a von Mises-Fisher (vmf) distribution if its pdf is given by: where is the mean direction, is the concentration parameter. vmf distributions generalize gaussian distributions to the unit sphere.

Spherical Admixture (SAM) 1. Draw a set of T topics on the unit hypersphere; 2. For each document d, draw topic weights d from a Dirichlet with hyperparameter 3. Draw a document vector v d from a vmf with mean d = Avg(, d ) and concentration The model is:

SAM performance

Spherical Admixture (SAM) 1. Draw a set of T topics on the unit hypersphere; 2. For each document d, draw topic weights d from a Dirichlet with hyperparameter 3. Draw a document vector v d from a vmf with mean d = Avg(, d ) and concentration The model is:

Google TouchGraph

Summary Why? Cluster quality K-means Spectral clustering Generative Models