Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Size: px

Start display at page:

Download "Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University"

Matthew McBride
6 years ago
Views:

1 Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

2 Descriptive model A descriptive model presents the main features of the data in convenient form Summary of data Reveal the important aspect of the data Major descriptive modeling Describe data by distribution Describe data by the cluster structures

3 Modeling the probability density (I) Parametric approach General steps for parametric modeling of density Choose a parametric model by assuming data are generated from the model Use a complex distribution Approximate the density by a mixture of simple density functions. Use (regularized) likelihood function or posterior probability of the parameters as the score function Use MLE or MAP to fit the model Problems: If model assumption is incorrect, the obtained model can be far away from the ground-truth property of the data

4 Modeling the probability density (II) Nonparametric approach General properties No parametric model is explicitly assumed Use the observed data to determine the density Each data point is an evidence of the underlying density Each data point is responsible to make some statement of any place in the space Assumption: It is more likely to observe a data point in a high density region rather than in a low density region. The density of the position close to a data point is more likely to be higher than that far away from a data point

5 Modeling the probability density (II) Nonparametric approach The kernel estimate f(x) = 1 n nx K i=1 µ x xi h Pros: No model assumption should be made Cons: Requires large number of data points for a better estimate Sensitive to dimensionality High computation cost

6 Describing the data by cluster structures Clustering (cluster analysis) Decompose the data set into groups such that the points in one group are similar to each other and are as different as possible from the points in other groups clustering is unsupervised learning in that there are no class labels for the training instances Can be used to: Summarize the data by the cluster structure Provide a pre-processing of other data mining algorithms

What is a good clustering A good clustering will produce high quality clusters with: high intra-cluster similarity low inter-cluster similarity Quality of a clustering

7 What is a good clustering A good clustering will produce high quality clusters with: high intra-cluster similarity low inter-cluster similarity Quality of a clustering result depends on Similarity measure used The clustering method that determines how similar within one cluster The goodness of the clusters depends on the opinion of the user!

8 Different types of clustering Partition-based clustering Hierarchical clustering Density-based clustering Model-based clustering

9 Partition-based clustering (I) -- the model Given a data set, a partitioning method constructs k (k < n) partitions of the data, where each partition represents a cluster The model of partition-based clustering is k representative of the k cluster Center (or mean) of the cluster The data point near the center of the cluster

10 Partition-based clustering (II) the score function The score function can leverage: Within-cluster variation (compactness) KX KX X wc(c) = wc(c k ) = d(x; r k ) 2 W = x2c k k=1 k=1 KX X k=1 x2c k (x r k )(x r k ) > Between-cluster variation (separation) bc(c) = X 1 j<k K d(r j ; r k ) 2 B = KX n k (r k ^¹)(r k ^¹) > k=1 Widely used score function can be wc(c) bc(c)=wc(c) trace(w) jwj t r a c e ( B W 1 )

11 Partition-based clustering (III) Search methods Problems in search: There is NO closed-form solution for any score function of interest Combinatorial explosion. Infeasible to conduct exhaustive search Iterative-improvement algorithms based on local search are particularly popular in clustering Greedy Systematic search: search tree beam search,

12 Partition-based clustering (III) Search methods General idea: 1. Start with randomly chosen clustering of points 2. Reassign points to improve in the score function 3. Recalculate the updated cluster center 4. Repeat 2 ~ 3 until no change in score function. What is the cluster center Mean of the points in the cluster A point that is nearest to the mean Median of the points in the cluster Mode vectors (of the 0-1 vector)

13 Partition-based clustering (IV) typical method K-means each cluster is represented by the mean value of the data in the cluster Step1: randomly select k objects as the centers of the clusters Step2: for each remaining object, assign it to the cluster whose center is the nearest to the object Step3: compute the means of the cluster and regard them as the centers Step4: if there is no change, exit. otherwise go to Step2

14 Partition-based clustering (IV) typical method K-medoids each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose medoid is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a non-medoid with a medoid, swapping them and then go to Step 2; otherwise exit

15 Partition-based clustering (V) the variants Stochastic variant Examine each point in turn and update the cluster center whenever a point is reassigned. Online variant Examine each point only once. Changing cluster structures Conduct splitting and / or merging clusters

16 Hierarchical clustering (I) Hierarchical clustering gradually merge points or divide superclusters. Two scheme: Top down. (agglomerative) Bottom up (divisive) Property of Hierarchical clustering It keeps a history on how we group the points It stretches to both extreme. One is one-point cluster, while the other is all-in-one cluster The clustering process does NOT depend on the number of clusters K. There is no clear separate of model, score function and search method

17 Hierarchical clustering (II) typical method AGNES (agglomerative) Step1: every object is placed into a cluster of its own Step2: merge the clusters according to the minimum Euclidean distance between the nearest objects in the clusters Step3: if arriving a whole cluster, exit; otherwise go to Step 2

18 Hierarchical clustering (II) typical method DIANA (divisive) Step1: all the objects are placed in one cluster Step2: split the clusters according to the maximum Euclidean distance between the nearest objects in the clusters Step3: if each cluster contains only one object, exit; otherwise go to Step 2

19 Hierarchical clustering (III) Key issue: Defining distance between clusters Single link (nearest neighbor) The distance between two cluster is the distance between two closest points, one from each cluster Pros: Can discover non-spherical cluster. Chaining phenomenon. Order-insensitive for equal-distance candidates Cons: Sensitive to data variation and outliers Limited value for segmentation, since too easy to connect

20 Hierarchical clustering (III) Complete link (furthest neighbor) The distance between two cluster is the distance between two most distant points, one from each cluster Pros: Insensitive to data variance and outliers. Good for segmentation. Produce clusters with equal volume. Cons: Tends to produce spherical cluster Order-sensitive for equal distance candidates Other variants Distance between centroid Mean distance Median distance,,

21 Hierarchical clustering (IV) Dendrogram representation

Density-based clustering (I) Density-based method creates clusters by continuing growing a cluster so long as the density of the data objects in the neighborhood

22 Density-based clustering (I) Density-based method creates clusters by continuing growing a cluster so long as the density of the data objects in the neighborhood exceeds some threshold Key point Defining the neighhood: how much space should be considered Determine the threshold: how many points should be regarded as dense

23 Density-based clustering (II) typical method DBSCAN key concepts: an object P whose -neighborhood containing no less than MinPts number of objects is a core object with respect to and MinPts an object M is directly density-reachable from object P with respect to and MinPts if M is within the -neighborhood of P which contains at least a minimum number of points, MinPts an object Q is density-reachable from object P with respect to and MinPts if there is a chain of objects p 1,, p n, p 1 = P and p n = Q, p i+1 is directly density-reachable from p i with respect to and MinPts an object S is density-connected to object R with respect to and MinPts if there is an object O such that both S and R are density-reachable from O with respect to and MinPts

24 Model-based clustering (I) Model-based clustering hypothesizes a model for each of the clusters, and finds the best fit of the data to that model Widely used methods Probabilistic models Neural Networks based on competitive excitative mechanism Large margin models with pseudo label assignments

25 Model-based clustering (II) typical method Mixture model + EM Basic idea: Data are generated from a mixture model with K components Use EM to fit the model where the cluster assignments are the hidden variable Find data assignment from the estimated probability If Gaussian is used, then k- means can be regarded as a stepwise approximation of EM

Model-based clustering (II) typical method Self-Organizing Maps (SOM) SOM is a type of neural network trained in unsupervised fashion to create low dimensional representation Step 1: Align a number

26 Model-based clustering (II) typical method Self-Organizing Maps (SOM) SOM is a type of neural network trained in unsupervised fashion to create low dimensional representation Step 1: Align a number neurons in the map, each one associated with at weight vector Step2: For each input vector, compute the distance between it and the weight vector to find the best matching neuron. Step3: Find the neighbors of that neuron and pushes them towards the input vector

27 Let s move to Chapter 10

Clustering in Data Mining

Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,