Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu

Clustering I

What is Clustering Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Examples Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs! Land use: Identification of areas of similar land use in an earth observation database! Insurance: Identifying groups of motor insurance policy holders with a high average claim cost! Urban planning: Identifying groups of houses according to their house type, value, and geographical location! Seismology: Observed earth quake epicenters should be clustered along continent faults

A Dataset with Cluster Structure! How$would$ you$design$ an$algorithm$ for$finding$ the$three$ clusters$in$ this$case?$

What Is Good Clustering! A good clustering method will produce clusters with! High intra-class similarity! Low inter-class similarity! Precise definition of clustering quality is difficult! Application-dependent! Ultimately subjective

Google News

Yahoo Sports

Bing Image Search

Clustering Algorithms Flat algorithms Usually start with a random (par6al) par66oning Refine it itera6vely K means clustering Spectral clustering Nonnega6ve matrix factoriza6on Hierarchical algorithms BoDom-up, agglomera6ve

Hard vs. Soft Clustering Hard clustering: Each data point belongs to exactly one cluster More common and easier to do! SoK clustering: A data point can belong to more than one cluster.

K-Means Clustering Given k, the k-means algorithm consists of four steps:! Select initial centroids at random. Assign each object to the cluster with the nearest centroid. Compute each centroid as the mean of the objects assigned to it. Repeat previous 2 steps until no change.

K-Means Example Reassign clusters Compute centroids Reassign clusters x x x x Compute centroids Reassign clusters Converged!

K-Means Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K-Means Clustering

Termination Condition Several possibili6es, e.g., A fixed number of itera6ons. Data par66on unchanged. Centroid posi6ons don t change.

K-Means Pros and Cons Strengths Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms Weaknesses Applicable only when mean is defined (what about categorical data?) Need to specify k, the number of clusters, in advance Trouble with noisy data and outliers Not suitable to discover clusters with non-convex shapes

Hierarchical Clustering! Use distance/similarity matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive

Dendrogram! Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster! Then repeatedly joins the closest pair of clusters, un6l there is only one cluster.! The history of merging forms a binary tree or hierarchy.

Agglomerative Clustering! Input: a pairwise matrix involved all data points in V! Algorithm! 1. Place each point of V in its own cluster (singleton), creating the list of clusters L (initially, the leaves of T):! L= V 1, V 2, V 3,..., V n-1, V n.! 3. Compute a merging cost function between every pair of elements in L to find the two closest clusters {V i, V j } which will be the cheapest couple to merge.! 4. Remove V i and V j from L.! 5. Merge V i and V j to create a new internal node V ij in T which will be the parent of V i and V j in the resulting tree.! 6. Go to Step 2 until there is only one set remaining.! A! A! A! B! B! B!

Computational Complexity In the first itera6on, all HAC methods need to compute similarity of all pairs of N ini6al instances, which is O(N 2 ). In each of the subsequent N2 merging itera6ons, compute the distance between the most recently created cluster and all other exis6ng clusters. In order to maintain an overall O(N 2 ) performance, compu6ng similarity to each other cluster must be done in constant 6me. OKen O(N 3 ) if done naively or O(N 2 log N) if done more cleverly

The Goodness of Clustering Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representa6on and the similarity measure used

External Criteria Quality measured by its ability to discover some or all of the hidden paderns or latent classes in gold standard data Assesses a clustering with respect to ground truth requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω 1, ω 2,, ω K with n i members.

Purity! Simple'measure:'purity,'the'ra1o'between'the' dominant'class'in'the'cluster'π i 'and'the'size'of' cluster'ω i ' Purity (ω ) i 1 = max ( n ) j ij n i j C! Biased'because'having'n'clusters'maximizes' purity'! Others'are'entropy'of'classes'in'clusters'(or' mutual'informa1on'between'classes'and' clusters)'

Example Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index Number of data pairs Same Cluster in clustering Different Clusters in clustering Same class in ground truth A B Different classes in ground truth C D RI = A + A B + + D C + D