Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Similar documents
Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Informatics. Seon Ho Kim, Ph.D.

Information Retrieval and Organisation

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Gene Clustering & Classification

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Unsupervised Learning

Informa(on Retrieval

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Informa(on Retrieval

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Analysis. CSE634 Data Mining

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CSE 5243 INTRO. TO DATA MINING

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

ECLT 5810 Clustering

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Clustering Part 3. Hierarchical Clustering

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Information Retrieval and Web Search Engines


Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

CSE 5243 INTRO. TO DATA MINING

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Introduction to Information Retrieval

Machine Learning. Unsupervised Learning. Manfred Huber

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Unsupervised Learning and Clustering

ECLT 5810 Clustering

Expectation Maximization!

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

PV211: Introduction to Information Retrieval

Clustering Lecture 3: Hierarchical Methods

Hierarchical Clustering

Information Retrieval and Web Search Engines

Clustering CS 550: Machine Learning

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

10701 Machine Learning. Clustering

Introduction to Information Retrieval

CHAPTER 4: CLUSTER ANALYSIS

Understanding Clustering Supervising the unsupervised

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Data Clustering. Danushka Bollegala

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Clustering: Overview and K-means algorithm

Clustering: K-means and Kernel K-means

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Exploratory Analysis: Clustering

Introduction to Mobile Robotics

Clustering Algorithms for general similarity measures

Unsupervised Learning and Clustering

Road map. Basic concepts

Chapter 6: Cluster Analysis

Cluster Analysis: Agglomerate Hierarchical Clustering

CS47300: Web Information Search and Management

Kapitel 4: Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Lecture on Modeling Tools for Clustering & Regression

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Clustering algorithms

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Unsupervised learning, Clustering CS434

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 11

Chapter 9. Classification and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Hierarchical Clustering

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Based on Raymond J. Mooney s slides

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering: Overview and K-means algorithm

Clustering part II 1

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Introduction to Data Mining

Clustering Analysis Basics

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Hierarchical and Ensemble Clustering

Clustering. Bruno Martins. 1 st Semester 2012/2013

Supervised and Unsupervised Learning (II)

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Hierarchical Clustering

Introduction to Information Retrieval

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CS7267 MACHINE LEARNING

Lesson 3. Prof. Enza Messina

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Unsupervised Learning Partitioning Methods

Text Documents clustering using K Means Algorithm

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining Concepts & Techniques

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Transcription:

Big Data Analytics! Special Topics for Computer Science CSE 4095-001 CSE 5095-005! Feb 9 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu

Clustering I

What is Clustering Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Examples Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs! Land use: Identification of areas of similar land use in an earth observation database! Insurance: Identifying groups of motor insurance policy holders with a high average claim cost! Urban planning: Identifying groups of houses according to their house type, value, and geographical location! Seismology: Observed earth quake epicenters should be clustered along continent faults

A Dataset with Cluster Structure! How$would$ you$design$ an$algorithm$ for$finding$ the$three$ clusters$in$ this$case?$

What Is Good Clustering! A good clustering method will produce clusters with! High intra-class similarity! Low inter-class similarity! Precise definition of clustering quality is difficult! Application-dependent! Ultimately subjective

Google News

Yahoo Sports

Bing Image Search

Clustering Algorithms Flat algorithms Usually start with a random (par6al) par66oning Refine it itera6vely K means clustering Spectral clustering Nonnega6ve matrix factoriza6on Hierarchical algorithms BoDom-up, agglomera6ve

Hard vs. Soft Clustering Hard clustering: Each data point belongs to exactly one cluster More common and easier to do! SoK clustering: A data point can belong to more than one cluster.

K-Means Clustering Given k, the k-means algorithm consists of four steps:! Select initial centroids at random. Assign each object to the cluster with the nearest centroid. Compute each centroid as the mean of the objects assigned to it. Repeat previous 2 steps until no change.

K-Means Example Reassign clusters Compute centroids Reassign clusters x x x x Compute centroids Reassign clusters Converged!

K-Means Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K-Means Clustering

Termination Condition Several possibili6es, e.g., A fixed number of itera6ons. Data par66on unchanged. Centroid posi6ons don t change.

K-Means Pros and Cons Strengths Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms Weaknesses Applicable only when mean is defined (what about categorical data?) Need to specify k, the number of clusters, in advance Trouble with noisy data and outliers Not suitable to discover clusters with non-convex shapes

Hierarchical Clustering! Use distance/similarity matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 a a b b a b c d e c c d e d d e e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive

Dendrogram! Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster! Then repeatedly joins the closest pair of clusters, un6l there is only one cluster.! The history of merging forms a binary tree or hierarchy.

Agglomerative Clustering! Input: a pairwise matrix involved all data points in V! Algorithm! 1. Place each point of V in its own cluster (singleton), creating the list of clusters L (initially, the leaves of T):! L= V 1, V 2, V 3,..., V n-1, V n.! 3. Compute a merging cost function between every pair of elements in L to find the two closest clusters {V i, V j } which will be the cheapest couple to merge.! 4. Remove V i and V j from L.! 5. Merge V i and V j to create a new internal node V ij in T which will be the parent of V i and V j in the resulting tree.! 6. Go to Step 2 until there is only one set remaining.! A! A! A! B! B! B!

Computational Complexity In the first itera6on, all HAC methods need to compute similarity of all pairs of N ini6al instances, which is O(N 2 ). In each of the subsequent N2 merging itera6ons, compute the distance between the most recently created cluster and all other exis6ng clusters. In order to maintain an overall O(N 2 ) performance, compu6ng similarity to each other cluster must be done in constant 6me. OKen O(N 3 ) if done naively or O(N 2 log N) if done more cleverly

The Goodness of Clustering Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representa6on and the similarity measure used

External Criteria Quality measured by its ability to discover some or all of the hidden paderns or latent classes in gold standard data Assesses a clustering with respect to ground truth requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω 1, ω 2,, ω K with n i members.

Purity! Simple'measure:'purity,'the'ra1o'between'the' dominant'class'in'the'cluster'π i 'and'the'size'of' cluster'ω i ' Purity (ω ) i 1 = max ( n ) j ij n i j C! Biased'because'having'n'clusters'maximizes' purity'! Others'are'entropy'of'classes'in'clusters'(or' mutual'informa1on'between'classes'and' clusters)'

Example Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index Number of data pairs Same Cluster in clustering Different Clusters in clustering Same class in ground truth A B Different classes in ground truth C D RI = A + A B + + D C + D