What is Unsupervised Learning?

Similar documents
Chapter DM:II. II. Cluster Analysis

CSE 5243 INTRO. TO DATA MINING

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering 4/5/17

Gene Clustering & Classification

Clustering CS 550: Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Hierarchical Clustering

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CSE 158. Web Mining and Recommender Systems. Midterm recap

Unsupervised Learning

Document Clustering: Comparison of Similarity Measures

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Introduction to Data Mining

ECLT 5810 Clustering

Machine Learning: Symbol-based

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Unsupervised Learning and Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

ECLT 5810 Clustering

Information Retrieval and Web Search Engines

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Forestry Applied Multivariate Statistics. Cluster Analysis

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

University of Florida CISE department Gator Engineering. Clustering Part 2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning and Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

10701 Machine Learning. Clustering

Lesson 3. Prof. Enza Messina

Unsupervised Learning

Unsupervised Learning

Exploratory Analysis: Clustering

Clustering and Visualisation of Data

Cluster Analysis. Angela Montanari and Laura Anderlucci

Lecture 10: Semantic Segmentation and Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Chapter 18 Clustering

Information Retrieval and Web Search Engines

Machine Learning (BSMC-GA 4439) Wenke Liu

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CSE 5243 INTRO. TO DATA MINING

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Distances, Clustering! Rafael Irizarry!

Chapter 6: Cluster Analysis

Unsupervised Learning : Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Finding Clusters 1 / 60

Cluster analysis. Agnieszka Nowak - Brzezinska

Based on Raymond J. Mooney s slides

Unsupervised Learning

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Unsupervised Learning Hierarchical Methods

ECS 234: Data Analysis: Clustering ECS 234

Natural Language Processing

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Multivariate Analysis

Dimension reduction : PCA and Clustering

Mining di Dati Web. Lezione 3 - Clustering and Classification

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

What to come. There will be a few more topics we will cover on supervised learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Informatics. Seon Ho Kim, Ph.D.

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

University of Florida CISE department Gator Engineering. Clustering Part 5

Introduction to Clustering

Data Mining Clustering

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Road map. Basic concepts

UVA CS 4501: Machine Learning. Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi. University of Virginia. Department of Computer Science

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Machine Learning using MapReduce

High throughput Data Analysis 2. Cluster Analysis

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Unsupervised learning, Clustering CS434

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi. University of Virginia

Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

2. Design Methodology

Transcription:

Clustering

What is Unsupervised Learning? Unlike in supervised learning, in unsupervised learning, there are no labels We simply a search for patterns in the data Examples Clustering Density Estimation Dimensionality Reduction

What is Clustering? A form of unsupervised learning used to separate a large group of observations into smaller subgroups of similar observations Examples Topic modeling: clustering documents by subject (politics, sports, etc.) Identifying hot spots of police or gang violence in urban areas Image segmentation, gene expression, etc. Relevance to EDA Clustering is used to identify and visualize patterns in data Helps identify outliers and/or with formulating hypotheses

k-means Clustering k-means clustering is the most common type of clustering The k in k-means refers to the number of clusters Pros: Cons: It is easy to understand and to implement, so it can produce quick and dirty results You must specify k in advance; if k is too large, it will find clusters where there are none; if k is too small, it will miss real clusters

Remember This? A middle school dance k-means clustering requires that you specify how many clusters you want to group the dance attendees into. Let s pick three clusters, so k = 3. Image and example from John W. Foreman s Data Smart book

A Middle School Dance Image and example from John W. Foreman s Data Smart book k-means clustering starts with three initial points (cluster centers), one per cluster, spread out across the dance floor Dancers are assigned to the cluster that s nearest to them The algorithm then slides the cluster centers and their corresponding clusters around until it finds a good fit

A Middle School Dance The algorithm is initialized with three centers (three black circles). Each data points is then assigned to the nearest center. In effect, this operation divides the space into three clusters, which are depicted here as variously shaded regions.

A Middle School Dance Cluster centers are then recomputed. Observe how they move towards the data points. And the process continues (until some stopping criterion is met).

A Middle School Dance Final form! Note the locations of the cluster centers and the divisions between clusters.

What do the clusters mean? It is never a good idea to take an algorithm s word for it. We must always apply human insight to interpret an algorithm s output. In the case of clustering, we ask what the clusters might signify? For a middle school dance, they could be cliques. The kids might be too timid to dance with kids outside their comfort zone! k-means allows us to cluster data, but we cannot accept a clustering if we cannot attribute meaning to the clusters. We must be able to understand the why behind the assignment.

How does it work? Step 1: Choose a desired number of clusters, k Step 2: Randomly assign each data point to an initial cluster Step 3: Compute cluster centroids (this is the k-centroids algorithm right?) Step 4: Re-assign each point to the closest cluster centroid Step 5: Re-compute cluster centroids Step 6: Repeat steps 4 and 5 until a stopping criterion is met

We re still missing something key! Goal: to group similar objects into meaningful subgroups, called clusters Points on the Cartesian plane are similar if the distance between them is small But more generally, how do we define the similarity between two observations? Answer: we use a metric. A note on nomenclature: Sometimes we speak of the similarity of two observations, and other times we speak of their difference/distance. A term that captures both similarity and distance is proximity.

What is a metric? A way to gauge similarity (or dissimilarity) among things A distance metric satisfies these three properties: Symmetry: A is the same distance from B as B is from A Positivity: Always non-negative (What would a negative distance mean?) Triangle Inequality: A + B > C, B + C > A, and A + C > C Note: a measure of similarity does not have to be a distance metric to be useful!

Euclidean Distance Distance as the crow flies is called Euclidean distance. In two dimensions: In n dimensions: (x 1, y 1 ) (x 2, y 2 ) (1, 3) (4, 2) D = sqrt((1-4) 2 + (3-2) 2 ) = sqrt(9 + 1) = sqrt(10) = 3.16

Manhattan Distance Euclidean distance probably isn t the best measure of distance in Manhattan, because the streets form a grid. A better idea: or (x 1, y 1 ) (1, 3) (x 2, y 2 ) (4, 2) D = 1-4 + 3-2 = 4

Cosine Similarity Metrics need not capture spatial/physical distance Cosine similarity is the cosine of the angle between two vectors It measures the extent to which two vectors point in the same direction 1 if the vectors point in exactly the same direction 0 if the vectors form a right angle -1 if the vectors point in completely opposite directions If θ is the angle between two points, then cosine similarity is just cos(θ) Image Source

Cosine Similarity θ a(2,2) Pointwise multiplication: e.g., (a x )(b x ) + (a y )(b y ) = (2)(1) + (2)(-3) = 2-6 = -4 b(1,-3) For a = (2, 2) and b = (1, -3): Euclidean distance of points a and b from origin

Pearson s Correlation Coefficient Remember correlation? Positive if when X increases, Y also increases Negative if when X increases, Y also decreases (or vice versa) Let s apply the same idea to two n-dimensional observations

Hamming Distance The metrics we ve discussed so far make sense for data with numerical features. There are other metrics for categorical data! Given two ordered vectors of the same length (i.e., representing the same features), the Hamming distance is the number of feature values that differ. Assume we have compiled a group of people s preferences for their favorite (food, movie, book, subject, season): Person A: (banana, The Stranger, Forrest Gump, CS, Winter) Person B: (apple, Harry Potter, Star Trek, CS, Winter) Person A and B differ in the food, movie, and book categories, so their Hamming distance is 3, and their normalized Hamming distance is ⅗.

Jaccard Similarity Amazon might represent each of its users by a vector of the products they buy. But then, to measure similarity between users, almost all entries are 0, so all users look very similar. Jaccard similarity measures similarity among non-zeros. E.g., x = (1,0,0,0,0,0,0,0,0,0) and y = (1,0,0,0,0,1,0,0,0,1) M 11 = 1 and M 10 = 0 and M 01 = 2 and M 00 = 7 J = M 11 / (M 01 + M 10 + M 11 ) = ⅓ Simple Matching Coefficient = (M 00 + M 11 ) / (M 00 + M 01 + M 10 + M 11 ) = 1/10

What makes a good clustering? The intra-cluster similarity is high The inter-cluster similarity is low All of the aforementioned metrics measure intra-cluster similarity (i.e., similarity within clusters). To evaluate a clustering, we also need metrics to measure inter-cluster similarity: i.e., similarity between one cluster and another.

Linkage Metrics Given two clusters A and B: Complete Single Centroid Find the maximum distance between all pairs a ϵ A and b ϵ B Find the minimum distance between all pairs a ϵ A and b ϵ B Find the distance between the centroid of A and the centroid of B Mean/Average Find the average of the distance between all pairs a ϵ A and b ϵ B Image Source

What makes a good clustering? The intra-cluster similarity is high The inter-cluster similarity is low The quality of a clustering depends on the choice of similarity metric Lots of different choices of metrics! Think about what makes the most sense for your data Important: you can t compare distances without first normalizing your data E.g., sleep (on average 8) dominates GPA (on average 3), so a clustering by both would cluster only by sleep if the data were not first normalized

Hierarchical Clustering

Two Approaches Agglomerative Bottom-up approach Each observations starts in its own cluster Clusters are merged as one moves up the hierarchy Until all observations are in the same cluster Divisive Top-down approach All observations start in the same cluster Clusters are divided as one moves down the hierarchy Until each observation comprises its own cluster

Agglomerative Algorithm Initialize the algorithm with each data point in its own cluster Calculate the distances between all clusters Combine the two closest clusters into one Repeat until all data points are in the same cluster

Agglomerative Example Image Source

Agglomerative vs. Divisive Image Source

Image Source