Clustering and Dimensionality Reduction

Similar documents
Clustering. Image segmentation, document clustering, protein class discovery, compression

K-means and Hierarchical Clustering

Network Traffic Measurements and Analysis

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

K-Means Clustering 3/3/17

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

K-means and Hierarchical Clustering

Clustering and Visualisation of Data

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis for Microarray Data

MSA220 - Statistical Learning for Big Data

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Unsupervised Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

21 The Singular Value Decomposition; Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Image Processing. Image Features

Data Mining: Unsupervised Learning. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Lecture 10: Semantic Segmentation and Clustering

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

Introduction to Machine Learning. Xiaojin Zhu

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Dimension Reduction CS534

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

CSE 5243 INTRO. TO DATA MINING

Segmentation Computer Vision Spring 2018, Lecture 27

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 2

CSC 411: Lecture 12: Clustering

Unsupervised Learning

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CIE L*a*b* color model

Hierarchical clustering

Grundlagen der Künstlichen Intelligenz

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

CSE 5243 INTRO. TO DATA MINING

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Unsupervised Learning

Hierarchical Clustering Lecture 9

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

CS 343: Artificial Intelligence

Work 2. Case-based reasoning exercise

10701 Machine Learning. Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Unsupervised learning in Vision

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Data Mining Concepts & Techniques

Data Clustering. Danushka Bollegala

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering analysis of gene expression data

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Dimension reduction : PCA and Clustering

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

Clustering Documents in Large Text Corpora

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Kernels and Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Hierarchical clustering

Image Similarities for Learning Video Manifolds. Selen Atasoy MICCAI 2011 Tutorial

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Unsupervised Learning Hierarchical Methods

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data preprocessing Functional Programming and Intelligent Algorithms

Unsupervised Learning

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Programming Exercise 7: K-means Clustering and Principal Component Analysis

Unsupervised learning, Clustering CS434

Hierarchical Clustering 4/5/17

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Transcription:

Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at:

Data Mining Automatically extracting meaning from large, high dimensional data sets. We will talk about two broad approaches today: Dimensionality reduction. Clustering.

High Dimensional Data??

Dimensionality Reduction... Re representing high dimensional data in a low dimensional space, so that as much information as possible is preserved. Sometimes this can be done by hand... Often, we would like to automate the process... Benefits: Visualization. Saving Space. Computational efficiency.

Principal Components Analysis Each point is represented by two numbers. How best to squeeze that down to one?

Informal PCA Summary The principal components of a data set are the directions of maximum variance in the data. Each principal component is perpendicular to all of the others. Dimensionality reduction is performed by projecting the data onto the top few principal components.

The Details... The principal components are computed by finding the eigenvectors of the covariance matrix for the data set. The eigenvalues indicate the amount of variance captured by the corresponding eigenvector.

More Details... If P is a matrix where the columns are the first m principle components. Dimensionality reduction is as easy as: y=p T x Where x is the data point, and y is the low dimensional projection. We can go back to the original space with: x=p y

PCA For Image Compression

Clustering The problem of grouping unlabeled data on the basis of similarity. A key component of data mining is there useful structure hidden in this data? Applications: Image segmentation, document clustering, protein class discovery, compression One problem: how do we define similarity? First thought: euclidean distance.

K means 1. Ask user how many clusters they d like. (e.g. k=5)

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints)

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns

K means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated!

K means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg s super duper fast K means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html)

K means continues

K means continues

K means continues

K means continues

K means continues

K means continues

K means continues

K means continues

K means terminates

Next Question: How Many Clusters?

This Looks Right

This Looks Wrong It is not possible to eyeball data in higher dimensions.

How Do We Define Success? We want two things: Compact clusters all data points should be near their cluster centers. (We can calculate the total distance from each point to its cluster center.) Few clusters not very useful if we have as many clusters as data points. These two goals are in conflict we need to find a good trade off.

Looking For Elbows...

K Means Evaluation It is guaranteed to converge. It is not guaranteed to reach a minima. Commonly used because: It is easy to code. It is efficient.

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster 4. Repeat

Single Linkage Hierarchical Clustering 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster 4. Repeat

Single Linkage Hierarchical Clustering How do we define similarity between clusters? Minimum distance between points in clusters (in which case we re simply doing Euclidian Minimum Spanning Trees) Maximum distance between points in clusters Average distance between points in clusters You re left with a nice dendrogram, or taxonomy, or hierarchy of datapoints (not shown here) 1. Say Every point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster 4. Repeat until you ve merged the whole dataset into one cluster

Other Kinds of Similarity Sometimes minimizing euclidean distance doesn't seem to be the right idea. On this data set: K Means gives us these clusters: Take COMP 486: Introduction to Machine Learning