What to come. There will be a few more topics we will cover on supervised learning

Similar documents
Unsupervised learning, Clustering CS434

Lecture 15 Clustering. Oct

Clust Clus e t ring 2 Nov

Based on Raymond J. Mooney s slides

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Data Informatics. Seon Ho Kim, Ph.D.

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Clustering algorithms

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CS47300: Web Information Search and Management

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CSE 5243 INTRO. TO DATA MINING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Gene Clustering & Classification

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Learning : Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Machine Learning using MapReduce

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Introduction to Data Mining

Hierarchical Clustering 4/5/17

Clustering Results. Result List Example. Clustering Results. Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Unsupervised Learning Partitioning Methods

Hierarchical Clustering

Kapitel 4: Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Network Traffic Measurements and Analysis

Supervised vs. Unsupervised Learning

Information Retrieval and Web Search Engines

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Statistics 202: Statistical Aspects of Data Mining

Introduction to Information Retrieval

Introduction to Information Retrieval

CS 8520: Artificial Intelligence

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Machine Learning (CSE 446): Unsupervised Learning

Cluster Analysis: Basic Concepts and Algorithms

K-Means. Oct Youn-Hee Han

Clustering. Supervised vs. Unsupervised Learning

Clustering and Visualisation of Data

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering: Overview and K-means algorithm

MSA220 - Statistical Learning for Big Data

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Information Retrieval and Organisation

Unsupervised Learning I: K-Means Clustering

Clustering CS 550: Machine Learning

Clustering. Chapter 10 in Introduction to statistical learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Exploratory Analysis: Clustering

Chapter 6: Cluster Analysis

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering: Overview and K-means algorithm

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

PV211: Introduction to Information Retrieval

Clustering Basic Concepts and Algorithms 1

Exploratory data analysis for microarrays

Information Retrieval and Web Search Engines

Cluster Analysis. Ying Shen, SSE, Tongji University

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Artificial Intelligence. Programming Styles

Hierarchical Clustering

CSE 158. Web Mining and Recommender Systems. Midterm recap

Introduction to Artificial Intelligence

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Natural Language Processing

Cluster Analysis: Basic Concepts and Algorithms

Unsupervised Learning and Clustering

A Dendrogram. Bioinformatics (Lec 17)

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Unsupervised Learning: Clustering

Data Clustering. Danushka Bollegala

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

INF 4300 Classification III Anne Solberg The agenda today:

Machine Learning - Clustering. CS102 Fall 2017

Improved Performance of Unsupervised Method by Renovated K-Means

Clustering. Unsupervised Learning

Introduction to Clustering

Transcription:

Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression (linear) Decision tree (nonlinear) Nearest Neighbor (nonlinear) Generative models Bayes classifier (not really practical) Naïve bayes classifier (conditional independence between features given class label)

What to come There will be a few more topics we will cover on supervised learning Ensemble methods Neural nets (perhaps) But we will delay this for a couple of weeks We will talk about unsupervised learning such that you are not limited to supervised learning when thinking about your project

Unsupervised learning, Clustering CS434

Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,, x 2 m x 1n,x 2n, x 3n,, x n m y 1 y 2 y n x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,, x 2 m x 1n,x 2n, x 3n,, x n m What do we expect to learn from such data? I have tons of data (web documents, gene data, etc) and need to: organize it better e.g., find subgroups understand it better e.g., understand interrelationships find regular trends in it e.g., If A and B, then C

Hierarchical clustering as a way to organize web pages

Finding association patterns in data

Dimension reduction

clustering

A hypothetical clustering example Monthly spending.......... Each point represents a person in the database...... Monthly income

A hypothetical clustering example Monthly spending................ This suggests there maybe three distinct spending patterns among the group of people in our database. Monthly income

What is Clustering In general clustering can be viewed as an exploratory procedure for finding interesting subgroups in given data A large portion of the efforts focus on a special kind of clustering: Group all given examples (i.e., exhaustive) into disjoint clusters (i.e., partitional), such that: Examples within a cluster are (very) similar Examples in different clusters are (very) different

Some example applications Information retrieval: cluster retrieved documents to present more organized and understandable results Consumer market analysis: cluster consumers into different interest groups so that marketing plans can be specifically designed for each individual group Image segmentation: decompose an image into regions with coherent color and texture Vector quantization for data (i.e., image) compression: group vectors into similar groups, and use group mean to represent group members Computational biology: group gene into co-expressed families based on their expression profile using different tissue samples and different experimental conditions

Image compression: Vector quantization Group all pixels into self-similar groups, instead of storing all pixel values, store the means of each group 701,554 bytes 127,292 bytes

Important components in clustering Distance/similarity measure How to measure the similarity between two objects? Clustering algorithm How to find the clusters based on the distance/similarity measures Evaluation of results How do we know that our clustering result is good

Distance/similarity Measures One of the most important question in unsupervised learning, often more important than the choice of clustering algorithms What is similarity? Similarity is hard to define, but we know it when we see it More pragmatically, there are many mathematical definitions of distance/similarity

Common distance/similarity measures Euclidean distance Manhattan distance L d 2 2 x i x i 2 ( x, x ) ( ) i 1 Straight line distance >=0 City block distance >=0 Cosine similarity Angle between two vectors, commonly used for measuring document similarity More flexible measures: D d ( x, x ) wi ( xi xi i 1 one can learn the appropriate weights given user guidance ) 2 Scale each feature differently using w i s Note: We can always transform between distance and similarity using a monotonically decreasing function, for example

How to decide which to use? It is application dependent Consider your application domain, you need to ask questions such as: What does it mean for two consumers to be similar to each other? Or for two genes to be similar to each other? For example, for text domain, we typically use cosine similarity This may or may not give you the answer you want depends on your existing knowledge of the domain When domain knowledge is not enough, one could consider learning based approach

A learning approach Ideally we d like to learn a distance function from user s inputs Ask users to provide things like object A is similar to object B, dissimilar to object C For example, if a user want to group the marbles based on their pattern They don t belong together! They belong together! Learn a distance function to correctly reflect these relationships E.g. weigh pattern-related features more than color features This is a more advanced topic that we will not cover in this class, but nonetheless very important

When we can not afford to learn a distance measure, and don t have a clue about which distance function is appropriate, what should we do? Clustering is an exploratory procedure and it is important to explore i.e., try different options

Clustering Now we have seen different ways that one can use to measure distances or similarities among a set of unlabeled objects, how can we use them to group the objects into clusters? People have looked at many different approaches and they can be categorized into two distinct types

Hierarchical and non-hierarchical Clustering Hierarchical clustering builds a tree-based hierarchical taxonomy (dendrogram) animal vertebrate invertebrate fish reptile mammal worm insect A non-hierarchical (flat) clustering produces a single partition of the unlabeled data Generates a single partition of the data without resorting to the hierarchical procedure Representative example for flat clustering: K-means clustering

Goal of Clustering Given a data set We need to partition into disjoint clusters, such that Examples are self-similar in the same cluster How do we quantify this?

Objective: Sum of Squared Errors Given a partition of the data into k clusters, we can compute the center (i.e., mean, center of mass) of each cluster For a well formed cluster, its points should be close to its center. We measure this with sum of squared error (SSE): Our objective is thus to find a partition : 1 argmin,, This is a combinatorial optimization problem x x NP-hard Kmeans finds a local optimal solution using an iterative approach

Basic idea We assume that the number of desired clusters, k, is given Randomly choose k examples as seeds, one per cluster. Form initial clusters based on these seeds. Iterate by repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.

The Standard K-means algorithm Input: 1 2 and desired number of clusters Output: 1 such that D = 1 2 Let d be the distance function between examples 1. Select random samples from as centers 1 //Initialization 2. Do 3. for each example, 4. assign to such that, 1 is minimized // the Assignment step 5. for each cluster j, update its cluster center 6. 1 j x c j x c j 7. Until convergence // the update step

K-Means Example (K=2)

K-Means Example (K=2) Pick seeds

K-Means Example (K=2) Pick seeds Reassign clusters

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

Monotonicity of K-means Monotonicity Property: Each iteration of K- means strictly decreases the SSE until convergence The following lemma is key to the proof: Lemma: Given a finite set C of data points the value of μ that minimizes the SSE: is: x x

Proof of monoticity Given a current set of clusters with their means, the SSE is given by : x x Consider the assignment step: Since each point is only reassigned if it is closer to some other cluster than its current cluster, so we know the reassignment step will only decrease SSE Consider the re-center step: From our lemma we know the updated minimizes the SSE of, which implies that the resulting SSE again is decreased. Combine the above two, we know K-means always decreases the SSE

K-means properties K-means always converges in a finite number of steps Typically converges very fast (in fewer iterations than n ) Time complexity: Assume computing distance between two instances is where is the dimensionality of the vectors. Reassigning clusters: distance computations, or. Computing centers: Each instance vector gets added once to some center:. Assume these two steps are each done once for iterations:. Linear in all relevant factors, assuming a fixed number of iterations

More Comments Highly sensitive to the initial seeds This is because SSE has many local minimal solutions, i.e., solutions that can not be improved by local reassignments of any particular points

Solutions Run multiple trials and choose the one with the best SSE This is typically done in practice Heuristics: try to choose initial centers to be far apart Using furthest-first traversal Start with a random initial center Set the second center to be furthest from the first center The third center to be furthest from the first two centers

Even more comments K-Means is exhaustive: Cluster every data point, no notion of outlier Outliers may cause problems, why? x x

K-medoids Outliers strongly impact the cluster centers K-medoids use the medoid for each cluster, i.e., the data point that is closest to other points in the cluster argmin x x z z K-medoids is computationally more expensive but more robust to outliers

Deciding k a model selection problem What if we don t know how many clusters there are in the data? Can we use SSE to decide k by choosing k that gives the smallest SSE? We will always favor larger k values Any quick solutions? Find the knee(s)