Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms
|
|
- Bruno Franklin
- 5 years ago
- Views:
Transcription
1 Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine
2 Assignment 5 Refer to the Wiki page Due noon on Monday February 12 th to EEE dropbox Note: due before class (by 2pm) Questions? Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2
3 What is Exploratory Data Analysis? EDA = {visualization, clustering, dimension reduction,.} For small numbers of variables, EDA = visualization For large numbers of variables, we need to be cleverer Clustering, dimension reduction, embedding algorithms These are techniques that essentially reduce high-dimensional data to something we can look at Today s lecture: Finish up visualization Overview of clustering algorithms Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3
4 Tufte s Principles of Visualization Graphical excellence is the well-designed presentation of interesting data a matter of substance, of statistics, and of design consists of complex ideas communicated with clarity, precision and efficiency is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space requires telling the truth about the data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4
5 Different Ways of Presenting the Same Data From Karl Broman, via Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5
6 Principle of Proportional Ink (or How to Lie with Visualization) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6
7 Principle of Proportional Ink (or How to Lie with Visualization) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7
8 Potentially Misleading Scales on the X-axis Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8
9 Example: Visualization of Napoleon s 1812 March Illustrates size of army, direction, location, temperature, date all on one chart Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9
10 Data Journalism From New York Times, Feb Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10
11 Exploratory Data Analysis: Clustering Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11
12 Example: Clustering Vectors in a 2-Dimensional Space x 2 Each point (or 2d vector) represents a document x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12
13 Example: Possible Clusters x 2 Cluster 1 Cluster 2 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13
14 Example: How many Clusters? x 2 Cluster 1 Cluster 2 Cluster 3 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14
15 Cluster Structure in Real-World Data 1500 subjects signal C Two measurements per subject signal T Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15
16 Cluster Structure in Real-World Data 1500 subjects signal C Two measurements per subject signal T Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16
17 signal C CC CT TT Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine signal T 17 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17
18 Issues in Clustering Representation How do we represent our examples as data vectors? Distance How do we want to define distance between vectors? Algorithm What type of algorithm do we want to use to search for clusters? What is the time and space complexity of the algorithm? Number of Clusters How many clusters do we want? No right answer to these questions in general it depends on the application Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18
19 Cluster Analysis vs Classification Data are unlabeled The number of clusters are unknown Unsupervised learning Goal: find unknown structures The labels for training data are known The number of classes are known Supervised learning Goal: allocate new observations, whose labels are unknown, to one of the known classes 19 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19
20 Clustering: The K-Means Algorithm Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20
21 Notation N documents Represent each document as a vector of T terms (e.g., counts or tf-idf) The vector for the ith document is: x i = ( x i1, x i2,, x ij,..., x it ), i = 1,..N Document-Term matrix x ij is the ith row, jth column columns correspond to terms rows correspond to documents We can think of our documents as being in a T-dimensional space, with clusters as clouds of points Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21
22 The K-Means Clustering Algorithm Input: N vectors x 1,. x N of dimension D K = number of clusters (K > 1) Output: K cluster centers, c 1,. c K, each center is a vector of dimension D (Equivalently) A list of cluster assignments (values 1 to K) for each of the N input vectors Note: In K-means each input vector x is assigned to one and only one cluster k, or cluster center c k The K -means algorithm partitions the N data vectors into K disjoint groups Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22
23 Example of K-Means Output with 2 Clusters x 2 Cluster 1 Blue circles are examples of documents Red circles are examples of cluster centers c 1 Cluster 2 c 2 x 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23
24 Squared Error Distance Consider two vectors each with T components (i.e., dimension T) x = ( x, x2,!, x y = y, y,!, y 1 T ( 1 2 T A common distance metric is squared error distance: ) ) d E ( x, y) = T j= 1 ( x j y j 2 ) In two dimensions the square root of this is the usual notion of spatial distance, i.e., Euclidean distance Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24
25 Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Index j is over the D components/dimensions of the vectors Cluster 1 c 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25
26 Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Total squared error between a cluster center c k and all N k points assigned to that cluster: Cluster 1 S k = Σ i d [ x i, c k ] Sum is over the D components/dimensions of the vectors Distance defined as Euclidean distance This sum is over vectors, over the N k points assigned to cluster k c 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26
27 Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: dist [ x, c ] = Σ j ( x j - c j ) 2 Total squared error between a cluster center c k and all N k points assigned to that cluster: S k = Σ i d [ x i, c k ] Sum is over the D components/dimensions of the vectors Sum is over the N k points assigned to cluster k Distance defined as Euclidean distance Total squared error summed across K clusters SSE = Σ k S k Sum is over the K clusters Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27
28 K-means Objective Function K-means: minimize the total squared error, i.e., find the K cluster centers c k, and assignments, that minimize SSE = Σ k S k = Σ k ( Σ i d [ x i, c k ] ) K-means seeks to minimize SSE, i.e., find the cluster centers such that the sum-squared-error is smallest will place cluster centers strategically to cover data similar to data compression (in fact used in data compression algorithms) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28
29 K-Means Algorithm Random initialization Select the initial K centers randomly from N input vectors randomly Or, assign each of the N vectors randomly to one of the K clusters Iterate: Assignment Step: Assign each of the N input vectors to their closest mean Update the Mean-Vectors (K of them) Compute updated centers: the average value of the vectors assigned to k New c k = 1/N k Σ i x i Convergence: Did any points get reassigned? Yes: terminate No: return to Iterate step Sum is over the N k points assigned to cluster k Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29
30 Pseudocode for the K-means Algorithm From Chapter 16 in Manning, Raghavan, and Schutze Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 30
31 Example of K-Means Clustering 7 Original Data 6 5 DIMENSION DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 31
32 Example of K-Means Clustering Iteration DIMENSION Mean Squared Error = DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 32
33 Example of K-Means Clustering Iteration DIMENSION Mean Squared Error = DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 33
34 Example of K-Means Clustering Iteration DIMENSION Mean Squared Error = DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 34
35 Example of K-Means Clustering 7 Iteration 5 (converged) 6 5 DIMENSION Mean Squared Error = DIMENSION 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 35
36 K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 36
37 K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 37
38 K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 38
39 K-means 1. Pick number of clusters (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. New Centers => new boundaries 6. Repeat until no change Figure/slide from Andrew Moore, CMU Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 39
40 The Iris Data Collected by R.A. Fisher A famous early data set in multivariate data analysis Four features: sepal length in cm sepal width in cm petal length in cm petal width in cm Three different species Setosa Versicolor Virginica Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 40
41 K-Means Clustering on the Iris Data Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 41
42 K-Means for Image Compression Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 42
43 An Example of Data where K-Means does not work well Ideal Clustering of Data in 2 Dimensions Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 43
44 An Example of Data where K-Means does not work well Ideal Clustering of Data in 2 Dimensions K-means Clustering Result, K = 2 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 44
45 From: Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 45
46 Properties of the K-Means Algorithm Time complexity? N = number of data points K = number of clusters D = dimension of data points (number of variables) O( N K d) in time per iteration This is good: linear time in each input parameter Does K-means always find a Global Minimum? i.e., the set of K centers that minimize the SSE? No: always converges to *some* local minimum, but not necessarily the best Depends on the starting point chosen Can prove that SSE on each iteration must either Decrease, or Not change (in which case we have converged) [Think about how you might prove this] Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 46
47 Summary of Kmeans Input: N vectors Output: K clusters Each cluster represented by a cluster mean (a vector) Assigns each data point to its closest cluster center Strengths Fast: time complexity is O(N D K), i.e., linear time in N, T, K Simple to implement Weaknesses: Not guaranteed to find the best solution (the global minimum of SSE) Assumes a fixed K, number of clusters Uses Euclidean distance not necessarily ideal Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 47
48 Number of Clusters? Generally no right answer it depends on the application We can think of clustering as a type of data compression technique: As K, the number of clusters grows, we compress the data better, e.g., lower overall squared error But this does not mean larger K is always better..the larger the value of K the harder it is for humans to understand the clustering results Options? Pick a value of K based on intuition/heuristics, e.g., relatively small K (e.g., K=5 or 10) if we are showing the results to a human Evaluate different values of K if we have some ground truth for evaluation and select the best value of K using the task-specific evaluation measure Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 48
49 Hierarchical Clustering Algorithms Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 49
50 Setosa Virginica Versicolor Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 50
51 Hierarchical Clustering The number of clusters is not required Gives a tree-based representation of observations - dendrogram Each leaf represents an observation Leaves most similar to each other are merged Internal nodes most similar to each other merged Process continues recursively until all nodes are merged at the root node Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 51
52 Basic Concept of Hierarchical Clustering Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e c d e a b c d e Merge data points, and then clusters, in a bottom-up fashion, until all data points are in 1 cluster. Requires that we can define distance/similarity between sets of points Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 52
53 Simple Example of Hierarchical Clustering Dimension 2 Dimension 1 Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 53
54 Complete-link clustering of Reuters news stories Figure from Chapter 17 of Manning, Raghavan, and Schutze Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 54
55 Distance between Two Branches/Clusters Single linkage Complete linkage Average linkage Many other options Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 55
56 Complexity of Hierarchical Clustering Time Complexity (N = num of docs, T = dimensionality) Time to compute all pairwise distances: O(N 2 T ) Time to create the tree: O(N 3 ) -> Overall time complexity is O(N 3 + N 2 T ) Space complexity = O(N 2 ) This is a significant weakness of hierarchical clustering: scales poorly in N One practical option is first run K-means with (e.g.,) K = 20 or 100 or 500 clusters and then cluster the clusters from K-means Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 56
57 Automatically Clustering Languages in Linguistics Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 57
58 Hierarchical Clustering based on user votes for favorite beers Based on centroid method From data.ranker.com Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 58
59 Heat-Map Representation (human data) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 59
60 Discovering Structure from a HeatMap of Brain Network Data From Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 60
61 Summary of Clustering Algorithms Used for exploring data Can answer questions such are there subgroups? Different clustering algorithms K-means Simple, fast, easy to interpret Tends to find circular clusters, can fail on complex structure Number of clusters K is fixed ahead of time Hierarchical agglomerative clustering Produces a tree of clusters (dendrogram) Number of clusters is not fixed Computational complexity is high, does not scale well to large N Clustering is useful for exploration.but one should be careful No gold standard to compare it to Many different methods.can give different results Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 61
62 Assignment 5 Refer to the Wiki page Due noon on Monday February 12 th to EEE dropbox Note change: due before class (by 2pm) Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 62
An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs
An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two
More informationIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)
More informationUniversity of Florida CISE department Gator Engineering. Visualization
Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationKTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn
KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in
More informationData Warehousing and Machine Learning
Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationMATH5745 Multivariate Methods Lecture 13
MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationMultivariate Analysis (slides 9)
Multivariate Analysis (slides 9) Today we consider k-means clustering. We will address the question of selecting the appropriate number of clusters. Properties and limitations of the algorithm will be
More informationNote Set 4: Finite Mixture Models and the EM Algorithm
Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationClustering and Dimensionality Reduction
Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from
More informationPreprocessing DWML, /33
Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationIntroduction to R and Statistical Data Analysis
Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,
More informationCase-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationHierarchical Clustering Lecture 9
Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:
More informationCluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6
Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationCSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning
CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationHierarchical Clustering 4/5/17
Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationMay 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch
May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch 12.1, 9.1 May 8, CODY Machine Learning for finding oil,
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1
Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationSummer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis
Summer School in Statistics for Astronomers & Physicists June 15-17, 2005 Session on Computational Algorithms for Astrostatistics Cluster Analysis Max Buot Department of Statistics Carnegie-Mellon University
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-
More informationK-means and Hierarchical Clustering
K-means and Hierarchical Clustering Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these
More informationK-Means Clustering 3/3/17
K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering
More informationBL5229: Data Analysis with Matlab Lab: Learning: Clustering
BL5229: Data Analysis with Matlab Lab: Learning: Clustering The following hands-on exercises were designed to teach you step by step how to perform and understand various clustering algorithm. We will
More informationData Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)
More informationKernels and Clustering
Kernels and Clustering Robert Platt Northeastern University All slides in this file are adapted from CS188 UC Berkeley Case-Based Learning Non-Separable Data Case-Based Reasoning Classification from similarity
More informationExploratory Analysis: Clustering
Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationK-means and Hierarchical Clustering
K-means and Hierarchical Clustering Xiaohui Xie University of California, Irvine K-means and Hierarchical Clustering p.1/18 Clustering Given n data points X = {x 1, x 2,, x n }. Clustering is the partitioning
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationCS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek
CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function
More informationMultivariate Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data
More informationCS 584 Data Mining. Classification 1
CS 584 Data Mining Classification 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationK-means Clustering & PCA
K-means Clustering & PCA Andreas C. Kapourani (Credit: Hiroshi Shimodaira) 02 February 2018 1 Introduction In this lab session we will focus on K-means clustering and Principal Component Analysis (PCA).
More informationData Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science
Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic
More informationMeasure of Distance. We wish to define the distance between two objects Distance metric between points:
Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance
More informationUninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall
Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes
More informationHsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction
Support Vector Machine With Data Reduction 1 Table of Contents Summary... 3 1. Introduction of Support Vector Machines... 3 1.1 Brief Introduction of Support Vector Machines... 3 1.2 SVM Simple Experiment...
More informationPerformance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms
Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Binoda Nand Prasad*, Mohit Rathore**, Geeta Gupta***, Tarandeep Singh**** *Guru Gobind Singh Indraprastha University,
More informationCluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole
Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University,
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationClustering algorithms
Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22 Table of contents 1 Supervised
More informationDD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today
Sec.14.1! Recap: Classification DD2475 Information Retrieval Lecture 10: Clustering Hedvig Kjellström hedvig@kth.se www.csc.kth.se/dd2475 Data points have labels Classification task: Finding good separators
More informationCOSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.
COSC 6397 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 215 Clustering Clustering is a technique for finding similarity groups in data, called
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More information