k-means Clustering (pp )
|
|
- Chloe Barton
- 5 years ago
- Views:
Transcription
1 Notation: Means pencil-and-paper QUIZ Means coding QUIZ k-means Clustering (pp ) Explaining the intialization and iterations of k-means clustering algorithm: Let us understand the mechanics of k-means on a 1-dimensional example 1 : This is the random initialization of 2 clusters (k=2): Show how the points are assigned to the clusters! 1 Unlike higher-dimensional cases, data points in 1D can be sorted, which makes the clustering problem easier. k-means is not applied to 1D data in real-life. In fact, clustering in 1D is usually called by a different name, segmentation. An example of 1D clustering algorithm is Jenks natural breaks optimization.
2 Calculate the new cluster centers! Show how the points are assigned to the new clusters! Calculate the cluster centers in the next iteration! Iterate until the centers are stable!
3 Now a similar problem in 2D: Show how the points are assigned to the clusters! Calculate the coordinates of the new cluster centers! Iterate until the centers are stable!
4 Each cluster lies within its (linear) boundaries, which can later be used for classification: How to do our own clustering with Scikit-learn: How to use the trained clustering model to classify new points (the model is not changed by the new points!) Use Numpy vectorized operations to prove that the two arrays above are identical. Use the model trained above to classify these three data points: (-1, 42) (1, -42), (42, 42). Make only one call to predict( )! The attribute cluster_centers_ stores the coordinates of the centers: We can print the data points and the centers on the same plot using mglearn:
5 To do for next time: Try more or fewer cluster centers (p.174) Read Failure cases (pp.175-8) and try out the code therein. Read more about shortcomings of k-means here: (linked on our webpage). Update memory-sheet!
6 Solutions:
7 Algorithm development case-study: Implementing the k-means clustering algorithm 2 In this section, we are going to apply Python tools to implement and refine the k-means clustering algorithm in 1, 2, and 3 dimensions. Although we already know that k-means is not used in the one-dimensional (1D) case, we start here because it is a simpler problem: A company wants to distribute goods to 30 cities located on the same highway. To do this, the company will build a number k (initially unknown) of distribution centers along the highway, and each center will serve a number of neighboring cities. The centers themselves need not be in a city, they can be placed anywhere on the highway. What is the optimal placement of the centers? Whenever we ask for an optimal solution, we need to specify what exactly we want optimized: in general, we need a function whose minimum or maximum must be attained. For this problem, a good candidate is the sum of all distances from centers to the cities they serve 3, but, in order to make the math more tractable, we shall adopt a slightly different function, the sum of the squares of those distances; borrowing a concept from Coding Theory, this is called the distortion: k 1 1 D = n 2 i centeri, cityij i 0 j 0 d, where n i is the number of cities served by center i. (*) The 1D data is simply the distance of each city from one end of the highway 4. The instructor will provide the starter code shown here: Note that the set of data points is sorted, a simplification that will not be possible in higher dimensions. We now find the number of points, and decide on the number of centers/clusters: We have to find initial candidates for the cluster centers. With 1D data, this is an easy task, due to the sorted order: we split the highway into k equal segments, and take the middle of each segment as initial center. The code is shown below: 2 This section is not in our text. 3 Assuming the trucks drop off goods to only one city on each trip 4 This is Interstate 90, with Seattle (0 miles) at the Western end and Boston (3053 miles) at the Eastern end; distances were obtained from Google maps.
8 As mentioned, multi-dimensional data cannot be sorted, so we develop an alternate method for initialization: initial centers are chosen at random: Warm-up problems (solve in a separate.py file): Create a list of 10 random elements, using one of the random generators in Python s random module: random( ), uniform( ), triangular( ). Apply the built-in function max( ), which returns the maximum (largest) element in its argument. How can we also find the index of the maximum? Hint: There are at least two possible solutions, one using the method index( ), another using the built-in function enumerate( ). One iteration of the k-means algorithm consists of: For each data point, find the closest center. Label the data point with an identifier for the center. Group the data points into k sets, based on the labels; each set contains points assigned to the same center, the one closest to them. This is classification. Find the centroid (center of mass ) for each set. The centroids become the new centers. Let us implement these steps: In the code above, the centers are numbered 0 to k-1, and these numbers are used as labels. The list closest, of length n, is used to store the number of the closest center. We use the built-in functions abs( ), which
9 returns the absolute value of a scalar argument, and min( ), which returns the minimum of its list argument. Note that the label in closest is the number of the center, whereas the minimum is calculated as a distance; this is why we use the index( ) method to find where the minimum occurred. The result is that the list closest now contains for each data point the corresponding number of the closest center, as seen in the output. Once the points have been classified into sets with respect to the current centers, we calculate the centroid for each set. Centroids are averages, so we need two lists: one for the sums, and one for the counts: It can be proven mathematically that the distortion function D defined in (*) above is reduced (or stays the same) at each iteration, and therefore the k-means algorithm is guaranteed to converge to a minimum. Unfortunately, that may not be the global minimum, as seen by running the algorithm multiple times with different initial centers 5 : There is no known method guaranteed to find the global minimum, short of exhaustive search. In practice, we just run the algorithm multiple times and take the solution with the minimum distortion. In the output above, the distortion is denoted D_sqr, but we also included D_abs, the sum of only the absolute values of distances to centers; intuitively, we feel that a low value for D_sqr also means a low value for D_abs, but we can see that this is not always the case. In practice, however, the difference between the two distortions is small enough to neglect. This is how we plot our 30 points on a line, using matplotlib and pyplot: 5 This is the more advanced algorithm, with multiple iterations, which will be developed subsequently.
10 Lab 6 - see handout The starter code for this lab is available on our webpage, and/or can be provided by the instructor. The file is called 16_kmeans_random_centers_1_iteration.py After the lab is completed, the final code will also be available on our webpage, and/or provided by the instructor. This code will be the starting point for the work in the next session (below) We start this session from the final version of the code for k-means, developed in the previous lab. If you don t have it, it can be obtained from our webpage, or from the instructor. The file is called 17_kmeans_loop_threshold.py 6 Discussion of the color list.what happens when we increase the number of clusters to 4? 5? 6? What are we supposed to do to fix the problem? Now revert k back to 3 until the end of today s session. We mentioned that the algorithm may not converge to the global optimum. Change the random seed used in the program to several different values and see how the final centroids change. The question then is: Which of the different solutions obtained above is better? To answer it, we have to first agree on a definition, or metric, for better. One possible solution is to use the sum of the absolute values, as we did when we measured the change in the centroids positions. Or we can take the average, by dividing the sum above by the number of clusters. The metric that works best in higher dimensions, however, is the sum of squared distances, a.k.a. distortion, introduced earlier. For convenience, the definition is repeated here: k 1 1 D = n 2 i centeri, cityij i 0 j 0 d, where n i is the number of cities served by center i. (*) Modify the code to calculate and display the value of D of the solution. Write a Python function to do this (What arguments does this function need?) To distinguish the distortion from the sum of absolute values, name your function D_sqr( ). Modify the code to run in a for loop, with different random seeds (or drop the random.seed() function call altogether). Keep track of the best solution, i.e. the one with the minimum distortion, and display it at the end. 6 If you downloaded it from the webpage, it probably has the extension.txt, which you have to change to.py.
11 Hint: Use a new list that stores closest, centroids, and D_abs, for the current best. At the end, display and plot only the best solution. Questions for the class: What was the minimum D_abs for k = 3 clusters? What are the centroids positions corresponding to it? Do you think that increasing the number of clusters from 3 to, say, 4 or 5 will increase of decrease the distortion? After you have answered this question, experiment and see if you got it right! We start this session from the final version of the code for k-means, developed previously. If you don t have it, it can be obtained from our webpage, or from the instructor. The file is called 18_kmeans_distortion.py 7 First, let us make some small improvements: A. We noticed last time that, especially when working with larger values for the number of centers k, some random choices for the centers resulted in an error condition, but the call to exit(0) didn t work as expected. We fix this by calling the stronger version of exit, from the sys module: B. The creation of the list colors for plotting can be made more efficient using a dictionary and a list comprehension: 7 If you downloaded it from the webpage, it probably has the extension.txt, which you have to change to.py.
12 What happens now if we try to use more than 5 centers? Where else in the program can we use color_dict? C. Finally, it is possible to include custom text information in matplotlib pyplots, with the method annotate( ), as shown here: The first argument is the text to be displayed, provided as a string. The second is the location (xy coordinates) of the tip of the arrow, provided as a tuple. The third is the location (xy coordinates) where the text starts to be displayed, also as a tuple. The fourth, arrowprops allows us to customize the properties of the arrows 8. Now we do a major refactoring of the program, by replacing two (potentially) long lists with Numpy arrays: points becomes arr_points closest becomes arr_closest Although distances is also of the same size, we are leaving it for the lab! The function d_sqr( ) is also left for the lab! Make only one small change at a time, and test the program after each change! (Make sure you obtain exactly the same result!) Whenever possible, use Numpy-specific operations! 8 See
13 Solutions: What happens now if we try to use more than 5 centers? A: We get Key Error, because the key 5 doesn t exist in the dictionary color_dict. Where else in the program can we use color_dict? A: In the scatter plot with the centers, we cna use the values in this dictionary:
14 Lab 7 - see handout The starter code for this lab is available on our webpage, and/or can be provided by the instructor. The file is called 20_kmeans_NUMPY.py After the lab is completed, the final code will also be available on our webpage, and/or provided by the instructor. Failure cases of k-means (pp.175-8) A. The number of clusters is a hyper-parameter. There is no universal algorithm to find the best number. Many algorithms exist, like the elbow method, Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC) B. Clusters are defined by diameter only, there is no way to account for density. Here is the text example: C. Clusters are circular, there is no way to account for direction. Here is the text example:
15 In general, there is no way to account for cluster shape: D. k-means will find clusters even in non-clustered data 9 : 9 Example from
16 E. k-means is very sensitive to scale 10 : F. Even on perfect data, k-means can get stuck in a local minimum 11 : 10 Example from 11 Example from
17 Relationship between k-means and vector quantization (pp ) PCA and other dimensionality reduction algorithms express the original data points (vectors) as a sum of directions (parts, components) whose number is lower than the original dimensionality. k-means can be viewed as a dimensionality reduction algorithm where each point is represented by just one component: the center of a cluster. Vector quantization is a technique from signal processing, originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms 12. The text code on p.179 shows a comparison between the first few PCA components and the first few clusters obtained with k-means on the dataset Labeled Faces in the Wild PCA was done with n_components=100, and k-means with n_clusters = 100. (Ignore the NMF decomposition, since we didn t cover it): And here are a few reconstructions. For k-means, the reconstruction is simply the closest center: 12
18 An advantage of k-means over PCA: We are not restricted to the number of dimensions of the original dataset. Example: The Two Moons dataset has only 2 dimensions, so PCA is no help, but k-means can use, say, 10 clusters, which means 10 features:
19 Would the moon shapes be preserved with fewer than 10 clusters? Experiment to find out! Aside from the cluster centers themselves, the transform( ) method of kmeans allows easy access to the distance between each point and its center: Complexity: O(kndi), where i is the number of iterations, and d is the number of features (dimension of fata point). End k-means Next, we cover a clustering algorithm that offers support in choosing the right number of clusters.
20 Agglomerative Clustering (AC) (pp.184-9) Pseudocode: Start by declaring each point its own cluster. Repeat: Merge the two closest clusters Until a certain objective is reached. The Big-Oh complexity of AC (Not in text) The nr. of data points is n and the number of features is d. Remark: The complexity depends on the exact way in which the distance between clusters is measured. Since clusters are merged or linked, the distance is called the linkage. The discussion in our text implies the so-called single-linkage, a.k.a. nearest-neighbor distance; it is the minimum distance between all pairs of points in those clusters. A relatively straightforward algorithm for single-linkage uses an nxn proximity matrix to keep track of the distances between clusters at each step. It can be shown to have time complexity O(n 2 d + n 3 ) = O(n 3 ). Faster, but more intricate algorithms exist for single-linkage 13, that are only O(n 2 ). Interestingly, the Scikit-learn implementation of AC does not use single-linkage; see below for details. 13
21
22 Using AC in Scikit-learn Unlike KMeans, AgglomerativeClustering does not have a predict( ) method, because the clusters do not naturally divide the space among themselves. We can either: apply predict( ) and then use the information in the attribute labels_, as with KMeans, or apply fit_predict( ) in one step, which returns the cluster labels (as shown in the example above) Nothing prevents us, of course, from using the labels to create our own prediction algorithm. As an example, we can find the centroid of each cluster, and then use the distance of a new point to those centroids, k-meansstyle. The two most important hyper-parameters are: n_clusters The number of clusters where the bottom-up process stops. Its default value is 2. linkage How the distance between pairs of clusters is calculated, in order to decide which two will be merged next. Here is the official documentation 14 : It is also possible to override the blind linkage if we have extra information about the structure of the data. In the parameter connectivity, we can specify a matrix of neighboring data points. 14
23 The dendrogram It cannot be drawn (yet) in Scikit-learn, but it is easy to use the original Scipy: We can then use the usual matplotlib.pyplot tools to show cuts at various inter-cluster distances: Note: gca get current (figure s) axes (or create a new figure) The vertical segments in the dendrogram show how far apart the clusters are. Intuitively, we want to stop ( cut the dendrogram) at the largest branches, as shown above.
24 Conclusions on AC: A little slower than k-means in practice, but still scales well. If we manually create a connectivity matrix based on the structure of the dataset, it can capture complex shapes of clusters... But it does nto do it out of the box. It still has no concept of the density of the data points. (But the next and last clustering algorithm we present below does capture density!) DBSCAN The acronym stands for Density-Based Spatial Clustering of Applications with Noise. Idea: Clusters are regions of (relatively) high density of points, separated by regions of (relatively) low density. More detailed idea: Identify core samples, then extend the cluster by finding other core samples. What happend? The special label -1 means no cluster, i.e. all points are classified as noise! Normal labels start at 0, as usual. We have to tweak two important hyper-parameters: min_samples (default 5) and eps (epsilon, default 0.5):
25 Performance on the two-mon dataset:
26 Big-Oh complexity: O(n 2 ), but can be brought down to O(n log n) Conclusions on DBSCAN: Pros: Does not need the nr. of clusters as hyper-parameter! Models density! Can find arbitrarily-shaped clusters! Robust to outliers - will leave a data point outside of any cluster if it is too far! In practice, it is a little slower than k-means and AC, but not by much. Cons: Does not do well if the dataset has regions with large differences in density. Choosing a good threshold requires good understanding of the scaling of the data. Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed 15. Read FYI: Comparing and Evaluating Clustering Algorithms Homework #4 was assigned 15
Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationCSE 347/447: DATA MINING
CSE 347/447: DATA MINING Lecture 6: Clustering II W. Teal Lehigh University CSE 347/447, Fall 2016 Hierarchical Clustering Definition Produces a set of nested clusters organized as a hierarchical tree
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationCh.1 Introduction. Why Machine Learning (ML)?
Syllabus, prerequisites Ch.1 Introduction Notation: Means pencil-and-paper QUIZ Means coding QUIZ Why Machine Learning (ML)? Two problems with conventional if - else decision systems: brittleness: The
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCh.1 Introduction. Why Machine Learning (ML)? manual designing of rules requires knowing how humans do it.
Ch.1 Introduction Syllabus, prerequisites Notation: Means pencil-and-paper QUIZ Means coding QUIZ Code respository for our text: https://github.com/amueller/introduction_to_ml_with_python Why Machine Learning
More informationHierarchical Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00
More informationMidterm Examination CS 540-2: Introduction to Artificial Intelligence
Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State
More information10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2
161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under
More informationClustering Lecture 3: Hierarchical Methods
Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationNeural Networks (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.
More informationData Clustering Hierarchical Clustering, Density based clustering Grid based clustering
Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationHW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20
HW4 VINH NGUYEN Q1 (6 points). Chapter 8 Exercise 20 a. For each figure, could you use single link to find the patterns represented by the nose, eyes and mouth? Explain? First, a single link is a MIN version
More informationClustering to Reduce Spatial Data Set Size
Clustering to Reduce Spatial Data Set Size Geoff Boeing arxiv:1803.08101v1 [cs.lg] 21 Mar 2018 1 Introduction Department of City and Regional Planning University of California, Berkeley March 2018 Traditionally
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationCSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning
CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationLecture 10: Semantic Segmentation and Clustering
Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationDATA MINING I - CLUSTERING - EXERCISES
EPFL ENAC TRANSP-OR Prof. M. Bierlaire Gael Lederrey & Nikola Obrenovic Decision Aids Spring 2018 DATA MINING I - CLUSTERING - EXERCISES Exercise 1 In this exercise, you will implement the k-means clustering
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationLecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Hierarchical Clustering Produces a set
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationDensity estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate
Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationK-means and Hierarchical Clustering
K-means and Hierarchical Clustering Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationDS504/CS586: Big Data Analytics Big Data Clustering II
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationData Mining and Data Warehousing Classification-Lazy Learners
Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is
More informationClustering in Ratemaking: Applications in Territories Clustering
Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking
More informationApplications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors
Segmentation I Goal Separate image into coherent regions Berkeley segmentation database: http://www.eecs.berkeley.edu/research/projects/cs/vision/grouping/segbench/ Slide by L. Lazebnik Applications Intelligent
More informationPAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods
Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationDS504/CS586: Big Data Analytics Big Data Clustering II
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm
More informationIntroduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)
Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data
More informationData Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC
Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC Clustering Idea Given a set of data can we find a natural grouping? Essential R commands: D =rnorm(12,0,1)
More informationMachine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016
Machine Learning for Signal Processing Clustering Bhiksha Raj Class 11. 13 Oct 2016 1 Statistical Modelling and Latent Structure Much of statistical modelling attempts to identify latent structure in the
More informationSearch. The Nearest Neighbor Problem
3 Nearest Neighbor Search Lab Objective: The nearest neighbor problem is an optimization problem that arises in applications such as computer vision, pattern recognition, internet marketing, and data compression.
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationSegmentation Computer Vision Spring 2018, Lecture 27
Segmentation http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 218, Lecture 27 Course announcements Homework 7 is due on Sunday 6 th. - Any questions about homework 7? - How many of you have
More informationKnowledge Discovery in Databases
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering
More informationCS7267 MACHINE LEARNING
S7267 MAHINE LEARNING HIERARHIAL LUSTERING Ref: hengkai Li, Department of omputer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) Mingon Kang, Ph.D. omputer Science,
More information数据挖掘 Introduction to Data Mining
数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 S8700113C 1 Introduction Last week: Association Analysis
More informationDensity estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate
Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationHierarchical clustering
Aprendizagem Automática Hierarchical clustering Ludwig Krippahl Hierarchical clustering Summary Hierarchical Clustering Agglomerative Clustering Divisive Clustering Clustering Features 1 Aprendizagem Automática
More informationCPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016
CPSC 340: Machine Learning and Data Mining Density-Based Clustering Fall 2016 Assignment 1 : Admin 2 late days to hand it in before Wednesday s class. 3 late days to hand it in before Friday s class. 0
More informationCIS192 Python Programming
CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania October 22, 2015 Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18 Outline 1 Machine Learning
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline
More informationKapitel 4: Clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More informationCase-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCPSC 67 Lab #5: Clustering Due Thursday, March 19 (8:00 a.m.)
CPSC 67 Lab #5: Clustering Due Thursday, March 19 (8:00 a.m.) The goal of this lab is to use hierarchical clustering to group artists together. Once the artists have been clustered, you will calculate
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationScanning Real World Objects without Worries 3D Reconstruction
Scanning Real World Objects without Worries 3D Reconstruction 1. Overview Feng Li 308262 Kuan Tian 308263 This document is written for the 3D reconstruction part in the course Scanning real world objects
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationCS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017
CS 2750: Machine Learning Clustering Prof. Adriana Kovashka University of Pittsburgh January 17, 2017 What is clustering? Grouping items that belong together (i.e. have similar features) Unsupervised:
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationHomework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in
More informationHomework Assignment #3
CS 540-2: Introduction to Artificial Intelligence Homework Assignment #3 Assigned: Monday, February 20 Due: Saturday, March 4 Hand-In Instructions This assignment includes written problems and programming
More information