Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Similar documents
Unsupervised Learning : Clustering

Clustering CS 550: Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering in Data Mining

Unsupervised Learning

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering Lecture 4: Density-based Methods

Clustering Part 4 DBSCAN

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering part II 1

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

数据挖掘 Introduction to Data Mining

CHAPTER 4: CLUSTER ANALYSIS

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

CSE 5243 INTRO. TO DATA MINING

Gene Clustering & Classification

Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering Basic Concepts and Algorithms 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

CSE 5243 INTRO. TO DATA MINING

CSE 347/447: DATA MINING

Unsupervised Learning and Clustering

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Data Mining Algorithms

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

What to come. There will be a few more topics we will cover on supervised learning

DS504/CS586: Big Data Analytics Big Data Clustering II

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Cluster Analysis: Basic Concepts and Algorithms

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering Lecture 3: Hierarchical Methods

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

DBSCAN. Presented by: Garrett Poppe

Network Traffic Measurements and Analysis

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Artificial Intelligence. Programming Styles

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Data Mining 4. Cluster Analysis

Introduction to Clustering

Clustering. Supervised vs. Unsupervised Learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning I: K-Means Clustering

Supervised vs. Unsupervised Learning

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

What is Cluster Analysis?

A Review on Cluster Based Approach in Data Mining

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering in Ratemaking: Applications in Territories Clustering

ECLT 5810 Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 5

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Lesson 3. Prof. Enza Messina

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

ECLT 5810 Clustering

CS7267 MACHINE LEARNING

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

CSE 494/598 Lecture-11: Clustering & Classification

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Chapter 4: Text Clustering

Unsupervised Learning Partitioning Methods

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Chapter VIII.3: Hierarchical Clustering

Clustering Part 2. A Partitional Clustering

Clustering. Chapter 10 in Introduction to statistical learning

A Comparative Study of Various Clustering Algorithms in Data Mining

Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering and Visualisation of Data

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Data Informatics. Seon Ho Kim, Ph.D.

COMS 4771 Clustering. Nakul Verma

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Analyzing Outlier Detection Techniques with Hybrid Method

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

The k-means Algorithm and Genetic Algorithm

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Introduction to Data Mining

Transcription:

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the rod, there will be 50 different measurements. So, the noise in he data due to errors gets cancelled out. Similarly, a combination of learning models (ensemble of classifiers) increases classification accuracy. Random forest is a large subset of decorrelated decision trees. For every subset a new tree is built. There can be a total of 2 n+m 1 trees in the forest since, each attribute (n) and each data point(m) have equal chances of either being included or not being included in a tree. It takes a vote for predicting the class of a data point. 1

2 Clustering The aim of clustering to segregate groups with similar traits and assign them into clusters. Clustering is an example of Unsupervised learning It is used when data is available but without any labels, i.e., without any a priori information. For Example, to understand customers better to market a product to them in a better manner. Clustering is used for operations like 1) Summarization, 2) Compression, and 3) Finding Nearest Neighbors Types of clustering are: 1. Hierarchical: where there is a hierarchical relationship between the clusters 2. Partitional: Where no hierarchical relationship exists between clusters 3. Exclusive: Where no such data point exists that belongs to more than one clusters. 4. Overlapping: Where a data point belongs to more than one clusters 5. Fuzzy: Where the data point s probability of belonging to each cluster is evaluated. 6. Complete: Where all data points are assigned clusters 7. Partial: Where a few data points are such thta they don t blong to any cluster. This happens dues to outliers. K-means: Is a partitional clustering technique that attempts to find a user-specified number of clusters(k), which are represented by their centroids. To run a k-means algorithm, you have to randomly initialize three points (See the figure below) called the cluster centroids. 2

I have three cluster centroids, because I want to group my data into three clusters. K- means is an iterative algorithm and it does two steps: 1. Cluster assignment step 2. Move centroid step(see the figure below).[5] This is repeated till they converge.[5] 3

Closeness is measured by: 1. Euclidean distance d Σ(q p) 2 (1) 2. Cosine similarity d ΣA.B ΣA 2. ΣB 2 (2) 3. Correlation 4. Bregman divergence d F (x) F (y) F (y) (3) In Cluster assignment step, the algorithm goes through each of the data points and depending on which cluster is closer, whether the red cluster centroid or the blue cluster centroid or the green; It assigns the data points to one of the three cluster centroids.[5] In move centroid step, K-means moves the centroids to the average of the points in a cluster. In other words, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location. This process is repeated until there is no change in the clusters (or possibly until some other stopping condition is met). K is chosen randomly or by giving specific initial starting points by the user. 4

The three lines in figure shows the path from each centroids initial location to its final location. For choosing an appropriate value for K, just run the experiment using different values of K and see which ones generate good results. The value for k can be decreased if some clusters are too small, and increased if the clusters are too broad. 3 Evaluation of K-means The object of k-means is to minimize he distance of points from their assigned centroids. Choose clusters corresponding to the one that minimized sum of squared error (SSE) The red lines on the graph indicate that the value of SSE can be anywhere in the range of the length of the line, since, the average values have been taken to plot the graph. A good clustering will have smaller number of k s. Therefore, there is trade-off between the number of k and the sum of squared error (SSE). For choosing k: 1. Iterate over a large number of k s to find optimum k. 2. Pre-process data with other algorithms 5

3. Domain knowledge about the data may be very important for knowing an estimate of the number of clusters that can exist. Initialization of Centers: 1. The centers can be random points in space 2. A data-point can be selected and the over the iterations the centroids jump from one data point to another 3. A random point in the dense region 4. Spaced uniformly over the feature space. Cluster Quality: 1. Depends on the relative sizes of clusters and the inter-cluster distance. 2. Distance between members of a cluster and the cluster center 3. Diameter of smallest sphere. In ideal case, the diameter of smallest sphere should be large. Because this means that all cluster spheres are of similar sizes. 4. Ability of the model to discover hidden patterns in the data Limitations of K-means The clustering is heavily dependent upon where the centroid is placed initially as we see from the figure. When the clusters were supposed to be clustered as per figure 1, they get clustered as per figure 2 because of unfortunate placing of the centroids. Model has problem when data has 1. Different size clusters 2. Different densities 3. Non-globular shape 4.Handling Empty ClustersWhen a cluster is empty, which can occur in a continuously changing database, the concept needs to be dropped. If there are no items in a cluster, the model does not form that cluster. 5. When there are outliers. 6. Updating Centroids Incrementally: Every time a new data point is added, the model 6

needs to be run again. Which is computationally expensive. K-NN AND K-MEANS ARE ARE DIFFERENT. K-NN is a supervised approach for classification whereas K-means is an unsupervised clustering algorithm. 4 Other Approaches K-Medoids: Wheres data point is chosen as center and it minimizes a sum of pairwise dissimilarities. Agglomerative clustering: repeatedly merging the two closest clusters until a single (Single Link) DBSCAN:Density-Based Spatial Clustering of Applications with Noise: Pros: Can handle dynamic datasets and clusters of arbitrary shape Cons: Sensitive to parameters Parameters:Eps, Minimum numer of Pts Points: core, border, noise 1. A point p is a core point if at least minpts points are within distance ( is the maximum radius of the neighborhood from p) of it (including p). Those points are said to be directly reachable from p.[5] 2. A point q is directly reachable from p if point q is within distance from point p and p must be a core point. 3. A point q is reachable from p if there is a path p1,..., pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi (all the points on the path must be core points, with the possible exception of q). 4. All points not reachable from any other point are outliers. 7

In this diagram[6], minpts = 4. Point A and the other red points are core points, because the area surrounding these points in an radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable. Incremental DBSCAN (Addition) When a new point is added: 1. A new cluster can be created. 2. new point can be absorbed into an existing cluster 3. 2 existing clusters can be merged due to the new point. Similarly for Decremental DBSCAN (Deletion) 1. An existing cluster can be deleted due to point deletion. 2. AN existing cluster can be split into two clusters because of deletion. 3. A cluster can shrink in size because of point deletion. All these can be accomplished in O(1) order. 5 References: 1. Book - Machine Learning, ch-3, Tom M. Mitchell. [2] Decision Tree 1: how it works https://www.youtube.com/watch?v=ekd5gxppey0, 2. An efficient k-means clustering al- 8

gorithm: Analysis and implementation, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, IEEE Transaction on Pattern Analysis and Machine Intelligence, pp 881892, 24 (2002) 3. https://www-users.cs.umn.edu/kumar/dmbook/ch8.pdf 4. http://scikit-learn.org/stable/modules/generated/sklearn.cluster.dbscan.html 5. http://bigdatamadesimple.com/possibly-the-simplest-way-to-explain-k-means-algorithm/ 6. https://en.wikipedia.org/wiki/d 9