DATA MINING I - CLUSTERING - EXERCISES

Similar documents
Clustering Lecture 4: Density-based Methods

Clustering Part 4 DBSCAN

University of Florida CISE department Gator Engineering. Clustering Part 4

DS504/CS586: Big Data Analytics Big Data Clustering II

Unsupervised Learning : Clustering

DS504/CS586: Big Data Analytics Big Data Clustering II

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

C-NBC: Neighborhood-Based Clustering with Constraints

DBSCAN. Presented by: Garrett Poppe

Data Mining 4. Cluster Analysis

CHAPTER 4: CLUSTER ANALYSIS

Understanding Clustering Supervising the unsupervised

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Introduction to Artificial Intelligence

Data Mining Algorithms

k-means Clustering (pp )

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Unsupervised Learning

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering CS 550: Machine Learning

University of Florida CISE department Gator Engineering. Clustering Part 2

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Clustering part II 1

CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

ENHANCED DBSCAN ALGORITHM

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Chapter VIII.3: Hierarchical Clustering

CSE 347/447: DATA MINING

数据挖掘 Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 5

Multivariate Analysis (slides 9)

Machine Learning (BSMC-GA 4439) Wenke Liu

ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Pattern Recognition Chapter 3: Nearest Neighbour Algorithms

Knowledge Discovery in Databases

Chapter 4: Text Clustering

Faster Clustering with DBSCAN

6. Object Identification L AK S H M O U. E D U

Redefining and Enhancing K-means Algorithm

A Review on Cluster Based Approach in Data Mining

CSE 5243 INTRO. TO DATA MINING

[7.3, EA], [9.1, CMB]

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Dynamic Clustering in WSN

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering to Reduce Spatial Data Set Size

Clustering and Visualisation of Data

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Chapter 8: GPS Clustering and Analytics

K-means Clustering & k-nn classification

Network Traffic Measurements and Analysis

Multi-label classification using rule-based classifier systems

Cluster Analysis: Basic Concepts and Algorithms

CSE 5243 INTRO. TO DATA MINING

10701 Machine Learning. Clustering

COLOR image segmentation is a method of assigning

Introduction to Data Mining

Clustering. (Part 2)

Clustering Documentation

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

ECLT 5810 Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Weka ( )

Colour Image Segmentation Using K-Means, Fuzzy C-Means and Density Based Clustering

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

7 Control Structures, Logical Statements

Unsupervised learning in Vision

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Final Exam Assigned: 11/21/02 Due: 12/05/02 at 2:30pm

Detection of Rooftop Regions in Rural Areas Using Support Vector Machine

Density-based clustering algorithms DBSCAN and SNN

General Instructions. Questions

Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Heterogeneous Density Based Spatial Clustering of Application with Noise

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

ECLT 5810 Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Exploratory data analysis for microarrays

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Hierarchical clustering

CSC 2515 Introduction to Machine Learning Assignment 2

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Course Content. Classification = Learning a Model. What is Classification?

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Hierarchical Clustering

Course Content. What is Classification? Chapter 6 Objectives

CS249: ADVANCED DATA MINING

APPM 2460 Matlab Basics

K-means Clustering & PCA

Clustering analysis of gene expression data

Transcription:

EPFL ENAC TRANSP-OR Prof. M. Bierlaire Gael Lederrey & Nikola Obrenovic Decision Aids Spring 2018 DATA MINING I - CLUSTERING - EXERCISES Exercise 1 In this exercise, you will implement the k-means clustering algorithm. You can find the main pseudocode for the algorithm in Algorithm 1. You will also find other functions that are used by Algorithm 1 in Algorithms 2 and 3. Algorithm 1: k-means clustering Result: Compute the k clusters using k-means. Input : data, k, maxiters, threshold Output: centroids, assignments 1 Initialize the clusters by randomly selecting k samples as init clusters 2 Initialize an empty list to store the losses: loss 3 foreach i 1 to maxiters do /* See Algorithm 2 for pseudo-code */ 4 Compute the updated k-means parameters 5 Compute the average loss and add it the list loss /* Stopping criterion. Be careful when you have only one value in loss. */ 6 if loss i loss i 1 < threshold then 7 break 8 end 9 Update the centroids 10 end Algorithm 2: Update k-means parameters Result: Update the centroids, compute the loss and the new assignments. Input : data, old centroids Output: losses, assignments, new centroids /* See Algorithm 3 for pseudo-code */ 1 Build the distance matrix between the centroids and the samples 2 Compute the assignments bases on minimum distance between samples and centroids 3 foreach k 1 to #clusters do 4 Get the indices corresponding to all samples assigned to cluster k 5 Compute the new centroid of cluster k using the average position of all samples belonging to cluster k 6 end 1

Algorithm 3: Build Euclidian distance matrix Result: Build a distance matrix between all samples and centroids Input : data, centroids Output: distance matrix 1 Create an empty matrix D 2 foreach k 1 to #clusters do 3 Compute the sum of square between all the points in data and the centroid of cluster k 4 Add it to the matrix D; 5 end 6 Make sure the matrix D has the right shape. Here are the steps you need to follow to implement the k-means clustering algorithm: 1. Run the file exercise0.m and analyze the data that we are giving you. More information on this dataset can be found on Wikipedia. Choose the dimensions along which you would like to cluster and add them in the file exercise1.m on lines 11-12. 2. Code the different parts of the k-means algorithm in the following files in the directory functions: intialize cluster.m build distance matrix.m update kmeans parameters.m kmeans.m Do not forget to define the different parameters for the k-means clustering on lines 18 to 21 in the file exercise1.m. Test your implementation by running this file. Hints: You should code the files in the order they are given above. It can be helpful to test the functions you re implementing. So, do not hesitate to create new scripts to test your implementations. 3. Test your implement against ours using the function kmeans vizu. (The code is not shared, but you can call the function.) This function is also used to show how the clusters are being updated. To do so, comment line 24, uncomment line 33 and run exercise1.m Exercise 2 In this exercise, you will implement the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. You can find below the pseudo-code in Algorithm 4. Here are the steps you need to follow to implement the DBSCAN algorithm: 1. Report the dimensions you used from Exercise 1 in the file exercise2.m on lines 11-12. 2. Implement the algorithm based on the pseudo-code given earlier. Hint: It may be useful to have a look at your course material and on Internet to find more information about this algorithm and how it should behave. 3. Test your algorithm by running the file exercise2.m. But first, you will have to find the two parameters for thisalgorithm: εfor therangeandthe minimumnumberof neighboursn min,neigh to be a cluster point. Here is how to do it: 2

Choice of N min,neigh : A low number for this parameter means that DBSCAN will create more clusters. Aside from this information, the choice is kind of arbitrary or expert-needed. Play with it to see how the clusters are computed based on this parameter. Choice of ε: You can use multiple methods to choose this algorithm. One of them is the k-distance plot. The code is provided for the exercises. Indeed, in a clustering with N min,neigh = k, we expect that the core points will all be in a certain range while noise points can be much further away. Therefore, you will observe a knee point in the k- distance plot, as shown in Figure 1 (Sometimes, you will not find any knee point, therefore, you will have to use other methods.) 8 7 6 5-NN distance 5 4 3 2 1 0 0 50 100 150 Points (sample) sorted by distance Figure 1: k-distance plot for k = N min,neigh = 5. The knee point is around 2-2.5. But other values are also acceptable since it is not a very clear knee point. To run the k-distance plot, you can uncomment line 23 in the file exercise2.m and run it. 4. Compare the results of this algorithm with the results you obtained with the k-means algorithm. What are the main differences? Which one do you think is better on this dataset and why? 3

Algorithm 4: DBSCAN Result: Compute clusters and noise based on two parameters: ε and N min,neigh Input : data, ε, N min,neigh Output: cluster labels (for all the points, a label is given) 1 Initialize classified, a vector of zeros containing the information if a point has been classified or not. 2 Initialize nbr_neigh_eps, a vector of zeros for the number of neighbours to a sample within range ε. 3 Initialize clust_labels, a vector of NaN s to give the nature of the cluster for each sample. 4 Initialize num_clust the integer for the current cluster. 5 Compute the matrix D of Euclidian distance between all the points. foreach i 1 to #Samples do /* You start coding from here. The initialization is given to you. */ 6 if i is not classified then 7 Use the matrix D to find all points that are in the range ε of sample i. 8 Update the vector nbr neigh eps with the number of neighbours to i. (Without taking i into account) 9 if nbr neigh eps(i) < N min,neigh then /* This sample is a noise */ 10 This point is now classified and its label is 0. 11 else /* This sample belongs to a cluster */ 12 Update the current cluster and update the label of sample i. In addition, this point is classified. 13 Create an array of neighbours neigh to i without i inside. 14 while neigh is not empty do 15 Get the first neighbour, call it j, in the list neigh, find all of its neighbours that are in the range of ε and compute the number of neighbours for this point. 16 if nbr neigh eps(j) < N min,neigh then /* j is also a core point. We need to do the same thing as for i. */ 17 Find all the unclassified neighbours of j and add this array to the array neigh. 18 Get the neighbours of j that were considered as noisy 19 Change the label of the unclassified neighbours and noisy ones to be the same as j and i and say that all the neighbours of j are classified. 20 end 21 Remove the first element of neigh. 22 end 23 end 24 end 25 end 4

Exercise 3 In this exercise, you do not have to implement anything. In the file exercise3.m, we are comparing the two algorithms you ve implemented as well as the Hierarchical clustering algorithm implemented in Matlab on different datasets. You can find an example how it is used on the Mathworks website. Once you ve implemented the first two exercises, you can run the file exercise3.m. This will give you 4 figures: Figure 1 shows you the artificial data composed of non-linear clusters as well as some noise. Figure 2 shows you the results of the k-means algorithm with k = 6. Figure 3 shows you the results of the Hierarchical clustering. Figure 4 shows you the results of the DBSCAN algorithm. Your task is to understand the behaviour of these algorithms on this particular dataset and answer the following questions: 1. In your opinion, which algorithm performs better? And why? 2. Compared to the results of exercises 1 and 2 (as done in point 4 of Exercise 2) where k-means and DBSCAN were clustering on the Iris dataset, do you observe the same ranking between the two algorithms, i.e. are there any differences with your answer from point 4 of Exercise 2? When you try to rank algorithms, you can use the accuracy as the metric. Were you expecting the results you got? If no, why? Can you think of a way to fix the ranking and get the same on both datasets? 3. If you encounter a new dataset, which of these three algorithms will you use first, and why? 5