CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Similar documents
Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Learning Partitioning Methods

ECLT 5810 Clustering

CSE 5243 INTRO. TO DATA MINING

ECLT 5810 Clustering

CSE 5243 INTRO. TO DATA MINING

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Cluster Analysis. CSE634 Data Mining

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Gene Clustering & Classification

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Clustering part II 1

Unsupervised Learning

Data Mining Algorithms

K-Means. Oct Youn-Hee Han

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Network Traffic Measurements and Analysis

Kapitel 4: Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Unsupervised Learning : Clustering

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Clustering in Data Mining

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

New Approach for K-mean and K-medoids Algorithm

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering. Chapter 10 in Introduction to statistical learning

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What to come. There will be a few more topics we will cover on supervised learning

Cluster Analysis: Basic Concepts and Methods

Exploratory data analysis for microarrays

CHAPTER 4: CLUSTER ANALYSIS

Cluster Analysis. Ying Shen, SSE, Tongji University

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Introduction to Mobile Robotics

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Hierarchical Clustering 4/5/17

Cluster Analysis chapter 7. cse634 Data Mining. Professor Anita Wasilewska Compute Science Department Stony Brook University NY

A Review on Cluster Based Approach in Data Mining

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Iteration Reduction K Means Clustering Algorithm

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

[7.3, EA], [9.1, CMB]

Clustering. Supervised vs. Unsupervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. (Part 2)

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Exploratory Analysis: Clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Comparative Study of Subspace Clustering Algorithms

Clustering and Visualisation of Data

Clustering CS 550: Machine Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

An Enhanced K-Medoid Clustering Algorithm

Supervised vs. Unsupervised Learning

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Machine Learning. Unsupervised Learning. Manfred Huber

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Chapter 6 Continued: Partitioning Methods

Analyzing Outlier Detection Techniques with Hybrid Method

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Lecture on Modeling Tools for Clustering & Regression

CSE 5243 INTRO. TO DATA MINING

Enhancing K-means Clustering Algorithm with Improved Initial Center

Road map. Basic concepts

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Normalization based K means Clustering Algorithm

Clustering Part 4 DBSCAN

Clust Clus e t ring 2 Nov

The k-means Algorithm and Genetic Algorithm

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

K-Means Clustering With Initial Centroids Based On Difference Operator

CS Introduction to Data Mining Instructor: Abdullah Mueen

Intelligent Image and Graphics Processing

Application of K-Means Clustering Methodology to Cost Estimation

Information Retrieval and Organisation

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

K+ Means : An Enhancement Over K-Means Clustering Algorithm

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Machine Learning - Clustering. CS102 Fall 2017

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Clustering: K-means and Kernel K-means

Clustering Techniques

Transcription:

CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf 2. K-means: https://www.datascience.com/blog/k-means-clustering https://en.wikipedia.org/wiki/elbow_method_(clustering) 3. CLARA: http://www.sthda.com/english/articles/27-partitioning-clustering-essentials/89-clara-clusteringlarge-applications/ 4. Book: Data Mining Concepts and Techniques, Jiawei Han, Micheline Kamber,Morgan Kaufman,2011 Chapter : 10, Page: 445-454

PART 1

OVERVIEW What is clustering? Similarity Measures Requirements of good clustering algorithm K-mean clustering K-medoids clustering - PAM K-medoids clustering - CLARA Applications of K-means and K-medoids

WHAT IS CLUSTERING? A way of grouping together data samples that are similar in some way - according to some criteria that you pick. A form of unsupervised learning It can also be called a method of data exploration.

SIMILARITY MEASURES A good clustering method will produce high quality clusters with 1.high intra-class similarity 2.low inter-class similarity The quality of a clustering result depends on: 1. similarity measure used by the method and its implementation. 2. its ability to discover some or all of the hidden patterns.

REQUIREMENTS OF GOOD CLUSTERING ALGORITHM Scalability Discovery of clusters with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints

CLUSTERING ALGORITHMS 1.K-Means 2.K-medoids 2.1 Basic K-medoids 2.2 PAM 2.3 CLARA

K-MEANS CLUSTERING First used by James Mcqueen in 1967 Unsupervised Learning Goal : Find the groups in the given data where no of groups is denoted by K Groups made on Feature similarity Results expected: Centroids used to label data Labels for training data Uses : Behavioral segmentation, Inventory categorization,sorting sensor measurements

K-MEANS PROCESS Input Data set K i.e no of clusters Data Assignment Step Each centroid represents one cluster Each data point is assigned to its nearest cluster based on squared Euclidean distance Centroid Update Step Recompute the centroids by taking the mean of the data points assigned to that particular centroid The algorithms repeats the two steps until end condition is met: No change in clusters

https://www.slideshare.net/edurekain/k-means-clustering K-MEANS PROCESS

K-MEANS ALGORITHM Input : K (No of Clusters to form) and Input Data Set Initialize: Randomly assign K cluster centroids µ 1, µ 2,.µ K, є R n Repeat{ for i = 1 to m c (i) :=index(1 to K) of cluster centroid closest to x (i) (datapoint) for k = 1 to K µ K :=average mean of points assigned to cluster K } Stop when convergence criteria is meet.

CHOOSING NUMBER OF CLUSTERS K Elbow- Join method Metric used is mean distance between data points and their cluster centroid.

http://mnemstudio.org/clustering-k-means-example-1.htm K-MEANS EXAMPLE

http://rossfarrelly.blogspot.com/2012/12/k-means-clustering.html K-MEANS EXAMPLE

ADVANTAGES AND DISADVANTAGES Advantages 1. Easyto implement. 2. With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small) Disadvantages 1. Difficult to predict the number of clusters (K-Value). 2. Can converge on local minima 3. Sensitive to outliers

K-MEDOIDS CLUSTERING The mean in k-means clustering is sensitive to outliers. Since an object with an extremely high value may substantially distort the distribution of data. Hence we move to k-medoids. Instead of taking mean of cluster we take the most centrally located point in cluster as it s center. These are called medoids. https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method

K-MEANS & K-MEDOIDS Clustering- Outliers Comparison

https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method K-MEDOIDS - BASIC ALGORITHM Input : Number of K (the clusters to form) Initialize: Select K points as the initial representative objects i.e initial K-medoids of our K clusters. Repeat: Assign each point to the cluster with the closest medoid m. Randomly select a non-representative object o i Compute the total cost of swapping S, the medoid m with o i If S < 0: Swap m with o i to form new set of medoids. Stop when convergence criteria is meet.

K-MEDOIDS - BASIC FlOWCHART https://www.researchgate.net/figure/flowchart-of-k-means-algorithm_fig1_314078156

https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method K-MEDOIDS - PAM ALGORITHM PAM stands for Partitioning Around Medoids. GOAL: To find Clusters that have minimum average dissimilarity between objects that belong to same cluster. ALGORITHM: 1. Start with initial set of medoids. 2. Iteratively replace one of the medoids with a non-medoid if it reduces total sum of SSE of resulting cluster. SSE is calculated as below: Where k is number of clusters and x is a data point in cluster C i and M i is medoid of C i

TYPICAL PAM EXAMPLE https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method

K-MEDOIDS (PAM) EXAMPLE For K = 2 Randomly Select m1 = (3,4) and m2 =(7,4) Using Manhattan as similarity metric we get, C1 = ( o1, o2, o3, o4 ) C2 = ( o5, o6, o7, o8, o9, o10) https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf K-MEDOIDS (PAM) EXAMPLE Compute absolute error as follows, E = (o1-o2) + (o3-o2) + (o4-o2) + (o5-o8) +(o6-o8)+(o7-o8) +(o9-o8) + (o10-o8) E = (3+4+4) + (3+1+1+2+2) Therefore, E = 20

K-MEDOIDS (PAM) EXAMPLE Swapping o8 with o7 Compute absolute error as follows, E = (o1-o2) + (o3-o2) + (o4-o2) + (o5-o7) +(o6-o7)+(o8-o7) +(o9-o7) + (o10-o7) E = (3+4+4) + (2+2+1+3+3) Therefore, E = 22

K-MEDOIDS (PAM) EXAMPLE Let s now calculate cost function S for this swap, S = E for (o2,07) - E for (o2, o8) S = 22-20 Therefore S > 0, This swap is undesirable.

ADVANTAGES and DISADVANTAGES of PAM Advantages: PAM is more flexible as it can use any similarity measure. PAM is more robust than k-means as it handles noise better. Disadvantages: PAM algorithm for K-medoid clustering works well for dataset but cannot scale well for large data set due to high computational overhead. PAM COMPLEXITY : O(k(n-k) 2 ) this is because we compute distance of n-k points with each k point, to decide in which cluster it will fall and after this we try to replace each of the medoid with a non medoid and find it s distance with n-k points. To overcome this we make use of CLARA.

https://en.wikibooks.org/wiki/data_mining_algorithms_in_r/clustering/clara CLARA - CLUSTERING LARGE APPLICATIONS Improvement over PAM Finds medoids in a sample from the dataset [Idea]: If the samples are sufficiently random, the medoids of the sample approximate the medoids of the dataset [Heuristics]: 5 samples of size 40+2k gives satisfactory results Works well for large datasets (n=1000, k=10)

http://www.sthda.com/english/articles/27-partitioning-clustering-essentials/89-clara-clustering-large-applications/ CLARA ALGORITHM 1. Split randomly the data sets in multiple subsets with fixed size (sampsize) 2. Compute PAM algorithm on each subset and choose the corresponding k representative objects (medoids). Assign each observation of the entire data set to the closest medoid. 3. Calculate the mean (or the sum) of the dissimilarities of the observations to their closest medoid. This is used as a measure of the goodness of the clustering. 4. Retain the sub-dataset for which the mean (or sum) is minimal. A further analysis is carried out on the final partition.

COMPARISON CLARA vs PAM Strength: - deals with larger data sets than PAM - CLARA Outperforms PAM in terms of running time and quality of clustering Weakness: - Efficiency depends on the sample size - A good clustering based on samples will not necessarily represent a good clustering of the whole https://en.wikibooks.org/wiki/data_mining_algorithms_in_r/clustering/clara

APPLICATIONS Social Network Document Clustering

GENERAL APPLICATIONS OF CLUSTERING 1. Recognition 2. Spatial Data Analysis a. create thematic maps in GIS by clustering feature spaces b. detect spatial clusters and explain them in spatial data mining 1. Image Processing 2. Economic Science (especially market research) 3. WWW a. Document classification b. Cluster Weblog data to discover groups of similar access patterns

PART 2

TRAFFIC ANOMALY DETECTION USING K-MEANS CLUSTERING Authors: Gerhard Munz, Sa Li, Georg Carle, Computer Networks and Internet, Wilhelm Schickard Institute for Computer Science, University of Tuebingen, Germany Published in GI/ITG Workshop MMBnet, 2007.

NETWORK DATA MINING Knowledge about monitoring data. Helps in determining dominant characteristics and outliers. Deployed to define rules or patterns that are typical for specific kinds of traffic helps to analyze new sets of monitoring data(labeling).

NOVEL NDM APPROACH K-means clustering of monitoring data Aggregate and transform flow records into datasets for equally spaced time intervals Raw Data and Extracted Features Total number of packets sent Total number of bytes sent Number of different source-destination pairs

K-means Clustering Distance metric used is

CLASSIFICATION AND OUTLIER DETECTION Classification Outlier detection

CLASSIFICATION AND OUTLIER DETECTION Combined approach

RESULTS : PING FLOOD DETECTION

CONCLUSIONS The resulting cluster centroids can be used to detect anomalies in new on-line monitoring data with a small number of distance calculations. Applying the clustering algorithm separately for different services improves the detection quality. The algorithm is scalable. Optimum number of clusters K is difficult to decide.