Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Similar documents
K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

ECLT 5810 Clustering

ECLT 5810 Clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Unsupervised Learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Gene Clustering & Classification

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Introduction to Clustering

Cluster Analysis. CSE634 Data Mining

Exploratory Analysis: Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering in Data Mining

Clustering CS 550: Machine Learning

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

Unsupervised Learning : Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering Basic Concepts and Algorithms 1

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSE 5243 INTRO. TO DATA MINING

Data Informatics. Seon Ho Kim, Ph.D.

Unsupervised Learning and Clustering

What to come. There will be a few more topics we will cover on supervised learning

Unsupervised Learning and Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Introduction to Mobile Robotics

CHAPTER 4: CLUSTER ANALYSIS

Kapitel 4: Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CSE 5243 INTRO. TO DATA MINING

Introduction to Machine Learning. Xiaojin Zhu

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering Part 4 DBSCAN

Unsupervised Learning Partitioning Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Introduction to Artificial Intelligence

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Introduction to Data Mining

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

University of Florida CISE department Gator Engineering. Clustering Part 4

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

What is Unsupervised Learning?

Clustering & Classification (chapter 15)

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Cluster Analysis. Ying Shen, SSE, Tongji University

K-Means. Oct Youn-Hee Han

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Clustering in Ratemaking: Applications in Territories Clustering

Unsupervised Learning: Clustering

Chapter DM:II. II. Cluster Analysis

Information Retrieval and Organisation

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

University of Florida CISE department Gator Engineering. Clustering Part 2

Network Traffic Measurements and Analysis

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Clustering and Visualisation of Data

Exploratory data analysis for microarrays

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Hierarchical Clustering 4/5/17

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Data Warehousing and Machine Learning

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Clustering Analysis Basics

Cluster Analysis: Basic Concepts and Algorithms

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Information Retrieval and Web Search Engines

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Unsupervised Learning

Data Mining Algorithms

Cluster analysis. Agnieszka Nowak - Brzezinska

Introduction to Computer Science

Clustering. Supervised vs. Unsupervised Learning

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

Preprocessing DWML, /33

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Understanding Clustering Supervising the unsupervised

CS Introduction to Data Mining Instructor: Abdullah Mueen

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Artificial Intelligence. Programming Styles

ECG782: Multidimensional Digital Signal Processing

Clustering: Overview and K-means algorithm

Information Retrieval and Web Search Engines

数据挖掘 Introduction to Data Mining

Supervised vs. Unsupervised Learning

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Transcription:

Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1

1. Supervised Data Mining Classification Regression Outlier detection Frequent pattern mining 2. Unsupervised Data Mining Clustering Feature Extraction definition real use-cases method pros and cons Izabela Moise, Evangelos Pournaras, Dirk Helbing 2

1. Supervised Data Mining Classification Regression Outlier detection Frequent pattern mining 2. Unsupervised Data Mining Clustering Feature Extraction definition real use-cases method pros and cons Izabela Moise, Evangelos Pournaras, Dirk Helbing 2

Unsupervised Data Mining descriptive or undirected finds hidden structure and relation within the data determine the existence of classes or clusters in the data exploratory analysis all variable are treated in the same way Izabela Moise, Evangelos Pournaras, Dirk Helbing 3

Overview Clustering Main principles Definition Types of clustering Applications Clustering techniques Distance metrics k-means Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing 4

Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing 5

Definition Clustering is a data mining function that partitions the data points into natural groups called clusters. The goal: the points within a cluster are very similar, whereas points across clusters are as dissimilar as possible. Unsupervised (requires data, not labels) Outcome clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 6

Definition Clustering is a data mining function that partitions the data points into natural groups called clusters. The goal: the points within a cluster are very similar, whereas points across clusters are as dissimilar as possible. Unsupervised (requires data, not labels) Outcome clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 6

Definition Clustering is a data mining function that partitions the data points into natural groups called clusters. The goal: the points within a cluster are very similar, whereas points across clusters are as dissimilar as possible. Unsupervised (requires data, not labels) Outcome clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 6

Types of Clustering partitional divides data points into non-overlapping clusters, each point is in exactly one subset hierarchal finds clusters using previously built clusters agglomerative start with single-element clusters and merge them exclusive a data point belongs to a single cluster non-exclusive a data point may belong to multiple clusters fuzzy, probabilistic a point belongs to every cluster with a weight between 0 and 1 Izabela Moise, Evangelos Pournaras, Dirk Helbing 7

Applications 1. useful when don t know what you re looking for 2. used as a stand-alone tool to get insight into the data 3. used as a preprocessing tool for other algorithms (outlier detection, data compression) Astronomy: aggregation of stars, galaxies, or super galaxies Spatial Data Analysis: create thematic maps in GIS by clustering feature spaces Image Processing Izabela Moise, Evangelos Pournaras, Dirk Helbing 8

Weblogs: discover groups of similar access patterns City-planning: identifying groups of houses according to their house type, value, and geographical location Land use: identification of areas of similar land use in an earth observation database Earth-quake studies: observed earth quake epicentres should be clustered along continent faults Summarisation: reduce the size of large data sets Marketing Izabela Moise, Evangelos Pournaras, Dirk Helbing 9

Google News Izabela Moise, Evangelos Pournaras, Dirk Helbing 10

Applications Izabela Moise, Evangelos Pournaras, Dirk Helbing 11

What is a Cluster? a subset of objects which are similar the distance between any two objects in the cluster is less than the distance between any object in the cluster and any object outside it a connected region of a multidimensional space containing a relatively high density of objects Izabela Moise, Evangelos Pournaras, Dirk Helbing 12

What Makes a Clustering Good? A good clustering method will produce high quality clusters in which: intra-cluster similarity is high inter-cluster similarity is low depends on the similarity metric and its implementation ability to discover all or some hidden patterns Izabela Moise, Evangelos Pournaras, Dirk Helbing 13

Distance Metrics 1. Euclidean Distance 2. Manhattan Distance 3. Minkowski Distance Izabela Moise, Evangelos Pournaras, Dirk Helbing 14

Calculating Cluster Distances 1. Single link dist(k i, k j ) = min(x i,p, y j,q ) 2. Complete link dist(k i, k j ) = max(x i,p, y j,q ) 3. Average distance dist(k i, k j ) = avg(x i,p, y j,q ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 15

Centroid vs. Medoid Centroid: the middle of a cluster C n centroid = 1 x n i, n = C i=1 does not have to be one of the data points in the cluster Medoid: the central point of a cluster C the data point that is "least dissimilar" from all of the other data points has to be one of the data points in the cluster Centroids distance dist(k i, k j ) = dist(centroid i, centroid j ) Medoids distance dist(k i, k j ) = dist(medoid i, medoid j ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

Centroid vs. Medoid Centroid: the middle of a cluster C n centroid = 1 x n i, n = C i=1 does not have to be one of the data points in the cluster Medoid: the central point of a cluster C the data point that is "least dissimilar" from all of the other data points has to be one of the data points in the cluster Centroids distance dist(k i, k j ) = dist(centroid i, centroid j ) Medoids distance dist(k i, k j ) = dist(medoid i, medoid j ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

Centroid vs. Medoid Centroid: the middle of a cluster C n centroid = 1 x n i, n = C i=1 does not have to be one of the data points in the cluster Medoid: the central point of a cluster C the data point that is "least dissimilar" from all of the other data points has to be one of the data points in the cluster Centroids distance dist(k i, k j ) = dist(centroid i, centroid j ) Medoids distance dist(k i, k j ) = dist(medoid i, medoid j ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

Centroid vs. Medoid Centroid: the middle of a cluster C n centroid = 1 x n i, n = C i=1 does not have to be one of the data points in the cluster Medoid: the central point of a cluster C the data point that is "least dissimilar" from all of the other data points has to be one of the data points in the cluster Centroids distance dist(k i, k j ) = dist(centroid i, centroid j ) Medoids distance dist(k i, k j ) = dist(medoid i, medoid j ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

k-means very popular algorithm for clustering object = n-dimensional vector users specifies k ( of clusters) generic sketch: (1) pick k random vectors as centroids (2) assign vectors to closest centroid clusters (3) compute centroids of each cluster (4) repeat from (2) until clusters converge or a finite number of iterations is reached Izabela Moise, Evangelos Pournaras, Dirk Helbing 17

k-means Algorithm Izabela Moise, Evangelos Pournaras, Dirk Helbing 18

k-means in action K-means clustering The dataset. Input k=5 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Randomly picking 5 positions as initial cluster centers (not necessarily a data point) K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Each point finds which cluster center it is closest to (very much like 1NN). The point belongs to that cluster. K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Each cluster computes its new centroid, based on which points belong to it K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Each cluster computes its new centroid, based on which points belong to it And repeat until convergence (cluster centers no longer move) K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means: initial cluster centers 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means stops 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means Algorithm Izabela Moise, Evangelos Pournaras, Dirk Helbing 20

k -Means Algorithm Izabela Moise, Evangelos Pournaras, Dirk Helbing 20

Why k-means converges? Whenever an assignment is changed, the sum squared distances of datapoints from their assigned cluster centers is reduced. Whenever a cluster center is moved the sum squared distances of the datapoints from their currently assigned cluster centers is reduced. If the assignments do not change in the assignment step, we have converged. Izabela Moise, Evangelos Pournaras, Dirk Helbing 21

k-means Convergence 1. assign each point to its nearest centroid 2. compute centroid of each cluster Izabela Moise, Evangelos Pournaras, Dirk Helbing 22

k-means Convergence 1. assign each point to its nearest centroid 2. compute centroid of each cluster Algorithm terminates when neither (1) nor (2) results in change of configuration Izabela Moise, Evangelos Pournaras, Dirk Helbing 22

Initial Centroids affect the final clusters (inter-cluster and intra-cluster distances) often chosen randomly clusters vary from one run to another one solution: 1. pick a random point x 1 from dataset 2. find the point x 2 farthest from x 1 in the dataset 3. find x 3 farthest from the closer of x 1, x 2 4. pick k points like this, use them as starting cluster centroids for the k clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 23

k-means Properties unsupervised, non-deterministic and iterative there are always k clusters there is always at least one point in each cluster clusters are non-hierarchical and they do not overlap Izabela Moise, Evangelos Pournaras, Dirk Helbing 24

Pros and Cons Pros: fast, robust and easy to understand relatively efficient best results when data are well separated from each other Izabela Moise, Evangelos Pournaras, Dirk Helbing 25

Pros and Cons Cons: X requires a priori specification of k X unable to handle noisy data and outliers Centroid is average of cluster members Outlier can dominate average computation Solution: K-medoids X different initial partitions can result in different final clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 26