Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Similar documents
Clustering CS 550: Machine Learning

Unsupervised Learning

University of Florida CISE department Gator Engineering. Clustering Part 5

Applied Clustering Techniques. Jing Dong

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Cluster analysis. Agnieszka Nowak - Brzezinska

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Clustering in Ratemaking: Applications in Territories Clustering

Introduction to Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Part 3. Hierarchical Clustering

What is Unsupervised Learning?

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

ECLT 5810 Clustering

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Clustering fundamentals

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Angela Montanari and Laura Anderlucci

ECLT 5810 Clustering

Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Unsupervised Learning and Clustering

Clustering Distance measures K-Means. Lecture 22: Aykut Erdem December 2016 Hacettepe University

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Unsupervised learning, Clustering CS434

Hierarchical clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Preprocessing DWML, /33

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

CSE 5243 INTRO. TO DATA MINING

A Review on Cluster Based Approach in Data Mining

Lecture 15 Clustering. Oct

CHAPTER 4: CLUSTER ANALYSIS

Hierarchical Clustering

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Finding Clusters 1 / 60

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis for Microarray Data

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Nearest neighbor classification DSE 220

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Warehousing and Machine Learning

Hierarchical clustering

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

Chapter DM:II. II. Cluster Analysis

clustering SVG shapes

Clustering algorithms

9.1. K-means Clustering

Clustering Basic Concepts and Algorithms 1

Clustering: K-means and Kernel K-means

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Data Mining Clustering

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Data Mining Algorithms

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Data Informatics. Seon Ho Kim, Ph.D.

What to come. There will be a few more topics we will cover on supervised learning

CS7267 MACHINE LEARNING

CSE 5243 INTRO. TO DATA MINING

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Machine Learning for OR & FE

Clustering. Chapter 10 in Introduction to statistical learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Computer Science

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1

Data Mining Concepts & Techniques

Hierarchical Clustering

K-Mean Clustering Algorithm Implemented To E-Banking

Cluster Analysis. CSE634 Data Mining

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Data mining techniques for actuaries: an overview

Forestry Applied Multivariate Statistics. Cluster Analysis

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Cluster Analysis: Agglomerate Hierarchical Clustering

K-Means Clustering. Sargur Srihari

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Improved Performance of Unsupervised Method by Renovated K-Means

Artificial Neural Networks Unsupervised learning: SOM

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Transcription:

Content Clustering Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Clustering: Unsupervised data mining technique Typical Applications Marketing: help marketers segment customers based on similar buying patterns, and then use this knowledge to develop targeted marketing programs Insurance: indentifying groups of motor insurance policy holders with a high average claim cost Image processing: compressing images Pre-processing step: identify groups for further modeling purposes Perform clustering and then regression by cluster

Clustering of people Clustering of products Clustering for better fit Clustering of financial time series

Image compression: Kohonen vector quantization Example:Sir Ronald A. Fisher (1890-1962) Left = 1024 x 1024 greyscale image at 8 bits per pixel, with 1MB of storage Center = 2x2 block VQ, using k=200 clusters, with 245KB of storage Right = 2x2 block VQ, using k=4 clusters, with 64KB of storage Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Cluster Analysis Also called segmentation To group a collection of cases into subsets, such that Cases within each cluster tend to be similar to each other Cases in different clusters tend to be dissimilar Cases are similar in what sense? What is the best answer for clustering analysis?

Many Ways of Clustering Many ways of clustering Types of Clustering Methods Two major clustering methods Hierarchical nested set of cluster created Partitional one set of clusters created Other clustering methods Density-based based on the notion of density Grid-based based on multiple-level grid structure Model-based a model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation

K-Mean clustering Example 1: K=5 Assignment Re-assignment

Similarities Distance measure A good distance measure: Non-negativity: d(x,y)>=0 Symmetry: d(x,y)=d(y,x) Triangle inequality: d(x,y)<=d(x,z)+d(z,y) Identity: d(x,y)=0 if and only if x=y Some Distance Measure Euclidean distance Most widely used Clusters formed tend to be spherical in shape Manhattan (city-block) distance Clusters formed tend to be more cubical in shape Euclidean Distance Manhattan Distance

Hamming distance Exercise Comments on K-Means Importance of choosing initial cluster centers

Importance of choosing initial cluster centers Importance of choosing initial cluster centers Importance of choosing initial cluster centers Limitation of K-Means Different size

Limitation of K-Means Different density Limitation of K-means Non-convex shapes Overcoming K-Means limitations Overcoming K-means limitations

Overcoming K-means limitations Other partitional methods Other partitional methods Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation

Hierarchical clustering methods Hierarchical clustering methods Distance between two clusters Distance between two clusters

Dendrogram Example 2: Single linkage Example 2: Single linkage Example 2: Single linkage

Example 2: Single linkage Example 2: Single linkage Example 3: Average linkage Example 3: Average linkage

Example 3: Average linkage Example 3: Average linkage Example 3: Average linkage Example 4: Complete link

Issues in hierarchical clustering Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Data preparation in clustering Data preparation in clustering Coding data Discrete inputs Interval inputs Mixed inputs Missing values Variable selection

Data preparation in clustering Data preparation in clustering Data preparation in clustering Variable selection in clustering

Variable selection in clustering Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Interpreting clusters Example: Customer Segmentation on Air Miles Reward Program

Example: Customer Segmentation on Air Miles Reward Program Content Examples Cluster analysis Partitional: K-Means clustering method Hierarchical clustering methods Data preparation in clustering Interpreting clusters Cluster validation Cluster Validity How to evaluate the goodness of the resulting clusters Why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters Comment on Cluster Validity The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering data, Jain and Dubes