Clustering Analysis Basics

Similar documents
ECLT 5810 Clustering

ECLT 5810 Clustering

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Cluster Analysis. CSE634 Data Mining

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Gene Clustering & Classification

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

7.1 Euclidean and Manhattan distances between two objects The k-means partitioning algorithm... 21

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Unsupervised Learning

Hierarchical and Ensemble Clustering

Lecture 7 Cluster Analysis: Part A

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Kapitel 4: Clustering

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Road map. Basic concepts

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Informatics. Seon Ho Kim, Ph.D.

Visualization and text mining of patent and non-patent data

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

A Review on Cluster Based Approach in Data Mining

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

What to come. There will be a few more topics we will cover on supervised learning

Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning

CS145: INTRODUCTION TO DATA MINING

Chapter DM:II. II. Cluster Analysis

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Data Clustering. Danushka Bollegala

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Table Of Contents: xix Foreword to Second Edition

Clustering Tips and Tricks in 45 minutes (maybe more :)

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Knowledge Discovery in Databases

Data Mining 資料探勘. 分群分析 (Cluster Analysis) 1002DM04 MI4 Thu. 9,10 (16:10-18:00) B513

Machine Learning (BSMC-GA 4439) Wenke Liu

CS573 Data Privacy and Security. Li Xiong

10701 Machine Learning. Clustering

Lecture 15 Clustering. Oct

Lezione 21 CLUSTER ANALYSIS

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Unsupervised Learning Partitioning Methods

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining Algorithms

CSE 5243 INTRO. TO DATA MINING

k-means Gaussian mixture model Maximize the likelihood exp(

Machine Learning (BSMC-GA 4439) Wenke Liu

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Big Data SONY Håkan Jonsson Vedran Sekara

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Exploratory data analysis for microarrays

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Document Clustering: Comparison of Similarity Measures

Input: Concepts, Instances, Attributes

Clustering in Data Mining

CSE 5243 INTRO. TO DATA MINING

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Sponsored by AIAT.or.th and KINDML, SIIT

Clustering. Huanle Xu. Clustering 1

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Supervised vs. Unsupervised Learning

Unsupervised learning, Clustering CS434

Clustering in Ratemaking: Applications in Territories Clustering

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo

Knowledge Discovery in Databases. Part III Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Understanding Clustering Supervising the unsupervised

Getting to Know Your Data

Introduction to Clustering

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

How and what do we see? Segmentation and Grouping. Fundamental Problems. Polyhedral objects. Reducing the combinatorics of pose estimation

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering Basic Concepts and Algorithms 1

University of Florida CISE department Gator Engineering. Clustering Part 5

Introduction to Web Clustering

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

CSE 5243 INTRO. TO DATA MINING

Transcription:

Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM]

Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary

Introduction Cluster: A collection/group of data objects/points similar/closed/related to one another within the same group dissimilar/distant/unrelated to the objects in different groups Cluster analysis find similarities between data according to characteristics underlying the data and then grouping similar data objects into clusters Clustering Analysis: Unsupervised learning apart from objects features, no predefined classes for a training data set two general tasks: identify the natural clustering number and properly grouping objects into sensible clusters Typical applications as a stand-alone tool to gain an insight into data distribution as a preprocessing step of other algorithms in intelligent systems 3

Introduction Illustrative Eample : how many clusters? 4

Introduction Illustrative Eample : are they in the same cluster? Blue shark, sheep, cat, dog Lizard, sparrow, viper, seagull, gold fish, frog, red mullet.two clusters.clustering criterion: How animals bear their progeny Gold fish, red mullet, blue shark Sheep, sparrow, dog, cat, seagull, lizard, frog, viper.two clusters.clustering criterion: Eistence of lungs 5

Introduction Real Applications: Google News 6

Introduction Real Applications: Genetics Analysis 7

Introduction Real Applications: Emerging Applications 8

Introduction A technique demanded by many real world tasks Bank/Internet Security: fraud/spam pattern discovery Biology: taonomy of living things such as kingdom, phylum, class, order, family, genus and species City-planning: Identifying groups of houses according to their house type, value, and geographical location Climate change: understanding earth climate, find patterns of atmospheric and ocean Finance: stock clustering analysis to uncover correlation underlying shares Image Compression/segmentation: coherent piels grouped Information retrieval/organisation: Google search, topic-based news Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Social network mining: special interest group automatic discovery 9

Quiz

Data Types and Representations Discrete vs. Continuous Discrete Feature Has only a finite set of values e.g., zip codes, rank, or the set of words in a collection of documents Sometimes, represented as integer variable Continuous Feature Has real numbers as feature values e.g, temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous features are typically represented as floating-point variables Metric vs. non-metric Metric: quantity measured with a natural physical unit, e.g., distance, weight, Non-metric: abstract quantity without associated with physical unit, e.g., counts,

Data Types and Representations Data representations Data matri (object-by-feature structure)... i... n............... f... if... nf............... p... ip... np n data points (objects) with p dimensions (features) Two modes: row and column represent different entities Distance/dissimilarity matri (object-by-object structure) n data points, but registers d(,) only the distance d(3,) d(3,) A symmetric/triangular matri : : : Single mode: row and column d( n,) d( n,)...... for the same entity (distance)

Data Types and Representations Eamples 3 p p3 p4 p 3 4 5 6 point y p p p3 3 p4 5 Data Matri p p p3 p4 p.88 3.6 5.99 p.88.44 3.6 p3 3.6.44 p4 5.99 3.6 Distance Matri (i.e., Dissimilarity Matri) for Euclidean Distance 3

Distance Measures Minkowski Distance (http://en.wikipedia.org/wiki/minkowski_distance) For ( n) and y ( y y n p : Manhattan (city block) distance p : Euclidean distance n n > ( p p p y y y ) p, d(, y) p d(, y) y y d(, y) y Do not confuse p with n, i.e., all these distances are defined based on all numbers of features (dimensions). A generic measure for metric data: use appropriate p in different applications ) n y n y y n yn 4

Distance Measures Eample: Manhatten and Euclidean distances 3 p p3 p4 p 3 4 5 6 L p p p3 p4 p 4 4 6 p 4 4 p3 4 p4 6 4 Distance Matri for Manhattan Distance point y p p p3 3 p4 5 Data Matri L p p p3 p4 p.88 3.6 5.99 p.88.44 3.6 p3 3.6.44 p4 5.99 3.6 Distance Matri for Euclidean Distance 5

Distance Measures Cosine Measure (Similarity vs. Distance) for non-metric data For ( n) and y ( y y y n ) cos(, y) d(, y n y) cos(, y) y n y n y n d(, y) Property: Nonmetric vector objects: keywords in documents, gene features in micro-arrays, Applications: information retrieval, biologic taonomy,... 6

7 Distance Measures Eample: Cosine measure.68.3 ), cos( ), (.3.45 6.48 5 ), cos(.45 6 6.48 4 5 3 5 5 3 ),,,, (,, ),,, 5,,, (3, d

Distance Measures Distance for Binary Features For binary features, their value can be converted into or. Contingency table for binary feature vectors, and y y a : number of features that equal for both and y b : number of features that equal for but that are for c : number of features that equal for but that are for d : number of features that equal for both and y y y 8

Distance Measures Distance for Binary Features Distance for symmetric binary features Both of their states equally valuable and carry the same weight; i.e., no preference on which outcome should be coded as or, e.g., gender d(, y) a b b c c d Distance for asymmetric binary features Outcomes of the states not equally important, e.g., the positive and negative e.g., outcomes of a disease test ; the rarest one is set to and the other is. d(, y) a b b c c 9

Distance Measures Eample: Distance for binary features Name Gender Fever Cough Test- Test- Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Y : Yes P : Positive N : Negative/No gender is symmetric (not used in real applications unless having prior knowledge) the remaining features are binary Jack Jack Jim Mary Jim Mary d( Jack,Mary).33 d( Jack, Jim).67 d( Jim,Mary).75

Distance Measures Distance for nominal features A generalization of the binary feature so that it can take more than two states/values, e.g., red, yellow, blue, green, There are two methods to handle variables of such features. Simple mis-matching d (, y) number of mis-matching features between total number of features and y Convert it into binary variables creating new binary features for all of its nominal states e.g., if an feature has three possible nominal states: red, yellow and blue, then this feature will be epanded into three binary features accordingly. Thus, distance measures for binary features are now applicable!

Distance Measures Distance for nominal features (cont.) Eample: Play tennis Outlook Temperature Humidity Wind D Overcast High High Strong D Sunny High Normal Strong Simple mis-matching d( D, D) 4 Creating new binary features Using the same number of bits as those features can take Outlook {Sunny, Overcast, Rain} (,, ).5 Temperature {High, Mild, Cool} (,, ) Humidity {High, Normal} (, ) Wind {Strong, Weak} (, ) d( D, D).4

Major Clustering Methodologies Partitioning Methodology Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square distance cost, Typical methods: K-means, K-medoids, CLARANS, 3

Major Clustering Methodologies Hierarchical Methodology Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Agglomerative, Diana, Agnes, BIRCH, ROCK, 4

Major Clustering Methodologies Density-based Methodology Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue, 5

Major Clustering Methodologies Model-based Methodology A generative model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: Gaussian Miture Model (GMM), COBWEB, 6

Major Clustering Methodologies Spectral clustering Methodology Convert data set into weighted graph (verte, edge), spectral analysis on the weighted distance matri for feature etraction, apply a simple clustering method for analysis in the new feature space. Typical methods: Normalised-Cuts, 7

Major Clustering Methodologies Clustering ensemble Methodology Combine multiple clustering results (different partitions) via multiple clustering analyses Typical methods: Evidence accumulation, Graph-based combination 8

Summary Clustering analysis groups objects based on their (dis)similarity and has a broad range of applications. Distance (or similarity) metric underlying data types plays a critical role in all clustering analysis and distance-based learning (e.g., k-nn) Clustering algorithms can be categorized into partitioning, hierarchical, density-based, model-based, spectral clustering and clustering ensemble There are still lots of research issues on cluster analysis; finding the number of natural clusters with arbitrary shapes dealing with mied types of features (correlated features but different types) handling massive amount of data Big Data (efficiency vs. performance) coping with data of high dimensionality (many features for an object) performance evaluation (especially when no ground-truth available) 9