Introduction to Clustering

Similar documents
CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Unsupervised Learning : Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Clustering CS 550: Machine Learning

Clustering Basic Concepts and Algorithms 1

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Cluster Analysis: Basic Concepts and Algorithms

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

What is Cluster Analysis?

Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Tips and Tricks in 45 minutes (maybe more :)

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Proximity and Data Pre-processing

Cluster Analysis. Ying Shen, SSE, Tongji University

数据挖掘 Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis: Basic Concepts and Algorithms

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning and Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Gene Clustering & Classification

Unsupervised Learning and Clustering

ECLT 5810 Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CSE 5243 INTRO. TO DATA MINING

ECLT 5810 Clustering

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Chapter DM:II. II. Cluster Analysis

Information Retrieval and Web Search Engines

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CSE4334/5334 DATA MINING

Chuck Cartledge, PhD. 23 September 2017

What to come. There will be a few more topics we will cover on supervised learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Online Social Networks and Media. Community detection

CHAPTER 4: CLUSTER ANALYSIS

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CSE 494/598 Lecture-11: Clustering & Classification

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

9 Classification: KNN and SVM

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Statistics 202: Statistical Aspects of Data Mining

Clustering in Data Mining

Machine Learning (BSMC-GA 4439) Wenke Liu

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Hierarchical clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering: Overview and K-means algorithm

CSE 5243 INTRO. TO DATA MINING

Data Mining Algorithms

Oliver Dürr. Statistisches Data Mining (StDM) Woche 5. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Hierarchical clustering

Introduction to Mobile Robotics

Pattern Recognition Lecture Sequential Clustering

Getting to Know Your Data

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Information Retrieval and Web Search Engines

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Clustering: Overview and K-means algorithm

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering. Lecture 6, 1/24/03 ECS289A

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Road map. Basic concepts

Table Of Contents: xix Foreword to Second Edition

Clustering Part 4 DBSCAN

Based on Raymond J. Mooney s slides

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Contents. Foreword to Second Edition. Acknowledgments About the Authors

MIA - Master on Artificial Intelligence


Clustering & Classification (chapter 15)

Clustering. Bruno Martins. 1 st Semester 2012/2013

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Transcription:

Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar)

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

What is not Cluster Analysis? Supervised classification Have class label information Simple segmentation Dividing students into different registration groups alphabetically, by last name Results of a query Groupings are a result of an external specification

Notion of a Cluster can be Ambiguous

Types of Clustering A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree

Partitional Clustering

Hierarchical Clustering

Other Distinctions Between Sets of Clusters Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to multiple clusters. Can represent multiple classes or border points Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Partial versus complete In some cases, we only want to cluster some of the data Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities

Types of Clusters Center-based clusters Contiguous clusters

Types of Clusters: Center-Based Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster

Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Dissimilarity between Data Objects: Euclidean Distance Euclidean Distance dist = σn k=1 (p k q k ) 2 Where n is the number of dimensions (attributes) and p k and q k are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.

Euclidean Distance

Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance n dist = k=1 p k q k r Where r is a parameter, n is the number of dimensions (attributes) and p k and q k are, respectively, the kth attributes (components) or data objects p and q 1/r

Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r. supremum (L max norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Cosine Similarity If d 1 and d 2 are two document vectors cos(d 1, d 2 )=(d 1 d 2 )/ d 1 d 2, Where indicates vector dot product and d is the length of vector d. Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 0 0 0 1 0 2 d 1 d 2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 d 1 = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= (42)^0.5= 6.481 d 1 = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6)^0.5= 2.245 cos( d1, d2) =.3150

Cosine Similarity 1: exactly the same cos(d 1, d 2 ) = 0: orthogonal 1: exactly opposite