# Chapter 6: Cluster Analysis

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each individual. Often in clustering the items are called objects. We wish to create clusters such that the objects within each cluster are similar and objects in different clusters are dissimilar. The dissimilarity between any two objects is typically quantified using a distance measure (like Euclidean distance). Cluster analysis is a type of unsupervised classification, because we do not know the nature of the groups (or the number of groups, typically) before we classify the objects into clusters. STAT J530 Page 1

2 Applications of Cluster Analysis In marketing, researchers attempt to find distinct clusters of the consumer population so that several distinct marketing strategies can be used for the clusters. In ecology, scientists classify plants or animals into various groups based on some measurable characteristics. Researchers in genetics may separate genes into several classes based on their expression ratios measured at different time points. STAT J530 Page 2

3 The Need for Clustering Algorithms We assume there are n objects that we wish we separate into a small number (say, k) of clusters, where k < n. If we know the true number of clusters k ahead of time, then the number of ways to partition the n objects into k clusters is a Stirling number of the second kind. Example: There are 48, 004, 081, 105, 038, 305 ways to separate n = 30 objects into k = 4 clusters. STAT J530 Page 3

4 The Need for Clustering Algorithms (Continued) If we don t know the value of k, the possible number of partitions is even more massive. Example: There are 35, 742, 549, 198, 872, 617, 291, 353, 508, 656, 626, 642, 567 possible partitions of n = 42 objects if we let the number of clusters k vary. (This is called the Bell number for n.) Clearly even a computer cannot investigate all the possible clustering partitions to see which is best. We need intelligently designed algorithms that will search among the best possible partitions relatively quickly. STAT J530 Page 4

5 Types of Clustering Algorithms There are three major classes of clustering methods from oldest to newest, they are: 1. Hierarchical methods 2. Partitioning methods 3. Model-based methods Hierarchical methods cluster the data in a series of n steps, typically joining observations together step by step to form clusters. Partitioning methods first determine k, and then typically attempt to find the partition into k clusters that optimizes some objective function of the data. Model-based clustering takes a statistical approach, formulating a model that categorizes the data into subpopulations and using maximum likelihood to estimate the model parameters. STAT J530 Page 5

6 Hierarchical Clustering Agglomerative hierarchical clustering begins with n clusters, each containing a single object. At each step, the two clusters that are closest are merged together. So as the steps iterate, there are n clusters, then n 1 clusters, then n 2, etc. By the last step, there is 1 cluster containing all n objects. The R function hclust will perform a variety of hierarchical clustering methods. STAT J530 Page 6

7 Defining Closeness of Clusters The key in a hierarchical clustering algorithm is specifying how to determine the two closest clusters at any given step. For the first step, it s easy: Join the two objects whose (Euclidean?) distance is smallest. After that, we have a choice: Do we join two individual objects together, or merge an object into a cluster that already has multiple objects? Intercluster dissimilarity is typically defined in one of three ways, which give rise to linkage methods. STAT J530 Page 7

8 Linkage Methods in Hierarchical Clustering The single linkage algorithm, at each step, joins the clusters whose minimum distance between objects is smallest, i.e., joins the clusters A and B with the smallest d AB = min i A,j B (d ij) Single linkage clustering is sometimes called nearest neighbor clustering. The complete linkage algorithm, at each step, joins the clusters whose maximum distance between objects is smallest, i.e., joins the clusters A and B with the smallest d AB = max i A,j B (d ij) Complete linkage clustering is sometimes called farthest neighbor clustering. STAT J530 Page 8

9 Linkage Methods in Hierarchical Clustering (Continued) The average linkage algorithm, at each step, joins the clusters whose average distance between objects is smallest, i.e., joins the clusters A and B with the smallest d AB = 1 n A n B d ij i A j B where n A and n B are the number of objects in clusters A and B, respectively. See Figure 6.4 in the Everitt textbook. STAT J530 Page 9

10 Dendrograms A hierarchical algorithm actually produces not one partition of the data, but lots of partitions. There is a clustering partition for each step 1, 2,...,n. The series of mergings can be represented at a glance by a treelike structure called a dendrogram. To get a single k-cluster solution, we would cut the dendrogram horizontally at a point that would produce k groups (The R function cutree can do this). It is strongly recommended to examine the full dendrogram first before determining where to cut it. A natural set of clusters may be apparent from a glance at the dendrogram. STAT J530 Page 10

11 Standardization of Observations If the variables in our data set are of different types or are measured on very different scales, then some variables may play an inappropriately dominant role in the clustering process. In this case, it is recommended to standardize the variables in some way before clustering the objects. Possible standardization approaches: 1. Divide each column by its sample standard deviation, so that all variables have standard deviation Divide each variable by its sample range (max min); Milligan and Cooper (1988) found that this approach best preserved the clustering structure. 3. Convert data to z-scores by (for each variable) subtracting the sample mean and then dividing by the sample standard deviation a common option in clustering software packages. STAT J530 Page 11

12 Pros and Cons of Hierarchical Clustering An advantage of hierarchical clustering methods is their computational speed for small data sets. Another advantage is that the dendrogram gives a picture of the clustering solution for a variety of choices of k (the number of clusters) at once. On the other hand, a major disadvantage is that once two clusters have been joined, they can never be split apart later in the algorithm, even if such a move would improve the clustering. The so-called partitioning methods of cluster analysis do not have this restriction. In addition, hierarchical methods can be less efficient than partitioning methods for large data sets, when n is much greater than k. STAT J530 Page 12

### Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

### Hierarchical clustering

Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### 5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

### Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

### Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

### Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

### Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

### 10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

### Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State

### DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

### K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

### Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

### University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

### Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Clustering CS498 Today s lecture Clustering and unsupervised learning Hierarchical clustering K-means, K-medoids, VQ Unsupervised learning Supervised learning Use labeled data to do something smart What

### Clustering & Classification (chapter 15)

Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

### Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

### Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

### Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes

### Finding Clusters 1 / 60

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

### Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

### Clustering Part 3. Hierarchical Clustering

Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

### Lecture 4 Hierarchical clustering

CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying

### CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 16

CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 16 Today Continue Clustering Last Time Flat Clustring Today Hierarchical Clustering Divisive Agglomerative Applications of Clustering Hierarchical

### Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

### 9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

### INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

### 3. Cluster analysis Overview

Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

### Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

### Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs

### DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

### PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

### STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral

### Clustering Algorithms for general similarity measures

Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

### CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

### Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

### Hierarchical Clustering

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

### Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your

### Oliver Dürr. Statistisches Data Mining (StDM) Woche 5. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Statistisches Data Mining (StDM) Woche 5 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 17 Oktober 2017 1 Multitasking

### An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

### 4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

Anàlisi d Imatges i Reconeixement de Formes Image Analysis and Pattern Recognition:. Cluster Analysis Francesc J. Ferri Dept. d Informàtica. Universitat de València Febrer 8 F.J. Ferri (Univ. València)

### CS570: Introduction to Data Mining

CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

### Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

### K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Xiaohui Xie University of California, Irvine K-means and Hierarchical Clustering p.1/18 Clustering Given n data points X = {x 1, x 2,, x n }. Clustering is the partitioning

### Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

### Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

### Unsupervised Learning

Unsupervised Learning A review of clustering and other exploratory data analysis methods HST.951J: Medical Decision Support Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision

### CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

### Clustering. Lecture 6, 1/24/03 ECS289A

Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

### MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

### Stats fest Multivariate analysis. Multivariate analyses. Aims. Multivariate analyses. Objects. Variables

Stats fest 7 Multivariate analysis murray.logan@sci.monash.edu.au Multivariate analyses ims Data reduction Reduce large numbers of variables into a smaller number that adequately summarize the patterns

### Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

### Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

### Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Data Clustering Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Data clustering is the task of partitioning a set of objects into groups such that the similarity of objects

### Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

### Unsupervised Learning

Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision

### Statistical Methods for Data Mining

Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on supervised learning

### Hierarchical Clustering

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set animal of documents. vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean One approach: recursive

### Clustering k-mean clustering

Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: partition genes into distinct sets with high homogeneity and high separation Clustering (unsupervised)

### Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline

### Objective of clustering

Objective of clustering Discover structures and patterns in high-dimensional data. Group data with similar patterns together. This reduces the complexity and facilitates interpretation. Expression level

### How do microarrays work

Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

### A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

### A k-means Clustering Algorithm on Numeric Data

Volume 117 No. 7 2017, 157-164 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A k-means Clustering Algorithm on Numeric Data P.Praveen 1 B.Rama 2

### Stat 321: Transposable Data Clustering

Stat 321: Transposable Data Clustering Art B. Owen Stanford Statistics Art B. Owen (Stanford Statistics) Clustering 1 / 27 Clustering Given n objects with d attributes, place them (the objects) into groups.

### Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

### Algorithms for Bioinformatics

Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

### Association Rule Mining and Clustering

Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

### COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised

### Unsupervised Learning

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

### Clustering Part 2. A Partitional Clustering

Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida

### slide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1

Splitting Words a.k.a. Word Sense Disambiguation 600.465 - Intro to NLP - J. Eisner Representing Word as Vector Could average over many occurrences of the word... Each word type has a different vector

### Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

### 21 The Singular Value Decomposition; Clustering

The Singular Value Decomposition; Clustering 125 21 The Singular Value Decomposition; Clustering The Singular Value Decomposition (SVD) [and its Application to PCA] Problems: Computing X > X takes (nd

### Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

### Clustering and Visualisation of Data

Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

### Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

### Clustering. Bruno Martins. 1 st Semester 2012/2013

Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

### Data Preprocessing. Komate AMPHAWAN

Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

### Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

### A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING Susan Tony Thomas PG. Student Pillai Institute of Information Technology, Engineering, Media Studies & Research New Panvel-410206 ABSTRACT Data

### CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation

### 11/17/2009 Comp 590/Comp Fall

Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected

### Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions

### CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

### Computer Vision 5 Segmentation by Clustering

Computer Vision 5 Segmentation by Clustering MAP-I Doctoral Programme Miguel Tavares Coimbra Outline Introduction Applications Simple clustering K-means clustering Graph-theoretic clustering Acknowledgements:

### CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

### Cluster Analysis: Basic Concepts and Algorithms

7 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

### Unsupervised Learning

Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

### 8. Clustering: Pattern Classification by Distance Functions

CEE 6: Digital Image Processing Topic : Clustering/Unsupervised Classification - W. Philpot, Cornell University, January 0. Clustering: Pattern Classification by Distance Functions The premise in clustering

### Spectral Classification

Spectral Classification Spectral Classification Supervised versus Unsupervised Classification n Unsupervised Classes are determined by the computer. Also referred to as clustering n Supervised Classes

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Problems and were graded by Amin Sorkhei, Problems and 3 by Johannes Verwijnen and Problem by Jyrki Kivinen.. [ points] (a) Gini index and Entropy are impurity measures which can be used in order to measure

### A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

### CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.

### CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA

CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA by Esra Atescelik In partial fulfillment of the requirements for the degree of Master of Computer Science Department of Computer Science VU University Amsterdam