# Information Retrieval and Web Search Engines

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

2 Homework Exercise 4.1 In language models one can estimate a model s probability parameters by the relative term frequencies as they appear in the respective document. What is the problem with this approach and how can it be fixed? 2

3 Homework Exercise 4.2 Using an example of your own, explain the difference between relevance and pertinence. 3

4 Homework Pertinence vs. relevance Relevance = Topical or subject relevance Aboutness Pertinence = Cognitive relevance or pertinence Cognitive state of knowledge, information need Example: A computer scientist is looking for technical information about Google s PageRank algorithm by asking the query pagerank algorithm Google s own technology overview (generally understandable) definitely is relevant But it does not fits the scientist s information need Information Retrieval and Web Search Engines Wolf-Tilo Balke with Joachim Selke Technische Universität Braunschweig 4

5 Homework Exercise 4.3 Last week we have seen an example information need as used in TREC ( endangered species ). Think of an information need of your own and describe it using the same structure (title, description, narrative). 5

6 Homework Title: Reactions to the Bush shoe incident Description: In December 2008, an Iraqi journalist threw his shoes at US president George W. Bush, who was giving a press conference then. How did international media comment on this event? Narrative: Any media commentary itself is relevant, as well as any document summarizing reactions of the media. Articles reporting only about the event itself are not relevant. Information Retrieval and Web Search Engines Wolf-Tilo Balke with Joachim Selke Technische Universität Braunschweig 6

7 Homework Exercise 4.4 What is the pooling method? 7

8 Homework Exercise 4.5 What is more important for an information retrieval system: precision or recall? 8

9 The Cluster Hypothesis The Cluster Hypothesis states: Closely associated documents tend to be relevant to the same requests Closely associated usually means similar (with respect to some kind of similarity measure) R R R R R 9

10 The Cluster Hypothesis (2) Experimental validation of the Cluster Hypothesis? Proved to be problematic Seems to be highly collection-specific Also depends on: Representation of documents Similarity measures Queries But: It sounds reasonable and holds often enough In addition, real-world collections usually have a clear cluster structure Can we exploit clustering for information retrieval? 10

11 Lecture 7: Document Clustering 1. Applications 2. Problem Statement 3. Flat Clustering 4. Hierarchical Clustering 11

12 Search Result Clustering In IR, results are typically presented by means of ranked lists What about clusters? 12

13 Search Result Clustering (2) Advantages: Scanning a few coherent groups often is easier than scanning many individual documents The cluster structure gives you an impression of what the result set looks like Disadvantages: Finding informative labels for clusters is difficult Good clusterings are hard to find (example on the next slide) 13

14 Search Result Clustering (3) Cluster structure found for query apple : 14

15 Search Result Clustering (4) Ideally, a clustering should look like this: 15

16 Scatter Gather Scatter-Gather is a navigational user interface Search without typing! Idea: 1. Cluster the whole document collection into a small number of clusters 2. Users formulate queries by selecting one or more of these clusters 3. Selected clusters are merged and clustered again 4. Return to step 2 if not finished 16

17 Scatter Gather (2) Example from (Manning et al., 2008): Collection: New York Times news stories 17

18 Collection Clustering Sometimes it makes sense to cluster the whole document collection hierarchically: 18

19 Collection Clustering (2) Collection clustering is especially useful if The collections contains only a small number of topics Each topic is covered by many documents in a similar fashion Advantages: Enables exploratory browsing Can be helpful even if the user is unsure about which query terms to use There s no clustering here! But dmoz is an example of using a global hierarchy for navigation 19

20 Collection Clustering (3) Collection clustering can also be used to extend small result lists If there is only a small number of documents matching the query, add similar documents from the clusters containing the matching documents Matching documents 20

21 Collection Clustering (4) Also interesting: Use collection clustering to speed-up retrieval Idea: Cluster the whole collection Represent each cluster by a (possibly virtual) document, e.g., a typical or average document contained in the cluster Speed-up query processing by first finding the clusters having best-matching representatives and then doing retrieval only on the documents in these clusters 1. Find best-matching clusters 2. Build the set of documents contained in these clusters 3. Find best-matching documents 21

22 Some Examples Formerly called Clusty Open source!

23 Lecture 7: Document Clustering 1. Applications 2. Problem Statement 3. Flat Clustering 4. Hierarchical Clustering 23

24 Issues in Clustering Clustering is more difficult than you might think How many clusters? Flat or hierarchical? Hard or soft? What s a good clustering? How to find it? 24

25 How Many Clusters? Let k denote the number of clusters from now on Basically, there are two different approaches regarding the choice of k Define k before searching for a clustering, then only consider clusterings having exactly k clusters Do not define a fixed k, i.e., let the number of clusters depend on some measure of clustering quality to be defined The right choice depends on the problem you want to solve 25

26 Flat or Hierarchical? Flat clustering: 26

27 Flat or Hierarchical? (2) Hierarchical:

28 Flat or Hierarchical? (3) Hierarchical: 28

29 Hard or Soft? Hard clustering: Every document is assigned to exactly one cluster (at the lowest level, if the clustering is hierarchical) More common and easier to do Soft clustering: A document s assignment is a distribution over all clusters (fuzzy, probabilistic, or something else) Better suited for creating browsable hierarchies (a knife can be a weapon as well as a tool) Example: LSI (k clusters/topics) 29

30 Abstract Problem Statement Given: A collection of n documents The type of clustering to be found (see previous slides) An objective function f that assigns a number to any possible clustering of the collection Task: Find a clustering that minimizes the objective function (or maximizes, respectively) Let s exclude a nasty special case: We don t want empty clusters! 30

31 What s a Good Clustering? The overall quality of a clustering is measured by f Usually, f is closely related to a measure of distance between documents (e.g. cosine similarity) Popular primary goals: Low inter-cluster similarity, i.e. documents from different clusters should be dissimilar High intra-cluster similarity, i.e. all documents within a cluster should be mutually similar 31

32 What s a Good Clustering? (2) Inter-cluster similarity and intra-cluster similarity: BAD: GOOD: 32

33 What s a Good Clustering? (3) Common secondary goals: Avoid very small clusters Avoid very large clusters All these goals are internal (structural) criteria External criterion: Compare the clustering against a hand-crafted reference clustering (later) 33

34 How to Find a Good Clustering? Naïve approach: Try all possible clusterings Choose the one minimizing/maximizing f Hmm, how many different clusterings are there? There are S(n, k) distinct hard, flat clusterings of a n-element set into exactly k clusters S(, ) are the Stirling numbers of the second kind Roughly: S(n, k) is exponential in n The naïve approach fails miserably Let s use some heuristics 34

35 Lecture 7: Document Clustering 1. Applications 2. Problem Statement 3. Flat Clustering 4. Hierarchical Clustering 35

36 K-Means Clustering K-means clustering: The most important (hard) flat clustering algorithm, i.e., every cluster is a set of documents The number of clusters k is defined in advance Documents usually are represented as unit vectors Objective: Minimize the average distance from cluster centers! Let s work out a more precise definition of the objective function 36

37 K-Means Clustering (2) Centroid of a cluster: Let A = {d 1,, d m } be a document cluster (a set of unit vectors) The centroid of A is defined as: RSS of a cluster: Again, let A be a document cluster The residual sum of squares (RSS) of A is defined as: 37

38 K-Means Clustering (3) In k-means clustering, the quality of the clustering into (disjoint) clusters A 1,, A k is measured by: K-means clustering tries to minimize this value Minimizing RSS(A 1,, A k ) is equivalent to minimizing the average squared distance between each document and its cluster s centroid 38

39 K-Means Clustering (4) The k-means algorithm (aka Lloyd s algorithm): 1. Randomly select k documents as seeds (= initial centroids) 2. Create k empty clusters 3. Assign exactly one centroid to each cluster 4. Iterate over the whole document collection: Assign each document to the cluster with the nearest centroid 5. Recompute cluster centroids based on contained documents 6. Check if clustering is good enough ; return to (2) if not What s good enough? Small change since previous iteration Maximum number of iterations reached RSS small enough 39

40 K-Means Clustering (5) Example from (Manning et al., 2008): 1. Randomly select k = 2 seeds (initial centroids) 40

41 K-Means Clustering (6) 4. Assign each document to the cluster having the nearest centroid 41

42 K-Means Clustering (7) 5. Recompute centroids 42

43 K-Means Clustering (8) Result after 9 iterations: 43

44 K-Means Clustering (9) Movement of centroids in 9 iterations: 44

45 Variants and Extensions of K-Means K-means clustering is a popular representative of the class of partitional clustering algorithms Start with an initial guess for k clusters, update cluster structure iteratively Similar approaches: K-medoids: Use document lying closest to the centroid instead of centroid Fuzzy c-means: Similar to k-means but soft clustering Model-based clustering: Assume that data has been generated randomly around k unknown source points ; find the k points that most likely have generated the observed data (maximum likelihood) 45

46 K-Means Clustering 46

47 Lecture 7: Document Clustering 1. Applications 2. Problem Statement 3. Flat Clustering 4. Hierarchical Clustering 47

48 Hierarchical Clustering Two major approaches: Agglomerative (bottom-up): Start with individual documents as initial clustering, create parent clusters by merging Divisive (top-down): Start with an initial large cluster containing all documents, create child clusters by splitting 48

49 Agglomerative Clustering Assume that we have some measure of similarity between clusters A simple agglomerative clustering algorithm: 1. For each document: Create a new cluster containing only this document 2. Compute the similarity between every pair of clusters (if there are m clusters, we get an m msimilarity matrix) 3. Merge the two clusters having maximal similarity 4. If there is more than one cluster left, go back to (2) 49

50 Agglomerative Clustering (2) Dendrogram from (Manning et al., 2008): Documents from Reuters-RCV1 collection Cosine similarity Cosine similarity of Fed holds and Fed to keep is around

51 Agglomerative Clustering (3) Get non-binary splits by cutting the dendrogram at prespecified levels of similarity Gives 17 clusters Gives a cluster of size 3 51

52 Similarity of Clusters We just assumed that we can measure similarity between clusters But how to do it? Typically, measures of cluster similarity are derived from some measure of document similarity (e.g. Euclidean distance) There are several popular definitions of cluster similarity: Single link Complete link Centroid Group average 52

53 Similarity of Clusters (2) Single-link clustering: Similarity of two clusters = similarity of their most similar members Problem: Single-link clustering often produces long chains 53

54 Similarity of Clusters (3) Complete-link clustering: Similarity of two clusters = similarity of their most dissimilar members Problem: Complete-link clustering is sensitive to outliers 54

55 Similarity of Clusters (4) Centroid clustering: Similarity of two clusters = average inter-similarity (= similarity of centroids) Problem: Similarity to other clusters can improve by merging (leads to overlaps in dendrogram) 55

56 Similarity of Clusters (5) Group average clustering: Similarity of two clusters = average of all similarities Problem: Computation is expensive 56

57 Divisive Clustering How does divisive clustering work? We won t go into details here But there is a simple method: Use a flat clustering algorithm as a subroutine to split up clusters (e.g. 2-means clustering) Again, there might be constraints on clustering quality: Avoid very small clusters Avoid splitting into clusters of extremely different cardinalities 57

58 Hierarchical Clustering 58

59 Evaluation Finally, how to evaluate clusterings? We already used internal criteria (e.g. the total centroid distance for k-means clustering) Compare against a manually built reference clustering involves external criteria Example: The Rand index Look at all pairs of documents! What percentage of pairs are in correct relationship? True positives: The pair is correctly contained in the same cluster True negatives: The pair is correctly contained in different clusters False positives: The pair is wrongly contained in the same cluster False negatives: The pair is wrongly contained in different clusters 59

60 Next Lecture Relevance Feedback Classification 60

### Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

### INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

### PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

### Introduction to Information Retrieval

Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap

### University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

### Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

### Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

### Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

### Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

### Lesson 3. Prof. Enza Messina

Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

### Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

### 5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

### Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

### 10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

### Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

### Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

### Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

### CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

### Clustering Algorithms for general similarity measures

Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

### Expectation Maximization!

Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University and http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Steps in Clustering Select Features

### Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

### Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

### Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

### Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

### Association Rule Mining and Clustering

Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

### Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

### Clustering. Bruno Martins. 1 st Semester 2012/2013

Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

### 2. Background. 2.1 Clustering

2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

### Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

### Hierarchical clustering

Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Description Produces a set of nested clusters organized as a hierarchical tree. Can be visualized

### Clustering Part 3. Hierarchical Clustering

Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

### Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

Summary Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Last week: Logical Model: Cubes,

### Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

### Hierarchical Clustering

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

### Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

### Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

### 21 The Singular Value Decomposition; Clustering

The Singular Value Decomposition; Clustering 125 21 The Singular Value Decomposition; Clustering The Singular Value Decomposition (SVD) [and its Application to PCA] Problems: Computing X > X takes (nd

### DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

Sec.14.1! Recap: Classification DD2475 Information Retrieval Lecture 10: Clustering Hedvig Kjellström hedvig@kth.se www.csc.kth.se/dd2475 Data points have labels Classification task: Finding good separators

### Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

### Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

### PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

### Introduction to Clustering

Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of

### Finding Clusters 1 / 60

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

### Objective of clustering

Objective of clustering Discover structures and patterns in high-dimensional data. Group data with similar patterns together. This reduces the complexity and facilitates interpretation. Expression level

### Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

### 3. Cluster analysis Overview

Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

### K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

### DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

### Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

### Hierarchical Clustering

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set animal of documents. vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean One approach: recursive

### What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

// What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

### Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Clustering CS498 Today s lecture Clustering and unsupervised learning Hierarchical clustering K-means, K-medoids, VQ Unsupervised learning Supervised learning Use labeled data to do something smart What

### Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Graphs and Networks Page /7: Lecture 8. Clustering Heuristics Wednesday, November 8, 26 8:49 AM Today we will talk about clustering and partitioning in graphs, and sometimes in data sets. Partitioning

### CSE 494 Project C. Garrett Wolf

CSE 494 Project C Garrett Wolf Introduction The main purpose of this project task was for us to implement the simple k-means and buckshot clustering algorithms. Once implemented, we were asked to vary

### MIA - Master on Artificial Intelligence

MIA - Master on Artificial Intelligence 1 Hierarchical Non-hierarchical Evaluation 1 Hierarchical Non-hierarchical Evaluation The Concept of, proximity, affinity, distance, difference, divergence We use

### Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Lecture No 08 Cluster Analysis Naeem Ahmed Email: naeemmahoto@gmailcom Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Outline

### Hierarchical Document Clustering

Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

### MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

### IBL and clustering. Relationship of IBL with CBR

IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

### Clustering and Visualisation of Data

Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

### Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

### Unsupervised Learning

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

### Introduction to Computer Science

DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

### CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

### Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

### Kapitel 4: Clustering

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

### UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining b Tan, Steinbach, Kumar What is Cluster Analsis? Finding groups of objects such that the

### Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Data Clustering Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Data clustering is the task of partitioning a set of objects into groups such that the similarity of objects

### CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA

CLUSTER ANALYSIS APPLIED TO EUROPEANA DATA by Esra Atescelik In partial fulfillment of the requirements for the degree of Master of Computer Science Department of Computer Science VU University Amsterdam

### k-means A classical clustering algorithm

k-means A classical clustering algorithm Devert Alexandre School of Software Engineering of USTC 30 November 2012 Slide 1/65 Table of Contents 1 Introduction 2 Visual demo Step by step Voronoi diagrams

### Clustering & Bootstrapping

Clustering & Bootstrapping Jelena Prokić University of Groningen The Netherlands March 25, 2009 Groningen Overview What is clustering? Various clustering algorithms Bootstrapping Application in dialectometry

### Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification

### COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

COSC 6339 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 217 Clustering Clustering is a technique for finding similarity groups in data, called

### Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

### Knowledge Discovery in Databases

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

### I211: Information infrastructure II

Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1

### Chapter 18 Clustering

Chapter 18 Clustering CS4811 Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University 1 Outline Examples Centroid approaches Hierarchical approaches 2 Example:

### CSE 494/598 Lecture-11: Clustering & Classification

CSE 494/598 Lecture-11: Clustering & Classification LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **With permission, content adapted from last year s slides and from Intro to DM dmbook@cs.umn.edu

### STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral

### Supervised vs unsupervised clustering

Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

### Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

### Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs

### Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

### Cluster Analysis: Basic Concepts and Algorithms

7 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/004 What

### COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

COSC 6397 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 215 Clustering Clustering is a technique for finding similarity groups in data, called

### Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Data Mining: Classifier Evaluation CSCI-B490 Seminar in Computer Science (Data Mining) Predictor Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what

### Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

### Weka ( )

Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

### Lecture 4 Hierarchical clustering

CSE : Unsupervised learning Spring 00 Lecture Hierarchical clustering. Multiple levels of granularity So far we ve talked about the k-center, k-means, and k-medoid problems, all of which involve pre-specifying

### Clust Clus e t ring 2 Nov

Clustering 2 Nov 3 2008 HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most

### Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some