# Document Clustering: Comparison of Similarity Measures

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014

2 Outline 1 Introduction The Problem and the Motivation Approach 2 Methodology Document Representation Similarity Measures Clustering Algorithms Evaluation 3 Related Work Past Results References 4 The End

3 The Problem and the Motivation What is document clustering and why is it important? Document clustering is a method to classify the documents into a small number of coherent groups or clusters by using appropriate similarity measures. Document clustering plays a vital role in document organization, topic extraction and information retrieval. With the ever increasing number of high dimensional datasets over the internet, the need for efficient clustering algorithms has risen.

4 The Problem and the Motivation How can we solve this problem? A lot of these documents share a large proportion of lexically equivalent terms. We will exploit this feature by using a bag of words" model to represent the content of a document. We will group similar" documents together to form a coherent cluster. This similarity" can be defined in various ways. In the vector space, it is closely related to the notion of distance which can be defined in several ways. We will try to test which similarity measure performs the best across various domains of text articles in English and Hindi.

5 Approach How will we compare these similarity measures? We will first represent our document using the bag of words and the vector space model. Then we will cluster documents (now high dimensional vectors) by k-means and hierarchical clustering techniques using different similarity measures. Documents we will use are from varied domains from English and Hindi. We will then compare the performance of each similarity measure across the different kinds of documents. Entropy and Purity measure will be used for the purposes of evaluation.

6 Document Representation Bag of Words: Model Each word is assumed to be independent and the order in which they occur is immaterial. Each word corresponds to a dimension in the resulting data space. Each document then becomes a vector consisting of non-negative values on each dimension. Widely used in information retrieval and text mining.

7 Document Representation Bag of Words: Example Here are two simple text documents: Document 1 I don t know what I am saying. Document 2 I can t wait for this to get over.

8 Document Representation Bag of Words: Example Now, based on these two documents, a dictionary is constructed: I":1 don t"":2 know":3 what":4 am":5 saying":6 can t":7 wait":8 for":9 this":10 to":11 get":12 over":13

9 Document Representation Bag of Words: Example The dictionary has 13 distinct words. Using the indices of the dictionary, the document is represented by a 13-entry vector. Document 1 [2,1,1,1,1,1,0,0,0,0,0,0,0] Document 2 [1,0,0,0,0,0,1,1,1,1,1,1,1] Each entry of the vectors refers to count of the corresponding entry in the dictionary.

10 Document Representation Representing the document formally Let D = {d 1,..., d n } be a set of documents and T = {t 1,..., t m } be the set of distinct terms occurring in D. The document s representation in the vector space is given by an m-dimensional vector t d, t d = (tf (d, t 1 ),..., tf (d, t m )) where tf (d, t) denotes the frequency of the term t T in document d D.

11 Document Representation Pre-processing First, we will remove stop words (non-descriptive such as a, and, are and do). We will use the one implemented in the Weka machine learning workbench, which contains 527 stop words. Second, words will be stemmed using Porter?s suffix-stripping algorithm, so that words with different endings will be mapped into a single word. For example production, produce, produces and product will be mapped to the stem produc. Third, we considered the effect of including infrequent terms in the document representation on the overall clustering performance and decided to discard words that appear with less than a given threshold frequency. We select the top 2000 words ranked by their weights and use them in our experiments.

12 Document Representation TFIDF Some terms that appear frequently in a small number of documents but rarely in the other documents tend to be more relevant and specific for that particular group of documents, and therefore more useful for finding similar documents. To capture these terms, we transform the basic term frequencies tf (d, t) into the tfidf (term frequency and inversed document frequency) weighting scheme. Tfidf weighs the frequency of a term t in a document d with a factor that discounts its importance with its appearances in the whole document collection, which is defined as: tfidf (d, t) = tf (d, t) log( D df (t) ) Here df (t) is the number of documents in which term t appears.

13 Similarity Measures Metric A metric space (X, d) consists of a set X on which is defined a distance function which assigns to each pair of points of X a distance between them, and which satisfies the following four axioms: 1 d(x, y) 0 for all points x and y of X; 2 d(x, y) = d(y, x) for all points x and y of X; 3 d(x, z) d(x, y) + d(y, z) for all points x, y and z of X; 4 d(x, y) = 0 if and only if the points x and y coincide.

14 Similarity Measures Euclidean Distance Standard metric for geometric problems. Given two documents d a and d b represented by their term vectors t a and t b respectively, the Euclidean distance is defined as m D E ( t a, t b ) = ( w t,a w t,b 2 ) 1/2 t=1 where T = {t 1,..., t m } is the term set and the weights, w t,a = tfidf (d a, t).

15 Similarity Measures Cosine Similarity Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Given two documents t a and t b, the Cosine similarity is defined as t SIM C ( t a, t b ) = a t b t a t b where t a and t b are m-dimensional vectors over the term set T. Non-negative and bounded between [0, 1].

16 Similarity Measures Jaccard Coefficient The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. Given two documents t a and t b, the Jaccard Coefficient is defined as SIM J ( t a, t b ) = t a t b t a 2 + t b 2 t a t b where t a and t b are m-dimensional vectors over the term set T. Non-negative and bounded between [0, 1].

17 Similarity Measures Pearson Correlation Coefficient The Pearson Correlation coefficient is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and -1 inclusive, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. Given two documents t a and t b, the Pearson Correlation Coefficient is defined as SIM P ( t a, t b ) = m m t=1 w t,a w t,b TF a TF b [m m t=1 w t,a 2 TF a 2 ][m m t=1 w t,b 2 TF b 2] where t a and t b are m-dimensional vectors over the term set T and TF a = m t=1 w t,a, w t,a = tfidf (d a, t).

18 Similarity Measures Manhattan Distance The Manhattan Distance is the distance that would be traveled to get from one data point to the other if a grid-like path is followed. The Manhattan distance between two items is the sum of the differences of their corresponding components. Given two documents t a and t b, the Manhattan Distance between them is defined as SIM M ( t a, t b ) = m w t,a w t,b t=1 where t a and t b are m-dimensional vectors over the term set T and w t,a = tfidf (d a, t).

19 Similarity Measures Chebychev Distance The Chebychev distance between two points is the maximum distance between the points in any single dimension. Given two documents t a and t b, the Chebychev Distance is defined as SIM Ch ( t a, t b ) = max w t,a w t,b t where t a and t b are m-dimensional vectors over the term set T and w t,a = tfidf (d a, t).

20 Clustering Algorithms Hierarchical Algorithms Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative (bottom up): This method starts with every single object (gene or sample) in a single cluster. Then, in each successive iteration, it agglomerates (merges) the closest pair of clusters by satisfying some similarity criteria, until all of the data is in one cluster. O(n 3 ) Divisive(top down): This method starts with a single cluster containing all objects, and then successively splits resulting clusters until only clusters of individual objects remain. O(n 2 )

21 Clustering Algorithms Hierarchical Clustering: In action slides/clustering.ppt Figure: Single Link

22 Clustering Algorithms Hierarchical Clustering: End Result Figure: Hierarchical Clustering

23 Clustering Algorithms k-means Algorithm Partitions observations into clusters resulting in a partitioning of the data space into Voronoi cells. First pick k, the number of clusters. Initialize clusters by picking one point per cluster.for instance, pick one point at random, then k 1 other points, each as far away as possible from the previous points. Try different values of k and choose the based on the average distance to the centroid.

24 Clustering Algorithms k-means: Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Reassign all points to their closest centroid.

25 Clustering Algorithms k-means: In action Text-Documents-Clustering-using-K-Means-Algorithm Figure: k-means process

26 Clustering Algorithms Effective choice for k Effective heuristics for seed selection include: Excluding outliers from the seed set Trying out multiple starting points and choosing the clustering with the lowest cost; and Obtaining seeds from another method such as hierarchical clustering.

27 Evaluation Entropy Entropy measures the distribution of categories in a given cluster. The entropy of a cluster C i with size n i is defined as E(C i ) = 1 logc k h=1 n h i n i log( nh i n i ) where c is the total number of categories in the data set and n h i is the number of documents from the h th class that were assigned to this cluster C i.

28 Evaluation Purity Purity provides provides insight into the coherence of a cluster i.e. the degree to which a cluster contains documents from a single category. Purity for a given cluster C i of size n i is given by: P(C i ) = 1 n i max h (n h i ) where max h (ni h ) is the number of documents that are from the dominant category in cluster C i and ni h represents the number of documents from the cluster assigned to category h.

29 Evaluation Datasets We will use the following datasets for our project: 20news BBC/BBC Sport Wikipedia FIRE Classic r0

30 Past Results Anna Huang Table 1: Purity Results Data Euclidean Cosine Jaccard Pearson KLD 20news classic hitech re tr wap webkb

31 Past Results Anna Huang Table 2: Entropy Results Data Euclidean Cosine Jaccard Pearson KLD 20news classic hitech re tr wap webkb

32 References References Anna Huang. Similarity measures for document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, pages 49 56, D. Arthur and S. Vassilvitskii. k-means++ the advantages of careful seeding. In Symposium on Discrete Algorithms, Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3), 2004.

33 That s all folks! Thank You! Questions? Suggestions?

### Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

### An Appropriate Similarity Measure for K-Means Algorithm in Clustering Web Documents S Jaiganesh 1 Dr. P. Jaganathan 2

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 02, 2015 ISSN (online): 2321-0613 An Appropriate Similarity Measure for K-Means Algorithm in Clustering Web Documents S

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Lesson 3. Prof. Enza Messina

Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

### DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

### CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

### Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

### Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

Clustering Distance Measures Hierarchical Clustering k -Means Algorithms 1 The Problem of Clustering Given a set of points, with a notion of distance between points, group the points into some number of

### Clustering Part 4 DBSCAN

Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

### Information Retrieval. (M&S Ch 15)

Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

### An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

### Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Flat Clustering Slides are mostly from Hinrich Schütze March 7, 07 / 79 Overview Recap Clustering: Introduction 3 Clustering in IR 4 K-means 5 Evaluation 6 How many clusters? / 79 Outline Recap Clustering:

### Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

### MIA - Master on Artificial Intelligence

MIA - Master on Artificial Intelligence 1 Hierarchical Non-hierarchical Evaluation 1 Hierarchical Non-hierarchical Evaluation The Concept of, proximity, affinity, distance, difference, divergence We use

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

### Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

### Finding Clusters 1 / 60

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

### Clustering and Visualisation of Data

Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

### Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

### Kapitel 4: Clustering

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

### CS570: Introduction to Data Mining

CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

### University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

### Introduction to Information Retrieval

Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap

### PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv IIR 6: Flat Clustering Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center

### Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

### CSE 494 Project C. Garrett Wolf

CSE 494 Project C Garrett Wolf Introduction The main purpose of this project task was for us to implement the simple k-means and buckshot clustering algorithms. Once implemented, we were asked to vary

### CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function

### University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

### Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

### Sergei Silvestrov, Christopher Engström. January 29, 2013

Sergei Silvestrov, January 29, 2013 L2: t Todays lecture:. L2: t Todays lecture:. measurements. L2: t Todays lecture:. measurements. s. L2: t Todays lecture:. measurements. s.. First we would like to define

### INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

### Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

### [7.3, EA], [9.1, CMB]

K-means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] Outline Introduction K-means Algorithm Example How K-means partitions? K-means Demo Relevant Issues Application: Cell Neulei Detection Summary

### Boolean Model. Hongning Wang

Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

### Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

### Association Pattern Mining. Lijun Zhang

Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

### CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

### Approximation Algorithms for NP-Hard Clustering Problems

Approximation Algorithms for NP-Hard Clustering Problems Ramgopal R. Mettu Dissertation Advisor: Greg Plaxton Department of Computer Science University of Texas at Austin What Is Clustering? The goal of

### Association Rule Mining and Clustering

Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

### This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

This lecture: IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring 1 Ch. 6 Ranked retrieval Thus far, our queries have all

### Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to

### Mining Web Data. Lijun Zhang

Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

### Clustering. Lecture 6, 1/24/03 ECS289A

Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

### Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

### CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

### Data Mining Clustering

Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0

Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

### Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

### Clustering (COSC 416) Nazli Goharian. Document Clustering.

Clustering (COSC 416) Nazli Goharian nazli@cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

### Clustering. Bruno Martins. 1 st Semester 2012/2013

Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

### Data Clustering Algorithms

SAINT MARY S COLLEGE OF CALIFORNIA DEPARTMENT OF MATHEMATICS SENIOR ESSAY Data Clustering Algorithms Mentor: Author: Bryce CLOKE Dr. Charles HAMAKER Supervisor: Dr. Kathryn PORTER May 22, 2017 Abstract:

### Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

### 11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within

### Section 7.12: Similarity. By: Ralucca Gera, NPS

Section 7.12: Similarity By: Ralucca Gera, NPS Motivation We talked about global properties Average degree, average clustering, ave path length We talked about local properties: Some node centralities

### CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation

### Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

### 2. Design Methodology

Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

### Unsupervised Learning

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

### Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

### Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

### Cluster Analysis: Basic Concepts and Algorithms

7 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs

### A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING Susan Tony Thomas PG. Student Pillai Institute of Information Technology, Engineering, Media Studies & Research New Panvel-410206 ABSTRACT Data

### Keywords: hierarchical clustering, traditional similarity metrics, potential based similarity metrics.

www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 14027-14032 Potential based similarity metrics for implementing hierarchical clustering

### Mining Web Data. Lijun Zhang

Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

### New Approach for K-mean and K-medoids Algorithm

New Approach for K-mean and K-medoids Algorithm Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Purnima Singh Department of

### Introduction to Computer Science

DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

### Unsupervised Learning

Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision

### Clustering: Centroid-Based Partitioning

Clustering: Centroid-Based Partitioning Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 29 Y Tao Clustering: Centroid-Based Partitioning In this lecture, we

### Clustering Part 2. A Partitional Clustering

Universit of Florida CISE department Gator Engineering Clustering Part Dr. Sanja Ranka Professor Computer and Information Science and Engineering Universit of Florida, Gainesville Universit of Florida

### 9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

### Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

### Clust Clus e t ring 2 Nov

Clustering 2 Nov 3 2008 HAC Algorithm Start t with all objects in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most

### Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

### Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

### Large-scale visual recognition Efficient matching

Large-scale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!

### 7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Foundations of Machine Learning CentraleSupélec Paris Fall 2016 7. Nearest neighbors Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning

### Chapter 18 Clustering

Chapter 18 Clustering CS4811 Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University 1 Outline Examples Centroid approaches Hierarchical approaches 2 Example:

### Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery, 10, 141 168, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Hierarchical Clustering Algorithms for Document Datasets YING ZHAO

### Mixture models and clustering

1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction

### 5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

### 6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

### 3. Cluster analysis Overview

Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

### Clustering Algorithms for general similarity measures

Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

### Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

### Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

### Chuck Cartledge, PhD. 23 September 2017

Introduction Definitions Numerical data Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Agglomerative Clustering Chuck Cartledge, PhD 23 September 2017 1/30 Table of contents

### Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

### Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

### Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

### Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

### Hierarchical Document Clustering

Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

### CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

### 9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Introduction ti to K-means Algorithm Wenan Li (Emil Li) Sep. 5, 9 Outline Introduction to Clustering Analsis K-means Algorithm Description Eample of K-means Algorithm Other Issues of K-means Algorithm

### Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

### Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

### PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

### Auto-assemblage for Suffix Tree Clustering

Auto-assemblage for Suffix Tree Clustering Pushplata, Mr Ram Chatterjee Abstract Due to explosive growth of extracting the information from large repository of data, to get effective results, clustering