Project C Report (Clustering Algorithms)

Similar documents
CSE494 Information Retrieval Project C Report

CSE 494 Project C. Garrett Wolf

Machine Learning: Symbol-based

Clustering (COSC 416) Nazli Goharian. Document Clustering.

A Comparison of Document Clustering Techniques

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Clustering Algorithms for general similarity measures

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Behavioral Data Mining. Lecture 18 Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning Partitioning Methods

Clustering: Overview and K-means algorithm

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

K-Means Clustering 3/3/17

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval and Web Search Engines

CSE 5243 INTRO. TO DATA MINING

Information Retrieval and Organisation

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

4. Ad-hoc I: Hierarchical clustering

Information Integration of Partially Labeled Data

Clustering: Overview and K-means algorithm


Unsupervised Learning Hierarchical Methods

Chapter 18 Clustering

Clustering Part 3. Hierarchical Clustering

Clust Clus e t ring 2 Nov

Expectation Maximization!

Lesson 3. Prof. Enza Messina

Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Lecture 3: Hierarchical Methods

Clustering COMS 4771

University of Florida CISE department Gator Engineering. Clustering Part 2

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Unsupervised Learning

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Parallel Algorithms K means Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Information Retrieval and Web Search Engines

Supervised and Unsupervised Learning (II)

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Based on Raymond J. Mooney s slides

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Clustering Part 4 DBSCAN

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Unsupervised Learning : Clustering

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

CS570: Introduction to Data Mining

Hierarchical Clustering 4/5/17

Introduction to Information Retrieval

Text clustering based on a divide and merge strategy

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Hierarchical Document Clustering

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Clustering CS 550: Machine Learning

Exploratory Analysis: Clustering

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

CSE 5243 INTRO. TO DATA MINING

K-Means. Algorithmic Methods of Data Mining M. Sc. Data Science Sapienza University of Rome. Carlos Castillo

PV211: Introduction to Information Retrieval

Hierarchical Clustering

Clustering and Visualisation of Data

Introduction to Data Mining

Clustering Algorithms. Margareta Ackerman

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 4

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

A Review of K-mean Algorithm

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Clustering. K-means clustering

Clustering Color/Intensity. Group together pixels of similar color/intensity.

INITIALIZING CENTROIDS FOR K-MEANS ALGORITHM AN ALTERNATIVE APPROACH

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

L9: Hierarchical Clustering

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Hierarchical clustering

Clustering in Data Mining

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

Image Analysis Lecture Segmentation. Idar Dyrdal

Transcription:

Project C Report (Clustering Algorithms) This report contains details about clustering algorithms. It mainly compares two of the clustering algorithms, K-Means and Buckshot and some variation of these algorithms. Clustering is performed on results obtained by Vector Space Similarity. Overview of Clustering Algorithms For a given query there can be results in multiple categories. E.g., for query Michael Jordan we can have some pages related to statistician and some pages related to basket ball player. User may be interested in any one of this. Clustering assists user to find only relevant document of his interest. Relevant documents are clustered using clustering. Good clustering should have high intra cluster similarity and low inter cluster similarity. Intra cluster similarity can be defined as, 1 S a = ---- d c C i d є C i where c is the centroid of the Ci th cluster. And inter cluster similarity can be defined as, 1 k k S e = -------------- c i c j k * ( k 1 ) i = 1 j = 1 j i where c i and cj are centroids of i th and j th cluster respectively. K Means Algorithm In K Means algorithm initial K centroids are selected randomly. Then using Vector Space similarity each document is assigned to the cluster with maximum similarity. And the centroids are recomputed. Then all the data points are reassigned to the closest centroids. And this process repeats until there is any change in clusters in successive iterations. The algorithm is as described below. findclusters ( top N docs, k){ for i = 1 to k{ do{ randomly select a doc cluster[i].centroid = selected doc. while(doc not already selected)

set change = true While(change){ assignclusters() computecentroids() assignclusters(){ set change = false for i = 1 to N{ set maxsim = for j = 1 to k{ if sim(cluster [j].centroid, ith doc) > maxsim{ closestcluster = j maxsim = sim(cluster [j].centroid, ith doc) Assign ith doc to closest cluster if closestcluster!= previouscluster then set change = true computecentroids(){ for i = 1 to k{ set cluster[i].centroid = for each doc in cluster i{ cluster[i].centroid += doc Cluster[i].centroid /= number of docs in cluster i Let I, d, k and n be the no. of iterations, no. of dimensions of a vector (average no. of terms), no. of clusters and no. of documents in base set respectively. Then the time complexity of assignclusters is of O(nkd) and the time complexity of computecentroids is O(nd). So for I iterations time complexity becomes O (k + I (nkd + nd)) = O (Idkn) To store doc vector of top n documents space of O (nt) is required. Where t is the average no. of terms in a document. And to store centroid vector space of O (kt) is required, where k is the no. of centroids. And to assign cluster to each document space of O (n) is required.

So total space complexity is O (nt + kt + n) = O(nt) So this algorithm is low in cost. But its performance depends on the initial choice of centroids. Sometimes the results change dramatically during multiple runs of K Means algorithm because of randomly picking the initial centroids. Various extensions are applied on K Means to improve its performance. Buckshot Algorithm Buckshot algorithm tries to improve the performance of K Means algorithm by choosing better initial cluster centroids. For that it uses Hierarchical Agglomerative Clustering (HAC) algorithm. HAC considers each point as a separate cluster and combines the cluster with the maximum similarity. Similarity between clusters is measured as a group average. When the no. of clusters left equals the no. of required clusters, the algorithm is stopped. And these centroids of these clusters are taken as initial centroids for the K Means algorithm. For Buckshot algorithm HAC is performed on kn documents. findinitialseeds (top N docs, k) { create a clustervector containing randomly selected kn docs from top n docs while( clustervector.size > k){ maxsim = for i = 1to k{ for j = i + 1 to k{ if sim(clustervector[i], clustervector[j]) > maxsim{ cluster1 = clustervector[i] cluster2 = clustervector[j] maxsim = sim(clustervecor[i], clustervector[j]) remove cluster1 and cluster2 from clustervector combine docs of cluster1 and cluster2 into cluster add cluster to clustervector Return clustervector Let n be the number of documents to be clustered, k be the no. of clusters and d be the no. of dimensions. Then time complexity of HAC is O ( kn + ( kn k) knd) = O ((kn) 3/2 d)

The time complexity of buckshot algorithm is O ((kn) 3/2 d + Idkn). And the space complexity of HAC is (nt + kt + kn). Where n is the total no. of documents, t is the average no. of terms per document and k is the no. of clusters. So the space complexity of buckshot algorithm is O (nt + kt + n + nt + kt + kn) = O (nt) Bisecting K Means Algorithm This algorithm tries to improve quality over K Means. It starts with one large cluster of all the data points and divides the whole dataset into two clusters. K Means algorithm is run multiple times to find a split that produce maximum intra cluster similarity. Then the cluster with largest size is picked to split further. This cluster can be chosen based upon minimum intra cluster similarity also. This algorithm is run k 1 times to get k clusters. This algorithm performs better than regular K Means because bisecting K Means produces almost uniform sized clusters. While in regular K Means there can be notable difference between sizes of the clusters. As small cluster tends to have high intra cluster similarity, large clusters have very low intra cluster similarity and overall intra cluster similarity decreases. The algorithm is as described below. findclusters ( top N docs, k){ Initialize clustervector with one cluster containing all documents for i = 1 to k - 1{ find cluster with largest size from clustervector remove cluster from clustervector. Set maxsim = Do Iteration times{ for j = 1 to 2{ do{ randomly select a doc cluster[j].centroid = selected doc. while(doc not already selected) set change = true While(change){ assignclusters() computecentroids() if(sim > maxsim){ maxsim = sim

store clusters in cluster1 and cluster2 Add cluster1 and cluster2 to clustervector assignclusters(){ set change = false for i = 1 to N{ set maxsim = for j = 1 to 2{ if sim(cluster [j].centroid, ith doc) > maxsim{ closestcluster = j maxsim = sim(cluster [j].centroid, ith doc) Assign ith doc to closest cluster if closestcluster!= previouscluster then set change = true computecentroids(){ for i = 1 to 2{ set cluster[i].centroid = for each doc in cluster i{ cluster[i].centroid += doc Cluster[i].centroid /= number of docs in cluster i Let n be the number of documents to be clustered, k be the no. of clusters, i be the no. of iterations made on each run and d be the no. of dimensions. Then the time complexity of assignclusters is O (nd). And time complexity of computecentroids is O (nd) if average size of cluster is of O (n). So overall complexity of bisecting K Means algorithm is O ((k - 1) * i * (nd + nd)) = O (nkid) So time complexity of bisecting K Means is linear in nature with respect to no. of documents. To store doc vector of top n documents space of O (nt) is required. Where t is the average no. of terms in a document. And to store centroid vector space of O (kt) is required,

where k is the no. of centroids. And to assign cluster to each document space of O (n) is required.so total space complexity is O (nt + kt + n) = O(nt) This algorithm is low in cost compare to buckshot algorithm but still gives comparable performance to buckshot algorithm. Comparison of algorithms using similarity measures To compare these algorithms inter cluster and intra cluster similarity measures are used. Algorithms were run for k = 3 to k =1 for 2 queries information retrieval and parking decal. information retrieval.8 intra cluster similarity.7.6.5.4.3.2 3 4 5 6 7 8 9 1 K Means buckshot bisecting K Means Intra Cluster Similarity for all algorithms on Information retrieval query information retrieval.25 inter cluster similarity.2 5.5 3 4 5 6 7 8 9 1 K Means buckshot bisecting K Means Inter Cluster Similarity for all algorithms on Information retrieval query

An algorithm performs well compared to another one if it has higher intra cluster similarity and lower inter cluster similarity compared to other algorithm. From the graphs it is obvious that buckshot algorithm consistently performs better than other algorithms in both intra cluster and inter cluster similarity measure. For k =3 and 6 bisecting K Means performs slightly better than buckshot algorithm in intra cluster similarity measure but overall buckshot algorithm outperforms other algorithms. parking decal intra cluster similarity.9.8.7.6.5.4.3.2 3 4 5 6 7 8 9 1 K Means buckshot bisecting K Means Intra Cluster Similarity for all algorithms on Parking decal query parking decal.5 inter cluster similarity.4.3.2 3 4 5 6 7 8 9 1 K Means buckshot bisecting K Means Inter Cluster Similarity for all algorithms on Parking Decal query From the 1 st graph it is clear that buckshot and bisecting K Means algorithm almost perform equal in terms of intra cluster similarity. And it is already mentioned that buckshot algorithm performs well because it selects better initial centroids. Bisecting K

Means algorithm performs well compared to K Means because Bisecting K Means algorithm produces clusters of almost uniform sizes. Effect of varying no. of clusters To analyze the effect of varying no. of clusters results of intra cluster similarity and inter cluster similarity for two queries for K Means algorithm are presented..8.7 intra cluster similarity.6.5.4.3.2 softw are engineering computer science 3 4 5 6 7 8 9 1 Intra cluster similarity for 2 queries for K Means algorithm.45.4 inter cluster similarity.35.3.25.2 5 softw are engineering computer science.5 3 4 5 6 7 8 9 1 Inter cluster similarity for 2 queries for K Means algorithm From the 1 st graph it is obvious that as the no. of clusters increases intra cluster similarity increases. And from 2 nd graph it is obvious that as the no. of clusters increases inter cluster similarity decreases. This is due to the fact that as the no. of cluster increases the

cluster size becomes smaller so intra cluster similarity increases. And similarly inter cluster similarity decreases. Analysis of clusters with respect to natural category information retrieval Cluster 1 www.public.asu.edu%%~hdavulcu%%cse591_semantic_web_mining.html www.public.asu.edu%%~hdavulcu%%cse591_semantic_web_mining2.html prism.asu.edu%%resources_links.asp Cluster 2 www.asu.edu%%copp%%justice%%courses%%index.htm www.asu.edu%%clas%%psych%%dinfo%%courses.htm Cluster 3 www.eas.asu.edu%%~csedept%%academic%%courses.shtml www.eas.asu.edu%%~csedept%%academic%%courses_pfv.html www.asu.edu%%aad%%catalogs%%spring_23%%cse.html Cluster 4 www.public.asu.edu%%~candan%%cv.htm www.public.asu.edu%%~candan%%research.htm aria.asu.edu%%people.htm Cluster 5 www.eas.asu.edu%%~gcss%%people%%nvf%%pubs.html wpcarey.asu.edu%%pubs%%is%%pub.cfm www.eas.asu.edu%%~gcss%%people%%nvf%%vita.html Above given are the results for query information retrieval using K Means algorithm using k = 5. 1st cluster represents semantic web mining course pages. 2 nd cluster represents pages related to courses. Cluster 3 represents cse courses page. Cluster 4 represents pages from prof. candan s directory. And cluster 5 has pages from gcss. So almost all clusters have pages forming one category. Computer Science Cluster1 www.eas.asu.edu%%~wcs%%index.html www.eas.asu.edu%%~wcs%%events.htm www.eas.asu.edu%%~wcs%%members.htm Cluster 2 www.asu.edu%%provost%%smis%%ceas%%bse%%csebse.html www.asu.edu%%provost%%smis%%ceas%%bs%%csbs.html www.eas.asu.edu%%~gcss%%people%%nvf%%pubs.html

Cluster 3 www.asu.edu%%lib%%noble%%library%%bestind.htm www.asu.edu%%provost%%smis%%clas%%bs%%psbs.html www.asu.edu%%provost%%smis%%clas%%ba%%psba.html Above given are the results for computer science query using buckshot algorithm with K = 3. Cluster 1 represents pages from wcs directory. Cluster2 represents pages under provost directory. Cluster 3 represents pages from clas directory. So results for buckshot algorithm are also producing clusters which can be named. Computer Science Cluster 1 www.asu.edu%%lrc%%computerlab.html www.asu.edu%%vpsa%%lrc%%computerlab.html Cluster 2 www.eas.asu.edu%%~csedept%%students%%internships%%internships.shtml www.eas.asu.edu%%~csedept%%students%%scholarships%%scholarships.shtml www.eas.asu.edu%%~csedept%%academicprograms%%academicprograms.shtml Cluster 3 www.eas.asu.edu%%~gcss%%people%%nvf%%pubs.html www.asu.edu%%provost%%smis%%ceas%%bs%%csbs.html www.asu.edu%%provost%%smis%%ceas%%bse%%csebse.html Above are the results for computer science query using bisecting K Means algorithm for K = 3. Cluster 1 represents pages related to computer lab. Cluster 2 represents pages related to scholarships. Cluster 3 represents pages under ceas directory. Results for bisecting K Means are also producing clusters that can be named.