Exploiting Parallelism to Support Scalable Hierarchical Clustering

Similar documents
Clustering (COSC 416) Nazli Goharian. Document Clustering.

Machine Learning: Symbol-based

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Unsupervised Learning

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CSE 494 Project C. Garrett Wolf

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines

CSE494 Information Retrieval Project C Report

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Hierarchical and Ensemble Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Based on Raymond J. Mooney s slides

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering Part 3. Hierarchical Clustering

CS7267 MACHINE LEARNING

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Chapter 18 Clustering


INF4820, Algorithms for AI and NLP: Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Clustering: Overview and K-means algorithm

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Hierarchical Clustering

Unsupervised Learning and Data Mining

Clustering in Data Mining

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Hierarchical Clustering 4/5/17

CSE 5243 INTRO. TO DATA MINING

K-Means Clustering 3/3/17

4. Ad-hoc I: Hierarchical clustering

Unsupervised Learning Partitioning Methods

CHAPTER 4: CLUSTER ANALYSIS

Clustering & Classification (chapter 15)

Hierarchical Clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

A Comparison of Document Clustering Techniques

Unsupervised Learning and Clustering

Clustering Lecture 3: Hierarchical Methods

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

University of Florida CISE department Gator Engineering. Clustering Part 4

INF 4300 Classification III Anne Solberg The agenda today:

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering Part 4 DBSCAN

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering CS 550: Machine Learning

Cluster Analysis for Microarray Data

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Search Engines. Information Retrieval in Practice

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Workload Characterization Techniques

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Clustering & Bootstrapping

Project C Report (Clustering Algorithms)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Clustering k-mean clustering

Data Mining Concepts & Techniques

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Clustering. Bruno Martins. 1 st Semester 2012/2013

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

Objective of clustering

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Chapter 10 in Introduction to statistical learning

Introduction to Artificial Intelligence

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014

Clustering part II 1

Clustering Algorithms for general similarity measures

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Machine learning - HT Clustering

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

2. Background. 2.1 Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Multivariate Analysis (slides 9)

Lesson 3. Prof. Enza Messina

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Chapter 6: Cluster Analysis

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Transcription:

Exploiting Parallelism to Support Scalable Hierarchical Clustering Rebecca Cathey, Eric Jensen, Steven Beitzel, Ophir Frieder, David Grossman Information Retrieval Laboratory http://ir.iit.edu

Background Information Retrieval (IR) Goal: return documents relevant to a user s query Terminology Document sequence of terms expressing ideas about some topic in a natural language Query request for documents pertaining to some topic Relevancy is measured by how close documents are to the user s query

IR Example Find documents close to the user D1: The Dogs walked home <1, 1, 1,0> D: Home on the range <0, 0, 1, 1> Q: Dogs at home <1, 0, 1, 0> Dogs D1,1 Walked D1,1 Home D1,1 D,1 Range D,1 D Q D1

Document Clustering Automatically group related documents into clusters. Uses If a collection is well clustered, we can search only the cluster that will contain relevant documents. Searching a smaller collection should improve effectiveness and efficiency Types Hierarchical Agglomerative clustering Partitioning (Iterative) clustering

Document Clustering Partitioning vs. Hierarchical Hierarchical Computationally Expensive Non order dependent Always gives the same division of clusters Partitioning Potential to cluster larger document collections Order dependent Less accurate Combine Partitioning with Hierarchical Buckshot Algorithm Fast with reasonable accuracy

Hierarchical Clustering Produces a hierarchy of clusters Create n x n similarity matrix "n O( ) ABCDE BCE AD BE C A D B E

Sequential Hierarchical Algorithm 7 Phase 1 Assign each document to a cluster Build a similarity matrix for n documents Find the nearest neighbor and corresponding max similarity for each cluster Phase Search nearest neighbor array for the closest clusters Merge two clusters Update similarity matrix with the similarity to new cluster Update α elements in the nearest neighbor array Total Cost: n + (n)(αn) = O(αn )

Phase 1: Build a Similarity Matrix n= 1 nn array max array 1 8 7 8 1 7 1 8 1 1 7 8 7 8

Phase : Create the Clusters 1

Phase : Create Clusters Find Closest clusters 1 nn array max array 1 8 7 8 1 7 1 8 1 1 7 8 7 1

Phase : Create Clusters 1,

Phase : Create Clusters Update Similarity Scores 1, nn array max array 1 8. 7 8, 8. 7. 7 7. 8 1 1

Phase : Create the Clusters, 1, 1 1

Phase : Create the Clusters 1,, nn array max array 1,. 8.,, 7. 1,. 7. 8. 1

Phase : Create the Clusters, 1,, 1 1

Phase : Create the Clusters 1,,, nn array max array 1,.,,. 1,,.. 1,. 1

Phase : Create the Clusters 1,,,,, 1, 1 17

Phase : Create the Clusters 1,,,,, 1,,,,, 1, 1 18

Partitioning (Iterative) Clustering Divides the documents into several clusters according to some criteria Start with a random set of cluster centroids Iteratively adjust until some termination condition is set Linear time 1

Sequential Buckshot Algorithm Phase 1 Cluster randomly selected documents using the hierarchical agglomerative algorithm Phase kn Calculate centroids of clusters from phase 1 Assign every other document to the cluster it is closest too Total Cost: ( "kn ) + kn = O(αkn) 0

Example Buckshot Clustering n documents, want k clusters Randomly select kn documents O(kn) 1 1 kn 1 7 8... kn 7 8 k clusters

Parallel Approach Split the tasks equally among p machines Reduces time and memory complexity by a factor of p Parallel Hierarchical Agglomerative Algorithm Allows clustering of larger document collections while still supporting accuracy Can be used with parallel buckshot algorithm to cluster larger collections

Parallel Hierarchical Algorithm Phase 1 Partition similarity matrix among p nodes. Each node finds the similarities for n/p rows Each node finds the maximum similarity and nearest neighbor for partition Phase Each node searches the nearest neighbor array for the closest clusters New node is selected to manage new cluster Every node calculates and sends the similarity between their clusters and the new cluster Every node updates α/p elements in the nearest neighbor aray Total Cost: n /p+ ((n)(αn))/p = O(αn /p)

Phase 1: Build Similarity Matrix Partition Documents 1 nn array max array N1 1 8 7 8 1 7 1 N 8 1 1 7 N 8 7

Phase : Create the Clusters N1 chosen to manage new cluster 1 nn array max array N1 1 8 7 8 1 7 1 N 8 1 1 7 N 8 7

Phase : Create the Clusters Update Similarity scores 1, N1 1 8. 7 8, 8. 7. N 7 7. N 8 nn array 1 max array

Parallel Buckshot Algorithm Phase 1 Cluster kn randomly selected documents using the parallel hierarchical agglomerative algorithm Phase Calculate centroids of clusters from phase 1 n! kn Partition remaining p documents into p partitions Each node assigns documents from a partition to the closest cluster Total Cost: ( "kn ) /p + kn/p = O(αkn/p) 7

Experiments Parallel computer consisting of 1 nodes Collections GB SGML Collection from TREC 8,0 documents Subset of GB collection 0,000 documents 8

Parallel Hierarchical Scalability according to the number of nodes

Parallel Hierarchical Scalability according to the collection size 0

Hierarchical Cluster Evaluation Clusters Kmeans Hierarchical significance.87 e.1 e X. e.17 e X 18.77 e. e X. e.88 e X 1.1 e 8. e X 1

Parallel Buckshot Scalability according to the number of nodes

Parallel Buckshot Scalability according to the number of clusters

Parallel Buckshot Scalability according to the collection size

Buckshot Cluster Evaluation Clusters Kmeans Buckshot significance.8 e. e X.0 e.0 e X 18.7 e. e X.8 e 7.0 e X 1.0 e 8.8 e X

Clustering Summary Clustering can be used to organize data Parallel Hierarchical Agglomerative Clustering can be used to accurately cluster somewhat large collections of documents Parallel Buckshot can be combined with the Parallel Hierarchical Agglomerative clustering algorithm to effectively cluster large quantities of data