Exploiting Parallelism to Support Scalable Hierarchical Clustering

Exploiting Parallelism to Support Scalable Hierarchical Clustering Rebecca Cathey, Eric Jensen, Steven Beitzel, Ophir Frieder, David Grossman Information Retrieval Laboratory http://ir.iit.edu

Background Information Retrieval (IR) Goal: return documents relevant to a user s query Terminology Document sequence of terms expressing ideas about some topic in a natural language Query request for documents pertaining to some topic Relevancy is measured by how close documents are to the user s query

IR Example Find documents close to the user D1: The Dogs walked home <1, 1, 1,0> D: Home on the range <0, 0, 1, 1> Q: Dogs at home <1, 0, 1, 0> Dogs D1,1 Walked D1,1 Home D1,1 D,1 Range D,1 D Q D1

Document Clustering Automatically group related documents into clusters. Uses If a collection is well clustered, we can search only the cluster that will contain relevant documents. Searching a smaller collection should improve effectiveness and efficiency Types Hierarchical Agglomerative clustering Partitioning (Iterative) clustering

Document Clustering Partitioning vs. Hierarchical Hierarchical Computationally Expensive Non order dependent Always gives the same division of clusters Partitioning Potential to cluster larger document collections Order dependent Less accurate Combine Partitioning with Hierarchical Buckshot Algorithm Fast with reasonable accuracy

Hierarchical Clustering Produces a hierarchy of clusters Create n x n similarity matrix "n O( ) ABCDE BCE AD BE C A D B E

Sequential Hierarchical Algorithm 7 Phase 1 Assign each document to a cluster Build a similarity matrix for n documents Find the nearest neighbor and corresponding max similarity for each cluster Phase Search nearest neighbor array for the closest clusters Merge two clusters Update similarity matrix with the similarity to new cluster Update α elements in the nearest neighbor array Total Cost: n + (n)(αn) = O(αn )

Phase 1: Build a Similarity Matrix n= 1 nn array max array 1 8 7 8 1 7 1 8 1 1 7 8 7 8

Phase : Create the Clusters 1

Phase : Create Clusters Find Closest clusters 1 nn array max array 1 8 7 8 1 7 1 8 1 1 7 8 7 1

Phase : Create Clusters 1,

Phase : Create Clusters Update Similarity Scores 1, nn array max array 1 8. 7 8, 8. 7. 7 7. 8 1 1

Phase : Create the Clusters, 1, 1 1

Phase : Create the Clusters 1,, nn array max array 1,. 8.,, 7. 1,. 7. 8. 1

Phase : Create the Clusters, 1,, 1 1

Phase : Create the Clusters 1,,, nn array max array 1,.,,. 1,,.. 1,. 1

Phase : Create the Clusters 1,,,,, 1, 1 17

Phase : Create the Clusters 1,,,,, 1,,,,, 1, 1 18

Partitioning (Iterative) Clustering Divides the documents into several clusters according to some criteria Start with a random set of cluster centroids Iteratively adjust until some termination condition is set Linear time 1

Sequential Buckshot Algorithm Phase 1 Cluster randomly selected documents using the hierarchical agglomerative algorithm Phase kn Calculate centroids of clusters from phase 1 Assign every other document to the cluster it is closest too Total Cost: ( "kn ) + kn = O(αkn) 0

Example Buckshot Clustering n documents, want k clusters Randomly select kn documents O(kn) 1 1 kn 1 7 8... kn 7 8 k clusters

Parallel Approach Split the tasks equally among p machines Reduces time and memory complexity by a factor of p Parallel Hierarchical Agglomerative Algorithm Allows clustering of larger document collections while still supporting accuracy Can be used with parallel buckshot algorithm to cluster larger collections

Parallel Hierarchical Algorithm Phase 1 Partition similarity matrix among p nodes. Each node finds the similarities for n/p rows Each node finds the maximum similarity and nearest neighbor for partition Phase Each node searches the nearest neighbor array for the closest clusters New node is selected to manage new cluster Every node calculates and sends the similarity between their clusters and the new cluster Every node updates α/p elements in the nearest neighbor aray Total Cost: n /p+ ((n)(αn))/p = O(αn /p)

Phase 1: Build Similarity Matrix Partition Documents 1 nn array max array N1 1 8 7 8 1 7 1 N 8 1 1 7 N 8 7

Phase : Create the Clusters N1 chosen to manage new cluster 1 nn array max array N1 1 8 7 8 1 7 1 N 8 1 1 7 N 8 7

Phase : Create the Clusters Update Similarity scores 1, N1 1 8. 7 8, 8. 7. N 7 7. N 8 nn array 1 max array

Parallel Buckshot Algorithm Phase 1 Cluster kn randomly selected documents using the parallel hierarchical agglomerative algorithm Phase Calculate centroids of clusters from phase 1 n! kn Partition remaining p documents into p partitions Each node assigns documents from a partition to the closest cluster Total Cost: ( "kn ) /p + kn/p = O(αkn/p) 7

Experiments Parallel computer consisting of 1 nodes Collections GB SGML Collection from TREC 8,0 documents Subset of GB collection 0,000 documents 8

Parallel Hierarchical Scalability according to the number of nodes

Parallel Hierarchical Scalability according to the collection size 0

Hierarchical Cluster Evaluation Clusters Kmeans Hierarchical significance.87 e.1 e X. e.17 e X 18.77 e. e X. e.88 e X 1.1 e 8. e X 1

Parallel Buckshot Scalability according to the number of nodes

Parallel Buckshot Scalability according to the number of clusters

Parallel Buckshot Scalability according to the collection size

Buckshot Cluster Evaluation Clusters Kmeans Buckshot significance.8 e. e X.0 e.0 e X 18.7 e. e X.8 e 7.0 e X 1.0 e 8.8 e X

Clustering Summary Clustering can be used to organize data Parallel Hierarchical Agglomerative Clustering can be used to accurately cluster somewhat large collections of documents Parallel Buckshot can be combined with the Parallel Hierarchical Agglomerative clustering algorithm to effectively cluster large quantities of data