Efficient Parallel Hierarchical Clustering

Size: px

Start display at page:

Download "Efficient Parallel Hierarchical Clustering"

Horatio Hart
6 years ago
Views:

1 Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore Deartment of Electrical & Comuter Engineering, Northwestern University, Evanston, IL 6008 Abstract. Hierarchical agglomerative clustering (HAC) is a common clustering method that oututs a dendrogram showing all N levels of agglomerations where N is the number of objects in the data set. High time and memory comlexities are some of the major bottlenecks in its alication to real-world roblems. In the literature arallel algorithms are roosed to overcome these limitations. But, as this aer shows, existing arallel HAC algorithms are inefficient due to ineffective artitioning of the data. We first show how HAC follows a rule where most agglomerations have very small dissimilarity and only a small ortion towards the end have large dissimilarity. Partially overlaing artitioning (POP) exloits this rincile and obtains efficient yet accurate HAC algorithms. The total number of dissimilarities is reduced by a factor close to the number of cells in the artition. We resent POP, the arallel version of POP, that is imlemented on a shared memory multirocessor architecture. Extensive theoretical analysis and exerimental results are resented and show that POP gives close to linear seedu and outerforms the existing arallel algorithms significantly both in CPU time and memory requirements. Keywords: hierarchical agglomerative clustering, artitioning, arallel algorithm, shared memory architecture. 1 Introduction Hierarchical agglomerative clustering (HAC) is often used in various alications due to its caability to outut a dendrogram showing all agglomerations. Unlike K-means and other tyes of clustering where objects are clustered into a given number of clusters, a dendrogram can be used to get any number of clusters. HAC algorithms are non-arametric, natural and simle in grouing objects, and caable of finding clusters of different shaes by using different similarity measures. However, they are limited in their alication to real-world roblems mainly due to high CPU time and memory comlexities. Existing algorithms take O(N log N) CPUtimeandrequireO(N ) memory. Parallel algorithms Research of the third author on this roject was suorted by NSF grant IIS M. Danelutto, D. Laforenza, M. Vanneschi (Eds.): Euro-Par 004, LNCS 3149, , 004. c Sringer-Verlag Berlin Heidelberg 004

2 364 M. Dash, S. Petrutiu, and P. Scheuermann are roosed to alleviate this limitation. Existing arallel algorithms either arallelize other clustering methods such as K-means (Dhillon and Modha [1]) and subsace clustering (Nagesh et al. []), or are not very efficient due to lack of erformance enhancing artitioning [3]. In [4] we have shown that comlexities of the existing sequential HAC algorithms can be reduced significantly by an efficient artitioning scheme without losing accuracy. The roosed methods are based on an observation that in HAC most iterations agglomerate very small clusters searated by very small dissimilarity. Only a small number of iterations towards the end agglomerate the large clusters. Using this observation a structure called artially overlaing artitioning (POP) divides the data into a number of overlaing cells. Analysis and exeriments showed that POP-based sequential HAC algorithms reduce existing time and memory comlexities by a factor close to the number of cells c. In this aer we resent arallel versions of POP, called POP. Due to the indeendent nature of each artitioned cell, arallelization is able to achieve similar reduction in time and memory comlexities as POP, i.e., by a factor close to the number of cells c. We imlement POP over a shared memory architecture. Exerimental evaluations show that for large data sets POP obtains near linear seedu. In addition, for stored matrix imlementations, POP results in a two order of magnitude imrovement in comutation time over the existing arallel HAC algorithms. Background Let us assume that there are N objects each with M attributes. We use real tye data and Euclidean (L ) distance to measure dissimilarity. Other distance measures, e.g. Manhattan, can be used (see [4]). The Rule: In an exeriment we ran the centroid tye HAC method over a -D data set with 100 clusters and some noise. In the centroid tye, each cluster is reresented by a centroid and the air with the closest centroids is merged in each iteration. In Figure 1, we lot the closest air distance for each iteration. Notice that most agglomerations excet for a small ortion towards the end have very small closest air distance comared to the maximum closest air distance. This maximum distance is taken over all agglomerations. If we lot the size of clusters merged in an iteration it also shows a similar lot. We exerimented with many data sets having varying characteristics. For varying M, N (tyically large at least a few thousand objects), and K (number of clusters), the general trend is as follows: if a majority of the objects are inside clusters then the shae of the distance lot is as shown in Figure. Wename this as rule to convey the idea that in a dendrogram, most levels from the bottom merge airs of very small clusters searated by a very small ortion of the maximum closest air distance. The rule extends to other HAC algorithms beyond the centroid method for both the geometric and the grah metrics. For sace constraints, we restrict all discussions in this aer to centroid method.

3 Efficient Parallel Hierarchical Clustering Distance Plot Closest Pair Distance % iterations merge clusters with distance less than 6% of maximum merging distance 0.5 6% % 100 Iteration Number(%) Fig. 1. An imortant roerty of HAC: the distance lot shows that the closest air distance is very small even until last stage of agglomeration. See [4] for detailed discussion on the rule and other metrics. Next we show how to exloit this inherent characteristic of HAC..1 Partially Overlaing Partitioning (POP) An axis-arallel POP divides the data-sace uniformly into c number of overlaing cells. The overlaing region is called -region where is the overlaing distance between two cells. Figure deicts the axis-arallel POP. For the centroid metric (and other geometric metrics), if the reresentative oint of a cluster falls in a -region then each affected cell that contains this -region holds it, otherwise only one cell holds it. Partially Overlaing Partitioning (POP) Fig.. The rule is exloited by POP for efficient HAC.

4 366 M. Dash, S. Petrutiu, and P. Scheuermann Before discussing POP any further, we very briefly describe some existing HAC algorithms. HAC algorithms are mainly of two tyes: stored matrix (e.g., dissimilarity matrix and riority queues) and stored data (e.g., nearest neighbor). The dissimilarity matrix method stores dissimilarities between each air of clusters. When a air is merged dissimilarities are comuted for the new cluster and the matrix is udated. The memory comlexity of this method is O(N ) and the time comlexity is O(N 3 ). In the riority queue method a hea-based riority queue is maintained for each cluster. Because a riority queue requires O(log n) time for each insert and delete oeration for n elements, the time comlexity reduces to O(N log N) although the memory comlexity stays at O(N ). The nearest neighbor array method maintains nearest neighbors for each cluster in an array. If after each iteration the average number of clusters whose nearest neighbors need to be changed is α, then the time comlexity reduces to O(αN ) and the memory comlexity reduces to O(N). An uer bound for α is (3 M ). When memory is enough to store O(N ) dissimilarities, stored matrix algorithms are referred as they do fewer comutations. Otherwise, the stored data tye is referred. -Phase Algorithm: In [4] we roosed a new -hase algorithm for HAC based on the axis-arallel POP. In hase 1 clusters are artitioned into c overlaing cells. The basic idea is that in each iteration the closest air is found for each cell and from those the overall closest air is found. If the overall closest air distance is less than then the air is merged and the riority queues (or the dissimilarity matrix or the nearest neighbor array) of only the container cell are udated. If the closest air or the merged cluster is in a -region then the riority queues of the affected cells are also udated. Phase 1 terminates when the closest air distance exceeds. Phase merges the remaining clusters of hase 1 using the existing algorithm, thus comleting the dendrogram. Accuracy: POPinhase 1 ensures that any air with distance less than must reside together in at least one cell. Hence, as hase is the existing algorithm itself, the -hase algorithm guarantees the correct dendrogram. Comlexity Analysis: By setting to the closest air distance at the turning oint of the distance lot (see Figure 1), a large number of small clusters are merged in hase 1 while only a small number of larger clusters are merged in hase. Recall that hase 1 uses POP which is very efficient whereas hase uses the existing algorithm which is not so efficient. In Figure 1, if is set to the turning oint of the distance lot, 96% agglomerations from the beginning are merged in hase 1 and the remaining 4% in hase. Therefore, the overall comutational time is reduced drastically. So, we see that when is set to the turning oint, the number of clusters remaining (k ) for hase is very small and the total number of clusters in the -region ( )is also very small. For simlification of the comlexity analysis, we consider k and to be negligible. This is reasonable because the rule holds for all data sets that have clusters in it. We assume equal cell size and equal -region size for each cell. In [4] we give the detailed comlexity analysis comarison between the existing and the -hase algorithms. Following is a brief overview of that. Stored matrix tye that requires O(N ) memory now requires

5 Efficient Parallel Hierarchical Clustering 367 O( N c ) in the -hase algorithm. Hence memory is reduced by a factor close to c. Because of this reduction, the -hase dissimilarity matrix algorithm, whose time comlexity is dominated by the time required to create the matrix, enjoys a reduction by a factor close to c. The time comlexity of the riority queue algorithm is dominated by the udate effort required to maintain the riority queues. After each agglomeration of the closest air, the riority queues of all other clusters are udated. But in the -hase algorithm this effort is restricted only to the cell that holds the closest air, and if it haens to be in a -region then it is restricted only to the affected cells. So after simlification the reduction factor is log N N c, i.e., the time comlexity reduces from O(N log N) to c O( N c log N c ). In stored data tye there is no reduction in the memory comlexity of O(N). The time comlexity is dominated by the time required to udate the nearest neighbors of the affected clusters. For the existing algorithm the time required to find the nearest neighbor of one affected cluster is O(N) but for the -hase algorithm it is O( N c ). So, the overall reduction factor is close to c. Setting and c Nested Algorithm: The erformance of the -hase algorithm deends on c and. As shown in the distance lot of Figure 1, there exists an ideal at the turning oint at which the total time taken by the -hase algorithm is minimum. But it is not straightforward to comute. So, we adoted a nested aroach where in the beginning POP artitioning starts with a very small and gradually increases it until a few or just one cluster remain. As increases, c which is set initially to a high value, is gradually reduced. Accuracy of this nested algorithm is assured from the accuracy of the -hase algorithm. Exeriments show that the nested algorithm is more efficient than the -hase algorithm even when is set ideally for the -hase algorithm. For examle, for the data set described in Section, the minimum time for the -hase algorithm is 15.4 cu sec while the nested algorithm takes only 57.8 cu sec. Higher Dimensional Data: The above discussion focuses on -D data. For higher dimensions we roosed a very efficient data structure as a relacement for the axis-arallel artitioning. Due to sace constraint we limit the scoe of this aer to -D and refer the interested reader to [4]. 3 POP Algorithms Parallel HAC algorithms have been studied by Li [5], Li and Fang [6], Olson [3], and Wu et al. [7]. The common feature of these algorithms is: for stored matrix tye the task of comuting and maintaining O(N ) dissimilarities is divided among the rocessors, whereas for stored data tye the task of comuting and maintaining the O(N) nearest neighbors is divided among the rocessors. For examle, Olson used rocessors to reduce the time comlexity of the dissimilarity matrix method to O( N 3 ) and that of the riority queue method to O( N log N ) [3]. The time comlexity for the nearest neighbor array method

6 368 M. Dash, S. Petrutiu, and P. Scheuermann reduces to O( α N ). These algorithms are not very efficient because they still require O(N ) total memory for stored matrix tye, and in each iteration they require to udate all the riority queues or dissimilarity matrix. For stored data tye the existing methods need to check all the clusters after each agglomeration to determine whether the newly merged cluster is nearer than the revious nearest. So, the reduction in these arallel algorithms is mostly because of arallelization, but not due to efficient artitioning. The advantage of POP is that each cell is sufficient by itself, and hence arallelization benefits by dividing the task of creating and maintaining the dissimilarities or riority queues or nearest neighbors of each cell among the rocessors. This reduces the total comutation of searching for the closest air and maintaining the data structure drastically. Below we give the comlexities of sequential, existing arallel and POP algorithms. For comlexity analysis we select the -hase algorithm of the stored matrix tye since, as we shall show later, this algorithm achieves larger seedus comared to the existing algorithms. As before, we assume equal cell sizes, negligible size, and negligible hase time. Among existing algorithms, those described by Olson [3] are selected. The number of rocessors is denoted by. Table 1. Comarison of time comlexities of sequential, existing arallel, and POP algorithms. RF - Reduction Factor (= ExistingP arallel P OP ). Priority Queues Sequential Existing Parallel POP RF 1. Create riority queues O(N ) O( N ) N O( ) c. for n = N to O(N) O(N) O(N) 3. find smallest distance O(n) O( n ) O( n ) 4. merge and udate P O(n log n) O( n log n ) O( n log n c ) Overall Overall (Dissimilarity Matrix) O(N 3 ) O(N log N) O( N log N ) O( N log N c ) log N log N c O( N3 ) N3 O( ) c c In Table 1 (riority queues) ste 1 of POP comutes riority queues in O( N N c ) time. Recall that POP reduces the memory by a factor of c, i.e., O( c ). POP divides the total comutation for the c cells among rocessors, and hence, assuming no synchronization delays the comlexity becomes O( N c ). Ste 4 udates the riority queues of the affected clusters. In POP a riority queue holds N c elements in the beginning. Hence, due to arallelization the total time comlexity of this ste is O( n log n c ). So, the overall reduction factor is log N. log N c Table 1 shows the overall comlexities for the dissimilarity matrix tye as well. It has a reduction factor close to c. The memory requirement for riority queues and dissimilarity matrix tyes is reduced by a factor close to c. Forthenearest neighbor tye, the gain of POP over the existing arallel algorithms cannot be obtained directly from the comlexity analysis. For the ste where each cluster

7 Efficient Parallel Hierarchical Clustering 369 is checked to find whether it is affected by the agglomeration, POP needs to do it for one (or a few, if in -region) cell whereas the existing algorithm needs to do it for all clusters. Similarly, the existing algorithm needs to check all the clusters to find the new nearest neighbor of each affected cluster. But POP requires only the container cell to be checked. Exerimental results in the next section show that POP outerforms the existing algorithms substantially for all the above three tyes of HAC. 4 Exerimental Results We erformed a number of exeriments to study the erformance and scalability of our roosed POP algorithms. Both stored matrix (riority queues) and stored data (nearest neighbors) tyes of POP were imlemented using the -hase algorithm. For comarison uroses we imlemented the corresonding existing arallel algorithms, hereby denoted as existing algorithms. These are described in [3]. The erformance was measured in terms of CPU time, memory sace and seedu. We exerimented using several real, benchmark, and artificial data sets. Due to sace constraint we show the results over an artificial data set that is used in [8]. Other results are available from manoranj/research.html. The exeriments were run on the SGI Origin000 multirocessor system which is a shared memory machine consisting of 8 R10000 CPUs running at clock rate of 195MHs. The secondary cache size is 4MB. We used OenMP which is an API for directed based arallel rogramming alications in a shared memory environment [9]. We decided to use it because it is designed for fine-grained arallelism, which was redominant in our algorithm. The POP imlementation in OenMP uses guided self scheduling clause in the assignment of iterations to threads, i.e., rocessors. During each iteration of HAC each rocessor is assigned in turn a chunk of cells to work on, with the chunk size being reduced as we roceed with the iteration. After an iteration is finished, a critical region is established in order to find the overall closest air of clusters and merge them. The riority queues of the cells affected by the agglomeration can be udated in arallel. In Figure 3 we show the results over the synthetic data set whose size varies from 3K to 60K. The existing stored matrix algorithms require O(N )memory, hence we could exeriment only with a data size u to 5K; on the other hand for POP we reort results for data sets u to 30K. The number of rocessors varies from 1 to 8. In Figure 3 (a-b) we reort the seedus of POP. Although the seedu of POP is small for smaller data sets, we observe that for larger data sets (30K or higher) the seedu of POP imroves substantially and aroaches linear seedu for data sets of 60K. Figure 3 (c-d) gives the relative seedu of POP over the existing algorithm. POP is always suerior over the existing algorithm because of its efficient artitioning, and indeendent nature of each cell. The relative seedu increases with data size. Among stored matrix and stored data tyes, POP s erformance is much better for stored matrix. It

8 370 M. Dash, S. Petrutiu, and P. Scheuermann (a) (b) Stored Matrix POP Algorithm Stored Data POP Algorithm 8 7 IdealSeedu 3k-oints 5k-oints 10k-oints 15k-oints 30k-oints 8 7 IdealSeedu 10k-oints 15k-oints 30k-oints 60k-oints Seedu Seedu Number Of Processors 450 (c) Stored Matrix Relative Seedu = (CPU Time Existing / CPU Time POP) 500 3k-oints 5k-oints Number Of Processors (d) Stored Data Relative Seedu = (CPU Time Existing/ CPU Time POP) k-oints 15k-oints 30k-oints Relative Seedu = Existing / POP Relative Seedu = Existing / POP Number Of Processors Number Of Processors Fig. 3. Synthetic data results: For stored matrix and stored data tyes, and for varying #rocessors (1 to 8), (a-b) show erformance of POP, and (c-d) show RelativeSeedU = Existing P OP. achieves a two order of magnitude imrovement in comutation time over the existing algorithm. ExistingCPU As shown in Figures 3 (c-d) the relative seedu, P OP CP U, decreases as the number of rocessors increases. This is due to the fact that for a small number of cells, when the number of rocessors is increased, some rocessors end u working on cells containing a very small number of clusters, and will therefore send a lot of time being idle when they are done with the comutation in a given iteration. However, as the data set size increases and/or the number of clusters increases, load balancing among the rocessors becomes better. This henomena can be observed in our figures. Although for both 3K and 5K sizes for stored

9 Efficient Parallel Hierarchical Clustering 371 matrix tye the relative seedu dros by aroximately the same amount (85) when the number of rocessors increased from 1 to 8, the noticeable fact is that relative seedu for 1 rocessor for 3K size is 375, but that for 5K size is 460. That is to say as the number of rocessors increased, with increasing size of data the rate of dro in seedu decreased. Although due to the high memory requirement of the existing arallel algorithms we could not test for higher data sizes, we ostulate that for larger data sets this trend of reduction in relative seedu for more rocessors will continue to slow down further. We comared the memory for stored matrix tye. For 3K and 5k POP reduced the memory requirement by a factor of 97 and 189 resectively. For stored data tye both algorithms require similar amount of memory. 5 Conclusion and Future Directions In this aer we roosed POP for efficient arallel HAC. Analysis and exeriments showed that, for both stored matrix and stored data tyes, POP outerforms the existing algorithms significantly both in CPU time and memory requirements. This is achieved by exloiting a rule of HAC which states that in a dendrogram, most levels from the bottom merge airs of very small clusters searated by a very small ortion of the maximum closest air distance. The data sace was artitioned by artially overlaing cells each of which could be rocessed indeendent of other such cells without affecting accuracy. Future work includes arallelizing the high-dimensional data structure. References 1. Dhillon, I.S., Modha, D.M.: Large-scale arallel data mining. Lecture Notes in Artificial Intelligence 1759 (000) Nagesh, H., Goil, S., Choudhary, A.: PMAFIA: A scalable arallel subsace clustering algorithm for massive datasets. In: Proc. International Conference on Parallel Processing. (000) Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comuting 1 (1995) Dash, M., Liu, H., Scheuermann, P., Tan, K.L.: Fast hierarchical clustering and its validation. Data and Knowledge Engineering 44(1) (003) Li, X.: Parallel algorithms for hierarchical clustering and cluster validity. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1990) Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comuting 11 (1989) Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient arallel algorithms for hierarchical clustering on arrays with reconfigurable otical buses. Journal of Parallel and Distributed Comuting 60 (000) Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD Conference on Management of Data, Montreal, Canada (1996) Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., Menon, R., eds.: Parallel Programming in OenMP. Morgan Kaufmann Publishers (000)

A Novel Iris Segmentation Method for Hand-Held Capture Device

A Novel Iris Segmentation Method for Hand-Held Cature Device XiaoFu He and PengFei Shi Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200030, China {xfhe,