Parallel Clustering on a Unidirectional Ring. Gunter Rudolph 1. University of Dortmund, Department of Computer Science, LS XI, D{44221 Dortmund

Similar documents
Computational Intelligence Winter Term 2017/18

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Self-Organizing Feature Map. Kazuhiro MINAMIMOTO Kazushi IKEDA Kenji NAKAYAMA

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

Bit Summation on the Recongurable Mesh. Martin Middendorf? Institut fur Angewandte Informatik

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Matrix multiplication

[8] V. Kumar, V. Nageshwara Rao, K. Ramesh, Parallel Depth First Search on the Ring Architecture

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.

q ii (t) =;X q ij (t) where p ij (t 1 t 2 ) is the probability thatwhen the model is in the state i in the moment t 1 the transition occurs to the sta

Prewitt. Gradient. Image. Op. Merging of Small Regions. Curve Approximation. and

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

Probability that n Random Points Are in Convex Position* ~n-1 n! (2n-2t) 2 n-l]. n!

Online algorithms for clustering problems

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

Hypercubes. (Chapter Nine)

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

An Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

- self-loop - parallel edges

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

Y. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University

Network. Department of Statistics. University of California, Berkeley. January, Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

to the Traveling Salesman Problem 1 Susanne Timsj Applied Optimization and Modeling Group (TOM) Department of Mathematics and Physics

Rajendra Boppana and C. S. Raghavendra. tation from the set of processors to itself. Important. problems like FFT, matrix transposition, polynomial

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Dense Matrix Algorithms

MeDoc Information Broker Harnessing the. Information in Literature and Full Text Databases. Dietrich Boles. Markus Dreger y.

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

/00/$10.00 (C) 2000 IEEE

Exact Optimized-cost Repair in Multi-hop Distributed Storage Networks

Figure (5) Kohonen Self-Organized Map

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Comparing Self-Organizing Maps Samuel Kaski and Krista Lagus Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2C, FIN

Set Cover with Almost Consecutive Ones Property

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University

Orientation of cameras with minimum and redundant information is the rst

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

1 Introduction and Results

Parallel Numerics, WT 2013/ Introduction

A Simplified Correctness Proof for a Well-Known Algorithm Computing Strongly Connected Components

Lecture 3: Sorting 1

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Block of PEs used in computing A[k, N-l] Block of PEs used in computing A[k, l] A[k, N-l] A[N-k, N-l] Block of PEs used in computing A[N-k, l]

Efficient Case Based Feature Construction

The K-Ring: a versatile model for the design of MIMD computer topology

A Bintree Representation of Generalized Binary. Digital Images

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Matrix Multiplication

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P

TASK FLOW GRAPH MAPPING TO "ABUNDANT" CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC "LIMITED"

Algorithms for Euclidean TSP

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

A PACKAGE FOR DEVELOPMENT OF ALGORITHMS FOR GLOBAL OPTIMIZATION 1

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

control polytope. These points are manipulated by a descent method to compute a candidate global minimizer. The second method is described in Section

The rest of the paper is organized as follows: we rst shortly describe the \growing neural gas" method which we have proposed earlier [3]. Then the co

Machine Learning (BSMC-GA 4439) Wenke Liu

The Fibonacci hypercube

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

LOCALIZATION OF FACIAL REGIONS AND FEATURES IN COLOR IMAGES. Karin Sobottka Ioannis Pitas

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction

Scalability of a parallel implementation of ant colony optimization

Filter Banks with Variable System Delay. Georgia Institute of Technology. Abstract

Search Algorithms. IE 496 Lecture 17

Module 6: Pinhole camera model Lecture 32: Coordinate system conversion, Changing the image/world coordinate system

Partition definition. Partition coding. Texture coding

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist

SDSU CS 662 Theory of Parallel Algorithms Networks part 2

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

Chapter 8 Dense Matrix Algorithms

n m-dimensional data points K Clusters KP Data Points (Cluster centers) K Clusters

The element the node represents End-of-Path marker e The sons T

. Introduction Image moments and various types of moment-based invariants play very important role in object recognition and shape analysis [], [2], [

Enumerating Independent Sets In Trees And. Submitted in partial fulllment of the requirements. for the degree of. Bachelor of Technology

1 Introduction Computing approximate answers to multi-dimensional range queries is a problem that arises in query optimization, data mining and data w

Chapter Introduction

A. Atamturk. G.L. Nemhauser. M.W.P. Savelsbergh. Georgia Institute of Technology. School of Industrial and Systems Engineering.

Unsupervised Learning : Clustering

Enhancing Integrated Layer Processing using Common Case. Anticipation and Data Dependence Analysis. Extended Abstract

larly shaped triangles and is preferred over alternative triangulations as the geometric data structure in the split and merge scheme. The incremental

Transcription:

Parallel Clustering on a Unidirectional Ring Gunter Rudolph 1 University of Dortmund, Department of Computer Science, LS XI, D{44221 Dortmund 1. Introduction Abstract. In this paper a parallel version of the squared error clustering method is proposed, which groups N data vectors of dimension M into K clusters such that the data vectors within each cluster are as homogeneous as possible whereas the clusters are as heterogeneous as possible compared to each other. The algorithm requires O(NMK=P) time on P processors assuming that P K. Speed up measurements up to 40 processors are given. It is sketched how this algorithm could be applied to parallel global optimization over continuous variables. Cluster algorithms are a valuable tool for exploratory pattern analysis. They are used to group a set of data vectors (patterns) into clusters such that the data vectors within each cluster are as homogeneous as possible whereas each cluster is as heterogeneous as possible among the other clusters according to some similarity measure. The assignment of the data vectors to clusters depends on the choice of the similarity measure used. Here, it is assumed that the data vectors are elements of the M{dimensional Euclidean space IR M, so that the usage of the Euclidean norm as the similarity measure is feasible. Let N denote the number of M{dimensional data vectors which are to be grouped into K clusters. The data vectors are gathered in the data matrix X of size N M. The sets C 1 C 2 ::: C K represent the clusters and they contain numbers in the range 1 2 ::: N whichidentify the data vectors according to their position in the data matrix. It is required that C 1 [ C 2 [ :::[ C K = f1 2 ::: Ng with C i \ C j = for i 6= j and jc i j1 for all clusters. Then the tuple C =(C 1 C 2 ::: C K )iscalledapartitionof length K. For each cluster the center x k = 1 jc k j X j2c k x j is just that point which minimizes the sum of (squared) distances to all points within the cluster, i.e. x k = argminfs k (y) j y 2 IR M g, where S k (y) := X j2c k kx j ; yk 2 1 The author is supported under project 107 004 91 by the Research Initiative Parallel Computing of Nordrhein{Westfalen, Land of the Federal Republic of Germany.

denotes the sum of squared distances between y and any other element of cluster C k. The deviation sum of a partition C can be dened as the sum of the minimal squared distances of all clusters: D(C) = KX k=1 S k (x k ) : (1) Let P (K N) denote the set of all feasible partitions of N data vectors into K clusters. A partition C 0 2 P (K N) is called optimal with respect to (1) if D(C 0 )=minfd(c) j C 2 P (K N)g : (2) The task to determine a partition C 0 with property (2) is a NP{hard combinatorial optimization problem. Since the number of feasible partitions is 1 K! KX k=1 (;1) K;k K k the choice of a solution method for (2) is restricted to iterative improvement methods, which do not guarantee the determination of the global optimal partition. The method that will be used in the following, however, has the property that its solution fullls at least the necessary condition for an optimal partition (for more details see e.g. [1]). In the next section the squared error clustering technique is described before the parallelization is carried out in section 3. Numerical results concerning speed up and eciency of the parallel version can be found in section 4. The usage of this tool in parallel global optimization over continuous variables is discussed in section 5. 2. Squared error clustering The squared error clustering technique starts with an initial partition C that can be chosen arbitrarily as long as the restrictions jc i j1 are obeyed. The deviation sum of this partition is determined by computing the K centers and by calculating the squared distances between each ofthen data vectors and the center of the cluster the vector has been assigned to. Both steps can be done in O(NM) time. A new partition is constructed as follows: For each datavector x i calculate the distance to each center x k and assign x i to the cluster with smallest distance. This assignment step requires O(NMK) time. Now the deviation sum of the new partition is computed as described before. This procedure is iterated until the deviation sum of a new partition is not better than the previous one or until a maximal number of iterations has been performed, so that the time complexity of the squared distance clustering method is O(NMKI), where I denotes the maximal number of iterations. Consequently, a parallelization of this method is desirable. The sequential algorithm is sketched in gure 1. 3. Parallelization on a unidirectional ring The squared error clustering method has been parallelized on several parallel computers with dierent architectures (see [2] and the references therein). Ranka and Sahni (1991) designed a parallel algorithm for a MIMD parallel computer with a tree topology. Their algorithm requires O(NMK=P + MK log P ) time, where P denotes the number of processors. In the following it is shown that the time complexity can be reduced to O(NMK=P +MK +P ) although the topology of the processors is a unidirectional ring with much larger diameter.! k N

choose any initial partition determine the center matrix determine the deviation sum repeat for each x i (i =1 ::: N) for each center x k (k =1 ::: K) calculate distance kx i ; x k k 2 endfor assign x i to cluster with minimal distance endfor determine new center matrix determine new deviation sum until termination criterion applies Figure 1: Skeleton of the squared error clustering method Assume that the data matrix is distributed on P processors so that each ofq = Nmod P processors gets dn=pe data vectors and the remaining P ; Q processors get bn=pc data vectors each as proposed in [2]. Moreover, each processor is assumed to have acopy of the center matrix, which contains the center vectors of the K clusters. The assignment step can then be performed in parallel and it requires O(NMK=P) time. The only problem is how to build up the new center matrix in parallel. Each processor can compute the partial sums of the center vectors according to the assignment of the data vectors which are kept in memory. Thus, each processor computes a partial center matrix in O(NM=P) time. The sum of these P partial center matrices then gives the nal center matrix. Ranka and Sahni [2] solved this subproblem by passing and adding the partial center matrices from the leave processors of their tree topology to the root processor, so that the nal center matrix is known to the root processor in O(MK log P ) time. A broadcast operation from the root returns the nal center matrix to all other processors in O(MK log P ) time. Using this method on a unidirectional ring topology would result in a time complexity of O(MKP). In the following a more sophisticated strategy is proposed which requires only O(MK) time assuming that P K, which is a reasonable assumption for our application as can be seen in section 5. Assume that the partial center matrices have been computed. The partial center matrix is divided into P submatrices so that the rst KmodP submatrices contain dk=pe center vectors and the remaining contain bk=pc center vectors each. Now each processor p 2f0 1 ::: P ; 1g sends its pth submatrix to processor (p +1)modP, where the received submatrix is added to the the processor's own submatrix. This can be done in O(KM=P) time. Then each processor sends that submatrix to its neighbor processor. After P ; 1 communications on each processor there is a submatrix of the nal center matrix. With the same communication scheme these submatrices are sent through the ring. Again, each step requires O(KM=P) time. Therefore, the total time complexity is O(2 (P ; 1)KM=P)= O(KM). To illustrate the method consider the case for P = 4 processors. Let A ij denote the jth submatrix on processor i. Figure 2 shows the communication scheme, where the numbers k above the arrows represent the kth parallel communication step. In this case k =1 ::: 2(P ; 1). The numbers in parenthesis denote the steps with a subsequent store operation whereas those without parenthesis are steps with a subsequent

submatrix addition operation. For example, submatrix A 00 is sent to processor 1 in the rst communication step. Processor 1 computes the sum A 10 +A 00 and stores the result as its new submatrix A 0 10. In the second communication step the new submatrix A 0 10 is sent to processor 2 which computes the sum A 20 + A 0 = A 10 20 + A 10 + A 00 and stores it as its new submatrix A 0. 20 In the third communication step submatrix A 0 20 is sent to processor 3 which computes the new submatrix A 0 30 = A 30 + A 0 20 = A 30 + A 20 + A 10 + A 00. Clearly, submatrix A 0 30 is submatrix 0 of the nal center matrix. These operations are performed for all submatrices in parallel, so that the nal center matrices are distributed over the ring after P ; 1=3communication steps. Table 1 gives a snapshot of the contents of the partial center matrices on each processor after P ; 1 = 3 communication steps. Now P ; 1 = 3 additional communication steps are required to sent and copy the nal center submatrices to each processor in parallel. It is still left open how to compute the deviation sum of the new partition. After A 00 1(5) A 10 2(6) A 20 3 A 30 A 01 (4) A 11 1(5) A 21 2(6) A 31 A 02 3 A 12 (4) A 22 1(5) A 32 A 03 2(4) A 13 3 A 23 (4) A 33 1(5) 2(6) 3 (4) Figure 2: Communication scheme to compute the center matrix on 4 processors processor 0 processor 1 A 00 A 10 + A 00 A 01 + A 31 + A 21 + A 11 A 11 A 02 + A 32 + A 22 A 12 + A 02 + A 32 + A 22 A 03 + A 33 A 13 + A 03 + A 33 processor 2 processor 3 A 20 + A 10 + A 00 A 30 + A 20 + A 10 + A 00 A 21 + A 11 A 31 + A 21 + A 11 A 22 A 32 + A 22 A 23 + A 13 + A 03 + A 33 A 33 Table 1: Snapshot of the contents of the partial center matrices on 4 processors after 3 communication steps

the new center matrix is known to each processor, each processor computes the partial deviation sum by summing up the distances of its data vectors to the centers they have been assigned to. This can be done in O(NM=P) time. Now the partial deviation sum value of each processor is sent through the ring so that the nal deviation sum value is known after O(P ) time on each processor. Summarizing, the time complexities for the dierent steps within one iteration are step operation time 1 assign data vectors to clusters O(NMK=P) 2 compute partial center matrix O(NM=P) 3 build up nal center matrix O(MK) 4 compute partial deviation sum value O(NM=P) 5 build up nal deviation sum value O(P ) Consequently, the total time complexity for one iteration is O(NMK=P + MK + P )=O(NMK=P + MK)=O(NMK=P)forP K. 4. Numerical results: speed up and eciency The parallel clustering algorithm has been implemented on a transputer system with 44 T805/25 and 4 T805/30 transputers with 4 MB memory each. They are connected as a mesh of size 8 6.Thecodewas written in the programming language C and the communication is performed by using the POSIX{library under the operating system Helios. The program has been tested for N = 2048 4096 and 8192 data vectors of dimension M = 20 uniformly distributed in a hypercube in the range [0 10] 20, which aretobe grouped into K = 40 clusters. The use of random data vectors is justied because speed up and eciency measurements are of primary concern here. Therefore the algorithm was stopped after I = 20 iterations. Figures 3 and 4 summarize the speed up and eciency results, respectively. Of course, the eciency becomes better the more data vectors are to be clustered. For example, the time required for 8192 data vectors on one processor drops from 872.37 seconds to 28.38 seconds when 40 processors are used. The relation between computing and communication time could be improved in several ways: Firstly, the so{called task force manager (TFM) maps the parallel code onto the physical processor network, which is a 8 6 grid in this case. It is the strategy of the TFM to minimize the distances to the root processor which is connected to the host computer. This results in the following mapping for the logical ring topology used for the parallel clustering algorithm with 40 processors: shell root 38 35 31 26 39 37 34 30 25 20 36 33 29 24 19 14 32 28 23 18 13 8 27 22 17 12 7 3 21 16 11 6 2 { 15 10 5 1 { { 9 4 0 { { {

40 35 speed up 30 25 20 15 8192 4096 2048 10 5 0 0 5 10 15 20 25 30 35 40 processors Figure 3: Speed up for N =2048 4096 8192 data vectors of dimension M = 20 assigned to K = 40 clusters. 100 90 efficiency (%) 80 70 60 8192 4096 50 2048 40 0 5 10 15 20 25 30 35 40 processors Figure 4: Eciency for N =2048 4096 8192 data vectors of dimension M = 20 assigned to K = 40 clusters. Note that the algorithm sends its data via the path 0! 1! :::! 39! 0. Therefore, the time required to move data from one processor to its neighbor diers between 1 and 9 time units. This problem might be circumvented by an explicit assignment of processes to nodes by the user. Secondly, the use of the POSIX library for communication reduces the data transfer rate a transputer is capable of. There are low level routines which are provided with better transfer rates. Despite these shortcomings the eciency measurements of the current implementation are not too bad and they are expected to be improved signicantly after the modications suggested above.

5. Application to global optimization One way tondtheglobalminimum f(x ) = minff(x) j x 2 D IR M g of the objective function f in the feasible region D is to sample a number of points uniformly distributed over D and to select the best point as the candidate solution. The multistart technique employs a local search algorithm which is applied several times with dierent starting points. The best local solution found is regarded as the candidate solution. Although these method can be parallelized easily they are not very ecient in terms of function evaluations. Better strategies are available which can be parallelized, too (see [3]). The idea to use clustering algorithms as a tool in the eld of global optimization dates back to the 70s (see [4] for a survey). Usually, a number of points are sampled uniformly over D. Then only a fraction of, say, 10 percent of the best solutions are selected for further processing. It is assumed that this fraction of selected points reect the shape of the regions of attraction of the local minimizers. Now the clustering method is used to identify the these shapes by grouping the corresponding points into clusters. Finally, a local search technique is applied from each center of the clusters to locate the local minimum. Again, the best local minimum found is regarded as the candidate solutions. Obviously, this method can be used only for moderate problem dimension on a uniprocessor system. A parallel version might be implemented as follows: The generation of the sample points as well as their evaluation can be done in parallel without communication. Then each processor selects the fraction of the best solutions. Note that this is not the same as to select this fraction of the best solutions over all sampled points. One can expect, however, that the dierence becomes smaller and smaller the more points are sampled. The parallel clustering method as described before can be used to identify the shapes of the regions of attractions. Since it is only necessary to get a rough approximation of these regions only few iterations of the clustering method have to be performed. Finally, each processor selects a center point from the center matrix and applies a local search. For an ecient use of the parallel computer it is desirable that there are at least as much center points as processors. This is why the assumption P K was termed reasonable in a previous section. 6. Conclusions The squared error clustering method has been parallelized on a MIMD parallel computer with unidirectional ring communication with a time complexity of O(NMK=P)if P K. Speed up measurements indicate that improvements are to be expected as soon as some shortcomings of the Helios operating system are circumvented. The parallelized algorithm can be used as a tool in sophisticated parallel global optimization methods. It was sketched how this algorithm could be designed. References [1] H. Spath, Cluster{Formation und {Analyse. Oldenbourg, Munchen and Wien, 1983. [2] S. Ranka and S. Sahni, Clustering on a hypercube multicomputer. IEEE Transactions on Parallel and Distributed Systems, 2(2):129{137, 1991. [3] G. Rudolph, Parallel approaches to stochastic global optimization. In W. Joosen and E. Milgrom (eds.), Parallel Computing: From Theory to Sound Practice, Proceedings of

the European Workshop on Parallel Computing (EWPC 92). IOS Press, Amsterdam, 1992, pp. 256{267. [4] A. Torn and A. Zilinskas, Global Optimization, Lecture Notes in Computer Science Vol. 350. Springer, Berlin and Heidelberg, 1989.