A Multi-threading in Prolog to Implement K-means Clustering

Size: px

Start display at page:

Download "A Multi-threading in Prolog to Implement K-means Clustering"

Morris Joseph
6 years ago
Views:

1 A Multi-threading in Prolog to Implement K-means Clustering SURASITH TAOKOK, PRACH PONGPANICH, NITTAYA KERDPRASOP, KITTISAK KERDPRASOP Data Engineering Research Unit, School of Computer Engineering Suranaree University of Technology 111 University Avenue, Muang District, Nakhon Ratchasima 30000, THALAND Abstract: - Prolog presented in this paper is a logic programming language with multi-threading support. Several programmers use multi-threading to improve execution time. This is due to the fact that the multithreads can split tasks to work in concurrency. In this paper, we propose an algorithm and its implementation of k-means clustering with multi-threading using Prolog. The main objective is to speedup execution time. The experimentation compares k-means and multi-thread k-means, as well as percentage of speedup time execution. The result has proved our claim. Key-Words: - Data Mining, Clustering, Modified k-means, Multi-thread k-means, Logic programming, Prolog 1 Introduction Prolog is a general purpose logic programming language. Many Prolog compilers support for multi-threaded such as SWI-Prolog, SICStus Prolog, CIAO Prolog and Qu-Prolog. In this paper use SWI-Prolog [2] to implement k-means clustering algorithm. SWI-Prolog is an open source and multi-threading support available for Linux, Windows and Macintosh platform. We can profit multi-thread Prolog by splitting a large task into subtasks that are speedup on multi-core processors. The k-means clustering algorithm is an unsupervised learning method that separates data points into groups. The time complexity of k- means depends on the number of data points and the number of clusters and the number of iterations. Computational complexity of k-means is O(nkt), when n is number of data points, k is number of clusters and t is number of iteration until the centroids are stable. However when data has big size the time complexity are raising and the traditional k-means algorithm does not efficiency. We introduce the propose multi-thread in Prolog [1] [4] apply to k-means [6] algorithm called multi-thread k-mean algorithm (MTK). The MTK is not parallel k-means [3] [5] [7] algorithm; it has new re-designed at sub process original k- means algorithm to support multi-thread. That process is the calculating new centroids step. We create threads to distribute tasks so calculating all new centroids concurrency. The organization of the rest of this paper is as follows. Discussion of related work in developing a multi-thread, k-means and parallel k-means is presented in Section 2. Our proposed algorithms, a multi-thread k-means, are explained in Section 3. The implementation (a complete source code is available in the appendix) and experimental results are demonstrated in Section 4. The conclusion as well as future research direction appears as a last section. 2 Related Works The k-means algorithm [6] was presented by J.B. MacQueen in 1976 and then it has applied to many several applications. Then the technology of multicore processors has been created and applied to [4] and support the parallel k-means algorithm [5]. J. Wielemaker [1] presents the multi-thread in SWI-Prolog, their works show speedup when running multiple thread on multi-core processor. Manasi N. Joshi [5] presents the parallel k- means algorithm with messsage passing interface (MPI) on distributed mermory and multiprocessors system (Sun workstations). Their method can take advantage from multiprocessor environment. B. Hohlt [8] introduces a parallel k-means algorithm implemented with C++ and pthread. N. Kerdprasop and K. Kerdprasop [3] propose the parallel k-means implemented with Erlang. Their experimental results show the speedup when clustering on large dataset ISBN:

2 Hence in this paper we proposed the MTK algorithm and implement it as Logic programming as Prolog multi-thread methodology. 3 Proposed Algorithm K-means algorithm [6] start with random the initialization k centroids. Then assign data point to the nearest cluster and re-compute the new centroids of each k clusters. If the new centroids are not stables its will be iterate assign data point to nearest cluster and re-compute new centroids again until the new centroids are stabled. The k- means algorithm is shown in Algorithm 1. Algorithm 1 K-means (KM) Input: number of clustering and a set of data points Output: K-centroids and members of each cluster Steps 1. Select initial centroid C=<C 1,C 2,,C K > 2. Repeat 2.1 Assign each data point to its nearest cluster center 2.2 Re-compute the cluster centers using the current cluster memberships 3. Until there is no further change in the assignment of the data points to new cluster centers The k-means algorithm has some gaps to distribute the sequence works to do its together. We found that, when the k-means re-compute new centroids after assign all data point to each cluster, they can compute the new centroids are concurrency. We proposed the multi-threading method adapt to k-means algorithm. The pseudo code of modify k-means algorithm (Multi-thread k-means) apply to support the concurrency is shown in Algorithm 2. Algorithm 2 Multi-thread k-means (MTK) Input: number of clustering and a set of data points Output: K-centroids and members of each cluster Steps 1. Select initial centroid C=<C 1,C 2,,C K > 2. Assign each data point to nearest cluster center 3. Create threads process T=<T 1,T 2,,T K > for centroid C=<C 1,C 2, C K > 4. For each Thread (T i=1 to T K ) 4.1 Re-compute cluster centers <T i :cal(c i )> 4.2 Return a new centroid C i to set C 5. Check stable of centroids 5.1 if C!= C' then set C = C' go to step if C == C' then stop and return C and cluster members The MKT algorithm was added the multithread process at the re-compute process. The recompute process is the main process master and responsible for create threads, sending a set of data point each cluster to thread, and recalculating the new centroids. The re-compute process repeat as long as the old and new centroids do not converge and multi-threading process just invoke every time when re-compute process started. The re-compute process and multi-threading process can be graphically shown in Fig.1 Fig.1 A diagram illustrating the communication between master process and threads 4 Experimental and Results We implement the proposed algorithms with Prolog language (SWI-Prolog standard). The implementation of KM and MTK algorithms as a Prolog program is given in appendix. A screenshot of running the program (SWI- Prolog Multi-threaded, 32 bits, Version ) is in Fig.2. To running the program we use the command. cluster(k). The argument K is the number of clusters and before running the program the data file points.pl must exist in working directory. And data format with following predicate and data point list in 3 dimensional. Or item([[p1],[p2],,[pk]]). item([ [-4,8,-7],[-9,0,-5],[8,4,4], [9,5,6],[-4,-5,-7],[-2,-1,3], [10,11,0],[0,-15,7],[2,-1,3]]). ISBN:

A predicate item([[p1],[p2],,[pk]]) is a set of data points for clustering with the KM and MTK implementations program. when the numbers of cluster are increasing.

3 A predicate item([[p1],[p2],,[pk]]) is a set of data points for clustering with the KM and MTK implementations program. when the numbers of cluster are increasing. The different running time is shows in Fig.3 Fig.3 Running time comparisons of KM versus MTK with 10,000 data points Fig.2 A screenshot to illustrate running the MTK program with generate 2 centroids The screenshot in Fig.2 show a command to running MTK algorithm program for 2 clusters classification. Then show the running time usage after finished clustering. 4.1 Performance of Multi-thread k-means We evaluate performances of the proposed KM and MTK algorithms on synthetic three dimensional dataset. The computational speed of k-means as compared to multi-thread k-means is given in Table 1. Experimental are performed on Laptop computer with the processor intel(r) Core(TM) i GHz, 4Gb of memory, and Windows 7 32-bit operating system. The numbers of synthetic data points are 10,000 points. Table 1 Execution time of KM versus MTK with 10,000 data points (The number of clusters is equivalent the number of threads) Number of Cluster Times (Sec) KM MTK Speedup (%) The results experimental observable running time of KM and MTK by used the 10,000 data points the percentage of running time speedup average more than 30% which the different number of clusters test (2 to 10 clusters). Percentage of speedup between KM and MTK is shown in Fig.4 Fig.4 Percentage of running time speedup different number of clusters with 10,000 data points 4.2 Speedup of Multi-thread k-means This section, we test to evaluate percentage of running time speedup. We prepare series of data set include 500, 1,000, 2,000, 3,000, 4,000, 5,000, 8,000, 10,000, and 12,000 points of data and 3 dimensional. The experimental use different number of clusters each dataset, the number of cluster for test are 2, 4, 6, 8, and 10 clusters. The results of experiment different running time are shown in Table 2 and Table 3. And the percentage of running time speedup is shown in Table 4. The results from Table 1 show that the running time of MTK is faster than KM. And also the running time is more different increase speedup ISBN:

4 Table 2 Execution time of KM versus MTK in 2, 4 and 6 clusters Data Running Time (Sec) 2 Clusters 4 Clusters 6 Cluster K TK K TK K TK , , , , , , , , Table 3 Execution time of KM versus MTK in 8 and 10 clusters Running Time (Sec) Data 8 Clusters 10 Clusters K TK K TK , , , , , , , , Table 4 Speedup percentage of different number of clusters and data sizes Data Speedup percentage (%) via Number of Clusters , , , , , , , , Percentage execution time speed up of the experimental is shown in Fig. 5. It can be noticed from experimental results that if the number of data points and clusters are increase the percentage of speedup running time are increase too. Fig.5 Comparison of speedup percentage at different data sizes and number of threads 5 Conclusion Nowadays the processors are mostly multi-core processing. And traditional programming and algorithm are not work efficiency and effective with the hardware. K-means clustering is the most well-known algorithm commonly used for clustering data. The k-means algorithm is simple but it s not more effective if implement with traditional style. In this paper we propose the design and implementation of KM and MTK with logic programming. The MTK algorithm is modified from KM by integrations multi-threading process into the algorithm. The experimental results reveal that the multithreading method considerably speedups the computation time, especially with tested with multi-core processors. Our future work will focus on the real parallel k-means algorithm and applications. 6 Appendix Source codes presented in this section are in SWI- Prolog format. A line preceded with "%" is a comment. We provide two versions of clustering programs: k-means and multi-thread k-means. Each program starts with comments explaining how to run the program. K-means Clustering % files "points.pl" must exist in working directory % example of data file: % item([ [-4,8,-7], [-9,0,-5], [8,4,4], [9,5,6], [-4,-5,-7], % [-2,-1,3], [10,11,0],[0,-15,7],[2,-1,3]]). ISBN:

5 % Then test a program with this command % cluster(2). %% use 2 or more %% -- Reserve memories :-set_prolog_stack(global,limit(3*10**9)), set_prolog_stack(local,limit(4*10**9)). %% -- Main program cluster(k):- ensure_loaded('points.pl'), pc_time(h1-m1-s1), item(item), initial(item,k,mean), writeln(mean), kmean(item,mean), pc_time(h2-m2-s2), TS1 is H1*60*60+M1*60+S1, TS2 is H2*60*60+M2*60+S2, DTS is TS2 - TS1, writeln(time-dts). %% -- Return the execution times %% -- example time-15.9 pc_time(ct):- get_time(t), stamp_date_time(t, date(_, _, _, H, M, S, 0, 'UTC', -), 'UTC'), CT = H-M-S. %% -- Initial Centroid pick from a set of data lis initial(_,0,[]):-!. initial([hitem Titem],K,[Hitem Tmean]):- Nk is K - 1, initial(titem,nk,tmean). %% -- K-means work kmean(item,mean):- calculate_dist(item,mean,caleditem), split_item(mean,caleditem,splititem), calculate_mean(splititem,newmean), writeln(newmean), ( Mean = NewMean -> true,!; kmean(item,newmean) ),!. %% -- Calculate distance and assign %% -- each point to nearest cluster calculate_dist([],_,[]):-!. calculate_dist([hitem Titem],Mean,[Hitem- SelMean TSelMean]):- calculating(hitem,mean,dist), select_cluster(dist,mean,selmean), calculate_dist(titem,mean,tselmean). %% -- Euclidian distance with 3 Dimensional data calculating(_,[],[]):-!. calculating([hi1,hi2,hi3], [[Hm1,Hm2,Hm3] Tmean], [Dist Tdist]):- Caler is (Hi1-Hm1)^2 + (Hi2-Hm2)^2 + (Hi3-Hm3)^2, sqrt(caler,dist), calculating([hi1,hi2,hi3],tmean,tdist). %% -- Each point choose nearest cluster select_cluster([_],[mean],mean):-!. select_cluster([hd1,hd2 Tdist], [Hm1,Hm2 Tmean], SelMean):- (Hd1 < Hd2 -> select_cluster([hd1 Tdist], [Hm1 Tmean], SelMean) ; select_cluster([hd2 Tdist], [Hm2 Tmean], SelMean) ). %% -- splited data to classify cluster split_item([],_,[]):-!. split_item([hm Mean], CaledItem,[Splited SplitItem]):- spliting(hm,caleditem,splited), split_item(mean,caleditem,splititem). spliting(_,[],[]):-!. spliting(mean,[hitem-selmean Titem],Splited):- spliting(mean,titem,tsplited), ( Mean = SelMean -> Splited = [Hitem TSplited] ; Splited = TSplited ). %% -- Re-compute new Centroid value calculate_mean([],[]):-!. calculate_mean([hs SplitItem],[HR NewMean]):- cal_mean(hs,hr), calculate_mean(splititem,newmean). cal_mean(l,r):- mean_me(0,[0,0,0],l,r). mean_me(n,[sx,sy,sz],[[x,y,z] T],R):- NN is N + 1, NSx is Sx + X, NSy is Sy + Y, NSz is Sz + Z, mean_me(nn,[nsx,nsy,nsz],t,r). mean_me(n,[sx,sy,sz],[],[rsx,rsy,rsz]):- RSx is Sx / N, RSy is Sy / N, RSz is Sz / N. % End of K-means % Multi-thread K-means Clustering % K-means Clustering % % data files "points.pl" must exist in working directory % example of data file: % item([ [-4,8,-7], [-9,0,-5], [8,4,4], [9,5,6], [-4,-5,-7], % [-2,-1,3], [10,11,0],[0,-15,7],[2,-1,3]]). % Then test a program with this command % cluster(2). %% use 2 or more/ is a number of clusters %% Reserve memories :-set_prolog_stack(global,limit(2*10**9)), set_prolog_stack(local,limit(2*10**9)). %% -- Main program cluster(k):- ensure_loaded('points.pl'), pc_time(h1-m1-s1), item(item), initial(item,k,mean), writeln(mean), kmean(item,mean), pc_time(h2-m2-s2), TS1 is H1*60*60+M1*60+S1, TS2 is H2*60*60+M2*60+S2, DTS is TS2 - TS1, writeln(time-dts). ISBN:

6 %% -- Return the execution times %% -- example time-15.9 pc_time(ct):- get_time(t), stamp_date_time(t, date(_, _, _, H, M, S, 0, 'UTC', -), 'UTC'), CT = H-M-S. %% -- Initial Centroid pick from a set of data lis initial(_,0,[]):-!. initial([hitem Titem],K,[Hitem Tmean]):- Nk is K - 1, initial(titem,nk,tmean). %% -- Multi-trhead K-means work kmean(item,mean):- calculate_dist(item,mean,caleditem), split_item(mean,caleditem,splititem), calculate_mean(splititem,tl), wait_for_threads(tl,newmean), writeln(newmean), ( intersection(mean,newmean,mean) -> true,! ; kmean(item,newmean) ),!. %% -- Calculate distance and assign each point %% -- to nearest cluster calculate_dist([],_,[]):-!. calculate_dist([hitem Titem],Mean,[Hitem- SelMean TSelMean]):- calculating(hitem,mean,dist), select_cluster(dist,mean,selmean), calculate_dist(titem,mean,tselmean). %% -- Euclidian distance with 3 Dimensional data calculating(_,[],[]):-!. calculating([hi1,hi2,hi3], [[Hm1,Hm2,Hm3] Tmean], [Dist Tdist]):- Caler is (Hi1-Hm1)^2 + (Hi2-Hm2)^2 + (Hi3-Hm3)^2, sqrt(caler,dist), calculating([hi1,hi2,hi3],tmean,tdist). %% -- Each point choose nearest cluster select_cluster([_],[mean],mean):-!. select_cluster([hd1,hd2 Tdist], [Hm1,Hm2 Tmean], SelMean):- (Hd1 < Hd2 -> select_cluster([hd1 Tdist], [Hm1 Tmean], SelMean) ; select_cluster([hd2 Tdist], [Hm2 Tmean], SelMean) ). %% -- splited data to classify cluster split_item([],_,[]):-!. split_item([hm Mean], CaledItem,[Splited SplitItem]):- spliting(hm,caleditem,splited), split_item(mean,caleditem,splititem). spliting(_,[],[]):-!. spliting(mean,[hitem-selmean Titem],Splited):- spliting(mean,titem,tsplited), ( Mean = SelMean -> Splited = [Hitem TSplited] ; Splited = TSplited ). %% -- Re-compute new Centroid value %% -- In this section create on thread %% -- per one cluster re-computer new centroid calculate_mean([],[]):-!. calculate_mean([hs SplitItem],[T0 TL1]):- calculate_mean(splititem,tl1), thread_create(cal_mean(hs), T0, []). cal_mean(l):- mean_me(0,[0,0,0],l,r), assert(mean(r)). mean_me(n,[sx,sy,sz],[[x,y,z] T],R):- NN is N + 1, NSx is Sx + X, NSy is Sy + Y, NSz is Sz + Z, mean_me(nn,[nsx,nsy,nsz],t,r). mean_me(n,[sx,sy,sz],[],[rsx,rsy,rsz]):- RSx is Sx / N, RSy is Sy / N, RSz is Sz / N. %% -- Wait for all thread completed work. wait_for_threads([],[]):-!. wait_for_threads([t TL],NewMean) :- ( thread_join(t, true) -> mean(nm), retract(mean(nm)), wait_for_threads(tl,tm), NewMean = [NM TM] ; wait_for_threads([t TL],NewMean) ). % End of Multi-thread K-means % References: [1] J. Wielemaker, Native Preemptive Threads in SWI-Prolog, ICLP. Volume 2916 of Lecture Notes in Computer Science., Springer (2003), pp [2] J. Wielemaker, T. Schrijvers, M. Triska, T. Lager, SWI-Prolog, Accepted for publication in TPLP, [3] N. Kerdprasop and K. Kerdprasop, A Lightweight Method to Parallel K-means Clustering, International Journal of Mathematics and Computers in Simulation, Issue 4, Volume 4, 2010, pp [4] M. Carro and M. Hermenegildo, Concurrency in Prolog Using Threads and a Shared Database, International Conference on Logic Programming, [5] M. Joshi, Parallel K-means Algorithm on Distributed Memory Multiprocessors, Technical Report, University of Minnesota, 2003, pp [6] J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp [7] A. Prasad, Parallelization of K-means Clustering Algorithm, Project Report, University of Colorado, 2007, pp [8] B. Hohlt, Pthread Parallel K-means, CS267 Applications of Parallel Computing, UC Berkeley, 2001 ISBN:

Parallelization of K-Means Clustering on Multi-Core Processors

Parallelization of K-Means Clustering on Multi-Core Processors Kittisak Kerdprasop and Nittaya Kerdprasop Data Engineering and Knowledge Discovery (DEKD) Research Unit School of Computer Engineering, Suranaree