The Application of Clustering Algorithm Based on Improved Canopy -Kmeans in Operators Data

Size: px

Start display at page:

Download "The Application of Clustering Algorithm Based on Improved Canopy -Kmeans in Operators Data"

Eugene Griffin
5 years ago
Views:

1 rd International Conference on Engineering Technology and Application (ICETA 2016) ISBN: The Application of Clustering Algorithm Based on Improved Canopy -Kmeans in Operators Data Haoqian Mai & Lianglun Cheng* College of Computer Science and Technology, Guangdong University of Technology, Guangzhou, Guangdong, China ABSTRACT: Kmeans algorithm is commonly used in user segmentation in operators data, but its k value is difficult to be identified. Meanwhile, canopy algorithm can help Kmeans algorithm to determine the k value, but it is seriously impacted by the radius. In order to solve the above problems, an improved Canopy-Kmeans algorithm is proposed. Firstly, the initial data will be divided into K1 coarse clusters by using the Canopy algorithm with smaller radius. And then, we will use the split method or merged method to reconstruct the K1 coarse clusters to K2 convergent clusters (K1 K2). Finally, we can make the final K2 cluster centers be the initial centers on Kmeans algorithm. By the simulation experiment, the improved Canopy-Kmeans algorithm has performed well in running time, clusters result and square error. Keywords: Canopy Kmeans; clustering; split; merge 1 INTRODUCTION Data mining is a technology which finds the latent rules from the large, incomplete, noisy, and fuzzy data [1]. With the advent of era of big data, the traditional operators start to adopt intelligent data management. In order to subdivide the personalized service, how to effectively and reasonably fractionize the user group has become the key issues in operators management. Therefore, subdividing customer group according to the customers personalized characteristics and designing personalized marketing strategies for different customers will be the vital task for operators in the fierce market competition. At present, clustering is the main data mining algorithm in customer segmentation. Clustering algorithm is mainly divided into the following categories: partitioned clustering [2], hierarchical clustering [3], density clustering [4] and grid clustering [5]. Among them, Kmeans algorithm is frequently used on clustering, and its time complexity is O(nkt), where n is the size of sample, k is the number of categories, and t is the count of iterations [6]. However, the traditional Kmeans algorithm is highly dependent on the selection of initial cluster centers. If the initial value is not good enough, it will be easy to fall into the local optimal solution and get the bad clustering results [7]. The paper [8] proposed an improved Kmeans to select some points, which is most closed to the center of data set, as the new cluster centers. And this algorithm has been applied to the customer segmentation with good effect. The paper[9] proposed a method to select the initial cluster center by using the maximum distance, which can reduce the dependency on the initial cluster centers selection and the time cost. The paper [10] put forward an improved Kmeans algorithm to optimize the initial clustering center by using minimum variance according to the distribution of sample, and its clustering results are stable and have strong anti-noise ability. In addition to the problem of selecting initial cluster center, how to determine the K value is another problem on Kmeans algorithm, which also influences the clustering result [11]. This paper proposes an improved method about Canopy-Kmeans. Firstly, the initial data will be divided into K1 coarse clusters by using the Canopy algorithm with smaller radius. And then, we will use the split method or merged method to reconstruct the K1 coarse clusters to K2 convergent clusters (K1 K2). Finally, we can use the final K2 cluster centers to be the initial centers on Kmeans algorithm. *Corresponding author: mhqian@yeah.net 247

2 2 CANOPY -KMEANS CLUSTERING ALGORITHM 2.1 Kmeans algorithm Kmeans algorithm is a typical clustering algorithm based on distance, and the distance is the evaluating indicator about similarity. If the two objects distance is shorter, their similarity is greater. Kmeans algorithm thinks that the clusters are composed of the neighboring objects, and its ultimate goal is to gain the compact and independent clusters whose objects have the high similarity in the same cluster, but the objects between clusters will have lower similarity [12]. The steps of Kmeans algorithm are as follows: (1) Randomly select K objects as the initial center from the data set. (2) Calculate the distance of each remaining object to their center, and put it into the nearest cluster. The similarity calculation formula is as follows (usually use Euclidean distance). Assuming that c is a cluster center, x is a sample point: d xc, = n 2 x i c i (1) i1 (3) Recalculate the new center of each cluster. (4) Iterate step2 and step3 until the standard measurement function has been convergence. Generally, it uses mean square deviation as the standard clustering measure function: J= k n i 2 xij c i (2) i1j1 n 1 But the Kmeans algorithm has the following drawbacks: K value should be given in advance, and the K value is very difficult to estimate. Most of the time, we does not know how many categories should be divided from the given data set; Moreover, the initial clustering centers of clustering has great influence on the final results. If the choice of initial value is not good enough, the effective clustering results won t be obtained. 2.2 Canopy algorithm Canopy is one of the improved Kmeans algorithm, can be used to determine the number of clusters. With the introduction of Canopy clustering, the data set is divided into k sub-sets by setting the radius, and the sub-sets can be selected as the initial centers of Kmeans. Because the Canopy algorithm can reduce the running time of the clustering by reduce the count of comparisons, it will improve the computational efficiency. The Canopy algorithm s steps as follows: Step1: Put all data into a List, and initialize two distance radius about the loose threshold T1 and the tight threshold T2 (T1> T2). Step2: Randomly select a point as the first initial center of the Canopy cluster, and delete this node from the List. Step3: Get a point from the List, and calculate the distance d to each Canopy clusters. If d < T2, the point belongs to this cluster; if T2 d T1, this point will be marked with a weak label; If the distance d to all Canopy center is greater than T1, then the point will be a new Canopy cluster center. Finally, this point should be deleted from the List; Step4: Run the step3 repeatedly until the list is empty, and recalculate the cluster center. But the execution efficiency of Canopy algorithm is affected by the radius about T1 and T2. When T1 is too large, it will makes one point belongs to multiple Canopy cluster, which will increase the computing time; when the T2 is too large, it will reduce the clustering count. So the initial radius about T1 and T2 is generally set based on the experience or experimental test, which will influence the accuracy and efficiency of classification. In order to solve the above problems, an improved Canopy-Kmeans algorithm is proposed. 3 IMPROVED CANOPY -KMEANS CLUSTER- ING ALGORITHM The improved algorithm in this paper is mainly divided into three steps: Firstly, the initial data will be divided into K1 coarse clusters by using the Canopy algorithm with smaller radius T1 and T2; Secondly, we will use the split method or merged method to re-construct the K1 coarse clusters to K2 convergent clusters (K1 K2); Finally, we can make the final K2 cluster centers be the initial centers on Kmeans algorithm. The algorithm architecture is shown in Figure The initial Canopy algorithm Canopy algorithm is used to obtain the coarse cluster number K1 which is greater than the final cluster number K, and it can provide the optimal initial state to step2 for splitting and merging. Canopy clustering process is as follows: A) Obtain data D form operators, and preprocess the data about its missing value, outlier and quantitative feature. B) Set a small initial radius about T1 and T2 according to the expert knowledge and business background. C) Obtain K1 rough cluster by using Canopy algorithm on data D for clustering. The initial results of Canopy clustering will get more clusters with wide coverage and high shrink, and it can avoid local optimum caused by the inappropri- 248

3 Data From Operator The first clustering for generate K1 cluster rough The second clustering for generate K2 clusters by merging and splitting The third clustering according to user requirements Data preprocessing Canopy clustering Merging and Splitting Kmeans clustering Output the final clustering results Figure 1. The architecture of improved Canopy-Kmeans. ate selection of clustering center and greatly reduce the running time of the clustering. 3.2 Cluster splitting and merging Obtain K1 coarser clusters from step1, and then we will use the operation of merge and split to adjust the cluster until the clusters have been convergence or reached the maximum count of iterations. The algorithm steps are as follows: Step1: Initial the control parameters on the basis of expert background knowledge and business: TE: The upper limit of standard deviation for each characteristic component (when the standard deviation of one cluster is greater than TE, this cluster should be split); TC: The minimum distance between two clusters center (when the distance of two clusters center is smaller than TC, they should be merged); NS: the max iterative count. Step2: Split operation: When there is a cluster s standard deviation is greater than a specified threshold, it will be divided into two categories. Calculate the standard deviation vector of each cluster s samples distance: T j 1j, 2j,, nj (3) Each component in the vector as ij N 1 j x 2 ik cij (4) N j k1 In the formula, i is the dimension of the feature vector, j is the count of clusters, Nj is the count of samples of cluster j. Calculate the maximum component σj max of each standard deviation vector σj, if σj max is greater than TE, the cluster will be split into two new clustering center Ck and Ck+1, Ck is the result about the component of σj max add to σj max /2, and Ck+1 is the result about the component of σj max minus σj max /2. If it does not meet the conditions of split, it will get into the merge operation, or it will go to step4. Step3: Merge operation: Calculate the distance Dij between the centers of each cluster. When Dij is less than TC, the two clusters should be merged into one cluster. The calculation formula of center distance Dij is as follows: n 2 Dij = C ik Cjk k1 (5) Merge the two clusters which meet merge condition to obtain the new center: * NikCik N jkcjk Ck Nik Njk (6) In the formula, the two combined clusters center vector were respectively weighted by the sample s count, and Ck * is the real average vector. Step4: If the process has been convergence (using equation (1) to judge) or the number of iterations is greater than NS, then the algorithm should be terminated. Otherwise, the number of iterations should be plus one, and it will return to step2 to adjust the cluster center. Step5: After initializing the cluster, it will analyze the outlier cluster by using metrics about square error, similarity and reparability. 3.3 Customizable Kmeans clustering The K2 clusters from step2 can be used as the initial clustering center on Kmeans. And the users can also adjust the value k according to their need, if the user wants to get more cluster, it only need to split the cluster whose standard deviation is largest; Similarly, if the user wants to get less cluster class, it only need to merge the two cluster whose distance between their center is minimum. Finally, the value k and the corresponding center can use to the traditional Kmeans algorithm for clustering. The flow chart of the algorithm is shown in Figure

4 NO IS the distance between two cluster center is less than TC YES Merge operation NO NO Figure 2. Algorithm flowchart. Start Input the parameters about radius, maximum standard deviation, minimum center distance,etc Canopy clustering Calculate The cluster center, standard deviation and the distance between each cluster Is the standard deviation of this cluster greater than TE YES Split operation Convergence or reach the maximum iterations YES Delete outlier cluster Customizable Kmeans clustering END 4 EXPERIMENTS AND RESULT ANALYSIS 4.1 Experimental preparation The experiment was running on the PC with Windows 7 operating system and 8G memory, the algorithm is implemented by Matlab, and the data are provided by the operators. 4.2 Experimental design In order to achieve better user segmentation, we will analyze the data about user s attributes, consumer behavior and communication records, and build a data mining model. Finally, this mathematical model will be applied to formulate the corresponding marketing policy for different customer groups and maximize the benefits. The experimental data are shown in Table 1. The experiment selected the attribute from Table 1 for analysis, and the training data are randomly selected from the original data and divided into 10 groups (the amount of data is from 1 million to 10 million). Moreover, the training data are subject to normal distribution, so it can ensure the randomness of the experimental data and the accuracy of experimental results. Experiment 1: In the same amount of data, it compares the proposed algorithm with the traditional Canopy algorithm on running efficiency, and the results are shown in Table 2 and Figure 3. In the parameter setting, the radius T1 and T2 of traditional Canopy clustering algorithm were set 0.5 and 0.75, and the radius T1 and T2 of the proposed algorithm were set the smaller values about 0.25 and According to the gold segmentation evaluation function of the paper [13], it can calculate the maximum cluster standard deviation TE and the minimum cluster center distance were 0.09 and 3. In the case of low data volume, the running time of these two algorithms have no obvious difference, but when the volume of data reaches a certain scale, the running time of the proposed algorithm become slowly and its convergence speed is significantly higher than the traditional Canopy algorithm. Experiment 2: For the example about the 10th data set, we can observe the influence of the initial radius to the clustering results. The results are shown in Table 3 and Figure 4, and the radius tight threshold T2 is 0.05 and 0.5 respectively. As can be seen from Figure 4, the clustering results of traditional Canopy algorithm is easy to be affected by the radius, when the Table 1. Main data and instruction in the experiment. Attribute Name Explain IMSI a number of unique identification for mobile users BRND_CD The operator's brand type INNET_DUR The total service time of user from the first use to now. Unit: Month BI_AGE_CNT User's age which registe in the operator F_ACCTBAL_AMT Account balance for month ARPU A standard to measure the operator's income. Unit: Month NB_ARPU A part of Arpu which is the main data business income of the operators GPRS_FLUX The total flow of user use per month(2g+3g+4g) G4_FLUX The 4G flow of user use per month G3_FLUX The 3g flow of user use per month MOU A measure of the telecommunications. Unit: minute INT_NORM_ROAM_CALL_CNT A number of calls which is out of the service area for month F_ NORM_ROAM_CALL_DUR A time of calls which is out of the service area for month INT_NORM _4G_FRD_PCT The proportion of 4G customers in the Top20 frequent interaction friends INT_NORM _ HFLUX _FRD_ PCT The proportion of high flow customers in the Top20 frequent interaction friends INT_NORM _ HARPU _FRD_ PCT The proportion of high arpu customers in the Top20 frequent interaction friends TERM_BRND_CD The terminal brand SI_ INVOICE _FLAG The flag of pay fees invoice for whether the customer has apply 250

5 radius T2 increases, the count of clustering results is less. On the contrary, the proposed algorithm is very stable, which is not easily affected by the radius, and the count of clusters is near to 5. Table 2. The relationship between the amount of data and running time. Running time/s The amount of The proposed data / ten thousand algorithm Traditional Canopy Figure 4. The comparison of clustering results with radius. Experiment 3: The square error can be used for evaluate the centralized degree of the cluster. If the Square error is smaller, the objects within a cluster have higher concentration and higher similarity. From Table 4 and Figure 5, we can see that the proposed algorithm is better than traditional Canopy algorithm on square error. Table 4. The relationship between the amount of data and square error. Figure 3. The chart of computing efficiency. Table 3. The relationship between radius T2 and the count of clusters. The count of clusters Radius T2 The proposed algorithm Traditional Canopy The amount of data / ten thousand Square error /10^3 The proposed algorithm Traditional Canopy Figure 5. The change of square error with the different amount of data. 251

6 5 CONCLUSIONS Kmeans is a common clustering method in data mining. In order to solve the weaknesses of Kmeans algorithm and Canopy algorithm, an improved Canopy-Kmeans algorithm is proposed. The proposed method not only retained the advantages of traditional Canopy-Kmeans algorithm, but also can adjust the clustering result according to the actual need. Experiments show that the algorithm is very helpful on personalized operation of the customer subdivision, and the next step we will combine this method with the personalized recommendation for more in-depth data mining research and application. ACKNOWLEDGEMENT This work is supported by the national Joint Funds of Guangdong province support project (No. U ) and National Natural Science Foundation of China for Young Scholar (No ), all support is gratefully acknowledged. REFERENCES [1] Zhao C., Wu Y.., Gao H Study on knowledge acquisition of the telecom customers' consuming behavior based on data mining. Wireless Communications, Networking and Mobile Computing, WiCOM'08. 4th International Conference on. IEEE, pp: 1-5. [2] Soua M., Kachouri R., Akil M A new hybrid binarization method based on K-means. Communications, Control and Signal Processing (ISCCSP), th International Symposium on. IEEE, pp: [3] Tang X.Q., Zhu P Hierarchical clustering problems and analysis of fuzzy proximity relation on granular space. Fuzzy Systems, IEEE Transactions on, 21(5): [4] Smiti A., Elouedi Z DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques. Intelligent Engineering Systems (INES), 2012 IEEE 16th International Conference on. IEEE, pp: [5] Tsai C.F., Hu Y.C Enhancement of efficiency by thrifty search of interlocking neighbor grids approach for grid-based data clustering. Machine Learning and Cybernetics (ICMLC), 2013 International Conference on. IEEE, 3: [6] Qin X., Zheng S., Huang Y., et al Improved K-Means algorithm and application in customer segmentation. Wearable Computing Systems (APWCS), 2010 Asia-Pacific Conference on. IEEE, pp: [7] Han L.B., Wang Q., Jiang Z.F Improved k-means initial clustering center selection algorithm. Computer Engineering and Applications, 46(17): [8] Du W., Zhao C.R., Huang W.J Application of improved Kmeans cluster algorithm to customer segmentation. Journal of Hebei University of Economics and Business, 35(1): [9] Zhai D.H., Yu J., Gao F K-means text clustering algorithm based on initial cluster centers selection according to maximum distance. Application Research of Computers, 31(3): [10] Xie J.Y., Wang Y.E K-means algorithm based on minimum deviation initialized clustering centers. Computer Engineering, 40(8): [11] Mehar A.M., Matawie K., Maeder A Determining an optimal value of K in K-means clustering. Bioinformatics and Biomedicine (BIBM), 2013 IEEE International Conference on. IEEE, pp: [12] Na S., Xumin L, Yong G Research on k-means clustering algorithm: An improved k-means clustering algorithm. Intelligent Information Technology and Security Informatics (IITSI), 2010 Third International Symposium on. IEEE, pp: [13] Zhang L.N., Jiang X.H., Na R.S Research of large sample data clustering method based on improved ISodata algorithm. Journal of Inner Mongolia Agricultural University: Natural Sciences Edition, (1):

A Recommender System Based on Improvised K- Means Clustering Algorithm

A Recommender System Based on Improvised K- Means Clustering Algorithm Shivani Sharma Department of Computer Science and Applications, Kurukshetra University, Kurukshetra Shivanigaur83@yahoo.com Abstract: