FuzzyShrinking: Improving Shrinking-Based Data Mining Algorithms Using Fuzzy Concept for Multi-Dimensional data

Size: px

Start display at page:

Download "FuzzyShrinking: Improving Shrinking-Based Data Mining Algorithms Using Fuzzy Concept for Multi-Dimensional data"

Herbert McCoy
5 years ago
Views:

Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 FuzzyShrinking: Improving Shrinking-Based Data Mining Algorithms Using Fuzzy Concept for Multi-Dimensional

1 Kennesaw State University State University Faculty Publications 2008 FuzzyShrinking: Improving Shrinking-Based Data Mining Algorithms Using Fuzzy Concept for Multi-Dimensional data Yong Shi Kennesaw State University, Follow this and additional works at: Part of the Computer Sciences Commons Recommended Citation Yong Shi FuzzyShrinking: improving shrinking-based data mining algorithms using fuzzy concept for multi-dimensional data. In Proceedings of the 46th Annual Southeast Regional Conference on XX (ACM-SE 46). ACM, New York, NY, USA, This Article is brought to you for free and open access by State University. It has been accepted for inclusion in Faculty Publications by an authorized administrator of State University. For more information, please contact

2 FuzzyShrinking: Improving Shrinking-based Data Mining Algorithms Using Fuzzy Concept for Multi-Dimensional Data Yong Shi Department of Computer Science and Information Systems Kennesaw State University 1000 Chastain Road Kennesaw, GA ABSTRACT In this paper, we present continuous research on data analysis based on our previous work on the shrinking approach. Shrinking[19] is a novel data preprocessing technique which optimizes the inner structure of data inspired by the Newton s Universal Law of Gravitation[16] in the real world. It can be applied in many data mining fields. In this approach data are moved along the direction of the density gradient, thus making the inner structure of data more prominent. It is conducted on a sequence of grids with different cell sizes. In this paper, we applied the Fuzzy concept to improve the performance of the shrinking approach, targeting the better decision making for the movement for individual data points in each iteration. This approach can assist to improve the performance of existing data analysis approaches. 1. INTRODUCTION With the advance of modern technology, the generation of multi-dimensional data has proceeded at an explosive rate in many disciplines. Data preprocessing procedures can greatly benefit the utilization and exploration of real data. Shrinking[19] is a novel data preprocessing technique which optimizes the inner structure of data inspired by the Newton s Universal Law of Gravitation[16] in the real world. It can be applied in many data mining fields. In this paper, we first give a brief introduction about our previous work on the shrinking concept formation and its application for clustering approaches in full-space; then, we propose to use the Fuzzy concept to improve the performance of the shrinking approach, targeting the better decision making for the movement for individual data points in each iteration. 1.1 Related work Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-SE 08, March 28 29, 2008, Auburn, AL, USA. Copyright 2008 ACM ISBN /08/03 $ Data preprocessing transforms the data into a format that can be more easily and effectively processed for the purpose of the users. It is commonly used as a preliminary data mining practice. There are a number of data preprocessing techniques[15, 4]. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store. Data transformation may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size. These data processing techniques, when applied prior to mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining[12]. Cluster analysis is used to identify homogeneous and wellseparated groups of objects in databases. The basic steps in the development of a clustering process can be summarized as [6] feature selection, application of a clustering algorithm, validation of results, and interpretation of the results. Among these steps, the clustering algorithm is especially critical, and many methods have been proposed in the literature for this step. Existing clustering algorithms can be broadly classified into four types [11]: partitioning [10, 13, 14], hierarchical [21, 7, 8], grid-based [18, 17, 1], and density-based [5, 9, 2] algorithms. 1.2 Data shrinking In this subsection, we will briefly introduce our previous work on the shrinking approach[19]. Shrinking is a data preprocessing approach inspired by the Newton s Universal Law of Gravitation which states that any two objects exert a gravitational force of attraction on each other. It optimizes the inner structure of data by moving each data point along the direction in which way it is most strongly attracted by other data points. In the previous work[19], we proposed a shrinking-based approach for multi-dimensional data analysis which consists of three steps: data shrinking, cluster detection, and cluster evaluation and selection. This approach is based on grid separation of the data space. Since grid-based clustering approaches depend on the proper selection of grid-cell size, we used a technique called density span generation to select a sequence of grids of different cell sizes and perform the data-

3 shrinking and cluster-detection steps based on these suitable grids. In the data-shrinking step, each data point moves along the direction of the density gradient and the data set shrinks toward the inside of the clusters. Data points are attracted by their neighbors and move to create denser clusters. The neighboring relationship of the points in the data set is gridbased. The data space is first subdivided into grid cells. Data points in sparse cells are considered to be noise or outliers and will be ignored in the data-shrinking process. Data-shrinking proceeds iteratively; in each iteration, we treat the points in each cell as a rigid body which is pulled as a unit toward the data centroid of those surrounding cells which have more points. Therefore, all points in a single cell participate in the same movement. The iterations terminate if the average movement of all points is less than a threshold or if the number of iterations exceeds a threshold. Following the data-shrinking step, the cluster-detection step is performed in which neighboring dense cells are connected and a neighboring graph of the dense cells is constructed. A breadth-first search algorithm is applied to find connected components of the neighboring graph each of which is a cluster. The clusters detected at multiple grid scales are compared by a cluster-wise evaluation measurement. We use the term compactness to measure the quality of a cluster on the basis of intra-cluster and inter-cluster relationships. Internal connecting distance (ICD) and external connecting distance (ECD) are defined to measure the closeness of the internal and external relationships, respectively. Compactness is then defined as the ratio of the external connecting distance over the internal connecting distance. This definition of compactness is used to evaluate clusters detected at different scales and to then select the best clusters as the final result. 1.3 Fuzzy concept Some data in the real world are not naturally well organized. Clusters in the data may overlap each other. Fuzzy concept can be applied to further smooth the shrinking movement of the boundary data points. The concept of fuzzy sets was first introduced by Zadeh [20] to represent vagueness. The use of fuzzy set theory is becoming popular because it produces not only crisp decision when necessary but also corresponding degree of membership. Usually, membership functions are defined based on a distance function, such that membership degrees express proximities of entities to cluster centers. In conventional clustering, sample is either assigned to or not assigned to a group. Assigning each data point to exactly one cluster often causes problems, because in real world problems a crisp separation of clusters is rarely possible due to overlapping of classes. Also there are exceptions which cannot be suitably assigned to any cluster. Fuzzy sets extend to clustering in that object of the data set may be fractionally assigned to multiple clusters, that is, each point of data set belongs to groups by a membership function. This allows for ambiguity in the data and yields detailed information about the structure of the data, and the algorithms adapt to noisy data and classes that are not well separated. Most fuzzy cluster analysis methods optimize a subjective function that evaluates a given fuzzy assignment of data to clusters. One of the classic fuzzy clustering approach is the Fuzzy C-means Method designed by Bezdek, J. C [3]. In brief, for a data set X with size of n and cluster number of c, it extends the classical within groups sum of squared error objective function to a fuzzy version by minimizing the objective function with weighting exponent m, 1 m < : J m(u, V ) = n k=1 i=1 c (u ik ) m d 2 (x k, v i), (1) where U is a partition of X in c part, V = v = (v 1, v 2,..., v c ) are the cluster centers in R p, and A is any (p p) symmetric positive definite matrix defined as the following: d(x k, v i) = (x k v i) (x k v i), (2) where d(x k, v i) is an inner product induced norm on R p, u ik is referred to as the grade of membership of x k to the cluster i. The fuzzy C-Means (FCM) uses an iterative optimization of the objective function, based on the weighted similarity measure between x k and the cluster center v i. During each iteration, it calculates the c cluster centers {v i,t }, i = 1,..., c n k=1 v i,t = um ik,t 1x k n, (3) k=1 um ik,t 1 for those data points not of any current cluster center, it calculate the following 1 u ik,t = ( ) 2. (4) c dik,t m 1 j=1 d jk,t When a predefined termination condition is satisfied, the algorithm is terminated. 1.4 Fuzzy-based data movement One of the advantages of the shrinking-based clustering approach has over other clustering approaches is the subtlety characteristics of data point movement. In each iteration step, the data points move gradually according to the gravitation of neighboring data points. Data points won t move dramatically in one step which may easily cause improper movement such as unsuitable direction or unfavorable amplitude. One of the crucial aspects of the data point movement is the way to handle boundary data points. Here boundary means that those data points are among the intersection area of multiple clusters. It can be easily understood that the movement accuracy of boundary data points is more important than that of center data points. For center data points, even though they move 261

4 a little bit improperly during one iteration, they will still be in the center area of a cluster, whereas for boundary data points, incorrect movement during one iteration might lead them to the wrong cluster. Thus more effort should be put on the improvement of the movement of boundary data points. We tend to use fuzzy clustering concept to further smoothen the boundary data point movement in shrinkingbased approaches. Data movement is an iterative process intended to move data gradually toward a goal of increased cluster density. In each iteration, points are attracted by their neighbors and move toward the inside of the clusters. Each point in a cell C has neighboring points in C or in the cells surrounding C. The movement of a point can be intuitively understood as analogous to the attraction of a mass point by its neighbors. Thus, the point moves toward the centroid of its neighbors. However, data movement thus defined can cause an evenlydistributed data set to be condensed in a piece-wise manner. Our solution to the above problem is to treat the points in each cell as a rigid body which is pulled as a unit toward the data centroid of those surrounding cells which have more points. Therefore, all points in a single cell participate in the same movement. This approach not only solves the problem of piece-wise condensing but also saves time. Formally, suppose the data set at the beginning of ith iteration becomes and the set of dense cells is { X i 1, X i 2,..., X i n}, DenseCellSet i = {C i 1, C i 2,..., C i m}. (5) Respectively, we assume that the dense cells have n i 1, n i 2,..., n i m, points and their data centroids are Φ i 1, Φ i 2,..., Φ i m. For a dense cell C i j, we suppose it has P i j surrounding dense cells which are C i j l, for l = 1, 2,..., P i j. The movement for any data point X i j o, o = 1, 2,..., n i j in cell C i j in the ith iteration is P i j Movement(Xj i o ) = u i j l ( Φ i j l Φ i j) (6) l=1 where Φ i j l Φ i j is the distance between the two centroids Φ i j l and Φ i j, and u i j l is the grade of movement direction (instead of membership in the traditional fuzzy concept) of X i j o to the neighboring dense cell C i j l : u i j l = n i j l P i j q=1 ni j q. (7) so the more data points a neighboring cell C i j l, for l = 1, 2,..., P i j has, the more likely it will attract data points in C i j. This grade of movement direction satisfies the following constraints: 0 u i j l 1, for 1 l P i j, 1 k n, (8) P i j u i j l = 1, for 1 j m. (9) i=1 To compute the movement for a dense cell Cj, i we browse the set of dense cells to find its surrounding cells and then calculate the movement. The computation takes O(m) time. It then takes O(n j ) time to update the points in the cell. Therefore in the ith iteration, the time used to move all points is O(m 2 + n). Thus, the time required for the ith iteration is O(m 2 + n log n), where O(n log n) time is used for the subdivision of space. 2. REFERENCES [1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages , Seattle, WA, [2] Ankerst M., Breunig M. M., Kriegel H.-P., Sander J. OPTICS: Ordering Points To Identify the Clustering Structure. Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 99), Philadelphia, PA, pages 49 60, [3] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, [4] D. Barbara, W. DuMouchel, C. Faloutsos, P. J. Haas, J. H. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Servcik. The New Jersey data resuction report. Bulletin of the Technical Committee on Data Engineering, [5] M. Ester, K. H.-P., J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, [6] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press, [7] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD conference on Management of Data, pages 73 84, Seattle, WA, [8] S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the IEEE Conference on Data Engineering, [9] A. Hinneburg and D. A.Keim. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 58 65, New York, August [10] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I,Statistics., [11] A. Jain, M. Murty, and P. Flyn. Data clustering: A review. ACM Computing Surveys, 31(3), [12] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers,

5 [13] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, [14] R. T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, pages , Santiago, Chile, [15] T. Redman. Data Quality: Management and Technology. Bantam Books, [16] Rothman, Milton A. The laws of physics. New York, Basic Books, [17] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th International Conference on Very Large Data Bases, [18] W. Wang, J. Yang, and R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the 23rd VLDB Conference, pages , Athens, Greece, [19] Yong Shi, Yuqing Song and Aidong Zhang. A shrinking-based clustering approach for multidimensional data. In IEEE Transactions on Knowledge and Data Engineering (TKDE), pages , [20] L. A. Zadeh. Fuzzy sets. Information and Control, 8(3): , [21] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages , Montreal, Canada,

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University