THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

Size: px

Start display at page:

Download "THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS"

Barry Warner
5 years ago
Views:

1 THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS Marcin Pełka 1 1 Wroclaw University of Economics, Faculty of Economics, Management and Tourism, Department of Econometrics and Computer Science ( marcin.pelka@ue.wroc.pl) KEYWORDS: Symbolic data analysis, Ensemble Clustering, Conceptual Clustering, Customer Loyalty 1 Introduction Ensemble approach based on aggregating information provided by different models has been proved to be a very useful tool in the context of supervised learning. The main goal of the ensemble approach is to increase the accuracy and stability of the classification. Recently the same techniques have been applied for cluster analysis where by combining a set of different clusterings, a better solution can be obtained. Ensemble clustering means combining (aggregating) N base clustering results (models) P1,...,P N into one model with P* clusters (see: Fred and Jain 2005). Nevertheless the idea of ensemble approach, that is combining (aggregating) the results of many base models, can be applied for cluster analysis of symbolic data. There are several proposals of applying the idea of ensemble approach in the context of clustering aggregation of results of different clustering algorithms, receiving different partitions by resampling the data, applying different subsets of variables, applying a given algorithm many times with different values of parameters or different initializations. 2 Symbolic data In classical multivariate data analysis, the basic units under the analysis are usually single individuals which are described by a set of quantitative (for example numerical) and/or quantitative (also known as categorical) variables each taking exactly one single value. For example, a specific car can be described by year of production, average fuel consumption, trunk capacity, color, etc. Data are often organized in a matrix or data array, where each cell contains the value of variable for an individual. Neverthenless this kind of data representation is too restricted to take into account variability and/or uncertainty of the data. Whether the data are obtained by

2 contemporaneous or temporal aggregation of individual observations to obtain descriptions of the entities which are of our interest, or whether we are facing concepts as such specified by experts or put in evidence by clustering, we are dealing with elements that can no longer be described by usual quantitative or qualitative framework without an loss of information. Symbolic data analysis (SDA) provides a framework where the variability observed may be effectively be considered in the data representation, and methods be developed that take into account. To describe groups of individuals or concepts, variables may now assume other forms of realizations (see Bock, Diday 2000 for details). Symbolic variables can be numerical (or quatitative) single valued (real or integer) if it takes one single value, multivalued if its values are finite subsets of the domain, interval variable if its are intervals. Categorical variable can be singlevalued (ordinal or not) when we have a single category form a given finite domain, multivalued if its values are finite subsets of the domain. A categorical modal variable is a multistate variable, where for each element, we are given a category set and, for each category, a frequency or a probability which indicates how frequent or likely that category is for this element. 3 Ensemble conceptual clustering There are two main approaches that can be applied in ensemble learning for symbolic interval-valued data (see: Ghaemi et. al. 2009; De Carvalho et. al. 2012; Hornik 2005): 1. Clustering algorithm for multiple relational matrices proposed by De Carvalho et. al This approach is based on different distance matrices. 2. Clustering ensemble that apply consensus functions in clustering ensembles. There are five main consensus functions that are applied in clustering ensemble. There are following methods in this solution: hypergraph partitioning, voting approach, mutual information, finite mixture model, co-association based functions. However these approaches does not allow to produce concepts as output of clustering ensemble. Since Michalski wrote about conceptual clustering as a new branch of machine learning (Michalski 1980) there has been increasing attention to that tasks. Conceptual clustering is not only the inherent structure of the data that drives cluster formation, but also the description language which is available to the learner. A concept is an abstraction or generalization from experience or the result of a transformation of existing concepts. The concept reifies all of its actual or potential instances whether these are things in the real world or other ideas. In order to obtain concepts as results of ensemble clustering adaptation of bagging can be used. Bagging, which stands for bootstrap aggregating, is one of the earliest, most intuitive and perhaps the simplest ensemble based algorithms, with a surprisingly good performance (Breiman 1996). Diversity of classifiers in bagging is obtained by using bootstrapped replicas of the training data. That is, different training data subsets are randomly drawn with replacement from the entire training

3 dataset. Each training data subset is used to train a different classifier of the same type. In clustering there are following adaptation of bagging for classical data case (Hornik 2005; Leisch 1999; Dudoit and Fridyland 2003): 1. Leisch s (1999) adaptation of bagging, where usually a k-means like base method is used. Centres obtained from each clustering are used as initial data set for some clustering method (e.g. hierarhical). Objects are assigned to the closest cluster centre. This kind of approach can be used to obtain concepts that describe clusters of symbolic objects. At the first stage subsets of objects (drawn with replacement) are obtained. Then for each subset the dynamic clustering for symbolic data (SCLUST) is used. Final cluster representatives (symbolic objects) are obtained at this step. This objects are then used as initial data for pyramidal/hierarchical clustering final clustering (assertation objects) is obtained. 2. Dudoit and Fridlyand (2003) proposal where k-means like algorithm is used to cluster entire data set and each of the subsets. Then a permutation is done to obtain best agreement between cluster labels for entire data set and subsets. 3. Hornik s (2005) proposal where a clustering is applied for each subset. The final solution is obtained by minimizing the distance between elements of ensemble and the set of all possible ensemble clusterings. 4 Short example To present the main idea of the paper short example will be used. Data set contains 20 artificial symbolic objects, that are decribed by two interval-valued variables, were obtained from cluster.gen function from custersim package (see Table 1). 40 subsets (each conaining 14 objects (drawn with replacement) were obtained. For each subset the dynamic clustering was done with cluster number drawn at random from the interval [2; 8]. Cluster representatives obtained at this stage were used as initial data set in hierarchical (see Bock, Diday 2000, ) clustering in SO- DAS 2.50 software assertation objects were obtained. Objects from OOB (out-ofbag) data set were assigned to the closest final cluster. At the end two cluster structure was obtained see Table 1. 5 Aim of the paper The article proposes to apply conceptual clustering in ensemble learning of symbolic data. An adaptation of Leisch s bagging is used. In the first stage data is divided into subsets (bags). For each of them the dynamical clusterong algorithm with different number of clusters is applied cluster representatives are obtained. These representatives are then used as the initial data set for hierarchical clustering which is a conceptual clustering method. Assertation objects are obtained as the final result of this step. In the empirical part of the paper results of the ensemble clustering are presented where customer loyalty data is used.

4 Table 1. Symbolic objects Object no. Variable V 1 Variable V 2 1. [ ; ] [ ; ] 2. [ ; ] [ ; ] 3. [ ; ] [ ; ] 4. [ ; ] [ ; ] 5. [ ; ] [ ; ] 6. [ ; ] [ ; ] 7. [ ; ] [ ; ] 8. [ ; ] [ ; ] 9. [ ; ] [ ; ] 10. [ ; ] [ ; ] 11. [ ; ] [ ; ] 12. [ ; ] [ ; ] 13. [ ; ] [ ; ] 14. [ ; ] [ ; ] 15. [ ; ] [ ; ] 16. [ ; ] [ ; ] 17. [ ; ] [ ; ] 18. [ ; ] [ ; ] 19. [ ; ] [ ; ] 20. [ ; ] [ ; ] Source: own research. Table 1. Examples of symbolic variables Variable V 1 Variable V 2 Cluster 1 [ ; ] [ ; ] Cluster 2 [ ; ] [ ; ] Source: own research. References BOCK, H.-H., DIDAY, E. (EDS.), Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data. Berlin- Heidelberg: Springer. BREIMAN, L., Bagging predictors. Machine Learning, vol. 24, no. 2, DUDOIT, S., FRIDLYAND, J., Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19 (9), FRED, A.L.N., JAIN, A.K., Combining multiple clustering using evidence accumulation. IEEE Transaction on Pattern Analysis and Machine Intelligence., 27,

5 GHAEMI, R., SULAIMAN, N., IBRAHIM, H., MUSTAPHA, N., A survey: Clustering ensemble techniques [in:] Proceedings of World Academy of Science, Engineering and Technology, 38, HARTIGAN, J.A, Clustering Algorithms. New York: Wiley. HORNIK, K., A clue for clustering ensembles. Journal of Statistical Software. 14, LEISCH, F., Bagged clustering. Adaptive Information Systems and Modeling in Economics and Management Science. Working Papers, SFB, 51. MICHALSKI, R.S., Knowledge acquisition through conceptual clustering: A theoretical framework and algorithm for partitioning data into conjunctive concepts. International Journal of Policy Analysis and Information Systems, 4, PEŁKA, M., Ensemble approach for clustering of interval-valued symbolic data. Statistics in Transition, 13 (2),

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department