A Pretopological Approach for Clustering ( )

A Pretopological Approach for Clustering ( ) T.V. Le and M. Lamure University Claude Bernard Lyon 1 e-mail: than-van.le@univ-lyon1.fr, michel.lamure@univ-lyon1.fr Abstract: The aim of this work is to define a clustering algorithm starting from the pretopological results related to the minimal closed subset concepts. However the minimal closed subsets algorithm generally does not lead us to a clustering of the population under interest. Thus, we propose to define a new clustering method, based on this algorithm. Our method involves two steps: the first one consists in structuring the population by the minimal closed subsets method and the second one consists in building a clustering of the population, starting from the previously obtained structuring. This method is tested on data from the CRAM of Lyon and from the emergency service of Hôpital Edouard Herriot of Lyon. Keywords: binary relation, pretopology, minimal closed subset, germ, structuration, classification. 1. Introduction The pretopology is a mathematical tool for the analysis model and construction in the most various fields: social sciences, game theories, graphs, networks, preferences, and mathematisation of discrete spaces. It probably establishes the powerful tools for the structure analysis and automatic classification (Hubert Emptoz 1983, Nicoloyannis N. 1988). It ensures the follow-up of the process s development of dilation, alliance, adherence, closed subset, acceptability (Belmandt Z. 1993, Lamure M. 1987, Duru G. 1980, Auray J. P. 1983). In the problem of data analysis, the pretopology provides us a structural method based on adherence and minimal closed subsets concepts (Belmandt Z. 1993, Bonnevay S. and all. 1999, 2000, Largeron C. and all. 1997). Given a finite set E, the adherence a(.) defined on its subsets has the possibility to express the extension phenomena. Contrarily to what occurs in topology, a(.) is not always a closure, but its successive aggregations lead to produce closed subsets which characterize homogenous or interdependent parts of E. This process is the principal mechanism of this structural method that will be detailed in the paragraph 2. However, this method does not deduce a partition for a given set because the nested groups exist in its structure. We thus propose a new method of automatic classification based on the minimal closed subsets approach and the idea from the germ of k-medoids method (P. Berkhin 2002, Raymond T. Ng and all. 2002). Our approach has two steps: producing minimal closed subsets and then separating the nested groups obtained in the previous step into the distinct groups. Its advantages will be discussed in the paragraph 3. ( ) We gratefully thank CRAM of Lyon and Professor Robert (head of the emergency service) for giving us access to their data

2. Pretopology: Basic Concepts 2.1. Pseudoclosure Definition 1 A mapping a(.) from P(E) into P(E) is called a pseudoclosure iff A P(E), a( ) = (1) A P(E), a(a) A (2) By using duality, we can define the interior mapping as follows: Definition 2 Given a pseudoclosure on E, we define the interior mapping by putting: A P(E), i(a) = [a(a c )] c (3) where A c denotes the subset E A. Then, (E, i, a) is said a pretopological space. According to properties of a(.) (and i(.)), we obtain more or less complex pretopological spaces from the most general spaces to topological spaces. Pretopological spaces of V type are the most interesting case. In that case, a(.) fulfills the following property : A P(E), B P(E), A B a(a) a(b) (4) Definition 3 Let (E, i, a) a pretopological space, A a subset of E is said a closed subset iff a(a) = A. Definition 4 Let (E, i, a) a pretopological space, A a subset of E is said an open subset iff i(a) = A. Definition 5 Given (E, i, a) a pretopological space, for any subset A of E, we can consider the whole family of closed subsets of E which contain A. If exist, we determine the smallest element of that family for the including relationship. That element is called the closure of A and denoted F(A). 2.2. Pretopology and binary relationships Suppose we have a family {R i } i=1..n of binary relationships (quantitative or qualitative) on a set E. In this section, we show how it is possible to define a V pretopological space from the family {R i }. Let us consider for any x E: B i (x) = {y E xr i y} {x}. (5) We call V(x), the family of the neighbourhoods of x, is defined by: V(x) = {V P(E) i, B i (x) V } (6) We can prove that {V(x) x E} is a prefilter of subsets of E, which means: x E, V V(x), / V (7)

x E, V V(x), W, W V W V(x). (8) Then from the family V(x), the pseudoclosure a(.) is defined by: A P(E), a(a) = {x E V V(x), V A } or equivalently : (9) A P(E), a(a) = {x E i, B i (x) A } (10) Proposition 1 Given a family {R i } i=1..n of binary relationships on a finite set E, the pretopological space(e, i, a) defined by using the pseudoclosure a(.) defined above is a V one. The reason for using the spaces of type V is that we can build them from a family of reflexive binary relations on the finite set E. That thus makes it possible to take various points of view (various relations) expressed in a qualitative way to determine the pretopological structure placed on E. The space of type V is the starting point for the definition of a classification of E. 2.3. Elementary closed subsets, minimal closed subsets We recall in a pretopological space (E, i, a), a subset K of E is a closed subset of E if and only if a(k) = K. And the smallest closed subset containing A is the closure of A. We get the following result: Proposition 2 In any pretopological space of type V, given a subset A of E, the closure of A always exists. We denote F e the family of elementary closed subset the set of closures of each singleton {x} of P(E). So in a V pretopological space, we get: - x E, F x : closure of{x} - F e = {F x x E} Definition 6 F is called a minimal closed subset if and only if F is a minimal element for inclusion in F e. In view to determine how E is structured by the pseudoclosure a(.), we use the concept of minimal closed subsets according to the following algorithm: Given the pseudoclosure a(.), we search for F e into E the following function provides the result. F e = ; for all x Edo{ F x = a({x}); W hile (a(f x ) F x )F x = a(f x ); If(F x / F e )F e = F e F x ; }

Then, we are able to determine the minimal closed subsets by using the following function by noting that we only need to extract these minimal closed subsets from F e (Bonnevay S.1999,2000). F m = ; While(F e ){ Choose F F e ; F e = F e {F }; minimal = true; K = F e ; While((K ) (minimal)){ Choose G K; If(G F )minimal = false; Else if(f G)F e = F e {G}; K = K G; } If((minimal == true)&& (F / F m )) F m = F m F ; } Figure 1: Example of structuration method. F m F e............ Example: In order to illustrate our method, we present the following example, E= 1,., 10, n=10, and given x, x = (x 1, x 2 ) and y, y = (y 1, y 2 ), we put xry iff (y2 y1) 0 and d(x, y) ε, for a given ε. (see Table 1). R(x) = {y E (y2 y1) 0, d(x, y) 2} Table 1: x x 1 x 2 R(x) a(x) = {y E F x a(x) R({y}) {x} = } 1 1 1 1,2,3 1 1* 1 2 1 2 2,3 1,2,3 1,2,3 3 3 2 2 2,3 1,2,3 1,2,3 3 4 3 4 4,5,7 4,5 4,5* 2 5 4 4 4,5,6 4,5 4,5* 2

Table 1: (continued) 6 5 5 6 5,6 4,5,6 2 7 3 6 7 4,7 4,5,7 2 8 4 1 8,9 8,9 8,9* 2 9 6 1 8,9,10,11 8,9 8,9* 2 10 7 2 10,11,12 9,10 8,9,10 2 11 6 3 11,12 9,10,11 8,9,10,11 3 12 7 4 12,13,15 10,11,12,13 8,9,10,11,12,13,14 4 13 9 4 12,13 12,13,14 8,9,10,11,12,13,14 3 14 9 3 13,14 14 14* 1 15 7 6 15,16 12,15,16 8,9,10,11,12,13,14,15,16 3 16 8 6 15,16 15,16 8,9,10,11,12,13,14,15,16 2 Using the structural method based on the minimal closed subset concept, we get the following result : Figure 2: Structuring process. The final structure shown in figure 2 is obtained as follows: In the first step, we get the minimal closed subsets {{1}, {4, 5}, {8, 9}, {14}} (the greyest areas in the above picture). Afterwards, we get the smallest elementary closed subsets which contain the minimal ones: {{1, 2, 3}, {4, 5, 6}, {4, 5, 7}, {8, 9, 10}}. And so on: {8, 9, 10, 11}, {8, 9, 10, 11, 12, 13, 14},{8, 9, 10, 11, 12, 13, 14, 15 16}. In the above table and image, we note that the sets marked by stars (*) are minimal element forming homogeneous groups of the population. They cannot transfer the poison to the other elements but they are influenced by the ones of the group which contains them. The advantage of this method is to help us to analyze the connection between the elements in discrete space. However, this method only provides a clustering of E in the case which the relationship between elements of E is a symmetric one. In many practical situations, it is not the case, so we propose using this minimal closed subsets algorithm as a pre-treatment for classical clustering methods, in particular for the K-medoids method. Two possible cases can thus occur at the end of the minimal closed subsets algorithm: F m provides a partition of E. The clustering is obtained. F m does not provide a partition of E. In this case, we must perform the second step in order to build a clustering based on the result obtained by the previous stage.

3. Our pretopological clustering method As we previously said, the minimal closed subsets algorithm generally does not provide a partition of the whole set E. However, in its first step, it provides the minimal closed subsets which play an important in the structuring process. So, this gives us the idea to use these minimal closed subsets as germs for the K-medoids method. The problem is that K-medoids methods use singletons as germs. As minimal closed subsets are not generally singletons (see the previous example), the first thing we have to deal with is to select one and only element in each minimal closed subset if needed. 3.1. Determining germs Let us recall some notations: - F m : family of minimal closed subsets. - F e = {F x x E}: family of elementary closed subsets - a({x}) = {y E R({y}) {x} = } Two possibilities can occur: F ({x}) = F m ({x}) = {x}, x is a germ of class F ({x}) = F m ({x}) = {x 1, x 2,..., x p }. Calculate a({x i }), for i = 1,...,p. Select x o such as a({x o }) = Max( a({x o }) ), for i = 1,...,p. In case where two such x o exist, the germ is randomly chosen, else, we continue by calculating a measurement dispersion τ x associated to a possible germ x. Case 1: Data are quantitative ones and we can define a metric d on the set E. For any subset A of E, for any x, x A, we compute τ x = d(x, y), and we y A, y x select the germ by taking x o such as τ xo = Min({τ x, x A}). Case 2: Data are qualitative ones and any element x can be represented by a binary string, by mean of a completely disjunctive table. We compute BitOne(x&&y) τ xy = where BitOne(x) returns the number of bit 1 in BitOne(x)+BitOne(y) z the binary string which represents x. Then, we select the germ by taking x o such as τ xo = Max({τ xy }, y A, y x). After having found the germ e for each class influenced by x, we choose the class to affect x to which such as the dispersion measurement between x and e is the most reasonable by the assignment approach. 3.2. The assignment approach Initialization: - G = F m = {G j }, F em = F e F m = {Fem}, i j = 1,.., F m, i = 1,.., F em ; - Sort F em ascending within the meaning of inclusion; - i=1 ; 1. Compute K = {G j G j Fem i }, H = Fem i G j. G j K

2. If K = 1, put G j = G j H and go to step 4, else go to step 3. 3. x H, calculate e(g j ), compute τ(x, e(g j ), G j K) affect x to G k such as : τ(x, e(g k )) = Min({τ x,e(gj )}), Case 1 τ(x, e(g k )) = Max({τ x,e(gj )}), Case 2 e(g j ) is the germ determined in G j. 4. F em = F em F i em. If F em =, stop, else, i = i + 1 and go to step 1. At the issue, G is a partition of E. In order to have a better understanding of this algorithm, let us return to the previous example. Initialization : - G = F m = {{1}, {14}, {4, 5}, {8, 9}} - F em = {{1, 2, 3}, {4, 5, 6}, {4, 5, 7}, {8, 9, 10}, {8, 9, 10, 11}, {8, 9, 10, 11, 12, 13, 14}, {8, 9, 10, 11, 12, 13, 14, 15, 16} 1. F 1 em = {1, 2, 3}, K = {1}, H = {2, 3}. Affect all elements of H to G 1, G 1 = {1, 2, 3}, F em = F em F 1 em 2. F 1 em = {4, 5, 6}, K = {4, 5}, H = {6}. Affect all elements of H to G 3, G 3 = {4, 5, 6}, F em = F em F 1 em 3.... 4. F 1 em = {8, 9, 10, 11}, K = {8, 9, 10}, H = {11}. Affect all elements of H to G 4, G 4 = {8, 9, 10, 11}, F em = F em F 1 em 5. F 1 em = 8, 9, 10, 11, 12, 13, 14, K = {{14}, {8, 9, 10, 11}} τ(12, 11) < τ(12, 14) => affect {12} to G 4 τ(13, 14) < τ(13, 11) => affect {13} to G 2 G 2 = {13, 14}, G 4 = {8, 9, 10, 11, 12}, e(g 2 ) = {13}, e(g 4 ) = {12} Result : G={{1,2,3},{4,5,6,7},{8,9,10,11,12,15,16},{13,14}} (see Figure 2). Figure 3: Clustering process. What are advantages of this method? - First, it provides a clustering of the population, - Second, the number of classes (the number of minimal closed subsets) is computed by the method while it must be chosen by the user in other methods as the k-medoids method. - Last, the germ of class is easy to extract from minimal closed subsets.

4. Conclusion This article present a new method for clustering based on the concept of minimal closed subsets of pretopology. Pretopology helps us to analyze the structure of a finite set in discrete space but it generally does not provide a clustering. This restriction is solved by the second stage of this method defining a concept of germ from which a clustering process is build. This new method gives us a possibility for making-decision in the field of social sciences where data often are complex and cannot lead us to consider metric spaces as representing the population. One typical example for applying that kind of method is the analysis of data from medico-economic bases as DRGs. References Abdul-Amier Hashom (1982) Plus proches voisins et classification automatique. Applications a des donnees industrielles, Thèse Th. Doct. 3e cycle Mathematiques des systemes : INSA Lyon. Auray J. P. (1983) Contribution à l analyse des structures pauvres, Thèse d Etat, Université Lyon 1. Belmandt Z. (1993) Manuel de prétopologie et ses applications, Edition Hermès. Bonnevay S., Lamure M., Largeron-Leteno C., Niconoyannis N. (1999) A pretopological approach for structuring data in non metric space, in: Electronic Notes in Discrete Mathematics, Melvin F. Janowitz, Elsevier Science Publishers, 2. Bonnevay S., Largeron C. (2000) Data analysis based on minimal closed subsets, in: Data Analysis, Classification and Related Methods, Kiers et al. editors, Springer, 303 308. Duru G. (1980) Contribution à l étude des structures des systèmes complexes dans les sciences humaines, Thèse d Etat, Université Lyon 1. Hubert Emptoz (1983) Modèle prétopologique pour la reconnaissance des formes. Applications en neurophysiologie, Thèse : Th. Sc. Univ. Cl. Bernard. Lyon I. Lamure M. (1987) Contribution à l analyse des espaces abstraits - application aux images digitales, Thèse d Etat, Université Lyon 1. Largeron C., Bonnevay S. (1997) Une méthode de structuration par recherche de fermés minimaux. Application à la modélisation de flux de migrations inter-villes, in: Société Francophone de Classification 97, 111 118. Nicoloyannis N. (1988) Structure prétopologiques et classification automatique: le logiciel DEMON, Thèse Lyon. P. Berkhin (2002) Survey of clustering data mining techniques. Raymond T. Ng and Jiawei Han (2002) CLARANS: A Method for Clustering Objects for Spatial Data Mining, in: IEEE transaction on knowledge and data engineering,14,5.