Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku FINLAND Abstract: The problem of mnng assocaton rules for fuzzy quanttatve tems was ntroduced and an algorthm proposed n [5]. However, the algorthm assumes that fuzzy sets are gven. In ths paper we propose a method to fnd the fuzzy sets for each quanttatve attrbute n a database by usng clusterng technques. We present a scheme for fndng the optmal parttonng of a data set durng the clusterng process regardless of the clusterng algorthm used. More specfcally, we present an approach for evaluaton of clusterng parttons so as to fnd the best number of clusters for each specfc data set. Ths s based on a goodness ndex, whch assesses the most compact and well-separated clusters. We use these clusters to classfy each quanttatve attrbute nto fuzzy sets and defne ther membershp functons. These steps are combned nto a concse algorthm for fndng the fuzzy sets. Fnally, we descrbe the results of usng ths approach to generate assocaton rules from a real-lfe dataset. The results show that a hgher number of nterestng rules can be dscovered, compared to parttonng the attrbute values nto equal-szed sets. Key-Words: assocaton rules, fuzzy tems, quanttatve attrbutes, clusterng Introducton Snce knowledge can often be expressed n a more natural way by usng fuzzy sets, many decson support problems can be greatly smplfed. We attempt to take advantage of fuzzy sets n knowledge dscovery from databases. One mportant topc n knowledge dscovery and decson support research s concerned wth the dscovery of nterestng assocaton rules []. An nterestng assocaton rule descrbes an nterestng relatonshp among dfferent attrbutes. Gven a set of transactons where each transacton s a set of tems, an assocaton rule s an expresson of the form X Y, where X and Y are sets of tems. An example of an assocaton rule s: 4% of transactons that contan beer and potato chps also contan dapers; 5% of all transactons contan all of these tems. Here 4% s called the confdence of the rule, and 5% the support of the rule. The problem s to fnd all assocaton rules that satsfy user-specfed mnmum support and mnmum confdence constrants. The problem of mnng boolean assocaton rules over supermarket data was ntroduced n [], and later broadened n [3], for the case of databases consstng of categorcal attrbutes alone. In practce the nformaton n many, f not most, databases s not lmted to categorcal attrbutes, but also contans much quanttatve data. The problem of mnng quanttatve assocaton rules was ntroduced and an algorthm proposed n [4]. The algorthm nvolves dscretzng the domans of quanttatve attrbutes nto ntervals n order to reduce the doman nto a categorcal one. An example of a rule accordng to ths defnton would be: % of marred people between age 5 and 7 have at least cars. However, these ntervals may not be concse and meanngful enough for human experts to easly obtan nontrval knowledge from those rules dscovered. In [5], we showed a method to handle quanttatve attrbutes usng a fuzzy approach. Instead of usng ntervals, the method employs lngustc terms to represent the revealed regulartes and exceptons. We assgned each quanttatve attrbute several fuzzy sets whch characterze t. Fuzzy sets provde a smooth transton between a member and non-member of a set. The fuzzy assocaton rule s also easly understandable to a human because of the lngustc terms assocated wth the fuzzy sets. Usng the fuzzy set concept, the above example could be rephrased e.g. % of marred old people have several cars. However, the algorthm proposed n [5] for fuzzy assocaton rule mnng suffers from the followng problem. The user or an expert must provde ths algorthm the requred fuzzy sets of the quanttatve
attrbutes and ther correspondng membershp functons. It s unrealstc to assume that experts can always provde the fuzzy sets of the quanttatve attrbutes n the database for fuzzy assocaton rule mnng. To deal wth ths problem, we ntend to fnd the fuzzy sets by usng clusterng technques. In ths paper, we present an approach for clusterng scheme evaluaton. It ams at evaluatng the schemes produced by a specfc clusterng algorthm, assumng dfferent nput parameter values. These schemes are evaluated usng a new clusterng scheme valdty ndex, whch we defne. Our goal t s not to propose a new clusterng algorthm or to evaluate a varety of clusterng algorthms, but to produce the clusterng scheme wth the most compact and well-separated clusters for any gven algorthm. The remander of the paper s organzed as follows. In the next secton we descrbe the proposed goodness ndex for clusterng scheme evaluaton. In Secton 3, we explot the dscovered cluster centers, to classfy the quanttatve attrbute values nto fuzzy sets, and show a method to fnd the correspondng membershp functon for each fuzzy set. Then we formulate our approach nto a precse algorthm n Secton 4. In Secton 5 the expermental results are reported, comparng obtaned assocaton rules both qualtatvely and quanttatvely. The paper ends wth a bref concluson n Secton 6. Clusterng Scheme Evaluaton The objectve of the clusterng methods s to provde n some sense optmal parttons of a data set. In general, they should search for well separated clusters whose members are close to each other. Another problem n clusterng s to decde the optmal number of clusters that fts best a data set. The majorty of clusterng algorthms produce a parttonng based on the nput parameters (e.g. number of clusters, mnmum densty) that fnally lead to a fnte number of clusters. Thus, the applcaton of an algorthm assumng dfferent nput parameter values results n dfferent parttons of a partcular data set, whch are not easly comparable. A soluton to ths problem s to run the algorthm repettvely wth dfferent nput parameter values and compare the results aganst a well-defned valdty ndex. A number of cluster valdty ndces are descrbed n the lterature. A cluster valdty ndex for crsp clusterng proposed n [6], attemps to dentfy compact and separated clusters. Other valdty ndces for crsp clusterng have been proposed n [7] and [8]. The mplementaton of most of these measures s very expensve computatonally, especally when the number of clusters and number of objects n the data set grow very large [9]. Other valdty measures are proposed n [], []. We should menton that the evaluaton of proposed measures and the analyss of ther relablty have been qute lmted. In the followng, we defne a goodness ndex for evaluatng clusterng schemes based on the valdty ndex defned for the fuzzy c-means method (FCM) n []. We use the same concepts for valdaton, but the goodness ndex can be used for any clusterng method, not just for FCM. Assume that we study a quanttatve attrbute X. Defnton The varance of an attrbute X, σ X, s defned as denoted ( ) ( X) = ( x k x) σ, n k= where x, K, xn are the attrbute nstances, and x s the mean gven by n x = x k n k= Defnton The varance of cluster contanng elements X = x,, x } s gven by K { n n n k= ( x r ) k σ ( X, r) =, n where r s the center of cluster, havng n elements. Defnton 3 The average scatterng (separaton) for c clusters s defned as c = σ Scat( X, R) = c σ where R s the set of c cluster centers. ( X, r) ( X) Scat(X,R) ndcates the average compactness of clusters. A small value for ths term ndcates compact clusters and as the scatterng wthn clusters ncreases (they become less compact) the value of Scat(X,R) also ncreases. Defnton 4 The total separaton between clusters s gven by D c max c Ds( R) = ( j = r r j ), Dmn = where D max s the maxmum, and D mn s the mnmum dstance between cluster centers.,
The term total separaton sounds lke a measure that we want to maxmze. However, here the opposte holds: a smaller value s better. Ds(R) ndcates the total separaton (scatterng) between the c clusters, and generally, ths term wll ncrease wth the number of clusters. Now, we can defne our goodness ndex based on the last two defntons. Defnton 5 The goodness ndex for cluster R wthn set X s as follows: G ( X, R) = α Scat( X, R) Ds( R), where α s a weghtng factor equal to Ds(c max ), c max s the maxmum number of nput clusters. The goodness ndex uses cluster separaton as an ndcaton of the average scatterng between clusters. Mnmzng the separaton thus also tends to mnmze the possblty to select a cluster scheme wth sgnfcant dfferences n cluster dstances. Snce the two terms of goodness ndex are of dfferent ranges, a weghtng factor s needed n order to ncorporate both terms n a balanced way. (Note that the nfluence of the weghtng factor s an ssue for further study as mentoned n [].) Goodness Index.5.45.4.35.3.5 Age 3 4 5 6 7 8 9 Number of Clusters Fg.: Example of Goodness Index for the Attrbute Age For example, Fg. shows the values of the goodness ndex as a functon of the number of clusters for attrbute Age, whch s gven n Secton 5. We can see that the best number of clusters s three for ths dataset. 3 Determnng Fuzzy Sets by Usng the Dscovered Cluster Scheme After we have obtaned the best cluster scheme (.e. centers of clusters), we can use ths to classfy the quanttatve attrbute values nto c fuzzy sets. We dvde the attrbute nterval nto c sub-ntervals by usng the dscovered r values, wth a coverage of p percent between two adjacent ones, and gve each subnterval a symbolc name related to ts poston (Fg.). r d - d MnValue r d3 MaxValue (low) (mddle) (hgh) Fg.: Example of the proposed fuzzy parttons To specfy our heurstc method, we gve the followng defntons. Defnton 6 The effectve upper bound, denoted d for fuzzy set, s gven by: d ( p)( r r ) = r.5 d, where p s the overlap parameter n %, and r s the center of cluster, = {,, K, c}. d s also the fuzzy lower bound of cluster. Defnton 7 The effectve lower bound, denoted j d for fuzzy set j, s as follows: ( p)( r r ) j = rj.5 j j d, where p s the overlap parameter n %, and r j s the center of cluster j, j = {,3, K,c}. d j s also the fuzzy upper bound of cluster j-. p Notce that.5( p ) =. These defntons become clear by nspectng Fg.. To quote an example, we classfy the attrbute Age nto three fuzzy sets as gven n Table, where Age ranges from 5 to 9. Table : The ranges of fuzzy set Age (p = 3%) Fuzzy set Range Cluster center (Age,young) 5 to 43.95 3.65 (Age,mddle) 38.8 to 65.8 5.58 (Age,hgh) 58.77 to 9 73.99 In the followng, we descrbe how to generate the correspondng membershp functon for each fuzzy set of a quanttatve attrbute. Let { r, r, K, r, K, r c } be the cluster centers for a quanttatve attrbute. We use the followng formulas to defne the requred membershp functons for each fuzzy set. For the fuzzy set wth cluster center r, the membershp functon for element x s gven by - r 3
f x d d x f ( r, x) = f d < < x d d d f x d For the fuzzy set wth cluster center r c, the membershp functon for element x s gven by f x dc dc ( ) = x f rc, x f d < x< d c- c dc dc f x dc For the fuzzy set wth cluster center r, where < c, the membershp functon for element x s gven by f x d d x f d < < x d d d f ( r ) =, x f d x d d x f d < < x d d d f x d 4 An Algorthm for Fndng Fuzzy Sets by Usng a Clusterng Scheme Goodness Index In Secton we have defned a goodness ndex for clusterng scheme evaluaton. We explot ths ndex durng the clusterng process n order to defne the optmal number of clusters for a quanttatve attrbute. More specfcally, we frst defne the range of nput parameters (e.g. number of clusters) of a clusterng algorthm. Let parameter c denote the number of clusters, to be optmzed. The range of values for c s defned by an expert, so that the clusterng schemes produced are compatble wth expected attrbute parttons. Then, a clusterng algorthm s performed for each value c and the results of clusterng are evaluated usng goodness ndex G. We use the dscovered most compact and well-separated clusters to classfy each quanttatve attrbute nto fuzzy sets. After that, we can generate the correspondng membershp functon for each fuzzy set. The steps for fndng fuzzy sets can be summarzed as: () Fndng the best clusterng scheme by usng a goodness ndex for each quanttatve attrbute, () constructng fuzzy sets wth the c cluster centers, and (3) dervng the correspondng membershp functons. Man algorthm (C alg, X, c mn, c max, p) (*Frst phase: fndng the optmal number of clusters and cluster centers*) Intalze: c c max repeat Run the clusterng algorthm C alg for data set X to produce c cluster centers R Compute the goodness ndex G(X, R) f (c = c max ) then α Ds(c max ) G opt G(c) c opt c endf else f G(c) < G opt then c opt c G opt G(c) endf c c- untl c = c mn- (*Second phase: constructng fuzzy sets wth the c cluster centers*) for := to c opt do f < c opt then determne d by usng p f then determne d by usng p endfor (*Thrd phase: generatng membershp functon for each fuzzy set*) for each x X do for each r R do Compute the correspondng membershp functon f(r, x) endfor endfor End algorthm Parameters: C alg = the clusterng algorthm X = { x, x, K, x n } the set of attrbute values to be clustered c mn = the mnmum number of clusters c max = the maxmum number of clusters p = overlap parameter n % 5 Expermental Results We assessed the effectveness of our approach by expermentng wth a real-lfe dataset. The data set comes from a research by the U.S. Census Bureau. The data had 6 quanttatve attrbutes for 63756 famles: age of famly head n years ( head s the reference person n a famly), number of persons, chldren n famly, educaton level of head, head's personal ncome and famly ncome.
Goodness Index.8.6.4...8.6.4 IncHead IncFam 3 4 5 6 7 8 9 Number of Clusters Fg.3: Goodness ndex as a functon of the number of clusters Frst, we evaluate the proposed approach for fndng the optmal clusterng scheme usng the above data set. The clusterng schemes are dscovered usng the C-means algorthm whle ts nput parameters (number of clusters) take values between and 5 for the attrbutes FamPers, NumKds, and between and 9 for the others (see Fg.3 for attrbutes IncHead and IncFam). Applyng the frst phase of our algorthm (see n Secton 4.), Table shows the best number of clusters for dfferent attrbutes. After fndng the best cluster scheme, we can create the fuzzy sets for each quanttatve attrbute by usng the dscovered cluster centers. For example, the ranges of fuzzy set of Age s shown n the Table. These ranges nclude all values where the membershp functon s postve. Table : The best number of clusters Attr. No.of Cluster centers clust. Age 3 3.65, 5.58, 73.99 FamPers.73, 4.48 NumKds 3.6,.4, 4.5 EdHead 3 34.8, 39.39, 43.35 IncHead 4 3436,48,84933,7354 IncFam 4 656,47396,88938,6794 In the followng, we llustrate how the above concept (clusterng-based parttonng) gves a larger number of frequent temsets and nterestng assocaton rules than the case when we don t use the proposed approach. In the latter case we use the same number of attrbute elements for each nterval (quantle-based parttonng). Note that the same defntons of membershp functons are used for both methods as descrbed n Secton 3. In dervng the assocaton rules, we apply the algorthm descrbed n [5], developed for fuzzy attrbutes. It s an extenson of the well-known technque based on ncrementally fndng the frequent sets [3]. Fg.4(a) and Fg.4(b) show the average support and the number of frequent temsets for dfferent mnmum support thresholds. As expected, the average support ncreases and the number of frequent temset decreses as the mnmum support ncreases from % to 5%. We can see that the clusterng-based parttonng gves a hgher number of frequent temsets. However, the quantle-based parttonng gves hgher average support values f the mnmum support s between.35 and.5. Note, however, that t s generated by only two frequent temsets. Average Support Number of Frequent Itemsets.8.7.6.5.4.3.. 6 4 8 6 4 clusterng method quantle method..5..5.3.35.4.45.5 Mnmum Support (a) clusterng method quantle method..5..5.3.35.4.45.5 Mnmum Support (b) Fg.4: (a) Average Support (b) Number of Frequent Itemsets Fg.5(a) shows the average confdence for dfferent mnmum confdence thresholds. The result s qute smlar to that n Fg.4(a). We can see that the clusterng-based parttonng gves n most cases hgher average confdence values. Fg.5(b) shows the number of generated rules as a functon of mnmum confdence threshold, for both clusterng and quantle case. The mnmum support was set to 3%. The results are as expected: the numbers of rules for the clusterng-based parttonng case are larger, but both decrease wth ncreasng confdence threshold.
Average Confdence Number of Interestng Rules..8.6.4. 6 4 8 6 4 clusterng method quantle method..3.4.5.6.7.8.9 Mnmum Confdence (a) clusterng method quantle method..3.4.5.6.7.8.9 Mnmum Confdence (b) Fg.5: (a) Average Confdence (b) Number of Interestng Rules Fnally, we show some nterestng rules. The mnmum support was set to 3% and the mnmum confdence to 5%. IF EdHead s medum THEN IncHead s low IF IncHead s medum THEN IncFam s medum IF FamPers s low AND NumKds s low THEN EdHead s low IF FamPers s low AND NumKds s low AND IncHead s low THEN IncFam s low We see that the rules are very easy to read and understand for anyone. Ths s our man goal n usng fuzzy parttons for attrbutes. Of course, the usefulness of the rules can only be judged by a human. 6 Concluson The problem of mnng assocaton rules for fuzzy quanttatve tems was ntroduced n [5]. However, the algorthm assumes that the fuzzy sets are gven. In ths paper we have proposed a method to fnd the fuzzy sets for each quanttatve attrbute n a database by usng clusterng technques. We defned the goodness ndex G for clusterng scheme evaluaton, based on two crtera: compactness and separaton. The goodness ndex s a varant of the ndces defned for the fuzzy c-means algorthm n [], adapted to crsp clusterng algorthms. Our approach s ndependent of the clusterng algorthm used to partton the data set. After havng obtaned the best cluster scheme, we exploted the dscovered cluster centers, to classfy the quanttatve attrbute values nto fuzzy sets, and showed a method to fnd the correspondng membershp functon for each fuzzy set dscovered. Then we combned the dfferent steps nto an explct algorthm. The expermental results demonstrated that by usng the goodness ndex G as a bass for generatng clusters (and thereby fuzzy sets), a hgher number of fuzzy assocaton rules can be dscovered. Accordng to our observatons, we clam that the generated rules are very meanngful for real-lfe data sets. References: [] G. Patetsky-Shapro, W.J. Frawley, Knowledge Dscovery n Databases. AAAI Press, 99. [] R.Agrawal, T.Imelnsk, A.Swam, Mnng assocaton rules between sets of tems n large databases. Proc. of ACM SIGMOD, 993, pp. 7-6 [3] R. Agrawal, R. Srkant, Fast algorthms for mnng assocaton rules n large databases. Proc. of the th VLDB Conference, 994, pp. 487-499. [4] R. Srkant, R. Agrawal, Mnng quanttatve assocaton rules n large relaton tables. Proceedngs of ACM SIGMOD, 996, pp. -. [5] A. Gyenese, Mnng Weghted Assocaton Rules for Fuzzy Quanttatve Items. Proceedngs of PKDD Conference, Lyon,, pp. 46-43. [6] J.C. Dunn, Well separated clusters and optmal fuzzy parttons. J.Cybern, 974, pp. 95-4. [7] R.N. Dave, Valdatng fuzzy parttons obtaned through c-shells clusterng. Pattern Recognton Letters, Vol.7, 996, pp. 63-63. [8] Z. Huang, A Fast Clusterng Algorthm to Cluster very Large Categorcal Data Sets n Data Mnng. DMKD, 997. [9] X.L.Xe, G.Ben, A Valdty measure for Fuzzy Clusterng. IEEE Trans. on Pattern Analyss and Machne Intellgence, Vol.3, No.4, 99. [] Gath, B. Geva, Unsupervsed Optmal Fuzzy Clusterng. IEEE Trans. on Pattern Analyss and Machne Intellgence, Vol., No.7, 989. [] R. Reazee, Leleveldt, Reber, A new cluster valdty ndex for the fuzzy c-mean. Pattern Recognton Letters, 9, 998, pp. 37-46.