SSDR: An Algorithm for Clustering Categorical Data Using Rough Set Theory

Size: px

Start display at page:

Download "SSDR: An Algorithm for Clustering Categorical Data Using Rough Set Theory"

Patricia Hill
5 years ago
Views:

Avalable onlne at www.pelagaresearchlbrary.

Trpathy and *Adhr Ghosh School of Computer Scence and Engneerng, VIT Unversty, Vellore, Taml Nadu, Inda _ ABSTRACT In the present day scenaro, there are large numbers of clusterng algorthms avalable

1 Avalable onlne at Pelaga Research Lbrary Advances n Appled Scence Research, 20, 2 (3): ISSN: CODEN (USA): AASRFC SSDR: An Algorthm for Clusterng Categorcal Data Usng Rough Set Theory B. K. Trpathy and *Adhr Ghosh School of Computer Scence and Engneerng, VIT Unversty, Vellore, Taml Nadu, Inda _ ABSTRACT In the present day scenaro, there are large numbers of clusterng algorthms avalable to group objects havng smlar characterstcs. But the mplementatons of many of those algorthms are challengng when dealng wth categorcal data. Whle some of the algorthms avalable at present cannot handle categorcal data the others are unable to handle uncertanty. Many of them have the stablty problem and also have effcency ssues. Ths necesstated the development of some algorthms for clusterng categorcal data and whch also deal wth uncertanty. In 2007, an algorthm, termed MMR was proposed [3], whch uses the rough set theory concepts to deal wth the above problems n clusterng categorcal data. Later n 2009, ths algorthm was further mproved to develop the algorthm MMeR [2] and t could handle hybrd data. Agan, very recently n 20 MMeR s agan mproved to develop an algorthm called SDR [22], whch can also handle hybrd data. The last two algorthms can handle both uncertantes as well as deal wth categorcal data at the same tme but SDR has more effcency over MMeR and MMR. In ths paper, we propose a new algorthm n ths sequence, whch s better than all ts predecessors; MMR, MMeR and SDR, and we call t SSDR (Standard devaton of Standard Devaton Roughness) algorthm. Ths takes both the numercal and categorcal data smultaneously besdes takng care of uncertanty. Also, ths algorthm gves better performance whle tested on well known datasets. Keywords- Clusterng, MMeR, MMR, SDR, SSDR, uncertanty. _ INTRODUCTION The basc objectve of clusterng s to group data or objects havng the smlar characterstcs n the same cluster and havng dssmlarty wth other clusters. It has been used n data mnng tasks such as unsupervsed classfcaton and data summaton. It s also used n segmentaton of large heterogeneous data sets nto smaller homogeneous subsets whch s easly managed, separately modeled and analyzed [8]. The basc goal n cluster analyss s to dscover natural groupngs of objects []. Clusterng technques are used n many areas such as manufacturng, Pelaga Research Lbrary 34

2 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): medcne, nuclear scence, radar scannng and research and also n development. For example, Wu et al. [2] developed a clusterng algorthm specfcally desgned for handlng the complexty of gene data. Jang et al. [3] analyze a varety of cluster technques, whch can be appled for gene expresson data. Wong et al. [6] presented an approach used to segment tssues n a nuclear medcal magng method known as postron emsson tomography (PET). Hamov et al. [20] used cluster analyss to segment radar sgnals n scannng land and marne objects. Fnally Matheu and Gbson [9] used the cluster analyss as a part of a decson support tool for large scale research and development plannng to dentfy programs to partcpate n and to determne resource allocaton. The problem wth all the above mentoned algorthms s that they mostly deal wth numercal data sets that are those databases havng attrbutes wth numerc domans.the basc reason for dealng wth numercal attrbutes s that these are very easy to handle and also t s easy to defne smlarty on them. But categorcal data have mult-valued attrbutes. Ths, smlarty can be defned as common objects, common values for the attrbutes and the assocaton between two. In such cases horzontal co-occurrences (common value for the objects) as well as the vertcal co-occurrences (common value for the attrbutes) can be examned [2]. Other algorthms, those can handle categorcal data have been proposed ncludng work by Huang[3], Gbson et al. [4], Guha et al. [3] and Dempster et al. []. Whle these algorthms or methods are very helpful to form the clusters from categorcal data they have the dsadvantage that they cannot deal wth uncertanty. However, n real world applcatons t has been found that there s often no sharp boundary between clusters. Recently some work has been done by Huang [8] and Km et al. [4] where they have developed some clusterng algorthms usng fuzzy sets, whch can handle categorcal data. But, these algorthms suffer from the stablty problem as they do not provde satsfactory values due to the multple runs of the algorthms. Therefore, there s a need for a robust algorthm that can handle uncertanty and categorcal data together. In ths sequence S. Parmar et al [3] n 2007, B.K.Trpathy et al [2] n 2009 and [22] n 20 proposed three algorthms whch can deal wth both uncertanty and categorcal attrbutes together. But the effcency and stablty come nto play when Purty rato s measured. The purty ratos of MMR, MMeR and SDR are n the ncreasng order. In ths paper, a new algorthm called Standard Devaton of Standard Devaton Roughness (SSDR) algorthm s proposed, whch has hgher purty rato than all the prevous algorthms n ths seres and prevous to that. We establsh the superorty of ths algorthm over the others by testng them on a famlar data base, the zoo data set taken from the UCI repostory. MATERIALS AND METHODS 2. Materals In ths secton we frst present the lterature revew as the bass of the proposed work, the defntons of concepts to be used n the work and also present the notatons to be used. 2.. Lterature Revew In ths secton we present the lterature of exstng categorcal clusterng algorthms. Dempster et al. [] presents a parttonal clusterng method, called the Expectaton-Maxmzaton (EM) algorthm. EM frst randomly assgns dfferent probabltes to each class or category, for each cluster. These probabltes are then successvely adjusted to maxmze the lkelhood of the data Pelaga Research Lbrary 35

3 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): gven the specfed number of clusters. Snce the EM algorthm computes the classfcaton probabltes, each observaton belongs to each cluster wth a certan probablty. The actual assgnment of observatons to a cluster s determned based on the largest classfcaton probablty. After a large number of teratons, EM termnates at a locally optmal soluton. Han et al. [26] propose a clusterng algorthm to cluster related tems n a market database based on an assocaton rule hypergraph. A hypergraph s used as a model for relatedness. The approach targets bnary transactonal data. It assumes tem sets that defne clusters are dsjont and there s no overlap amongst them. However, ths assumpton may not hold n practce as transactons n dfferent clusters may have a few common tems. K-modes [8] extend K-means and ntroduce a new dssmlarty measure for categorcal data. The dssmlarty measure between two objects s calculated as the number of attrbutes whose values do not match. The K-modes algorthm then replaces the means of clusters wth modes, usng a frequency based method to update the modes n the clusterng process to mnmze the clusterng cost functon. One advantage of K-modes s t s useful n nterpretng the results [8]. However, K-modes generate local optmal solutons based on the ntal modes and the order of objects n the data set. K-modes must be run multple tmes wth dfferent startng values of modes to test the stablty of the clusterng soluton. Ralambondrany [5] proposes a method to convert multple categores attrbutes nto bnary attrbutes usng 0 and to represent ether a category absence or presence, and to treat the bnary attrbutes as numerc n the K-means algorthm. Huang [8] also proposes the K-prototypes algorthm, whch allows clusterng of objects descrbed by a combnaton of numerc and categorcal data. CACTUS (Clusterng Categorcal Data Usng Summares) [23] s a summarzaton based algorthm. In CACTUS, the authors cluster for categorcal data by generalzng the defnton of a cluster for numercal attrbutes. Summary nformaton constructed from the data set s assumed to be suffcent for dscoverng well-defned clusters. CACTUS fnds clusters n subsets of all attrbutes and thus performs a subspace clusterng of the data. Guha et al. [6] propose a herarchcal clusterng method termed ROCK (Robust Clusterng usng Lnks), whch can measure the smlarty or proxmty between a par of objects. Usng ROCK, the number of lnks are computed as the number of common neghbors between two objects. An agglomeratve herarchcal clusterng algorthm s then appled: frst, the algorthm assgns each object to a separate cluster, clusters are then merged repeatedly accordng to the closeness between clusters, where the closeness s defned as the sum of the number of lnks between all pars of objects. Gbson et al. [4] propose an algorthm called STIRR (Sevng Through Iterated Relatonal Renforcement), a generalzed spectral graph parttonng method for categorcal data. STIRR s an teratve approach, whch maps categorcal data to non-lnear dynamc systems. If the dynamc system converges, the categorcal data can be clustered. Clusterng naturally lends tself to combnatoral formulaton. However, STIRR requres a nontrval post-processng step to dentfy sets of closely related attrbute values [23]. Addtonally, certan classes of clusters are not dscovered by STIRR [23]. Moreover, Zhang et al. [24] argue that STIRR cannot guarantee convergence and therefore propose a revsed dynamc system algorthm that assures convergence. He et al. [7] propose an algorthm called Squeezer, whch s a one-pass algorthm. Squeezer puts the frst-tuple n a cluster and then the subsequent-tuples are ether put nto an exstng cluster or rejected to form a new cluster based on a gven smlarty functon. He et al. [25] explore categorcal data clusterng (CDC) and lnk clusterng (LC) problems and propose a LCBCDC (Lnk Clusterng Based Categorcal Data Clusterng), and compare the results wth Squeezer and K-mode. In revewng these algorthms, some of the methods such as STIRR and EM algorthms cannot guarantee the convergence whle others have scalablty ssues. In addton, all of the algorthms have one common assumpton: each object can be classfed nto only one cluster and all objects have the same degree of confdence when grouped nto a cluster [5]. However, n real world applcatons, t s dffcult to draw clear Pelaga Research Lbrary 36

4 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): boundares between the clusters. Therefore, the uncertanty of the objects belongng to the cluster needs to be consdered. One of the frst attempts to handle uncertanty s fuzzy K-means [9]. In ths algorthm, each pattern or object s allowed to have membershp functons to all clusters rather than havng a dstnct membershp to exactly one cluster. Krshnapuram and Keller [8] propose a probablstc approach to clusterng n whch the membershp of a feature vector n a class has nothng to do wth ts membershp n other classes and modfed clusterng methods are used to generate membershp dstrbutons. Krshnapuram et al. [7] present several fuzzy and probablstc algorthms to detect lnear and quadratc shell clusters. Note the ntal work n handlng uncertanty was based on numercal data. Huang [8] proposes a fuzzy K-modes algorthm wth a new procedure to generate the fuzzy partton matrx from categorcal data wthn the framework of the fuzzy K-means algorthm. The method fnds fuzzy cluster modes when a smple matchng dssmlarty measure s used for categorcal objects. By assgnng confdence to objects n dfferent clusters, the core and boundary objects of the clusters can be decded. Ths helps n provdng more useful nformaton for dealng wth boundary objects. More recently, Km et al. [4] have extended the fuzzy K-modes algorthm by usng fuzzy centrod to represent the clusters of categorcal data nstead of the hard-type centrod used n the fuzzy K-modes algorthm. The use of fuzzy centrod makes t possble to fully explot the power of fuzzy sets n representng the uncertanty n the classfcaton of categorcal data. However, fuzzy K-modes and fuzzy centrod algorthms suffer from the same problem as K-modes, that s they requre multple runs wth dfferent startng values of modes to test the stablty of the clusterng soluton. In addton, these algorthms have to adjust one control parameter for membershp fuzzness to obtan better solutons. Ths necesstates the effort for multple runs of these algorthms to determne an acceptable value of ths parameter. Therefore, there s a need for a categorcal data clusterng method, havng the ablty to handle uncertanty n the clusterng process whle provdng stable results. One methodology wth potental for handlng uncertanty s Rough Set Theory (RST) whch has receved consderable attenton n the computatonal ntellgence lterature snce ts development by Pawlak n the 980s. Unlke fuzzy set based approaches, rough sets have no requrement on doman expertse to assgn the fuzzy membershp. Stll, t may provde satsfactory results for rough clusterng. The objectve of ths proposed algorthm s to develop a rough set based approach for categorcal data clusterng. The approach, termed Standard devaton of Standard devaton roughness (SSDR), s presented and ts performance s evaluated on large scale data sets Bascs of rough sets Most of our tradtonal tools for formal modelng, reasonng and computng are determnstc and precse n character. Real stuatons are very often not determnstc and they cannot be descrbed precsely. For a complete descrpton of a real system often one would requre by far more detaled data than a human beng could ever recognze smultaneously, process and understand. Ths observaton led to the extenson of the basc concept of sets so as to model mprecse data whch can enhance ther modelng power. The fundamental concept of sets has been extended n many drectons n the recent past. The noton of Fuzzy Sets, ntroduced by Zadeh [0] deals wth the approxmate membershp and the noton of Rough Sets, ntroduced by Pawlak [2] captures ndscernblty of the elements n a set. These two theores have been found to complement each other nstead of beng rvals. The dea of rough set conssts of approxmaton of a set by a par of sets, called the lower and upper approxmatons of the set. The basc assumpton n rough set s that, knowledge depends upon the classfcaton capabltes of human bengs. Snce every classfcaton (or partton) of a unverse and the concept of equvalence Pelaga Research Lbrary 37

5 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): relaton are nterchangeable notons, the defnton of rough sets depends upon equvalence relatons as ts mathematcal foundatons [2]. Let U ( ) be a fnte set of objects, called the unverse and R be an equvalence relaton over U. By U / R we denote the famly of all equvalence classes of R (or classfcaton of U) referred to as categores or concepts of R and [x] R denotes a category n R contanng an element x U. By a Knowledge base, we understand a relaton system k= (U, R), where U s as above and R s a famly of equvalence relatons over U. For any subset P ( ) R, the ntersecton of all equvalence relatons n P s denoted by IND (P) and s called the ndscernblty relaton over P. The equvalence classes of IND (P) are called P- basc knowledge about U n K. For any Q R, Q s called a Q-elementary knowledge about U n K and equvalence classes of Q are called Q-elementary concepts of knowledge R. The famly of P-basc categores for all P R wll be called the famly of basc categores n knowledge base K. By IND (K), we denote the famly of all equvalence relatons defned n k. Symbolcally, IND (K) = {IND (P): P R}. For any X U and an equvalence relaton R IND (K), we assocate two subsets, RX = U{ Y U / R : Y X} and RX = U { Y U / R : Y X }, called the R-lower and R-upper approxmatons of X respectvely. The R-boundary of X s denoted by BN R (X) and s gven by BN R (X) = RX RX. The elements of RX are those elements of U whch can be certanly classfed as elements of X employng knowledge of R. The borderlne regon s the undecdable area of the unverse. We say X s rough wth respect to R f and only f RX RX, equvalently BN R (X). X s sad to be R- defnable f and only f RX rough wth respect to R f and only f t s not R-defnable. = RX, or BN R (X) =. So, a set s 2..3 Defntons Defnton (Indscernblty relaton (Ind (B))): Ind (B) s a relaton on U. Gven two objects x, x j U, they are ndscernble by the set of attrbutes B n A, f and only f a (x ) = a (x j ) for every a B. That s, (x, x j Ind (B) f and only f a B where B A, a (x ) = a (x j ). Defnton (Equvalence class ([x ] Ind (B) )): Gven Ind (B), the set of objects x havng the same values for the set of attrbutes n B conssts of an equvalences classes, [x ] Ind(B). It s also known as elementary set wth respect to B. Defnton (Lower approxmaton): Gven the set of attrbutes B n A, set of objects X n U, the lower approxmaton of X s defned as the unon of all the elementary sets whch are contaned n X. That s X = x [x ] Ind (B) X}. B Defnton (upper approxmaton): Gven the set of attrbutes B n A, set of objects X n U, the upper approxmaton of X s defned as the unon of the elementary sets whch have a nonempty ntersecton wth X.That s X B = {x [x ] Ind (B) X }. Pelaga Research Lbrary 38

6 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): Defnton (Roughness): The rato of the cardnalty of the lower approxmaton and the cardnalty of the upper approxmaton s defned as the accuracy of estmaton, whch s a measure of roughness. It s presented as R B (X) = - X B X B If R B (X) = 0, X s crsp wth respect to B, n other words, X s precse wth respect to B. If R B (X) <, X s rough wth respect to B, That s, B s vague wth respect to X. Defnton (Relatve roughness) : Gven a A, X s a subset of objects havng one specfcs value α of attrbute a, X ( a = a) and X ( a = a) refer to the lower and upper approxmaton of X wth respect to { }, then R (X) s defned as the roughness of X wth respect to { }, that s Ra ( X / a j =α) = - X ( a a j = α ), where a, A and a. X ( a = α ) Defnton (Mean roughness): Let A have n attrbutes and a A. X be the subset of objects havng a specfc value α of the attrbute a. Then we defne the mean roughness for the equvalence class a =α, denoted by MeR (a =α) as n MeR (a =α) = ( Ra ( X / a / ( ) j = α )) n. j= j Defnton (Standard devaton) : After calculatng the mean of each a A, we wll apply the standard devaton to each a by the formula SD (a = α) = n (/ ( n )) ( R ( X / a = α ) MeR(a = α)) = a 2 Defnton (Dstance of relevance): Gven two objects B and C of categorcal data wth n attrbutes, DR for relevance of objects s defned as follows: n =. = DR( B, C) ( b, c ) Here, b and c are values of objects B and C respectvely, under the th attrbute a. Also, we have. DR (b, c ) = f b c 2. DR (b, c ) = 0 f b = c 3. DR (b, c ) = eq B eq C f a s a numercal attrbute; where eq B no s the number assgned to the equvalence class that contans b. eq number of equvalence classes n numercal attrbute a. Pelaga Research Lbrary C s smlarly defned and no s the total 39

7 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): Defnton (Purty rato) : In order to compare SDR wth MMeR and MMR and all other algorthms whch have taken ntatve to handle categorcal data we developed an mplementaton. The tradtonal approach for calculatng purty of a cluster s gven below. Purty ()= the number of data occurng n both the th cluster and ts correspondng class Over all Purty= # ofclusters = the number of data n the data set Purty( ) # ofclusters METHODS In ths secton we present the man algorthm of the paper and the expermental part deals wth an example Proposed Algorthm In ths secton we present our algorthm whch we call SSDR. The notatons and defntons of concepts have been dscussed n the prevous secton.. Procedure SSDR(U, k) 2. Begn 3. Set current number of cluster CNC = 4. Set ParentNode = U 5. Loop: 6. If CNC < k and CNC then 7. ParentNode = Proc ParentNode (CNC) 8. End f // Clusterng the ParentNode 9. For each a A ( = to n, where n s the number of attrbutes n A) 0. Determne [ X m] Ind ( a ) (m = to number of objects). For each A (j = to n, where n s the number of the attrbutes n A, j ) 2. Calculate Rough (a ) 3. Next 4. MeR (a =α) = n ( Ra ( X / a / ( ) j = α )) n. j= j 5. Next 6. Apply standard devaton SD(a =α)= n (/ ( n )) ( R ( X / a = α ) MeR(a = α)) = a 7. Next 8. Set SDR =SD {mn {SD (a =α ),.SD (a = α k j )},where k j s the number of equvalence classes n Dom(a ). 9. Determne splttng attrbute a correspondng to the Standard devaton- Roughness 20. Do bnary splt on the splttng attrbute a Pelaga Research Lbrary 2 320

8 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): CNC = the number of leaf nodes 22. Go to Loop: 23. End 24. Proc ParentNode (CNC) 25. Begn 26. Set = 27. Do untl < CNC 28. If Avg-dstance of cluster s calculated 29. Goto label 30. else 3. n = Count (Set of Elements n Cluster ). 32. Avg-dstance () = 2*( n n ( Dstance of relevance between objects and a k j= k = j+ ))/(n*(n -)) 33. label : 34. ncrement 35. Loop 36. Determne Max (Avg-dstance ()) 37. Return (Set of Elements n cluster ) correspondng to Max (Avg-dstance ()) 38. End Expermental Part In ths secton we present the expermental hybrd table whch the characterzaton of varous anmals n terms of sze, anmalty, color and age. In later secton we wll show the effcency of ths algorthm. The expermental table s as follows: Table ANIMAL NAME SIZE ANIMALITY COLOUR AGE A Small Bear Black 25 A2 Medum Bear Black 6 A3 Large Dog Brown 9 A4 Small Cat Black 30 A5 Medum Horse Black 28 A6 Large Horse Black 5 A7 Large Horse Brown 7 Let us consder the value of k s 3 that s k=3 whch mean the number of clusters wll be 3. Intally the value of CNC s and the value of the ParentNode s U whch ndcates, the ntal value of ParentNode s whole table. So, we need to apply our algorthm three tmes to get the desred clusters. Computatonal Part So, ntally CNC < k and CNC s false. So t wll calculate the average dstance of the parent node, but ntally only one table we have so there s no need to calculate the average dstance, drectly we wll calculate the roughness of each attrbute relatve to the rest of the attrbutes whch s known as relatve roughness. So, when =, the value of a s SIZE that s a = sze. Ths attrbute has three dstnct values Small, Medum and Large so consderng α = Small Pelaga Research Lbrary 32

9 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): frst we get X={A, A4} (where X s a subset of objects havng one specfc value α of attrbute a ) and consderng j=2(as j) we get = Anmalty. So the equvalence classes of s {(A, A2), A3, A4, (A5, A6, A7)} and the lower approxmaton of X ( a = α ) s gven by X ( a = α )= {ϕ} and the upper approxmaton of X ( a = α ) s gven by X ( a = α ) = {A, A2, A4}. So, the roughness of a (when a = SIZE and α= Small ) s gven by R ( X / a = α ) = - X j X a ( a j = α ) X ( a = α ) = = Now, by changng the value of j (when j=3, 4,) and keepng constant the value of a (a = sze ) and α (α= Small ) we need to fnd the roughness of a relatve to the attrbutes COLOR (when j=3) and AGE (when j=4) and s gven by R ( X / a = α ) = - X j R ( X / a = α ) = - X j X a ( a j = α ) X ( a = α ) = = when j=3 and = COLOR X a ( a j = α ) X ( a = α ) = = 0 when j=4 and = AGE Now, to get the standard devaton of a (a = sze ) when α= Small we need to fnd the mean of these values and s gven by =. And applyng standard devaton formula we get the 3 3 value and wll be stored n a varable. Ths smlar process wll be contnued by changng the value of α (for α= Medum and Large ) and keepng constant the value of a. And lastly we wll get three standard devaton values for each dfferent α. And agan we wll store those values n a varable. After calculatng the SD (standard devaton) of each α we wll take the mnmum value of those dfferent values of α and wll store t n another varable. The above procedure wll be contnued for each a (for a = ANIMALITY, COLOR and SIZE when =2, 3 and 4) and the correspondng values wll be stored n the varable. After completng the above step we wll take those mnmum values for next calculaton. We wll apply SD (standard devaton) to those mnmum values to get the Splttng attrbutes. If the value of SD does not match wth the mnmum values then wll we take the nearest mnmum vale as the splttng attrbute and wll do the bnary splttng that s we wll dvde ths table nto two clusters. Let after splttng we have got two cluster c and c2 and c contans 2 elements and c2 contans 5 elements. So now we need to calculate the average dstance to choose the clusterng table for further calculaton. Ths can be done by applyng dstance of relevance formula. Let us see how we calculate DR (dstance of Relevance). For example let us take two tuple A4 and A6 whch s as follows Pelaga Research Lbrary 322

10 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): Table 2 ANIMAL NAME SIZE ANIMALITY COLOR AGE A4 Small Cat Black 30 A6 Large Horse Black 5 Here B=A4 and C=A6 and DR (B, C) s defned as DR (B, C) = n = DR( b, c ) =DR (b sze,c sze ) + DR (b anmalty,c anmalty ) + DR (b color,c color ) + DR (b age,c age ) So, DR (b sze, c sze ) = 0 as b sze c sze DR (b anmalty, c anmalty ) = 0 as b anmalty c anmalty DR (b color, c color ) = as b color = c color But for DR (b age, c age ) we need to follow some dfferent method as AGE s the numercal attrbute. To calculate the DR of a numercal attrbute we need to exclude that numercal attrbute from that table and need to fnd the average equvalence class of all attrbutes. So, n ths case we need to exclude the attrbute AGE frst and then we have to fnd the average equvalence class. So, the average equvalence class s (3+4+2)/3 = 3. In ths case we have got a nteger value but we can get a fracton also then we need to take ether ts floor value or ts roof value. Now we need to sort the attrbute value of the attrbute AGE. After sortng n ascendng order we get {5, 7, 9, 6, 25, 28, 30}. Now we wll dstrbute these numbers nto three sets whch s as follows Set = {5, 7} Set 2 = {9, 6} Set 3 = {25, 28, 30} Now we wll calculate DR (b age, c age ). In our case b age = 30 and c age = 5. So, we wll put 3 and n place of 30 and 5 as 30 belongs to the set 3 and 5 belongs to the set. So, DR (b age, c age ) = 3 total _ number _ of _ sets Fnally, DR (B, C) = DR (b sze,c sze ) + DR (b anmalty,c anmalty ) + DR (b color,c color ) + DR (b age,c age ) 2 = = = 2 3 So, n ths way we wll calculate the average dstance of C and C2 and the cluster havng the larger average dstance we wll take that partcular cluster as the nput for further calculaton. Pelaga Research Lbrary 323

11 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): So, n ths fashon we wll apply ths algorthm untl we get the desred number of cluster. In our case we wll stop when we wll get C3 because n our case the total number of clusters s 3. RESULTS AND DISCUSSION In ths secton we present the orgnal result that s tested on ZOO dataset whch was also taken by MMR, MMeR and SDR algorthm. The ZOO data has 8 attrbutes and out them 5 are Boolean attrbute, 2 are numerc and s anmal name and t has 0 objects. The total objects are dvded nto seven classes so; we need to stop when we wll get seven clusters. After takng the ZOO dataset as the nput we have got the followng output whch s as follows: Table 3 Cluster Number Class I Class II Class III Class IV Class V Class VI Class VII Purty Rato Overall Purty Comparson of SSDR wth MMeR, MMR, SDR and Algorthms based on FUZZY Set Theory Tll the development of MMR, the only algorthms whch amed at handlng uncertanty n the clusterng process were based upon fuzzy set theory[26].these algorthms based on fuzzy set theory nclude fuzzy K-modes, fuzzy centrods. The K-modes algorthm replaces the means of the clusters (K-means) wth modes and uses a frequency based method to update the modes n the clusterng process to mnmze the clusterng cost functon. Fuzzy K-modes generates a fuzzy partton matrx from categorcal data. By assgnng a confdence to objects n dfferent clusters, the core and boundary objects of the clusters are determned for clusterng purposes. The fuzzy centrods algorthm uses the concept of fuzzy set theory to derve fuzzy centrods to create clusters of objects whch have categorcal attrbutes. But n MMR, MMeR and n SDR they have used rough sets concept to buld those algorthms but as compared to effcency MMeR s more effcent than MMR and less effcent than SDR but SSDR s much more effcent than other Empercal Analyss The earler algorthms for classfcaton wth uncertanty lke K-modes, Fuzzy K-modes and Fuzzy centrod on one hand and MMR, MMeR and SDR on the other hand were appled to ZOO data sets. Table 4 below provdes the comparson of purty for these algorthms on ths datasets. It s observed that SSDR has a better purty than all other algorthms when appled on zoo data set. As mentoned earler, all the fuzzy set based algorthms face a challengng problem that s the problem of stablty. These algorthms requre great effort to adjust the parameter, whch s used to control the fuzzness of membershp of each data pont. At each value of ths parameter, the algorthms need to be run multple tmes to acheve a stable soluton. Pelaga Research Lbrary 324

12 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): MMR, MMeR and SDR on the other hand have no such problem. SSDR contnues to have the advantages of MMR, MMeR and SDR over the other algorthms as mentoned above. But t has hgher purty than MMR, MMeR and SDR whch establshes ts superorty over MMR, MMeR and SDR. Table 4 DATA SET K-modes Fuzzy K-modes Fuzzy centrods MMR MMeR SDR SSDR ZOO * *In ths case we have got the same Purty rato as compared to SDR but as standard devaton has better central tendency over mean or mnmum t wll gve better result for other data sets. Manually t has been checked for a small data set that t s gvng much better result than MMR, MMeR and SDR CONCLUSION In ths paper, we proposed a new algorthm called SSDR, whch s more effcent than most of the earler algorthms ncludng MMR, MMeR and SDR, whch are recent algorthms developed n ths drecton. It handles uncertan data usng rough set theory. Frstly, we have provded a method where both numercal and categorcal data can be handled and secondly, by provdng the dstance of relevance we are gettng much better results than MMR where they are choosng the table to be clustered, accordng to the number of objects. The comparson of purty rato shows ts superorty over MMeR. Future enhancements of ths algorthm may be possble by consderng hybrd technques lke rough-fuzzy clusterng or fuzzy-rough clusterng. REFERENCES [] A. Dempster, N. Lard, D. Rubn, Journal of the Royal Statstcal Socety 39 () (977) 38. [2] B.K.Trpathy and M S Prakash Kumar Ch.: Internatonal Journal of Rapd Manufacturng (specal ssue on Data Mnng) (Swtzerland),vol., no.2, (2009), pp [3] D Parmar, Teresa Wu, Jennfer B, Data & Knowledge Engneerng (2007) [4] D. Gbson, J. Klenberg, P. Raghavan, The Very Large Data Bases Journal 8 (3 4) (2000) [5] M. Halkd, Y. Batstaks, M. Vazrganns, Journal of Intellgent Informaton Systems 7 (2 3) (200) [6] S. Guha, R. Rastog, K. Shm, Informaton Systems 25 (5) (2000) [7] Z. He, X. Xu, S. Deng, Journal of Computer Scence & Technology 7 (5) (2002) [8] Z. Huang, Data Mnng and Knowledge Dscovery 2 (3) (998) [9] E. Ruspn, Informaton Control 5 () (969) [0] L.A. Zadeh, Informaton and Control, (965), pp [] R. Johnson, W. Wchern, Appled Multvarate Statstcal Analyss, Prentce Hall, New York, [2] Zdzslaw Pawlak, Rough Sets- Theoretcal Aspects of Reasonng About Data. Norwell: Kluwar Academc Publshers, (992). [3] D. Jang, C. Tang, A. Zhang IEEE Transactons on Knowledge and Data Engneerng 6 () (2004) [4] D. Km, K. Lee, D. Lee, Pattern Recognton Letters 25 () (2004) Mkm. [5] H. Ralambondrany, Pattern Recognton Letters 6 () (995) Pelaga Research Lbrary 325

13 Adhr Ghosh et al Adv. Appl. Sc. Res., 20, 2 (3): [6] K. Wong, D. Feng, S. Mekle, M. Fulham, IEEE Transactons on Nuclear Scence 49 () (2002) [7] R. Krshnapuram, H. Frgu, O. Nasraou, IEEE Transactons on Fuzzy Systems 3 () (995) [8] R. Krshnapuram, J. Keller, IEEE Transactons on Fuzzy Systems (2) (993) [9] R. Matheu, J. Gbson, IEEE Transactons on Engneerng Management 40 (3) (2004) [20] S. Hamov, M. Mchalev, A. Savchenko, O. Yordanov, IEEE Transactons on Geo Scence and Remote Sensng 8 () (989) [2] S. Wu, A. Lew, H. Yan, M. Yang, IEEE Transactons on Informaton Technology n BoMedcne 8 () (2004) 5 5. [22] Trpathy, B.K. and A.Ghosh: SDR: An Algorthm for Clusterng Categorcal Data Usng Rough Set Theory, Communcated to the Internatonal IEEE conference to be held n Kerala, (20). [23] V., Gant, J. Gehrke, R. Ramakrshnan, CACTUS clusterng categorcal data usng summares, n: Ffth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, (999), pp [24] Y. Zhang, A. Fu, C. Ca, P. Heng, Clusterng categorcal data, n: Proceedngs of the 6th Internatonal Conference on Data Engneerng, (2000), pp [25] Z. He, X. Xu, S. Deng, A lnk clusterng based approach for clusterng categorcal data, Proceedngs of the WAIM Conference, (2004). < [26] E. Han, G. Karyps, V. Kumar, B. Mobasher, Clusterng based on assocaton rule hypergraphs, n: Workshop on Research Issues on Data Mnng and Knowledge Dscovery, (997), pp Pelaga Research Lbrary 326

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr