A ROUGH SET APPROACH FOR CUSTOMER SEGMENTATION

Size: px

Start display at page:

Download "A ROUGH SET APPROACH FOR CUSTOMER SEGMENTATION"

James Harris
5 years ago
Views:

1 A ROUGH SET APPROACH FOR CUSTOMER SEGMENTATION Prabha Dhadayudam * ad Ilago Krishamurthi Departmet of CSE, Sri Krisha College of Egieerig ad Techology, Coimbatore, Idia * prabhadhadayudam@gmail.com ik@skcet.ac.i ABSTRACT Customer segmetatio is a process that divides a busiess s total customers ito groups accordig to their diversity of purchasig behavior ad characteristics. The data miig clusterig techique ca be used to accomplish this customer segmetatio. This techique clusters the customers i such a way that the customers i oe group behave similarly whe compared to the customers i other groups. The customer related data are categorical i ature. However, the clusterig algorithms for categorical data are few ad are uable to hadle ucertaity. Rough set theory (RST) is a mathematical approach that hadles ucertaity ad is capable of discoverig kowledge from a database. This paper proposes a ew clusterig techique called MADO (Miimum Average Dissimilarity betwee Obects) for categorical data based o elemets of RST. The proposed algorithm is compared with other RST based clusterig algorithms, such as MMR (Mi-Mi Roughess), MMeR (Mi Mea Roughess), SDR (Stadard Deviatio Roughess), SSDR (Stadard deviatio of Stadard Deviatio Roughess), ad MADE (Maximal Attributes DEpedecy). The results show that for the real customer data cosidered, the MADO algorithm achieves clusters with higher cohesio, lower couplig, ad less computatioal complexity whe compared to the above metioed algorithms. The proposed algorithm has also bee tested o a sythetic data set to prove that it is also suitable for high dimesioal data. Keywords: Customer Relatioship Maagemet, Customer segmetatio, Clusterig, Categorical data, Rough set theory 1 INTRODUCTION Customer Relatioship Maagemet (CRM) is a busiess methodology used to build a relatioship with log term profitable customers by aalyzig customer eeds ad behaviors. It is a importat techology i every busiess because busiess is customer cetric. CRM helps busiess leaders gai isight ito customer behavior ad life time value to icrease profit by actig accordig to the customer characteristics. Customer segmetatio plays a importat role i CRM. It divides customers ito groups accordig to their purchasig behavior, allowig busiess leaders to desig ad establish differet strategies for each group of customers ad thus maximize their value (Lig & Ye, 2001). Rececy (R), Frequecy (F), ad Moetary (M) are the attributes chose for describig the purchasig customer characteristics. The RFM model works very effectively i customer segmetatio (Wu & Li, 2005). R idicates the time iterval betwee the preset ad previous trasactio dates of a customer. F idicates the umber of trasactios that the customer has made i a particular iterval of time. M idicates the total value of the customer s trasactio (Cheg & Che, 2009). I this paper, the modified RFM model, called RFMP, is itroduced. This model cosiders the customers paymet details. P idicates the average time iterval betwee paymet ad purchase date. The customers paymet details are a importat attribute because ay two customers with the same R, F, M values but a differet P value caot be treated equally by the eterprise. The RFMP model esures that the customer segmetatio is doe obectively. The values for R, F, M, ad P attributes are cotiuous. These cotiuous values are ormalized to categorical values as beig very low, low, middle, high, ad very high for effective aalysis. The data available for customer segmetatio is ow categorical data. The data miig clusterig techique is widely used to accomplish customer segmetatio (Ngai, Xiu, & Chau, 2009). The clusterig techique is used to segmet customers i such a way that the customers i oe group behave similarly whe compared to the customers i other groups based o their trasactio details. The traditioal approaches for clusterig, such as partitioig ad hierarchical algorithms, deal with umerical data whose iheret geometric properties ca be exploited to aturally defie distace fuctios betwee data poits. Distace fuctios, such as Mahatta, Euclidea, ad Mikowski, are used for allocatig a data poit to the appropriate clusterig. However, customer data available for clusterig is categorical so the above procedures are ot feasible. The computatio of similarity or dissimilarity is essetial for categorical data (Che, Chuag, & Che 2008). 1

2 Simple matchig, co-occurrece, probabilistic ad distace hierarchy are the approaches for computig similarity or dissimilarity measures (Cao, Liag, Li, Bai, & Dag, 2011). Simple matchig is a commo approach i which the compariso of two idetical categorical values yields either zero or oe. k-modes, fuzzy k-modes, ad k-prototype algorithms are based o simple matchig. These algorithms produce clusters with weak itrasimilarity ad have stability problems (Cao, Liag, Li, Bai, & Dag, 2011). ROCK (Robust clusterig usig liks) ad CACTUS (Clusterig categorical data usig summaries) are algorithms based o the co-occurrece approach. ROCK uses the cocept of a lik to measure the similarity betwee categorical patters. Here lik is defied as the umber of commo eighbors betwee two patters (Guha, Rastogi, & Shim, 2000). CACTUS calculates the frequecy of two values appearig i the patters together. It fids clusters i subsets of all attributes ad thus performs subspace clusterig. It geeralizes the cluster defiitio of umerical data so that it is suitable for categorical data (Gati, Gehrke, & Ramakrisha, 1999). The limitatios of ROCK are that it is sesitive to threshold value ad sometimes does ot produce the required umber of clusters (Parmar, Wu, & Blackhurst, 2007). Probabilistic approaches use coditioal probability estimatio to defie relatios betwee clusters. COBWEB, AUTOCLASS, DECA, ad COOLCAT are based o probabilistic models ad require a log traiig time. COBWEB uses the category utility fuctio, ad AUTOCLASS uses a Bayesia method to derive probable class distributio. DECA is a discrete valued clusterig algorithm, ad COOLCAT is a etropy based algorithm. Distace hierarchy associates each lik with a weight ad requires the domai experts to icorporate kowledge (Cao, Liag, Li, Bai, & Dag, 2011). Rough set theory (RST) by Pawlak (1982) received a great deal of attetio for dealig with categorical data i clusterig algorithms due to its stable results ad o requiremet of domai expertise. It uses global data properties to establish similarity betwee the obects (Bea & Kambhampati, 2008). The existig clusterig algorithms (Parmar et al., 2007; Mazlack, He, Zhu, & Coppock, 2000; Kumar & Tripathy, 2009; Tripathy & Ghosh, 2011a; Tripathy & Ghosh, 2011b; Herawa, Deris, & Abaway, 2010; Herawa, Ghazali, Yato, & Deris, 2010) based o RST utilize the correlatio of attributes. If the attributes are depedet, the the clusterig algorithms based o the correlatio of attributes produce correct results. The attributes R, F, M, ad P chose for describig the purchasig characteristics of customers are idepedet so that there is o proportioal relatioship betwee the attributes. This cocept led to the proposal of MADO (Miimum Average Dissimilarity betwee Obects), a clusterig algorithm based o RST. The MADO algorithm calculates the dissimilarity betwee obects without cosiderig the depedecy betwee attributes. The rest of the paper is orgaized as follows: Sectio 2 discusses the basic cocepts of rough set theory. Sectio 3 summarizes rough set theory based clusterig algorithms. Sectio 4 explais the MADO clusterig algorithm. Sectio 5 compares the clusterig results obtaied for real customer data ad sythetic data. Fially, Sectio 6 provides cocludig remarks. 2 BACKGROUND Rough set theory (RST) by Pawlak classifies imprecise, ucertai, or icomplete iformatio or kowledge expressed by data acquired from experiece (Pawlak, 1982). It gaied importace i the areas of machie learig, kowledge acquisitio, decisio aalysis, kowledge discovery from databases, expert systems, decisio support systems, iductive reasoig, ad patter recogitio (Pawlak, 1992). It is suitable for processig qualitative iformatio that is difficult to aalyze by stadard statistical techiques. It maages vague ad ucertai data or problems related to iformatio systems (Shyg, Wag, Tzeg, & Wu, 2007). The iformatio system is a 4-tuple (quadruple) S = (U, A, V, f), where U ad A are a o-empty fiite sets of obects ad attributes, respectively, ad V is the set cotaiig the domai of each attribute, where V a deotes the domai of attribute a that belogs to A. The fuctio f: U A V is a total fuctio such that f(u, a) V a, for every (u, a) U A ad is called the iformatio fuctio. RST is maily based o the idisceribility relatio, equivalece class, lower approximatio, ad upper approximatio (Pawlak & Skowro, 2007). The idisceribility relatio (Id(B)) is a relatio o U. Give two obects, x i, x U, they are idiscerible by the set of attributes B i A if ad oly if a(x i ) = a (x ) for every a B. That is, (x i, x ) Id (B) if ad oly if a B where B A ad a(x i ) = a (x ). The equivalece class ([x i ] Id(B) ) is the set of obects x i havig the same values for the set of attributes i B. This is also kow as a elemetary set with respect to B. The lower approximatio (X LB ) is the uio of all the elemetary sets with respect to B that are cotaied i X. The upper approximatio (X UB ) is the 2

3 uio of all the elemetary sets with respect to B that have a o-empty itersectio with X. Eqs. (1) ad (2) calculate the lower ad upper approximatio, respectively. XLB { xi [ xi] Id(B) X} (1) XUB { xi [ xi] Id(B) X } (2) The lower approximatio cosists of all obects that defiitely belog to the cocept while the upper approximatio cotais all obects that possibly belog to the cocept. The differece betwee the upper ad the lower approximatio costitutes the boudary regio of the vague cocept. Approximatios are two basic operatios i rough set theory; thus it expresses vagueess ot by meas of membership but by employig a boudary regio of a set. If the boudary regio of a set is empty, the set is crisp. Otherwise the set is rough (iexact) (Pawlak & Skowro, 2007). The ratio of the cardiality of the lower approximatio ad the cardiality of the upper approximatio is defied as the accuracy of estimatio, a measure of roughess (Pawlak, 1992). May clusterig algorithms based o RST have bee developed, ad they are overviewed i the ext sectio. 3 RELATED WORK Mazlack et al. (2000) proposed two techiques based o rough set theory to select clusterig attributes. These are bi-clusterig (BC) ad total roughess (TR) techiques. BC is applicable oly for bi-valued attributes. TR hadles multi-valued attributes ad is based o the total average of the mea roughess of a attribute with respect to the set of all attributes i a iformatio system. The attribute with the higher TR is chose as the clusterig attribute. However, for partitioig, the method starts with biary valued attributes ad uses the total roughess criterio oly for multi-valued attributes. Therefore, partitioig is doe o a multi-valued attribute oly whe all the biary valued attributes have already bee partitioed. This reduces the efficiecy of the algorithm because the partitioig is doe o a biary valued attribute eve whe the total roughess value for the multi-valued attribute is high. The roughess, mea roughess, ad total roughess i TR are calculated usig Eqs. (3), (4), ad (5), respectively. XLB R(X) (3) XUB Rough(i) = R(X i ) i 1 (4) Total roughess(i) = m Rough(i) i 1 m (5) X idicates the set of obects i the data set; X i idicates the set of obects i the sub-partitio of attribute I; idicates the umber of sub-partitios of attribute I, ad m idicates the umber of attributes. I Eq. (3), X LB ad X UB are calculated usig Eqs. (1) ad (2), respectively. I Eq. (4), Rough(i) is the mea roughess of all sub-partitios of attribute i. I order to choose the partitioig attribute i, the Total roughess(i) towards all the attributes is calculated usig Eq. (5). The attribute with the highest total roughess is chose as the clusterig attribute. However, for partitioig, the method starts with biary valued attributes ad uses the total roughess criterio oly for multi-valued attributes. This creates a disadvatage due to the fact that the partitioig is doe o a biary attribute eve though the total roughess for a multi-valued attribute is higher. Parmar et al. (2007) proposed a ew techique called mi-mi roughess (MMR) for multi-valued attributes. The MMR techique is based o the miimum value of mea roughess ad does ot require total roughess for calculatig as i TR. The roughess ad mea roughess i MMR is calculated usig Eqs. (6) ad (7), respectively. Give a i, a A (set of attributes), V( ) ai is the set of values of attributes a i, ad the a i a. X is a subset of obects havig oe specific value for the attribute a i,, that is X (a i = ). ( X ( a )) a i is the lower approximatio of L 3

4 X with respect to {a }, ad ( Xa ( ai )) is the upper approximatio of X with respect to {a }. Thus, R (X) U a is defied as the roughess of X with respect to {a }, which is give by Eq. (6). Also, the mea roughess o attribute a i with respect to {a } is defied i Eq. (7). ( Xa ( ai )) L Ra (X ai ) 1 ( Xa ( ai )) U Rough a ( a i ) R a ( X ai 1)... R a ( X ai V( i ) (7) V( ai ) The maximum value of roughess is oe because the umber of obects i the lower approximatio is less tha or equal to the umber of obects i the upper approximatio. The mi-roughess (MR) of each attribute refers to the miimum of the mea roughess with respect to a sigle attribute. The mi-mi-roughess (MMR) is defied as the miimum MR of attributes. The MMR value determies the splittig attribute. The ode with the maximum umber of elemets is chose for further splittig. From the experimetal aalysis i Parmar et al. (2007), MMR achieves better results whe compared to BC, TR, fuzzy set based algorithms, ad other dissimilarity approaches. Kumar ad Tripathy (2009) proposed MMeR by makig a slight alteratio i MMR. I their algorithm, the mea-roughess (MeR) of each attribute is calculated as the mea of the mea roughess. The mi-mea-roughess (MMeR) is defied as the miimum of MeR of attributes. The MMeR value determies the splittig attribute. The ode that has the maximum average distace betwee its elemets is chose for further splittig. Tripathy ad Gosh (2011a) proposed a algorithm for clusterig categorical data where the stadard-deviatio-roughess (SDR) of each attribute is calculated as the stadard deviatio of the mea roughess. The attribute with the miimum SDR is chose as the splittig attribute. Tripathy ad Gosh (2011b) also proposed a algorithm called SSDR. Here the stadard deviatio of SDR (SSDR) is calculated, ad the splittig attribute is chose based o this value. The ode chose for further splittig i the SDR ad SSDR algorithm is the same as that of the MMeR algorithm. SSDR achieves better results tha SDR for small data sets while for large data sets it achieves the same results as SDR. The roughess used i MMeR, SDR, ad SSDR is calculated usig the same formula as i MMR. All the algorithms TR, MMR, MMeR, SDR, ad SSDR i the above examples (Parmar et al., 2007; Mazlack et al., 2000; Kumar & Tripathy, 2009; Tripathy & Ghosh, 2011a; Tripathy & Ghosh, 2011b) have bee tested o data sets available i the UCI machie learig repository. The purity of clusters was used as a measure to test the quality of the clusters. The purity ratios of MMR, MMeR, SDR, ad SSDR are i icreasig order. All these clusterig algorithms are based o the roughess of a attribute with other attributes. Herawa et al. (2010a) proposed a ew techique called maximal depedecy attributes (MDA) for selectig clusterig attributes. Based o MDA, Herawa et al. (2010b) proposed MADE for categorical data clusterig. MDA selects the clusterig attribute by determiig the depedecies betwee attributes. The attribute with the maximum degree of depedecy is selected as the partitioig attribute. 4 PROPOSED ALGORITHM The calculatio of roughess ad determiatio of depedecies betwee attributes i real customer data are ot appropriate due to the dyamic behavior of the customers. The MADO clusterig algorithm overcomes this disadvatage by utilizig the equivalece class property of rough set theory. The splittig attribute is chose based o the dissimilarity betwee the obects i the same equivalece class. Let A idicate the set cotaiig the attributes i the data ad B be aother set cotaiig the obects whose value for a particular attribute is same. Equivalece class is calculated for each attribute with a specific value. I each calculatio, the set B cotais the obects of that equivalece class. The dissimilarity betwee the obects withi the set B is calculated usig Eq. (8). The average dissimilarity betwee the obects i a equivalece class or set B is calculated usig Eq. (9). a ) (6) 4

5 V( xi, x) dissim( xi, x) 1 (8) Here x i, x belog to the same equivalece class; is the umber of attributes i A; ad V( xi, x ) attributes havig same value for the obects x i, x. m 1 m dissim( xi, x ) i 1 i 1 avgdissim(b) m(m 1) / 2 is the umber of (9) Here B idicates a equivalece class; m is the umber of obects i set B, ad x i, x belogs to set B. The equivalece class B of a attribute a i havig value v is represeted as [a i ]. Set B, which has the miimum average dissimilarity ad has at least two obects, determies the splittig attribute a i o its value v for the paret ode. Iitially, the paret ode cotais all the obects. After partitio, the leaf ode havig more obects is selected as the paret ode for further partitioig i subsequet iteratios. The algorithm termiates whe it reaches a pre-defied umber of clusters. The procedure for the MADO clusterig algorithm is give i Figure 1. Procedure (U,k) Lie 1 Begi Lie 2 Set curret umber of cluster CNC = 1 Lie 3 Set k as required umber of clusters Lie 4 Set ParetNode = U Lie 5 Do Lie 6 For each a i from A (i = 1 to, where is the umber of attributes i A) Lie 7 For =1 to l, where l is the umber of differet values i a i Lie 8 I the ParetNode, determie family of equivalece classes for a i with value which is deoted as set B Lie 9 Calculate avgdissim (B) Lie 10 Next Lie 11 Next Lie 12 Set Mi-Avg-Dissim = Mi (avgdissim (B)) for each B >1 Lie 13 Determie splittig attribute a i o its value v (a i ) correspodig to the Mi-Avg-Dissim Lie 14 CNC = CNC + 1 Lie 15 ParetNode = ProcParetNode (CNC) Lie 16 While (CNC<k) Lie 17 Ed Lie 18 ProcParetNode (CNC) Lie 19 Begi Lie 20 For i = 1 to CNC Lie 21 Size (i) = Cout (Set of Elemets i Cluster i) Lie 22 Next Lie 23 Determie Max (Size (i)) Lie 24 Retur (Set of Elemets i cluster i) correspodig to Max (Size (i)) Lie 25 Ed Figure 1. MADO algorithm 5 EXPERIMENTAL RESULTS I this sectio, real data sets of customer trasactio details are used for clusterig or segmetig the customers. Customer trasactio details for a period of six moths have bee collected from four differet eterprises. Data set1 cosists of records; data set2 cosists of records; data set3 cosists of records; ad data set4 cosists of records. For each trasactio, party id, date of purchase, amout of purchase, ad paymet of purchase are used to defie R, F, M, ad P values. The distict party id represets the idividual customer. For each distict party id, R is calculated as the iterval betwee the time that the latest cosumig behavior occurred ad the preset; F is calculated as the umber of trasactio records; M is calculated as the total purchase amout; ad P is calculated as the average time iterval (i terms of days) betwee the paymet date ad purchase date for 5

6 each trasactio i the data set. The data set ow has oly four attributes, amely R, F, M, ad P, for each customer. Dataset1 has 5062 customers; data set2 has 1420 customers; data set3 has 3811 customers; ad data set4 has 2675 customers. The values of R, F, M, ad P are ormalized as give below: For ormalizig R or P: 1) Sort the data set i descedig order of R or P. 2) Divide the data set ito five equal parts with 20% of the records i each. 3) Assig categorical values as very low, low, middle, high, ad very high to the first, secod, third, fourth, ad fifth parts of the records, respectively. For ormalizig F or M: 1) Sort the data set i ascedig order of F or M 2) Divide the data set ito five equal parts with 20% of the records i each. 3) Assig categorical values as very low, low, middle, high, ad very high to first, secod, third, fourth, ad fifth part of the records, respectively. The ormalized data set is ow used by the MADO clusterig algorithm to segmet the customers ito various groups. MMR, MMeR, SDR, SSDR, ad MADE algorithms are applied to the same four data sets so that the customers are divided ito various groups. Criteria cohesio ad couplig are used to measure the iteral quality of the cluster (Satos, Heuser, Moreira, & Wives, 2011). Cohesio expresses the average similarity betwee the elemets of a cluster. Couplig expresses the average similarity betwee all pairs of elemets, where oe elemet belogs to cluster C ad the other does ot. Ideally, cohesio should be high ad couplig should be low (Kuz & Black, 1995). The formulas for cohesio ad couplig for a cluster C are give by Eqs. (10) ad (11), respectively. The total cohesio ad total couplig of clusters are give by Eqs. (12) ad (13), respectively. sim( ci, c) i Cohesio(C) (10) m*(m 1) 2 sim( ci, q ) i, Couplig(C) (11) m* k Total cohesio Cohesio(C) (12) 1 C k Total couplig Couplig(C) (13) C 1 Here sim(c i,c ) is the similarity score betwee elemets c i ad c belogig to cluster C; sim(c i,q ) is the similarity betwee elemet c i from cluster C ad elemet q from aother cluster; m is the umber of elemets i C; is the umber of elemets outside C; ad k is the umber of clusters. The results of MMR, MMeR, SDR, SSDR, MADE, ad MADO algorithms are compared by varyig the umber of clusters to be produced from three to seve. I each case, the total couplig ad the total cohesio of the clusters are calculated usig Eqs. (12) ad (13). The results of four data sets for all the five cases produced by the clusterig algorithms are tabularized i Tables 1 through 8. The MADE algorithm produces the value zero for the degree of depedecy calculated for each attribute. This is because the lower approximatio of each attribute for each value is a ull set. Therefore this algorithm could ot be applied further to produce the clusterig result for the cosidered data set. The results of SDR ad SSDR are the same because for large data sets, SSDR achieves the same result as SDR (Tripathy & Ghosh, 2011b). Table 1. Total cohesio value for data set1 Total cohesio MADO MMR MMeR SDR & SSDR

7 Table 2. Total couplig value for data set1 Total couplig MADO MMR MMeR SDR & SSDR Table 3. Total cohesio value for data set2 Total cohesio MADO MMR MMeR SDR & SSDR Table 4. Total couplig value for data set2 Total couplig MADO MMR MMeR SDR & SSDR Table 5. Total cohesio value for data set3 Total cohesio MADO MMR MMeR SDR & SSDR Table 6. Total couplig value for data set3 Total couplig MADO MMR MMeR SDR & SSDR Table 7. Total cohesio value for data set4 Total cohesio MADO MMR MMeR SDR & SSDR

8 Table 8. Total couplig value for data set4 Total couplig MADO MMR MMeR SDR & SSDR Tables 1 through 8 illustrate that the clusters produced by the MADO algorithm have o average high cohesio ad low couplig values for all five cases with respect to the other algorithms. Because the clusterig algorithm performace depeds o producig clusters with high cohesio ad low couplig values, the proposed algorithm out performs the other rough set based clusterig algorithms for the cosidered four real customer data sets. The reaso behid this is that the proposed algorithm cosiders the dissimilarity betwee the obects i the same equivalece class whe choosig the splittig attribute istead of fidig the correlatio betwee attributes as the other rough set based clusterig algorithms do. The proposed clusterig algorithm is also applicable for high dimesioal data. It has bee tested for soyabea ad zoo data sets obtaied from the UCI Machie Learig Repository. The soybea data set cotais 47 obects with 35 categorical attributes. The zoo data set cotais 101 obects with 18 categorical attributes. The purity of clusters is used to test the quality of the clusters if the class label is already kow. I a real data set, because the class label is ot kow, cohesio ad couplig are used to test the quality of the clusters. I the sythetic data set obtaied from the repository, the class label is available i the data set ad so purity is used as a measure to test the quality of the clusters. The purity of a cluster i is defied by Eq. (14). The overall purity is defied by Eq. (15). i Purity(i) = r (14) r Overall purity = i 1 purity( i ) (15) Here, i r idicates the umbers of obects occurrig i cluster i ad its correspodig class r; r idicates the umber of obects i the class r; ad idicates the umber of clusters i the data set. A higher value of overall purity idicates a better clusterig result, with perfect clusterig yieldig a value of 1. I the zoo data set, each obect is classified as belogig to oe of the 7 classes. The MMR, MMeR, SDR, SSDR, ad proposed MADO clusterig algorithms are executed to obtai 7 classes. Purity for each cluster is calculated usig Eq. (14), ad the overall purity is calculated usig Eq. (15). The results obtaied from the MMR, SSDR, ad MADO algorithms are give i Tables 9, 10, ad 11, respectively. From Table 9, it is observed that out of 101 obects, 92 obects are classified correctly. From Table 10, it is observed that out of 101 obects, 79 obects are classified correctly. From Table 11, it is observed that out of 101 obects, 95 obects are classified correctly. Thus the overall purity of the clusters obtaied usig MMR, SSDR, ad the proposed algorithm is 0.91, , ad respectively. The overall purity of the clusters obtaied usig MMeR ad SDR is ad respectively (Tripathy & Ghosh, 2011b). Thus, the proposed clusterig algorithm produces good clusters for a high dimesioal sythetic data set. Table 9. MMR output for the zoo data set Cluster C1 C2 C3 C4 C5 C6 C7 Purity Number

9 Table 10. SSDR output for the zoo data set Cluster C1 C2 C3 C4 C5 C6 C7 Purity Number Table 11. MADO output for the zoo data set Cluster C1 C2 C3 C4 C5 C6 C7 Purity Number I the zoo data set, the MMR algorithm performs better whe compared to the MMeR, SDR, ad SSDR algorithms. Next, the MMR ad proposed clusterig algorithms are used for the soybea data set. I the soybea data set, each obect is classified as belogig to oe of the 4 diseases or 4 classes. The MMR ad proposed clusterig algorithms are executed to obtai 4 clusters. Purity for each cluster is calculated usig Eq. (14). The results obtaied from the MMR ad the proposed algorithm are give i Tables 12 ad 13, respectively. From Table 12, it is observed that out of 47 obects, 39 obects are classified correctly. From Table 13, it is observed that out of 47 obects, 46 obects are classified correctly. Thus the overall purity of the clusters obtaied usig the MMR ad the proposed algorithm is ad , respectively. Table 12. MMR output for soyabea data set Cluster Disease 1 Disease 2 Disease 3 Disease 4 Purity Number Table 13. MADO output for soybea data set Cluster Disease 1 Disease 2 Disease 3 Disease 4 Purity Number The results obtaied for the sythetic data set show that the proposed algorithm produces improved clusterig results i terms of purity whe compared to other clusterig algorithms. Therefore, the proposed clusterig algorithm ca cluster data i a better way irrespective of the umber of attributes cosidered. 9

10 6 CONCLUSION Customer segmetatio for CRM is achieved usig a data miig clusterig techique. This techique is tested i four various real data sets. Data related to customers are categorical i ature so the usual hierarchical ad partitioig algorithms are ot applicable. The importace of rough set theory for categorical clusterig is discussed, ad the algorithms based o this techique are studied. The proposed MADO algorithm cosiders the dissimilarity betwee obects i the same equivalece class. The cluster quality for a real data set is measured usig cohesio ad couplig. The experimetal results for the cosidered four customer data sets show that the MADO algorithm produces clusters with high cohesio ad low couplig values for all four cases with respect to other algorithms. The applicability of the MADO algorithm to high dimesioal data sets has bee prove by testig it with sythetic data sets. The experimetal results show that the MADO algorithm clusters high dimesioal data i a better way whe compared to other clusterig algorithms. I the future, the behavior or characteristics of customers i each segmet ca be aalyzed so that marketers ca employ differet strategies for each group. Furthermore, the life time value of a customer ca be icreased by adoptig suitable techiques for each customer segmet. 7 REFERENCES Bea, C. & Kambhampati, C. (2008) Autoomous clusterig usig rough set theory. Iteratioal Joural of Automatio ad Computig, 5(1), Cao, F., Liag, J., Li, D., Bai, L., & Dag, C. (2011) A dissimilarity measure for the k-modes clusterig algorithm. Kowledge Based Systems, 26, Che, H., Chuag, K., & Che, M. (2008) O data labelig for clusterig categorical data. IEEE Trasactios o Kowledge ad Data Egieerig, 20(11), Cheg, C. & Che, Y. (2009) Classifyig the segmetatio of customer value via RFM model ad RS theory. Expert Systems with Applicatios, 36(3), Gati, V., Gehrke, J., & Ramakrisha, R. (1999) CACTUS-Clusterig categorical data usig summaries. Proceedigs of 5th ACM SIGKDD Iteratioal coferece o Kowledge Discovery ad Data Miig (pp ). Sa Diego, CA, USA. Guha, S., Rastogi, R., & Shim, K. (2000) ROCK: A robust clusterig algorithm for categorical attributes. Iformatio Systems, 25(5), Herawa, T., Deris, M.M., & Abaway, J.H. (2010) A rough set approach for selectig clusterig attribute. Kowledge Based Systems, 23(3), Herawa, T., Ghazali, R., Yato, I.T.R, & Deris, M.M. (2010) Rough set approach for categorical data clusterig. Iteratioal Joural of Database Theory ad Applicatio, 3(1), Kumar, P. & Tripathy, B.K. (2009) MMeR: a algorithm for clusterig heterogeeous data usig rough set theory. Iteratioal Joural of Rapid Maufacturig, 1(2), Kuz, T. & Black, J.P. (1995) Usig automatic process clusterig for desig recovery ad distributed debuggig. IEEE Trasactios o Software Egieerig, 21(6), Lig, R. & Ye, D.C. (2001) Customer relatioship maagemet: A aalysis framework ad strategies. Joural of Computer Iformatio Systems, 41(3), implemetatio Mazlack, L.J., He, A., Zhu, Y., & Coppock, S. (2000) A rough set approach i choosig clusterig attributes. Proceedigs of the 13th iteratioal coferece o Computer Applicatios i Idustry ad Egieerig (pp. 1-6). Hoolulu, Hawaii, USA. Ngai, E.W.T., Xiu, L., & Chau, D.C.K. (2009) Applicatio of data miig techiques i customer relatioship maagemet: A literature review ad classificatio. Expert Systems with Applicatios, 36(2),

11 Data Sciece Joural, Volume 13, 9 April 2014 Parmar, D., Wu, T., & Blackhurst, J. (2007) MMR:a algorithm for clusterig categorical data usig rough set theory. Data ad Kowledge Egieerig, 63(3), Pawlak, Z. (1982) Rough set. Iteratioal Joural of Computer ad Iformatio Scieces, 11(5), Pawlak, Z. (1992) Rough sets: Theoretical aspects of reasoig about data, Norwell, MA, USA: Kluwer Academic Publishers. Pawlak, Z. & Skowro, A. (2007) Rudimets of rough sets. Iformatio Scieces, 177(1), Satos, J.B., Heuser, C.A., Moreira, V.P., & Wives, L.K. (2011) Automatic threshold estimatio for data matchig applicatios. Iformatio Scieces, 181(13), Shyg, J.Y., Wag, F.K., Tzeg, G.H., & Wu, K.S. (2007) Rough set theory i aalyzig the attributes of combiatio values for the isurace market. Expert Systems with Applicatios, 32(1), Tripathy, B.K. & Ghosh, A. (2011) SDR: A algorithm for clusterig categorical data usig rough set theory. Proceedigs of IEEE coferece o Recet Advaces i Itelliget Computatioal Systems (pp ). Trivadrum, Idia. Tripathy, B.K. & Ghosh, A. (2011) SSDR: A algorithm for clusterig categorical data usig rough set theory. Advaces i Applied Sciece Research, 2(3), Wu, J. & Li, Z. (2005) Research o Customer Segmetatio Model by Clusterig. Proceedigs of the ACM 7th iteratioal coferece o Electroic commerce (pp ). Xi-a, Chia. (Article history: Received 26 March 2013, Accepted 19 February 2014, Available olie 3 April 2014) 11

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history: