ABSTRACT. WEIQING, JIN. Fuzzy Classification Based On Fuzzy Association Rule Mining (Under the direction of Dr. Robert E. Young).

ABSTRACT WEIQING, JIN. Fuzzy Classfcaton Based On Fuzzy Assocaton Rule Mnng (Under the drecton of Dr. Robert E. Young). In fuzzy classfcaton of hgh-dmensonal datasets, the number of fuzzy rules ncreases exponentally wth the ncrease of attrbutes. Fuzzy assocaton rule mnng wth approprate threshold values can help to desgn a fuzzy classfer by sgnfcantly decreasng the number of nterestng rules. In ths dssertaton, we nvestgate the way to ntegrate fuzzy assocaton rule mnng and fuzzy classfcaton. Frst, the framework of fuzzy assocaton rule mnng s presented whch ncorporates fuzzy set modelng n an assocaton rule mnng technque. It avods the sharp boundary problem caused by arbtrary determnaton of ntervals on the doman of quanttatve attrbutes, meanwhle presentng natural and clear knowledge n the form of lngustc rules. We study the mpact of dfferent fuzzy aggregaton operators on the rule mnng result. The selecton of the operator should depend on the applcaton context. Based on the framework of fuzzy assocaton rule mnng, we propose a heurstc method to construct the fuzzy classfer based on the set of fuzzy class assocaton rules. We call ths method the FCBA approach, where FCBA stands for fuzzy classfcaton based on assocaton. The obectve s to buld a classfer wth strong classfcaton ablty. In the FCBA approach, we use the composte crtera of fuzzy support and fuzzy confdence as the rule weght to ndcate the sgnfcance of the rule. Through our study, we fnd t s mportant to fnd a good combnaton of these two rule nterestngness threshold values. The classfcaton of each record s acheved by applyng the classc fuzzy reasonng method, n whch each record s classfed as the consequent of the rule wth the maxmum product of the compatblty grade and the rule weght. We use the well-known classfcaton problems such as the Irs dataset, and hgh-dmensonal classfcaton problem, the Wne dataset to compare the proposed FCBA approach wth other non-fuzzy and fuzzy classfcaton approaches. The emprcal study shows that the FCBA approach performs well on these datasets on both accuracy and nterpretablty.

FUZZY CLASSIFICATION BASED ON FUZZY ASSOCIATION RULE MINING by WEIQING JIN A dssertaton submtted to the Graduate Faculty of North Carolna State Unversty In partal fulfllment of the Requrements for the degree of Doctor of Phlosophy In INDUSTRIAL ENGINEERING Ralegh, NC 2004 APPROVED BY: Dr. Robert E. Young Char of Advsory Commttee Dr. Mchael G. Kay Advsory Commttee Dr. Dens R. Cormer Advsory Commttee Dr. Laure Wllams Advsory Commttee

BIOGRAPHY Weqng Jn s a Ph.D student of Department of Industral Engneerng at North Carolna State Unversty. He receved hs dual B.S. degrees n Materal Engneerng and Internatonal Fnance from Shangha Jao Tong Unversty, Shangha, P.R. Chna, n 996. Then he obtaned the M.S. degree n Mechancal Engneerng from the same unversty n 999. In the fall of 2000, he came to North Carolna State Unversty to start hs Ph.D. study n the Department of Industral Engneerng. Hs research nterests nclude data mnng, machne learnng, artfcal ntellgence, pattern classfcaton, fuzzy reasonng. Whle workng on hs Ph.D., he also served as teachng assstant for the database applcatons course, and was a database engneer (CO-OP) n ndustry.

ACKNOWLEDGEMENTS I would lke to express my apprecaton to all persons who have supported ths research. I would lke to express my sncere grattude to my advsor, Dr. Robert E. Young, for hs encouragement, thoughtful advce, kndness, and patence throughout my Ph.D. study. Hs gudance was not only a good nspraton n the past, but also wll be a very valuable asset to me n the future. Specal thanks s extended to Dr. Mchael G. Kay, for hs constructve suggestons and valuable advce. I am also grateful to Dr. Dens R. Cormer and Dr. Laure Wllams for ther teachng and servce as my commttee members, and ther thoughtful comments. I also would lke to thank Dr. Mark Walker for beng the Graduate Representatve. I am thankful to Dr. Davd Dckey n the Statstcs department for hs excellent data mnng course that helped to expand my research horzon. I am deeply ndebted to my famly for ther support. Ther love and encouragement made ths study possble.

Table of Contents Lst of Tables Lst of Fgures v v. Introducton. Fuzzy Classfcaton...............................2 Integraton of Assocaton Rule Mnng and Classfcaton.......... 2.3 Research Scope and Obectve......................... 3.4 Dssertaton Organzaton........................... 4 2. Classfcaton 6 2. Defnton of Classfcaton........................... 6 2.2 Data Preparaton for Classfcaton...................... 8 2.3 Tradtonal Classfcaton Technques..................... 9 2.3. Statstcal Classfcaton Methods................... 9 2.3.2 Decson Tree Method......................... 3 2.4 Classfcaton Accuracy............................. 5 3. Assocaton Rule Mnng 8 3. Defnton of Assocaton Rule Mnng.................... 8 3.2 Apror Algorthm............................... 2 3.2. Downward Clouse Property...................... 2 3.2.2 Iteratve procedure........................... 22 3.2.3 Hash Tree Implementaton....................... 24 3.3 Quanttatve Assocaton Rule Mnng.................... 26 4. Fuzzy Aggregaton Operators and Fuzzy Rules 30 4. Fuzzy Sets.................................. 30 4.2 Fuzzy Aggregaton Operator......................... 35 v

4.2. t-norm and t-conorm Operator..................... 35 4.2.2 Compensatory Operators....................... 38 4.3 Fuzzy Relatons and Fuzzy Rules....................... 38 5. Fuzzy Assocaton Rule Mnng 43 5. Introducton.................................. 43 5.2 Fuzzy Assocaton Rule Mnng........................ 46 5.2. Syntax and Interestngness Measurement.............. 48 5.2.2 Mnng Algorthm.......................... 5 5.3 Emprcal Study................................ 56 5.3. The Auto-Mpg Dataset........................ 56 5.3.2 The Page Blocks Dataset....................... 68 5.3.3 Result Analyss and Dscusson.................... 70 6. Fuzzy Classfcaton 72 6. Advantages of Usng Fuzzy Classfer.................... 72 6.2 General Fuzzy Classfer Model....................... 74 6.3 Rule Weght and Fuzzy Reasonng Method................. 75 7. Desgnng Fuzzy Classfer From Fuzzy Assocaton Rule Mnng 8 7. Lterature Revew.............................. 8 7.. Classfcaton on Non-Fuzzy Data.................. 8 7..2 Classfcaton on Fuzzy Data..................... 83 7.2 Problem Formulaton............................. 86 7.3 Fuzzy Class Assocaton Rule Mnng Algorthm.............. 88 7.4 Fuzzy Classfcaton Based On Assocaton (FCBA)............. 93 7.4. Revew of Adaptve Rule Weght Method.............. 93 7.4.2 FCBA Algorthm........................... 94 v

8. Emprcal Study of Fuzzy Classfcaton Based On Assocaton 98 8. Fuzzy Parttonng............................. 98 8.2 The Irs Dataset............................... 99 8.3 The Wne Dataset.............................. 03 8.4 Summary of Results............................ 07 9. Conclusons and Future Research 09 9. Conclusons................................. 09 9.2 Future Research............................... Lst of References 2 v

Lst of Tables 3. An example of a transacton database........................ 20 3.2 People table and example of quanttatve assocaton rules............. 26 3.3 Mappng from quanttatve attrbutes to Boolean attrbutes............ 27 3.4 Dscrete nterval method wth overlapped..................... 29 4. Common t-norms and ther dual t-conorms.................... 37 4.2 Common fuzzy mplcaton functons....................... 40 4.3 Logcal equvalence of A B and A B.................... 4 5. Truth table of fuzzy assocaton rules........................ 47 5.2 Truth table of conuncton............................. 47 5.3 Truth table of mplcaton............................. 47 5.4 A record of mpg database............................. 57 5.5 Attrbutes of the mpg dataset........................... 57 5.6 Frst 0 records of mpg database........................ 59 5.7 The fuzzy set value transformed from data n Table 5.6............. 60 5.8 The normalzed fuzzy set membershp values................... 60 5.9 Frst ten fuzzy conunctve assocaton rules.................... 6 8. Accuracy on Irs dataset by non-fuzzy classfcaton methods.......... 99 8.2 Accuracy on Irs dataset by Adaptve method based on LV........... 00 8.3 Result of Irs data by FCBA method based on LV................ 00 8.4 Performance comparson between GA and FCBA methods........... 0 8.5 Accuracy on Irs dataset by fuzzy classfcaton methods............. 02 8.6 Accuracy of non-fuzzy classfcaton methods on the Wne dataset........ 05 8.7 0-CV accuracy on Wne dataset by GBML based method........... 06 8.8 0-CV accuracy on Wne dataset by FCBA method................ 06 v

Lst of Fgures 2. Classfcaton Process............................... 8 2.2 Two-class separaton by lnear dscrmnaton................... 2.3 Decson tree example for the concept of playng tenns............. 3 2.4 Smple to complex grow of decson tree..................... 4 2.5 Emprcal vs. true accuracy............................ 5 3. Apror algorthm................................. 23 3.2 Hash Tree Illustraton Dagram.......................... 25 3.3 Dscrete Interval Method............................. 28 4. Fuzzy set membershp functons.......................... 3 4.2 cut, support, core and heght......................... 32 4.3 Lngustc varable temperature.......................... 33 4.4 Mn and Max operators of fuzzy sets....................... 36 5. Fuzzy set defned on the nterval......................... 44 5.2 An example of lngustc terms and ts termset defnton............. 49 5.3 Fuzzy assocaton rule mnng algorthm..................... 53 5.4 Fuzzy sets of attrbute mpg........................... 58 5.5 Fuzzy sets of attrbute acceleraton....................... 58 5.6 Fuzzy sets of attrbute dsplacement....................... 58 5.7 Fuzzy sets of attrbute weght.......................... 58 5.8 Fuzzy sets of attrbute horsepower....................... 58 5.9 Number of rules vs. mnmum support on un-normalzed fuzzy set membershp values................................ 63 5.0 Number of rules vs. mnmum support on normalzed fuzzy set values..... 63 5. Fuzzy set membershp values of attrbute acceleraton............ 64 5.2 Another fuzzfcaton scheme for attrbute acceleraton........... 65 5.3 Number of rules vs. mnmum support on dfferent confdence values..... 66 5.4 Number of rules vs. mnmum support (multple lngustc terms)....... 67 5.5 Number of rules vs. mnmum support (sngle most mportant lngustc term). 68 v

5.6 Effect of mnmum support on the number of frequent termsets......... 69 5.7 The relatonshp between the mnmum support and the number of rules.... 70 6. Fuzzy rule representaton............................. 72 6.2 Classfcaton boundary by fuzzy rules and the look-up table method....... 73 6.3 Effect of rule weght................................ 78 7. Fuzzy class assocaton rule mnng algorthm.................. 90 7.2 FCBA algorthm.................................. 96 8. Attrbute wth fve unform fuzzy parttons.................... 99 8.2 Effect of mnmum support on accuracy on the Wne dataset for FCBA..... 04 x

Chapter Introducton. Fuzzy Classfcaton Many real world decson-makng problems can be treated as classfcaton [Wess 990]. For nstance, when a fnancal bank plans to promote a new type of credt card they wll send out some marketng mal to ther current customers. The purpose of the mal s to promote the new credt card. However, n order to mnmze the marketng campagn cost the bank wants to target the porton of customers that wll respond to the promoton mal and apply for the credt card. Ths problem can be treated as a classfcaton problem that classfes customers nto two groups. One group wll respond to the promoton mal and the other wll not. It would be helpful to management f we can construct a classfer based on lngustc nterpretable rules. One of the approaches to solve ths classfcaton problem s to formulate a soluton usng a fuzzy rule-based classfer. A fuzzy classfer can extract rules n a lngustc format that s more nterpretable compared to other nonlnear approaches such as neural networks. In fact, fuzzy rule-based classfcaton systems have been wdely appled n the pattern classfcaton area [Kuncheva 2000]. Another characterstc of a fuzzy classfer s ts smooth classfcaton boundary because of the overlap between the fuzzy spaces. Ths helps to solve some classfcaton applcatons where the crsp parttonng boundary cannot be easly drawn. Such examples nclude face and voce recognton, and handwrtng verfcaton. Ths dssertaton studes the followng fuzzy classfcaton problem: gven a set of tranng records represented n fuzzy membershp values and assocated class labels, we need to generate a fuzzy classfer contanng a set of fuzzy lngustc rules to predct the class labels of unseen (future) records.

Dfferent approaches for fuzzy rule-based classfcaton systems have been proposed n the past. Fuzzy If-Then rules derved from numercal data were obtaned by heurstcprocedures [Ishbuch 992], [Abe 995], [Nozak 996], neuro-fuzzy technques [Pedrycz 992], [Mrta 997], [Nauck 997], genetcs-algorthm [Ishbuch 995], [Cordon 998], [Gonzalez 999]. Generally speakng, extractng fuzzy classfcaton rules from numercal data contans two phases: frst, parttonng the attrbute s doman based on lngustc term defnton, and second, selectng sgnfcant classfcaton rules. Ths type of approach was often referred to as the smple fuzzy grd method [Ishbuch 992]. However, the fast growth of nformaton technology has resulted n large datasets wth many attrbutes. For hgh dmensonal datasets, the smple fuzzy grd method suffers from the curse of dmensonalty problem. Ths s because the total number of fuzzy rules wll ncrease exponentally wth the ncrease of the number of attrbutes. Ths usually results n a huge set of fuzzy rules to be examned. Thus, t becomes very dffcult to construct a fuzzy classfer from datasets that contan a lot of attrbutes. Compared to non-fuzzy data classfcaton, the exponental combnaton effect s more severe for datasets wth many attrbutes snce multple fuzzy lngustc values are defned on each attrbute s doman..2 Integrate Assocaton Rule Mnng and Classfcaton Assocaton rule mnng ams to fnd certan assocaton relatonshps among a set of attrbutes n a dataset. It fnds all the rules n the dataset that satsfy pre-specfed mnmum support and mnmum confdence values. By ncorporatng fuzzy set modelng n assocaton rule mnng we can extract fuzzy assocaton rules from datasets [Kuok 998]. Fuzzy assocaton rules contanng lngustc terms can express more natural and nterpretable knowledge. By extendng the concept from assocaton rule mnng, fuzzy support and fuzzy confdence can be used to measure the rule nterestngness n fuzzy assocaton rule mnng. Assocaton rule mnng and classfcaton are two mportant topcs n the data mnng area [Chen 996]. They can be vewed wthn a unfyng framework of rule dscovery 2

[Agrawal 993b]. The rapdly growng and avalable computng power has facltated the use of these two technques on large datasets. There have been some studes on applyng the concept of assocaton rule mnng n classfcaton, though most of them were targeted to non-fuzzy data [Lent 997], [Lu 998], [L 200]. The man dfference between a fuzzy classfcaton rule and a fuzzy assocaton rule s that a fuzzy classfcaton rule only contans the class label as the rule consequent. Therefore, fuzzy classfcaton rules can be treated as a subset of fuzzy assocaton rules. By applyng the fuzzy assocaton rule mnng technque we can construct a fuzzy classfer from large datasets by choosng hgh qualty rules from among all possble rules. In order to select hgh qualty rules, a rule weght s needed to attach to each rule to ndcate the sgnfcance of each rule. Dfferent knds of rule weghts can be derved by ndvdual or composte use of rule nterestngness measurements such as fuzzy support and fuzzy confdence. Fuzzy support ndcates the compatblty grade between all the data and the rule, whle fuzzy confdence ndcates the accuracy of the fuzzy rule. The goal of fuzzy classfcaton s to construct a fuzzy classfer wth strong classfcaton ablty. Classfcaton accuracy and nterpretablty are two maor crtera to evaluate fuzzy classfers. Accuracy s the ablty to correctly classfy unseen data, whle nterpretablty s the level of understandng and nsght that s provded by the model..3 Research Scope and Obectve Ths dssertaton studes ntegratng fuzzy assocaton rule mnng and fuzzy classfcaton. In fuzzy assocaton rule mnng, the use of a fuzzy aggregaton operator plays an mportant role n the rule mnng process. The selecton of an approprate operator depends on the applcaton context. We wll study the mpact of dfferent operators on fuzzy assocaton rule mnng. Fuzzy assocaton rule mnng allows us to extract rules from datasets wth many attrbutes. Based on the results of fuzzy assocaton rule mnng, we propose a heurstc method to construct a fuzzy classfer from fuzzy assocaton rule mnng. In the lterature, the classfcaton ablty of a fuzzy classfer s manly measured by the classfcaton 3

accuracy and the nterpretablty. The accuracy of a fuzzy classfer on a dataset s defned as the percentage of the correct classfed data among the total gven data. The nterpretablty of a fuzzy classfer s measured by the total number of rules n the fuzzy classfer. Our obectve s to desgn a fuzzy classfer wth strong classfcaton ablty. Ths ncludes hgher classfcaton accuracy and fewer number of rules. Rule weght ndcates the sgnfcance of the fuzzy rules n fuzzy classfer desgn. Rule nterestngness measurements such as fuzzy support and fuzzy confdence have been ncorporated n the desgn of fuzzy classfers to help pre-screen canddate fuzzy rules [Ishbuch 2004]. The rule weght should have a good trade-off between generalty and specfcty. A rule s general f t covers many records; a rule s specfc f t covers few records wth hgh accuracy. We propose the composte crtera of usng both fuzzy support and fuzzy confdence as the rule weght n fuzzy classfer desgn. In the process of fuzzy classfer desgn, redundant rules need to be removed from the set of fuzzy assocaton rules. In addton, the te stuaton when multple rules can classfy one record needs to be solved also. We wll study these ssues and propose effectve methods to deal wth them. In order to evaluate the proposed fuzzy classfer, we wll compare our approach wth varous non-fuzzy and fuzzy classfcaton methods on some well-known classfcaton problems. Dfferent evaluaton technques such as 0-fold cross valdaton and leavng-one-out wll be used to estmate our fuzzy classfer s performance..4 Dssertaton Organzaton Ths dssertaton contans nne chapters. Chapter ntroduces the problems of fuzzy classfcaton from assocaton rule mnng. It also gves the research obectve and the dssertaton organzaton. Chapter 2 ntroduces the problems of classfcaton, the tradtonal classfcaton methods and the performance evaluaton technques. Chapter 3 ntroduces the assocaton rule mnng problem and the Apror algorthm. The lmtaton of mnng assocaton rules from quanttatve data s also presented. Chapter 4 ntroduces fuzzy aggregaton operators and fuzzy rules that wll be used n the framework of fuzzy 4

assocaton rule mnng and fuzzy classfcaton. Then n chapter 5, the fuzzy assocaton rule mnng framework s presented, ncludng the semantcs, syntax, nterestngness measurement and mnng algorthm. The results and dscussons of an emprcal study are also provded. Chapter 6 ntroduces a general fuzzy classfcaton model and dscusses the mportance of the rule weght and fuzzy reasonng method n desgnng a fuzzy classfer. Chapter 7 proposes the framework of desgnng a fuzzy classfer from fuzzy class assocaton rules a subset of fuzzy assocaton rules. The problem of desgnng a fuzzy classfer from fuzzy class assocaton rule mnng s formulated. The fuzzy class assocaton rule mnng algorthm s presented. The heurstc approach fuzzy classfcaton based on assocaton (FCBA) s also proposed. Chapter 8 presents the emprcal study to compare FCBA approach wth other fuzzy or non-fuzzy classfcaton approaches from the lterature. Chapter 9 draws conclusons and proposes future research drectons. 5

Chapter 2 Classfcaton Ths chapter provdes a bref ntroducton to classfcaton. The defnton of the classfcaton problem s gven n Secton 2.. The data preparaton procedure s explaned n Secton 2.2. Tradtonal classfcaton methods are ntroduced n Secton 2.3, whch nclude statstcal classfcaton methods (Secton 2.3.) and the decson tree method (Secton 2.3.2). Estmaton technques of classfcaton accuracy are ntroduced n Secton 2.4. 2. Defnton of Classfcaton Many real world decson-makng problems fall nto the general category of classfcaton [Wess 990], [Mchalsk 998]. The classfcaton approach based on rules provdes a modularzed, clearly explaned format for a decson, whch s compatble wth a human beng s reasonng procedure. In the machne learnng lterature, classfcaton usually means establshng rules by whch we can classfy new data nto the exstng classes that are known n advance [Mche 994]. Each data s assumed to belong to a predetermned class, and ths class s called a class label. Classfcaton s also known as supervsed learnng, n the sense that the classfcaton rules are establshed from gven data whose class labels have been known. Ths contrasts wth unsupervsed learnng, such as clusterng, n whch the class label of each gven data, and the number of classes to be learned may not be known n advance. We present the defnton of a non-fuzzy data classfer as follows. Notce the data used n the non-fuzzy classfer are all crsp values. 6

Defnton 2. Non-fuzzy Classfer Let us assume we have n records (patterns) and C { C }, (,, q ) as the set of class labels. m attrbutes (features). Assume Let T T } (,, n ) denote the set of records. A non-fuzzy data classfer s any { mappng T C That s, each record T s mapped to one class label n C. Dependng on the value of the class label, classfcaton can be dvded nto dfferent categores. In the frst type of classfcaton, the class label s value s dscrete or nomnal, such as Yes/No, Hgh/Medum/Low, etc. In the second type of classfcaton, the class label s value s contnuous or ordered. Dfferent classfcaton methods can be appled to these two types of problems. For nstance, a decson tree s the common technque to be used n the frst type of classfcaton problem, whle regresson s used for the second type. In ths dssertaton, we only deal wth the frst type of classfcaton problem, n whch the class label only has dscrete or nomnal values. The classfcaton process usually nvolves two stages and t s llustrated n Fgure 2.. In the frst stage, a classfcaton model s bult to descrbe a gven set of data wth known class labels [Han 200]. Notce n the context of classfcaton, data s also referred to as a pattern, nstance, record, etc. The data s often dvded nto tranng data and testng data. The data used to buld the classfcaton model are called tranng data. They are randomly selected from the sample populaton. The knowledge output of the model s usually represented n the form of classfcaton rules, or mathematcal formula. In the second stage of the classfcaton process, the classfcaton ablty of the model s evaluated. One of the most common crtera to evaluate the performance s to estmate the classfcaton accuracy on the testng data. The accuracy s the percentage of the testng data that are correctly classfed by the model. If the model has satsfactory classfcaton ablty, t wll then be appled to classfy future (unseen) data. 7

Gven Tranng Data Classfcaton Model Data Testng Data Knowledge Output Satsfactory Classfcaton Ablty? Y Future Data Fgure 2.. Classfcaton process. 2.2 Data Preparaton for Classfcaton In order to mprove the classfcaton ablty and the effcency of the classfcaton process, some data preprocessng procedures are requred. Ths ncludes data cleanng, feature selecton and data transformaton. We wll brefly ntroduce them here. Data Cleanng Data cleanng generally refers to removng nose and dentfyng anomales of the data. How to deal wth mssng values of the data s also nvolved n data cleanng. The purpose of the data cleanng s to mprove the data qualty and help reduce confuson n the classfcaton model. Feature Selecton Some of the attrbutes of the dataset mght be rrelevant to the classfcaton task. Includng such attrbutes may ncrease the unnecessary computatonal burden and 8

mslead the classfcaton effort. It thus helps to dentfy and remove those redundant attrbutes n the classfcaton process. One common approach n feature selecton s to use some form of statstcal test or dstance metrc to stepwse decde whether an attrbute s sgnfcant or not. In addton, some stoppng condton s needed to determne when to stop the feature selecton [Wess 99]. Data Transformaton Normalzaton s one of the most common technques for data transformaton. It scales all attrbute values for a gven attrbute so that they fall wthn a small specfed range, such as from.0 to.0. In dstance metrc based classfcaton methods, normalzaton would prevent attrbutes wth a large range from outweghng other attrbutes wth a smaller range. Besdes, concept herarches may be ncorporated nto the classfcaton process to compress the orgnal data. Such compresson requres the generalzaton of the data based on a predetermned concept herarchy. 2.3 Tradtonal Classfcaton Technques We wll brefly revew two maor categores of tradtonal classfcaton technques here. The frst category s statstcal classfcaton methods that ncludng Bayesan classfers, lnear dscrmnant and the k-nearest neghbor method. The second category s the decson tree method. 2.3. Statstcal Classfcaton Methods The classfcaton problem has been wdely studed n pattern recognton and statstcs. Many methods were developed n these two areas. 9

Bayesan Classfers Bayesan classfers are the applcaton of Bayesan analyss to classfcaton problems [Russell 995], [Berthold 999], [Madgan 2003]. Assume a record s denoted as t, and all records are assgned to q known class labels. Thus, we have a class label set C { C } (,, q ). The basc prncple of a Bayesan classfer s that t wll classfy the record as the class C wth the greatest posteror probablty for ths record. The posteror probablty of a record t toward class label C s gven as P( C t) P( Ck t) for all (,, q, k,, q, k ) (2.) From Bayes rules, we know that the posteror probablty of a record t toward C s gven as P( t C ) P( C ) P( C t) (2.2) P( t) where P t C ) s the condtonal probablty of a gven record for a specfc class C. ( Smple mathematcal manpulaton of the Bayes rule shows that an alternatve formulaton of classfyng a record t s to choose the class C to satsfy followng P t C ) P( C ) P( t C ) P( C ) for all k (2.3) ( k k Whle we can use the proporton of each class n the gven data to represent the pror probablty of class C, t s farly dffcult to estmate the true populaton values of the condtonal probablty of a gven record t for a specfc class C. A lot of statstcal classfcaton methods can be vewed as approxmatons to the Bayes rule wth varyng assumptons made to estmate the condtonal probabltes. The assumpton may be a characterstc dstrbuton of the populaton, or specfc format for the decson soluton tself [Wess 99]. 0

Lnear Dscrmnant Lnear dscrmnant s one of the most common form of classfer, and t has a qute smple structure [Haste 200]. Assume the dataset has m attrbutes, then the lnear dscrmnant s to use a lnear combnaton of the attrbutes to separate or dscrmnate among the classes and assgn the class label for a new record. For a dataset wth m attrbutes, ths means geometrcally that the separatng surface between the records wll be a m hyperplane. Fgure 2.2 represents an dealzed plane that separates two classes C and C n 3-dmensonal space. 2 AT 3 C C 2 AT AT 2 Fgure 2.2: Two-class separaton by lnear dscrmnaton The maor advantage of lnear dscrmant s ts smple classfcaton structure. The general form of any lnear dscrmnant s gven as follows: w AT w2 AT2 wm ATm w0 (2.4) where AT (,, m) are the m attrbutes of the dataset, w ( 0,, m) are the constant parameters to be estmated. Lnear dscrmnant tends to perform well n practce, though t s true that dfferent classes cannot always be separated by a smple lnear combnaton of attrbutes.

Moreover, more than one plane or lne can be used to separate two classes. The maor ssues of applyng lnear dscrmants s to decde the constant parameters for the lnear dscrmnant. The most common approach s to decde those parameters under certan assumpton about the data dstrbuton, such as a normal or Gaussan dstrbuton. k-nearest Neghbor Method k-nearest neghbor method s to frst fnd out the k-nearest neghbor of a new record, and then assgn the data to the class label that appears most frequently among the k neghbors [Mtchell 997]. k s generally an odd number so that te stuaton won t happen. Ths method needs to calculate the dstance between a new record and every exstng record. Dstance metrcs such as absolute dstance, Eucldean dstance, and varous normalzed dstances are used n calculatng the dstance between a new record and an old record. Generally, the dstance s compared attrbute by attrbute and then added together. For absolute dstance, the dfference between the values of each attrbute s added together. For Eucldean dstance, the dfference between the values of each attrbute s squared and added together for all attrbutes. The square root of the sum becomes the Eucldean dstance. In some cases dfferent attrbutes may be scaled dfferently, such as n dfferent unts or n dfferent conventons; therefore, t s more approprate to normalze the dstance metrc. For example, we can measure the dstance n terms of standard devatons from the mean of each attrbute. The maor computatonal effort of the k-nearest neghbor method les n the classfcaton stage. The new record must be compared wth every exstng record n the dataset. Ths ncreases the computatonal effort, especally for huge datasets. On the other hand, k-nearest neghbor method does not need an underlnng assumpton on the data dstrbuton; therefore, t s a non-parametrc method. The above two characterstcs dfferentate ths method from parametrc methods such as the lnear dscrmnant method. 2

2.3.2 Decson Tree Method The decson tree method s one of the most wdely used classfcaton methods [Mtchell 997]. Decson tree classfes data based on ts top-down tree structure. Startng from the root node, each nternal node n the tree specfes a test on a certan attrbute of the dataset. Each branch from that node corresponds to one of the possble values of ths attrbute. A record s classfed by startng beng tested by the root node attrbute, and movng down the tree branch correspondng to the value of the attrbute n the gven record. A well-know decson tree example for the concept Play Tenns s gven n Fgure 2.3. Based on the condtons such as weather outlook, humdty and wnd, ths example uses a decson tree to classfy Saturday mornngs accordng to whether or not they are sutable for playng tenns. Outlook Sunny Overcast Ran Humdty Yes Wnd Hgh Normal Strong Weak No Yes No Yes Fgure 2.3. Decson tree example for the concept of playng tenns Snce we wll compare our fuzzy classfcaton approach wth the decson tree method n the emprcal study secton, we brefly explan the decson tree algorthm here. The decson tree method uses a statstcal property, nformaton gan, to measure how well a gven attrbute separates the tranng dataset accordng to the class label. Most decson tree algorthms use ths nformaton gan to select among the canddate attrbutes at each step whle growng the tree. The nformaton gan s based on the entropy concept that s commonly used n nformaton theory. 3

Assume we have a gven data set T, and the class label set C C },(,, q ). The { entropy of T s defned as: where Entropy(T) p log 2 p q p s the proporton of T belongng to class. The nformaton gan of each canddate attrbute s actually the expected reducton n entropy resultng from parttonng the records accordng to ths attrbute. Startng from the root node wth all the records, the decson tree algorthm selects the attrbute wth the largest nformaton gan. The process of selectng a new attrbute and parttonng the records s repeated for each node down the tree. The algorthm stops at a certan leaf node only when all the attrbutes have been ncluded n the path down to the certan leaf node, or the records assocated wth that leaf node all belong to the same class. C Some decson tree methods such as ID3 and C4.5 [Qunlan 993] performs a smple to complex search through ts search space. It starts from an empty tree, then consderng trees wth more attrbutes guded by nformaton gan heurstc. Ths process s llustrated n Fgure 2.4. C C V V2 C C Ck C C Ck V2 V2 C C V3 Ck C C V4 Ck Fgure 2.4. Smple to complex grow of decson tree 4

The search space of the decson tree method s the set of all possble decson trees. At each node, only the sngle best attrbute s chosen to partton the records though other attrbutes may also be consstent wth the records. Once t selects an attrbute to test at a partcular level of the tree, t does not backtrack to reconsder the alternatve attrbutes at a hgher level. Therefore, t loses the capablty to examne all possble decson trees that are consstent wth the records n the search space. 2.4. Classfcaton Accuracy The purpose of classfcaton s to buld a classfer from gven data and predct correctly on future data. The most commonly used performance measurement s the classfcaton accuracy. For a gven fnte number of records, the emprcal accuracy s defned as the rato of the correct classfcaton to the number of gven records. number of Emprcal accuracy= number correct classfcaton of gven records If the number of gven records approaches nfnty, then the emprcal accuracy becomes statstcally the true accuracy of the classfer under the actual populaton dstrbuton. However, there s only lmted gven data n a real stuaton. Consequently, a technque to estmate the true accuracy from emprcal accuracy becomes very mportant. They are at least as mportant as studes on the classfcaton method tself. Fgure 2.5 llustrates the relatonshp between emprcal accuracy and true accuracy. Gven Data Classfcaton Model New Data Emprcal Accuracy True Accuracy Fgure 2.5. Emprcal vs. true accuracy. 5

In order to honestly estmate the true classfcaton accuracy, the gven data should be a random sample. Ths means that the gven data should not be pre-selected or sfted by any specfc human-nvolved crtera. Wthout a random sample, the emprcal accuracy based on gve data wll not be a good estmate of the true accuracy. Assumng the gven data s random, we can thus obtan emprcal accuracy. Snce the gven data s always lmted compared to the whole populaton, the natural queston arses, how many data do we need n order to be confdent that the emprcal accuracy s a good estmate of true accuracy? There has been theoretcal study on ths specfc problem and t s called probably approxmate correct (PAC) analyss [Valant 985], [Kearns 994]. Ths study gves theoretcal bounds on applyng emprcal accuracy on future data; however, t ndcates huge number of data s needed for a guarantee of performance. Most tme, a large populaton of the data s often naccessble and the PAC approach can be modfed to some practcal methods to effectvely estmate the true accuracy. In ths secton, we wll brefly revew some common accuracy evaluaton technques such as the holdout, the k-fold cross-valdaton and the leavng-one-out method. In the emprcal study secton of ths dssertaton, we wll manly use the k-fold cross-valdaton and the leavng-one-out technques to estmate the true accuracy of the fuzzy classfer. The Holdout Method In the holdout method, the gven data are dvded nto a tranng data set and a testng data set. Thus, ths method s also known as tran-and-test method. For nstance, two thrds of the data are used for tranng and to derve the classfcaton model whle the rest of the data are used for testng the classfer s accuracy. If the holdout method s repeated k tmes usng dfferent partton schemes, t s also referred to as random subsamplng, and the accuracy estmate s taken as the average of the accuraces from each teraton. If the sze of the gven data s small, t s crucal to randomly dvde the data nto a tranng and a testng set, and ths s often mplemented by computer. The accuracy estmate by the holdout method s a bt pessmstc, because not all the data are used n desgnng the classfer. Besdes, a sngle tran-and-test partton may cause bas to the estmate. 6

k-fold Cross Valdaton k-fold cross valdaton belongs to the famly of resamplng methods. For k-fold cross valdaton, the cases are randomly dvded nto k mutually exclusve test parttons of approxmately equal sze. When we pck one test partton, the rest of the parttons are used for tranng of the classfer. The average accuracy rate over all k parttons s the cross valdaton accuracy rate. Ths method was extensvely tested wth varyng numbers of parttons and 0-fold cross valdaton seemed to be most approprate [Wess 990]. k-fold cross-valdaton can be extended to stratfed cross-valdaton, n whch each partton s stratfed so that the class dstrbuton of the data n each partton s approxmately the same as that n the orgnal data. Leavng-one-out Method Leavng-one-out method can be regarded as a specal case of k-fold cross valdaton, where k s set to be the sze of the dataset. In each teraton, ths method uses one record as the testng data, and the rest wll all be used for tranng of the classfer. The average accuracy rate over all records s used as the estmate of the accuracy. The maor advantage of the leavng-one-out technque s that t ntroduces less bas compared to other estmaton technques. However, the computatonal cost for applyng leavng-oneout s hgh, especally for a large dataset. 7

Chapter 3 Assocaton Rule Mnng Assocaton rule mnng s an actve area n data mnng. Ths chapter provdes a bref ntroducton to the assocaton rule mnng problem. We provde the defnton of assocaton rule mnng and ts classfcaton scheme n Secton 3.. One of the most mportant assocaton rule mnng algorthms, the Apror, s explaned n Secton 3.2. Secton 3.3 presents the quanttatve assocaton rule mnng problem and ts lmtatons. 3. Defnton of Assocaton Rule Mnng Assocaton rule mnng s a maor research area of data mnng. Its obectve s to fnd certan assocaton relatonshps among a set of data tems n a database. The assocaton relatonshps are descrbed n assocaton rules. Assocaton rule mnng was orgnally motvated by market basket analyss whch studes the buyng habts of customers [Agrawal, 993a]. It provdes the answer to the followng queston: Whch group of tems are usually assocated so that customers are lkely to purchase them together when they shop at the supermarket? The analyss can be performed on the retal data or customer transactons at the store. The result can be used for store layout, cross-marketng advertsement and drect malng applcatons. Assume customers who buy a computer also tend to buy fnancal management software. For one layout strategy, those two tems can be placed n close proxmty n order to further promote the sale of such tems together. In an alternatve strategy, placng hardware and software at opposte ends of the store may entce customers who purchase such tems to pck up other tems along the way [Han, 200]. Now, assocaton rule mnng has been extended to other areas lke geographcal databases, network securty, medcal dagnoss, etc. 8

The formal defnton of an assocaton rule s gven n [Agrawal,994]: Let I {, 2,, m} be a set of lterals, called tems. Let D be the set of all transactons, where each transacton t s a set of tems such that t I. Each transacton s assocated wth a unque dentfer, called TID. Let A be a set of tems n I. A transacton t s sad to contan A f and only f A t. An assocaton rule s of the form A B, where A I, B I, and A B. Wthn the rule A B, A s called the antecedent whle B s called the consequent of the rule. Support and confdence are the two key measures of rule nterestngness n assocaton rule mnng. A set of tems X n I s referred to as an temset. X s called the k-temset f t contans k tems; t X s the set of transactons that contan the temset X ; D s the total number of transactons; Support Sup s the percentage of transactons n D that contans both A and B. t A tb Sup ( A B) (3.) D t t A B s the number of transactons that contan both A and B. Confdence Conf s the percentage of transactons n D contanng A that also contan B. Conf Sup( A B) t A tb (3.2) Sup( A) t A From these defntons, we can nfer that the values of support and confdence range from 0 to. Gven an example n Table 3., let s examne the support and confdence of a potental assocaton rule: Bread Butter. Sup( t 2 Sup = Bread tbutter ) 40% D 5 Sup( tbread t Conf Sup( t ) Bread Butter ) 2 50% 4 Therefore, the rule Bread Butter has support of 40% and confdence of 50%. 9

Table 3.. An example of a transacton database. TID Items Bread, Salsa 2 Beer, Bread, Eggs, Jam 3 Bread, Butter 4 Bread, Butter, Lettuce, Salsa 5 Beer, Butter Gven user specfed mnmum support and mnmum confdence thresholds, the assocaton rule mnng s to fnd the rules wth support and confdence larger than the respectve thresholds. Generally, assocaton rule mnng can be descrbed as a two-step process: Frst, fnd all temsets whose support s above the predetermned mnmum support. These temsets are called the frequent temsets or the large temsets. Second, generate nterestng assocaton rules from the frequent temsets. Market basket analyss s ust one form of assocaton rule mnng. There are dfferent knds of assocaton rules besdes ths. Based on the types of values handled n the rule, t can be classfed as ether a Boolean assocaton rule or a quanttatve assocaton rule. Boolean assocaton rules only concern the assocatons between the presence or absence of tems. Market basket analyss belongs to Boolean assocaton rule mnng. If ether the antecedent or consequence of the rule contans quanttatve attrbutes, then t s a quanttatve assocaton rule. For nstance, age(x, 30 39) AND ncome(x, 40K 50K) Number of Cars(X, 3) s a quanttatve assocaton rule. If the tems or attrbutes n an assocaton rule nvolve only one dmenson, t s a sngle dmenson rule, whle a mult-dmensonal assocaton rule nvolves two or more dmensons. The above quanttatve assocaton rule s a mult-dmensonal assocaton rule. Based on the levels of abstracton nvolved n the rule set, t can be classfed as a sngle-level rule or a mult-level assocaton rule. If the rule can fnd assocatons at 20

dfferent levels of abstracton, then t s a mult-level assocaton rule. For nstance, we may have Apple and Banana as tems n the transactons, and Frut as a category name for those tems at a hgher abstracton level. Hence, the rule Frut Juce s a mult-level assocaton rule. 3.2 Apror Algorthm As mentoned prevously, the process of assocaton rule mnng s dvded nto two phases. In the frst phase, canddate temsets are generated and counted by scannng the transacton data. If the number of temsets appearng n the transactons s larger than the mnmum support, the temset s consdered a frequent temset. Itemsets contanng only one sngle tem are processed frst. Large temsets contanng only sngle tems are then combned to form canddate temsets contanng two tems. Ths process s repeated untl all large temsets have been found. In the second phase, all possble assocaton combnatons for each large temset are formed, and those wth a calculated confdence value larger than the mnmum confdence are output as assocaton rules. In ths secton, we wll ntroduce the Apror algorthm, the mportant algorthm used n bnary assocaton rule mnng to fnd the frequent temsets. 3.2. Downward Closure Property The Apror algorthm s based on a very mportant property: the downward closure property [Brn 997]. The downward closure property states that f an temset s frequent (whose support value s above the mnmum support) then all subsets of the temset must be frequent also. Assume an temset X and an operaton on an temset s OP (X ). If X,, X m are m subsets of the temset X, the downward closure property holds as follows: OP( X ) mn( OP( X ),, OP( X m )) (3.3) Notce the property holds for any conunctve operaton lke mnmum or multplcaton. 2

3.2.2 Iteratve Procedure Apror s an teratve, level-wse algorthm. It can be dvded nto two maor steps: on and prune. In the on step, t frst generates frequent temsets wth only one tem (called frequent -temsets), then frequent 2-temsets, and so on. In the prune step, the downward closure property can be appled to prune the canddate temsets whose support value s below the mnmum support. The followng notaton wll be used hereafter: k-temset An temset havng k tems L k k Set of frequent k-temset (wth mnmum support) CAND Set of canddate k-temset ( CAND s a superset of L ) k k Jon Step In ths step, k-temset are used to form (k+)-temset. For example, after one full scan of the dataset, the algorthm frst fnds the set of frequent -temset L. By onng L wth tself, L 2 can be found and so forth. th th Denote as the temset n L ; l [ ] as the tem n l. The two temset l and l l k s onable only f ther frst (k-2) tems are n common, that s: (l [ ] l [] ) (l [ 2] l [2]) (l [ k 2] l [ k 2] ) (l [ k ] l [ k ] ) where represents a logcal AND. Ths ensures that no duplcate temset L s k generated. The resultng canddate temset wll be l [ ] l [2] l [ k ] l [ k ]. Prune Step By onng L to tself, we can get canddate set CAND. To fnd the frequency of k temsets n CAND wll nvolve heavy computaton because CAND can become huge. k By usng the closure property, f any ( ) subset of an temset CAND s not n, then that canddate temset cannot be frequent ether and therefore can be removed from CAND k. Ths subset prunng can be mplemented by mantanng a hash tree of all frequent temsets. The Apror algorthm termnates when there are no frequent k k k L k k 22

k temsets [Han, 200]. Fgure 3. shows the pseudo-code for the Apror algorthm and ts related procedures. L ={frequent -temsets}; for (k=2; L k ; k++) do begn CAND k _ k apror gen( L,mn_ sup); //generate canddate temsets CANDk for all transacton t D do begn //scan D for counts end L k CANDt subset( CANDk, t) ; //get the subsets of t that are canddates for all canddates { CAND CAND CAND.count ; k CAND CANDt CAND. count mn_ sup} end return L k Lk procedure apror_gen( L k,mn_ sup) for each temset l L k for each temset l L f end end return CAND k do begn do begn ( l [] l []) ( l [ k 2] l [ k 2]) ( l [ k ] l [ k ]) CAND = l k l //on step, generate canddates then f nfrequent_subsets( CAND, L k ) then delete CAND ; //use closure property to prune else add CAND to CAND ; k Fgure 3.. Apror algorthm [Han 200]. 23

3.2.3 Hash Tree Implementaton The Apror algorthm s mplemented n the form of a hash tree n [Agrawal,994]. The hash tree contans two types of nodes: leaf nodes and nteror nodes. The temsets are stored n the leaf node. In an nteror node, each bucket of the hash table ponts to another node. The root of the hash tree s defned to be at depth, an nteror node at depth d ponts to nodes at depth d+. When a new temset s nserted nto the hash tree, the algorthm traverses from the root level to the leaf level. At depth d, a hash functon s appled to the dth tem of the temset. After the number of temsets stored n the leaf node reaches to the specfed lmt, the leaf node s converted to an nteror node. If t s at a leaf node, the frequency of the temsets n all the transactons s counted and recorded. If t s at an nteror node, and tem s hashed, then every tem that comes after wll be hashed. Ths procedure wll be recursvely appled to the node n the correspondng bucket. Suppose there s a smple transacton database as shown n Fgure 3.2. From the fgure t has four transactons and each transacton contans a seres of tems among A,B,C,D,E. By scannng the database, we form the -tem canddate set ({A},{B},{C},{D},{E}) and store them nto a -tem hash tree based on the gven hash functon. Assume the maxmum temsets that one leaf node can hold s two, then obvously one leaf node cannot hold all the -tem canddate sets. Further, assume each nternal node has three hash buckets. Accordng to the hash functon, f the d th element of the transacton s A or B, the temset goes to the left branch; f the d th element of the transacton s C or D, then the temset goes to the mddle branch; smlarly, E goes to the rght branch. The value n the parenthess s the frequency count of each tem. Each temset stored n the hash tree s compared wth the transacton to be scanned. If the transacton contans the temset, then the count value s ncreased by. Ths repeats for each transacton (record) n the dataset. In Fgure 3.2, we assume mnmum support value s two. By applyng the closure property, we can remove tem {E} from the -tem canddate set because ts frequency count s one. Thus, we get the -tem frequent set ({A},{B},{C},{D}). By 24

onng the -tem frequent set, we can get the 2-tem canddate set: ({A,B},{A,C},{A,D},{B,C},{B,D},{C,D}). By examnng the 2-temset hash tree and frequency count value of each temset, we know only {A,B},{B,C},{B,D} remans n the 2-tem frequent set. Transacton TID Transacton A,B 2 A,B,C 3 A,B,D,E 4 B,C,D Assume mnmum support s 2 Hash functon If dth element=a, B then left branch If dth element=c, D then mddle branch If dth element=e then rght branch -temset hash tree Each leaf node can maxmum hold 2 temset Each nternal node has 3 hash buckets A(3) C(2) -tem frequent set:{a},{b},{c},{d} E() B(4) D(2) 2-temset hash tree 2-tems canddate set:{a,b},{a,c},{a,d} {B,C},{B,D},{C,D} AB(3) AC() BC(2) CD() AD() BD(2) Fgure 3.2. Hash tree llustraton dagram. 25

3.3 Quanttatve Assocaton Rule Mnng There has been much research n Boolean assocaton rule mnng. For nstance, besdes Apror, AprorTID and AprorHybrd algorthms were also proposed for mnng Boolean assocaton rules [Agrawal 994]. However, real-world transacton data usually conssts of quanttatve data. How to extract useful rules from quanttatve data presents a challenge n ths research feld. The problem of mnng quanttatve assocaton rules was ntroduced by Srkant and Agrawal [Srkant 996]. They proposed mnng quanttatve assocaton rules by parttonng the attrbute doman and combnng the adacent parttons to transform the problem nto a bnary assocaton rule mnng problem. Several other approaches based on the dea of mnng assocaton rules from nterval clusters were also proposed [Mller 997], [Zhang 997]. Ths category method s often referred to as the dscrete nterval method n the lterature. The dscrete nterval method dvdes the attrbute doman nto dscrete ntervals and measures ther mportance based on the frequency of the nterval tems. For llustraton, a database table wth two quanttatve attrbutes and one categorcal attrbute s shown n Table 3.2. It shows the relatonshp between people s age, marrage status and number of cars they buy. Table 3.2. People table and example of quanttatve assocaton rules. RecordID Age Marred NumCars 00 23 No 200 25 Yes 300 29 No 0 400 34 Yes 2 500 38 Yes 2 Rules(Sample) Support Confdence <Age : 30...39> and <Marred: Yes> <NumCars: 2> <NumCars:0 > <Marred: No> 40% 40% 00% 66.6% 26