A Heuristic for Mining Association Rules In Polynomial Time*

Size: px

Start display at page:

Download "A Heuristic for Mining Association Rules In Polynomial Time*"

Jordan Nichols
6 years ago
Views:

1 Complete reference nformaton: Ylmaz, E., E. Trantaphyllou, J. Chen, and T.W. Lao, (3), A Heurstc for Mnng Assocaton Rules In Polynomal Tme, Computer and Mathematcal Modellng, No. 37, pp A Heurstc for Mnng Assocaton Rules In Polynomal Tme* E. YILMAZ General Electrc Card Servces, Inc. A unt of General Electrc Captal Corporaton 6 Summer Street, MS -39C, Stamford, CT, 697, U.S.A. egemen.ylmaz@gecaptal.com E. TRIANTAPHYLLOU Department of Industral and Manufacturng Systems Engneerng Lousana State Unversty, 38 CEBA Buldng, Baton Rouge, LA, 783, U.S.A. Emal: tranta@lsu.edu Web: J. CHEN Department of Computer Scence Lousana State Unversty, 98 Coates Hall, Baton Rouge, LA, 783, U.S.A. T. W. LIAO Department of Industral and Manufacturng Systems Engneerng Lousana State Unversty, 38 CEBA Buldng, Baton Rouge, LA, 783, U.S.A. (Last Revson: Aprl, ) Abstract: Mnng assocaton rules from databases has attracted great nterest because of ts potentally very practcal applcatons. Gven a database, the problem of nterest s how to mne assocaton rules (whch could descrbe patterns of consumers behavors) n an effcent and effectve way. The databases nvolved n today s busness envronment can be very large. Thus, fast and effectve algorthms are needed to mne assocaton rules out of large databases. Prevous approaches may cause an exponental computng resource consumpton. A combnatoral exploson occurs because exstng approaches exhaustvely mne all the rules. The proposed algorthm takes a prevously developed approach, called the Randomzed Algorthm (or RA), and adapts t to mne assocaton rules out of a database n an effcent way. The orgnal RA approach was prmarly developed for nferrng logcal clauses (.e., a Boolean functon) from examples. Numerous computatonal results suggest that the new approach s very promsng. Key words: Data mnng, assocaton rules, algorthm analyss, the One Clause At a Tme (OCAT) approach, randomzed algorthms, heurstcs, Boolean functons. *: Correspondng Author: Dr. Evangelos Trantaphyllou

2 . INTRODUCTION Mnng of assocaton rules from databases has attracted great nterest because of ts potentally very useful applcatons. Assocaton rules are derved from a type of analyss that extracts nformaton from concdence []. Sometmes called market basket analyss, ths methodology allows a data analyst to dscover correlatons, or co-occurrences of transactonal events. In the classc example, consder the tems contaned n a customer s shoppng cart on any one trp to the grocery store. Chances are that the customer s own shoppng patterns tend to be nternally consstent, and that he/she tends to buy certan tems on certan days, for example mlk on Mondays and beer on Frdays. There mght be many examples of pars of tems that are lkely to be purchased together. For nstance, one mght always buy champagne and strawberres together on Saturdays, although one only rarely purchases ether of these tems separately. Ths s the knd of nformaton the store manager could use to make decsons about where to place tems n the store so as to ncrease sales. Ths nformaton can be expressed n the form of assocaton rules. From the example gven above, the manager mght decde to place a specal champagne dsplay near the strawberres n the frut secton on the weekends n the hope of ncreasng sales. Purchase records can be captured by usng the bar codes on the products. The technology to read them has enabled busnesses to effcently collect vast amounts of data, commonly known as market basket data []. Typcally, a purchase record contans the tems bought n a sngle transacton, and a database may contan many such transactons. Analyzng such databases by extractng assocaton rules may offer some unque opportuntes for busnesses to ncrease ther sales, snce such assocaton rules can be used n desgnng effectve marketng strateges. The szes of the databases nvolved can be very large. Thus, fast and effectve algorthms are needed to mne assocaton rules out of them. For a more formal defnton of assocaton rules, some notaton and defntons are ntroduced as follows. Let I = {A, A, A 3,, A n } be the set wth the names of the tems (also called attrbutes, hence the notaton A ) among whch assocaton rules wll be searched [-3]. Ths set s often called the tem doman. Then, a transacton s a set of one or more tems obtaned from the set I. Ths means that for each transacton T, the relaton T I holds. Let D be the set of all transactons. Also, let X be defned as a set of some of the tems n I. The set X s contaned n a transacton T f the relaton X T holds. Usng these defntons, an assocaton rule s a relatonshp of the form X I, Y I, and X Y = X Y, where. The set X s the antecedent part, whle the set Y s the consequent part of the rule. Such an assocaton rule holds wth some confdence level denoted as CL. The confdence level s the condtonal probablty (as t can be nferred from the avalable

3 transactons n the target database) of havng the consequent part Y gven that we already have the antecedent part X. Moreover, an assocaton rule has support S, where S s the number of transactons n D that contan X and Y together. A frequent tem set s a set of tems that occur frequently n the database. That s, ther support s above a predetermned mnmum support level. A canddate tem set s a set of tems, possbly frequent, but not yet checked whether they meet the mnmum support crteron. The assocaton rule analyss n our approach wll be restrcted to those assocaton rules whch have only one tem n the consequent part of the rule. However, a generalzaton can be made easly. Example.: Consder the followng llustratve database: D = Ths database s defned on fve tems, so I = { A, A, A, A A } 3, 5. Each row represents a transacton. For nstance, the second row represents a transacton n whch only tems A 3 and A were bought. The support of the rule A A A5 s equal to 3. Ths s true because the tems A, A, and A 5 occur smultaneously n 3 transactons (.e., the ffth, eghth, and nneth transactons). The confdence level of the rule A A A5 s % because the number of transactons n whch A and A appear together s equal to the number of transactons that A, A, and A 5 appear (both are equal to three), gvng a confdence level of %. Gven the prevous defntons, then the problem of nterest s how to mne assocaton rules out of a database D, that meet some pre-establshed mnmum support and confdence level requrements. Mnng of assocaton rules was frst ntroduced by Agrawal, Imelnsk and Swam n []. Ther algorthm s called AIS (whch stands for Agrawal, Imelnsk, and Swam). Another study used a dfferent approach to solve the problem of mnng assocaton rules [5]. That study presented a new algorthm called SETM (for Set Orented Mnng). The new algorthm was proposed to mne 3

4 assocaton rules by usng relatonal operatons n a relatonal database envronment. Ths was motvated by the desre to use the SQL system to compute frequent tem sets. The next study [] receved a lot more recognton than the prevous ones. Three new algorthms were presented; the Apror, the AprorTd, and the AprorHybrd. The Apror and AprorTd approaches are fundamentally dfferent from the AIS and the SETM algorthms. As the name AprorHybrd suggests, ths approach s a hybrd between the Apror and the AprorTd algorthms. Another major study n the feld of mnng of assocaton rules s descrbed n [6]. These authors presented an algorthm called Partton. Ther approach reduces the search by frst computng all frequent tem sets n two passes over the database. Another major study on assocaton rules takes a samplng approach [7]. These algorthms make only one full pass over the database. The man dea s to select a random sample, and use t to determne representatve assocaton rules that are very lkely to also occur n the whole database. These assocaton rules are n turn valdated n the entre database. Ths paper s organzed as follows. The next secton presents a formal descrpton of the research problem under consderaton. The thrd secton starts wth a bref descrpton of the OCAT (one clause at a tme) approach that played a crtcal role n the development of the new approach. The new approach s descrbed n the second half of the thrd secton. The fourth secton presents an extensve computatonal study that compared the proposed approach for the mnng of assocaton rules wth some exstng ones. Fnally, the paper ends wth a conclusons secton.. PROBLEM DESCRIPTION Prevous work on mnng of assocaton rules focused on extractng all conjunctve rules, provded that these rules meet the crtera set by the user. Such crtera can be the mnmum support and confdence levels. Although prevous algorthms manly consdered databases from the doman of market basket analyss, they have been appled to the felds of telecommuncaton data analyss, census data analyss, and to classfcaton and predctve modelng tasks n general [3]. These applcatons dffer from market basket analyss n the sense that they contan dense data. That s, such data mght possess all or some of the followng propertes: () Have many frequently occurrng tems; () Have strong correlatons between several tems; () Have many tems n each record. When standard assocaton rule mnng technques are used (such as the Apror approach [] and ts varants), they may cause exponental resource consumpton n the worst case. Thus, t may take too much CPU tme for these algorthms to mne the assocaton rules. The combnatoral exploson s a natural result of these algorthms, because they mne exhaustvely all the rules that

5 satsfy the mnmum support constrant as specfed by the analyst. Furthermore, ths characterstc may lead to the generaton of an excessve number of rules. Then, the end user wll have to determne whch rules are worthwhle. Therefore, the hgher the number of the derved assocaton rules s, the more dffcult t s to revew them. In addton, f the target database contans dense data, then the prevous stuaton may become even worse. The sze of the database also plays a vtal role n data mnng algorthms [7]. Large databases are desred for obtanng accurate results, but unfortunately, the effcency of the algorthms depends heavly on the sze of the database. The core of today s algorthms s the Apror algorthm [] and ths algorthm wll be the one to be compared wth n ths paper. Therefore, t s hghly desrable to develop an algorthm that has polynomal complexty and stll beng able of fndng a few rules of good qualty. 3. METHODOLOGY 3. The One Clause At a Tme (OCAT) Approach The proposed approach s based on a heurstc, called the Randomzed Algorthm (or RA) that was developed n [8]. Ths heurstc nfers logcal clauses (Boolean functons) from two mutually exclusve collectons of bnary examples. The man deas of ths heurstc are brefly descrbed next. Let { A A,..., }, A n be a set of n bnary attrbutes. Also, let F be a Boolean functon over these bnary attrbutes. That s, F s a mappng from {, } n {, }. The nput of the RA heurstc s two sets of mutually exhaustve tranng examples. Each example s a vector of sze n defned n the space {, } n. The tranng examples somehow have been classfed as ether postve or negatve. Then, the Boolean functon to be nferred should evaluate to true () when t s fed wth a postve example and to false () when t s fed wth a negatve example. Hopefully, ths functon s an accurate estmaton of the hdden logc that has classfed the tranng examples. Another goal s for the nferred Boolean functon (when t s expressed n conjunctve normal form (CNF) or dsjunctve normal form (DNF)) to have a very small, deally mnmum, number of dsjunctons or conjunctons (also known as terms n the lterature). A Boolean functon s n CNF f t s of the form: a ρ j k j = Smlarly, a Boolean functon s n DNF f t s n the form: k a j = ρ j 5

6 where a s ether a bnary attrbute A or ts negaton, A and the varable ρ j s the set of the ndces of the attrbutes n the th j conjuncton or dsjuncton. As t s shown n [9] any Boolean functon can be transformed nto the CNF or DNF form. Also, n [] a smple transformaton scheme s presented for nferrng CNF functons wth algorthms that ntally nfer DNF functons and vce-versa. In order to help fx deas of how the RA algorthm operates, consder the followng postve and negatve example sets, denoted as E = E and E, respectvely., E Now consder the followng Boolean expresson (n CNF): = ( A A ) ( A A ) ( A A ). 3 3 A It can be easly verfed that ths Boolean expresson correctly classfes the prevous tranng examples. In [-] the authors present a strategy called the One Clause At a Tme (OCAT) approach (see also Fgure ) for nferrng a Boolean functon from two classes of bnary examples. =, C = ; {ntalzatons} DO WHILE E th Step : ; {where ndcates the clause} Step : Fnd a clause c whch accepts all members of E whle t rejects as many members of E as possble; Step 3: Let E ( c ) be the set of members of Step : Let C C c ; Step 5: REPEAT; Let E E - E ( c ); E whch are rejected by c ; Fgure : The One Clause At a Tme (OCAT) Approach, for the CNF Case []. As t s ndcated n Fgure, the OCAT approach attempts to mnmze the number of CNF clauses that wll fnally form the target functon F. A key task n the OCAT approach s Step (n Fgure ). At Step a sngle clause s constructed. In [] a branch-and-bound approach s developed that nfers a clause (for the CNF case) that accepts all the postve examples whle t rejects as many negatve examples as possble. Later, n [8] the RA 6

7 heurstc s proposed that returns a clause that now rejects many (as opposed to as many as possble) negatve examples (and stll accepts all the postve examples). Next, are some defntons that are used n these approaches and are gong to be used n the new approach as well. C s the set of attrbutes n the current clause (a dsjuncton for the CNF case); a k an attrbute such that a k A, where A s the set of the attrbutes A, A,, A n and ther negatons; POS (a k ) the number of all postve examples n E whch would be accepted f attrbute a k were ncluded n the current CNF clause; NEG (a k ) the number of all negatve examples n E whch would be accepted f attrbute a k were ncluded n the current clause; l the sze of the canddate lst; ITRS the number of tmes the clause formng procedure s repeated. The RA algorthm s descrbed next n Fgure. Its tme complexty s of O(D nlogn) order (where D s the number of transactons n the database and n s the number of tems or attrbutes). Ths follows by observng that the nner most loop requres nlogn operatons when a quck sort approach s appled for sortng the POS/NEG values. The nner most loop s repeated n order of O(E ) steps that s the same as order O(D). Smlarly, the outer loop s also of order O(D). Thus, the tme complexty of the RA algorthm s of order O(D nlogn). For llustratve purposes, ths algorthm s appled on the two sets of bnary vectors gven earler n ths secton. When the prevous defntons are used, then the followng can be easly derved: The set of the attrbutes (tems) for these postve and negatve examples s: { A, A, A, A, A, A, A A } A = }. 3 3, Therefore, the POS (a k ) and NEG (a k ) values are: POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= POS ( A 3 )= NEG ( A 3 )=3 POS ( A 3 )=3 NEG ( A 3 )=3 POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= 7

8 DO for ITRS number of teratons BEGIN DO WHILE ( E ) C = ; {ntalzatons} E DO WHILE ( ) Step : Rank n descendng order all attrbutes a A (where a s ether A or A ) accordng to ther POS( a ) value. If NEG( a ) =, then POS( a ) = (.e., an arbtrarly hgh value); Step : Form a canddate lst of the attrbutes whch have the l top hghest POS( a ) values; Step 3: Randomly choose an attrbute ak from the canddate lst; Step : Let the set of atoms n the current clause be C C a ; Step 5: Let E ( ) a k be the set of members of ncluded n the current CNF clause; E E E ; a k Step 6: Let ( ) Step 7: Let A A a } ; { k Step 8: Calculate the new POS( a k ) values for all REPEAT E C be the set of members of Step 9: Let ( ) Step : Let E E E ( C) ; E accepted when a k A ; k a k s E whch are rejected by C ; Step : Reset E to the orgnal value; REPEAT END CHOOSE the fnal Boolean system among the prevous ITRS systems that has the smallest number of clauses. Fgure : The RA Heurstc for the CNF Case [8]. By examnng the prevous defntons, some key observatons can be made at ths pont. When an attrbute of hgh POS functon value s chosen to be ncluded n the CNF clause currently beng formed, then t s very lkely that ths wll cause acceptng some addtonal postve examples. The reverse s true for atoms wth a small NEG functon value n terms of the negatve examples. Therefore, attrbutes that have hgh POS functon values and low NEG functon values are a good choce for ncluson n the current CNF clause. Ths key observaton leads to the followng alternatves for defnng an evaluatve crteron for Step n Fgure for ncludng a new atom n the CNF clause under consderaton: POS/NEG, or POS-NEG, or some type of a weghted verson of these 8

9 two expressons. In [8] t was shown through some emprcal experments that the POS/NEG rato s an effectve evaluatve crteron, snce t s very lkely to lead to Boolean functons wth few clauses. In terms of the prevous llustratve data, the POS/NEG ratos are as follows: POS NEG POS NEG POS NEG POS NEG ( A ) ( A ) = ( A ) ( A ) = ( A3 ) ( A ) = 3 ( A ) ( A ) = POS NEG POS NEG POS NEG POS NEG ( A ) ( A ) = ( A ) ( A ) = ( A3 ) ( A ) = 3 ( A ) ( A ) = Next suppose that l n Step, Fgure, was chosen to be equal to 3. Then, the 3 hghest POS/NEG values for ths case are: {.,.,.}. These values correspond to the attrbutes A, A, and A, respectvely. Let A be the one to be randomly selected out of ths canddate lst. The atom A accepts (please note that the current CNF clause s now nl) the frst and the second examples n the E set. Ths means that more attrbutes are requred n the current CNF clause beng bult for all postve examples to be accepted. Next, suppose (after the POS/NEG ratos have been recalculated) that A was the second attrbute to be ncluded n the clause. Note that A and A together can accept all the postve examples n the Boolean expresson s ready E set. Therefore, the frst CNF clause (.e., ( A A ) ) of the Next one can observe whch negatve examples are not rejected by ths clause: ths clause fals to reject the second, thrd and the sxth examples n the E set. Therefore, the updated E set should contan the second, thrd and the sxth examples from the orgnal E set. Ths process s repeated untl the E set s empty (Fgure ), meanng that all the negatve examples are rejected. By recallng that RA s a randomzed algorthm (t repeats the functon generaton task ITRS tmes) and thus t does not return a determnstc soluton, a Boolean expresson acceptng all the postve examples and rejectng all the negatve examples could be: ( A A ) ( A A ) ( A A ). 3 3 A 9

10 3. Proposed Alteratons to the RA Algorthm For a Boolean expresson to reveal nformaton about assocatons n a database, t s more convenent to be expressed n DNF. The frst step s to select an attrbute about whch assocatons wll be sought. Ths attrbute wll form the consequent part of the desred assocaton rules. By selectng an attrbute, the database can be parttoned nto two mutual sets of records (bnary vectors). Vectors that have value equal to n terms of the selected attrbute, can be seen as the postve examples. A smlar nterpretaton holds true for records that have a value of for that attrbute. These vectors wll be the set of the negatve examples. Gven the above way for parttonng (dchotomzng) a database of transactons, t follows that each conjuncton (logcal clause or term ) of the target functon wll reject all the negatve examples, whle on the other hand, t wll accept some of the postve examples. Of course, when all the conjunctons are consdered together, then they wll accept all the postve examples. In terms of assocaton rules, each clause n the Boolean expresson (whch now s expressed n DNF) can be thought as a set of frequent tem sets. That s, such a clause forms a frequent tem set. Thus, ths clause can be checked further whether t meets the preset mnmum support and confdence level crtera. The requrement of havng Boolean expressons n DNF does not mean that the RA algorthm has to be altered to produce Boolean expressons n DNF. However, t wll have to be altered n order to make t compatble wth mnng of assocaton rules, but ts orgnal CNF producng nature (as descrbed n Fgure ) wll be kept as t s. As t shown n [] f one forms the complements of the postve and negatve sets and then swaps ther roles, then a CNF producng algorthm, wll produce a DNF expresson (and vce-versa). The last alteraton s n the CNF (or DNF) expresson to swap the logcal operators ( ) AND and OR ( ). Another nterestng ssue s to observe that the confdence level of the assocaton rules produced by processng frequent tem sets (.e., clauses of a Boolean expresson n DNF when the OCAT / RA approach s used) wll always be equal to %. Ths happens because each DNF clause rejects all the negatve examples whle t accepts some of the postve examples when a database wth transactons s parttoned as descrbed above. A crtcal change n the RA heurstc s that for dervng assocaton rules, t should only consder the attrbutes themselves and not ther negatons. Ths s not always the case, snce some authors have also proposed to use assocaton rules wth negatons [3]. However, assocaton rules are usually defned on the attrbutes themselves and not on ther negatons. Some changes need also to be made to the selecton process of the sngle attrbute to be ncluded n the clause beng formed (Step n Fgure ). If NEG( a ) = at Step, then the value of

11 the rato POS( a ) for that partcular a s set to be equal to, (.e., an arbtrarly hgh postve number) multpled by the POS( a ) value. However, the number, may stll be small and thus t should be changed accordng to the sze of the database. There are four cases regardng the value of the POS/NEG rato that need to be consdered when selectng an attrbute. These cases are: Case #: Multple attrbutes (tems) wth NEG( a ) = and equal values of the POS( a ) rato. Case #: No attrbutes (tems) wth value NEG( a ) = exst, but when all the attrbutes are ranked accordng to ther POS( a ) values n descendng order, then the hghest POS( a ) value occurs multple tmes. Case #3: A sngle attrbute wth NEG( a ) = exsts. Case #: There are no attrbutes wth NEG( a ) =, but when all the attrbutes are ranked accordng to ther POS( a ) values n descendng order, then the hghest POS( a ) value occurs only once. For cases # and #, the attrbute to be ncluded n the clause beng formed s randomly selected among the canddates. The canddates for case # are those attrbutes wth NEG( a ) = and equal values of POS( a ). The canddates for case #, on the other hand, are those attrbutes that share the same POS( a ) value (and ths s the hghest value). For cases #3 and # there s no need for a random selecton process, snce there s a sngle attrbute wth the hghest POS( a ) value. Thus, that partcular attrbute s ncluded n the clause beng formed. Furthermore, f one consders only the attrbutes themselves and excludes ther negatons, ths requrement may cause certan problems due to certan degeneratve stuatons that could occur. These degeneratve stuatons occur as follows: Degeneratve Case #: If only one tem s bought n a transacton, and f that partcular tem s selected to be the consequent part of the assocaton rules sought, then the E set wll have an example (.e., the one that corresponds to that transacton) wth only zero elements. Thus, the RA heurstc (or any varant of t) wll never termnate. Hence, for smplcty t wll be assumed that such degeneratve transactons do not occur n our databases. Degeneratve Case #: After formng a clause, and after the E set s updated (Step n Fgure ), the new POS/NEG values may be such that the new clause may be one of those that have been already produced earler (.e., t s possble to have cyclng ). Degeneratve Case #3: A newly generated clause may not be able to reject any of the negatve examples.

12 The prevous s an exhaustve lst of all possble degeneratve stuatons when the orgnal RA algorthm s used. Thus, the orgnal RA algorthm needs to be altered n order to avod them. Degeneratve case # can be easly avoded by smply dscardng all one-tem transactons (whch are very rare to occur n realty any way). Degeneratve cases # and #3 can be avoded by establshng some upper lmts on the number a Boolean functon s generated wthout beng able to reject all the negatve examples (please recall the randomzed characterstc of the RA heurstc). In order to mne assocaton rules that have dfferent consequents, the altered RA should be run for each one of the attrbutes: A, A,, A n. After determnng the frequent tem sets for each one of these attrbutes, one needs to calculate the support level for each frequent tem set, and check whether the (preset) mnmum support crteron s met. If t s, then the current assocaton rule s reported. The proposed altered RA (to be denoted as ARA) heurstc s summarzed n Fgure 3. Fnally, t should be stated here that the new heurstc s also of tme complexty O(D nlogn) as s the case wth the orgnal RA algorthm. Ths follows easly from a computatonal analyss smlar to the one descrbed n the prevous sub-secton for the RA algorthm.

13 DO for each consequent A, A,, BEGIN Form the E and A n E sets accordng to the presence or absence of the current A attrbute. Calculate the ntal POS and NEG values. Let A = { A, A,, A n }. E C = ; {ntalzatons} E Step : Rank n descendng order all attrbutes a A (where a s the attrbute DO WHILE ( ) START: DO WHILE ( ) currently under consderaton) accordng to ther POS( a ) value. If NEG( a ) =, then POS( a ) =, xpos( a ); Step : Evaluate the current POS/NEG case; Step 3: Choose an attrbute a accordngly; k Step : Let the set of atoms n the current clause be C C a } ; Step 5: Let E ( ) a k be the set of members of n the current CNF clause; E E E ; a k Step 6: Let ( ) Step 7: Let A A a } ; { k Step 8: Calculate the new POS( a ) values for all Step 9: If REPEAT Step : Let E ( C) Step : If ( C) = k { k E accepted when a A; k a k s ncluded A = (.e., checkng for falure case #), then go to START; be the set of members of E whch are rejected by C ; E, determne the falure case (.e., case #, or #3). Check whether the correspondng counter has ht the preset lmt. If yes, then go to START; Step : Let E E E ( C) ; Step 3: Calculate the new NEG values; Step : Let C be the antecedent and A be the consequent of the rule. Check the canddate rule C A for mnmum support. If t meets the mnmum support level crteron, then output the rule; Step 5: Reset the E set (.e., select the examples whch have A equal to and store them n set E ); REPEAT END Fgure 3: The Proposed Altered Randomzed Algorthm (ARA) for Mnng Assocaton Rules (for the CNF Case). 3

14 . COMPUTATIONAL EXPERIMENTS In order to compare the altered RA (ARA) heurstc wth some of the exstng assocaton rule methods, we appled them on several synthetc databases that were generated by usng the data generaton programs descrbed n []. The web address (URL) of these codes s: These databases contan transactons that would reflect the real world, where people tend to buy sets of certan tems together. as follows: Several databases were used n makng these comparsons. The szes of the databases used are Database #:, tems wth, transactons (the mn support was set to 5). Database #:, tems wth, transactons (the mn support was set to 5). Database #3: 5 tems wth 5, transactons (the mn support was set to ). Database #: 5 tems wth,5 transactons (the mn support was set to ). Database #5: 5 tems wth, transactons (the mn support was set to ). The frst results are from the densest databases used n [], that s, database #. The Apror algorthm was stll n the process of generatng the frequent tem sets of length after 8 hours mnutes and 8 seconds when database # was used. Therefore, the experment wth the Apror algorthm was aborted. However, the ARA algorthm completed mnng the very same database n only hours mnutes and second. The ARA algorthm mned a sngle rule for each one of the followng support levels: 59, 63, 38,, 535, 63, 6, 756, 78, 98, and,93. All the experments were run on an IBM 967/R53 computer. Ths processor s a -engne box wth each engne beng rated at 6 MIPS (mllons of nstructons per second). For the experments wth database #, however, some parallel computng technques were utlzed for the Apror algorthm. The frequent tem sets were gathered nto smaller groups, makng t possble to buld the next frequent tem sets n shorter tme. As a result, each group was analyzed separately, and the CPU tmes for each one of these jobs were added together at the end. The Apror algorthm completed mnng ths database n 59 hours 5 mnutes and 3 seconds. Fgure llustrates the number of rules for ths case. On the other hand, the ARA algorthm mned database # n only hours 5 mnutes and 57 seconds. These results are depcted n Fgure 5.

15 5, Number of rules mned, 5,, 5, Support Level Fgure : Hstogram of the Results When the Apror Approach Was Used on Database #. Number of rules mned Support Level Fgure 5: Hstogram of Results When the ARA Approach Was Used on Database #. 5

16 It should be noted here that the CPU tmes recorded for the Apror experments for ths research were hgher than the smlar results reported n []. For nstance, t was reported n [] that the Apror algorthm took approxmately 5 seconds to mne database #. That result was obtaned on an IBM RS/6 53H workstaton wth a man memory of 6 MB, and runnng AIX 3.. On the other hand, for database #, the Apror program wrtten for ths research was n the process of generatng tem sets of length after 8 hours mnutes and 8 seconds. The only dfference between the approach taken n [] and the one n ths research s that the canddate tem sets n [] were stored n a hash tree. Hashng s a data storage technque that provdes fast drect access to a specfc stored record on the bass of a gven value for some feld []. In ths research, hash trees were not used n storng canddate tem sets; nstead they were kept n the man memory of the computer. Ths made t faster to access canddate tem sets because drect access s generally very expensve CPU-wse. It s beleved that the programmng technques and the type of the computers used n [] are causng the CPU tme dfference. Addtonally, the Apror code n ths research was run under a tme-sharng opton, whch agan could make a bg dfference. As t was mentoned earler, the computer codes for the Apror and the ARA algorthms were run on an IBM 967/R53 computer. The results obtaned by usng database # suggest that ARA produced a reasonable number of rules fast. Also, these rules were of hgh qualty, snce by constructon, all had a % confdence level. After obtanng these results, t was decded to mne the remanng databases by also usng a commercal software, namely MneSet by Slcon Graphcs. MneSet s one of the most commonly used data mnng computer packages. Unfortunately, MneSet works wth transactons of a fxed length. Therefore, the transactons were coded as zeros and ones, zeros representng that the correspondng tem was not bought, and ones representng that the correspondng tem was bought. However, ths causes Mneset to mne negatve assocaton rules, too. Negatve assocaton rules are rules based on the absence of tems n the transactons, rather than the presence of them and negatons of attrbutes may appear n the rule structure. Another drawback of MneSet s that only a sngle tem s supported n both the left and the rght hand sdes of the rules to be mned. Also, the current verson of MneSet allows for a maxmum of 5 tems n each transacton. The MneSet software used for ths study was nstalled on a Slcon Graphcs workstaton, whch had a CPU clock rate of 5 MHz and a RAM of 5MB. MneSet supports only a sngle tem n both the left and the rght hand sdes of the assocaton rules. Ths suggests that MneSet uses a search procedure of also polynomal tme complexty. Such an approach would have frst to count the support of each tem when t s compared wth every other 6

17 tem, and store these supports n a trangular matrx of dmenson n (.e., equal to the number of attrbutes). Durng the pass over the database, the supports of the ndvdual tems could be counted, and the rest wll only be a matter of checkng whether the result s above the preset mnmum confdence level. For nstance, when checkng the canddate assocaton rule A A 6, the confdence level would be the support of A dvded by the support of A A 6. On the other hand, when dong the same for rule A 6 A, then the confdence level would be the support of A 6 dvded by the support of A A 6. Therefore, such an approach requres n(n-)/ operatons (where n s the number of attrbutes or tems). If D s the number of transactons (records) n the database, then the tme complexty of ths approach s equal to O(Dn ). Ths s almost of the same tme complexty that the ARA approach has (whch recall s of order O(D nlogn)). However, for the ARA case, ths complexty s for the worst-case scenaro. The ARA algorthm wll stop as soon as t has produced a Boolean functon that accepts all the postve and rejects all the negatve examples. In addton, the ARA approach s able to mne rules wth multple tems n the antecedent part of an assocaton rule. The ARA approach can also be easly adapted to mne assocaton rules wth multple tems n the consequent part. The only change that has to be made s n the parttonng (dchotomzaton) of the orgnal database nto the sets of the postve and negatve examples. On the other hand, the Apror approach has an exponental tme complexty because t follows a combnatoral search approach. When database #3 was used, t took MneSet 3 mnutes and seconds to mne the assocaton rules. On the other hand, t took ARA just 6 mnutes and 5 seconds to mne the same database. Fgures 6 and 7 provde the number of the mned rules from database #3. When database # was used, t took MneSet 8 mnutes and 3 seconds to mne assocaton rules. For the ARA approach, the requred tme was 5 mnutes and 6 seconds only. These results are depcted n Fgures 8 and 9. For database #5, t took MneSet 5 mnutes and seconds to mne the rules. On the other hand, t took only mnutes and 3 seconds when the ARA approach was used on the same database. The correspondng results are depcted n Fgures and. Table presents a summary of all the above. From these results t becomes evdent that the ARA approach derves assocaton rules faster and also these rules have much hgher support levels. 7

18 Number of mned rules Support Level Fgure 6: Hstogram of the Results When the MneSet Software Was Used on Database #3. 5 Number of rules mned Support Level Fgure 7: Hstogram of the Results When the ARA Approach Was Used on Database #3. 8

19 Number of rules mned Support Level Fgure 8: Hstogram of the Results When the MneSet Software Was Used on Database #. 3 5 Number of rules mned Support Level Fgure 9: Hstogram of the Results When the ARA Approach Was Used on Database #. 9

20 8 6 Number of rules mned Support Level Fgure : Hstogram of the Results When the MneSet Software Was Used on Database # Number of rules mned Support Level Fgure : Hstogram of the Results When the ARA Approach Was Used on Database #5.

21 Table : Summary of the Requred CPU Tmes Under Each Method. Apror CPU (hh:mm:ss) ARA CPU (hh:mm:ss) MneSet CPU (hh:mm:ss) Database # Not completed :: N/A Database # 59:5:3 :5:57 N/A Database #3 N/A :6:5 :3: Database # N/A :5:6 :8:3 Database #5 N/A ::3 :5: 5. CONCLUSIONS Ths paper presented the developments of a new approach for dervng assocaton rules from databases. The new approach s called ARA and t s based on a prevous algorthm (.e., the RA approach) that was developed by one of the authors and hs assocates n [8]. Both the old and new approach are randomzed algorthms. The proposed ARA approach produces a small set of assocaton rules n polynomal tme. Furthermore, these rules are of hgh qualty wth % support levels. The % support level of the derved rules s a characterstc of the way the ARA approach constructs assocaton rules. The ARA approach can be further extended to handle cases wth less than % support levels. Ths can be done by ntroducng stoppng rules that termnate the approprate loops n Fgure 3. That s, to have a predetermned lower lmt (.e., a percentage less than %) of the postve examples to be accepted by each clause (n the CNF case) and also a predetermned percentage of the negatve examples s rejected nstead of seekng for all the postve examples to be accepted and all the negatve examples to be rejected as s the current case. An extensve emprcal study was also undertaken. The Apror approach and the MneSet software by Slcon Graphcs were compared wth the proposed ARA algorthm. The computatonal results demonstrated that the new approach can be both hghly effcent and effectve. The above observatons strongly suggest that the proposed ARA algorthm s very promsng for mnng assocaton rules n today s world wth the always-ncreasng and dverse databases.

22 REFERENCES. T. Blaxton and C. Westphal, Data Mnng Solutons: Methods and Tools for Solvng Real-World Problems, John Wley & Sons, Inc., 86-89, New York, NY, (998).. R. Agrawal and R. Srkant, Fast algorthms for mnng assocatons rules, Proceedngs of the th VLDB Conference, Santago, Chle, (99). 3. R.J. Bayardo Jr., R. Agrawal and D. Gunopulos, Constrant-based rule mnng n large, dense databases, Proceedngs of the 5 th Internatonal Conference on Data Engneerng, (999).. R. Agrawal, T. Imelnsk and A. Swam, Mnng assocaton rules between sets of tems n large databases, Proceedngs of the 993 ACM SIGMOD Conference, Washngton, DC, May, (993). 5. M. Houtsma and A. Swam, Set orented mnng of assocaton rules, Techncal Report RJ 9567, IBM, October, (993). 6. A. Savasere, E. Omecnsk and S. Navathe, An effcent algorthm for mnng assocaton rules n large databases, Data Mnng Group, Tandem Computers, Inc., Austn, TX, (995). 7. H. Tovonen, Samplng large databases for assocaton rules, Proceedngs of the nd VLDB Conference, Bombay, Inda, (996). 8. A.S. Deshpande and E. Trantaphyllou, A greedy randomzed adaptve search procedure(grasp) for nferrng logcal clauses from examples n polynomal tme and some extensons, Mathematcal and Computer Modellng 7, 75-99, (998). 9. J. Peysakh, A fast algorthm to convert Boolean expressons nto CNF, IBM Computer Scence RC 93(#5797), Watson, NY, (987).. E. Trantaphyllou and A.L. Soyster, A relatonshp between CNF and DNF systems dervable from examples, ORSA Journal on Computng 7, (995).. E. Trantaphyllou, Inference of a mnmum sze Boolean functon from examples by usng a new effcent branch and bound approach, Journal of Global Optmzaton 5, 69-9 (99).. E. Trantaphyllou, A.L. Soyster and S.R.T. Kumara, Generatng logcal expressons from postve and negatve examples va a branch and bound approach, Computers and Operatons Research, (99). 3. A. Savasere, E. Omecnsk and S. Navathe, Mnng for strong assocaton negatve assocatons n a large database of customer transactons, Proceedngs of the IEEE th Internatonal Conference on Data Engneerng, Orlando, FL, (998).. C.J. Date, An Introducton to Database Systems, Addson-Wesley Publshng Company, Readng, MA, (995).

A Heuristic for Mining Association Rules In Polynomial Time

A Heuristic for Mining Association Rules In Polynomial Time A Heurstc for Mnng Assocaton Rules In Polynomal Tme E. YILMAZ General Electrc Card Servces, Inc. A unt of General Electrc Captal Corporaton 6 Summer Street, MS -39C, Stamford, CT, 697, U.S.A. egemen.ylmaz@gecaptal.com