A Heuristic for Mining Association Rules In Polynomial Time

Similar documents
A Heuristic for Mining Association Rules In Polynomial Time*

Concurrent Apriori Data Mining Algorithms

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Cluster Analysis of Electrical Behavior

A Binarization Algorithm specialized on Document Images and Photos

An Optimal Algorithm for Prufer Codes *

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Meta-heuristics for Multidimensional Knapsack Problems

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Lecture 5: Multilayer Perceptrons

Performance Evaluation of Information Retrieval Systems

Array transposition in CUDA shared memory

Module Management Tool in Software Development Organizations

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Problem Set 3 Solutions

Analysis of Continuous Beams in General

Mathematics 256 a course in differential equations for engineering students

Support Vector Machines

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Wishing you all a Total Quality New Year!

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Query Clustering Using a Hybrid Query Similarity Measure

Solving two-person zero-sum game by Matlab

The Codesign Challenge

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Load Balancing for Hex-Cell Interconnection Network

Support Vector Machines

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Related-Mode Attacks on CTR Encryption Mode

Smoothing Spline ANOVA for variable screening

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Available online at Available online at Advanced in Control Engineering and Information Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

3D vector computer graphics

X- Chart Using ANOM Approach

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Classifier Selection Based on Data Complexity Measures *

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

S1 Note. Basis functions.

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

A New Approach For the Ranking of Fuzzy Sets With Different Heights

An Entropy-Based Approach to Integrated Information Needs Assessment

On Some Entertaining Applications of the Concept of Set in Computer Science Course

CSE 326: Data Structures Quicksort Comparison Sorting Bound

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Virtual Machine Migration based on Trust Measurement of Computer Node

Feature Reduction and Selection

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

GSLM Operations Research II Fall 13/14

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Private Information Retrieval (PIR)

Programming in Fortran 90 : 2017/2018

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

5 The Primal-Dual Method

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Load-Balanced Anycast Routing

Optimizing Document Scoring for Query Retrieval

Imperialist Competitive Algorithm with Variable Parameters to Determine the Global Minimum of Functions with Several Arguments

Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio,

Learning-Based Top-N Selection Query Evaluation over Relational Databases

CMPS 10 Introduction to Computer Science Lecture Notes

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Parallel matrix-vector multiplication

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

A Facet Generation Procedure. for solving 0/1 integer programs

A Combined Approach for Mining Fuzzy Frequent Itemset

CE 221 Data Structures and Algorithms

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Hermite Splines in Lie Groups as Products of Geodesics

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Brave New World Pseudocode Reference

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Simulation Based Analysis of FAST TCP using OMNET++

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

A Hill-climbing Landmarker Generation Algorithm Based on Efficiency and Correlativity Criteria

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Backpropagation: In Search of Performance Parameters

Transcription:

A Heurstc for Mnng Assocaton Rules In Polynomal Tme E. YILMAZ General Electrc Card Servces, Inc. A unt of General Electrc Captal Corporaton 6 Summer Street, MS -39C, Stamford, CT, 697, U.S.A. egemen.ylmaz@gecaptal.com E. TRIANTAPHYLLOU Department of Industral and Manufacturng Systems Engneerng Lousana State Unversty, 38 CEBA Buldng, Baton Rouge, LA, 783, U.S.A. tranta@lsu.edu Web: http://www.mse.lsu.edu/vangels J. CHEN Department of Computer Scence Lousana State Unversty, 98 Coates Hall, Baton Rouge, LA, 783, U.S.A. T. W. LIAO Department of Industral and Manufacturng Systems Engneerng Lousana State Unversty, 38 CEBA Buldng, Baton Rouge, LA, 783, U.S.A. (Last Revson: September 6, ) Abstract: Mnng assocaton rules from databases has attracted great nterest because of ts potentally very practcal applcatons. Gven a database, then the problem of nterest s how to mne assocaton rules (.e., patterns of consumers behavors) n an effcent and effectve way. The databases nvolved n today s nformaton socety can be very large. Thus, fast and effectve algorthms are needed to mne assocaton rules out of large databases. Prevous approaches may cause an exponental computng resource consumpton. A combnatoral exploson occurs because exstng approaches exhaustvely mne all the rules. The proposed algorthm takes a prevously developed approach, called the Randomzed Algorthm (or RA), and adapts t to mne assocaton rules out of a database n an effcent way. The RA approach was prmarly developed for nferrng logcal clauses from examples. Numerous computatonal results suggest that the new approach s very promsng. Key words: Data mnng, assocaton rules, algorthm analyss, the one clause at a tme (OCAT) approach, randomzed algorthms, heurstcs. Correspondng Author: Evangelos Trantaphyllou

. INTRODUCTION Mnng assocaton rules from databases has attracted great nterest because of ts potentally very useful results. Assocaton rules are derved from a type of analyss that extracts nformaton from concdence [Blaxton and Westphal, 998]. Sometmes called market basket analyss, ths methodology allows a data analyst to dscover correlatons, or co-occurrences of transactonal events. In the classc example, consder the tems contaned n a customer s shoppng cart on any one trp to the grocery store. Chances are that hs/her own shoppng patterns tend to be nternally consstent, and that he/she tends to buy certan tems on certan days, for example mlk on Mondays and beer on Frdays. There mght be many examples of pars of tems that are lkely to be purchased together. For nstance, one mght always buy champagne and strawberres together on Saturdays, although one only rarely purchases ether of the tems separately. Ths s the knd of nformaton the store manager could use to make decsons about where to place tems n the store so as to ncrease sales. Ths nformaton can be expressed n the form of assocaton rules. From the example gven above, the manager mght decde to place a specal champagne dsplay near the strawberres n the frut secton on the weekends n the hope of ncreasng sales. Purchase records can be captured by usng the bar codes on the products. The technology to read them has enabled busnesses to collect vast amounts of data, commonly known as market basket data [Agrawal and Srkant, 99]. Typcally, a purchase record contans the tems bought n a sngle transacton, and a database may contan many such transactons. Analyzng such databases by extractng assocaton rules may offer some unque opportuntes for busnesses to ncrease ther sales, snce such nformaton can be used n desgnng marketng strateges. The databases nvolved can be very large. Thus, fast and effectve algorthms are needed to mne assocaton rules out of them. For a more formal defnton of assocaton rules, some notaton and defntons need to be ntroduced next. Let I = {A, A, A 3,, A n } be the set wth the names of the tems (also called attrbutes, hence the notaton A ) among whch assocaton rules wll be searched [Bayardo, Agrawal, et al. 999], [Agrawal and Srkant, 99]. Ths set s often called the tem doman. Then, a transacton s a set of one or more tems obtaned from the set I. Ths means that for each transacton T, the relaton T I holds. Let D be the set of all transactons. Also, let X be defned as a set of some of the tems n I. The set X s contaned n a transacton T f the relaton Usng these defntons, an assocaton rule s a relatonshp of the form X I, Y I, and X Y = X T holds. X Y, where. The set X s the antecedent part, whle the set Y s the consequent part of the rule. Such an assocaton rule holds wth some confdence level denoted as CL. The confdence level s the condtonal probablty (as t can be nferred from the avalable transactons n the target database) of havng the consequent part Y gven that we already have the

antecedent part X. Moreover, an assocaton rule has support S, where S s the number of transactons n D that contan X and Y together. A frequent tem set s a set of tems that occur frequently n the database. That s, ther support s above a predetermned mnmum support level. A canddate tem set s a set of tems, possbly frequent, but not yet checked whether they meet the mnmum support crteron. The assocaton rule analyss n our approach wll be restrcted to those assocaton rules whch have only one tem n the consequent part of the rule. However, a generalzaton can be made easly. Example.: Consder the followng llustratve database: D = Ths database s defned on fve tems, so I = { A, A, A, A A } 3, 5. Each row represents a transacton. For nstance, the second row represents a transacton n whch only tems A 3 and A were bought. The support of the rule A A A5 s 3, because the tems A, A, and A 5 occur n 3 transactons together (.e., the ffth, eghth, and nneth transactons). The confdence level of the rule A A A5 s % because the number of transactons n whch A and A appear together s equal to the number of transactons that A, A, and A 5 appear (both are equal to three), gvng a confdence level of %. Gven the prevous defntons, then the problem of nterest s how to mne assocaton rules out of a database D, that meet some pre-establshed mnmum support and confdence level requrements. Mnng of assocaton rules was frst ntroduced n [Agrawal, Imelnsk, et al., 993]. Ther algorthm s called AIS (whch stands for Agrawal, Imelnsk, and Swam). Another study used a dfferent approach to solve the problem of mnng assocaton rules [Houtsma and Swam, 993]. That study presented a new algorthm called SETM (for Set Orented Mnng). The new algorthm was 3

proposed to mne assocaton rules by usng relatonal operatons n a relatonal database envronment. Ths was motvated by the desre to use SQL to compute frequent tem sets. The next study [Agrawal and Srkant, 99] receved a lot more recognton than the prevous ones. Three new algorthms were presented; the Apror, the AprorTd, and the AprorHybrd. The Apror and AprorTd approaches are fundamentally dfferent from the AIS and the SETM algorthms. As the name AprorHybrd suggests, ths approach s a hybrd between the Apror and the AprorTd algorthms. Another major study n the feld of mnng of assocaton rules s descrbed n [Savasere, Omecnsk, et al., 995]. These authors presented an algorthm called Partton. Ther approach reduces the search by frst computng all frequent tem sets n two passes over the database. Another major study on assocaton rules takes a samplng approach [Tovonen, 996]. These algorthms make only one full pass over the database. The man dea s to select a random sample, and use t to determne representatve assocaton rules that are very lkely to also occur n the whole database. These assocaton rules are n turn valdated n the entre database. Ths paper s organzed as follows. The next secton presents a formal descrpton of the research problem under consderaton. The thrd secton starts wth a bref descrpton of the OCAT (one clause at a tme) approach that played a crtcal role n the development of the new approach. The new approach s descrbed n the second half of the thrd secton. The fourth secton presents an extensve computatonal study that compared the proposed approach for mnng of assocaton rules wth some exstng ones. Fnally, the paper ends wth a conclusons secton.. PROBLEM DESCRIPTION Prevous work on mnng of assocaton rules focused on extractng all conjunctve rules, provded that these rules meet the crtera set by the user. Such crtera can be the mnmum support and confdence levels. Although prevous algorthms manly consdered databases from the doman of market basket analyss, they have been appled to the felds of telecommuncaton data analyss, census data analyss, and to classfcaton and predctve modelng tasks n general [Bayardo, Agrawal, et al., 999]. These applcatons dffer from market basket analyss n the sense that they contan dense data. That s, such data mght possess all or some of the followng propertes: () Have many frequently occurrng tems; () Have strong correlatons between several tems; () Have many tems n each record. When standard assocaton rule mnng technques are used (such as the Apror approach [Agrawal and Srkant, 99] and ts varants), they may cause exponental resource consumpton n the

worst case. Thus, t may take too much CPU tme for these algorthms to mne the assocaton rules. The combnatoral exploson s a natural result of these algorthms, because they mne exhaustvely all the rules that satsfy the mnmum support constrant as specfed by the analyst. Furthermore, ths characterstc may lead to the generaton of an excessve number of rules. Then, the end user wll have to determne whch rules are worthwhle. Therefore, the hgher the number of assocaton rules produced s, the more dffcult t s to revew them. In addton, f the target database contans dense data, then the prevous stuaton may become even worse. The sze of the database also plays a vtal role n data mnng algorthms [Tovonen, 996]. Large databases are desred for obtanng accurate results, but unfortunately, the effcency of the algorthms depends heavly on the sze of the database. The core of today s algorthms s the Apror algorthm [Agrawal and Srkant, 99] and ths algorthm wll be the one to be compared wth n ths paper. Therefore, t s hghly desrable to develop an algorthm that has polynomal complexty and stll beng able of fndng a few rules of good qualty. 3. METHODOLOGY 3. The One Clause At a Tme (OCAT) Approach The proposed approach s based on a heurstc, called the Randomzed Algorthm (or RA) that was developed n [Deshpande and Trantaphyllou, 998]. Ths heurstc nfers logcal clauses from two mutually exclusve collectons of bnary examples. The man deas of ths heurstc are brefly descrbed next. Let { A A,..., }, A n be a set of n bnary attrbutes. Also, let F be a Boolean functon over these bnary attrbutes. That s, F s a mappng from {, } n {, }. The nput of the RA heurstc s two sets of tranng examples. Each example s a vector of sze n defned n the space {, } n. The tranng examples somehow have been classfed as ether postve or negatve. Then, the Boolean functon to be nferred should evaluate to true () when t s fed wth a postve example and to false () when t s fed wth a negatve example. Hopefully, ths functon s an accurate estmaton of the hdden logc that has classfed the tranng examples. Another goal s for the nferred Boolean functon (n conjunctve normal form (CNF) or dsjunctve normal form (DNF)) to have a very small, deally mnmum, number of dsjunctons or conjunctons (also known as terms n the lterature). A Boolean functon s n CNF f t s of the form: a ρ j k j = Smlarly, a Boolean functon s n DNF f t s n the form: 5

where a s ether a bnary attrbute k a j = ρ j A or ts negaton, A and the varable ρ j s the set of the ndces of the attrbutes n the th j conjuncton or dsjuncton. As t s shown n [Peysakh, 987] any Boolean functon can be transformed nto the CNF or DNF form. Also, n [Trantaphyllou and Soyster, 995] a smple transformaton scheme s presented for nferrng CNF functons wth algorthms that ntally nfer DNF functons and vce-versa. In order to help fx deas of how the RA algorthm operates, consder the followng postve and negatve example sets, denoted as E = E and E, respectvely., E Now consder the followng Boolean expresson (n CNF): = ( A A ) ( A A ) ( A A ) 3 3 A Ths Boolean expresson correctly classfes the prevous tranng examples. In [Trantaphyllou, Soyster, et al., 99] and [Trantaphyllou, 99] the authors present a strategy called the One Clause At a Tme (OCAT) approach (see also Fgure ) for nferrng a Boolean functon from two classes of bnary examples. =, C = ; {ntalzatons} DO WHILE E Step : ; {where ndcates the th clause} Step : Fnd a clause c whch accepts all members of E whle t rejects as many members of E as possble; Step 3: Let E ( c ) be the set of members of Step : Let C C c ; Step 5: REPEAT; Let E E - E ( c ); E whch are rejected by c ; Fgure : The One Clause At a Tme (OCAT) Approach, for the CNF Case [Trantaphyllou, 99]. 6

As t s ndcated n Fgure, the OCAT approach attempts to mnmze the number of CNF clauses that wll fnally form the target functon F. A key task n the OCAT approach s Step (n Fgure ). At Step a sngle clause s constructed. In [Trantaphyllou, 99] a branch-and-bound approach s developed that nfers a clause (for the CNF case) that accepts all the postve examples whle t rejects as many negatve examples as possble. Later, n [Deshpande and Trantaphyllou, 998] the RA heurstc s proposed that returns a clause that now rejects many (as opposed to as many as possble) negatve examples (and stll accepts all the postve examples). Next, are some defntons that are used n these approaches and are gong to be used n the new approach as well. C s the set of attrbutes n the current clause (a dsjuncton for the CNF case); A k an attrbute such that a k A, where A s the set of the attrbutes A, A,, A n and ther negatons; POS (a k ) the number of all postve examples n E whch would be accepted f attrbute a k were ncluded n the current CNF clause; NEG (a k ) the number of all negatve examples n E whch would be accepted f attrbute a k were ncluded n the current clause; l the sze of the canddate lst; ITRS the number of tmes the clause formng procedure s repeated. The RA algorthm s of tme complexty O(Dn ) (where D s the number of transacton sn the data base and n s the number of tems) and t s next descrbed n Fgure. For llustratve purposes, ths algorthm s next appled on the two sets of bnary vectors gven earler n ths secton. When the prevous defntons are used, then the followng can be easly derved: The set of attrbutes (tems) for these postve and negatve examples s: { A, A, A, A, A, A, A A } A = }. 3 3, Therefore, the POS (a k ) and NEG (a k ) values are: POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= POS ( A 3 )= NEG ( A 3 )=3 POS ( A 3 )=3 NEG ( A 3 )=3 POS ( A )= NEG ( A )= POS ( A )= NEG ( A )= 7

DO for ITRS number of teratons BEGIN DO WHILE ( E ) C = ; {ntalzatons} DO WHILE ( E ) Step : Rank n descendng order all attrbutes a a (where a s ether A or A ) accordng to ther POS( a ) value. If NEG( a ) =, then POS( a ) = (.e., an arbtrarly hgh value); Step : Form a canddate lst of the attrbutes whch have the l top hghest POS( a ) values; Step 3: Randomly choose an attrbute ak from the canddate lst; Step : Let the set of atoms n the current clause be C C a ; Step 5: Let E ( ) a k be the set of members of ncluded n the current CNF clause; E E E ; a k Step 6: Let ( ) Step 7: Let a a a }; { k Step 8: Calculate the new POS( a k ) values for all REPEAT E C be the set of members of Step 9: Let ( ) Step : Let E E E ( C) ; E accepted when a k s a k a ; E whch are rejected by C ; Step : Reset E to the orgnal value; REPEAT END CHOOSE the fnal Boolean system among the prevous ITRS systems that has the smallest number of clauses. k Fgure : The RA Heurstc, for the CNF Case [Deshpande and Trantaphyllou, 998]. By examnng the prevous defntons, some key observatons can be made at ths pont. When an attrbute of hgh POS functon value s chosen to be ncluded n the CNF clause currently beng formed, then t s very lkely that ths wll cause acceptng some addtonal postve examples. The reverse s true for atoms wth a small NEG functon value n terms of the negatve examples. Therefore, attrbutes that have hgh POS functon values and low NEG functon values are a good choce for ncluson n the current CNF clause. Ths key observaton leads to the followng alternatves for defnng an evaluatve crteron for Step n Fgure for ncludng a new atom n the CNF clause under consderaton: POS/NEG, or POS-NEG, or some type of a weghted verson of these two expressons. In [Deshpande and Trantaphyllou, 998], t was shown through some emprcal 8

experments that the POS/NEG rato s an effectve evaluatve crteron, snce t s very lkely to lead to Boolean functons wth few clauses. In terms of the prevous llustratve data, the POS/NEG ratos are as follows: POS NEG POS NEG POS NEG POS NEG ( A ) ( A ) = ( A ) ( A ) = ( A3 ) ( A ) = 3 ( A ) ( A ) = POS( A ).5 NEG( A ) = POS( A ). NEG( A ) = POS( A3 ).33 NEG( A ) = 3 POS( A ). NEG( A ) = Next suppose that l n Step, Fgure, was chosen to be equal to 3. Then, the hghest POS/NEG values for ths case are: {.,.,.}. These values correspond to the attrbutes A, A, and A, respectvely. Let A be the one to be randomly selected out of ths canddate lst. The atom A accepts (please note that the current CNF clause s now nl) the frst and the second examples n the E set. Ths means that more attrbutes are requred n the current CNF clause beng bult for all postve examples to be accepted. Next, suppose (after the POS/NEG ratos have been recalculated) that A was the second attrbute to be ncluded n the clause. Note that A and A together can accept all the postve examples n the Boolean expresson s ready...5..5 E set. Therefore, the frst CNF clause (.e., ( A A ) ) of the Next one can observe whch negatve examples are not rejected by ths clause: ths clause fals to reject the second, thrd and the sxth examples n the contan the second, thrd and the sxth examples from the orgnal untl the E set. Therefore, the updated E set should E set. Ths process s repeated E set s empty (Fgure ), meanng that all the negatve examples are rejected. By recallng that RA s a randomzed algorthm (t repeats the functon generaton task ITRS tmes) and thus t does not return a determnstc soluton, a Boolean expresson acceptng all the postve examples and rejectng all the negatve examples could be: ( A A ) ( A A ) ( A A ). 3 3 A 9

3. Proposed Alteratons to the RA Algorthm For a Boolean expresson to reveal nformaton about assocatons n a database, t s more convenent to be expressed n DNF. The frst step s to select an attrbute about whch assocatons wll be sought. Ths attrbute wll form the consequent part of the desred assocaton rules. By selectng an attrbute, the database can be parttoned nto two mutual sets of records (bnary vectors). Vectors that have value equal to n terms of that attrbute, can be seen as the postve examples. A smlar nterpretaton holds true for records that have a value of for that attrbute. These vectors wll be the set of negatve examples. Gven the above way for parttonng (dchotomzng) a database of transactons, t follows that each conjuncton (logcal clause or term ) of the target functon wll reject all the negatve examples, whle on the other hand, t wll accept some of the postve examples. Of course, when all the conjunctons are consdered together, then they wll accept all the postve examples. In terms of assocaton rules, each clause n the Boolean expresson (n DNF) can be thought as a set of frequent tem sets. That s, such a clause forms a frequent tem set. Thus, ths clause can be checked further whether t meets the preset mnmum support and confdence level crtera. The requrement of havng Boolean expressons n DNF does not mean that the RA algorthm has to be altered to produce Boolean expressons n DNF. However, t wll have to be altered n order to make t compatble wth mnng of assocaton rules, but ts orgnal CNF producng nature (as descrbed n Fgure ) wll be kept as t s. As t shown n [Trantaphyllou and Soyster, 995] f one forms the complements of the postve and negatve sets and then swaps ther roles, then a CNF producng algorthm, wll produce a DNF expresson (and vce-versa). The last alteraton s n the CNF (or DNF) expresson to swap the logcal operators ( ) AND and OR ( ). Another nterestng ssue s to observe that the confdence level of the assocaton rules produced by processng frequent tem sets (.e., clauses of a Boolean expresson n DNF when the OCAT / RA approach s used) wll always be equal to %. Ths happens because each DNF clause rejects all the negatve examples whle t accepts some of the postve examples when a database wth transactons s parttoned as descrbed above. A crtcal change n the RA heurstc s that for dervng assocaton rules, t should only consder the attrbutes themselves and not ther negatons. Ths s not always the case, snce some authors have also proposed to use assocaton rules wth negatons [Savasere, Omecnsk, et al., 998]. However, assocaton rules are usually defned on the attrbutes themselves and not on ther negatons. Some changes need also to be made to the selecton process of the sngle attrbute to be ncluded n the clause beng formed (Step n Fgure ). If NEG( a ) = at Step, then the value of

the rato POS( a ) for that partcular a s set to be equal to, (.e., an arbtrarly hgh postve number) multpled by the POS( a ) value. However, the number, may stll be small and thus t should be changed accordng to the sze of the database. There are four cases regardng the value of the POS/NEG rato that need to be consdered when selectng an attrbute. These cases are: Case #: Multple attrbutes (tems) wth NEG( a ) = and equal values of the POS( a ) rato. Case #: No attrbutes (tems) wth value NEG( a ) = exst, but when all the attrbutes are ranked accordng to ther POS( a ) values n descendng order, then the hghest POS( a ) value occurs multple tmes. Case #3: A sngle attrbute wth NEG( a ) = exsts. Case #: There are no attrbutes wth NEG( a ) =, but when all the attrbutes are ranked accordng to ther POS( a ) values n descendng order, then the hghest POS( a ) value occurs only once. For cases # and #, the attrbute to be ncluded n the clause beng formed s randomly selected among the canddates. The canddates for case # are those attrbutes wth NEG( a ) = and equal values of POS( a ). The canddates for case #, on the other hand, are those attrbutes that share the same POS( a ) value (and ths s the hghest value). For cases #3 and # there s no need for a random selecton process, snce there s a sngle attrbute wth the hghest POS( a ) value. Thus, that partcular attrbute s ncluded n the clause beng formed. Furthermore, f one consders only the attrbutes themselves and excludes ther negatons, ths requrement may cause certan problems due to certan degeneratve stuatons that could occur. These degeneratve stuatons occur as follows: Degeneratve Case #: If only one tem s bought n a transacton, and f that partcular tem s selected to be the consequent part of the assocaton rules sought, then the E set wll have an example (.e., the one that corresponds to that transacton) wth only zero elements. Thus, the RA heurstc (or any varant of t) wll never termnate. Hence, for smplcty t wll be assumed that such degeneratve transactons do not occur n our databases. Degeneratve Case #: After formng a clause, and after the E set s updated (Step n Fgure ), the new POS/NEG values may be such that the new clause may be one of those that have been already produced earler (.e., t s possble to have cyclng ).

Degeneratve Case #3: A newly generated clause may not be able to reject any of the negatve examples. The prevous s an exhaustve lst of all possble degeneratve stuatons when the orgnal RA algorthm s used. Thus, the orgnal RA algorthm needs to be altered n order to avod them. Degeneratve case # can be easly avoded by smply dscardng all one-tem transactons (whch are very rare to occur n realty any way). Degeneratve cases # and #3 can be avoded by establshng some upper lmts on the number a Boolean functon s generated wthout beng able to reject all the negatve examples (please recall the randomzed characterstc of the RA heurstc). In order to mne assocaton rules that have dfferent consequents, the altered RA should be run for each one of the attrbutes: A, A,, A n. After determnng the frequent tem sets for each one of these attrbutes, one needs to calculate the support level for each frequent tem set, and check whether the (preset) mnmum support crteron s met. If t s, then the current assocaton rule s reported. The proposed altered RA (to be denoted as ARA) heurstc s summarzed n Fgure 3.

DO for each consequent A, A,, BEGIN Form the E and A n E sets accordng to the presence or absence of the current Calculate the ntal POS and NEG values. E C = ; {ntalzatons} E DO WHILE ( ) START: DO WHILE ( ) REPEAT END A attrbute. Step : Rank n descendng order all attrbutes a a (where a s the attrbute currently under consderaton) accordng to ther POS( a ) value. If NEG( a ) =, then POS( a ) =, xpos( a ); Step : Evaluate the current POS/NEG case; Step 3: Choose an attrbute a accordngly; k Step : Let the set of atoms n the current clause be C C a } ; Step 5: Let E ( ) a k be the set of members of n the current CNF clause; E E E ; a k Step 6: Let ( ) Step 7: Let a a a }; { k { k E accepted when a k s ncluded Step 8: Calculate the new POS( a k ) values for all a k a ; Step 9: If a = (.e., checkng for falure case #), then go to START; REPEAT E C be the set of members of E whch are rejected by C ; Step : Let ( ) Step : If ( C) = E, determne the falure case (.e., case #, or #3). Check whether the correspondng counter has ht the preset lmt. If yes, then go to START; E E E C ; Step : Let ( ) Step 3: Calculate the new NEG values; Step : Let C to be the antecedent and A to be the consequent of the rule. Check the canddate rule C A for mnmum support. If t meets the mnmum support level crteron, then output the rule; Step 5: Reset the E set (.e., select the examples whch have A equal to and store them n set E ); Fgure 3: The Proposed Altered Randomzed Algorthm (ARA) for Mnng Assocaton Rules (for the CNF Case). 3

. COMPUTATIONAL EXPERIMENTS In order to compare the altered RA (ARA) heurstc wth some of the exstng assocaton rule methods, we appled them on several synthetc databases that were generated by usng the data generaton programs descrbed n [Agrawal and Srkant, 99]. The web address (URL) of these codes s: http://www.almaden.bm.com/cs/quest/syndata.html. These databases contan transactons that would reflect the real world, where people tend to buy sets of certan tems together. as follows: Several databases were used n makng these comparsons. The szes of the databases used are Database #:, tems wth, transactons (the mn support was set to 5). Database #:, tems wth, transactons (the mn support was set to 5). Database #3: 5 tems wth 5, transactons (the mn support was set to ). Database #: 5 tems wth,5 transactons (the mn support was set to ). Database #5: 5 tems wth, transactons (the mn support was set to ). The frst results are from the densest databases used n [Agrawal and Srkant, 99], that s, database #. The Apror algorthm was stll n the process of generatng the frequent tem sets of length after 8 hours mnutes and 8 seconds when database # was used. Therefore, the experment wth the Apror algorthm was aborted. However, the ARA algorthm completed mnng the very same database n only hours mnutes and second. The ARA algorthm mned a sngle rule for each one of the followng support levels: 59, 63, 38,, 535, 63, 6, 756, 78, 98, and,93. All the experments were run on an IBM 967/R53 computer. Ths processor s a - engne box wth each engne beng rated at 6 MIPS (mllons of nstructons per second). For the experments wth database #, however, some parallel computng technques were utlzed for the Apror algorthm. The frequent tem sets were gathered nto smaller groups, makng t possble to buld the next frequent tem sets n shorter tme. As a result, each group was analyzed separately, and the CPU tmes for each one of these jobs were added together at the end. The Apror algorthm completed mnng ths database n 59 hours 5 mnutes and 3 seconds. Fgure llustrates the number of rules for ths case. On the other hand, the ARA algorthm mned database # n only hours 5 mnutes and 57 seconds. These results are depcted n Fgure 5.

5, Number of rules mned, 5,, 5, 5 3 35 5 5 55 6 65 7 75 8 85 9 95 5 Support Level Fgure : Hstogram of the Results When the Apror Approach Was Used on Database #. Number of rules mned 8 6 5 3 37 3 9 55 6 67 73 79 85 9 97 3 9 Support Level Fgure 5: Hstogram of Results When the ARA Approach Was Used on Database #. 5

It should be noted here that the CPU tmes recorded for the Apror experments for ths research were hgher than the smlar results reported n [Agrawal and Srkant, 99]. For nstance, t was reported n [Agrawal and Srkant, 99] that the Apror algorthm took approxmately 5 seconds to mne database #. That result was obtaned on an IBM RS/6 53H workstaton wth a man memory of 6 MB, and runnng AIX 3.. On the other hand, for database #, the Apror program wrtten for ths research was n the process of generatng tem sets of length after 8 hours mnutes and 8 seconds. The only dfference between the approach taken n [Agrawal and Srkant, 99] and the one n ths research s that the canddate tem sets n [Agrawal and Srkant, 99] were stored n a hash tree. Hashng s a data storage technque that provdes fast drect access to a specfc stored record on the bass of a gven value for some feld [Date, 995]. In ths research, hash trees were not used n storng canddate tem sets; nstead they were kept n the man memory of the computer. Ths made t faster to access canddate tem sets because drect access s generally very expensve CPU-wse. It s beleved that the programmng technques and the type of the computers used n [Agrawal and Srkant, 99] are causng the CPU tme dfference. Addtonally, the Apror code n ths research was run under a tme-sharng opton, whch agan could make a bg dfference. As t was mentoned earler, the computer codes for the Apror and the ARA algorthms were run on an IBM 967/R53 computer. The results obtaned by usng database # suggest that ARA produced a reasonable number of rules fast. Also, these rules were of hgh qualty, snce by constructon, all had % confdence level. After obtanng these results, t was decded to mne the remanng databases by also usng a commercal software, namely MneSet by Slcon Graphcs. MneSet s one of the most commonly used data mnng computer packages. Unfortunately, MneSet works wth transactons of a fxed length. Therefore, the transactons were coded as zeros and ones, zeros representng that the correspondng tem was not bought, and ones representng that the correspondng tem was bought. However, ths causes Mneset to mne negatve assocaton rules, too. Negatve assocaton rules are rules based on the absence of tems n the transactons, rather than the presence of them and negatons of attrbutes may appear n the rule structure. Another drawback of MneSet s that only a sngle tem s supported n both the left and the rght hand sdes of the rules to be mned. Also, the current verson of MneSet allows for a maxmum of 5 tems n each transacton. The MneSet software used for ths study was nstalled on a Slcon Graphcs workstaton, whch had a CPU clock rate of 5 MHz and a RAM of 5MB. MneSet supports only a sngle tem n both the left and the rght hand sdes of the assocaton rules. Ths suggests that MneSet uses a search procedure of quadratc tme complexty. Such an 6

approach would have frst to count the support of each tem when t s compared wth every other tem, and store these supports n a trangular matrx of dmenson n (.e., equal to the number of attrbutes). Durng the pass over the database, the supports of the ndvdual tems could be counted, and the rest wll only be a matter of checkng whether the result s above the preset mnmum confdence level. For nstance, when checkng the canddate assocaton rule A A 6, the confdence level would be the support of A dvded by the support of A A 6. On the other hand, when dong the same for rule A 6 A, then the confdence level would be the support of A 6 dvded by the support of A A 6. Therefore, such an approach requres n(n-)/ operatons (where n s the number of attrbutes or tems). If D s the number of transactons (records) n the database, then the tme complexty of ths approach s equal to O(Dn ). Ths s the same tme complexty that the ARA approach has. However, for the ARA case, ths complexty s for the worst-case scenaro. The ARA algorthm wll stop as soon as t has produced a Boolean functon that accepts all the postve and rejects all the negatve examples. In addton, the ARA approach s able to mne rules wth multple tems n the antecedent part of an assocaton rule. The ARA approach can also be easly adapted to mne assocaton rules wth multple tems n the consequent part. The only change that has to be made s n the parttonng (dchotomzaton) of the orgnal database nto the sets of the postve and negatve examples. On the other hand, the Apror approach has an exponental tme complexty because t follows a combnatoral search approach. When database #3 was used, t took MneSet 3 mnutes and seconds to mne the assocaton rules. On the other hand, t took ARA just 6 mnutes and 5 seconds to mne the same database. Fgures 6 and 7 provde the number of the mned rules from database #3. When database # was used, t took MneSet 8 mnutes and 3 seconds to mne assocaton rules. For the ARA approach, the requred tme was 5 mnutes and 6 seconds only. These results are depcted n Fgures 8 and 9. For database #5, t took MneSet 5 mnutes and seconds to mne the rules. On the other hand, t took only mnutes and 3 seconds when the ARA approach was used on the same database. The correspondng results are depcted n Fgures and. Table presents a summary of all the above. From these results t becomes evdent that the ARA approach derves assocaton rules faster and also these rules have much hgher support levels. 7

Number of mned rules 8 6 3 5 7 9 3 5 7 9 3 33 35 37 39 3 Support Level Fgure 6: Hstogram of the Results When the MneSet Software Was Used on Database #3. 5 Number of rules mned 5 5 8 3 36 8 5 6 66 7 78 8 9 96 8 Support Level Fgure 7: Hstogram of the Results When the ARA Approach Was Used on Database #3. 8

Number of rules mned 8 6 8 6 6 6 8 33 35 Support Level Fgure 8: Hstogram of the Results When the MneSet Software Was Used on Database #. 3 5 Number of rules mned 5 5 7 3 9 35 7 53 59 65 7 77 83 89 95 7 Support Level Fgure 9: Hstogram of the Results When the ARA Approach Was Used on Database #. 9

8 6 Number of rules mned 8 6 6 8 6 8 3 3 3 36 38 Support Level Fgure : Hstogram of the Results When the MneSet Software Was Used on Database #5. 3 5 Number of rules mned 5 5 8 6 3 3 38 6 5 5 58 6 66 7 7 78 8 86 9 9 Support Level Fgure : Hstogram of the Results When the ARA Approach Was Used on Database #5.

Table : Summary of the Requred CPU Tmes Under Each Method. Apror CPU (hh:mm:ss) ARA CPU (hh:mm:ss) MneSet CPU (hh:mm:ss) Database # Not completed :: N/A Database # 59:5:3 :5:57 N/A Database #3 N/A :6:5 :3: Database # N/A :5:6 :8:3 Database #5 N/A ::3 :5: 5. CONCLUSIONS Ths paper presented the developments of a new approach for dervng assocaton rules from databases. The new approach s called ARA and t s based on a prevous algorthm (.e., the RA approach) that was developed by one of the authors and hs assocates n [Deshpande and Trantaphyllou, 998]. Both the old and new approach are randomzed algorthms. The proposed ARA approach produces a small set of assocaton rules n quadratc tme. Furthermore, these rules are of hgh qualty wth % support levels. The % support level of the derved rules s a characterstc of the way the ARA approach constructs assocaton rules. The ARA approach can be further extended to handle cases wth less than % support levels. Ths can be done by ntroducng stoppng rules that termnate the approprate loops n Fgure 3. That s, to have a predetermned lower lmt (.e., a percentage less than %) of the postve examples to be accepted by each clause (n the CNF case) and also a predetermned percentage of the negatve examples s rejected nstead of seekng for all the postve examples to be accepted and all the negatve examples to be rejected as s the case now. An extensve emprcal study was also undertaken. The Apror approach and the MneSet software by Slcon Graphcs were compared wth the proposed ARA algorthm. The computatonal results demonstrated that the new approach can be both hghly effcent and effectve. The above observatons strongly suggest that the proposed ARA algorthm s very promsng for mnng assocaton rules n today s world wth the always-ncreasng and dverse databases.

REFERENCES. Agrawal, R., Imelnsk, T., and Swam, A., Mnng Assocaton Rules between Sets of Items n Large Databases, Proceedngs of the 993 ACM SIGMOD Conference, Washngton, DC, May 993.. Agrawal, R., and Srkant, R., Fast Algorthms for Mnng Assocatons Rules, Proceedngs of the th VLDB Conference, Santago, Chle, 99. 3. Bayardo Jr., R. J., Agrawal, R., and Gunopulos, D., Constrant-Based Rule Mnng n Large, Dense Databases, Proceedngs of the 5 th Internatonal Conference on Data Engneerng, 999.. Blaxton, T., and Westphal, C., Data Mnng Solutons: Methods and Tools for Solvng Real- World Problems, John Wley & Sons, Inc., pp. 86-89, New York, NY, 998. 5. Date, C. J., An Introducton to Database Systems, Addson-Wesley Publshng Company, Readng, MA, 995. 6. Deshpande, A. S., and Trantaphyllou, E., A Greedy Randomzed Adaptve Search Procedure(GRASP) for Inferrng Logcal Clauses from Examples n Polynomal Tme and Some Extensons, Mathematcal and Computer Modellng, Vol. 7, pp. 75-99, 998. 7. Houtsma, M., and Swam, A., Set Orented Mnng of Assocaton Rules, Techncal Report RJ 9567, IBM, October 993. 8. Peysakh, J., A Fast Algorthm to Convert Boolean Expressons nto CNF, IBM Computer Scence RC 93(#5797), Watson, NY, 987. 9. Savasere, A., Omecnsk, E., and Navathe, S., An Effcent Algorthm for Mnng Assocaton Rules n Large Databases, Data Mnng Group, Tandem Computers, Inc., Austn, TX, 995.. Savasere, A., Omecnsk, E., and Navathe, S., Mnng for Strong Assocaton Negatve Assocatons n a Large Database of Customer Transactons, In Proceedngs of the IEEE th Internatonal Conference on Data Engneerng, Orlando, FL, 998.. Trantaphyllou, E., Inference of a Mnmum Sze Boolean Functon from Examples by Usng a New Effcent Branch and Bound Approach, Journal of Global Optmzaton, Vol. 5, pp. 69-9, 99.. Trantaphyllou, E., Soyster, A. L., and Kumara, S. R. T, Generatng Logcal Expressons from Postve and Negatve Examples va a Branch and Bound Approach, Computers and Operatons Research, Vol., pp. 85-97, 99. 3. Trantaphyllou, E., and Soyster, A. L., A Relatonshp Between CNF and DNF Systems Dervable from Examples, ORSA Journal on Computng, Vol. 7, pp. 83-85, 995.. Tovonen, H., Samplng Large Databases for Assocaton Rules, Proceedngs of the nd VLDB Conference, Bombay, Inda, 996.