METHODS FOR BATCH PROCESSING OF DATA MINING QUERIES

ETHOS FOR TH PROESSING OF T INING QUERIES arek Wojcechowsk and acej Zakrzewcz Insttute of omputng Scence Poznan Unversty of Technology ul. Potrowo 3a Poznan, Poland bstract: ey words: ata mnng s a useful decson support technque, whch can be used to fnd trends and regulartes n warehouses of corporate data. serous problem of ts practcal applcatons s long processng tme requred by data mnng algorthms. urrent systems consume mnutes or hours to answer sngle requests, whle typcally batches of the requests are delvered the systems. In ths paper we present the problem of batch processng of data mnng requests. We ntroduce methods that analyze smlartes between separate requests to reduce the processng. We also perform a comparatve performance analyss of the proposed methods. data mnng, multple query optmzaton. INTROUTION ata mnng, also referred to as database mnng or knowledge dscovery n databases (), ams at dscovery of useful patterns from large databases or warehouses [][][4][6][0][][]. urrently we are observng the evoluton of data mnng envronments from specalzed tools to mult-purpose data mnng systems offerng some level of ntegraton wth exstng database management systems. From a user s pont of vew data mnng can be seen as advanced queryng: a user specfes the source data set and the requested class of patterns, the system chooses the rght data mnng algorthm and returns dscovered patterns to the user [3][5][7][8][9]. The

arek Wojcechowsk and acej Zakrzewcz most serous problem concernng data mnng queres s a long response tme. urrent systems consume mnutes or hours to answer sngle queres. ata mnng applcatons typcally execute data mnng queres durng nghts, when system actvty s low. Sets of data mnng queres are scheduled and then automatcally evaluated by a data mnng system. It s possble that the data mnng queres delvered to the system are somehow smlar, eg. ther source data sets overlap. Unfortunately, none of the proposed data mnng algorthms tred to employ such smlarty of data mnng requests to reduce ther processng. In ths paper we present the problem of batch processng of data mnng queres. We descrbe and analyze three methods of executng batches of data mnng queres n a more effcent way. We llustrate our methods wth many examples expressed n nesq, whch s a declaratve, mult-purpose SQlke language for nteractve and teratve data mnng n relatonal databases, developed by us over the last couple of years [8][9].. asc efntons Frequent temsets. et {l, l,..., l m } be a set of lterals, called tems. et a non-empty set of tems T be called an temset. et be a set of varable length temsets, where each temset T. We say that an temset T supports an tem x f x s n T. We say that an temset T supports an temset X f T supports every tem n the set X. The support of the temset X s the percentage of T n that support X. The problem of mnng frequent temsets n conssts n dscoverng all temsets whose support s above a user-defned support threshold. pror algorthm. pror s an example of a level-wse algorthm for assocaton dscovery. It makes multple passes over the nput data to determne all frequent temsets. et k denote the set of frequent temsets of sze k and let k denote the set of canddate temsets of sze k. efore makng the k-th pass, pror generates k usng k-. Its canddate generaton process ensures that all subsets of sze k- of k are all members of the set k-. In the k-th pass, t then counts the support for all the temsets n k. t the end of the pass all temsets n k wth a support greater than or equal to the mnmum support form the set of frequent temsets k. Fgure provdes the pseudocode for the general level-wse algorthm, and ts pror mplementaton. The subset(t, k) functon gves all the subsets of sze k n the set t. Ths method of prunng the k set usng k- results n a much more effcent support countng phase for pror when compared to the earler algorthms. In addton, the usage of a hash-tree data structure for storng the canddates provdes a very effcent support-countng process.

ethods for atch Processng of ata nng Queres 3 {all -temsets from } for (k; k ; k) count( k, ); k {c k c.count mnsup}; k generate_canddates( k ); nswer U k k ; {frequent -temsets} for (k ; k- ; k) k generate_canddates( k- ); forall tuples t t k subset(t, k); forall canddates c t c.count; k {c k c.count mnsup} nswer U k k ; Fgure -. evel-wse algorthm for assocaton dscovery and ts pror mplementaton. nesq ata nng Query anguage nesq s a SQ language extenson we presented n [9] as a tool to formulate data mnng queres. The man nesq statement s INE, desgned to dscover frequent patterns from a result of a SEET query. The dscovered patterns may be fltered by means of user-defned condtons. We also ntroduced new datatypes to allow to store temsets n database relatons: SET OF HR, SET OF INTEGER, etc., the SET() groupng functon, as well as the ONTINS operator used to determne f one set of tems contans another set of tems. In [8][3] we extended nesq wth data mnng materalzed vews and sequental pattern processng operators. The followng example statements llustrate nesq capabltes to create a database relaton to hold sets of ntegers and to dscover all frequent temsets wth support greater than 0 n the frst 00 tuples of the relaton. create table mysets ( nteger, s set of nteger) mne temset from (select s from mysets where <00) where support(temset) > 0 The next example llustrates nesq capabltes to store results of a data mnng query: create table mypatterns (s set of nteger) nsert nto mypatterns mne temset from (select s from mysets where <00) where support(temset) > 0. PREIINRIES N PROE STTEENT ata mnng query. data mnng query s a tuple Q (R, a, Σ, Φ), where R s a relaton, a s an attrbute of R, Σ s a condton nvolvng the attrbutes of the relaton R, Φ s a condton nvolvng dscovered patterns.

4 arek Wojcechowsk and acej Zakrzewcz The result of the data mnng query s a set of patterns dscovered n π a σ Σ and satsfyng Φ. Example. Gven the relaton R shown n Fg. a, the result of the data mnng query Q (R, set, d>5 N d<0, mnsup 3 ) s shown n Fg. b. mne temset from (select tems from R where d>5 and d<0) where support(temset)>3; R : d set -------- a,b,c 4 a,c 6 d,f,g 7 f,g,k,m 8 e,f,g 5 a,f Fgure -a Example relaton R result of Q : {f} {g} {f,g} Fgure -b Q query result Problem statement. Gven a set S {Q, Q,, Q n } of data mnng queres, where Q (R, a, Σ, Φ ) and j σ Σ (R ) σ Σj (R j ), the goal s to mnmze the I/O and the PU of executng S.. otvatng example onsder a relaton Sales(uad, basket, tme) to store purchases made by users of an nternet shop. Snce data sets of ths knd tend to be very large, there s a need for automated analyss of ther contents. ssume a shop manager s nterested n fndng sets of products that were frequently cooccurrng n the users purchases. The shop manager plans to create two reports: one showng the frequent sets that appeared n more than 350 purchases n Jan 00 and one showng the frequent sets that appeared n more than 0 purchases made by clents from France. Two requred data mnng queres are shown below. Q mne temset from (select basket from sales where tme between 0-0-0 and 0-3-0 ) where support(temset) > 350 Q mne temset from (select basket from sales where uad lke %.fr ) where support(temset) > 0 If the sze of the Sales relaton s very large, each of the above data mnng queres can take a sgnfcant amount of tme to execute. Part of ths tme wll be spent on readng the Sales relaton from dsk n order to count occurrences of canddate temsets. Notce that the sets of bloks to be read by the two data mnng queres may overlap. If we try to merge the processng of the two data mnng queres, we can reduce redundancy resultng from ths overlappng. In the remanng of ths paper we wll use ths example to llustrate partcular methods.

ethods for atch Processng of ata nng Queres 5 3. OE OF EVE-WISE SSOITION ISOVERY GORITH In order to descrbe methods for batch processng of data mnng queres, we frst need to ntroduce a notaton to express steps of a level-wse assocaton dscovery algorthm. We decded to use the extended relatonal algebra to model the level-wse algorthm processng n the followng way. Each canddate countng step s represented as a relatonal jon, followed by groupng and selecton operatons. Fgure 3 shows the SQ query and the relatonal algebra graph for the canddate countng step; (s) s the canddates relaton, R(s) s the database relaton. The canddate generaton step s represented as a smple relatonal jon. Fgure 4 shows the SQ query and the relatonal algebra graph for ths case. select c.s, count(r.s) from c, r where r.s contans c.s group by c.s havng count(r.s)>mnsup σ OUNT(R.s) mnsup γ.s, OUNT(R.s).s R.s Fgure -3. anddate countng-prunng step modeled wth relatonal algebra select unon(l.s, l.s) as cand from l l, l l where sze(dfference(l.s,l.s)) group by cand havng count(*) k*(k-)/ π.s.s R σ OUNT(*) k(k-)/ γ.s.s, OUNT(*).s -.s Fgure -4. anddate generaton step for k modeled wth relatonal algebra In order to analyze the general model of the level-wse assocaton dscovery algorthm, we make the followng assumptons: () the sze of the database s much larger than the sze of all canddate temsets, () the sze of all canddate temsets s larger than the memory sze, and (3) frequent temsets ft n memory. The notaton we use s gven n Table.

6 arek Wojcechowsk and acej Zakrzewcz Table. Notaton used n models man memory sze (blocks) number of temsets n the database sze of the database (blocks) number of canddate temsets for step I sze of all canddate temsets for step (blocks), <<, < number of frequent temsets for step, < sze of all frequent temsets for step (blocks), < The of performng the general level-wse assocaton dscovery algorthm s as follows:. anddate countng-prunng. anddate temsets must be read from dsk n portons equal to the avalable memory sze. For each porton, the database must be scanned to jon temsets from wth temsets from. Next, the canddate temsets wth support greater or equal to mnsup become frequent temsets and must be wrtten to dsk. The I/O of a sngle teraton s the followng: I / O The domnant part of the PU s jon condton verfcaton. For the smplcty, we assume the of comparng two temsets does not depend on ther szes and equals. Thus, the PU of a sngle teraton s the followng: PU. anddate generaton. Frequent temsets from the prevous teraton must be read from dsk, joned n memory, and saved as new canddate temsets. The I/O of a sngle teraton s the followng: I / O The PU of ths phase of the algorthm s the followng: PU Therefore, f s the number of teratons, the overall of the levelwse algorthm s as follows: I / O PU ( )

ethods for atch Processng of ata nng Queres 7 4. ETHOS FOR TH PROESSING OF T INING QUERIES In ths Secton we present three methods for processng batches of data mnng queres. The frst one represents a trval approach, where we execute each Q separately. We call ths method Sequental Processng. The second method, called ommon ountng, ntegrates the countng phase of the level-wse algorthm to reduce I/O. The thrd method, called ne erge, splts Qs nto a new set of dsjont Qs. Ther results are used to answer the orgnal queres. 4. Sequental Processng In the Sequental Processng method, each Q s executed separately. We do not try to beneft from usng common dsk blocks by two separate data mnng queres. Fgure 5 gves the model and pseudocode for ths method ( means generated for Q, etc.). The of ths method s equal to the sum of ndependent executon of each of the queres: I / O PU 4. ommon ountng When two or more dfferent Qs count ther canddate temsets n the same part of the database, the common part of ther countng steps s ntegrated and requres only one scan of the nvolved part of the database. model of a sngle step of the ommon ountng algorthm and ts procedural mplementaton are shown n Fg 6. Example. Usng the orgnal database selecton condtons, we construct three separate dataset defntons:. select basket from sales where tme between 0-0-0 and 0-3-0 and NOT uad lke %.fr. select basket from sales where tme between 0-0-0 and 0-3-0 and uad lke %.fr

8 arek Wojcechowsk and acej Zakrzewcz π.s.s σ OUNT(*) k(k-)/ γ.s.s, OUNT(*).s -.s π.s.s σ OUNT(*) k(k-)/ γ.s.s, OUNT(*).s -.s {all -temsets from } for (k; k k ; k) count( k, ); k {c k c.count mnsup }; k generate_canddates( k ); nswer U k k ; {all -temsets from } for (k; k ; k) count( k, ); k {c k c.count mnsup }; k generate_canddates( k ); nswer U k k ; π.s π.s σ OUNT(.s) mnsup σ OUNT(.s) mnsup γ.s, OUNT(.s) γ.s, OUNT(.s).s.s.s.s Fgure -5 odel of the Sequental Processng method π.s.s.s -.s π.s σ OUNT(.s) mnsup π.s.s.s -.s π.s σ OUNT(.s) mnsup {all -temsets from } {all -temsets from } for (k; k k ; k) f k count( k, - ); f k count( k, - ); count( k k, ); k {c k c.count mnsup }; k {c k c.count mnsup }; k generate_canddates( k ); k generate_canddates( k ); nswer U k k ; nswer U k k ; π.s,.ount(.s).ount(.s) π.s,.ount(.s).ount(.s).s.s.s.s γ.s, OUNT(.s) γ.s, OUNT(.s) γ.s, OUNT(.s).s.s.s.s.s.s - - Fgure -6. odel of the ommon ountng method

ethods for atch Processng of ata nng Queres 9 3. select basket from sales where NOT tme between 0-0-0 and 0-3-0 and uad lke %.fr Next, we scan the frst query s result n order to count Q canddate temsets, then we scan the second query s result n order to count both Q and Q canddate temsets, fnally we scan the thrd query s result n order to count Q canddate temsets. Notce that none of the database blocks nedeed to be read twce, f the canddate temsets ft n memory. et us analyze the of ths method. anddate temsets of Q must be read, joned wth -, counted, and saved to dsk. lso, canddate temsets of Q must be read, joned wth -, counted, and saved to dsk. Next, all canddates of Q and Q must be read, joned wth, counted, and saved to dsk. The canddate temsets wth support greater or equal to, respectvely, mnsup or mnsup, become frequent temsets and are wrtten to dsk. In order to generate new canddate temsets, all frequent temsets must be read from dsk and new canddate temsets must be wrtten to dsk. Therefore, the I/O of ths method s the followng: 3 3 max(, ) I / O Smlarly, the PU s as follows: max(, ) PU ( ) ) 4.3 ne erge Ths method employs the property that an temset whch s frequent n a whole data set, must also be frequent n at least one porton of t [4,3]. In the ne erge method, each par of overlappng Qs s dvded nto three separate Qs. Next, the new Qs are executed sequentally. The results of the new Qs are canddates to determne the results of the orgnal Qs. Therefore, an addtonal countng step s needed to fnally answer the orgnal Qs. The pseudocode of the method and a model of the addtonal step are gven n Fg. 7.

0 arek Wojcechowsk and acej Zakrzewcz U - U σ OUNT(.s) mnsup γ.s, OUNT(.s).s.s σ OUNT(.s) mnsup - {all -temsets from - } for (k; - k ; k) count( - k, - ); γ.s, OUNT(.s) - k {c - k c.count mnsup }; - k generate_canddates( - k ); nswer - U k - k ;.s.s U - U - {all -temsets from - } for (k; k - ; k) count( k -, ); k - {c k - c.count mnsup }; k - generate_canddates( k - ); nswer - U k k - ; {all -temsets from } for (k; k ; k) count( k, ); k {c k c.count mn(mnsup, mnsup )}; k generate_canddates( k ); nswer U k k ; count(nswer - nswer, ); nswer {c nswer - nswer c.count mnsup ); count(nswer - nswer, ); nswer {c nswer - nswer c.count mnsup ); Fgure 7. odel of the ne erge method Example. Usng the orgnal database selecton condtons, we construct three new data mnng queres. ssume the ntermedate results are wrtten to the relaton Intermedate(label,temset). Q: nsert nto ntermedate mne Q, temset from ( select basket from sales where tme between 0-0-0 and 0-3-0 and NOT uad lke %.fr ) where support(temset)>350 Q: nsert nto ntermedate mne Q, temset from (select basket from sales where tme between 0-0-0 and 0-3-0 and uad lke %.fr ) where support(temset)>0 Q3: nsert nto ntermedate mne Q3, temset from (select basket from sales where NOT tme between 0-0-0 and 0-3-0 and uad lke %.fr ) where support(temset)>0

ethods for atch Processng of ata nng Queres The above queres dscover frequent temsets n the three parttons of the orgnal data sets. In the next step, we have to merge the parttons and verfy the temsets fnal supports:. select temset from (select dstnct temset from ntermedate), sales s where label n ( Q, Q ) and s.temset contans.temset group by.temset havng count(*)>350;. select temset from (select dstnct temset from ntermedate), sales s where label n ( Q, Q3 ) and s.temset contans.temset group by.temset havng count(*)>0; The temsets selected by the frst Select query form the result of Q, and the temsets selected by the second Select query form the result of Q. et us analyze the of ths method. The I/O of executng the three new data mnng queres s the followng: O I / The I/O of verfyng the dscovered temsets supports s the of performng the jon operaton: ( ) ( ) O I U U / The PU of the complete method s the followng: PU ) ( ) ( U U

arek Wojcechowsk and acej Zakrzewcz 5. ONUSIONS In ths paper we have presented the problem of effcent executng batches of data mnng queres. We have bult a relatonal algebra model for a level-wse assocaton dscovery algorthm and we used ths model to descrbe our methods of executng batched data mnng queres. For the three descrbed methods, we analyzed ther performance n terms of I/O and PU. REFERENES. grawal R., Imelnsk T., Swam.: nng ssocaton Rules etween Sets of Items n arge atabases. Proc. of the 993 SIGO onf. on anagement of ata (993). grawal R., Srkant R.: Fast lgorthms for nng ssocaton Rules. Proc. of the 0th Int l onf. on Very arge ata ases (994) 3. er S., eo R., Psala G.: New SQ-lke Operator for nng ssocaton Rules. Proc. of the nd Int l onference on Very arge ata ases (996) 4. heung.w., Han J., Ng V., Wong.Y.: antenance of scovered ssocaton Rules n arge atabases: n Incremental Updatng Technque. Proc. of the th IE (996) 5. Han J., Fu Y., Wang W., hang J., Gong W., opersk.,., u Y., Rajan., Stefanovc N., Xa., Zaane O.R.: ner: System for nng nowledge n arge Relatonal atabases. Proc. of the nd onference (996) 6. Imelnsk T., annla H.: atabase Perspectve on nowledge scovery. ommuncatons of the, Vol. 39, No. (996) 7. Imelnsk T., Vrman., bdulghan.: atamne: pplcaton programmng nterface and query language for data mnng. Proc. of the nd onference (996) 8. orzy T., Wojcechowsk., Zakrzewcz.: ata nng Support n atabase anagement Systems. Proc. of the nd awa onference (000) 9. orzy T., Zakrzewcz.: SQ-lke anguage for atabase nng. IS 97 Symposum (997) 0. Nag., eshpande P.., ewtt.j.: Usng a nowledge ache for Interactve scovery of ssocaton Rules. Proc. of the 5th onference (999).Thomas S., odagala S., lsabt., Ranka S.: n Effcent lgorthm for the Incremental Updaton of ssocaton Rules n arge atabases. Proc. of the 3rd onference (997).Tovonen H.: Samplng arge atabases for ssocaton Rules. Proc. of the nd Int l onference on Very arge ata ases (996) 3.Wojcechowsk., Zakrzewcz.: Itemset ateralzng for Fast nng of ssocaton Rules. Proc. of the nd IS onference (998)