Bleach: A Distributed Stream Data Cleaning System

Size: px
Start display at page:

Download "Bleach: A Distributed Stream Data Cleaning System"

Transcription

1 Blech: A Distriuted Strem Dt Clening System Yongcho Tin Eurecom Biot, Frnce Emil: yongcho.tin@eurecom.fr Pietro Michirdi Eurecom Biot, Frnce Emil: pietro.michirdi@eurecom.fr Mrko Vukolić IBM Reserch Zurich, Switzerlnd Emil: mvu@zurich.im.com Astrct Existing sclle dt clening pproches hve focused on tch dt clening. However, tch dt clening is not suitle for streming ig dt systems, in which dynmic dt is generted continuously. Despite the incresing populrity of strem-processing systems, no strem dt clening techniques hve een proposed so fr. In this pper, we ridge this gp y ddressing the prolem of rule-sed strem dt clening, which sets stringent requirements on ltency, rule dynmics nd ility to cope with the continuous nture of dt strems. We design system, clled Blech, which chieves rel-time violtion detection nd dt repir on dirty dt strem. Blech relies on efficient, compct nd distriuted dt structures to mintin the necessry stte to repir dt, using n incrementl version of the equivlence clss lgorithm. Additionlly, it supports rule dynmics nd uses cumultive sliding window opertion to improve clening ccurcy. We evlute prototype of Blech using oth synthetic nd rel dt strems nd experimentlly vlidte its high throughput, low ltency nd high clening ccurcy, which re preserved even with rule dynmics. In the sence of n existing comprle stremclening seline, we compred Blech to seline system uilt on the micro-tch streming prdigm, nd experimentlly show the superior performnce of Blech. I. INTRODUCTION Modern ig dt nd mchine lerning pplictions criticlly rely on dt, nd derivtive representtion thereof, to meet certin qulity criteri. Issues with dt qulity cn led to misleding nlysis outcomes on the grge in, grge out sis. To ddress this issue, rnge of dt clening techniques were proposed recently [25], [2], [17]. However, existing dt clening solutions hve focused on tch dt clening, y processing sttic dt stored in dt wrehouses. This is in shrp contrst with requirements (nd populrity) of distriuted systems for streming dt processing (e.g., [23], [2]). In streming dt processing systems, dt is continuously nd simultneously generted from potentilly thousnds of dt sources. Exmples of streming dt include log files generted y customers of moile nd we pplictions, online purchses, online gming, socil networks, finncil trding floors, telemetry from sensors or other connected devices, etc. In contrst to the populrity of such systems, dt clening solutions for clening streming dt hve not received dequte ttention. In this pper, we ddress this gp nd focus on strem dt clening. The chllenges nd requirements in strem dt clening tht mke it different from estlished tch dt clening re mnifold, ut we highlight the following ones: Strem clening requires oth rel-time gurntees s well s high ccurcy, requirements tht re often t odds. A nïve pproch to strem dt clening could simply extend existing tch techniques, y uffering dt records in temporry dt store nd clening it periodiclly efore feeding it into downstrem components. Although likely to chieve high ccurcy, such method clerly violtes rel-time requirements of streming pplictions. The prolem is excerted y the volume of dt clening systems need to process, which prohiits centrlized solutions. Therefore, our gol is to design distriuted strem dt clening system, which chieves efficient nd ccurte clening in rel-time. Specific chllenges tht strem clening system needs to ddress rise due to the long-term nd dynmic nture of dt strems. The dynmics of the streming systems my led the very definition of dirty dt to chnge in response to such dynmics. In this pper, we specificlly focus on rule-sed dt clening, wherey set of domin-specific rules define how dt should e clened: in prticulr, we consider functionl dependencies (FDs) nd conditionl functionl dependencies (CFDs). Despite extensive work on rule-sed dt clening [1], [6], [9], [1], [12], [18], [7], [17], we re not wre of ny precedent rule-sed strem dt clening system. Our system, clled Blech, proceeds in two phses: violtion detection, to find rule violtions, nd violtion repir, to repir dt sed on such violtions. Blech relies on efficient, compct nd distriuted dt structures to mintin the necessry stte (e.g., summries of pst dt) to repir dt, using n incrementl equivlence clss lgorithm. To ddress long-term nd dynmic nture of strems, Blech supports dynmic rules, which cn e dded nd deleted without requiring idle time. Additionlly, Blech implements sliding window opertion tht trdes modest dditionl storge requirements to temporrily store cumultive sttistics, for incresed clening ccurcy. The experimentl performnce evlution of our Blech prototype is two-fold. First, we study the performnce, in terms of throughput, ltency nd ccurcy, focusing on the impct of Blech prmeters, nd on the effects of rule dynmics. Then we compre Blech to n lterntive pproch. In the sence of n existing comprle seline, we compred Blech to system tht we implemented using micro-tch streming. Our experiments indicte superior performnce of Blech. Beyond demonstrting superior performnce of our system, our comprtive nlysis ttests tht designing n efficient strem-clening solution goes eyond nïve ppliction of tch clening techniques to strem processing.

2 The pper is orgnized s follows. Section II gives more precise prolem sttement. We present Blech system design in Section III nd Section IV; dynmic rule mngement nd windowing re discussed respectively in Section V nd Section VI. Section VII presents our experimentl results. Section VIII overviews relted work nd Section IX concludes. II. PRELIMINARIES Next, we introduce sic nottion we use throughout the pper, then we define the prolem sttement we consider. A. Bckground nd Definitions In this pper we ssume tht strem dt clening system ingests dt strem nd outputs clened dt strem instnce. We consider n input dt strem instnce D in with schem S(A 1, A 2,..., A m) where A j is n ttriute in schem S. We ssume the existence of unique tuple identifiers for every tuple in D in: thus given tuple t i, id(t i) is the identifier of t i. In generl we define function id(e) which returns the identifier (ID) of e where e cn e ny element. A list of IDs [id(e 1), id(e 2),..., id(e n)] is expressed s id(e 1, e 2,..., e n) for revity. The output dt strem instnce D out complies with schem S nd hs the sme tuple identifiers s in D in, i.e., with no tuple loss or dupliction. The sic unit, cell c i,j, is the conctention of tuple id, n ttriute nd the projection of the tuple on the ttriute: c i,j = (id(t i), A j, t i(a j)). Note tht t i(a j) is the vlue of c i,j, which cn lso e expressed s v(c i,j). Sometimes, we my simply express c i,j s c i when the cell ttriute is not relevnt to the discussion. In our work, when we point t specific tuple t i, we lso refer to this tuple s the current tuple. Tuples ppering erlier thn t i in the dt strem re referred to s erlier tuples nd those ppering fter t i re referred to s lter tuples. To perform dt clening, we define set of rules Σ = [r 1,..., r n], in which r k is either functionl dependency (FD) rule or conditionl FD rule (CFD). Ech rule hs unique rule identifier id(r k ). A CFD rule r k is represented y (X A, cond(y )), in which cond(y ) is oolen function on set of ttriutes Y where Y S. X nd A re respectively referred to s set of left-hnd side (LHS) ttriutes nd right-hnd side (RHS) ttriute: LHS(r k ) = X, RHS(r k ) = A. When the rule is cler in the context, we omit r k so tht LHS = X, RHS = A. Cells of LHS (RHS) ttriutes re lso referred to s LHS (RHS) cells. Y is referred to s set of conditionl ttriutes. If there exists pir of tuples t 1 nd t 2 stisfying condition cond(t 1(Y )) = cond(t 2(Y )) = true where t 1(B) = t 2(B) for ll B X ut t 1(A) t 2(A), then we sy t 1 nd t 2 violte ginst r k. A dt strem instnce D stisfies r k, denoted s D = r k, when no violtions for r k exist in D. A FD rule cn e seen s specil cse of CFD rule where cond(y ) is lwys true nd Y is. We refer to n ttriute s n intersecting ttriute if it is involved in multiple rules. If D stisfies set of rules Σ, denoted D = Σ, then D = r k for r k Σ. If D does not stisfy Σ, D is dirty dt strem instnce. B. Chllenges nd Gols An idel strem dt clening system should ccept dirty input strem D in nd output clen strem D out, in which ll time Fig. 1. t 1 t 2 t 3 t 4 t 5 item ctegory clientid city zipcode McBook computer Frnce 751 ike sports Lyon null Interstellr movies Pris 751 ike toys Nice 6 Titnic movies Pris null Illustrtive exmple of dt strem consisting of on-line trnsctions. rule violtions in D in re repired (D out = Σ). However, this is not possile in relity due to: Rel-time constrint: As the dt clening is incrementl, the clening decision for tuple (repir or not repir) cn only e mde sed on itself nd erlier tuples in the dt strem, which is different from dt clening in dt wrehouses where the entire dtset is ville. In other words, if dirty tuple only hs violtions with lter tuples in the dt strem, it cn not e clened. A lte updte for tuple in the output dt strem cnnot e ccepted. Dynmic rules: In strem dt clening system, the rule set is not sttic. A new rule my e dded or n osolete rule my e deleted t ny time. A processed dt tuple cn not e clened gin with n updted rule set. Reprocessing the whole dt strem whenever the rule set is updted is not relistic. Unounded dt: A dt strem produces n unounded mount of dt, tht cnnot e stored completely. Thus, strem dt clening cn not fford to perform clening on the full dt history. Nmely, if dirty tuple only hs violtions with tuples tht pper much erlier in the dt strem, it is likely tht such tuple will not e clened. Consider the exmple in Figure 1, which is dt strem of on-line shopping trnsctions. Ech tuple represents purchse record, which contins purchsed item (item), the ctegory of tht item (ctegory), client identifier (clientid), the city of the client (city) nd the zip code of tht city (zipcode). In the exmple, we show n extrct of five dt tuples of the dt strem, from t 1 to t 5. Now, ssume we re given two FD rules nd one CFD rule stting how clen dt strem should look like: (r 1) the sme items cn only elong to the sme ctegory; (r 2) two records with the sme clientid must hve the sme city; (r 3) two records with the sme non-null zip code must hve the sme city: (r 1) item ctegory (r 2) clientid city (r 3) zipcode city, zipcode null In our exmple, there re three violtions of rules r 1, r 2 nd/or r 3 : (v1) t 1 nd t 3 hve the sme non-null zip code (t 1(zipcode) = t 2(zipcode) null) ut different city nmes (t 1(city) t 2(city)); (v2) t 2 clims ikes elong to ctegory sports while t 4 clssifies ikes s toys (t 2(item) =

3 v1, v3 v2 t3(city) t1(city) t5(city) t2(ctegory) t4(ctegory) Dt Strem Detect Dt History Intermedite Dt Strem with Violtions Repir Violtion Grph Clened Dt Strem Fig. 2. An exmple of violtion grph, derived from our running exmple. Rule Updte Rule Controller t 4(item), t 2(ctegory) t 4(ctegory)); nd (v3) t 1 nd t 5 hve the sme clientid ut different city nmes (t 1(clientid) = t 5(clientid), t 1(city) t 5(city)). Note tht when strem dt clening system receives tuple t 1, no violtion cn e detected s in our exmple t 1 only hs violtions with lter tuples t 3 nd t 5. Thus, no modifiction cn e mde to t 1. Furthermore, delying the clening process for t 1 is not fesile option, not only ecuse of rel-time constrints, ut lso ecuse it is difficult to predict for how long this tuple should e uffered for it to e clened. Therefore, strem dt clening must e incrementl: whenever new piece of dt rrvies, the dt clening process strts immeditely. Although performing incrementl violtion detection seems strightforwrd, incrementl violtion repir is much more complex to chieve. Coming ck to the exmple in Figure 1, ssume tht the strem clening system receives tuple t 5 nd successfully detects the violtion v 3 etween t 5 nd t 1. Such detection is not sufficient to mke the correct repir decision, s the tuple t 1 lso conflicts with nother tuple, t 3. An incrementl repir in strem dt clening system should lso tke the violtions mong erlier tuples into ccount. To ccount for the intriccies of the violtion repir process, we use the concept of violtion grph [17]. A violtion grph is dt structure contining the detected violtions, in which ech node represents cell. If some violtions shre common cell, they will e grouped into single sugrph. Therefore, the violtion grph is prtitioned into smller independent sugrphs. A single cell cn only e in one sugrph. If two sugrphs shre common cell, they need to merge. The repir decision of tuple is only relevnt to the sugrphs in which its cells re involved. A violtion grph for our exmple cn e seen in Figure 2. Given this violtion grph, to mke the repir decision for tuple t 5, the clening system cn only rely on the upper sugrph which consists of violtion v 1 nd v 3 with the common cell t 1(city). We now give our prolem sttement s following. Prolem sttement: Given n unounded dt strem with n ssocited schem 1 nd dynmic set of rules, how cn we design n incrementl nd rel-time dt clening system, including violtion detection nd violtion repir mechnisms, using ounded computing nd storge resources, to output clened dt strem? In the following sections, we overview the Blech rchitecture nd provide detils out its components. As shown in Figure 3, the input dt strem first enters the detect module (Sections III), which revels violtions ginst defined rules. The intermedite dt strem is enriched with violtion informtion, which the repir module (Section IV) uses to mke 1 Note tht lthough we restrict the dt strem to hve fixed schem in this work, it is esy to extend our work to support dynmic schem. Fig. 3. Fig. 4. Strem dt clening Overview The Detect Module repir decisions. Finlly, the system outputs clened dt strem. The rule controller module, is discussed in Section V. Finlly, we discuss the windowing opertion in Section VI. III. VIOLATION DETECTION The violtion detection module ims t finding input tuples tht violte rules. To do so, it stores the tuples in-memory, in n efficient nd compct dt structure tht we cll the dt history. Input tuples re thus compred to those in the dt history to detect violtions. Figure 4 illustrtes the internls of the detect module: it consists of n ingress router, n egress router nd multiple detect workers (DW). Blech mps violtion rules to such DWs: ech worker is in chrge of finding violtions for specific rule. A. The Ingress Router The role of the ingress router is to prtition nd distriute incoming tuples to DWs. As discussed in Section II, only suset of the ttriutes of n input tuple re relevnt when verifying dt vlidity ginst given rule. For exmple, FD rule only requires its LHS nd RHS ttriutes to e verified, ignoring the rest of the input tuple ttriutes. Therefore, when the ingress router receives n input tuple, it prtitions the tuple sed on the current rule set, nd only sends the relevnt informtion to ech DW in chrge of ech specific rule. As such, n input tuple is roken into multiple su-tuples, which ll shre the sme identifier of the corresponding input tuple. Note tht some ttriutes of n input tuple might e required y multiple rules: in this cse, sutuples will contin redundnt informtion, llowing ech DW to work independently. An exmple of tuple prtitioning cn e found in Figure 4, where we reuse the input dt schem nd the rules from Section II.

4 B. The Detect Worker Ech DW is ssigned rule, nd receives the relevnt sutuples stemming from the input strem. For ech su-tuple, DW performs lookup opertion in the dt history, nd emits messge to downstrem components when rule violtion is detected. To chieve efficiency nd performnce, lookup opertions need to e fst, nd the intermedite dt strem should void redundnt informtion. Next, we descrie how the dt history is represented nd mterilized in memory; then, we descrie the output messges DW genertes, nd finlly outline the DW lgorithm. Dt history representtion. A DW ccumultes relevnt input su-tuples in compct dt structure tht enles n efficient lookup process, which mkes it similr to trditionl indexing mechnism. The structure 2 of the dt history is illustrted in Figure 5. First, to speed-up the lookup process, su-tuples re grouped y the vlue of the LHS ttriute used y given rule: we cll such group cell group (CG). Thus, CG stores ll RHS cells whose su-tuples shre the sme LHS vlue. The identifier of cell group cg l is the comintion of the rule ssigned to the DW, nd the vlue of LHS ttriutes, expressed s id(cg l ) = (id(r k ), t(lhs)) where r k is the rule ssigned to the DW. Next, to chieve compct dt representtion, ll cells in CG shring the sme RHS vlue re grouped into super cell (SC): sc m = [c 1,j, c 2,j,..., c n,j]. From Section II, recll tht cell is mde of tuple ID, n ttriute nd vlue: (id(t i), A j, t i(a j)). Therefore, super cell cn e compressed s list of tuple IDs, n ttriute nd their common vlue: sc m = (id(t 1, t 2,..., t n), A j, t(a j)) where t(a j) = t 1(A j) =... = t n(a j). Hence, within n individul DW, su-tuples whose cells re compressed in the sme sc re equivlent, s they hve the sme LHS ttriutes vlue (the identity of the cell group) nd the sme RHS ttriute vlue (the vlue of super cell). A cell group cg l now cn e expressed s: cg l = ((id(r k ), t(lhs)), [sc 1, sc 2,...]) including n identifier nd list of super cells. In summry, the lookup process for given input sutuple is s follows. Cell groups re stored in hsh-mp using their identifiers s keys: therefore the DW first finds the CG corresponding to the current su-tuple. Cells in the corresponding CG re the only cells tht might e in conflict with the current cell. Overll, the complexity of the lookup process for su-tuple is O(1). Violtion messges. DWs generte n intermedite dt strem of violtion messges, which help downstrem components to eventully repir input tuples. The gol of the DW is to generte s few messges s possile, while llowing effective dt repir. When the lookup process revels the current tuple does not violte rule, DWs emit non-violtion messge (msg nvio). Insted, when violtion is detected, DW constructs messge with ll the necessry informtion to repir it, including: the ID of the cell group corresponding to the current tuple nd the RHS cells of the current nd erlier tuples in dt history: msg vio = (id(cg l ), c cur, c old ). Now, to reduce the numer of violtion messges, the DW 2 The techniques we use re similr to the notion of prtitions nd compression introduced in Ndeef [1]. Fig. 5. cell group dt history cell group indexing y v(lhs) cell group indexing y v(rhs) sc sc sc sc sc The structure of the dt history in detect worker cn use super cell in plce of single cell (c old ) in conflict with the current tuple. In ddition, recll tht single CG cn contin multiple super cells, thus possily requiring multiple messges for ech group. However, we oserve tht two cells in the sme CG must lso conflict with ech other, s long s their vlues re different. Since the dt repir module in Blech is stteful, it is sfe to omit some violtion messges. Algorithm 1 Violtion Detection 1: given rule r = (X A j, cond(y )) 2: procedure RECEIVE(su-tuple t i) cond(t i(y )) = true 3: if id(cg l ) = (id(r), t i(x)) then 4: if cg l = 1 then cg l contins sc old 5: if v(sc old ) = t i(a j) then 6: Emit msg nvio 7: else 8: Emit msg vio (id(cg l ), c cur, sc old ) 9: end if 1: else 11: Emit msg vio (id(cg l ), c cur, null) 12: end if 13: else 14: Crete cg l Crete new cell group 15: Emit msg nvio 16: end if 17: Add c cur to cg l 18: end procedure Algorithm detils. Next, we present the DW violtion lgorithm detils, s illustrted in Algorithm 1. The lgorithm strts y treting FD rules s specil cse of CFD rules (line 1). Then, when DW receives su-tuple t i stisfying the rule condition (line 2), it performs lookup in the dt history to check if the corresponding cell group cg l exists (line 3). If yes, it determines the numer of SC contined in the cg l (line 4). If there is only one SC sc old, violtion detection works s follows. If the RHS cell of the current su-tuple, c cur, hs the sme vlue s sc old, it emits non-violtion messge (line 5-6). Otherwise, violtion hs een detected: the DW emits complete violtion messge, contining oth the current cell nd the old cell (line 8). If the CG contins more thn one SC, the DW emits single ppend-only violtion messge, which only contins the cell of the current su-tuple (line 11). Such compct messges omit the SC from the dt history, since they must e contined in erlier violtion messges. Finlly, if the lookup procedure (line 3) fils, the DW cretes new cell group nd emits non-violtion messge (line 14-15). At this point, the current cell c cur is dded to the corresponding

5 Fig. 6. Violtion Repir Violtion Grph Violtion Grph... group cg l (line 17), either in n existing sc, or s new distinct cell. It is worth noticing tht, following Algorithm 1, DW emits single messge for ech input su-tuple, no mtter how mny tuples in the dt history it conflicts with. C. The Egress Router The egress router gthers (violtion or non-violtion) messges for given dt tuple, s received from ll DWs, nd sends them downstrem to the repir module. IV. VIOLATION REPAIR The gol of this module is to tke the repir decisions for dirty dt tuples, sed on n intermedite strem of violtion messges generted y the detect module. To chieve this, Blech uses dt structure clled violtion grph. Violtion messges contriute to the cretion nd dynmics of the violtion grph, which essentilly groups those cells tht, together, re used to perform dt repir. Figure 6 sketches the internls of the repir module: it consists of n ingress router, the repir workers (RW), nd n ggregtion component tht emits clen dt. An dditionl component, clled the coordintor, steers violtion grph mngement, with the contriution of RWs. A. The Ingress Router The ingress router rodcsts ll incoming violtion messges to ll RWs. As opposed to its counterprt in the detection module, it does not perform dt prtitioning. Although ech RW receives ll violtion messges, cell in violtion messge will only e stored in one RW with the gol of creting nd mintining the violtion grph. B. The Repir Worker Next, we descrie the opertion of RW. First, we focus on the violtion grph nd the dt repir lgorithm. Then, we move to the key chllenge tht RWs ddress, tht is how to mintin distriuted violtion grph. As such, we focus on grph prtitioning nd mintennce. Due to violtion grph dynmics, coordintion issues might rise in distriuted setting: such prolems re ddressed y the coordintor component. The repir lgorithm. Current dt repir lgorithms use violtion grph to repir dirty dt sed on user-defined rules. A violtion grph is succinct representtion of cells (oth current nd historicl) tht re in conflict ccording to some rules. A violtion grph is composed of sugrphs. As incoming dt strems in, the violtion grph evolves: specificlly, its sugrphs might merge or split, depending on the contents of violtion messges. Using the violtion grph, severl lgorithms cn perform dt clening, such s the equivlence clss lgorithm [5] or the holistic dt clening lgorithm [9]. Currently, Blech uses n incrementl version of the equivlence clss lgorithm, tht supports streming input dt, lthough lterntive pproches cn e esily plugged in our system. Thus, sugrph in the violtion grph cn e interpreted s n equivlence clss, in which ll cells re supposed to hve the sme vlue. The Blech violtion grph is uilt using violtion messges output y the detect module. We sy tht sugrph sg intersects with violtion messge msg vio, denoted y msg vio sg, either when ny of the current or old cells encpsulted in msg vio re lredy contined in sg or when sg hs cells which re in the sme cell group s ny of the cells in msg vio. When there is only one RW, upon receiving violtion messge msg vio, the RW checks if there is sugrph intersecting with msg vio. If such sg exists, the RW dds msg vio dd to sg, denoted y msg vio sg, y dding oth cells in msg vio. If none of the sugrphs intersects with msg vio, new sugrph will e creted with the two cells in msg vio, dd denoted y msg vio null. If more thn one such sugrphs exist, Blech merges these sugrphs to single sugrph, dd nd then dds msg vio to it: msg vio (sg 1, sg 2,...) merged. We define sugrph identifier id(sg k ) to e the list of cell group IDs comprised in msg vio: id(cg 1, cg 2,...). A sugrph cn e expressed s sg k = (id(cg 1, cg 2,...), [sc 1, sc 2,...]): it consists of group of SC, stored in compressed formt, s shown in Section III-B. Note tht when two sugrphs merge, their identifiers re lso merged y conctenting oth CG ID lists. To mke the sugrph ID cler, sg k cn e presented s sg id(cg1,cg 2,...). Distriuted violtion grph. Due to the unounded nture of streming dt, it is resonle to expect the violtion grph to grow to sizes exceeding the cpcity of single RW. As such, in Blech, the violtion grph is distriuted dt structure, prtitioned cross ll RWs. However, unlike for DWs, the prtitioning scheme cn not e simply rule sed, ecuse cell my violte multiple rules, creting issues relted to coordintion nd lod lncing. More generlly, no prtitioning scheme cn gurntee tht cells from single violtion messge or single sugrph to e plced in single RW. Therefore, Blech prtitions the violtion grph sed on cells using cells tuple IDs (e.g., hsh prtitioning). Since violtion messges re rodcsted to ll RWs, violtion messge msg vio is prtilly dded to sugrph sg in ech RW, p dd denoted y msg vio sg, such tht only cells mtching the prtitioning scheme re dded in sg. Hence, sugrph spns severl RWs, ech storing frction of the cells comprised in the sugrph. We use the sugrph identifier to recognize prtitions from the sme sugrph. An illustrtive exmple is in order. Let s ssume there re two RWs, rw1 nd rw2, nd the current violtion grph consists in two sugrphs sg id(cg1 ), contining cells c 1, c 2, c 3, nd sg id(cg2 ), contining cells c 4, c 5. In our exmple, the

6 rw1: sg id(cg1 ) sg id(cg2 ) c 1 c 3 rw2: c 2 Fig. 7. () initil stte rw1: c 5 c 4 rw2: c 2 c 1 c 3 rw1: sg id(cg1, cg3 ) sg id(cg2 ) c 6 sg id(cg1, cg 3 ) sg id(cg2 ) c 1 c 3 c 5 rw2: c 2 c 4 c 6 sg id(cg1 ) sg id(cg3 ) () merge only in rw1 c 5 c 4 (c) merge in rw1 nd rw2 Violtion grph uild exmple violtion grph is prtitioned s in Figure 7(): oth RWs hve portion of cells of every sugrph. Algorithm 2 RW-sic 1: procedure RECEIVE([msg vio1, msg vio2,...]) 2: Initilize merge proposl mp 3: for msg vioi in [msg vio1, msg vio2,...] do 4: Find [sg i1, sg i2,...] where msg vioi sg ij 5: p dd msg vioi (sg i1, sg i2,...) merged 6: Add (Attr(msg vioi), id((sg i1, sg i2,...) merged )) to mp 7: end for 8: Send mp to the coordintor 9: end procedure 1: procedure RECEIVE(md) merge decision 11: for (A i, id(sg i)) in md do 12: Find [sg i1, sg i2,...] where id(sg ij ) id(sg i) 13: Merge sugrphs [sg i1, sg i2,...] such tht id((sg i1, sg i2,...) merged ) = id(sg i) 14: end for 15: Send repir proposl to the ggregtor 16: end procedure C. The Coordintor The prolem we ddress now stems from violtion grph dynmics, which evolves s new violtion messges strem into the repir module. As ech sugrph is prtitioned mong ll RWs, sugrph prtitions must e identified y the sme ID. Continuing with the exmple from Figure 7(), suppose new violtion messge {id(cg 3), c 6, c 1} is received y oth RWs. Now, in rw1, the new violtion is dded to sugrph sg id(cg1 ) since oth the messge nd the sugrph shre the sme cell c 1: s such, the new sugrph ecomes sg id(cg1,cg 3 ). Insted, in rw2, the new violtion triggers the cretion of new sugrph sg id(cg3 ), since no common cells re shred etween the messge nd existing sugrphs in rw2. The violtion grph ecomes inconsistent, s shown in Figure 7(): this is consequence of the independent opertion of RWs. Insted, the repir lgorithm requires the violtion grph to e in consistent stte, s shown in Figure 7(c), where oth RWs use the sme sugrph identifier for the sme equivlence clss. To gurntee the consistency of the violtion grph mong independent RWs, Blech uses stteless coordintor component tht helps RWs gree on sugrph identifiers. In wht follows we present three vrints of the simple protocol RWs use to communicte with the coordintor. RW-sic. Algorithm 2 demonstrtes how RWs work with the coordintor in the RW-sic pproch. When RW receives violtion messges for tuple, it dds the cells in the messges to the violtion grph, ccording to its locl stte nd the prtitioning scheme. Note tht in Algorithm 2 (line 4-6) (sg i1, sg i2,...) merged is generl cse including when there is none or only one intersecting sugrph. Then, the RW cretes merge proposl contining the sugrph IDs for ech conflicting ttriute, nd sends it to the coordintor. Once the coordintor receives merge proposls from ll RWs, it merges sugrph IDs for ech ttriute from the vrious merge proposls nd produces merge decision which is sent ck to ll RWs. With the merge decision, RWs merge their locl sugrphs nd converge to glolly consistent stte. Then, RWs re redy to generte repir proposls (more detils in Section IV-D). Clerly, such simple pproch to coordintion hrms Blech performnce. Indeed, the RW-sic scheme requires one round-trip messge for every incoming dt tuple, from ll RWs. However, we note tht it is not necessrily true tht the coordintion is lwys needed for every tuple. For exmple, when every cell violtes t most one rule, every sugrph would only hve single CG ID. Thus, coordintion is not necessry. More generlly, given violtion messges for tuple, coordintion is only necessry when there is complete violtion messge contining n old cell which lredy exists in the violtion grph ecuse of different violtion rule. Figure 8 gives n exmple, where the initil stte (Figure 8()) is the sme s in Figure 7(). Then, two violtion messges, {id(cg 1), c 6, null} nd {id(cg 2), c 6, null}, re received. Cell c 6 is current cell contined in the current tuple. Oviously sg id(cg1 ) nd sg id(cg2 ) should merge into sg id(cg1,cg 2 ). This cn e ccomplished without coordintion y oth repir workers, s shown in Figure 8(). Indeed, ech RW is wre tht c 6 is involved in two sugrphs, lthough c 6 is only stored in rw2 ecuse of the prtitioning scheme. Next, we use the ove oservtions nd propose two vrints of the coordintion mechnism tht im t ypssing the coordintor component to improve performnce. RW-dr. In RW-dr, the coordintion is only conducted if it is necessry, nd the repir worker sends merge proposl to the coordintor nd wits for the merge decision. However, this pproch is not exempt from drwcks: it my cuse some dt tuples in the strem to e delivered out of order. This is ecuse the repir worker wit for the merge decision in non-locking wy. The violtion messges of tuple which do not require coordintion my e processed in the coordintion gp of n erlier tuple. RW-ir. With this vrint, no mtter if the violtion messges of tuple require coordintion or not, RW immeditely updtes its locl sugrphs, executes the repir lgorithm nd emits repir proposl downstrem to the ggregtor component. Then, if necessry, the RW lzily executes the coordintion protocol. Clerly, this pproch cters to system performnce

7 rw1: sg id(cg1 ) sg id(cg2 ) c 1 c 3 c 5 rw1: sg id(cg1, cg2 ) c 1 c 3 c 5 sg id(cg1,cg 2,cg 3 ) c 1 cg 1 ccc1 c 2 c 3 sg id(cg1, cg 3 ) c 1 cg 1 ccc1 c 2 c 3 rw2: c 2 c 4 rw2: c 2 c 4 c 6 c 6 ccc1 cg 3 c 6 ccc1 cg 3 () initil stte () fter independent processing c 7 ccc1 c 5 c 4 cg 2 c 7 Fig. 8. Exmple of violtion grph uilt without coordintion () initil sg id(cg1,cg 2,cg 3 ) () if delete cg 2 nd voids tuples to e delivered out of order, ut might hrm clening ccurcy. Indeed, individul dt repir proposls from RW re sed on locl view prior to finishing ll necessry merge opertions on sugrphs, which hs direct impct on equivlence clsses. sg id(cg1,cg 2 ) c 1 c 7 cg 1 ccc1 c 2 c 3 ccc1 c 5 c 4 cg 2 sg id(cg1 ) sg id(cg2 ) cg 1 c 1 ccc1 c 2 c 3 cg 2 c 7 ccc1 c 5 c 4 D. The Aggregtor With the consistent distriuted violtion grph, ech RW emits dt repir proposl, which includes ll 3 cndidte vlues nd their frequency computed in locl sugrph prtition. The ggregtor component collects ll repir proposls nd selects the cndidte vlue to repir given cell s the one hving the highest ggregte frequency. Finlly, the ggregtor modifies the current dt tuple nd outputs clen dt strem. Note tht the ggregtor only modifies current tuples in the output strem. Insted, cells stored in the violtion grph re not modified regrdless of the repir decision: this llows to updte frequency counts s new dt strems into the system, thus steering the ggregtor to mke different repir decisions s the violtion grph evolves. To void potentil ottlenecks, Blech cn hve multiple coordintors nd ggregtors, so tht their worklod cn e distriuted sed on current tuple IDs. V. DYNAMIC RULE MANAGEMENT In strem dt clening the rule set is usully not immutle ut dynmic. Therefore, we now introduce new component, the rule controller, shown in Figure 3, which llows Blech to dpt to rule dynmics. The rule controller ccepts rule updtes s input nd guides the detect nd the repir module to dpt to rule dynmics without stopping the clening process nd without loosing stte. Rule updtes cn e of two types: one for dding new rule nd one for deleting n existing rule. Detect. In the detect module, the ddition of rule triggers the instntition of new DW, s input tuples re prtitioned y the rule. The new DW strts with no stte, which is uilt upon receiving new input tuples. As such, violtion detection using pst tuples cnnot e chieved, which is consistent with the Blech design gols. Insted, the deletion of n existing rule simply triggers the removl of DW, with its own locl dt history. Repir. In the repir module, the ddition of new rule is not prolemtic with respect to violtion grph mintennce opertions. Insted, the removl of rule implies violtion grph dynmics (sugrphs might shrink or split) which re more chllenging to ddress. Thus, in sugrph, we further group cells y cell groups. A sugrph now cn lso e expressed 3 In cse there re too mny cndidte vlues, we only send the top-k vlues, where k = 5. Fig. 9. (c) incorrect if delete cg 3 Sugrph split exmple (d) correct if delete cg 3 s: sg k = (id(cg 1, cg 2,...), [cg 1, cg 2,...]), where ech cell group gthers super cells. Some cells might spn multiple groups, s they my violte multiple rules. We lel such peculir cells s hinge cells. For ech hinge cell, the sugrph keeps the IDs of its connecting cell groups: c i,j = (c i,j, id(cg i1, cg i2,...)). Hinge cells with the sme vlue nd the sme connecting cell groups re lso compressed into super cells. With the new orgniztion of cells in sugrphs, the violtion grph updtes s following upon the removl of rule. If sugrph contins single cell group relted to the deleted rule, RWs re simply instructed to remove it. If sugrph contins multiple cell groups, RWs remove the cell groups relted to the deleted rule nd updte the hinge cells. With the remining hinge cells, RWs check the connectivity of the remining cell groups in the sugrph nd decide to split the sugrph or not. An exmple of split opertion cn e seen in Figure 9. The initil stte of sugrph is shown in Figure 9(): the sugrph is sg id(cg1,cg 2,cg 3 ), nd its contents re three cell groups. Cell c 1 nd c 7 re hinge cells, which work s ridges, connecting different cell groups together. Now, s simple cse, ssume we wnt to remove the rule pertining to cg 2: the sugrph should ecome sg id(cg1,cg 3 ), s shown in Figure 9(). Note tht cell c 7 looses its sttus of hinge cell. A more involved cse rise when we delete the rule pertining to cg 3 insted of the rule pertining to cg 2. In this cse, the sugrph should not ecome sg id(cg1,cg 2 ) s shown in Figure 9(c). Indeed, removing cg 2 elimintes ll existing hinge cells connecting the remining cell groups. Thus, the sugrph must split in two seprte sugrphs sg id(cg1 ) nd sg id(cg2 ) s shown in Figure 9(d). VI. WINDOWING Blech provides windowed computtions, which llow expressing dt clening over sliding window of dt. Despite eing common opertion in most streming systems, window-sed dt clening ddresses the chllenge of the unounded nture of streming dt: without windowing, the dt structures Blech uses to detect nd repir dirty strem would grow indefinitely, which is unprcticl. In this section, we discuss two windowing strtegies: sic, tuple-sed

8 windowing strtegy nd n dvnced strtegy tht im t improving clening ccurcy. A. Bsic Windowing The underlying ide of the sic windowing strtegy is to only use tuples within the sliding window to populte the dt structures used y Blech to chieve its tsks. Next, we outline the sic windowing strtegy for oth DWs nd RWs opertion. Windowed Detection. We now focus on how DWs mintin their locl dt history. Clerly, the dt history only contins cells tht fll within the current window. When the window slides forwrd, DWs updte the dt history s follows: i) if cell group ends up hving no cells in the new window, DWs simply delete it; ii) for the remining cell groups, DWs drop ll cells tht fll outside the new window, nd updte ccordingly the remining super cells. Note tht, if implemented nively, the first opertion ove cn e costly s it involves liner scn of ll cell groups. To improve the efficiency of dt history updtes, Blech uses the following pproch. It cretes FIFO queue of k lists, which store cell groups. In cse the sliding step is hlf the window size, k = 2; more generlly, we set k to e the window size divided y the sliding step. Any new cell group from the current window enters the queue in the k-th list. Any cell group updtes, e.g., due to new cell dded to the cell group, promotes it from list j to list k. As the window slides forwrd, the queue drops the list (nd its cell groups) in the first position nd cquires new empty list in position k + 1. Windowed Repir. Now we focus on how to mintin the violtion grph in RWs. Agin, the violtion grph only stores cells within the current window. When the window slides forwrd, RWs updte the violtion grph s follows: If sugrph hs no cells in the new window, RWs delete the sugrph; For the remining sugrphs, if cell group hs no cells in the new window, RWs delete the cell group; RWs lso delete hinge cells tht re outside of the new window. This could require sugrphs to split, s they could miss ridge cell to connect its cell groups; For the remining cell groups, RWs drop ll cells outside of the new window, nd updte the remining super cells ccordingly. For efficiency resons, Blech uses the sme k-list pproch descried for DWs to mnge violtion grph updtes due to sliding window. B. Blech Windowing The sic windowing strtegy only relies on the dt within the current window to perform dt clening, which my limit the clening ccurcy. We egin with motivting exmple, then descrie the Blech windowing strtegy, tht ims t improving clening ccurcy. Note tht here we only focus on the repir module nd its violtion grph, since Blech windowing does not modify the opertion of the detect module. t 1 t 2 t 3 t 4 t 5 t 6 t 1 t 2 t 3 t 4 t 5 t 6 A () output dt with sic windowing Fig. 1. A B c B c c () input dt window [1, 4] t 1 t 2 t 3 t 4 t 5 t 6 A window [3, 6] (c) output dt with Blech windowing Motivting exmple: Bsic vs. Blech windowing. Figure 1() illustrtes dt strem of two-ttriute tuples. Assume we use single FD rule (A B), window size of 4 tuples, sliding step of 2 tuples, nd the sic windowing strtegy. When t 4 rrives, the window covers tuples [1, 4]. According to the repir lgorithm, Blech repirs t 4(B) nd sets it to the vlue in the output strem. Note tht s we descried in Section IV, t 4(B) remins unchnged in the violtion grph. Now, when tuple t 5 rrives, the window moves to cover tuples [3, 6], even though t 6 hs yet to rrive. With only three tuples in the current window, the lgorithm determines t 5(B) is correct, ecuse now the mjority of tuples hve vlue c. The output strem produced using sic windowing is shown in Figure 1(). Clerly, clening ccurcy is scrificed, since it is esy to see tht t 5(B) should hve een repired to vlue, which is the most frequent vlue overll. Hence, the need for different windowing strtegy to overcome such prolems. Blech windowing relies on n extension of super cell, which we cll cumultive super cell. The ide is for the violtion grph to ccumulte pst stte, to complement the view Blech uilds using tuples from the current window. Hence, cumultive super cell is represented s super cell, with n dditionl field tht stores the numer of occurrences of cells with the sme RHS vlue, including those tht hve een dropped ecuse they fll outside the sliding window oundries. When using Blech windowing, RWs mintin the violtion grph y storing cumultive super cells insted of super cells. When the window slides forwrd, RWs updte the violtion grph s follows. The first two steps re equivlent to those for the sic strtegy. The lst two steps re modified s follows: For the remining sugrphs, RWs delete hinge cells tht do not ridge cell groups nymore ecuse of the updte. Also, RWs split sugrphs ccording to the remining hinge cells; For the remining cell groups nd hinge cells, RWs updte cumultive super cells, flushing cells which fll outside the new window while keeping their count. B

9 TABLE I. EXAMPLE RULE SETS USED IN OUR EXPERIMENTS. r : ss item sk i rnd, (ss item sk null) r 1 : ss item sk i ctegory, (ss item sk null) r 2 : c stte, c city c zip, (c stte, c city null) r 3 : ss promo sk p promo nme, (ss promo sk null) r 4 : ss store sk s store nme, (ss store sk null) r 5 : ss ticket num s store nme, (ss ticket num null) r 6 : file extension mime type, (file extension null) Now, going ck to the exmple in Figure 1(), when tuple t 5 rrives, Blech stores two cumultive super cells: csc 1(id(t) = [3], vlue =, count = 3) nd csc 2(id(t) = [4, 5], vlue = c, count = 2). Although t 1 nd t 2 hve een deleted ecuse they re outside the sliding window, they still contriute to the count field in csc 1. Therefore, tuple t 5(B) is correctly repired to vlue, s shown in Figure 1(c). Additionl notes: When using cumultive super cells, Blech keeps trcks of cndidte vlues to e used in the repir lgorithm s long s cell groups remin. By using cumultive super cells for hinge cells, sugrphs only split if some cell groups re removed when the window moves forwrd. Note tht the introduction of cumultive super cells does not interfere with dynmic rule mngement: in prticulr, when deleting rule, sugrphs updte correctly when hinge cells use the cumultive formt. Overll, to compute the count of cndidte vlue in sugrph, Blech ccumultes the counts of relevnt cumultive super cells from ll cell groups, tking into ccount ny duplicte contriutions from hinge cells. Oviously, Blech windowing requires more storge thn sic windowing, s cumultive super cells store dditionl informtion, nd keep the super cell structure, even when they hve n empty cell list. Section VII demonstrtes tht such dditionl overhed is well lnced y superior clening ccurcy, mking Blech windowing truly desirle. VII. EVALUATION We uilt Blech prototype implementtion using Apche Storm [24]. 4 Input strems, including oth the dt strem nd rule updtes, re fed into Blech using Apche Kfk [16]. We conducted ll experiments in cluster of 18 mchines, with 4 cores, 8 GB RAM nd 1 Gps network interfce ech. We evlute Blech using oth synthetic nd rel-life dtsets. The synthetic dtset is generted from TPC-DS (with scle fctor 1 GB) where we join fct tle store sles with its dimension tles to uild single tle (288 million tuples). We mnully design six CFD rules, from r to r 5, s shown in Tle I. Among these rules, r 4 nd r 5 hve the sme RHS ttriute s store nme. We generte dirty dt strem s follows: we modify the vlues of RHS ttriutes with proility 1% nd replce the vlues of LHS ttriutes with NULL with proility 1%. 5 The rel-life dtset we use is the result of merging ll the log files of Uuntu One servers [15] for 3 dys (773 GB of CSV text). Insted 4 Nothing prevents Blech to e uilt using lterntive systems such s Apche Flink, for exmple. 5 In our experiments we lso used BART [3], which is well ccepted dirty dt genertor. However, BART fils to scle to hundreds of millions of tuples due to memory resons. Thus, we present results otined using our custom process, which mimics tht of BART ut scles to lrge dt strems. of modifying ny vlues, the dtset lredy contins dirty records. We design CFD rule, r 6, s shown in Tle I. With rule r 6, the dirty rtio of the dtset is roughly By exporting the dtsets to Kfk, we simulte unounded dt strems. In ll the experiments, we use Blech windowing s the defult windowing strtegy nd set the window size to 2M tuples nd the sliding step to 1M tuples, unless otherwise specified. The synthetic dtset is used in ll experiments except in our lst ttery of experiments, where the rel-life dtset is used. Our gol is to demonstrte tht Blech chieves efficient strem dt clening under rel-time constrints. Our evlution uses throughput, ltency nd dirty rtio s performnce metrics. We express the dirty rtio s the frction of dirty dt remining in the output dt strem: the smller the dirty rtio, the higher the clening ccurcy. The processing ltency is mesured from uniformly smpled tuples (1 per 1). Compring Coordintion Approches. In this experiment we compre the three RW pproches discussed in Section IV, ccording to our performnce metrics, s shown in Figure 11: RW-sic requires coordintion mong repir workers for ech tuple; RW-dr omits coordintion for tuples when possile; RWir is similr to RW-dr, ut llows repir decisions to e mde efore finishing coordintion. Figure 11() shows how Blech throughput evolves with processed tuples. The throughput with oth RW-dr nd RWir is round 15K tuples/second, wheres RW-sic chieves roughly 13K tuples/second. The inferior performnce of RWsic is due to the lrge numer of coordintion messges required to converge to glol sugrph identifiers, while RWdr nd RW-ir only require 7% coordintion messges in RWsic. Figure 11() shows the CDF of the tuple processing ltency for the three RW pproches. RW-sic hs the highest processing ltency, on verge 364 ms. The processing ltency of RW-ir is on verge 316 ms. RW-dr verge ltency is slightly higher, out 323 ms. This difference is due gin to the dditionl round-trip-messges required y coordintion: with RW-ir, RWs mke their repir proposls without witing for coordintion to complete, therefore the smll processing ltency. Figure 11(c) illustrtes the clening ccurcy. All three pproches lower the rtio of dirty dt significntly to t most.5% (even % for rule r 3 nd r 4). For the first five rules, the three pproches chieve similr clening ccurcy. Insted, for rule r 5 the RW-ir method suffers nd the dirty rtio is lrger. Indeed, for rule r 5 whose clening ccurcy is hevily linked to rule r 4, RW-ir fils to correctly updte some of its sugrphs ecuse it egerly emits repir proposls without witing for coordintion to complete. In the following experiments, we use the RW-dr s the defult mechnism. Compring Windowing Strtegies. In this experiment, we compre the performnce of the sic nd Blech windowing strtegies. Additionlly, for stress testing, we increse the input dirty dt rtio y modifying the vlues of RHS ttriutes with proility 5% for dt in the intervl from 4M to 42M tuples. As shown in Figure 13 nd Figure 14, the two windowing strtegies re essentilly equivlent in terms of throughput nd ltency: this is good news, s it implies the requirement for cumultive super cells is negligile toll on performnce. Next, we focus on detiled view of the clening ccurcy, which is shown in Figure 12. Blech windowing chieves

10 throughput (tuples/sec) RW-sic RW-dr RW-ir () Throughput RW sic RW dr RW ir ltency (sec) () Ltency dirty rtio RW-sic RW-dr RW-ir r r1 r2 r3 r4 r5 rules (c) Dirty Rtio Fig. 11. Comprison of Blech coordintion mechnisms: RW-sic, RW-dr nd RW-ir. dirty rtio sic lech dirty rtio sic lech dirty rtio sic lech dirty rtio Fig dirty rtio () r sic lech (d) r3 dirty rtio RW-sic RW-dr.4 RW-ir.2 r r1 r2 r3 r4 r5 rules () r1 sic lech (e) r4 Comprison of clening ccurcy for sic nd Blech windowing strtegies. dirty rtio (c) r2 sic lech (f) r5 Fig. 16. vg dirty rtio win2k win5k win1m win2m r r1 r2 r5 rules Comprison of clening ccurcy for different Blech window sizes. superior clening ccurcy: in generl, the dirty rtio is one order of mgnitude smller thn tht of sic windowing. This dvntge is kept lso in presence of dirty rtio spike in the input dt. In prticulr, for rules r 3 nd r 4, Blech windowing chieves % dirty rtio, irrespectively of the dirty rtio spike. Overll, Blech windowing revels tht keeping stte from pst windows cn indeed drmticlly improve clening ccurcy, with little to no performnce overhed. Compring Different Window Sizes. In this experiment, we evlute Blech with different window sizes. We set the window size s 2K, 5K, 1M nd 2M respectively (the sliding step is hlf of the window size), nd the experiment result is s shown in Figure 15 nd Figure 16. We see tht Blech hs higher chnce to clen the dt strem with more tuples in the window. Figure 15 shows tht the throughput decreses s the size of window increses. With lrger window, there re more tuples to e detected for violtions in the dt history. Hence, more violtions re detected nd sent to the repir module. The violtion grph in the RWs will e lrger. As consequence, ny sugrph opertions including merging nd split will tke more time to finish. With our implementtion the throughput drops 23% when the window size increses 1 times. In contrst, Figure 16 demonstrtes tht the clening ccurcy my increse more thn 1 times when the window size increses 1 times 6. Dynmic Rule Mngement. Next, we study the performnce of Blech in presence of rule dynmics, s shown in Figure 17. To do this, we initilly use the sme input dt strem nd rule set s in the first sets of experiments. However, while Blech is clening the input strem, we delete rule r 5 nd dd two new rules r 7 (ss ticket num c emil ddr, (ss ticket num null)) nd r 8 (ss customer sk c emil ddr, (ss customer sk null)), s indicted in the Figure. Figure 17() nd Figure 17() show the evolution in time of throughput nd ltency, wheres Figure 17(c) gives the CDF of the processing ltency. Figure 17() shows tht rule dynmics cn result in n increse in throughput. Indeed, removing r 5 (t the 6M tuple) implies tht Blech needs to mnge fewer rules; in ddition, r 4 ecomes simpler to mnge, s there re no more intersections with 6 The clening ccurcy of rule r 3 nd r 4 is not shown in Figure 16, s their clening ccurcy is ll 1% with four different window sizes.

arxiv: v1 [cs.db] 16 Sep 2016

arxiv: v1 [cs.db] 16 Sep 2016 Blech: A Distriuted Strem Dt Clening System Yongcho Tin Eurecom yongcho.tin@eurecom.fr Pietro Michirdi Eurecom pietro.michirdi@eurecom.fr Mrko Vukolić IBM Reserch - Zurich mvu@zurich.im.com rxiv:169.5113v1

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully

More information

Fig.25: the Role of LEX

Fig.25: the Role of LEX The Lnguge for Specifying Lexicl Anlyzer We shll now study how to uild lexicl nlyzer from specifiction of tokens in the form of list of regulr expressions The discussion centers round the design of n existing

More information

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

In the last lecture, we discussed how valid tokens may be specified by regular expressions. LECTURE 5 Scnning SYNTAX ANALYSIS We know from our previous lectures tht the process of verifying the syntx of the progrm is performed in two stges: Scnning: Identifying nd verifying tokens in progrm.

More information

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring

More information

2 Computing all Intersections of a Set of Segments Line Segment Intersection

2 Computing all Intersections of a Set of Segments Line Segment Intersection 15-451/651: Design & Anlysis of Algorithms Novemer 14, 2016 Lecture #21 Sweep-Line nd Segment Intersection lst chnged: Novemer 8, 2017 1 Preliminries The sweep-line prdigm is very powerful lgorithmic design

More information

10.5 Graphing Quadratic Functions

10.5 Graphing Quadratic Functions 0.5 Grphing Qudrtic Functions Now tht we cn solve qudrtic equtions, we wnt to lern how to grph the function ssocited with the qudrtic eqution. We cll this the qudrtic function. Grphs of Qudrtic Functions

More information

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved Network Interconnection: Bridging CS 57 Fll 6 6 Kenneth L. Clvert All rights reserved The Prolem We know how to uild (rodcst) LANs Wnt to connect severl LANs together to overcome scling limits Recll: speed

More information

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Lecture 10 Evolutionry Computtion: Evolution strtegies nd genetic progrmming Evolution strtegies Genetic progrmming Summry Negnevitsky, Person Eduction, 2011 1 Evolution Strtegies Another pproch to simulting

More information

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have Rndom Numers nd Monte Crlo Methods Rndom Numer Methods The integrtion methods discussed so fr ll re sed upon mking polynomil pproximtions to the integrnd. Another clss of numericl methods relies upon using

More information

From Dependencies to Evaluation Strategies

From Dependencies to Evaluation Strategies From Dependencies to Evlution Strtegies Possile strtegies: 1 let the user define the evlution order 2 utomtic strtegy sed on the dependencies: use locl dependencies to determine which ttriutes to compute

More information

UT1553B BCRT True Dual-port Memory Interface

UT1553B BCRT True Dual-port Memory Interface UTMC APPICATION NOTE UT553B BCRT True Dul-port Memory Interfce INTRODUCTION The UTMC UT553B BCRT is monolithic CMOS integrted circuit tht provides comprehensive MI-STD- 553B Bus Controller nd Remote Terminl

More information

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining EECS150 - Digitl Design Lecture 23 - High-level Design nd Optimiztion 3, Prllelism nd Pipelining Nov 12, 2002 John Wwrzynek Fll 2002 EECS150 - Lec23-HL3 Pge 1 Prllelism Prllelism is the ct of doing more

More information

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs. Lecture 5 Wlks, Trils, Pths nd Connectedness Reding: Some of the mteril in this lecture comes from Section 1.2 of Dieter Jungnickel (2008), Grphs, Networks nd Algorithms, 3rd edition, which is ville online

More information

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards A Tutology Checker loosely relted to Stålmrck s Algorithm y Mrtin Richrds mr@cl.cm.c.uk http://www.cl.cm.c.uk/users/mr/ University Computer Lortory New Museum Site Pemroke Street Cmridge, CB2 3QG Mrtin

More information

Presentation Martin Randers

Presentation Martin Randers Presenttion Mrtin Rnders Outline Introduction Algorithms Implementtion nd experiments Memory consumption Summry Introduction Introduction Evolution of species cn e modelled in trees Trees consist of nodes

More information

OUTPUT DELIVERY SYSTEM

OUTPUT DELIVERY SYSTEM Differences in ODS formtting for HTML with Proc Print nd Proc Report Lur L. M. Thornton, USDA-ARS, Animl Improvement Progrms Lortory, Beltsville, MD ABSTRACT While Proc Print is terrific tool for dt checking

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distriuted Systems Principles nd Prdigms Chpter 11 (version April 7, 2008) Mrten vn Steen Vrije Universiteit Amsterdm, Fculty of Science Dept. Mthemtics nd Computer Science Room R4.20. Tel: (020) 598 7784

More information

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis CS143 Hndout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexicl Anlysis In this first written ssignment, you'll get the chnce to ply round with the vrious constructions tht come up when doing lexicl

More information

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011 CSCI 3130: Forml Lnguges nd utomt Theory Lecture 12 The Chinese University of Hong Kong, Fll 2011 ndrej Bogdnov In progrmming lnguges, uilding prse trees is significnt tsk ecuse prse trees tell us the

More information

Agilent Mass Hunter Software

Agilent Mass Hunter Software Agilent Mss Hunter Softwre Quick Strt Guide Use this guide to get strted with the Mss Hunter softwre. Wht is Mss Hunter Softwre? Mss Hunter is n integrl prt of Agilent TOF softwre (version A.02.00). Mss

More information

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5 CS321 Lnguges nd Compiler Design I Winter 2012 Lecture 5 1 FINITE AUTOMATA A non-deterministic finite utomton (NFA) consists of: An input lphet Σ, e.g. Σ =,. A set of sttes S, e.g. S = {1, 3, 5, 7, 11,

More information

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits Systems I Logic Design I Topics Digitl logic Logic gtes Simple comintionl logic circuits Simple C sttement.. C = + ; Wht pieces of hrdwre do you think you might need? Storge - for vlues,, C Computtion

More information

A dual of the rectangle-segmentation problem for binary matrices

A dual of the rectangle-segmentation problem for binary matrices A dul of the rectngle-segmenttion prolem for inry mtrices Thoms Klinowski Astrct We consider the prolem to decompose inry mtrix into smll numer of inry mtrices whose -entries form rectngle. We show tht

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Graphs with at most two trees in a forest building process

Graphs with at most two trees in a forest building process Grphs with t most two trees in forest uilding process rxiv:802.0533v [mth.co] 4 Fe 208 Steve Butler Mis Hmnk Mrie Hrdt Astrct Given grph, we cn form spnning forest y first sorting the edges in some order,

More information

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties, Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1): Overview (): Before We Begin Administrtive detils Review some questions to consider Winter 2006 Imge Enhncement in the Sptil Domin: Bsics of Sptil Filtering, Smoothing Sptil Filters, Order Sttistics Filters

More information

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries Tries Yufei To KAIST April 9, 2013 Y. To, April 9, 2013 Tries In this lecture, we will discuss the following exct mtching prolem on strings. Prolem Let S e set of strings, ech of which hs unique integer

More information

Dr. D.M. Akbar Hussain

Dr. D.M. Akbar Hussain Dr. D.M. Akr Hussin Lexicl Anlysis. Bsic Ide: Red the source code nd generte tokens, it is similr wht humns will do to red in; just tking on the input nd reking it down in pieces. Ech token is sequence

More information

Transparent neutral-element elimination in MPI reduction operations

Transparent neutral-element elimination in MPI reduction operations Trnsprent neutrl-element elimintion in MPI reduction opertions Jesper Lrsson Träff Deprtment of Scientific Computing University of Vienn Disclimer Exploiting repetition nd sprsity in input for reducing

More information

Midterm 2 Sample solution

Midterm 2 Sample solution Nme: Instructions Midterm 2 Smple solution CMSC 430 Introduction to Compilers Fll 2012 November 28, 2012 This exm contins 9 pges, including this one. Mke sure you hve ll the pges. Write your nme on the

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

What are suffix trees?

What are suffix trees? Suffix Trees 1 Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl

More information

Definition of Regular Expression

Definition of Regular Expression Definition of Regulr Expression After the definition of the string nd lnguges, we re redy to descrie regulr expressions, the nottion we shll use to define the clss of lnguges known s regulr sets. Recll

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

The Greedy Method. The Greedy Method

The Greedy Method. The Greedy Method Lists nd Itertors /8/26 Presenttion for use with the textook, Algorithm Design nd Applictions, y M. T. Goodrich nd R. Tmssi, Wiley, 25 The Greedy Method The Greedy Method The greedy method is generl lgorithm

More information

Engineer To Engineer Note

Engineer To Engineer Note Engineer To Engineer Note EE-186 Technicl Notes on using Anlog Devices' DSP components nd development tools Contct our technicl support by phone: (800) ANALOG-D or e-mil: dsp.support@nlog.com Or visit

More information

Mobile IP route optimization method for a carrier-scale IP network

Mobile IP route optimization method for a carrier-scale IP network Moile IP route optimiztion method for crrier-scle IP network Tkeshi Ihr, Hiroyuki Ohnishi, nd Ysushi Tkgi NTT Network Service Systems Lortories 3-9-11 Midori-cho, Musshino-shi, Tokyo 180-8585, Jpn Phone:

More information

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific Registering s HPE Reseller Quick Reference Guide for new Prtners in Asi Pcific Registering s new Reseller prtner There re five min steps to e new Reseller prtner. Crete your Appliction Copyright 2017 Hewlett

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds A Sprse Grid Representtion for Dynmic Three-Dimensionl Worlds Nthn R. Sturtevnt Deprtment of Computer Science University of Denver Denver, CO, 80208 sturtevnt@cs.du.edu Astrct Grid representtions offer

More information

PARALLEL AND DISTRIBUTED COMPUTING

PARALLEL AND DISTRIBUTED COMPUTING PARALLEL AND DISTRIBUTED COMPUTING 2009/2010 1 st Semester Teste Jnury 9, 2010 Durtion: 2h00 - No extr mteril llowed. This includes notes, scrtch pper, clcultor, etc. - Give your nswers in the ville spce

More information

Registering as an HPE Reseller

Registering as an HPE Reseller Registering s n HPE Reseller Quick Reference Guide for new Prtners Mrch 2019 Registering s new Reseller prtner There re four min steps to register on the Prtner Redy Portl s new Reseller prtner: Appliction

More information

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment File Mnger Quick Reference Guide June 2018 Prepred for the Myo Clinic Enterprise Khu Deployment NVIGTION IN FILE MNGER To nvigte in File Mnger, users will mke use of the left pne to nvigte nd further pnes

More information

Reducing a DFA to a Minimal DFA

Reducing a DFA to a Minimal DFA Lexicl Anlysis - Prt 4 Reducing DFA to Miniml DFA Input: DFA IN Assume DFA IN never gets stuck (dd ded stte if necessry) Output: DFA MIN An equivlent DFA with the minimum numer of sttes. Hrry H. Porter,

More information

Ma/CS 6b Class 1: Graph Recap

Ma/CS 6b Class 1: Graph Recap M/CS 6 Clss 1: Grph Recp By Adm Sheffer Course Detils Adm Sheffer. Office hour: Tuesdys 4pm. dmsh@cltech.edu TA: Victor Kstkin. Office hour: Tuesdys 7pm. 1:00 Mondy, Wednesdy, nd Fridy. http://www.mth.cltech.edu/~2014-15/2term/m006/

More information

documents 1. Introduction

documents 1. Introduction www.ijcsi.org 4 Efficient structurl similrity computtion etween XML documents Ali Aïtelhdj Computer Science Deprtment, Fculty of Electricl Engineering nd Computer Science Mouloud Mmmeri University of Tizi-Ouzou

More information

9 Graph Cutting Procedures

9 Graph Cutting Procedures 9 Grph Cutting Procedures Lst clss we begn looking t how to embed rbitrry metrics into distributions of trees, nd proved the following theorem due to Brtl (1996): Theorem 9.1 (Brtl (1996)) Given metric

More information

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Position Heaps: A Simple and Dynamic Text Indexing Data Structure Position Heps: A Simple nd Dynmic Text Indexing Dt Structure Andrzej Ehrenfeucht, Ross M. McConnell, Niss Osheim, Sung-Whn Woo Dept. of Computer Science, 40 UCB, University of Colordo t Boulder, Boulder,

More information

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the LR() nlysis Drwcks of LR(). Look-hed symols s eplined efore, concerning LR(), it is possile to consult the net set to determine, in the reduction sttes, for which symols it would e possile to perform reductions.

More information

Example: 2:1 Multiplexer

Example: 2:1 Multiplexer Exmple: 2:1 Multiplexer Exmple #1 reg ; lwys @( or or s) egin if (s == 1') egin = ; else egin = ; 1 s B. Bs 114 Exmple: 2:1 Multiplexer Exmple #2 Normlly lwys include egin nd sttements even though they

More information

Misrepresentation of Preferences

Misrepresentation of Preferences Misrepresenttion of Preferences Gicomo Bonnno Deprtment of Economics, University of Cliforni, Dvis, USA gfbonnno@ucdvis.edu Socil choice functions Arrow s theorem sys tht it is not possible to extrct from

More information

INTRODUCTION TO SIMPLICIAL COMPLEXES

INTRODUCTION TO SIMPLICIAL COMPLEXES INTRODUCTION TO SIMPLICIAL COMPLEXES CASEY KELLEHER AND ALESSANDRA PANTANO 0.1. Introduction. In this ctivity set we re going to introduce notion from Algebric Topology clled simplicil homology. The min

More information

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) Numbers nd Opertions, Algebr, nd Functions 45. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) In sequence of terms involving eponentil growth, which the testing service lso clls geometric

More information

MATH 25 CLASS 5 NOTES, SEP

MATH 25 CLASS 5 NOTES, SEP MATH 25 CLASS 5 NOTES, SEP 30 2011 Contents 1. A brief diversion: reltively prime numbers 1 2. Lest common multiples 3 3. Finding ll solutions to x + by = c 4 Quick links to definitions/theorems Euclid

More information

Data Flow on a Queue Machine. Bruno R. Preiss. Copyright (c) 1987 by Bruno R. Preiss, P.Eng. All rights reserved.

Data Flow on a Queue Machine. Bruno R. Preiss. Copyright (c) 1987 by Bruno R. Preiss, P.Eng. All rights reserved. Dt Flow on Queue Mchine Bruno R. Preiss 2 Outline Genesis of dt-flow rchitectures Sttic vs. dynmic dt-flow rchitectures Pseudo-sttic dt-flow execution model Some dt-flow mchines Simple queue mchine Prioritized

More information

IP: Network Layer. Goals and Tasks. Routing. Switching. Switching (cont.) Datagram v/s Virtual Circuit. Overview Addressing Routing

IP: Network Layer. Goals and Tasks. Routing. Switching. Switching (cont.) Datagram v/s Virtual Circuit. Overview Addressing Routing IP: Network Lyer Overview Addressing Routing Overview Gols nd Tsks Routing Switching Issues Bsic ides TOC IP TOC IP Overview Gols nd Tsks Gols of Network Lyer Guide pckets from source to destintion Use

More information

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona CSc 453 Compilers nd Systems Softwre 4 : Lexicl Anlysis II Deprtment of Computer Science University of Arizon collerg@gmil.com Copyright c 2009 Christin Collerg Implementing Automt NFAs nd DFAs cn e hrd-coded

More information

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases Overview Mobile computing nd dtbses Generl issues in mobile dt mngement Dt dissemintion Dt consistency Loction dependent queries Interfces Detils of brodcst disks thlis klfigopoulos Network rchitecture

More information

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li 2nd Interntionl Conference on Electronic & Mechnicl Engineering nd Informtion Technology (EMEIT-212) Complete Coverge Pth Plnning of Mobile Robot Bsed on Dynmic Progrmming Algorithm Peng Zhou, Zhong-min

More information

Computer Arithmetic Logical, Integer Addition & Subtraction Chapter

Computer Arithmetic Logical, Integer Addition & Subtraction Chapter Computer Arithmetic Logicl, Integer Addition & Sutrction Chpter 3.-3.3 3.3 EEC7 FQ 25 MIPS Integer Representtion -it signed integers,, e.g., for numeric opertions 2 s s complement: one representtion for

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Pointwise convergence need not behave well with respect to standard properties such as continuity.

Pointwise convergence need not behave well with respect to standard properties such as continuity. Chpter 3 Uniform Convergence Lecture 9 Sequences of functions re of gret importnce in mny res of pure nd pplied mthemtics, nd their properties cn often be studied in the context of metric spces, s in Exmples

More information

LECT-10, S-1 FP2P08, Javed I.

LECT-10, S-1 FP2P08, Javed I. A Course on Foundtions of Peer-to-Peer Systems & Applictions LECT-10, S-1 CS /799 Foundtion of Peer-to-Peer Applictions & Systems Kent Stte University Dept. of Computer Science www.cs.kent.edu/~jved/clss-p2p08

More information

UNIT 11. Query Optimization

UNIT 11. Query Optimization UNIT Query Optimiztion Contents Introduction to Query Optimiztion 2 The Optimiztion Process: An Overview 3 Optimiztion in System R 4 Optimiztion in INGRES 5 Implementing the Join Opertors Wei-Png Yng,

More information

Blackbaud s Mailwise Service Analyse Records Updated by MailWise

Blackbaud s Mailwise Service Analyse Records Updated by MailWise Blckud s Milwise Service Anlyse Records Updted y MilWise To nlyse the updtes tht hve een performed y the import, run the relevnt queries from the list elow. The queries selected depend on the MilWise Services

More information

COMPUTER EDUCATION TECHNIQUES, INC. (MS_W2K3_SERVER ) SA:

COMPUTER EDUCATION TECHNIQUES, INC. (MS_W2K3_SERVER ) SA: In order to lern which questions hve een nswered correctly: 1. Print these pges. 2. Answer the questions. 3. Send this ssessment with the nswers vi:. FAX to (212) 967-3498. Or. Mil the nswers to the following

More information

ASTs, Regex, Parsing, and Pretty Printing

ASTs, Regex, Parsing, and Pretty Printing ASTs, Regex, Prsing, nd Pretty Printing CS 2112 Fll 2016 1 Algeric Expressions To strt, consider integer rithmetic. Suppose we hve the following 1. The lphet we will use is the digits {0, 1, 2, 3, 4, 5,

More information

Inference of node replacement graph grammars

Inference of node replacement graph grammars Glley Proof 22/6/27; :6 File: id293.tex; BOKCTP/Hin p. Intelligent Dt Anlysis (27) 24 IOS Press Inference of node replcement grph grmmrs Jcek P. Kukluk, Lwrence B. Holder nd Dine J. Cook Deprtment of Computer

More information

PLWAP Sequential Mining: Open Source Code

PLWAP Sequential Mining: Open Source Code PL Sequentil Mining: Open Source Code C.I. Ezeife School of Computer Science University of Windsor Windsor, Ontrio N9B 3P4 cezeife@uwindsor.c Yi Lu Deprtment of Computer Science Wyne Stte University Detroit,

More information

Simrad ES80. Software Release Note Introduction

Simrad ES80. Software Release Note Introduction Simrd ES80 Softwre Relese 1.3.0 Introduction This document descries the chnges introduced with the new softwre version. Product: ES80 Softwre version: 1.3.0 This softwre controls ll functionlity in the

More information

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona Implementing utomt Sc 5 ompilers nd Systems Softwre : Lexicl nlysis II Deprtment of omputer Science University of rizon collerg@gmil.com opyright c 009 hristin ollerg NFs nd DFs cn e hrd-coded using this

More information

Categorical Skylines for Streaming Data

Categorical Skylines for Streaming Data Ctegoricl Skylines for Streming Dt ABSTRACT Nikos Srks University of Toronto nsrks@cs.toronto.edu Nick Kouds University of Toronto kouds@cs.toronto.edu The prolem of skyline computtion hs ttrcted considerle

More information

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center Resource Overview Quntile Mesure: Skill or Concept: 80Q Multiply two frctions or frction nd whole numer. (QT N ) Excerpted from: The Mth Lerning Center PO Box 99, Slem, Oregon 9709 099 www.mthlerningcenter.org

More information

Rethinking Virtual Network Embedding: Substrate Support for Path Splitting and Migration

Rethinking Virtual Network Embedding: Substrate Support for Path Splitting and Migration Rethinking Virtul Network Emedding: Sustrte Support for Pth Splitting nd Migrtion Minln Yu, Yung Yi, Jennifer Rexford, Mung Ching Computer Science, Princeton University, Emil: {minlnyu,jrex}@cs.princeton.edu

More information

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure , Mrch 12-14, 2014, Hong Kong An Algorithm for Enumerting All Mximl Tree Ptterns Without Dupliction Using Succinct Dt Structure Yuko ITOKAWA, Tomoyuki UCHIDA nd Motoki SANO Astrct In order to extrct structured

More information

Accelerating 3D convolution using streaming architectures on FPGAs

Accelerating 3D convolution using streaming architectures on FPGAs Accelerting 3D convolution using streming rchitectures on FPGAs Hohun Fu, Robert G. Clpp, Oskr Mencer, nd Oliver Pell ABSTRACT We investigte FPGA rchitectures for ccelerting pplictions whose dominnt cost

More information

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) * Pln for Tody nd Beginning Next week Interpreter nd Compiler Structure, or Softwre Architecture Overview of Progrmming Assignments The MeggyJv compiler we will e uilding. Regulr Expressions Finite Stte

More information

Efficient Algorithms For Optimizing Policy-Constrained Routing

Efficient Algorithms For Optimizing Policy-Constrained Routing Efficient Algorithms For Optimizing Policy-Constrined Routing Andrew R. Curtis curtis@cs.colostte.edu Ross M. McConnell rmm@cs.colostte.edu Dn Mssey mssey@cs.colostte.edu Astrct Routing policies ply n

More information

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe CSCI 0 fel Ferreir d Silv rfsilv@isi.edu Slides dpted from: Mrk edekopp nd Dvid Kempe LOG STUCTUED MEGE TEES Series Summtion eview Let n = + + + + k $ = #%& #. Wht is n? n = k+ - Wht is log () + log ()

More information

Algorithm Design (5) Text Search

Algorithm Design (5) Text Search Algorithm Design (5) Text Serch Tkshi Chikym School of Engineering The University of Tokyo Text Serch Find sustring tht mtches the given key string in text dt of lrge mount Key string: chr x[m] Text Dt:

More information

Scalable Distributed Data Structures: A Survey Λ

Scalable Distributed Data Structures: A Survey Λ Sclble Distributed Dt Structures: A Survey Λ ADRIANO DI PASQUALE University of L Aquil, Itly ENRICO NARDELLI University of L Aquil nd Istituto di Anlisi dei Sistemi ed Informtic, Itly Abstrct This pper

More information

CS481: Bioinformatics Algorithms

CS481: Bioinformatics Algorithms CS481: Bioinformtics Algorithms Cn Alkn EA509 clkn@cs.ilkent.edu.tr http://www.cs.ilkent.edu.tr/~clkn/teching/cs481/ EXACT STRING MATCHING Fingerprint ide Assume: We cn compute fingerprint f(p) of P in

More information

Epson Projector Content Manager Operation Guide

Epson Projector Content Manager Operation Guide Epson Projector Content Mnger Opertion Guide Contents 2 Introduction to the Epson Projector Content Mnger Softwre 3 Epson Projector Content Mnger Fetures... 4 Setting Up the Softwre for the First Time

More information

Looking up objects in Pastry

Looking up objects in Pastry Review: Pstry routing tbles 0 1 2 3 4 7 8 9 b c d e f 0 1 2 3 4 7 8 9 b c d e f 0 1 2 3 4 7 8 9 b c d e f 0 2 3 4 7 8 9 b c d e f Row0 Row 1 Row 2 Row 3 Routing tble of node with ID i =1fc s - For ech

More information

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012 Dynmic Progrmming Andres Klppenecker [prtilly bsed on slides by Prof. Welch] 1 Dynmic Progrmming Optiml substructure An optiml solution to the problem contins within it optiml solutions to subproblems.

More information

Typing with Weird Keyboards Notes

Typing with Weird Keyboards Notes Typing with Weird Keyords Notes Ykov Berchenko-Kogn August 25, 2012 Astrct Consider lnguge with n lphet consisting of just four letters,,,, nd. There is spelling rule tht sys tht whenever you see n next

More information

The dictionary model allows several consecutive symbols, called phrases

The dictionary model allows several consecutive symbols, called phrases A dptive Huffmn nd rithmetic methods re universl in the sense tht the encoder cn dpt to the sttistics of the source. But, dpttion is computtionlly expensive, prticulrly when k-th order Mrkov pproximtion

More information

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig CS311H: Discrete Mthemtics Grph Theory IV Instructor: Işıl Dillig Instructor: Işıl Dillig, CS311H: Discrete Mthemtics Grph Theory IV 1/25 A Non-plnr Grph Regions of Plnr Grph The plnr representtion of

More information

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay Lexicl Anlysis Amith Snyl (www.cse.iit.c.in/ s) Deprtment of Computer Science nd Engineering, Indin Institute of Technology, Bomy Septemer 27 College of Engineering, Pune Lexicl Anlysis: 2/6 Recp The input

More information

Lab 1 - Counter. Create a project. Add files to the project. Compile design files. Run simulation. Debug results

Lab 1 - Counter. Create a project. Add files to the project. Compile design files. Run simulation. Debug results 1 L 1 - Counter A project is collection mechnism for n HDL design under specifiction or test. Projects in ModelSim ese interction nd re useful for orgnizing files nd specifying simultion settings. The

More information

Lexical analysis, scanners. Construction of a scanner

Lexical analysis, scanners. Construction of a scanner Lexicl nlysis scnners (NB. Pges 4-5 re for those who need to refresh their knowledge of DFAs nd NFAs. These re not presented during the lectures) Construction of scnner Tools: stte utomt nd trnsition digrms.

More information

Preserving Constraints for Aggregation Relationship Type Update in XML Document

Preserving Constraints for Aggregation Relationship Type Update in XML Document Preserving Constrints for Aggregtion Reltionship Type Updte in XML Document Eric Prdede 1, J. Wenny Rhyu 1, nd Dvid Tnir 2 1 Deprtment of Computer Science nd Computer Engineering, L Trobe University, Bundoor

More information

CS201 Discussion 10 DRAWTREE + TRIES

CS201 Discussion 10 DRAWTREE + TRIES CS201 Discussion 10 DRAWTREE + TRIES DrwTree First instinct: recursion As very generic structure, we could tckle this problem s follows: drw(): Find the root drw(root) drw(root): Write the line for the

More information

12-B FRACTIONS AND DECIMALS

12-B FRACTIONS AND DECIMALS -B Frctions nd Decimls. () If ll four integers were negtive, their product would be positive, nd so could not equl one of them. If ll four integers were positive, their product would be much greter thn

More information