arxiv: v1 [cs.db] 16 Sep 2016

Size: px
Start display at page:

Download "arxiv: v1 [cs.db] 16 Sep 2016"

Transcription

1 Blech: A Distriuted Strem Dt Clening System Yongcho Tin Eurecom yongcho.tin@eurecom.fr Pietro Michirdi Eurecom pietro.michirdi@eurecom.fr Mrko Vukolić IBM Reserch - Zurich mvu@zurich.im.com rxiv: v1 [cs.db] 16 Sep 216 ABSTRACT In this pper we ddress the prolem of rule-sed strem dt clening, which sets stringent requirements on ltency, rule dynmics nd ility to cope with the unounded nture of dt strems. We design system, clled Blech, which chieves reltime violtion detection nd dt repir on dirty dt strem. Blech relies on efficient, compct nd distriuted dt structures to mintin the necessry stte to repir dt, using n incrementl version of the equivlence clss lgorithm. Additionlly, it supports rule dynmics nd uses cumultive sliding window opertion to improve clening ccurcy. We evlute prototype of Blech using TPC-DS derived dirty dt strem nd oserve its high throughput, low ltency nd high clening ccurcy, even with rule dynmics. Experimentl results indicte superior performnce of Blech compred to seline system uilt on the microtch streming prdigm. 1. INTRODUCTION Tody, we live in world where decisions re often sed on nlytics pplictions tht process continuous strems of dt. Typiclly, dt strems re comined nd summrized to otin succint representtion thereof: nlytics pplictions rely on such representtions to mke predictions, nd to crete reports, dshords nd visuliztions [15, 23, 26]. All these pplictions expect the dt, nd their representtion, to meet certin qulity criteri. Dt qulity issues interfere with these representtions nd distort the dt, leding to misleding nlysis outcomes nd potentilly d decisions. As such, rnge of dt clening techniques were proposed recently [28, 22, 19]. However, most of them focus on tch dt clening, y processing sttic dt stored in dt wrehouses, thus neglecting the importnt clss of streming dt. In this pper, we ddress this gp nd focus on strem dt clening. The chllenge in strem clening is tht it requires oth rel-time gurntees s well s high ccurcy, requirements tht re often t odds. A nïve pproch to strem dt clening could simply extend existing tch techniques, y uffering dt records in temporry dt store nd clening it periodiclly efore feeding it into downstrem components. Although likely to chieve high ccurcy, such method clerly violtes reltime requirements of streming pplictions. The prolem is excerted y the volume of dt clening systems need to process, which prohiits centrlized solutions. Therefore, our gol is to design distriuted strem dt clening system, which chieves efficient nd ccurte clening in reltime. In this pper, we focus on rule-sed dt clening, wherey set of domin-specific rules define how dt should e clened: in prticulr, we consider functionl dependencies (FDs) nd conditionl functionl dependencies (CFDs). Our system, clled Blech, proceeds in two phses: violtion detection, to find rule violtions, nd violtion repir, to repir dt sed on such violtions. Blech relies on efficient, compct nd distriuted dt structures to mintin the necessry stte (e.g., summries of pst dt) to repir dt, using n incrementl equivlence clss lgorithm. We further ddress the complictions due to the long-term nd dynmic nture of dt strems: the definition of dirty dt could chnge to follow such dynmics. Blech supports dynmic rules, which cn e dded nd deleted without requiring idle time. Additionlly, Blech implements sliding window opertion tht trdes modest dditionl storge requirements to temporrily store cumultive sttistics, for incresing clening ccurcy. Our experimentl performnce evlution of Blech is twofold. First, we study the performnce, in terms of throughput, ltency nd ccurcy, of our prototype nd focus on the impct of its prmeters. Then we compre Blech to n lterntive seline system, which we implement using micro-tch streming rchitecture. Our results indicte the enefits of system like Blech, which hold even with rule dynmics. Despite extensive work on rule-sed dt clening [1, 6, 9, 1, 13, 2, 7, 19], we re not wre of ny other strem dt clening system. The pper is orgnized s follows. Section 2 introduces our prolem sttement. The system design of Blech is discussed in Section 3; dynmic rule mngement nd windowing re discussed respectively in Section 4 nd Section 5. Section 6 presents our experimentl results. Section 7 overviews relted work. Finlly, Section 8 concludes our work. 1

2 2. PRELIMINARIES Next, we introduce some sic nottions we use throughout the pper, then we define the prolem sttement we consider in this work. 2.1 Bckground nd Definitions Similr to dt clening systems in dt wrehouses, tht red dirty dtset nd write ck clened dtset, in this pper we ssume tht strem dt clening system ingests dt strem nd outputs clened dt strem instnce. We consider n input dt strem instnce D in with schem S(A 1, A 2,..., A m) where A j is n ttriute in schem S. We ssume the existence of unique tuple identifiers for every tuple in D in: thus given tuple t i, id(t i) is t i s identifier. In generl we define function id(e) which returns the identifier (ID) of e where e cn e ny element. A list of IDs [id(e 1), id(e 2),..., id(e n)] is expressed s id(e 1, e 2,..., e n) for revity. The output dt strem instnce D out complies to schem S nd hs the sme tuple identifiers s in D in, without ny loss or dupliction. The sic unit, cell c i,j, is the conctention of tuple id, n ttriute nd the projection of the tuple on the ttriute: c i,j = (id(t i), A j, t i(a j)). Note tht t i(a j) is the vlue of c i,j, which cn lso e expressed s v(c i,j). Sometimes, we my simply express c i,j s c i when the cell ttriute is not relevnt to the discussion. In our work, when we point t specific tuple t i, we lso refer to this tuple s the current tuple. Tuples ppering erlier thn t i in the dt strem re referred to s erlier tuples nd those ppering fter t i re referred to s lter tuples. To perform dt clening, we define set of rules Σ = [r 1,..., r n], in which r k is either functionl dependency (FD) rule or conditionl FD rule (CFD). Ech rule hs unique rule identifier id(r k ). A CFD rule r k is represented y (X A, cond(y )), in which cond(y ) is oolen function on set of ttriutes Y where Y S. X nd A re respectively referred to s set of left-hnd side (LHS) ttriutes nd right-hnd side (RHS) ttriute: LHS(r k ) = X, RHS(r k ) = A. When the rule is cler in the context, we omit r k so tht LHS = X, RHS = A. Cells of LHS (RHS) ttriutes re lso referred to s LHS (RHS) cells. Y is referred to s set of conditionl ttriutes. For pir of tuples t 1 nd t 2 stisfying condition cond(t 1(Y )) = cond(t 2(Y )) = true, if t 1(B) = t 2(B) for ll B X ut t 1(A) t 2(A), then it is violtion for r k. A dt strem instnce D stisfies r k, denoted s D = r k, when there re no violtions for r k exist in D. A FD rule cn e seen s specil cse of CFD rule where cond(y ) is lwys true nd Y is. We refer to n ttriute s n intersecting ttriute if it is involved in multiple rules. If D stisfies set of rules Σ, denoted D = Σ, then D = r k for r k Σ. If D does not stisfy Σ, D is dirty dt strem instnce. 2.2 Chllenges nd Gols An idel strem dt clening system should ccept dirty input strem D in nd output clen strem D out, in which ll rule violtions in D in re repired (D out = Σ). However, this is not possile in relity due to: Rel-time constrint: As the dt clening is incrementl, the clening decision for tuple (repir or not repir) cn only e mde sed on itself nd erlier time t 1 t 2 t 3 t 4 t 5 item ctegory clientid city zipcode McBook computer Frnce 751 ike sports Lyon null Interstellr movies Pris 751 ike toys Nice 6 Titnic movies Pris null Figure 1: Illustrtive exmple of dt strem consisting of on-line trnsctions. tuples in the dt strem, which is different from dt clening in dt wrehouses where the entire dtset is ville. In other words, if dirty tuple only hs violtions with lter tuples in the dt strem, it cn not e clened. A lte updte for tuple in the output dt strem is not ccepted. Dynmic rules: In strem dt clening system, the rule set is not sttic. A new rule my e dded or n osolete rule my e deleted t ny time. A processed dt tuple cn not e clened gin with n updted rule set. Reprocessing the whole dt strem whenever the rule set is updted is not relistic. Unounded dt: A dt strem produces n unounded mount of dt, tht cnnot e stored completely. Thus, strem dt clening cn not fford to perform clening on the full dt history. Nmely, if dirty tuple only hs violtions with tuples tht pper much erlier in the dt strem, it is likely tht it will not e clened. Consider the exmple in Figure 1, which is dt strem of on-line shopping trnsctions. Ech tuple represents purchse record, which contins purchsed item (item), the ctegory of tht item (ctegory), client identifier (clientid), the city of the client (city) nd the zip code of tht city (zipcode). In the exmple, we show n extrct of five dt tuples of the dt strem, from t 1 to t 5. Now, ssume we re given two FD rules nd one CFD rule stting how clen dt strem should look like: (r1) the sme items cn only elong to the sme ctegory; (r2) two records with the sme clientid must hve the sme city; (r3) two records with the sme non-null zip code must hve the sme city: (r 1) item ctegory (r 2) clientid city (r 3) zipcode city, zipcode null If we focus on the detection of tuples tht violte such rules, we recognize three violtions mong the five tuples: (v1) t 1 nd t 3 hve the sme non-null zip code (t 1(zipcode) = t 2(zipcode) null) ut different city nmes (t 1(city) t 2(city)); (v2) t 2 clims ikes elong to ctegory sports while t 4 clssifies ikes s toys (t 2(item) = t 4(item), t 2(ctegory) t 4(ctegory)); nd (v3) t 1 nd t 5 hve the sme clientid ut 2

3 v1, v3 v2 t3(city) t1(city) t5(city) t2(ctegory) t4(ctegory) Dt Strem Detect Dt History Intermedite Dt Strem with Violtions Repir Violtion Grph Clened Dt Strem Figure 2: An exmple of violtion grph, derived from our running exmple. different city nmes (t 1(clientid) = t 5(clientid), t 1(city) t 5(city)). Note tht when strem dt clening system receives tuple t 1, no violtion cn e detected s in our exmple t 1 only hs violtions with lter tuples t 3 nd t 5. Thus, no modifiction cn e mde to t 1. Furthermore, delying the clening process for t 1 is not n option, not only ecuse of rel-time constrints, ut lso ecuse it is difficult to predict for how long this tuple should e uffered for it to e clened. Although performing incrementl violtion detection seems strightforwrd, incrementl violtion repir is much more complex to chieve. Coming ck to the exmple in Figure 1, ssume tht the strem clening system receives tuple t 5 nd successfully detects the violtion v 3 etween t 5 nd t 1. Such detection is not sufficient to mke the correct repir decision, s the tuple t 1 lso conflicts with nother tuple, t 3. An incrementl repir in strem dt clening system should lso tke the violtions mong erlier tuples into ccount. To ccount for the intriccies of the violtion repir process, we introduce the concept of violtion grph [19]. A violtion grph is dt structure contining the detected violtions, in which ech node represents cell. If some violtions shre common cell, they will e grouped into single sugrph (sg). Therefore, the violtion grph is prtitioned into smller independent sugrphs. A single cell cn only e in one sugrph. If two sugrphs shre common cell, they need to merge. The repir decision of tuple is only relevnt to the sugrphs in which its cells re involved. A violtion grph for our exmple cn e seen in Figure 2. Given this violtion grph, to mke the repir decision for tuple t 5, the clening system cn only rely on the upper sugrph which consists of violtion v 1 nd v 3 with the common cell t 1(city). We now give our prolem sttement s following. Prolem sttement: Given n unounded dt strem with n ssocited schem 1 nd dynmic set of rules, how cn we design n incrementl nd rel-time dt clening system, including violtion detection nd violtion repir mechnisms, using ounded computing nd storge resources, to output clened dt strem? In the next three sections, we give detiled description of our distriuted strem dt clening system, tht we cll Blech. 3. BLEACH DESIGN AND ALGORITHMS In this section, we overview the Blech rchitecture nd provide detils out its components. As shown in Figure 3, 1 Note tht lthough we restrict the dt strem to hve fixed schem in this work, it is esy to extend our work to support dynmic schem. Dt Strem Rule Updte Rule Controller Figure 3: Strem dt clening Overview Ingress Router item zipcode ctegory clientid city city Detect Worker Rule r1: (item -> ctegory) Detect Worker Rule r2: (clientid -> city) Detect Worker Rule r3: (zipcode -> city) (zipcode!= null) Dt History Dt History Dt History Figure 4: The Detect Module Egress Router Blech consists in two min locks, nmely the detect nd repir modules, nd rule controller module, which is discussed in Section 4. The input dt strem first enters the detect module, which revels violtions ginst defined rules. The intermedite dt strem, output from the first module, is enriched with violtion informtion, which the repir module uses to mke repir decisions. Finlly, the system outputs clened dt strem. Next, we delve into the detils of the first two modules nd outline severl optimiztions tht im t chieving efficiency nd performnce. 3.1 Violtion Detection The violtion detection module ims t finding input tuples tht violte rules. To do so, it stores them in n inmemory, efficient nd compct dt structure tht we cll the dt history. Input tuples re thus compred to those in the dt history to detect violtions. Figure 4 illustrtes the internls of the detect module: it consists of n ingress router, n egress router nd multiple detect workers (DW). Blech mps violtion rules to such DW: ech worker is in chrge of finding violtions for specific rule The Ingress Router The gol of the ingress router is to prtition nd distriute incoming tuples to DWs. Now, s discussed in Section 2, only suset of the ttriutes of n input tuple re relevnt when verifying dt vlidity ginst given rule. For exmple, FD rule only requires its LHS nd RHS ttriutes to e verified, ignoring the rest of the input tuple ttriutes. Therefore, when the ingress router receives n input tuple, it prtitions the tuple sed on the current rule set, nd only sends the relevnt informtion to ech DW in chrge 3

4 of ech specific rule. As such, n input tuple is roken into multiple su-tuples, which ll shre the sme identifier of the corresponding input tuple. Note tht some ttriutes of n input tuple might e required y multiple rules: in this cse, su-tuples will contin redundnt informtion, llowing ech DW to work independently. An exmple of tuple prtitioning cn e found in Figure 4, where we reuse the input dt schem nd the rules from Section The Detect Worker Ech DW is ssigned rule, nd receives the relevnt su-tuples stemming from the input strem. For ech sutuple, DW needs to perform lookup opertion in the dt history, nd eventully emit messge (tht is prt of n intermedite dt strem) to downstrem components when rule violtion is detected. To chieve efficiency nd performnce, lookup opertions on the dt history need to e fst, nd the intermedite dt strem should void redundnt informtion. Next, we first descrie how the dt history is represented nd mterilizes in memory; then, we descrie the output messges DW genertes, nd finlly outline the DW lgorithm. Dt history representtion. A DW ccumultes relevnt input su-tuples in compct dt structure tht enles n efficient lookup process, which mkes it similr to trditionl indexing mechnism. The structure 2 of the dt history is illustrted in Figure 5. First, to speed-up the lookup process, su-tuples re grouped y the vlue of the LHS ttriute used y given rule: we cll such group cell group (cg). Thus, A cg stores ll RHS cells whose su-tuples shre the sme LHS vlue. The identifier of cell group cg l is the comintion of the rule ssigned to the DW, nd the vlue of LHS ttriutes, expressed s id(cg l ) = (id(r k ), t(lhs)) where r k is the rule ssigned to the DW. Next, to chieve compct dt representtion, ll cells in cg shring the sme RHS vlue re grouped into super cell (sc): sc m = [c 1,j, c 2,j,..., c n,j]. From Section 2, recll tht cell is mde of tuple ID, n ttriute nd vlue: (id(t i), A j, t i(a j)). Therefore, super cell cn e compressed s list of tuple IDs, n ttriute nd their common vlue: sc m = (id(t 1, t 2,..., t n), A j, t(a j)) where t(a j) = t 1(A j) =... = t n(a j). Hence, within n individul DW, su-tuples whose cells re compressed in the sme sc re equivlent, s they hve the sme LHS ttriutes vlue (the identity of the cell group) nd the sme RHS ttriute vlue (the vlue of super cell). A cell group cg l now cn e expressed s: cg l = ((id(r k ), t(lhs)), [sc 1, sc 2,...]) including identifier nd list of super cells. In summry, the lookup process for given input sutuple is s follows. Cell groups re stored in hsh-mp using their identifier s keys: therefore the DW first finds the cg corresponding to the current su-tuple. Cells in the corresponding cg re the only cells tht might e in conflict with the current cell. Overll, the complexity of the lookup process for su-tuple is O(1). Violtion messges. DWs generte n intermedite dt strem of violtion messges, which help downstrem components to eventully repir input tuples. The gol of the 2 The techniques we use re similr to the notion of prtitions nd compression introduced in Ndeef [1]. cell group dt history cell group indexing y v(lhs) cell group indexing y v(rhs) sc sc sc sc sc Figure 5: The structure of the dt history in detect worker DW is to generte s few messges s possile, while llowing effective dt repir. When the lookup process revels the current tuple does not violte rule, DWs emit non-violtion messge (msg nvio). Insted, when violtion is detected, DW constructs messge with ll the necessry informtion to repir it, including: the ID of the cell group corresponding to the current tuple nd the RHS cells of the current nd erlier tuples in dt history: msg vio = (id(cg l ), c cur, c old ). Now, to reduce the numer of violtion messges, the DW cn use super cell in plce of single cell (c old ) in conflict with the current tuple. In ddition, recll tht single cg cn contin multiple super cells, thus possily requiring multiple messges for ech group. However, we oserve tht two cells in the sme cg must lso conflict with ech other, s long s their vlues re different. Since the dt repir module in Blech is stteful, it is sfe to omit multiple violtion messges for such cells. Algorithm detils. Next, we present the DW violtion lgorithm detils, s illustrted in Algorithm 1. Algorithm 1 Violtion Detection 1: given rule r = (X A j, cond(y )) 2: procedure Receive(su-tuple t i) cond(t i(y )) = true 3: if id(cg l ) = (id(r), t i(x)) then 4: if cg l = 1 then cg l contins sc old 5: if v(sc old ) = t i(a j) then 6: Emit msg nvio 7: else 8: Emit msg vio (id(cg l ), c cur, sc old ) 9: end if 1: else 11: Emit msg vio (id(cg l ), c cur, null) 12: end if 13: else 14: Crete cg l Crete new cell group 15: Emit msg nvio 16: end if 17: Add c cur to cg l 18: end procedure The lgorithm strts y treting FD rules s specil cse of CFD rules (line 1). Then, when DW receives su-tuple t i stisfying the rule condition (line 2), it performs lookup in the dt history to check if the corresponding cell group cg l exists 4

5 Violtion Grph Violtion Grph... Figure 6: Violtion Repir (line 3). If yes, it determines the numer of sc contined in the cg l (line 4). If there is only one sc sc old, violtion detection works s follows. If the RHS cell of the current su-tuple, c cur, hs the sme vlue s sc old, it emits non-violtion messge (line 5-6). Otherwise, violtion hs een detected: the DW emits complete violtion messge, contining oth the current cell nd the old cell (line 8). If the cg contins more thn one sc, the DW emits single ppend-only violtion messge, which only contins the cell of the current su-tuple (line 11). Such compct messges omits the sc from the dt history, since they must e contined in erlier violtion messges. Finlly, if the lookup procedure (line 3) fils, the DW cretes new cell group nd emits non-violtion messge (line 14-15). At this point, the current cell c cur is dded to the corresponding group cg l (line 17), either in n existing sc, or s new distinct cell. It is worth noticing tht, following Algorithm 1, DW emits single messge for ech input su-tuple, no mtter how mny tuples in the dt history it conflicts with The Egress Router The egress router gthers (violtion or non-violtion) messges for given dt tuple, s received from ll DWs. Such messges re then sent together downstrem towrds the repir module. 3.2 Violtion Repir The gol of this module is to tke the repir decisions for dirty dt tuples, sed on n intermedite strem of violtion messges generted y the detect module. To chieve dt repir, Blech uses dt structure clled violtion grph, s outlined in Section 2. Violtion messges contriute to the cretion nd dynmics of the violtion grph, which essentilly groups those cells tht, together, re used to perform dt repir. Figure 6 sketches the internls of the repir module: it consists of n ingress router, the repir workers (RW), nd n ggregtion component tht emits clen dt. An dditionl component, clled the coordintor, is used to steer violtion grph mngement, with the contriution of RWs The Ingress Router The ingress router is simple component tht ctully rodcsts ll incoming violtion messges to ll RWs. As opposed to its counterprt in the detection module, the ingress router does not perform dt prtitioning: insted, RWs re in chrge of using only relevnt informtion contined in the violtion messges they receive, with the gol of creting nd mintining the violtion grph The Repir Worker Next, we delve into the detils of the opertion of RW. First, we focus on the violtion grph nd the dt repir lgorithm we implement in Blech. Then, we move to the key chllenge tht RWs ddress, tht is how to mintin distriuted violtion grph. As such, we focus on grph prtitioning nd mintennce. Due to violtion grph dynmics, coordintion issues might rise in distriuted setting: such prolems re ddressed y the coordintor component. The repir lgorithm. Current dt repir lgorithms use the concept of violtion grph to repir dirty dt sed on user-defined rules. As outlined in Section 2, violtion grph is succinct representtion of cells (oth current nd historicl) tht re in conflict ccording to some rules. A violtion grph is composed of sugrphs. As incoming dt strems in, the violtion grph evolves: specificlly, its sugrphs might merge or split, depending on the contents of violtion messges. Using the violtion grph, severl lgorithms cn perform dt clening, such s the equivlence clss lgorithm [5] or the holistic dt clening lgorithm [9]. Currently, Blech relies on n incrementl version of the equivlence clss lgorithm, tht supports streming input dt, lthough lterntive pproches cn e esily plugged in our system. Thus, sugrph in the violtion grph cn e interpreted s n equivlence clss, in which ll cells re supposed to hve the sme vlue. Distriuted violtion grph. Due to the unounded nture of streming dt, it is resonle to expect the violtion grph to grow to sizes exceeding the cpcity of single RW. As such, in Blech, the violtion grph is distriuted dt structure, prtitioned cross ll RWs. However, unlike for DWs, the prtitioning scheme cn not e simply rule sed, ecuse cell my violte multiple rules, creting issues relted to coordintion nd lod lncing. More generlly, no prtitioning scheme cn gurntee tht cells from single violtion messge to e plced in the sme prtition to store sugrphs in single RW. Next, we descrie how Blech uilds nd mintins distriuted violtion grph. The grph is uilt using msg vio output y the egress router, which ll RWs receive. Upon receiving violtion messge, RWs process it independently, ccording to the following rules: i) if ny of the current or old cells encpsulted in the messge re lredy contined in n existing sugrph, oth cells re dded to the existing sugrph; ii) if n existing sugrph hs cells which re in the sme cell group s ny of the cells in the messge, the cells in the messge re oth dded to the existing sugrph; iii) if ny of these two cells re contined in multiple sugrphs, these sugrphs need to merge; iv) if none of these two cells is lredy contined in ny sugrph, new sugrph will e creted with these two cells. We define sugrph identifier id(sg k ) to e the list of cell group IDs comprised in msg vio: id(cg 1, cg 2,...). A sugrph cn e expressed s sg k = (id(cg 1, cg 2,...), [sc 1, sc 2,...]): it 5

6 rw1: sg id(cg1 ) sg id(cg2 ) c 1 c 3 rw2: c 2 () initil stte rw1: c 5 c 4 rw2: c 2 c 1 c 3 rw1: sg id(cg1, cg3 ) sg id(cg2 ) c 6 sg id(cg1, cg 3 ) sg id(cg2 ) c 1 c 3 c 5 rw2: c 2 c 4 (c) merge in rw1 nd rw2 c 6 sg id(cg1 ) sg id(cg3 ) () merge only in rw1 Figure 7: Violtion grph uild exmple consists of group of sc, stored in compressed formt, s shown in Section Note tht when two sugrphs merge, their identifiers re lso merged y conctenting oth cg ID lists. To mke the sugrph ID cler, sg k cn e presented s sg id(cg1,cg 2,...). Since sugrphs re collections of cells, we distriute the ltter cross ll RWs, using the cells tuple IDs for prtitioning. Then, we use the sugrph identifier to recognize prtitions from the sme sugrph. As consequence, sugrph spns severl RWs, ech storing frction of the cells comprised in the sugrph. Finlly, we note tht the violtion grph, nd in prticulr the sugrph prtitions stored y ech RW, mterilizes s dt structure stored in RAM. Such dt structure is orgnized similrly to tht of the dt history presented in Section 3.1, which llows n efficient execution of the repir lgorithm, nd compct dt representtion. An illustrtive exmple is in order. Let s ssume there re two RWs, rw1 nd rw2, nd the current violtion grph consists in two sugrphs sg id(cg1 ), contining cells c 1, c 2, c 3, nd sg id(cg2 ), contining cells c 4, c 5. In our exmple, the violtion grph is prtitioned s in Figure 7(): oth RWs hve portion of cells of every sugrph The Coordintor The prolem we ddress now is due to the dynmics of the violtion grph, which evolves s new violtion messges strem into the Repir module. As ech sugrph is prtitioned mong ll RWs, sugrph prtitions must e identified y the sme ID: this is ecuse sugrph is proxy for n equivlence clss, nd ll its cells contriute to the correct functioning of the repir lgorithm. Continuing with the exmple from Figure 7(), suppose new violtion messge {id(cg 3), c 6, c 1} is received y ll RWs. Now, in rw1, the new violtion is dded to sugrph sg id(cg1 ) since oth the messge nd the sugrph shre the sme cell c 1: s such, the new sugrph ecomes sg id(cg1,cg 3 ). Insted, in rw2, the new violtion triggers the cretion of new sugrph sg id(cg3 ), since no common cells re shred etween the messge nd existing sugrphs in rw2. The violtion grph ecomes inconsistent, s shown in Figure 7(): this is consequence of the independent opertion of RWs. Insted, the repir lgorithm requires the violtion grph to e in consistent stte, s shown in Fig- c 5 c 4 ure 7(c), where oth RWs use the sme sugrph identifier for the sme equivlence clss. To gurntee the consistency of the violtion grph mong independent RWs, Blech uses stteless coordintor component tht helps RWs gree on sugrph identifiers. In wht follows we present three vrints of the simple protocol RWs use to communicte with the coordintor. RW-sic. When RW receives violtion messges for tuple, it dds the cells in the messges to the violtion grph, ccording to its locl stte. Then, the RW cretes merge proposl contining the sugrph id for ech conflicting ttriute, nd sends it to the coordintor. Once the coordintor receives merge proposls from ll RWs, it produces merge decision, which contins list of ll cg IDs contined in the vrious merge proposls, nd rodcsts it to ll RWs. RWs merge their locl sugrphs nd converge to glolly consistent stte. Clerly, such simple pproch to coordintion hrms Blech performnce. Indeed, the RW-sic scheme requires one round-trip messge for every incoming dt tuple, from ll RWs. However, we note tht it is not necessrily true tht the coordintion is lwys needed for every tuple. For exmple, when every cell violtes t most one rule, every sugrph would only hve single cg ID. Thus, coordintion is not necessry. More generlly, given violtion messges for tuple, coordintion is only necessry when there is complete violtion messge contining n old cell which lredy exists in the violtion grph ecuse of different violtion rule. Figure 8 gives n exmple, where the initil stte (Figure 8()) is the sme s in Figure 7(). Then, two violtion messges, {id(cg 1), c 6, null} nd {id(cg 2), c 6, null}, re received. Cell c 6 is current cell contined in the current tuple. Oviously sg id(cg1 ) nd sg id(cg2 ) should merge into sg id(cg1,cg 2 ). This cn e ccomplished without coordintion y oth repir workers, s shown in Figure 8(). Indeed, ech RW is wre tht c 6 is involved in two sugrphs, lthough c 6 is only stored in rw2 ecuse of the prtitioning scheme. Next, we use the ove oservtions nd propose two vrints of the coordintion mechnism tht im t ypssing the coordintor component to improve performnce. In oth vrints, if there exist sugrphs which cn merge correctly without coordintion like the exmple of Figure 8, they merge immeditely. RW-dr. In RW-dr, the coordintion is only conducted if it is necessry, nd the repir worker sends merge proposl to the coordintor nd wits for the merge decision. However, this pproch is not exempt from drwcks: it my cuse some dt tuples in the strem to e delivered out of order. This is ecuse the repir worker wit for the merge decision in non-locking wy. The violtion messges of tuple which do not require coordintion my e processed in the coordintion gp of n erlier tuple. RW-ir. With this vrint, no mtter if the violtion messges of tuple require coordintion or not, RW immeditely updtes its locl sugrphs, executes the repir lgorithm nd emits repir proposl downstrem to the ggregtor component. Then, if necessry, the RW lzily executes the coordintion protocol. Clerly, this pproch cters to system performnce nd voids tuples to e delivered out of order, ut might hrm clening ccurcy. Indeed, individul dt repir proposls 6

7 sg id(cg1 ) sg id(cg2 ) sg id(cg1, cg2 ) rw1: c 1 c 3 c 5 rw1: c 1 c 3 c 5 sg id(cg1,cg 2,cg 3 ) c 1 cg 1 ccc1 c 2 c 3 sg id(cg1,cg 3 ) c 1 cg 1 ccc1 c 2 c 3 rw2: c 2 c 4 rw2: c 2 c 4 c 6 c 6 ccc1 cg 3 c 6 ccc1 cg 3 () initil stte () fter independent processing c 7 ccc1 c 5 c 4 cg 2 c 7 Figure 8: Exmple of violtion grph uilt without coordintion () initil sg id(cg1,cg 2,cg 3 ) () if delete cg 2 from RW re sed on locl view prior to finishing ll necessry merge opertions on sugrphs, which hs direct impct on equivlence clsses The Aggregtor Using the distriuted violtion grph, ech RW executes independently the Blech repir lgorithm nd emits dt repir proposl, which includes ll 3 cndidte vlues nd their frequency computed in locl sugrph prtition. The ggregtor component collects ll repir proposls nd selects the cndidte vlue to repir given cell s the one hving the highest ggregte frequency. Finlly, the ggregtor modifies the current dt tuple nd outputs clen dt strem. Note tht the ggregtor only modifies current tuples. Insted, more importntly, the cells stored in the violtion grph re not modified regrdless of the repir decision: this llows to updte frequency counts s new dt strems into the system, thus steering the ggregtor to mke different repir decisions s the violtion grph evolves. 4. DYNAMIC RULE MANAGEMENT In strem dt clening the rule set is usully not immutle ut dynmic. Therefore, we now introduce new component, the rule controller, shown in Figure 3, which llows Blech to dpt to rule dynmics. The rule controller ccepts rule updtes s input nd guides the detect nd the repir module to dpt to rule dynmics without stopping the clening process nd without loosing stte. Rule updtes cn e of two types: one for dding new rule nd one for deleting n existing rule. Detect. In the detect module, the ddition of rule triggers the instntition of new DW, s input tuples re prtitioned y rule. The new DW strts with no stte, which is uilt upon receiving new input tuples. As such, violtion detection using pst tuples cnnot e chieved, which is consistent with the Blech design gols. Insted, the deletion of n existing rule simply triggers the removl of DW, with its own locl dt history. Repir. In the Repir module, the ddition of new rule is not prolemtic with respect to violtion grph mintennce opertions. Insted, the removl of rule implies violtion grph dynmics (sugrphs might shrink or split) which re more chllenging to ddress. Thus, in sugrph, we further group cells y cell groups. A sugrph now cn lso e expressed s: sg k = (id(cg 1, cg 2,...), [cg 1, cg 2,...]), where ech cell group gthers super cells. Some cells might spn multiple groups, 3 In cse there re too mny cndidte vlues, we only send the top-k vlues, where k = 5. sg id(cg1,cg 2 ) c 1 c 7 cg 1 ccc1 c 2 c 3 ccc1 c 5 c 4 cg 2 (c) incorrect if delete cg 3 sg id(cg1 ) sgid(cg2 ) cg 1 c 1 ccc1 c 2 c 3 cg 2 c 7 ccc1 c 5 c 4 (d) correct if delete cg 3 Figure 9: Sugrph split exmple s they my violte multiple rules. We lel such peculir cells s hinge cells. For ech hinge cell, the sugrph keeps the IDs of its connecting cell groups: c i,j = (c i,j, id(cg i1, cg i2,...)). Hinge cells with the sme vlue nd the sme connecting cell groups re lso compressed into super cells. With the new orgniztion of cells in sugrphs, the violtion grph updtes s following upon the removl of rule. If sugrph contins single cell group relted to the deleted rule, RWs re simply instructed to remove it. If sugrph contins multiple cell groups, RWs remove the cell groups relted to the deleted rule nd updte the hinge cells. With the remining hinge cells, RWs check the connectivity of the remining cell groups in the sugrph nd decide to split the sugrph or not 4. An exmple of split opertion cn e seen in Figure 9. The initil stte of sugrph is shown in Figure 9(): the sugrph is sg id(cg1,cg 2,cg 3 ), nd its contents re three cell groups. Cell c 1 nd c 7 re hinge cells, which work s ridges, connecting different cell groups together. Now, s simple cse, ssume we wnt to remove the rule pertining to cg 2: the sugrph should ecome sg id(cg1,cg 3 ), s shown in Figure 9(). Note tht cell c 7 looses its sttus of hinge cell. A more involved cse rise when we delete the rule pertining to cg 3 insted of the rule pertining to cg 2. In this cse, the sugrph should not ecome sg id(cg1,cg 2 ) s shown in Figure 9(c). Indeed, removing cg 2 elimintes ll existing hinge cells connecting the remining cell groups. Thus, the sugrph must split in two seprte sugrphs sg id(cg1 ) nd sg id(cg2 ) s shown in Figure 9(d). 5. WINDOWING Blech provides windowed computtions, which llow expressing dt clening over sliding window of dt. Despite eing common opertion in most streming systems, window-sed dt clening ddresses the chllenge of the unounded nture of streming dt: without windowing, 4 A detiled lgorithm cn e found in our technicl report: 7

8 the dt structures Blech uses to detect nd repir dirty strem would grow indefinitely, which is unprcticl. In this section, we discuss two windowing strtegies: sic, tuple-sed windowing strtegy nd n dvnced strtegy tht im t improving clening ccurcy. 5.1 Bsic Windowing The underlying ide of the sic windowing strtegy is to only use tuples within the sliding window to populte the dt structures used y Blech to chieve its tsks. Next, we outline the sic windowing strtegy for oth DWs nd RWs opertion. Windowed Detection. We now focus on how DWs mintin their locl dt history. Clerly, the dt history only contins cells tht fll within the current window. When the window slides forwrd, DWs updte the dt history s follows: i) if cell group ends up hving no cells in the new window, DWs simply delete it; ii) for the remining cell groups, DWs drop ll cells tht fll outside the new window, nd updte ccordingly the remining super cells. Note tht, if implemented nively, the first opertion ove cn e costly s it involves liner scn of ll cell groups. To improve the efficiency of dt history updtes, Blech uses the following pproch. It cretes FIFO queue of k lists, which store cell groups. In cse the sliding step is hlf the window size, k = 2; more generlly, we set k to e the window size divided y the sliding step. Any new cell group from the current window enters the queue in the k-th list. Any cell group updtes, e.g. due to new cell dded to the cell group, promotes it from list j to list k. As the window slides forwrd, the queue drops the list (nd its cell groups) in the first position nd cquires new empty list in position k + 1. Windowed Repir. Now we focus on how to mintin the violtion grph in RWs. Agin, the violtion grph only stores cells within the current window. When the window slides forwrd, RWs updte the violtion grph s follows: If sugrph hs no cells in the new window, RWs delete the sugrph; For the remining sugrphs, if cell group hs no cells in the new window, RWs delete the cell group; RWs lso delete hinge cells tht re outside of the new window. This could require sugrphs to split, s they could miss ridge cell to connect its cell groups; For the remining cell groups, RWs drop ll cells outside of the new window, nd updte the remining super cells ccordingly. For efficiency resons, Blech uses the sme k-list pproch descried for DWs to mnge violtion grph updtes due to sliding window. 5.2 Blech Windowing The sic windowing strtegy only relies on the dt within the current window to perform dt clening, which my limit the clening ccurcy. We egin with motivting exmple, then descrie the Blech windowing strtegy, tht ims t improving clening ccurcy. Note tht here we only focus on the repir module nd its violtion grph, since Blech windowing does not modify the opertion of the detect module. t 1 t 2 t 3 t 4 t 5 t 6 t 1 t 2 t 3 t 4 t 5 t 6 A () output dt with sic windowing A B c B c c () input dt window [1, 4] t 1 t 2 t 3 t 4 t 5 t 6 A window [3, 6] (c) output dt with Blech windowing Figure 1: Motivting exmple: Bsic vs. Blech windowing. Figure 1() illustrtes dt strem of two-ttriute tuples. Assume we use single FD rule (A B), window size of 4 tuples, sliding step of 2 tuples, nd the sic windowing strtegy. When t 4 rrives, the window covers tuples [1, 4]. According to the repir lgorithm, Blech repirs t 4(B) nd sets it to the vlue. Now, when tuple t 5 rrives, the window moves to cover tuples [3, 6], even though t 6 hs yet to rrive. With only three tuples in the current window, the lgorithm determines t 5(B) is correct, ecuse now the mjority of tuples hve vlue c. The output strem produced using sic windowing is shown in Figure 1(). Clerly, clening ccurcy is scrificed, since it is esy to see tht t 5(B) should hve een repired to vlue, which is the most frequent vlue overll. Hence, the need for different windowing strtegy to overcome such prolems. Blech windowing relies on n extension of super cell, which we cll cumultive super cell. The ide is for the violtion grph to ccumulte pst stte, to complement the view Blech uilds using tuples from the current window. Hence, cumultive super cell is represented s super cell, with n dditionl field tht stores the numer of occurrences of cells (oth hinge s well s super cells within cell group) with the sme RHS vlue, including those tht hve een dropped ecuse they fll outside the sliding window oundries. When using Blech windowing, RWs mintin the violtion grph y storing cumultive super cells. When the window slides forwrd, RWs updte the violtion grph s follows. The first two steps re equivlent to those for the sic strtegy. The lst two steps re modified s follows: For the remining sugrphs, RWs updte hinge cells, nd flush those tht do not ridge cell groups nymore ecuse of the updte. Also, RWs split sugrphs ccording to the remining hinge cells; For the remining cell groups nd hinge cells, RWs updte compressed super cells, flushing cells which B 8

9 fll outside the new window while keeping their count. The flush opertion used ove only drops the content of super cell, ut keeps its structure nd its count field. Now, going ck to the exmple in Figure 1(), when tuple t 5 rrives, Blech stores two cumultive super cells: csc 1(id(t) = 3, vlue =, count = 3) nd csc 2(id(t) = [4, 5], vlue = c, count = 2). Although t 1 nd t 2 hve een deleted ecuse they re outside the sliding window, they still contriute to the count field in csc 1. Therefore, tuple t 5(B) is correctly repired to vlue, s shown in Figure 1(c). Additionl notes: When using cumultive super cells, Blech keeps trcks of cndidte vlues to e used in the repir lgorithm s long s cell groups remin. By using cumultive super cells for hinge cells, sugrphs only split if some cell groups re removed when the window moves forwrd. Note tht the introduction of cumultive super cells does not interfere with dynmic rule mngement: in prticulr, when deleting rule, sugrphs updte correctly when hinge cells use the cumultive formt. Overll, to compute the count of cndidte vlue in sugrph, cumultive super cells ccumulte the counts of relevnt compressed super cells from ll cell groups, tking into ccount ny duplicte contriutions from hinge cells. Oviously, Blech windowing requires more storge thn sic windowing, s cumultive super cells store dditionl informtion, nd ecuse of the flush opertion descried erlier, which keeps super cell structure, even when it hs n empty cell list. Section 6 demonstrtes tht such dditionl overhed is well lnced y superior clening ccurcy, mking Blech windowing truly desirle. 6. EVALUATION Blech prototype implementtion is uilt using Apche Storm [27]. 5 Input strems, including oth the dt strem nd rule updtes, re fed into Blech using Apche Kfk [18]. Our gol is to demonstrte tht Blech chieves efficient strem dt clening under rel-time constrints. Our evlution uses throughput, ltency nd dirty rtio s performnce metrics. We express the dirty rtio s the frction of dirty dt remining in the output dt strem: the smller the dirty rtio, the higher the clening ccurcy. The processing ltency is mesured from uniformed smpled tuples (1 per 1). All our experiments re conducted in cluster of 18 mchines, with 4 cores, 8 GB RAM nd 1 Gps network interfce ech. We evlute Blech using synthetic dtset generted from TPC-DS (with scle fctor 1 GB). To do so, we join fct tle store sles with its dimension tles in TPC-DS to uild single tle (288 million tuples). By exporting this tle to Kfk, we simulte n unounded dt strem. We mnully design eight CFD rules, s shown in Tle 1. Among these rules, r 4 nd r 5 hve the sme RHS ttriute s store nme, while r 6 nd r 7 hve the sme RHS ttriute c emil ddr, s intersecting ttriutes. We generte dirty dt strem, ccording to our rules, s follows: we modify the vlues of RHS ttriutes with proility 1% nd replce the vlues of LHS ttriutes 5 Nothing prevents Blech to e uilt using lterntive systems such s Apche Flink, for exmple. Tle 1: Rule Sets r : ss item sk i rnd, (ss item sk null) r 1 : ss item sk i ctegory, (ss item sk null) r 2 : c stte, c city c zip, (c stte, c city null) r 3 : ss promo sk p promo nme, (ss promo sk null) r 4 : ss store sk s store nme, (ss store sk null) r 5 : ss ticket num s store nme, (ss ticket num null) r 6 : ss ticket num c emil ddr, (ss ticket num null) r 7 : ss customer sk c emil ddr, (ss customer sk null) with NULL with proility 1%. 6 In ll the experiments, we set the window size to 2 M nd the sliding step to 1 M tuples respectively, regrdless which windowing strtegy we use. If not otherwise specified, we use Blech windowing s the defult strtegy. 6.1 Compring Coordintion Approches In this experiment we compre the three RW pproches discussed in Section 3.2, ccording to our performnce metrics, s shown in Figure 11: RW-sic requires coordintion mong repir workers for ech tuple; RW-dr omits coordintion for tuples when possile; RW-ir is similr to RW-dr, ut llows repir decisions to e mde efore finishing coordintion. Next, we use our synthetic dtset nd rules r to r 5. Figure 11() shows how Blech throughput evolves with processed tuples. The throughput with oth RW-dr nd RW-ir is round 15K tuples/second, wheres RW-sic chieves roughly 13K tuples/second. The inferior performnce of RW-sic is due to the lrge numer of coordintion messges required to converge to glol sugrph identifiers, while RW-dr nd RW-ir only require 7% coordintion messges in RW-sic. Figure 11() shows the CDF of the tuple processing ltency for the three RW pproches. RW-sic hs the highest processing ltency, on verge 364 ms. The processing ltency of RW-ir is on verge 316 ms. RW-dr verge ltency is slightly higher, out 323 ms. This difference is due gin to the dditionl round-trip-messges required y coordintion: with RW-ir, RWs mke their repir proposls without witing for coordintion to complete, therefore the smll processing ltency. Figure 11(c) illustrtes the clening ccurcy. All three pproches lower the rtio of dirty dt significntly, from the initil 1% to t most.5% (even % for rule r 3 nd r 4). For the first five rules, the three pproches chieve similr clening ccurcy. Insted, for rule r 5 the RW-ir method suffers nd the dirty rtio is lrger. Indeed, for rule r 5 whose clening ccurcy is hevily linked to rule r 1, RW-ir fils to correctly updte some of its sugrphs ecuse it egerly emits repir proposls without witing for coordintion to complete. 6.2 Compring Windowing Strtegies In this experiment, we compre the performnce of the sic nd Blech windowing strtegies, nd use the RW-dr mechnism. As for the previous experiment, we use rules r to r5. Additionlly, for stress testing, we increse the input 6 In our experiments we lso use BART [3], which is well ccepted dirty dt genertor. However, due to the sheer size of our dt strem, we present results otined using our custom process, which mimics tht of BART ut scles to lrge dt strems. 9

10 throughput (tuples/sec) RW-sic RW-dr RW-ir () Throughput RW sic RW dr RW ir ltency (sec) () Ltency dirty rtio RW-sic RW-dr RW-ir r r1 r2 r3 r4 r5 rules (c) Dirty Rtio Figure 11: Comprison of coordintion mechnisms: RW-sic, RW-dr nd RW-ir. throughput (tuples/sec) sic lech Figure 12: throughput of two windowing strtegies F(x) sic lech ltency (sec) Figure 13: processing ltency CDF of two windowing strtegies dirty dt rtio from 1% to 5% for dt in the intervl from 4 M to 42 M tuples. As shown in Figure 12 nd Figure 13, the two windowing strtegies re essentilly equivlent in terms of throughput nd ltency: this is good news, s it implies the requirement for cumultive super cells is negligile toll on performnce. Next, we focus on detiled view of the clening ccurcy, which is shown in Figure 14. Blech windowing chieves superior clening ccurcy: in generl, the dirty rtio is one order of mgnitude smller thn tht of sic windowing. This dvntge is kept lso in presence of 5% dirty rtio spike in the input dt. In prticulr, for rules r 3 nd r 4, Blech windowing chieves % dirty rtio, irrespectively of the dirty rtio spike. Overll, Blech windowing revels tht keeping stte from pst windows cn indeed drmticlly improve clening ccurcy, with little to no performnce overhed. 6.3 Dynmic Rule Mngement Next, we study the performnce of Blech in presence of rule dynmics, s shown in Figure 15. To do this, we initilly use the sme input dt strem nd rule set s in Section 6.1. However, while Blech is clening the input strem, we delete rule r 5 nd dd rule r 6 nd r 7, s indicted in the Figure. In this experiment, we use Blech windowing strtegy nd RW-dr coordintion. Figure 15() nd Figure 15() show the evolution in time of throughput nd ltency, wheres Figure 15(c) gives the CDF of the processing ltency. Figure 15() shows tht rule dynmics cn result in n increse in throughput. Indeed, removing r 5 (t the 6M tuple) implies tht Blech needs to mnge fewer rules; in ddition, r 4 ecomes simpler to mnge, s there re no more intersections with r 5. Similrly, Figure 15() shows tht lso ltency decreses upon r 5 removl. When rules r 6 nd r 7 re dded (t the 9 M tuple), the throughput drops nd the ltency grows, s Blech hs more rules to mnge nd ecuse the new rules hve intersecting ttriutes, requiring more work from RWs. Figure 15(c), shows the ltency distriution computed from output tuple smples. While the verge ltency is roughly 32 ms, we notice til in the distriution, indicting tht some (few) tuples experience ltencies up to seconds. This hs een oserved cross ll our experiments, nd is due to the sliding window mechnism, which imposes computtionlly demnding opertions when updting the violtion grph, resulting lso in rther low-level grge collection prolems. Overll, we conclude tht Blech supports dynmic rule mngement semlessly, with essentilly no impct on performnce, nd no system restrt required. 6.4 Compring Blech to Bseline Approch We conclude our evlution with comprtive nlysis of Blech nd seline pproch, which follows the ides discussed in Section 1. To do so, we design nd implement new system tht is sed on the micro-tch streming prdigm: essentilly, such system uffers input dt records nd performs dt clening periodiclly, s determined y sliding window. Our implementtion uses Apche Sprk, nd uses its Streming API tht supports ll necessry opertors. 7 We refer to the seline pproch s micro-tch clening. To demonstrte the performnce of micro-tch clening nd compre it to Blech, we perform series of experiments wherey we increse the sliding window size. We use the sme strem dt input from our previous experiments, ut only use single rule, r. Here, we focus on performnce nlysis expressed in terms of ltency nd dirty rtio, thus 7 To e precise, note tht window processing in Sprk Streming is time-sed nd not tuple-sed. For our experiment, this difference is negligile. 1

11 dirty rtio sic lech dirty rtio sic lech dirty rtio sic lech () r () r (c) r2 dirty rtio sic lech dirty rtio sic lech dirty rtio sic lech (d) r (e) r (f) r5 Figure 14: clening ccurcy of two windowing strtegies throughput (tuples/sec) delete r5 dd r6 nd r () Throughput ltency (sec) delete r5 dd r6 nd r () Ltency in time F(x) ltency (sec) (c) Ltency CDF Figure 15: Blech performnce with dynmic rule mngement. dirty rtio win 3s win 6s.1 win 12s lech win 24s vg ltency (sec) Figure 16: micro-tch clening result we feed the input strem t constnt throughput of 15 tuples/second. Figure 16 illustrtes the performnce of oth systems. As expected, for the micro-tch seline pproch, the verge ltency is proportionl to the window size: lrger sliding windows entil higher ltencies. Indeed, since the dt in the input strem is uniformly distriuted, the verge ltency equls the sum of hlf of the window size nd the verge execution time for clening dt in ech window. As for the clening ccurcy, the lrger the sliding window, the more opportunities the micro-tch system hs to clen the input strem, hence smller output strem dirty rtio. In prticulr, we notice tht to chieve the sme clening ccurcy s Blech, micro-tch clening requires the sliding window to e lrger thn 12 seconds, which incurs in n verge ltency lrger thn 1 minute. Insted, in Blech, the verge ltency is out 32 ms. We conclude tht Blech represents vlule tool in the dt clening lndscpe if rel-time requirements must e met, while chieving high clening ccurcy. 7. RELATED WORK In recent yers, dt clening systems hve flourished. Mny pproches [6, 2, 4, 9, 1, 19, 16] tckle the prolem of detecting nd repiring dirty dt sed on predefined dt qulity rules. [9] proposes wy to comine multiple rules together nd to perform dt clening work holisticlly. [7] focuses on functionl dependency violtions in horizontlly prtitioned dtse, iming to minimize dt shipment nd prllel computtion time. NADEEF[1] is extensile nd generic dt clening system nd BigDnsing[19] is lrge-scle version of NADEEF, which executes dt clening jo in frmeworks like Hdoop nd Sprk. These pproches re effective when dt is sttic. [12, 13] provide incrementl lgorithms to detect errors in distriuted dt when dt is updted, which is similr 11

Bleach: A Distributed Stream Data Cleaning System

Bleach: A Distributed Stream Data Cleaning System Blech: A Distriuted Strem Dt Clening System Yongcho Tin Eurecom Biot, Frnce Emil: yongcho.tin@eurecom.fr Pietro Michirdi Eurecom Biot, Frnce Emil: pietro.michirdi@eurecom.fr Mrko Vukolić IBM Reserch Zurich,

More information

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

In the last lecture, we discussed how valid tokens may be specified by regular expressions. LECTURE 5 Scnning SYNTAX ANALYSIS We know from our previous lectures tht the process of verifying the syntx of the progrm is performed in two stges: Scnning: Identifying nd verifying tokens in progrm.

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully

More information

2 Computing all Intersections of a Set of Segments Line Segment Intersection

2 Computing all Intersections of a Set of Segments Line Segment Intersection 15-451/651: Design & Anlysis of Algorithms Novemer 14, 2016 Lecture #21 Sweep-Line nd Segment Intersection lst chnged: Novemer 8, 2017 1 Preliminries The sweep-line prdigm is very powerful lgorithmic design

More information

Fig.25: the Role of LEX

Fig.25: the Role of LEX The Lnguge for Specifying Lexicl Anlyzer We shll now study how to uild lexicl nlyzer from specifiction of tokens in the form of list of regulr expressions The discussion centers round the design of n existing

More information

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved Network Interconnection: Bridging CS 57 Fll 6 6 Kenneth L. Clvert All rights reserved The Prolem We know how to uild (rodcst) LANs Wnt to connect severl LANs together to overcome scling limits Recll: speed

More information

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining EECS150 - Digitl Design Lecture 23 - High-level Design nd Optimiztion 3, Prllelism nd Pipelining Nov 12, 2002 John Wwrzynek Fll 2002 EECS150 - Lec23-HL3 Pge 1 Prllelism Prllelism is the ct of doing more

More information

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Lecture 10 Evolutionry Computtion: Evolution strtegies nd genetic progrmming Evolution strtegies Genetic progrmming Summry Negnevitsky, Person Eduction, 2011 1 Evolution Strtegies Another pproch to simulting

More information

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring

More information

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards A Tutology Checker loosely relted to Stålmrck s Algorithm y Mrtin Richrds mr@cl.cm.c.uk http://www.cl.cm.c.uk/users/mr/ University Computer Lortory New Museum Site Pemroke Street Cmridge, CB2 3QG Mrtin

More information

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have Rndom Numers nd Monte Crlo Methods Rndom Numer Methods The integrtion methods discussed so fr ll re sed upon mking polynomil pproximtions to the integrnd. Another clss of numericl methods relies upon using

More information

Presentation Martin Randers

Presentation Martin Randers Presenttion Mrtin Rnders Outline Introduction Algorithms Implementtion nd experiments Memory consumption Summry Introduction Introduction Evolution of species cn e modelled in trees Trees consist of nodes

More information

From Dependencies to Evaluation Strategies

From Dependencies to Evaluation Strategies From Dependencies to Evlution Strtegies Possile strtegies: 1 let the user define the evlution order 2 utomtic strtegy sed on the dependencies: use locl dependencies to determine which ttriutes to compute

More information

10.5 Graphing Quadratic Functions

10.5 Graphing Quadratic Functions 0.5 Grphing Qudrtic Functions Now tht we cn solve qudrtic equtions, we wnt to lern how to grph the function ssocited with the qudrtic eqution. We cll this the qudrtic function. Grphs of Qudrtic Functions

More information

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs. Lecture 5 Wlks, Trils, Pths nd Connectedness Reding: Some of the mteril in this lecture comes from Section 1.2 of Dieter Jungnickel (2008), Grphs, Networks nd Algorithms, 3rd edition, which is ville online

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distriuted Systems Principles nd Prdigms Chpter 11 (version April 7, 2008) Mrten vn Steen Vrije Universiteit Amsterdm, Fculty of Science Dept. Mthemtics nd Computer Science Room R4.20. Tel: (020) 598 7784

More information

OUTPUT DELIVERY SYSTEM

OUTPUT DELIVERY SYSTEM Differences in ODS formtting for HTML with Proc Print nd Proc Report Lur L. M. Thornton, USDA-ARS, Animl Improvement Progrms Lortory, Beltsville, MD ABSTRACT While Proc Print is terrific tool for dt checking

More information

UT1553B BCRT True Dual-port Memory Interface

UT1553B BCRT True Dual-port Memory Interface UTMC APPICATION NOTE UT553B BCRT True Dul-port Memory Interfce INTRODUCTION The UTMC UT553B BCRT is monolithic CMOS integrted circuit tht provides comprehensive MI-STD- 553B Bus Controller nd Remote Terminl

More information

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis CS143 Hndout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexicl Anlysis In this first written ssignment, you'll get the chnce to ply round with the vrious constructions tht come up when doing lexicl

More information

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1): Overview (): Before We Begin Administrtive detils Review some questions to consider Winter 2006 Imge Enhncement in the Sptil Domin: Bsics of Sptil Filtering, Smoothing Sptil Filters, Order Sttistics Filters

More information

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits Systems I Logic Design I Topics Digitl logic Logic gtes Simple comintionl logic circuits Simple C sttement.. C = + ; Wht pieces of hrdwre do you think you might need? Storge - for vlues,, C Computtion

More information

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011 CSCI 3130: Forml Lnguges nd utomt Theory Lecture 12 The Chinese University of Hong Kong, Fll 2011 ndrej Bogdnov In progrmming lnguges, uilding prse trees is significnt tsk ecuse prse trees tell us the

More information

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5 CS321 Lnguges nd Compiler Design I Winter 2012 Lecture 5 1 FINITE AUTOMATA A non-deterministic finite utomton (NFA) consists of: An input lphet Σ, e.g. Σ =,. A set of sttes S, e.g. S = {1, 3, 5, 7, 11,

More information

Engineer To Engineer Note

Engineer To Engineer Note Engineer To Engineer Note EE-186 Technicl Notes on using Anlog Devices' DSP components nd development tools Contct our technicl support by phone: (800) ANALOG-D or e-mil: dsp.support@nlog.com Or visit

More information

A dual of the rectangle-segmentation problem for binary matrices

A dual of the rectangle-segmentation problem for binary matrices A dul of the rectngle-segmenttion prolem for inry mtrices Thoms Klinowski Astrct We consider the prolem to decompose inry mtrix into smll numer of inry mtrices whose -entries form rectngle. We show tht

More information

Dr. D.M. Akbar Hussain

Dr. D.M. Akbar Hussain Dr. D.M. Akr Hussin Lexicl Anlysis. Bsic Ide: Red the source code nd generte tokens, it is similr wht humns will do to red in; just tking on the input nd reking it down in pieces. Ech token is sequence

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties, Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Agilent Mass Hunter Software

Agilent Mass Hunter Software Agilent Mss Hunter Softwre Quick Strt Guide Use this guide to get strted with the Mss Hunter softwre. Wht is Mss Hunter Softwre? Mss Hunter is n integrl prt of Agilent TOF softwre (version A.02.00). Mss

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

What are suffix trees?

What are suffix trees? Suffix Trees 1 Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl

More information

Reducing a DFA to a Minimal DFA

Reducing a DFA to a Minimal DFA Lexicl Anlysis - Prt 4 Reducing DFA to Miniml DFA Input: DFA IN Assume DFA IN never gets stuck (dd ded stte if necessry) Output: DFA MIN An equivlent DFA with the minimum numer of sttes. Hrry H. Porter,

More information

Registering as an HPE Reseller

Registering as an HPE Reseller Registering s n HPE Reseller Quick Reference Guide for new Prtners Mrch 2019 Registering s new Reseller prtner There re four min steps to register on the Prtner Redy Portl s new Reseller prtner: Appliction

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

PARALLEL AND DISTRIBUTED COMPUTING

PARALLEL AND DISTRIBUTED COMPUTING PARALLEL AND DISTRIBUTED COMPUTING 2009/2010 1 st Semester Teste Jnury 9, 2010 Durtion: 2h00 - No extr mteril llowed. This includes notes, scrtch pper, clcultor, etc. - Give your nswers in the ville spce

More information

Transparent neutral-element elimination in MPI reduction operations

Transparent neutral-element elimination in MPI reduction operations Trnsprent neutrl-element elimintion in MPI reduction opertions Jesper Lrsson Träff Deprtment of Scientific Computing University of Vienn Disclimer Exploiting repetition nd sprsity in input for reducing

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Midterm 2 Sample solution

Midterm 2 Sample solution Nme: Instructions Midterm 2 Smple solution CMSC 430 Introduction to Compilers Fll 2012 November 28, 2012 This exm contins 9 pges, including this one. Mke sure you hve ll the pges. Write your nme on the

More information

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific Registering s HPE Reseller Quick Reference Guide for new Prtners in Asi Pcific Registering s new Reseller prtner There re five min steps to e new Reseller prtner. Crete your Appliction Copyright 2017 Hewlett

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) Numbers nd Opertions, Algebr, nd Functions 45. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) In sequence of terms involving eponentil growth, which the testing service lso clls geometric

More information

Definition of Regular Expression

Definition of Regular Expression Definition of Regulr Expression After the definition of the string nd lnguges, we re redy to descrie regulr expressions, the nottion we shll use to define the clss of lnguges known s regulr sets. Recll

More information

Mobile IP route optimization method for a carrier-scale IP network

Mobile IP route optimization method for a carrier-scale IP network Moile IP route optimiztion method for crrier-scle IP network Tkeshi Ihr, Hiroyuki Ohnishi, nd Ysushi Tkgi NTT Network Service Systems Lortories 3-9-11 Midori-cho, Musshino-shi, Tokyo 180-8585, Jpn Phone:

More information

Ma/CS 6b Class 1: Graph Recap

Ma/CS 6b Class 1: Graph Recap M/CS 6 Clss 1: Grph Recp By Adm Sheffer Course Detils Adm Sheffer. Office hour: Tuesdys 4pm. dmsh@cltech.edu TA: Victor Kstkin. Office hour: Tuesdys 7pm. 1:00 Mondy, Wednesdy, nd Fridy. http://www.mth.cltech.edu/~2014-15/2term/m006/

More information

The Greedy Method. The Greedy Method

The Greedy Method. The Greedy Method Lists nd Itertors /8/26 Presenttion for use with the textook, Algorithm Design nd Applictions, y M. T. Goodrich nd R. Tmssi, Wiley, 25 The Greedy Method The Greedy Method The greedy method is generl lgorithm

More information

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) * Pln for Tody nd Beginning Next week Interpreter nd Compiler Structure, or Softwre Architecture Overview of Progrmming Assignments The MeggyJv compiler we will e uilding. Regulr Expressions Finite Stte

More information

documents 1. Introduction

documents 1. Introduction www.ijcsi.org 4 Efficient structurl similrity computtion etween XML documents Ali Aïtelhdj Computer Science Deprtment, Fculty of Electricl Engineering nd Computer Science Mouloud Mmmeri University of Tizi-Ouzou

More information

Graphs with at most two trees in a forest building process

Graphs with at most two trees in a forest building process Grphs with t most two trees in forest uilding process rxiv:802.0533v [mth.co] 4 Fe 208 Steve Butler Mis Hmnk Mrie Hrdt Astrct Given grph, we cn form spnning forest y first sorting the edges in some order,

More information

INTRODUCTION TO SIMPLICIAL COMPLEXES

INTRODUCTION TO SIMPLICIAL COMPLEXES INTRODUCTION TO SIMPLICIAL COMPLEXES CASEY KELLEHER AND ALESSANDRA PANTANO 0.1. Introduction. In this ctivity set we re going to introduce notion from Algebric Topology clled simplicil homology. The min

More information

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries Tries Yufei To KAIST April 9, 2013 Y. To, April 9, 2013 Tries In this lecture, we will discuss the following exct mtching prolem on strings. Prolem Let S e set of strings, ech of which hs unique integer

More information

MATH 25 CLASS 5 NOTES, SEP

MATH 25 CLASS 5 NOTES, SEP MATH 25 CLASS 5 NOTES, SEP 30 2011 Contents 1. A brief diversion: reltively prime numbers 1 2. Lest common multiples 3 3. Finding ll solutions to x + by = c 4 Quick links to definitions/theorems Euclid

More information

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases

Overview. Network characteristics. Network architecture. Data dissemination. Network characteristics (cont d) Mobile computing and databases Overview Mobile computing nd dtbses Generl issues in mobile dt mngement Dt dissemintion Dt consistency Loction dependent queries Interfces Detils of brodcst disks thlis klfigopoulos Network rchitecture

More information

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds A Sprse Grid Representtion for Dynmic Three-Dimensionl Worlds Nthn R. Sturtevnt Deprtment of Computer Science University of Denver Denver, CO, 80208 sturtevnt@cs.du.edu Astrct Grid representtions offer

More information

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Position Heaps: A Simple and Dynamic Text Indexing Data Structure Position Heps: A Simple nd Dynmic Text Indexing Dt Structure Andrzej Ehrenfeucht, Ross M. McConnell, Niss Osheim, Sung-Whn Woo Dept. of Computer Science, 40 UCB, University of Colordo t Boulder, Boulder,

More information

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li 2nd Interntionl Conference on Electronic & Mechnicl Engineering nd Informtion Technology (EMEIT-212) Complete Coverge Pth Plnning of Mobile Robot Bsed on Dynmic Progrmming Algorithm Peng Zhou, Zhong-min

More information

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment File Mnger Quick Reference Guide June 2018 Prepred for the Myo Clinic Enterprise Khu Deployment NVIGTION IN FILE MNGER To nvigte in File Mnger, users will mke use of the left pne to nvigte nd further pnes

More information

Simrad ES80. Software Release Note Introduction

Simrad ES80. Software Release Note Introduction Simrd ES80 Softwre Relese 1.3.0 Introduction This document descries the chnges introduced with the new softwre version. Product: ES80 Softwre version: 1.3.0 This softwre controls ll functionlity in the

More information

9 Graph Cutting Procedures

9 Graph Cutting Procedures 9 Grph Cutting Procedures Lst clss we begn looking t how to embed rbitrry metrics into distributions of trees, nd proved the following theorem due to Brtl (1996): Theorem 9.1 (Brtl (1996)) Given metric

More information

CS 268: IP Multicast Routing

CS 268: IP Multicast Routing Motivtion CS 268: IP Multicst Routing Ion Stoic April 5, 2004 Mny pplictions requires one-to-mny communiction - E.g., video/udio conferencing, news dissemintion, file updtes, etc. Using unicst to replicte

More information

LECT-10, S-1 FP2P08, Javed I.

LECT-10, S-1 FP2P08, Javed I. A Course on Foundtions of Peer-to-Peer Systems & Applictions LECT-10, S-1 CS /799 Foundtion of Peer-to-Peer Applictions & Systems Kent Stte University Dept. of Computer Science www.cs.kent.edu/~jved/clss-p2p08

More information

Example: 2:1 Multiplexer

Example: 2:1 Multiplexer Exmple: 2:1 Multiplexer Exmple #1 reg ; lwys @( or or s) egin if (s == 1') egin = ; else egin = ; 1 s B. Bs 114 Exmple: 2:1 Multiplexer Exmple #2 Normlly lwys include egin nd sttements even though they

More information

Computer Arithmetic Logical, Integer Addition & Subtraction Chapter

Computer Arithmetic Logical, Integer Addition & Subtraction Chapter Computer Arithmetic Logicl, Integer Addition & Sutrction Chpter 3.-3.3 3.3 EEC7 FQ 25 MIPS Integer Representtion -it signed integers,, e.g., for numeric opertions 2 s s complement: one representtion for

More information

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the LR() nlysis Drwcks of LR(). Look-hed symols s eplined efore, concerning LR(), it is possile to consult the net set to determine, in the reduction sttes, for which symols it would e possile to perform reductions.

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

UNIT 11. Query Optimization

UNIT 11. Query Optimization UNIT Query Optimiztion Contents Introduction to Query Optimiztion 2 The Optimiztion Process: An Overview 3 Optimiztion in System R 4 Optimiztion in INGRES 5 Implementing the Join Opertors Wei-Png Yng,

More information

Rethinking Virtual Network Embedding: Substrate Support for Path Splitting and Migration

Rethinking Virtual Network Embedding: Substrate Support for Path Splitting and Migration Rethinking Virtul Network Emedding: Sustrte Support for Pth Splitting nd Migrtion Minln Yu, Yung Yi, Jennifer Rexford, Mung Ching Computer Science, Princeton University, Emil: {minlnyu,jrex}@cs.princeton.edu

More information

CS201 Discussion 10 DRAWTREE + TRIES

CS201 Discussion 10 DRAWTREE + TRIES CS201 Discussion 10 DRAWTREE + TRIES DrwTree First instinct: recursion As very generic structure, we could tckle this problem s follows: drw(): Find the root drw(root) drw(root): Write the line for the

More information

Pointwise convergence need not behave well with respect to standard properties such as continuity.

Pointwise convergence need not behave well with respect to standard properties such as continuity. Chpter 3 Uniform Convergence Lecture 9 Sequences of functions re of gret importnce in mny res of pure nd pplied mthemtics, nd their properties cn often be studied in the context of metric spces, s in Exmples

More information

PLWAP Sequential Mining: Open Source Code

PLWAP Sequential Mining: Open Source Code PL Sequentil Mining: Open Source Code C.I. Ezeife School of Computer Science University of Windsor Windsor, Ontrio N9B 3P4 cezeife@uwindsor.c Yi Lu Deprtment of Computer Science Wyne Stte University Detroit,

More information

Digital Signal Processing: A Hardware-Based Approach

Digital Signal Processing: A Hardware-Based Approach Digitl Signl Processing: A Hrdwre-Bsed Approch Roert Esposito Electricl nd Computer Engineering Temple University troduction Teching Digitl Signl Processing (DSP) hs included the utilition of simultion

More information

Misrepresentation of Preferences

Misrepresentation of Preferences Misrepresenttion of Preferences Gicomo Bonnno Deprtment of Economics, University of Cliforni, Dvis, USA gfbonnno@ucdvis.edu Socil choice functions Arrow s theorem sys tht it is not possible to extrct from

More information

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay Lexicl Anlysis Amith Snyl (www.cse.iit.c.in/ s) Deprtment of Computer Science nd Engineering, Indin Institute of Technology, Bomy Septemer 27 College of Engineering, Pune Lexicl Anlysis: 2/6 Recp The input

More information

Efficient Algorithms For Optimizing Policy-Constrained Routing

Efficient Algorithms For Optimizing Policy-Constrained Routing Efficient Algorithms For Optimizing Policy-Constrined Routing Andrew R. Curtis curtis@cs.colostte.edu Ross M. McConnell rmm@cs.colostte.edu Dn Mssey mssey@cs.colostte.edu Astrct Routing policies ply n

More information

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer. DBMS Architecture SQL INSTRUCTION OPTIMIZER Dtbse Mngement Systems MANAGEMENT OF ACCESS METHODS BUFFER MANAGER CONCURRENCY CONTROL RELIABILITY MANAGEMENT Index Files Dt Files System Ctlog DATABASE 2 Query

More information

Accelerating 3D convolution using streaming architectures on FPGAs

Accelerating 3D convolution using streaming architectures on FPGAs Accelerting 3D convolution using streming rchitectures on FPGAs Hohun Fu, Robert G. Clpp, Oskr Mencer, nd Oliver Pell ABSTRACT We investigte FPGA rchitectures for ccelerting pplictions whose dominnt cost

More information

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization An Efficient Divide nd Conquer Algorithm for Exct Hzrd Free Logic Minimiztion J.W.J.M. Rutten, M.R.C.M. Berkelr, C.A.J. vn Eijk, M.A.J. Kolsteren Eindhoven University of Technology Informtion nd Communiction

More information

Algorithm Design (5) Text Search

Algorithm Design (5) Text Search Algorithm Design (5) Text Serch Tkshi Chikym School of Engineering The University of Tokyo Text Serch Find sustring tht mtches the given key string in text dt of lrge mount Key string: chr x[m] Text Dt:

More information

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016 Solving Prolems y Serching CS 486/686: Introduction to Artificil Intelligence Winter 2016 1 Introduction Serch ws one of the first topics studied in AI - Newell nd Simon (1961) Generl Prolem Solver Centrl

More information

Stack Manipulation. Other Issues. How about larger constants? Frame Pointer. PowerPC. Alternative Architectures

Stack Manipulation. Other Issues. How about larger constants? Frame Pointer. PowerPC. Alternative Architectures Other Issues Stck Mnipultion support for procedures (Refer to section 3.6), stcks, frmes, recursion mnipulting strings nd pointers linkers, loders, memory lyout Interrupts, exceptions, system clls nd conventions

More information

Scalable Distributed Data Structures: A Survey Λ

Scalable Distributed Data Structures: A Survey Λ Sclble Distributed Dt Structures: A Survey Λ ADRIANO DI PASQUALE University of L Aquil, Itly ENRICO NARDELLI University of L Aquil nd Istituto di Anlisi dei Sistemi ed Informtic, Itly Abstrct This pper

More information

Preserving Constraints for Aggregation Relationship Type Update in XML Document

Preserving Constraints for Aggregation Relationship Type Update in XML Document Preserving Constrints for Aggregtion Reltionship Type Updte in XML Document Eric Prdede 1, J. Wenny Rhyu 1, nd Dvid Tnir 2 1 Deprtment of Computer Science nd Computer Engineering, L Trobe University, Bundoor

More information

Categorical Skylines for Streaming Data

Categorical Skylines for Streaming Data Ctegoricl Skylines for Streming Dt ABSTRACT Nikos Srks University of Toronto nsrks@cs.toronto.edu Nick Kouds University of Toronto kouds@cs.toronto.edu The prolem of skyline computtion hs ttrcted considerle

More information

such that the S i cover S, or equivalently S

such that the S i cover S, or equivalently S MATH 55 Triple Integrls Fll 16 1. Definition Given solid in spce, prtition of consists of finite set of solis = { 1,, n } such tht the i cover, or equivlently n i. Furthermore, for ech i, intersects i

More information

PPS: User Manual. Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia

PPS: User Manual. Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia PPS: User Mnul Krishnendu Chtterjee, Mrtin Chmelik, Rghv Gupt, nd Ayush Knodi IST Austri (Institute of Science nd Technology Austri), Klosterneuurg, Austri In this section we descrie the tool fetures,

More information

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona CSc 453 Compilers nd Systems Softwre 4 : Lexicl Anlysis II Deprtment of Computer Science University of Arizon collerg@gmil.com Copyright c 2009 Christin Collerg Implementing Automt NFAs nd DFAs cn e hrd-coded

More information

Looking up objects in Pastry

Looking up objects in Pastry Review: Pstry routing tbles 0 1 2 3 4 7 8 9 b c d e f 0 1 2 3 4 7 8 9 b c d e f 0 1 2 3 4 7 8 9 b c d e f 0 2 3 4 7 8 9 b c d e f Row0 Row 1 Row 2 Row 3 Routing tble of node with ID i =1fc s - For ech

More information

IP: Network Layer. Goals and Tasks. Routing. Switching. Switching (cont.) Datagram v/s Virtual Circuit. Overview Addressing Routing

IP: Network Layer. Goals and Tasks. Routing. Switching. Switching (cont.) Datagram v/s Virtual Circuit. Overview Addressing Routing IP: Network Lyer Overview Addressing Routing Overview Gols nd Tsks Routing Switching Issues Bsic ides TOC IP TOC IP Overview Gols nd Tsks Gols of Network Lyer Guide pckets from source to destintion Use

More information

Data Flow on a Queue Machine. Bruno R. Preiss. Copyright (c) 1987 by Bruno R. Preiss, P.Eng. All rights reserved.

Data Flow on a Queue Machine. Bruno R. Preiss. Copyright (c) 1987 by Bruno R. Preiss, P.Eng. All rights reserved. Dt Flow on Queue Mchine Bruno R. Preiss 2 Outline Genesis of dt-flow rchitectures Sttic vs. dynmic dt-flow rchitectures Pseudo-sttic dt-flow execution model Some dt-flow mchines Simple queue mchine Prioritized

More information

2PC AGENT METHOD: ACHIEVING SERIALIZABILITY IN PRESENCE OF FAILURES IN A HETEROGENEOUS MULTIDATABASE

2PC AGENT METHOD: ACHIEVING SERIALIZABILITY IN PRESENCE OF FAILURES IN A HETEROGENEOUS MULTIDATABASE 2PC AGENT METHOD: ACHIEVING SERIALIZABILITY IN PRESENCE OF FAILURES IN A HETEROGENEOUS MULTIDATABASE Antoni Wolski nd Jri Veijlinen Technicl Reserch Centre of Finlnd Lortory for Informtion Processing Lehtisrentie

More information

Symbol Table management

Symbol Table management TDDD Compilers nd interpreters TDDB44 Compiler Construction Symol Tles Symol Tles in the Compiler Symol Tle mngement source progrm Leicl nlysis Syntctic nlysis Semntic nlysis nd Intermedite code gen Code

More information

Inference of node replacement graph grammars

Inference of node replacement graph grammars Glley Proof 22/6/27; :6 File: id293.tex; BOKCTP/Hin p. Intelligent Dt Anlysis (27) 24 IOS Press Inference of node replcement grph grmmrs Jcek P. Kukluk, Lwrence B. Holder nd Dine J. Cook Deprtment of Computer

More information

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Solving Prolems y Serching CS 486/686: Introduction to Artificil Intelligence 1 Introduction Serch ws one of the first topics studied in AI - Newell nd Simon (1961) Generl Prolem Solver Centrl component

More information

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork MA1008 Clculus nd Liner Algebr for Engineers Course Notes for Section B Stephen Wills Deprtment of Mthemtics University College Cork s.wills@ucc.ie http://euclid.ucc.ie/pges/stff/wills/teching/m1008/ma1008.html

More information

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table TDDD55 Compilers nd Interpreters TDDB44 Compiler Construction LR Prsing, Prt 2 Constructing Prse Tles Prse tle construction Grmmr conflict hndling Ctegories of LR Grmmrs nd Prsers Peter Fritzson, Christoph

More information

The dictionary model allows several consecutive symbols, called phrases

The dictionary model allows several consecutive symbols, called phrases A dptive Huffmn nd rithmetic methods re universl in the sense tht the encoder cn dpt to the sttistics of the source. But, dpttion is computtionlly expensive, prticulrly when k-th order Mrkov pproximtion

More information

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe CSCI 0 fel Ferreir d Silv rfsilv@isi.edu Slides dpted from: Mrk edekopp nd Dvid Kempe LOG STUCTUED MEGE TEES Series Summtion eview Let n = + + + + k $ = #%& #. Wht is n? n = k+ - Wht is log () + log ()

More information

Typing with Weird Keyboards Notes

Typing with Weird Keyboards Notes Typing with Weird Keyords Notes Ykov Berchenko-Kogn August 25, 2012 Astrct Consider lnguge with n lphet consisting of just four letters,,,, nd. There is spelling rule tht sys tht whenever you see n next

More information