A Holistic View of Stream Partitioning Costs

Size: px

Start display at page:

Download "A Holistic View of Stream Partitioning Costs"

Joella Stewart
5 years ago
Views:

1 A Holstc Vew of Stream Parttonng Costs Nkos R. Katspoulaks, Alexandros Labrnds, Panos K. Chrysanths Unversty of Pttsburgh Pttsburgh, Pennsylvana, USA {katsp, labrnd, ABSTRACT Stream processng has become the domnant processng model for montorng and real-tme analytcs. Modern Parallel Stream Processng Engnes (pspes) have made t feasble to ncrease the performance n both montorng and analytcal queres by parallelzng a query s executon and dstrbutng the load on multple workers. A determnng factor for the performance of a pspe s the parttonng algorthm used to dssemnate tuples to workers. Untl now, parttonng methods n pspes have been smlar to the ones used n parallel databases and only recently load-aware algorthms have been employed to mprove the effectveness of parallel executon. We dentfy and demonstrate the need to ncorporate aggregaton costs n the parttonng model when executng stateful operatons n parallel, n order to mnmze the overall latency and/or throughput. Towards ths, we propose new stream parttonng algorthms, that consder both tuple mbalance and aggregaton cost. We evaluate our proposed algorthms and show that they can acheve up to an order of magntude better performance, compared to the current state of the art. 1. INTRODUCTION Montorng and real-tme temporal analytc queres are beng wdely used n a varety of servces, whose qualty reles on successfully capturng topc drft or trend fluctuaton. Examples of such servces nclude hgh-frequency algorthmc stock tradng, socal network analyss, targeted advertsng, and clck stream analyss. In order to match the speeds of data producton, stream processng s deemed as the most promsng model. At a hgh level, t requres (a) one-pass over the data, (b) constant processng tme, and (c) contnuous executon. After many flavors of sngle-thread stream processng engnes [2, 6, 7, 11], Parallel Stream Processng Engnes (pspes) have been ntroduced as a soluton for processng hgh-volume data streams, n both sngle-node mult-threaded (scale-up) [1, 24], and n multple-node (scale-out) [1, 35, 27, 3, 37, 34, 25, 8, 3] envronments. pspes have domnated the streamng landscape because of ther ablty to scale processng capablty by dvdng load nto parallel Ths work s lcensed under the Creatve Commons Attrbuton- NonCommercal-NoDervatves 4. Internatonal Lcense. To vew a copy of ths lcense, vst For any use beyond those covered by ths lcense, obtan permsson by emalng nfo@vldb.org. Proceedngs of the VLDB Endowment, Vol. 1, No. 11 Copyrght 217 VLDB Endowment /17/7. data-flows, handled by many workers. At ther core, pspes are crtcally affected by the parttonng algorthm of the sub data-flows delegated to workers: the more evenly the load s parttoned, the more scalable a pspe s. Therefore, parttonng s paramount, especally n stateful operatons, whch nvolve wndows of computaton, complex delvery semantcs (.e., exactly-once), and wndow synchronzaton [8]. The focal pont of our work s on stateful operatons, whch requre the collocaton of tuples wth smlar characterstcs and produce an aggregated result, for each user-defned logcal wndow. A wndow contans tuples based on count, tme, or even sesson. The need for strct delvery semantcs (e.g., exactly-once) mposes addtonal overheads for guaranteeng correctness and tmely delvery of results, whch are generated by checkpontng, out-of-order wndow algnment, wndow barrers, etc. Often, pspes rely on dynamc re-parttonng of tuples to acheve better load balance [32, 38, 17, 13, 18, 35, 9, 15]. However, reparttonng comes wth the addtonal burden of state mgraton, whch s a heavyweght task and nvolves complex synchronzaton protocols, state ntegraton polces (subject to wndow semantcs), and can potentally lead to delayed tuple delvery. Therefore, n our study we chose to take a step back and focus on the parttonng algorthm to make t more effcent, so that the need for reparttonng materalzes less often. Of course, re-parttonng solutons, such as Flux [32], requre parttonng 1, and can complement our work to enhance a pspe s performance further. Untl recently, pspes adopted partton algorthms used n Massvely Parallel Processng Database Management Systems. The most popular algorthms among them are shuffle (or round-robn) (SH), and feld (or hash) (FLD) algorthms [8, 34]. The former, blndly sends tuples to workers n a crcular fashon akn to shufflng a deck of cards; whereas, the latter explots a random process, usually a hash functon, to dsperse tuples to workers. Each partton algorthm has ts merts and ts drawbacks: SH manages to balance load evenly, but forces a computatonally heavy aggregaton step (Fg. 1a); FLD underperforms on skewed streams (.e., when some keys appear more often than others) but does not requre an aggregaton step (Fg. 1b). The state of the art partton algorthm s partal key groupng (PK) [28]. It focuses on mprovng performance by keepng track of the number of tuples sent to each worker n an onlne fashon. PK leverages the dea of key splttng [5], whch dctates that tuples wth the same attrbute(s) can be splt among two workers for the beneft of overall performance (Fg. 1c). Recently, an extenson of PK that uses more than two choces was proposed [29], and was shown to further balance load among workers. The decson about 1 Flux uses FLD as ts partton algorthm. 1286

2 W 1 W 1 W 1 P W 2 A P W 2 P W 2 A W V a: SH s aggregaton runtme s proportonal to V tmes the number of dstnct groups. W V b: FLD doesn t requre aggregaton, but fals to balance load under skewed nput. W V c: PK s aggregaton runtme s proportonal to M tmes the number of dstnct groups (M s the number of choces). Fgure 1: Exstng stream parttonng algorthms lack a unfed model that lmts mbalance whle keepng aggregaton cost low. whch worker wll receve a tuple s determned by the total number of tuples already sent to each one of them at the tme of the decson. Ths way, the merts of FLD and SH are combned by overcomng skewness through the use of multple choces and, at the same tme, reducng aggregaton cost. Partton algorthms lke PK (and ts mult-choce varant [29]) focus on the aspect of mbalance, n terms of tuples sent to each worker on the parallel step of a stateful operaton. Nevertheless, every stateful operaton requres a step n whch partal results are combned (Fgs. 1a and 1c). In our work, we argue that an mportant factor for performance s the aggregaton cost requred to produce the fnal result, whch s not consdered by any other parttonng algorthm. In fact, to the extent of our knowledge, no other stream parttonng algorthm ncorporates both mbalance and aggregaton cost. In ths paper we propose that trackng the aggregaton cost of a stateful operaton reduces to countng the number of dstnct keys sent to each worker on every wndow. Hence, we ntroduce a new class of parttonng algorthms, whch leverage such nformaton durng the decson process. Our contrbutons are: Introduce a novel cost model for stream parttonng that consders both load mbalance and aggregaton cost on every wndow of a stateful operaton. Propose novel stream parttonng algorthms that ncorporate our cost model to mprove performance. Demonstrate the benefts of our cost model n real world benchmarks and present an emprcal rule for choosng a parttonng algorthm for a stateful query. Secton 2 presents our model and exstng parttonng algorthms. Secton 3 shows mechansms for keepng track of cardnalty, and Secton 4 presents our proposed algorthms. Secton 5 and 6 demonstrate the detals of our experments, followed by Secton 7, whch offers a dscusson on pckng a partton algorthm. Fnally, Secton 8 presents related work, and our work concludes n Secton PROPOSED MODEL We focus on pspes for ether scale-up or scale-out archtectures. A scale-up archtecture s a sngle mult-core machne, n whch multple cores are used to accommodate concurrent threads. A scale-out archtecture s a mult-node envronment n whch a cluster of machnes s at the dsposal of a central managng authorty of the pspe. 2.1 Prelmnares A query Q s submtted to the pspe n ether declaratve or mperatve form. For the rest of ths secton we are gong to use the SELECT R.a, COUNT(*) FROM R JOIN S on R.a = S.b [ Range 5 mnutes slde 3 seconds ] WHERE S.c < 1 GROUP BY R.a a: Input Query Q. output groupby map jon flter stateful S R stateless b: Evaluaton tree. Fgure 2: Query submsson and evaluaton on a pspe. Table 1: Model Symbol Overvew Model Symbol Overvew V # of workers S streams 1 N X schema of S e X tuple of S W : S {S 1,..., S w } wndow for S P : S w {L 1 S w,..., LP S w } partton functon L P S w wndow load of worker f : L P S w {..., (k x, v x),...} partal evaluaton functon Γ : {..., f(l j S w ),...} R aggregaton functon example query depcted n Fg. 2a, usng CQL [4]. The pspe transforms Q nto a logcal plan, whch s often modeled as an evaluaton tree (Fg. 2b). The root of the tree represents output, whch can ether be an external system consumng the result or external storage. The leaves of the tree are streams, each one represented by S, where 1 N (N s the number of nput streams). Each S s abstracted as a sequence of tuples e X wth a predefned schema X. From ths pont on we are gong to descrbe our model based on a sngle nput stream S. However, wthout any loss of generalty ths model s capable of accommodatng multple streams as well. An e X s attrbutes can be represented as a trplet (τ X, k X, p X ). τ X s the attrbute responsble for orderng tuples n S and s used to assgn each e X to a logcal wndow (ether tme- or countbased). A logcal wndow s abstracted as a functon W : S {S 1,..., S w }, where w. Each S w represents the tuples of S that belong to wndow w accordng to W. k X {X τ X } are the attrbutes, whch dentfy a tuple, and p X {X (τ X +k X )} are the remanng attrbutes, whch comprse e X s payload. Often, those appear n predcates, projecton lsts, or are used by aggregate functons. In our example query, S 1 = R and S 2 = S. 1287

3 P((R S) w ) f(l 1 (R S)w )... f(l V (R S)w ) Γ({f(L 1 ),, f(l V )}) partton partal eval. aggregaton Fgure 3: The wndowed group-by count of the sample query (Fg. 2) as a 3 stage process. X 1 = (t, a) and X 2 = (t, b, c). Each tuple from R s modeled as a trplet where τ X1 = {R.t}, k X1 = {R.a}, and p X1 =. Smlarly, τ X2 = {S.t}, k X2 = {S.b}, and p X2 = {S.c}. Turnng to the evaluaton tree, nternal nodes represent algebrac operatons, whch work as transformatons of an nput stream S to another S. Each operaton can be ether stateless or stateful. The former are pure functons (as defned by functonal programmng prncples) and are easly parallelzed by arbtrarly parttonng ther nput stream. The latter can be ether a relatonal algebra operaton or any user-defned functon that produces a result on every wndow S w. Our work focuses only on parttonng tuples for stateful operatons. In the tree llustrated on Fg. 2b, map and flter are stateless, whereas, jon and group-by are stateful operatons. 2.2 A new formulaton for Parallel Stateful Operatons By the tme a stateful operaton s scheduled to execute n parallel, t transforms nto a 3-stage process for each wndow S w. Its nput conssts of S1 w,..., SN w and the 3 stages are n order: () partton, () partal evaluaton, and () aggregaton. Fgure 3 depcts the wndowed group-by count between streams R and S of the sample query as the 3 stage process. Partton can be modeled as a functon that takes a subsequence S w and produces another sequence of equal length that ndcates the worker to whch each e X w s gong to be sent. In other words, partton s a functon P : S w {o 1,..., o S w }, where 1 o l V ( x represents the length of a stream/sequence x). The resultng sequence conssts of elements o l, where 1 l S w, each one mappng e X w ndexed by l to a number n [1, V]. V represents the pspe s parallelsm degree for a partcular stateful operaton and s materalzed by V workers, whch are responsble for processng the partal result n wndow w. Each worker s ether a thread or a process. L o S w = {e e X w P w(s w )[e] = o} denotes the sequence of tuples from S w that wll be sent to worker o, by the partton process. Partal evaluaton s executed by V workers n parallel. Each worker receves ts correspondng L o S w sequence and apples the user-defned transformaton f. f produces a set of key-value pars: f : L o S w {(k 1, v 1),..., (k m, v m)} of arbtrary sze m. m s naturally bounded by the cardnalty of S w, whch s defned as the number of dstnct values k S n S w. For the rest of ths paper, the cardnalty of a stream/sequence x wll be represented by x, whch s used to refer to the number of dstnct keys (groups) held by a worker. Fnally, aggregaton combnes all the key value pars f(l o S w ) produced by each worker o, o [1, V], nto a fnal result, usng an aggregaton functon Γ({f(L 1 S w ),..., f(lv S w )}). Gong back to the query shown n Fg. 3, partton would be a functon P ({R S w}) that parttons each wndowed stream based on R.a. Partal evaluaton would be the partal count of the group-by operator and the result of each worker would produce a sequence of key value pars, n whch the keys would consst of dstnct R.a values and values would be the number of tuples for each correspondng R.a key. Fnally, the aggregaton stage would combne partal results by addng partal counts for every matchng R.a key. In essence, f 2 workers are used wth worker #1 producng {(x, 12), (y, 123)} and worker #2 producng {(x, 43), (y, 1), (z, 4)}, then aggregaton (Γ) produces the result {(x, 55), (y, 124), (z, 4)} (smlar to the processng model of [33]). 2.3 Proposed parttonng cost model Partton ams to: () dvde S w as evenly as possble among V workers, whle () aggregaton load (Γ) remans low. Ths way, executon can beneft from employng multple workers: the more max V(L V S w ) gets reduced, the faster the partal evaluaton step s gong to progress. In our work, we adopt the assumpton that there exsts a monotonc relaton between the number of tuples and load ncrease (smlar to prevous work on stream parttonng [28]). Ths entals that when a tuple s assgned to a worker, ts load wll ether ncrease or stay the same. [28] ntroduced tuple mbalance as a metrc for quantfyng a parttoner s effcency n terms of balancng load among workers. However, [28] expressed mbalance on the entre stream (.e., countng from the begnnng of tme), whch we beleve s lmted, gven the dynamc nature and characterstcs of data streams. In ths work, we extend mbalance to cover the wndow aspect of a streamng query: I(P (S w )) = max(l j j S w ) avg(l j S w ), j = 1,..., V (1) j Equaton 1 defnes tuple mbalance as the dfference of the maxmum L S w mnus the average L S w, as parttoned by a partton algorthm P. The less the tuple mbalance acheved by P s, the less the maxmum runtme of each worker wll be. We propose a new model for measurng the effectveness of a parttoner by ncorporatng aggregaton cost, whch has been gnored n the past. As we dscussed n Sec. 2.2, the aggregaton stage wll have to ngest all f(l V S w ) and combne every par of (k, v) tuples wth a matchng key k. Hence, the number of operatons for processng partal results s proportonal to the sum of the szes of all partal aggregatons f(l V S w ) : V Γ(S w ) = O( f(l o S w ) ) (2) o=1 Equaton 2 captures both processng and memory cost of the aggregaton, snce partal results need to be stored untl they are processed. In fact, the larger Γ(S w ) s, the more memory s requred to accommodate partal results. Therefore, we model stream parttonng as the followng mnmzaton problem: mnmze I(P (S w )) whle Γ(S w ) max 1 j V (Lj S w ) The reason Γ(S w ) should be less or equal than the maxmum L V S w s so that executon benefts from parallelzng the workload and not havng aggregaton become more than the maxmum partal processng cost. Fnally, n scale-out archtectures, workers load mght dverge due to external factors (.e., communcaton, mult-tenancy etc.). Our model (Eq. 1) focuses on dentfyng load generated by the (3) 1288

4 Aggregaton cost Low Hgh shuffle (SH) worst partal-key (PK) best feld (FLD) Low Imbalance Hgh Fgure 4: Stream parttonng algorthms expected performance. stateful operaton and act accordngly to balance t. To the extent of our knowledge, any method for broader load montorng n a pspe nvolves archtectural nterventons, such as montorng modules and feedback loops [17, 38, 13, 18, 23, 31]. If a pspe features the aforementoned components to detect load dvergence caused by external factors, our cost model (Eq. 3) can be extended to ncorporate that nformaton as well The ptfall of gnorng aggregaton costs To better understand nherent trade-offs among exstng partton algorthms, we present Fg. 4, whch llustrates the two dmensons wth whch each algorthm s measured. The horzontal axs represents the ablty of an algorthm to balance the load among workers, and the vertcal axs represents an algorthm s ablty to mantan the aggregaton cost low, based on our model (Eq. 3). In Fg. 4 we have placed prevously proposed partton algorthms based on ther expected behavor n terms of mbalance and aggregaton cost. As ndcated by Eq. 3, parttonng becomes a trade-off between tuple mbalance and aggregaton cost: the more tuples are spread, the more aggregaton tme ncreases. Consder S to be an nput stream wth schema X = (t, a, b), where τ X = {t}, k X = {a} and p X = {b}. In a stateful operaton, a parttonng algorthm has to make a choce of where all tuples wth a partcular key a wll be sent. Parttonng algorthms can be categorzed based on how many worker optons are presented for a gven k X. A 1-choce parttoner offers no mechansms to balance skewness on nput data. As a result, the workers that happen to be assgned the part of the data that appear the most (.e., most frequent) wll always have more work compared to others. That leads to hgher mbalance (Eq. 1). In addton, when a sngle opton for each k X s presented, aggregaton cost (Eg. 2) s mnmal, because each worker wll produce a subset of the full result. On the other hand, a M-choce parttoner (M V) presents M canddate workers for each k X. Thus, load for k X s dvded nto M equal parts and handled by M workers. As a result, mbalance (Eq. 1) s reduced, and the pspe takes better advantage of parallelsm. Unfortunately, partal results produced by the M workers handlng a partcular k X have to be gathered and combned. That entals an nflated aggregaton cost, whch s expected to ncrease by a factor of M. For example, n a sngle wndow S w, f tuples wth k X = a x are assgned to 4 workers, then the aggregaton stage wll process 4 partal results (.e., one from each worker). Shuffle Parttonng (round-robn) - SH blndly sends tuples to workers, wthout makng any attempt to balance load and collocate keys (Fg. 1a). Therefore, SH s categorzed as a M-choce parttoner because an aggregaton stage s requred to produce the 2 By changng Eq. 1 to multply L j S w wth load-dvergence coeffcents, produced perodcally by montorng components. fnal result. SH manages to mnmze tuple mbalance (Eq. 1) snce each worker receves the same number of tuples n a gven wndow S w : f V workers exst, each one wll receve Sw tuples. Turnng V to aggregaton cost Γ(S w ) (Eq. 2), when SH s used t becomes computatonally expensve, because tuples are parttoned wthout an attempt to collocate keys. Therefore, n a worst case scenaro, each worker wll produce a partal result (f(l V S w )) wth all the keys that exst n S w (llustrated n Fg. 1a). In that case, Γ(S w ) wll become equal to M tmes S w. As far as our cost model s concerned (Eq. 3), SH mnmzes mbalance, but does not act to lmt the aggregaton cost. Hash Parttonng (feld) - FLD follows a dfferent approach than SH, by collocatng tuples wth the same k X on the same worker (Fg. 1b). FLD feeds k X to a hash functon and selects a worker based on the result. It guarantees that keys from the same group wll be collocated, resultng n mnmal aggregaton cost (Eq. 2). Hence, FLD s characterzed as a 1-choce parttoner. Nevertheless, FLD fals to balance the load effectvely when nput s skewed and some keys appear more often than others. (.e., there s tuple mbalance - Eq. 1). Matters can get exacerbated f ntal expectatons (or assumptons) on nput load do not hold true overtme. Under such crcumstances, strugglng workers wth excess load wll hnder the progress of a query and even compromse the correctness of the result. In concluson, FLD mposes mnmal aggregaton cost but does not act on lmtng tuple mbalance (based on the cost model - Eq. 3). Partal Key Parttonng - PK, s the current state of the art algorthm [28]. It adopted the dea of key splttng [5] to allevate the load of processng keys that are part of the skew. PK was frst to ncorporate load n terms of the number of tuples assgned to each worker (.e., L V S w ). Key splttng s materalzed by usng a par of ndependent hash functons (.e., H 1, H 2) and feed k X to both. Also, PK mantans an array of sze V wth the total tuple count sent to each worker. Every tme a tuple arrves, ts k X s fed to H 1 and H 2 to dentfy 2 canddate workers. The parttoner wll forward the tuple to the canddate that has receved the least number of tuples up to that pont. PK was extended to more than two canddates [29], when two are not suffcent to handle skew. Even though PK succeeded n mprovng mbalance (Eq. 1) compared to FLD, t dd so by addng an essental aggregaton step (Eq. 2). Therefore, PK s expected to ncur aggregaton cost proportonal to the number of canddates. Turnng to our cost model (Eq. 3), PK can potentally volate the aggregaton cost constrant, when Γ(S w ) exceeds the maxmum workload experenced by each worker. Summary: Our goal s to propose parttonng algorthms that belong to the best quartle (Fg. 4) and use our cost model (Eq. 3). To acheve ths, we have to mantan the aggregaton cost low and acheve better mbalance. 3. MINIMIZING IMBALANCE WITH LOW AGGREGATION COST Desgnng a parttonng algorthm that acheves low aggregaton cost entals keepng track of the number of keys produced by each worker on every wndow S w (.e., f(l j S w ), for 1 j V). Equaton 2 ndcates that f the sum of f(l j S w ) s reduced, then the aggregaton cost gets reduced as well. However, the boundares of the aggregaton cost need to be dentfed frst. PROPOSITION 1. For a gven stream S w, a stateful operaton f, and V number of workers, Γ(S w ) s bounded by: S w Γ(S w ) V S w. 1289

5 PROOF. Γ(S w ) wll always be greater or equal to S w and that happens when the parttonng algorthm sends each key to a sngle worker only. In ths case, L S w L j S w =, 1 j V. Hence, L 1 S w LV S w = Sw. Smlarly, f the partton algorthm sends at least one tuple for each key to every worker (.e., k L j S w, k S w and 1 j V), then L 1 S w LV S w = Sw S w = V S w }{{} V A mechansm for montorng Γ(S w ) s value has to be establshed. Eq. 2 can be expanded to the sum of ts operands as: Γ(S w ) = f(l 1 S w ) f(lv S w ) (4) Hence, n order to montor aggregaton cost, the partton algorthm has to keep track of the number of dstnct keys sent to each worker, for each S w. 3.1 Incorporatng Cardnalty n Parttonng Assumng a mechansm for keepng track of workers cardnaltes has been establshed, the cost model (Eq. 3) can be extended to ncorporate the knowledge of the number of dstnct keys sent to each worker. As ndcated by Eq. 3, the nformaton about workers cardnaltes can be used n two places: (a) mbalance (Eq. 1), and (b) aggregaton cost (Eq. 2) Cardnalty n mbalance The load of each worker has been modeled n terms of number of tuples. In the same manner, a worker s load can be expressed n terms of cardnalty usng the followng formula: CL j S w = L j S w, 1 j V (5) Equaton 5 depcts the load of a worker n terms of the number of dstnct keys sent to t. Therefore, cardnalty mbalance can be expressed as the dfference between the maxmum and the mean cardnalty of all workers for a gven wndow S w, as a result of a parttonng algorthm P : CI(P (S w )) = max(cl j j S w ) avg(cl j S w ), 1 j V (6) j At ths pont, mbalance s determned by tuple count and cardnalty. However, dfferent stateful operatons are affected by each metrc dfferently. Hence, there s a need for a more dverse load estmaton formula, whch combnes tuple count and cardnalty. In order to avod one metrc domnatng the other, the ntal values should be scaled accordngly: L j S w L j S = w mn 1 k V (L k S w ) max 1 k V (L k S w ) mn 1 k V (L k S w ) CL j CL j S S = w mn 1 k V (CL k S w ) w max 1 k V (CL k S w ) mn 1 k V (CL k S w ) H j S w = pl j S w (7) (8) + (1 p)cl j S, where 1 j V (9) w Equaton 9 combnes the normalzed loads both n terms of tuples (Eq. 7) and dstnct keys (Eq. 8) n a unfed score. That score s adjustable based on a user s (or query optmzer s) parameter p, whch controls the bas for each score accordngly: the smaller the p, the less the load n terms of tuples affects Eq. 9; whereas the hgher the p, the less the load n terms of dstnct keys affects Eq. 9. Fnally, mbalance can be expanded to a hybrd form that ncorporates load n terms of both tuple count and cardnalty: HI(P (S w )) = max(h j j S w ) avg(h j S w ), 1 j V (1) j Cardnalty n aggregaton Aggregaton cost s determned by Γ(S w ) (Eq. 2) and reducng t emanates from reducng the sum of dstnct keys sent to each worker. Its mnmum value can be S w when each key s sent to only a sngle worker. Ths behavor resembles FLD and t mght result n mbalance on workers. To avod ths, we employ key splttng for keys that have not been sent to a worker before n a partcular wndow S w. By sendng each newly encountered key to the worker wth ether the least keys or the least number of tuples up to that pont, the aggregaton cost remans low. Also, mbalance s expected to be lower compared to the one acheved from FLD. 3.2 Cardnalty Estmaton data structures The parttoner needs to keep track of each worker s cardnalty, every tme a new tuple arrves. Hence, t should mantan an array of V cardnalty estmaton structures (C), whch wll offer two methods: () update(k X ): for updatng the count of dstnct keys; and () estmate(): for returnng the count of dstnct keys Nave The nave approach for estmatng a worker s cardnalty nvolves keepng track of the exact number of dstnct keys. Therefore, a parttoner responsble for V downstream workers, V unordered set structures are needed. Ths way, the update and the estmate methods wll offer constant executon tme (O(1)). One caveat of usng an unordered set structure for each worker s the memory overhead. Dependng on the algorthm used, a key can end up n multple workers (e.g., SH). Ths way, the memory requred for mantanng the number of keys on each worker can become O(V S w ), snce all unordered sets can end up havng each key. The memory cost of a nave cardnalty estmaton structure s related to the cardnalty of S w and the choce of the parttonng polcy: If S w remans low and the parttoner does not send the same keys to multple workers, the memory requrements for C wll reman low. However, f S w s hgh and the parttoner tends to send tuples wth the same key to multple workers, then memory load can hnder the partton process Hyperloglog HyperLogLog (HLL) ntroduced by Flajolet et al. [14] s an algorthm for estmatng the number of dstnct elements n databases wth a bounded error. HLL requres O (log 2 log 2 N) memory for a relaton expected to have N dstnct elements. Every tme a new tuple arrves, ts k X s extracted and fed through a hash functon. HLL extracts the m most sgnfcant bts of the hash result, and uses them to dentfy whch regster (out of 2 m ) to update. Each regster s log 2 log 2 N bts long, and ts value s updated dependng on the left-most zero of the m most sgnfcant bts of the hashed value. HLL has been shown to present an accuracy of 1.4 m. Recent work from Heule et al. [2] presented a number of mprovements that need to take place so that cardnaltes n the orders of bllons can be estmated effcently. For cardnalty estmaton, a partton algorthm s requred to use V HLL s to measure the number of dstnct keys sent to each of the V worker. HLL can be used to lmt the memory cost but t uses rreversble operatons to update ts nternal buckets. That consttutes t unable to check whether a key has been prevously sent to a worker. We ntroduce an optmstc mechansm for checkng f a key has been 129

6 Algorthm 1: Partton. nput : e X, t w, C, L output: worker to whch e X wll be sent to 1 k = GetKey(e X ); 2 t = GetTmestamp(e X ); 3 f t t w then 4 Reset(C); 5 Reset(L); 6 c 1 = H 1(k); 7 c 2 = H 2(k); 8 return decde(c, L, k, c 1, c 2); forwarded to a partcular worker before: upon the arrval of a key that hashes to a worker, ts cardnalty c s estmated (call to estmate). A tral update of c s performed and the new cardnalty c s estmated. If the cardnalty estmate dfference ( c c ) =, then our mechansm optmstcally assumes that the key has already been sent to the correspondng worker. HLL s expected to make wrong decsons at the beneft of a constant memory cost. 4. PROPOSED CARDINALITY-AWARE PAR- TITIONING ALGORITHMS PK [28, 29] has motvated the merts of key splttng [5] for reducng mbalance among workers. Therefore, all the varatons of our proposed algorthms leverage key splttng, whch materalzes wth the use of multple hash functons for dentfyng canddate workers. For smplcty, our algorthms are presented wth only two canddates, but they can be extended to accommodate more. The bass of our algorthms s presented n Alg. 1 and s called by the pspe s parttoner when a new tuple arrves. The parttoner mantans two arrays of sze V: one wth cardnalty estmaton structures (C), and one wth tuple counters (L). L s dentcal to the one used by PK and gets updated the same way n all our proposed algorthms. e X s key s extracted and fed to the two hash functons: H 1 and H 2. If the parttoner uses more than two canddates (.e., M > 2), then an equal number of hash functons are used n the decson process. The resultng choces (c 1 and c 2) along wth the arrays C and L are passed to decde(). Durng query executon, a pspe mght have multple nstances of parttoners runnng on dfferent machnes (especally n a scale-out settng, where thousands of threads are nvolved n a query). The advantage of usng hash functons s that no exchange of nformaton s requred among dfferent nstances of parttoners. On top of that, C and L have ther counts monotoncally ncreasng on each wndow. Therefore, f each of the parttoners tres to reduce mbalance and/or aggregaton cost, then (through the addtve property) the overall mbalance and/or aggregaton cost are reduced. Fnally, C and L need to be reset when a wndow expres (Alg. 1 lne 3). Ths guarantees that decsons reflect the temporal nature of stream processng. Algorthm 1 receves t w as an argument, whch s the expraton tmestamp of the current wndow. In the followng sectons we go over our varatons for decde(): () Cardnalty mbalance Mnmzaton (CM), () Group Affnty wth mbalance Mnmzaton (AM & cam), and () Hybrd mbalance Mnmzaton (LM). 4.1 Cardnalty Imbalance Mnmzaton (CM) The frst parttonng algorthm ams at lmtng cardnalty mbalance (Eq. 6) and the decson s made based on the cardnalty Algorthm 2: Cardnalty mbalance mnmzaton (CM) nput : C, L, k, c 1, c 2 output: worker to whch the tuple s gong to be sent to 1 l 1 = C[c 1].estmate(); 2 l 2 = C[c 2].estmate(); 3 f l 1 l 2 then 4 C[c 1].update(k); 5 L[c 1] += 1; 6 return c 1; 7 else 8 C[c 2].update(k); 9 L[c 2] += 1; 1 return c 2; Algorthm 3: Group affnty combned wth cardnalty mbalance mnmzaton (AM) nput : C, L, k, c 1, c 2 output: worker to whch the tuple s gong to be sent to 1 f C[c 1].contans(k) then 2 L[c 1] += 1; 3 return c 1; 4 else f C[c 2].contans(k) then 5 L[c 2] += 1; 6 return c 2; 7 else 8 l 1 = C[c 1].estmate(); 9 l 2 = C[c 2].estmate(); 1 f l 1 l 2 then 11 C[c 1].update(k); 12 L[c 1] += 1; 13 return c 1; 14 else 15 C[c 2].update(k); 16 L[c 2] += 1; 17 return c 2; estmate retreved by the C array structure (Eq. 5). The newly arrved tuple e X s sent to the canddate worker that has the least cardnalty. Algorthm 2 llustrates the cardnalty mbalance mnmzaton algorthm (CM), whch works as a counterpart to PK. CM can have ts cardnalty estmaton structure be ether the Nave (Secton 3.2.1) or the HLL wth our optmstc mechansm (Secton 3.2.2). Ths algorthm s expected to be used n operatons n whch processng cost s domnated by the amount of dstnct keys. Ths way, mbalance n terms of cardnalty wll be mnmal. However, mbalance n terms of tuple counts wll be ncreased, snce CM s tuple count agnostc and makes no effort on lmtng aggregaton cost. 4.2 Group Affnty and Imbalance Mnmzaton (AM & cam) Group Affnty algorthms try to mpose no addtonal aggregaton cost, whle balancng load wth the use of key splttng. The name affnty comes from keepng track of whether a key has been encountered before n S w, and f t dd, then t s forwarded to the worker that receved t prevously. The frst varaton of affnty based algorthms, s AM and focuses on cardnalty mbalance (Alg. 3). AM tres to mnmze aggregaton cost by not splttng keys among workers. Frst, t checks 1291

7 Algorthm 4: Group affnty wth mbalance mnmzaton (cam) nput : C, L, k, c 1, c 2 output: worker to whch the tuple s gong to be sent to 1 f C[c 1].contans(k) then 2 L[c 1] += 1; 3 return c 1; 4 else f C[c 2].contans(k) then 5 L[c 2] += 1; 6 return c 2; 7 else 8 l 1 = L[c 1].estmate(); 9 l 2 = L[c 2].estmate(); 1 f l 1 l 2 then 11 C[c 1].update(k); 12 L[c 1] += 1; 13 return c 1; 14 else 15 C[c 2].update(k); 16 L[c 2] += 1; 17 return c 2; Algorthm 5: Hybld mbalance mnmzaton (LM) nput : C, L, k, c 1, c 2 output: worker to whch the tuple s gong to be sent to 1 hl 1 = pl c 1 S w + (1 p)cl c 1 S w ; 2 hl 2 = pl c 2 S w + (1 p)cl c 2 S w ; 3 f hl 1 hl 2 then 4 C[c 1].update(k); 5 L[c 1] += 1; 6 return c 1; 7 else 8 C[c 2].update(k); 9 L[c 2] += 1; 1 return c 2; f one of the canddate workers has encountered key k prevously. If one of them dd, then the tuple s forwarded to that worker; otherwse, t s sent to the worker wth the least cardnalty up to that pont. A dfferent varaton of AM, named cam (Alg. 4) behaves smlarly, but t forwards the tuple to the worker wth the least tuple count up to that pont. Ths way, both aggregaton cost and mbalance are consdered durng parttonng. Despte the fact that AM and cam resemble FLD, they are expected to perform better because of the multple number of choces that are presented to them. 4.3 Hybrd Imbalance Mnmzaton (LM) For stateful operatons equally affected by tuple count and cardnalty, we propose the hybrd load mbalance mnmzaton algorthm (LM). It combnes a worker s tuple count wth cardnalty and calculates hybrd load as ndcated n Eq. 9. A tuple s forwarded to the worker wth the least load and LM s man goal s to mnmze hybrd load mbalance (Eq. 1). LM s depcted on Algorthm EXPERIMENTAL SETUP Our experments were conducted on an AWS c4.8xlarge nstance, runnng Ubuntu v14.4. For all experments, we used our own mult-threaded stream parttonng lbrary, developed n C++11 and Table 2: Stream parttonng algorthms. w s the total number of workers. Symbol Algorthm Choces Cardnalty Estmaton Structure used shuffle w None feld 1 None partal-key [28] 2 None PK-5 partal-key [29] 5 None Alg. 2 2 Nave, Sec Alg. 3 2 Nave, Sec AM-5 Alg. 3 5 Nave, Sec c Alg. 4 2 Nave, Sec Alg. 4 5 Nave, Sec Alg. 5 2 Nave, Sec H Alg. 2 2 HLL, Sec H Alg. 3 2 HLL, Sec Alg. 5 2 HLL, Sec compled wth GCC v Our performance analyss nvolved a varyng numbers of concurrent worker threads (8 up to 32), and data parttons (from 8 to 256). The reason we dd not experment wth more threads was because we dd not want to pollute results wth context-swtchng overheads. All reported runtmes are the averages of 7 runs, after removng mnmum and maxmum reported tmes, to compensate for anomales related to runnng concurrent processes. 5.1 Stream Parttonng Algorthms We evaluated algorthms shuffle (SH), feld (FLD), and partal key (PK) [28] (wth 2 and 5 canddate workers), along wth dfferent varatons of our proposed algorthms: Cardnalty Imbalance Mnmzaton (CM), Group Affnty wth Cardnalty Imbalance Mnmzaton (AM), Group Affnty wth Imbalance Mnmzaton (cam) and Hybrd Imbalance Mnmzaton (LM). For all varaton of LM we set p to.5 to acheve unbased load estmaton. As a reference mplementaton for SH, FLD and PK we used the ones found n Apache Storm. In addton, we used the open source mplementaton of Murmur-Hash v3. All our proposed algorthms appear n two versons: one wth nave and one wth HLL as the cardnalty estmaton structure. For the former, we used C++ STL s mplementaton of unordered set, and for HLL, we mplemented our verson wth 496(= 2 12 ) regsters and a regster sze of 5 bts. The choce of the number and sze of regsters was made to accommodate up to 1 7 dstnct keys of 32 bts, wth an accuracy lower than 2%, as nstructed n [14]. Table 2 explans the algorthm symbols we use n our graphs. 5.2 Data sets and Workloads Table 3 summarzes the characterstcs of each data set/benchmark used n our experments. Below, we go over each data set and the queres we used n our study. TPC-H (TPCH): TPCH has been extensvely used for throughput orented streamng scenaros [13, 3, 1, 8, 12]. Out of 22 TPCH queres, 16 of them feature a groupng statement: half mantan a constant and half a scalng number of groups that ncreases when the scale factor grows. Due to the fact that our work addresses stateful operatons, we focused on groupng TPCH queres and pcked Query 1 (as a constant groupng query) and query 3 (as a scalng groupng query). Those two dffer sgnfcantly n the number of resultng groups, and ths enabled us to document the performance of dfferent partton algorthms, when the aggrega- 1292

8 Throughput (GB/s) Table 3: Summary of data characterstcs. Dataset Sze Groups Wndow Metrc TPC-H 1GB 4 up to 1k N/A throughput DEBS 32GB 62.5K up to 8.1M sldng latency GCM 16GB 4 to 67K sldng latency Number of workers c -H -H AM-5 PK-5 Fgure 5: TPCH Query 1 performance (throughput). ton cost vares n terms of sze. As ndcated n Table 3, Query 1 presents 4 and Query 3 presents up to 11 resultng groups. Data were generated usng the dbgen tool (v2.17) wth a scale factor of 1. ACM DEBS 215 Grand Challenge (DEBS): DEBS totals 32 GBs n raw sze, and comes wth two sldng wndow queres [21]: () the Top-1 most frequent routes (Query 1), and () the Top- 1 most proftable areas (Query 2). DEBS presents a per wndow latency orented data set, and ts two queres offer group numbers that can potentally range from 62.5 thousand to 8.1 mllon. Google Cluster Montorng (GCM): Ths data set contans executon traces from one of Google s cluster management software systems, and for t we used two sldng wndow queres, whch, lke DEBS, are per wndow latency orented. GCM Query 1 was taken from [24], and scans the task event table to calculate every 6 seconds (wth a slde of 1 second) the total CPU cores requested by each schedulng class. In addton, we ntroduced GCM Query 2 that calculates every 45 mnutes (wth a slde of 1 second) the average CPU cores requested by each job ID. There are more than 6 thousand job IDs n the whole data set. 6. EXPERIMENTAL RESULTS Our experments evaluate the mpact of a partton algorthm on performance (Sec. 6.1), n terms of throughput (usng the TPCH data set) and wndow latency (usng the DEBS data set). Moreover, we evaluate the scalablty (Sec. 6.2) of our algorthms compared to the state of the art (usng the GCM data set). For all experments, data were loaded n man memory before executon. The tme to load data and wrte output to storage was not ncluded n the reported tmes. Fnally, for the experments of Sec. 6.1 and 6.2, the tme t takes to partton tuples s not ncluded, because t s analyzed n Sec. 6.3 n terms of both processng and memory costs. 6.1 Performance In ths set of experments, we used the TPCH and DEBS benchmarks to evaluate performance TPCH Query 1 (Fg. 5) Fgure 5 ndcates that for TPCH Query 1, performs the best. Ths behavor s expected snce there are only 4 groups for Query 1. Therefore, aggregaton cost s neglgble and performance s affected only by tuple mbalance. s expected to offer optmal tuple mbalance ( 1), whch s reflected on results shown n Fg. 5. Those agree wth our model (Eq. 3), whch dentfes that offers mnmal mbalance wth constant aggregaton runtme of O(4V) (V s the number of workers). In addton, PK-5 offers the next best throughput, snce t reduces tuple mbalance, compared to all other algorthms wth 2 and 5 alternatve choces per group. CM and LM do not scale well wth two choces, snce they are affected by cardnalty mbalance. LM s expected to perform smlarly to PK, f p takes a value of 1 (as ndcated n Eq. 9). Turnng to 1- choce parttoners (.e.,, AM and cam), they present constant performance and do not scale when the number of workers ncreases. Ths happens because each group, s presented to a sngle canddate worker. Take-away: If the number of groups s constant and much smaller than the sze of the aggregaton, SH performs the best TPCH Query 3 (Fgs. 6-7) The query plan of TPCH Query 3 conssts of a parallel hash jon for the customer and orders tables, followed by a broadcast jon wth the lnetem table. Then, a parallel computaton of the group by follows, and executon concludes wth a fnal aggregaton step to materalze the result. Fgure 6 llustrates the performance of 1-choce parttoners (.e.,, AM, and cam),,, and PK-5. M-choce parttoners performed from 2.5x up to an order of magntude worse (LM and CM offered smlar performance to ). As shown on Table 3, TPCH Query 3 nvolves 11 thousand groups (before applyng the lmt statement), and aggregaton can take up to 6% of total executon tme for M-choce parttoners. As a result, M- choce parttoners (.e., SH, PK, CM, and LM) experence a substantal performance overhead on the fnal aggregaton step. Fgure 7 llustrates the relatve to tuple mbalance acheved by dfferent varatons of AM and cam. Even though, -H acheves better tuple mbalance compared to, t fals to perform n the same level as. Tuple mbalance results justfy the throughput shown on Fg. 6, n whch (apart from -H) all varatons of our proposed algorthms perform sgnfcantly better than. By adoptng key splttng, throughput ncreases wth the use of multple canddate workers. AM-5 and offer mproved throughput up to 47% compared to. Take-away: For throughput-orented queres, wth a large number of groups, cam and AM perform the best. They acheve up to an order of magntude better throughput compared to PK, and outperform FLD by up to 47% DEBS Query 1 (Fgs. 8a - 8c) Turnng to DEBS, both queres nvolve wndow semantcs and the performance s measured n wndow latency. Fgure 8 shows the mean and 99 percentle wndow latency acheved by each partton algorthm. It s clear that n all worker settngs,,, c, -H, AM-5, and perform the best. Ths emanates from a lack of aggregaton overhead, whch consttutes those algorthms scalable when the number of workers ncreases. In fact, aggregaton cost amounts for more than 7%, 84%, and 88% of total runtme for,, PK-5,, and. Fnally, -H acheves dentcal performance wth, whch leads us to beleve that our optmstc mechansm for cardnalty estmaton mantans a low error. Take-away: For latency-orented stateful queres, AM, and cam perform from 4.5x up to 11.6x better compared to PK. 1293

9 Throughput (GB/s) Number of workers -H c AM-5 PK-5 Fgure 6: TPCH Query 3 performance (throughput) DEBS Query 2 (Fgs. 9a - 9c) The executon plan starts by parttonng ncomng tuples based on the medallon of each rde and each worker has to create two local ndces: one for accumulatng fares for each pckup cell, and one for keepng track of the latest drop-off cell. Then, an aggregaton step follows, whch gathers each pckup cell s fares and determnes the latest cell for each medallon. The two resultng streams are parttoned based on pckup and drop-off cell IDs. Next, a gather step s executed, n whch the medan fare and the number of vacant taxs are processed to calculate the proft of each cell. Fnally, partal results are merged and ordered to produce the Top-1 most proftable cells. Ths query represents a problematc case for our model (Eq. 3), because parttonng does not necessarly affect the workload mposed on each worker (.e., the parttonng key s the medallon but each worker s state s affected by the number of dstnct cells). Fg. 9 depcts wndow latency acheved by each algorthm, and t s apparent that,, AM-5, c,, and - H offer the best performance. AM and cam n all ther varatons outperform FLD n both mean (from 1.2x up to 1.5x) and 99 percentle (from 1.3x to 1.9x) latency. Ths s justfed by AM s and cam s ablty to partton data more evenly compared to FLD. M- choce parttoners underperform because they do not act on lmtng aggregaton overhead. In comparson wth PK (n both and PK-5), AM and cam perform up to 5.7x faster. In order to examne AM s and cam s scalablty, we also ran DEBS Query 2 wth 64 and 128 workers. They performed up to 6.2x better than PK and up to 2.3x better than FLD. Take-away: For latency-orented complex queres, wth more than one stateful operatons, AM and cam have wndow latency between 1.2x and 1.9x lower than FLD, and up to 5.7x lower than PK. 6.2 Scalablty We used the GCM dataset to measure the scalablty of AM and cam compared to SH and PK. The reason for pckng GCM for scalablty experments s because t presents a conventonal montorng scenaro, n whch groups are not sgnfcantly more than the number of tuples n a wndow (lke n TPCH and DEBS), and the queres consst of a sngle stateful operaton. Ths way, M-choce parttoners would not be mpeded by the aggregaton cost GCM Query 1 (Fgs. 1 & 11) GCM Query 1 features up to 4 groups and dffers from TPCH Query 1 because the number of tuples n every wndow s com- Relatve Imbalance Number of Workers -H AM-5 c Fgure 7: TPCH Query 3 relatve mbalance to FLD. parable to the number of groups (the average wndow sze s 42 groups). For ths query, we measured SH s, AM s and cam s scalablty compared to, whch s the current state of the art and s expected to be scalable due to the small number of groups (as n Sec ). Fg. 1 presents the percentage mprovement n wndow latency of SH, AM, and cam compared to. has ts latency mprovement decrease, because aggregaton cost ncreases when more workers are employed. In contrast, AM and cam have ther latency decrease when the number of workers ncreases and they exhbt lower latency than. As Fg. 11 ndcates, AM s and cam s scalablty results from ther constant aggregaton cost whle the partal evaluaton latency decreases. The former s not the case wth and, whch have the aggregaton cost percentage ncrease wth the number of workers. Take-away: AM and cam are scalable, mantan a constant aggregaton cost, and outperform by up to 1.3x GCM Query 2 (Fg. 12) Fg. 12 llustrates the percentage mprovement n wndow latency of, AM, and cam over. Even though ths query contans a large number of groups (Table 3), ts average wndow sze s only 181 tuples and group repetton s scarce. Therefore, M-choce parttoners wll not have ther performance deterorate due to an overwhelmng aggregaton cost (the case n TPCH Query 3-Sec ). However, s not scalable because when addtonal workers are employed ts aggregaton cost becomes hgher. Turnng to, t manages to be scalable, but t underperforms compared to AM and cam n all worker settngs. Take-away: AM and cam are scalable and present more than 1.4x better latency compared to PK. 6.3 Partton algorthm cost In ths set of experments, we measure overhead mposed by each algorthm n terms of processng and memory cost. To that end, we pcked DEBS Q1, because t features the longest group dentfer (15 bytes), and the number of groups can reach up to 8.1M Partton latency (Fgs. 13a - 13c) To measure processng tme, we marshaled DEBS data to each algorthm and measured partton latency on each wndow. As descrbed n Sec. 3.2, the cardnalty estmaton structure sze reles on the number of workers. Therefore, we measured partton latency for 8, 16, and 32 workers. Fgure 13 llustrates the total tme spent on each wndow wth each partton algorthm. We ncluded 9 and 99 percentle wndow latency. Most of the algorthms present constant values for 8, 16, and 32 workers. However, notceable 1294

10 Wndow Latency (msec) H -H AM-5 PK-5 c mean 99 %le a: 8 workers. Wndow Latency (msec) H -H AM-5 PK-5 c mean 99 %le b: 16 workers. Wndow Latency (msec) H -H AM-5 PK-5 c mean 99 %le c: 32 workers. Wndow Latency (msec) H -H AM-5 PK-5 c mean 99 %le a: 8 workers. Fgure 8: DEBS Query 1 performance (wndow latency). Wndow Latency (msec) H -H AM-5 PK-5 c mean 99 %le b: 16 workers. Wndow Latency (msec) Fgure 9: DEBS Query 2 performance (wndow latency) H -H AM-5 PK-5 c mean 99 %le c: 32 workers. Latency mprov. over PK (%) Number of Workers -H AM-5 c Fgure 1: Latency percentage mprovement over PK for GCM Query 1. Aggr. (%) of Runtme Number of workers -H AM-5 PK-5 c Fgure 11: Aggregaton percentage of runtme for GCM Query 1. dfference can be seen wth 32 workers (Fg. 13c) for the 99 percentle wndow latency of, -H,, and. The ncrease s a result of addtonal processng requred for those algorthms. Take-away: Usng our proposed algorthms does not ncur any notceable overhead n latency Partton Memory (Table 4) Our proposed algorthms make use of cardnalty estmaton structures. Hence, we ran a mcro-benchmark, n whch we produced each possble key and replcated t to both avalable canddate workers. Ths experment ams at examnng an extreme scenaro, n whch all of the 8.1M groups appear n a sngle wndow. We measured memory consumed n MBs (Table 4). The nave cardnalty estmaton structure sze quckly ncreases wth the number of keys. Snce each key s sent to both of the two canddates, the nave cardnalty estmaton structure s sze ncreases further. Conversely, when HLL s used, memory consumpton ncreases when the number of workers ncreases and ts sze does not get affected by nether the number of keys, nor the number of canddates. However, f the expected cardnalty of the nput stream s more than 1 mllon, then each HLL structure needs to double ts number of buckets. Take-away: Memory requrements of the cardnalty estmaton structure can be sgnfcantly lmted wth the use of HLL. 7. DISCUSSION In concluson, a pspe s performance can be affected by both mbalance and aggregaton cost. Accordng to our expermental results, the state of the art soluton (.e., PK) fals to perform well, when a large number of groups appears, and 1-choce parttoners lke FLD can make use of key splttng [5] to acheve better performance. Mantanng low mbalance does not necessarly lead to lmtng aggregaton cost. Even f an mproved and dverse load metrc s used (.e., CM wth Eq. 6 and LM wth Eq. 1), performance wll degrade when the number of groups ncreases. In fact, M-choce parttoners underperform when a large number of keys appear, because they focus solely on mnmzng mbalance. After conductng a senstvty analyss on M-choce parttoners and ther 1295

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr