Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Size: px
Start display at page:

Download "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior"

Transcription

1 Thread Cluster Memory Schedulng: Explotng Dfferences n Memory Access Behavor Yoongu Km Mchael Papamchael Onur Mutlu Mor Harchol-Balter yoonguk@ece.cmu.edu papamx@cs.cmu.edu onur@cmu.edu harchol@cs.cmu.edu Carnege Mellon Unversty Abstract In a modern chp-multprocessor system, memory s a shared resource among multple concurrently executng threads. The memory schedulng algorthm should resolve memory contenton by arbtratng memory access n such a way that competng threads progress at a relatvely fast and even pace, resultng n hgh system throughput and farness. Prevously proposed memory schedulng algorthms are predomnantly optmzed for only one of these objectves: no schedulng algorthm provdes the best system throughput and best farness at the same tme. Ths paper presents a new memory schedulng algorthm that addresses system throughput and farness separately wth the goal of achevng the best of both. The man dea s to dvde threads nto two separate clusters and employ dfferent memory request schedulng polces n each cluster. Our proposal, Thread Cluster Memory schedulng (), dynamcally groups threads wth smlar memory access behavor nto ether the latency-senstve (memory-non-ntensve) or the bandwdth-senstve (memory-ntensve) cluster. ntroduces three major deas for prortzaton: 1) we prortze the latency-senstve cluster over the bandwdth-senstve cluster to mprove system throughput; ) we ntroduce a nceness metrc that captures a thread s propensty to nterfere wth other threads; 3) we use nceness to perodcally shuffle the prorty order of the threads n the bandwdth-senstve cluster to provde far access to each thread n a way that reduces nter-thread nterference. On the one hand, prortzng memory-non-ntensve threads sgnfcantly mproves system throughput wthout degradng farness, because such lght threads only use a small fracton of the total avalable memory bandwdth. On the other hand, shufflng the prorty order of memory-ntensve threads mproves farness because t ensures no thread s dsproportonately slowed down or starved. We evaluate on a wde varety of multprogrammed workloads and compare ts performance to four prevously proposed schedulng algorthms, fndng that acheves both the best system throughput and farness. Averaged over 9 workloads on a -core system wth memory channels, mproves system throughput and reduces maxmum slowdown by.%/38.% compared to (prevous work provdng the best system throughput) and 7.%/.% compared to PAR-BS (prevous work provdng the best farness). 1. Introducton Hgh latency of off-chp memory accesses has long been a crtcal bottleneck n thread performance. Ths has been further exacerbated n chp-multprocessors where memory s shared among concurrently executng threads; when a thread accesses memory, t contends wth other threads and, as a result, can be slowed down compared to when t has the memory entrely to tself. Inter-thread memory contenton, f not properly managed, can have devastatng effects on ndvdual thread performance as well as overall system throughput, leadng to system underutlzaton and potentally thread starvaton [11]. The effectveness of a memory schedulng algorthm s commonly evaluated based on two objectves: farness [1, 13, 1] and system throughput [1, 13, 5]. On the one hand, no sngle thread should be dsproportonately slowed down, whle on the other hand, the throughput of the overall system should reman hgh. Intutvely, farness and hgh system throughput ensure that all threads progress at a relatvely even and fast pace. Prevously proposed memory schedulng algorthms are based towards ether farness or system throughput. In one extreme, by tryng to equalze the amount of bandwdth each thread receves, some noton of farness can be acheved, but at a large expense to system throughput [1]. In the opposte extreme, by strctly prortzng certan favorable (memory-non-ntensve) threads over all other threads, system throughput can be ncreased, but at a large expense to farness [5]. As a result, such relatvely sngle-faceted approaches cannot provde the hghest farness and system throughput at the same tme. Our new schedulng algorthm explots dfferences n threads memory access behavor to optmze for both system throughput and farness, based on several key observatons. Frst, pror studes have demonstrated the system throughput benefts of prortzng lght (.e., memory-non-ntensve) threads over heavy (.e., memory-ntensve) threads [5, 1, ]. Memory-nonntensve threads only seldom generate memory requests and have greater potental for makng fast progress n the processor. Therefore, to maxmze system throughput, t s clear that a memory schedulng algorthm should prortze memory-non-ntensve threads. Dong so also does not degrade farness because lght threads rarely nterfere wth heavy threads. Second, we observe that unfarness problems usually stem from nterference among memory-ntensve threads. The most memory-ntensve threads become vulnerable to starvaton when less memory-ntensve threads are statcally prortzed over them (e.g., by formng a prorty order based on a metrc that corresponds to memory ntensty, as done n [5]). As a result, the most memoryntensve threads can experence dsproportonately large slowdowns whch lead to unfarness. Thrd, we observe that perodcally shufflng the prorty order among memory-ntensve threads allows each thread a chance to gan prortzed access to the memory banks, thereby reducng unfarness. However, how to best perform the shufflng s not obvous. We fnd that shufflng n a symmetrc manner, whch gves each thread equal possblty to be at all prorty levels, causes unfarness because not all threads are equal n terms of ther propensty to nterfere wth others; some threads are more lkely to slow down other threads. Hence, thread prorty order should be shuffled such that threads wth hgher propensty to nterfere wth others have a smaller chance of beng at hgher prorty. Fnally, as prevous work has shown, t s desrable that schedulng decsons are made n a synchronzed manner across all banks [5, 1, ], so that concurrent requests of each thread are servced n parallel, wthout beng seralzed due to nterference from other threads. 1

2 Overvew of Mechansm. Based on the above observatons, we propose Thread Cluster Memory schedulng (), an algorthm that detects and explots dfferences n memory access behavor across threads. dynamcally groups threads nto two clusters based on ther memory ntensty: a latency-senstve cluster comprsng memory-non-ntensve threads and a bandwdthsenstve cluster comprsng memory-ntensve threads. Threads n the latency-senstve cluster are always prortzed over threads n the bandwdth-senstve cluster to maxmze system throughput. To ensure that no thread s dsproportonately slowed down, perodcally shuffles the prorty order among threads n the bandwdth-senstve cluster. s ntellgent shufflng algorthm ensures that threads that are lkely to slow down others spend less tme at hgher prorty levels, thereby reducng the probablty of large slowdowns. By havng a suffcently long shufflng perod and performng shufflng n a synchronzed manner across all banks, threads are able to explot both row-buffer localty and bank-level parallelsm. Combned, these mechansms allow to outperform any prevously proposed memory scheduler n terms of both farness and system throughput. Contrbutons. In ths paper, we make the followng contrbutons: We ntroduce the noton of thread clusters for memory schedulng, whch are groups of threads wth smlar memory ntensty. We show that by dynamcally dvdng threads nto two separate clusters (latencysenstve and bandwdth-senstve), a memory schedulng algorthm can satsfy the dsparate memory needs of both clusters smultaneously. We propose a smple, dynamc clusterng algorthm that serves ths purpose. We show that threads n dfferent clusters should be treated dfferently to maxmze both system throughput and farness. We observe that prortzng latencysenstve threads leads to hgh system throughput, whle perodcally perturbng the prortzaton order among bandwdth-senstve threads s crtcal for farness. We propose a new metrc for characterzng a thread s memory access behavor, called nceness, whch reflects a thread s susceptblty to nterference from other threads. We observe that threads wth hgh rowbuffer localty are less nce to others, whereas threads wth hgh bank-level parallelsm are ncer, and montor these metrcs to compute thread nceness. Based on the proposed noton of nceness, we ntroduce a shufflng algorthm, called nserton shuffle, whch perodcally perturbs the prorty orderng of threads n the bandwdth-senstve cluster n a way that mnmzes nter-thread nterference by ensurng ncer threads are prortzed more often over others. Ths reduces unfarness wthn the bandwdthsenstve cluster. We compare aganst four prevously proposed memory schedulng algorthms and show that t outperforms all exstng memory schedulers n terms of both farness (maxmum slowdown) and system throughput (weghted speedup) for a -core system where the results are averaged across 9 workloads of varyng levels of memory ntensty. Compared to AT- LAS [5], the best prevous algorthm n terms of system throughput, mproves system throughput and reduces maxmum slowdown by.%/38.%. Compared to PAR-BS [1], the best prevous algorthm n terms of farness, mproves system throughput and reduces maxmum slowdown by 7.%/.%. We show that s confgurable and can be tuned to smoothly and robustly transton between farness and system throughput goals, somethng whch prevous schedulers, optmzed for a sngle goal, are unable to do.. Background and Motvaton.1. Defnng Memory Access Behavor defnes a thread s memory access behavor usng three components as dentfed by prevous work: memory ntensty [5], bank-level parallelsm [1], and rowbuffer localty [19]. Memory ntensty s the frequency at whch a thread msses n the last-level cache and generates memory requests. It s measured n the unt of (cache) msses per thousand nstructons or MPKI. Memory s not a monolthc resource but conssts of multple memory banks that can be accessed n parallel. It s the exstence of multple memory banks and ther partcular nternal organzaton that gve rse to banklevel parallelsm and row-buffer localty, respectvely. Bank-level parallelsm (BLP) of a thread s the average number of banks to whch there are outstandng memory requests, when the thread has at least one outstandng request. In the extreme case where a thread concurrently accesses all banks at all tmes, ts bank-level parallelsm would be equal to the total number of banks n the memory subsystem. A memory bank s nternally organzed as a twodmensonal structure consstng of rows and columns. The column s the smallest addressable unt of memory and multple columns make up a sngle row. When a thread accesses a partcular column wthn a partcular row, the memory bank places that row n a small nternal memory called the row-buffer. If a subsequent memory request accesses the same row that s n the row-buffer, t can be servced much more quckly; ths s called a rowbuffer ht. The row-buffer localty (RBL) of a thread s the average ht-rate of the row-buffer across all banks... Latency- vs. Bandwdth-Senstve Threads From a memory ntensty perspectve, we classfy threads nto one of two dstnct groups: latency-senstve or bandwdth-senstve. Latency-senstve threads spend most of ther tme at the processor and ssue memory requests sparsely. Even though the number of generated memory requests s low, the performance of latencysenstve threads s very senstve to the latency of the memory subsystem; every addtonal cycle spent watng on memory s a wasted cycle that could have been spent on computaton. Bandwdth-senstve threads experence frequent cache msses and thus spend a large porton of ther tme watng on pendng memory requests. Therefore, ther rate of progress s greatly affected by the throughput of the memory subsystem. Even f a memory request s quckly servced, subsequent memory requests wll once agan stall executon.

3 .3. Our Goal: Best of both System Throughput and Farness Maxmum slowdown Hgher Performance and Farness System throughput Fgure 1. Performance and farness of state-of-the-art schedulng A multprogrammed workload can consst of a dverse mx of threads ncludng those whch are latencysenstve or bandwdth-senstve. A well-desgned memory schedulng algorthm should strve to maxmze overall system throughput, but at the same tme bound the worst case slowdown experenced by any one of the threads. These two goals are often conflctng and form a trade-off between system throughput and farness. Intutvely, latency-senstve threads (whch cannot tolerate hgh memory latences) should be prortzed over others to mprove system throughput, whle bandwdth-senstve threads (whch can tolerate hgh memory latences) should be scheduled n a farnessaware manner to lmt the amount of slowdown they experence. Applyng a sngle memory schedulng polcy across all threads, an approach commonly taken by exstng memory schedulng algorthms, cannot address the dsparate needs of dfferent threads. Therefore, exstng algorthms are unable to decouple the system throughput and farness goals and acheve them smultaneously. To llustrate ths problem, Fgure 1 compares the unfarness (maxmum thread slowdown compared to when run alone on the system) and system throughput (weghted speedup) of four state-of-the-art memory schedulng algorthms (FR-FCFS [19], [13], PAR- BS [1], and [5]) averaged over 9 workloads. 1 algorthms. Lower rght corner s the deal operaton pont. An deal memory schedulng algorthm would be placed towards the lower (better farness) rght (better system throughput) part of the plot n Fgure 1. Unfortunately, no prevous schedulng algorthm acheves the best farness and the best system throughput at the same tme. Whle PAR-BS provdes the best farness, t has.9% lower system throughput than the hghestperformance algorthm,. On the other hand, AT- LAS provdes the hghest system throughput but ts maxmum slowdown s 55.3% hgher than the most far algorthm, PAR-BS. Hence, exstng schedulng algorthms are good at ether system throughput or farness, but not both. Our goal n ths paper s to desgn a memory schedulng algorthm that acheves the best of both worlds: hghest system throughput and hghest farness at the same tme. 1 Our evaluaton methodology and baselne system confguraton are descrbed n Secton... Varyng Susceptblty of Bandwdth- Senstve Threads to Interference We motvate the mportance of dfferentatng between threads memory access behavor by showng that not all bandwdth-senstve threads are equal n ther vulnerablty to nterference. To llustrate ths pont, we ran experments wth two bandwdth-senstve threads that were specfcally constructed to have the same memory ntensty, but very dfferent bank-level parallelsm and rowbuffer localty. As shown n Table 1, the random-access thread has low row-buffer localty and hgh bank-level parallelsm, whle the streamng thread has low bank-level parallelsm and hgh row-buffer localty. Randomaccess Streamng Memory ntensty Hgh (0 MPKI) Hgh (0 MPKI) Memory access behavor Bank-level parallelsm Hgh (7.7% of max.) Low (0.3% of max.) Row-buffer localty Low (0.1%) Hgh (99%) Table 1. Two examples of bandwdth-senstve threads: randomaccess vs. streamng Whch of the two threads s more prone to large slowdowns when run together? Fgure shows the slowdown experenced by these two threads for two dfferent schedulng polces: one where the random-access thread s strctly prortzed over the streamng thread and one where the streamng thread s strctly prortzed over the random-access thread. Clearly, as shown n Fgure (b), the random-access thread s more susceptble to beng slowed down snce t experences a slowdown of more than 11x when t s deprortzed, whch s greater than the slowdown of the streamng thread when t s deprortzed random-access Slowdown streamng (a) Strctly prortzng randomaccess thread random-access Slowdown streamng (b) Strctly prortzng streamng thread Fgure. Effect of prortzaton choces between the randomaccess thread and the streamng thread Ths s due to two reasons. Frst, the streamng thread generates a steady stream of requests to a bank at a gven tme, leadng to temporary denal of servce to any thread that accesses the same bank. Second, a thread wth hgh bank-level parallelsm s more susceptble to memory nterference from another thread snce a bank conflct leads to the loss of bank-level parallelsm, resultng n the seralzaton of otherwse parallel requests. Therefore, all else beng the same, a schedulng algorthm should favor the thread wth hgher bank-level parallelsm when dstrbutng the memory bandwdth among bandwdthsenstve threads. We wll use ths nsght to develop a new memory schedulng algorthm that ntellgently prortzes between bandwdth-senstve threads. 3

4 3. Mechansm 3.1. Overvew of Clusterng Threads. To accommodate the dsparate memory needs of concurrently executng threads sharng the memory, dynamcally groups threads nto two clusters based on ther memory ntensty: a latency-senstve cluster contanng lower memory ntensty threads and a bandwdth-senstve cluster contanng hgher memory ntensty threads. By employng dfferent schedulng polces wthn each cluster, s able to decouple the system throughput and farness goals and optmze for each one separately. Prortzng the Latency-Senstve Cluster. Memory requests from threads n the latency-senstve cluster are always strctly prortzed over requests from threads n the bandwdth-senstve cluster. As shown prevously [5, 1, ], prortzng latency-senstve threads (whch access memory nfrequently) ncreases overall system throughput, because they have greater potental for makng progress. Servcng memory requests from such lght threads allows them to contnue wth ther computaton. To avod starvaton ssues and ensure suffcent bandwdth s left over for the bandwdth-senstve cluster, lmts the number of threads placed n the latency-senstve cluster, such that they consume only a small fracton of the total memory bandwdth. Dfferent Clusters, Dfferent Polces. To acheve hgh system throughput and to mnmze unfarness, employs a dfferent schedulng polcy for each cluster. The polcy for the latency-senstve cluster s geared towards hgh performance and low latency, snce threads n that cluster have the greatest potental for makng fast progress f ther memory requests are servced promptly. By contrast, the polcy for the bandwdth-senstve cluster s geared towards maxmzng farness, snce threads n that cluster have heavy memory bandwdth demand and are susceptble to detrmental slowdowns f not gven a suffcent share of the memory bandwdth. Wthn the latency-senstve cluster, enforces a strct prorty, wth the least memory-ntensve thread recevng the hghest prorty. Such a polcy ensures that requests from threads spendng most of ther tme at the processor (.e., accessng memory nfrequently), are always promptly servced; ths allows them to quckly resume ther computaton and ultmately make large contrbutons to overall system throughput. Wthn the bandwdth-senstve cluster, threads share the remanng memory bandwdth, so that no thread s dsproportonately slowed down or, even worse, starved. accomplshes ths by perodcally shufflng the prorty orderng among the threads n the bandwdthsenstve cluster. To mnmze thread slowdown, ntroduces a new shufflng algorthm, called nserton shuffle, that tres to reduce the amount of nter-thread nterference and at the same tme maxmze row-buffer localty and bank-level parallelsm. To montor nter-thread nterference, we ntroduce a new composte metrc, called nceness, whch captures both a thread s propensty to cause nterference and ts susceptblty to nterference. montors the nceness values of threads and adapts ts shufflng decsons to ensure that nce threads are more lkely to receve hgher prorty. Nceness and the effects of shufflng algorthms for the bandwdth-senstve cluster are dscussed n Secton Groupng Threads nto Two Clusters perodcally ranks all threads based on ther memory ntensty at fxed-length tme ntervals called quanta. The least memory-ntensve threads are placed n the latency-senstve cluster whle the remanng threads are placed n the bandwdth-senstve cluster. Throughout each quantum montors the memory bandwdth usage of each thread n terms of the memory servce tme t has receved: summed across all banks n the memory subsystem, a thread s memory servce tme s defned to be the number of cycles that the banks were kept busy servcng ts requests. The total memory bandwdth usage s defned to be the sum of each thread s memory bandwdth usage across all threads. groups threads nto two clusters at the begnnng of every quantum by usng a parameter called ClusterThresh to specfy the amount of bandwdth to be consumed by the latency-senstve cluster (as a fracton of the prevous quantum s total memory bandwdth usage). Our expermental results show that for a system wth N threads, a ClusterThresh value rangng from /N to /N,.e., formng the latency-senstve cluster such that t consumes /N to /N of the total memory bandwdth usage can provde a smooth transton between dfferent good performance-farness trade-off ponts. A thorough analyss of the effect of dfferent ClusterThresh values s presented n Secton 7.1. Groupng of threads nto clusters happens n a synchronzed manner across all memory controllers to better explot bank-level parallelsm [5, 1]. In order for all memory controllers to agree upon the same thread clusterng, they perodcally exchange nformaton, every quantum. The length of our tme quantum s set to one mllon cycles, whch, based on expermental results, s short enough to detect phase changes n the memory behavor of threads and long enough to mnmze the communcaton overhead of synchronzng multple memory controllers. Algorthm 1 shows the pseudocode for the thread clusterng algorthm used by Bandwdth-Senstve Cluster: Farly Sharng the Memory Bandwdth-senstve threads should farly share memory bandwdth to ensure no sngle thread s dsproportonately slowed down. To acheve ths, the thread prorty order for the bandwdth-senstve cluster needs to be perodcally shuffled. As mentoned earler, to preserve bank-level parallelsm, ths shufflng needs to happen n a synchronzed manner across all memory banks, such that at any pont n tme all banks agree on a global thread prorty order. The Problem wth Round-Robn. Shufflng the prorty order n a round-robn fashon among bandwdth-senstve threads would appear to be a smple soluton to ths problem, but our experments revealed two problems. The frst problem s that a round-robn shufflng algorthm s oblvous to nter-thread nterference: t s not aware of whch threads are more lkely

5 Algorthm 1 Clusterng Algorthm Intalzaton: LatencyCluster ; BandwdthCluster Unclassfed {thread : 1 N threads } SumBW 0 Per-thread parameters: MPKI : Msses per klonstructon of thread BWusage : BW used by thread durng prevous quantum Clusterng: (begnnng of quantum) TotalBWusage P BWusage whle Unclassfed do j = arg mn MPKI //fnd thread wth lowest MPKI SumBW SumBW + BWusage j f SumBW ClusterThresh TotalBWusage then Unclassfed Unclassfed {thread j } LatencyCluster LatencyCluster {thread j } else break end f end whle BandwdthCluster Unclassfed to slow down others. The second problem s more subtle and s ted to the way memory banks handle thread prortes: when choosng whch memory request to servce next, each bank frst consders the requests from the hghest prorty thread accordng to the current prorty order. If that thread has no requests, then the next hghest prorty thread s consdered and so forth. As a result, a thread does not have to be necessarly at the top prorty poston to get some of ts requests servced. In other words, memory servce leaks from hghest prorty levels to lower ones. In fact, n our experments we often encountered cases where memory servce was leaked all the way to the ffth or sxth hghest prorty thread n a -thread system. Ths memory servce leakage effect s the second reason the smple round-robn algorthm performs poorly. In partcular, the problem wth round-robn s that a thread always mantans ts relatve poston wth respect to other threads. Ths means lucky threads scheduled behnd leaky threads wll consstently receve more servce than other threads that are scheduled behnd nonleaky threads, resultng n unfarness. Ths problem becomes more evdent f one consders the dfferent memory access behavor of threads. For nstance, a streamng thread that exhbts hgh row-buffer localty and low bank-level parallelsm wll severely leak memory servce tme at all memory banks except for the sngle bank t s currently accessng. Thread Nceness and Inserton Shuffle. To allevate the problems stemmng from memory servce leakage and to mnmze nter-thread nterference, employs a new shufflng algorthm, called nserton shuffle, that reduces memory nterference and ncreases farness by explotng heterogenety n the bank-level parallelsm and row-buffer localty among dfferent threads. We ntroduce a new metrc, called nceness, that cap- The name s derved from the smlarty to the nserton sort algorthm. Each ntermedate state durng an nserton sort corresponds to one of the permutatons n nserton shuffle. Algorthm Inserton Shufflng Algorthm Defnton: N: number of threads n bandwdth-senstve cluster threads[n]: array of bandwdth-senstve threads; we defne a thread s rank as ts poston n the array (N th poston occuped by hghest ranked thread) ncsort(, j): sort subarray threads[..j] n nc. nceness decsort(, j): sort subarray threads[..j] n dec. nceness Intalzaton: (begnnng of quantum) ncsort(1, N) //ncest thread s hghest ranked Shufflng: (throughout quantum) whle true do //each teraton occurs every ShuffleInterval for = N to 1 do decsort(, N) end for //each teraton occurs every ShuffleInterval for = 1 to N do ncsort(1, ) end for end whle tures a thread s propensty to cause nterference and ts susceptblty to nterference. A thread wth hgh rowbuffer localty s lkely to make consecutve accesses to a small number of banks and cause them to be congested. Under such crcumstances, another thread wth hgh bank-level parallelsm becomes vulnerable to memory nterference snce t s subject to transent hgh loads at any of the many banks t s concurrently accessng. Hence, a thread wth hgh bank-level parallelsm s fragle (more lkely to be nterfered by others), whereas one wth hgh row-buffer localty s hostle (more lkely to cause nterference to others), as we have emprcally demonstrated n Secton.. We defne a thread s nceness to ncrease wth the relatve fraglty of a thread and to decrease wth ts relatve hostlty. Wthn the bandwdthsenstve cluster, f thread has the b th hghest bank-level parallelsm and the r th hghest row-buffer localty, we formally defne ts nceness as follows: Nceness b r Every quantum, threads are sorted based on ther nceness value to yeld a rankng, where the ncest thread receves the hghest rank. Subsequently, every ShuffleInterval cycles, the nserton shuffle algorthm perturbs ths rankng n a way that reduces the tme durng whch the least nce threads are prortzed over the ncest threads, ultmately resultng n less nterference. Fgure 3 vsualzes successve permutatons of the prorty order for both the round-robn and the nserton shuffle algorthms for four threads. It s nterestng to note that n the case of nserton shuffle, the least nce thread spends most of ts tme at the lowest prorty poston, whle the remanng ncer threads are at hgher prortes and are thus able to synergstcally leak ther memory servce tme among themselves. Algorthm shows the pseudocode for the nserton shuffle algorthm. Note that the pseudocode does not reflect the actual hardware mplementaton the mplementaton s smple because the permutaton s regular. Handlng Threads wth Smlar Behavor. If the bandwdth-senstve cluster conssts of homogeneous 5

6 (a) Round-robn shuffle (b) Inserton shuffle Fgure 3. Vsualzng two shufflng algorthms threads wth very smlar memory behavor, dsables nserton shuffle and falls back to random shuffle to prevent unfar treatment of threads based on margnal dfferences n nceness values. To do ths, nspects whether threads exhbt a suffcent amount of dversty n memory access behavor before applyng nserton shufflng. Frst, calculates the largest dfference between any two threads n terms of bank-level parallelsm (max BLP ) and row-buffer localty (max RBL). Second, f both values exceed a certan fracton (ShuffleAlgo- Thresh) of ther maxmum attanable values, then nserton shufflng s appled. Specfcally, max BLP must exceed ShuffleAlgoThresh NumBanks and max RBL must exceed ShuffleAlgoThresh. In our experments we set ShuffleAlgoThresh to be 0.1, whch ntutvely means that falls back to random shufflng f BLP and RBL dffer by less than % across all threads n the system. Random Shufflng. When random shufflng s employed, a random permutaton of threads s generated every shufflng nterval whch serves as the thread rankng for the next shufflng nterval. In contrast to nserton shufflng, random shufflng s oblvous to thread nceness and does not follow a predetermned shufflng pattern. Random shufflng s also dfferent from roundrobn n that t does not preserve the relatve poston of threads across shuffles, thereby preventng cases where a nce thread remans stuck behnd a hghly nterferng or malcous thread. The major advantage of random shufflng over nserton shufflng s the sgnfcantly lower mplementaton complexty; t does not requre the montorng of BLP and RBL or the calculaton of nceness values for each thread. However, random shufflng pays the penalty of ncreased unfarness, snce t s unable to successfully mnmze the nterference among heterogeneous threads wth large dfferences n nceness, as we emprcally show n Secton 7.3. can be forced to always employ random shufflng by settng ShuffleAlgoThresh to 1. Secton 7.5 provdes senstvty results for ShuffleAlgo- Thresh; Secton 7.3 evaluates the effect of dfferent shufflng algorthms. 3.. Montorng Memory Access Behavor of Threads To mplement, the L cache and memory controller collect statstcs for each thread by contnuously montorng ts memory ntensty, row-buffer localty (RBL), and bank-level parallelsm (BLP) over tme. If there are multple memory controllers, ths nformaton s sent to a centralzed meta-controller at the end of a quantum, smlarly to what s done n [5]. The metacontroller aggregates the nformaton, computes thread clusters and ranks as descrbed prevously, and communcates them to each of the memory controllers to ensure that the thread prortzaton order s the same n all controllers. Memory ntensty. A thread s L MPKI (L cache msses per klonstructon) s computed at the L cache controller and serves as the measure of memory ntensty. Row-buffer localty. Each memory controller estmates the nherent row-buffer localty of a thread. Dong so requres the memory controller to keep track of a shadow row-buffer ndex [11] for each thread for each bank, whch keeps track of the row that would have been open n that bank f the thread were runnng alone on the system. RBL s smply calculated as the number of shadow row-buffer hts dvded by the number of accesses durng a quantum. Bank-level parallelsm. Each memory controller counts the number of banks that have at least one memory request from a thread as an estmate of the thread s nstantaneous BLP had t been runnng alone. Throughout a quantum, each controller takes samples of a thread s nstantaneous BLP and computes the average BLP for that thread, whch s sent to the meta-controller at the end of the quantum. The meta-controller then computes the average BLP for each thread across all memory controllers Summary: Thread Cluster Memory Schedulng () Prortzaton Rules Algorthm 3 summarzes how prortzes memory requests from threads. When requests from multple threads compete to access a bank, the hgher ranked thread (where rankng depends on the thread cluster) s prortzed as we have descrbed prevously. If two requests share the same prorty, row-buffer ht requests are favored. All else beng equal, older requests are favored. 3.. System Software Support Thread Weghts. supports thread weghts (or prortes) as assgned by the operatng system, such that threads wth larger weghts are prortzed n the memory. Unlke prevous schedulng algorthms, prortzes a thread based on ts weght whle also strvng to preserve the performance of other threads. Gven a thread wth a very large thread weght, blndly prortzng t over all other threads wthout regard to both ts and others

7 Algorthm 3 : Request prortzaton 1. Hghest-rank frst: Requests from hgher ranked threads are prortzed. Latency-senstve threads are ranked hgher than bandwdth-senstve threads (Secton 3.1). Wthn latency-senstve cluster: lower-mpki threads are ranked hgher than others (Secton 3.1). Wthn bandwdth-senstve cluster: rank order s determned by nserton shufflng (Secton 3.3).. Row-ht frst: Row-buffer ht requests are prortzed over others. 3. Oldest frst: Older requests are prortzed over others. memory access behavor would lead to destructon of the performance of all other threads and, as a result, severely degrade system throughput and farness. solves ths problem by honorng thread weghts wthn the context of thread clusters. For example, even f the operatng system assgns a large weght to a bandwdth-senstve thread, does not prortze t over the latency-senstve threads because dong so would sgnfcantly degrade the performance of all latency-senstve threads wthout sgnfcantly mprovng the performance of the hgher-weght thread (as latencysenstve threads rarely nterfere wth t). To enforce thread weghts wthn the latency-senstve cluster, scales down each thread s MPKI by ts weght. Thus, a thread wth a larger weght s more lkely to be ranked hgher than other latency-senstve threads because ts scaled MPKI appears to be low. Wthn the bandwdth-senstve cluster, mplements weghted shufflng where the tme a thread spends at the hghest prorty level s proportonal to ts weght. Farness/Performance Trade-off Knob. s ClusterThresh s exposed to the system software such that the system software can select a value that favors ts desred metrc. We dscuss the effect of ClusterThresh on farness and performance n Secton Multthreaded Workloads Multthreaded applcatons can be broadly categorzed nto two types: those whose threads execute mostly ndependent of each other and those whose threads requre frequent synchronzaton. Snce the frst type of multthreaded applcatons resemble, to a certan extent, multprogrammed workloads, they are expected to perform well under. In contrast, the executon tme of the second type of multthreaded applcatons s determned by slow-runnng crtcal threads [, 1, ]. For such applcatons, can be extended to ncorporate the noton of thread crtcalty to properly dentfy and prortze crtcal threads. Furthermore, we envson to be applcable to composte workloads that consst of an assortment of dfferent applcatons (e.g., multple multthreaded applcatons), by reducng nter-applcaton memory nterference.. Implementaton and Hardware Cost requres hardware support to 1) montor threads memory access behavor and ) schedule memory requests as descrbed. Table shows the major hardware storage cost ncurred n each memory controller to montor threads memory access behavor. The requred addtonal storage cost wthn a controller on our baselne -core system s less than Kbts. (If pure random shufflng s employed, t s less than 0.5 Kbts.) requres addtonal logc to rank threads by aggregatng montored thread metrcs. Both rankng and aggregaton logc are utlzed only at the end of each quantum and are not on the crtcal path of the processor. Rankng can be mplemented usng prorty encoders, as was done n [5]. At the end of every quantum, a central meta-controller (smlar to [5]) gathers data from every memory controller to cluster threads and to calculate nceness. Subsequently, the central meta-controller broadcasts the results to all the memory controllers so that they can make consstent schedulng decsons throughout the next quantum. At any gven pont n tme, each memory controller prortzes threads accordng to ther rankng (Algorthm 3). Even though the rankng of the bandwdthsenstve cluster s shuffled, t s consstent for all memory controllers snce shufflng s determnstc and occurs at regular tme ntervals. The meta-controller exsts only to reduce hardware complexty by consoldatng parts of the processng logc at a sngle locaton rather than replcatng t across separate memory controllers. Although the meta-controller s centralzed, t s unlkely to mpact scalablty snce only small amounts of data ( bytes per hardware context per controller) are exchanged nfrequently (once every mllon cycles). Furthermore, the communcaton s not latency crtcal because the prevous rankng can be used n the controllers whle the next rankng s beng computed or transferred. 5. Related Work: Comparson wth Other Memory Schedulers We descrbe related work on memory schedulng and qualtatvely compare to several prevous desgns. Secton 7 compares quanttatvely wth four stateof-the-art schedulers [19, 13, 1, 5]. Thread-Unaware Memory Schedulers. Memory controller desgns that do not dstngush between dfferent threads [8, 19, 5, 9, 3, 0, 15] have been examned wthn the context of sngle-threaded, vector, or streamng archtectures. The FR-FCFS schedulng polcy [19] that prortzes row-ht requests over other requests s commonly employed n exstng processors. Recent work [] explored reducng the cost of the FR-FCFS desgn for accelerators. The goal of these polces s to maxmze DRAM throughput. Thread-unaware schedulng polces have been shown to be low-performance and prone to starvaton when multple competng threads share the memory controller n general-purpose multcore/multthreaded systems [11, 1, 18,, 13, 1, 5]. Thread-Aware Memory Schedulers. Recent work desgned thread-aware memory schedulers wth the goal of mprovng farness and provdng QoS. Far queueng memory schedulers [1, 18] adapted varants of the far queueng algorthm from computer networks 7

8 Memory ntensty Storage Functon Sze (bts) MPKI-counter (montored by processor) A thread s cache msses per klo-nstructon N thread log MP KI max = 0 Bank-level parallelsm Storage Functon Sze (bts) Load-counter Number of outstandng thread requests to a bank N thread N bank log Queue max = 57 BLP-counter Number of banks for whch load-counter > 0 N thread log N bank = 8 BLP average Average value of load-counter N thread log N bank = 8 Row-buffer localty Storage Functon Sze (bts) Shadow row-buffer ndex Index of a thread s last accessed row N thread N bank log N rows = 13 Shadow row-buffer hts Number of row-buffer hts f a thread were runnng alone N thread N bank log Count max = 153 Table. Storage requred for montorng threads memory access behavor Processor ppelne 8-entry nstructon wndow Fetch/Exec/Commt wdth 3 nstructons per cycle n each core; only 1 can be a memory operaton L1 Caches 3 K-byte per core, -way set assocatve, 3-byte block sze L Caches 5 K-byte per core, 8-way set assocatve, 3-byte block sze DRAM controller (on-chp) 8-entry request buffer, -entry wrte data buffer, reads prortzed over wrtes DRAM chp parameters Mcron DDR-800 tmng parameters (see []) t CL =15ns, t RCD =15ns, t RP =15ns, BL/=ns; banks, K-byte row-buffer per bank DIMM confguraton Sngle-rank, 8 DRAM chps put together on a DIMM Round-trp L mss latency For a 3-byte cache block uncontended: row-buffer ht: 0ns (00 cycles), closed: 0ns (300 cycles), conflct: 80ns (00 cycles) Cores and DRAM controllers cores, ndependent DRAM controllers (1 controller has. GB/s peak DRAM bandwdth) to buld a memory scheduler that provdes QoS to each thread. Stall-tme far memory scheduler () [13] uses heurstcs to estmate the slowdown of each thread, compared to when t s run alone, and prortzes the thread that has been slowed down the most. These algorthms am to maxmze farness, although they can also lead throughput mprovements by mprovng system utlzaton. Parallelsm-aware batch schedulng (PAR-BS) [1] ams to acheve a balance between farness and throughput. To avod unfarness, PAR-BS groups memory requests nto batches and prortzes older batches over younger ones. To mprove system throughput, PAR-BS prortzes less-ntensve threads over others to explot bank-level parallelsm. As we wll show n Secton 7, PAR-BS s batchng polcy mplctly penalzes memorynon-ntensve threads because memory-ntensve threads usually nsert many more requests nto a batch, leadng to long delays for memory-non-ntensve threads and hence relatvely low system throughput. [5] ams to maxmze system throughput by prortzng threads that have attaned the least servce from the memory controllers. However, as shown n [5], ths ncrease n system throughput comes at the cost of farness because the most memory-ntensve threads receve the lowest prorty and ncur very hgh slowdowns. Ipek et al. [] leverage machne learnng technques to mplement memory schedulng polces that maxmze DRAM throughput. Zhu and Zhang [7] descrbe memory schedulng optmzatons for SMT processors to mprove DRAM throughput. Nether of these consder farness or system throughput n the presence of competng threads. Lee et al. [] descrbe a mechansm to adaptvely prortze between prefetch and demand requests n a memory scheduler; ther mechansm can be combned wth ours. Table 3. Baselne CMP and memory system confguraton Comparson wth. Overall, prevous threadaware memory schedulng algorthms have three major shortcomngs, whch we address n. Frst, they are manly based towards ether farness or system throughput no prevous algorthm acheves the best system throughput and farness at the same tme. We wll show that acheves ths by employng multple dfferent prortzaton algorthms, each talored for system throughput or farness. Second, prevous algorthms do not provde a knob that allows a smooth and gradual trade-off between system throughput and farness. s ablty to group threads nto two clusters wth dfferent polces optmzed for farness or system throughput allows t to trade off between farness and system throughput by varyng the clusterng threshold. Thrd, prevous algorthms do not dstngush dfferent threads propensty for causng nterference to others. As a result, they cannot customze ther prortzaton polces to the specfc needs/behavor of dfferent threads., by trackng memory access characterstcs of threads, determnes a prortzaton order that favors threads that are lkely to cause less nterference to others, leadng to mprovements n farness and system throughput.. Methodology and Metrcs We evaluate usng an n-house cycle-level x8 CMP smulator the front-end of whch s based on Pn [7]. The memory subsystem s modeled usng DDR tmng parameters [], whch were verfed usng DRAM- Sm [3] and measurements from real hardware. Table 3 shows the major DRAM and processor parameters n the baselne confguraton. Unless stated otherwse, we assume a -core CMP wth memory controllers. Workloads. We use the SPEC CPU00 benchmarks for evaluaton. We compled each benchmark usng gcc.1. wth -O3 optmzatons and chose a representatve smulaton phase usng PnPonts [17]. From these benchmarks, we formed multprogrammed workloads of varyng memory ntensty, whch were run for 0 mllon cycles. 8

9 # Benchmark MPKI RBL BLP # Benchmark MPKI RBL BLP 1 9.mcf 97.38% href % 1.19.lbquantum 50.00% gromacs % lesle3d 9.35% gobmk % soplex.70% sjeng % lbm 3.5% gcc % GemsFDTD 31.79% dealII % sphnx3.9% wrf % xalancbmk.95% namd % omnetpp 1.3% perlbench % 1. 3.cactusADM.01% calculx % astar 9.% tonto % hmmer 5.% povray % bzp 3.98% Table. Indvdual benchmark characterstcs (MPKI: Msses per klonstructon, RBL: Row-buffer localty, BLP: Bank-level parallelsm). Workload Memory-ntensve benchmarks Memory-non-ntensve benchmarks A B C D calculx(3), dealii, gcc, gromacs(), namd, perl, povray, sjeng, tonto gcc(), gobmk(3), namd(), perl(3), sjeng, wrf calculx(), dealii(), gromacs(), namd, perl(), povray, tonto, wrf calculx, dealii, gcc, gromacs, perl, povray(), sjeng(), tonto(3) mcf, soplex(), lbm(), lesle, sphnx3, xalancbmk, omnetpp, astar, hmmer() bzp(), cactusadm(3), GemsFDTD, href(), hmmer, lbquantum(), sphnx3 GemsFDTD(), lbquantum(3), cactusadm, astar, omnetpp, bzp, soplex(3) omnetpp, bzp(), href, cactusadm, astar, soplex, lbm(), lesle, xalancbmk() Table 5. Four representatve workloads (fgure n parentheses s the number of nstances spawned) We classfy benchmarks based on ther memory ntensty; benchmarks wth an average MPKI greater than one are labeled as memory-ntensve, whle all other benchmarks are labeled as memory-non-ntensve. The memory ntensty of a workload s defned as the fracton of memory-ntensve benchmarks n that workload. Unless stated otherwse, results are for workloads that are 50% memory-ntensve (.e., consstng of 50% memoryntensve benchmarks). For each memory ntensty category (50%, 75% and 0%), we smulate 3 multprogrammed workloads, for a total of 9 workloads. Evaluaton Metrcs. We measure system throughput usng weghted speedup [1] and farness usng maxmum slowdown. We also report harmonc speedup [8], whch measures a balance of farness and throughput. Weghted Speedup = X Harmonc Speedup = IP C shared IP C alone N P IP C alone IP C share Maxmum Slowdown = max IP C alone IP C shared Parameters of Evaluated Schemes. Unless stated otherwse, we use a BatchCap of 5 for PAR-BS [1], a QuantumLength of M cycles and HstoryWeght of for [5] and a FarnessThreshold of 1.1 and IntervalLength of for [13]. FR-FCFS [19] has no parameters. For we set ClusterThresh to /, ShuffleInterval to 800, and ShuffleAlgoThresh to Results We compare s performance aganst four prevously proposed memory schedulng algorthms, FR- FCFS [19], [13], PAR-BS [1] (best prevous algorthm for farness) and [5] (best prevous algorthm for system throughput). Fgure shows where each schedulng algorthms les wth regard to farness and system throughput, averaged across all 9 workloads of varyng memory ntensty. The lower rght part of the fgure corresponds to better farness (lower maxmum slowdown) and better system throughput (hgher weghted speedup). acheves the best system throughput and the best farness, outperformng every algorthm wth regard to weghted speedup, maxmum slowdown, and harmonc speedup (the last shown n Fg. (b)). 3 Maxmum slowdown Weghted speedup Fgure. Performance and farness of vs. other algorthms across all 9 workloads Compared to, the hghest-performance prevous algorthm, provdes sgnfcantly better farness (38.% lower maxmum slowdown) and better system throughput (.% hgher weghted speedup). AT- LAS suffers from unfarness because t s a strct prortybased schedulng algorthm where the thread wth the lowest prorty can access memory only when no other threads have outstandng memory requests to the same bank. As a result, the most deprortzed threads (those whch are the most memory-ntensve) become vulnerable to starvaton and large slowdowns. avods ths problem by usng shufflng to ensure that no memoryntensve thread s dsproportonately deprortzed. 3 The performance of as shown here s for just a sngle operatng pont. As we wll show n Secton 7.1, provdes the flexblty of smoothly transtonng along a wde range of dfferent performance-farness trade-off ponts. 9

10 Compared to PAR-BS, the most far prevous algorthm, provdes sgnfcantly better system throughput (7.% hgher weghted speedup) and better farness (.% lower maxmum slowdown). PAR-BS suffers from relatvely low system throughput snce memory requests from memory-ntensve threads can block those from memory-non-ntensve threads. PAR-BS perodcally forms batches of memory requests and strctly prortzes older batches. Batch formaton mplctly favors memory-ntensve threads because such threads have more requests that can be ncluded n the batch. As a result, memory-non-ntensve threads are slowed down because ther requests (whch arrve nfrequently) have to wat for the prevous batch of requests mostly full of memory-ntensve threads requests to be servced. avods ths problem by ensurng that memorynon-ntensve threads are always strctly prortzed over memory-ntensve ones. outperforms n weghted speedup by 11.1% and n maxmum slowdown by 3.5%. also outperforms the thread-unaware FR-FCFS n both system throughput (.%) and maxmum slowdown (50.1%). We conclude that provdes the best farness and system performance across all examned prevous schedulng algorthms. Indvdual Workloads. Fgure 5 shows ndvdual results for four, randomly selected, representatve workloads descrbed n Table 5. We fnd that the performance and farness mprovements of over all other algorthms are consstent across dfferent workloads. Weghted speedup Maxmum slowdown A B C D AVG Workloads (a) Weghted speedup for ndvdual workloads A B C D AVG Workloads (b) Maxmum slowdown for ndvdual workloads Fgure 5. vs. other algorthms for sample workloads and averaged across 3 workloads 7.1. Tradng off between Performance and Farness To study the robustness of each memory scheduler, as well as the ablty to adapt to dfferent performance and farness goals, we vared the most salent confguraton parameters of each scheduler. We evaluated for a QuantumLength rangng from 1K (conservatve) to 0M cycles (aggressve), PAR-BS for a BatchCap rangng from 1 (conservatve) to (aggressve), for a FarnessThreshold rangng from 1 (conservatve) to 5 (aggressve), and FR-FCFS (has no parameters). Fnally, for, we vary the ClusterThresh from / to / n 1/ ncrements. The performance and farness results are shown n Fgure. The lower rght and upper rght parts of Fgures (a) and (b) respectvely, correspond to better operatng ponts n terms of both performance and farness. In contrast to prevous memory schedulng algorthms, exposes a smooth contnuum between system throughput and farness. By adjustng the clusterng threshold between latency- and bandwdth-senstve clusters, system throughput and farness can be gently traded off for one another. As a result, has a wde range of balanced operatng ponts that provde both hgh system throughput and farness. None of the prevously proposed algorthms provde nearly the same degree of flexblty as. For example, always remans based towards system throughput (.e., ts maxmum slowdown changes by lttle), regardless of ts QuantumLength settng. Smlarly, PAR-BS remans based towards farness (.e., ts weghted speedup changes by lttle). For, an aggressve (large) ClusterThresh value provdes more bandwdth for the latency-senstve cluster and allows relatvely lghter threads among the bandwdth-senstve cluster to move nto the latencysenstve cluster. As a result, system throughput s mproved snce the lghter threads are prortzed over the heaver threads. But the remanng threads n the bandwdth-senstve cluster now compete for a smaller fracton of the memory bandwdth and experence larger slowdowns, leadng to hgher unfarness. In contrast, a conservatve (small) ClusterThresh value provdes only a small fracton of the memory bandwdth for the latencysenstve cluster so that most threads are ncluded n the bandwdth-senstve cluster and, as a result, take turns sharng the memory. We conclude that provdes an effectve knob for tradng off between farness and performance, enablng operaton at dfferent desrable operatng ponts dependng on system requrements. 7.. Effect of Workload Memory Intensty Fgure 7 compares the performance of to prevously proposed schedulng algorthms for four sets of 3 workloads that are 5%, 50%, 75% and 0% memoryntensve. (We nclude 5%-ntensty workloads for completeness, even though memory s not a large bottleneck for them.) s relatve advantage over PAR-BS and becomes greater as the workload becomes more memory-ntensve and memory becomes more heavly contended. When all the threads n the workload are memory-ntensve, provdes 7.% and.1% ncrease n weghted speedup and 5.8% and 8.% decrease n maxmum slowdown compared to PAR-BS and. provdes hgher gans for very memoryntensve workloads because prevous algorthms are ether unable to prortze less memory-ntensve threads (due to batchng polcy n PAR-BS) or cause severe deprortzaton of the most memory-ntensve threads (due to strct rankng n ) n such heavly contended systems.

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt ECE Department The Unversty of Texas at Austn Austn, TX, USA {joao, patt}@ece.utexas.edu

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Real-Time Guarantees. Traffic Characteristics. Flow Control

Real-Time Guarantees. Traffic Characteristics. Flow Control Real-Tme Guarantees Requrements on RT communcaton protocols: delay (response s) small jtter small throughput hgh error detecton at recever (and sender) small error detecton latency no thrashng under peak

More information

FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems

FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems FIRM: Far and Hgh-Performance Memory Control for Persstent Memory Systems Jshen Zhao, Onur Mutlu, Yuan Xe Pennsylvana State Unversty, Carnege Mellon Unversty, Unversty of Calforna, Santa Barbara, Hewlett-Packard

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems:

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems: Speed/RAP/CODA Presented by Octav Chpara Real-tme Systems Many wreless sensor network applcatons requre real-tme support Survellance and trackng Border patrol Fre fghtng Real-tme systems: Hard real-tme:

More information

Video Proxy System for a Large-scale VOD System (DINA)

Video Proxy System for a Large-scale VOD System (DINA) Vdeo Proxy System for a Large-scale VOD System (DINA) KWUN-CHUNG CHAN #, KWOK-WAI CHEUNG *# #Department of Informaton Engneerng *Centre of Innovaton and Technology The Chnese Unversty of Hong Kong SHATIN,

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

Cache Sharing Management for Performance Fairness in Chip Multiprocessors Cache Sharng Management for Performance Farness n Chp Multprocessors Xng Zhou Wenguang Chen Wemn Zheng Dept. of Computer Scence and Technology Tsnghua Unversty, Bejng, Chna zhoux07@mals.tsnghua.edu.cn,

More information

Virtual Machine Migration based on Trust Measurement of Computer Node

Virtual Machine Migration based on Trust Measurement of Computer Node Appled Mechancs and Materals Onlne: 2014-04-04 ISSN: 1662-7482, Vols. 536-537, pp 678-682 do:10.4028/www.scentfc.net/amm.536-537.678 2014 Trans Tech Publcatons, Swtzerland Vrtual Machne Mgraton based on

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach Dstrbuted Resource Schedulng n Grd Computng Usng Fuzzy Approach Shahram Amn, Mohammad Ahmad Computer Engneerng Department Islamc Azad Unversty branch Mahallat, Iran Islamc Azad Unversty branch khomen,

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Priority-Based Scheduling Algorithm for Downlink Traffics in IEEE Networks

Priority-Based Scheduling Algorithm for Downlink Traffics in IEEE Networks Prorty-Based Schedulng Algorthm for Downlnk Traffcs n IEEE 80.6 Networks Ja-Mng Lang, Jen-Jee Chen, You-Chun Wang, Yu-Chee Tseng, and Bao-Shuh P. Ln Department of Computer Scence Natonal Chao-Tung Unversty,

More information

Scheduling and queue management. DigiComm II

Scheduling and queue management. DigiComm II Schedulng and queue management Tradtonal queung behavour n routers Data transfer: datagrams: ndvdual packets no recognton of flows connectonless: no sgnallng Forwardng: based on per-datagram forwardng

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities

Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities Multple Sub-Row Buffers n DRAM: Unlockng Performance and Energy Improvement Opportuntes ABSTRACT Nagendra Gulur Texas Instruments (Inda) nagendra@t.com Mahesh Mehendale Texas Instruments (Inda) m-mehendale@t.com

More information

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member,

More information

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams Self-Tunng, Bandwdth-Aware Montorng for Dynamc Data Streams Navendu Jan, Praveen Yalagandula, Mke Dahln, Yn Zhang Mcrosoft Research HP Labs The Unversty of Texas at Austn Abstract We present, a self-tunng,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Internet Traffic Managers

Internet Traffic Managers Internet Traffc Managers Ibrahm Matta matta@cs.bu.edu www.cs.bu.edu/faculty/matta Computer Scence Department Boston Unversty Boston, MA 225 Jont work wth members of the WING group: Azer Bestavros, John

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

WIRELESS communication technology has gained widespread

WIRELESS communication technology has gained widespread 616 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 4, NO. 6, NOVEMBER/DECEMBER 2005 Dstrbuted Far Schedulng n a Wreless LAN Ntn Vadya, Senor Member, IEEE, Anurag Dugar, Seema Gupta, and Paramvr Bahl, Senor

More information

Real-time Scheduling

Real-time Scheduling Real-tme Schedulng COE718: Embedded System Desgn http://www.ee.ryerson.ca/~courses/coe718/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrcal and Computer Engneerng Ryerson Unversty Overvew RTX

More information

Quantifying Responsiveness of TCP Aggregates by Using Direct Sequence Spread Spectrum CDMA and Its Application in Congestion Control

Quantifying Responsiveness of TCP Aggregates by Using Direct Sequence Spread Spectrum CDMA and Its Application in Congestion Control Quantfyng Responsveness of TCP Aggregates by Usng Drect Sequence Spread Spectrum CDMA and Its Applcaton n Congeston Control Mehd Kalantar Department of Electrcal and Computer Engneerng Unversty of Maryland,

More information

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT Bran J. Wolf, Joseph L. Hammond, and Harlan B. Russell Dept. of Electrcal and Computer Engneerng, Clemson Unversty,

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Achieving class-based QoS for transactional workloads

Achieving class-based QoS for transactional workloads Achevng class-based QoS for transactonal workloads Banca Schroeder Mor Harchol-Balter Carnege Mellon Unversty Department of Computer Scence Pttsburgh, PA USA @cs.cmu.edu Arun Iyengar Erch

More information

Analysis of Collaborative Distributed Admission Control in x Networks

Analysis of Collaborative Distributed Admission Control in x Networks 1 Analyss of Collaboratve Dstrbuted Admsson Control n 82.11x Networks Thnh Nguyen, Member, IEEE, Ken Nguyen, Member, IEEE, Lnha He, Member, IEEE, Abstract Wth the recent surge of wreless home networks,

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

arxiv: v3 [cs.ds] 7 Feb 2017

arxiv: v3 [cs.ds] 7 Feb 2017 : A Two-stage Sketch for Data Streams Tong Yang 1, Lngtong Lu 2, Ybo Yan 1, Muhammad Shahzad 3, Yulong Shen 2 Xaomng L 1, Bn Cu 1, Gaogang Xe 4 1 Pekng Unversty, Chna. 2 Xdan Unversty, Chna. 3 North Carolna

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Space-Optimal, Wait-Free Real-Time Synchronization

Space-Optimal, Wait-Free Real-Time Synchronization 1 Space-Optmal, Wat-Free Real-Tme Synchronzaton Hyeonjoong Cho, Bnoy Ravndran ECE Dept., Vrgna Tech Blacksburg, VA 24061, USA {hjcho,bnoy}@vt.edu E. Douglas Jensen The MITRE Corporaton Bedford, MA 01730,

More information

Maintaining temporal validity of real-time data on non-continuously executing resources

Maintaining temporal validity of real-time data on non-continuously executing resources Mantanng temporal valdty of real-tme data on non-contnuously executng resources Tan Ba, Hong Lu and Juan Yang Hunan Insttute of Scence and Technology, College of Computer Scence, 44, Yueyang, Chna Wuhan

More information

Design of a Real Time FPGA-based Three Dimensional Positioning Algorithm

Design of a Real Time FPGA-based Three Dimensional Positioning Algorithm Desgn of a Real Tme FPGA-based Three Dmensonal Postonng Algorthm Nathan G. Johnson-Wllams, Student Member IEEE, Robert S. Myaoka, Member IEEE, Xaol L, Student Member IEEE, Tom K. Lewellen, Fellow IEEE,

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

Gateway Algorithm for Fair Bandwidth Sharing

Gateway Algorithm for Fair Bandwidth Sharing Algorm for Far Bandwd Sharng We Y, Rupnder Makkar, Ioanns Lambadars Department of System and Computer Engneerng Carleton Unversty 5 Colonel By Dr., Ottawa, ON KS 5B6, Canada {wy, rup, oanns}@sce.carleton.ca

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

On Achieving Fairness in the Joint Allocation of Buffer and Bandwidth Resources: Principles and Algorithms

On Achieving Fairness in the Joint Allocation of Buffer and Bandwidth Resources: Principles and Algorithms On Achevng Farness n the Jont Allocaton of Buffer and Bandwdth Resources: Prncples and Algorthms Yunka Zhou and Harsh Sethu (correspondng author) Abstract Farness n network traffc management can mprove

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

A fair buffer allocation scheme

A fair buffer allocation scheme A far buffer allocaton scheme Juha Henanen and Kalev Klkk Telecom Fnland P.O. Box 228, SF-330 Tampere, Fnland E-mal: juha.henanen@tele.f Abstract An approprate servce for data traffc n ATM networks requres

More information

Efficient QoS Provisioning at the MAC Layer in Heterogeneous Wireless Sensor Networks

Efficient QoS Provisioning at the MAC Layer in Heterogeneous Wireless Sensor Networks Effcent QoS Provsonng at the MAC Layer n Heterogeneous Wreless Sensor Networks M.Soul a,, A.Bouabdallah a, A.E.Kamal b a UMR CNRS 7253 HeuDaSyC, Unversté de Technologe de Compègne, Compègne Cedex F-625,

More information

WITH rapid improvements of wireless technologies,

WITH rapid improvements of wireless technologies, JOURNAL OF SYSTEMS ARCHITECTURE, SPECIAL ISSUE: HIGHLY-RELIABLE CPS, VOL. 00, NO. 0, MONTH 013 1 Adaptve GTS Allocaton n IEEE 80.15.4 for Real-Tme Wreless Sensor Networks Feng Xa, Ruonan Hao, Je L, Naxue

More information

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams Self-Tunng, Bandwdth-Aware Montorng for Dynamc Data Streams Navendu Jan #, Praveen Yalagandula, Mke Dahln #, Yn Zhang # # Unversty of Texas at Austn HP Labs Abstract We present, a self-tunng, bandwdth-aware

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

A Genetic Algorithm Based Dynamic Load Balancing Scheme for Heterogeneous Distributed Systems

A Genetic Algorithm Based Dynamic Load Balancing Scheme for Heterogeneous Distributed Systems Proceedngs of the Internatonal Conference on Parallel and Dstrbuted Processng Technques and Applcatons, PDPTA 2008, Las Vegas, Nevada, USA, July 14-17, 2008, 2 Volumes. CSREA Press 2008, ISBN 1-60132-084-1

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS CACHE MEMORY DESIGN FOR INTERNET PROCESSORS WE EVALUATE A SERIES OF THREE PROGRESSIVELY MORE AGGRESSIVE ROUTING-TABLE CACHE DESIGNS AND DEMONSTRATE THAT THE INCORPORATION OF HARDWARE CACHES INTO INTERNET

More information

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7 Optmzed Regonal Cachng for On-Demand Data Delvery Derek L. Eager Mchael C. Ferrs Mary K. Vernon Unversty of Saskatchewan Unversty of Wsconsn Madson Saskatoon, SK Canada S7N 5A9 Madson, WI 5376 eager@cs.usask.ca

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment TrpS: Automated Mult-tered Data Placement n a Geo-dstrbuted Cloud Envronment Kwangsung Oh, Abhshek Chandra, and Jon Wessman Department of Computer Scence and Engneerng Unversty of Mnnesota Twn Ctes Mnneapols,

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

ARTICLE IN PRESS. Signal Processing: Image Communication

ARTICLE IN PRESS. Signal Processing: Image Communication Sgnal Processng: Image Communcaton 23 (2008) 754 768 Contents lsts avalable at ScenceDrect Sgnal Processng: Image Communcaton journal homepage: www.elsever.com/locate/mage Dstrbuted meda rate allocaton

More information

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations* Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad

More information

On the Fairness-Efficiency Tradeoff for Packet Processing with Multiple Resources

On the Fairness-Efficiency Tradeoff for Packet Processing with Multiple Resources On the Farness-Effcency Tradeoff for Packet Processng wth Multple Resources We Wang, Chen Feng, Baochun L, and Ben Lang Department of Electrcal and Computer Engneerng, Unversty of Toronto {wewang, cfeng,

More information

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems Relablty and Energy-aware Cache Reconfguraton for Embedded Systems Yuanwen Huang and Prabhat Mshra Department of Computer and Informaton Scence and Engneerng Unversty of Florda, Ganesvlle FL 326-62, USA

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Technical Report. i-game: An Implicit GTS Allocation Mechanism in IEEE for Time- Sensitive Wireless Sensor Networks

Technical Report. i-game: An Implicit GTS Allocation Mechanism in IEEE for Time- Sensitive Wireless Sensor Networks www.hurray.sep.pp.pt Techncal Report -GAME: An Implct GTS Allocaton Mechansm n IEEE 802.15.4 for Tme- Senstve Wreless Sensor etworks Ans Koubaa Máro Alves Eduardo Tovar TR-060706 Verson: 1.0 Date: Jul

More information

Performance Analysis of a Reconfigurable Shared Memory Multiprocessor System for Embedded Applications

Performance Analysis of a Reconfigurable Shared Memory Multiprocessor System for Embedded Applications J. ICT Res. Appl., Vol. 7, No. 1, 213, 15-35 15 Performance Analyss of a Reconfgurable Shared Memory Multprocessor System for Embedded Applcatons Darcy Cook 1 & Ken Ferens 2 1 JCA Electroncs, 118 Kng Edward

More information

QoS-aware composite scheduling using fuzzy proactive and reactive controllers

QoS-aware composite scheduling using fuzzy proactive and reactive controllers Khan et al. EURASIP Journal on Wreless Communcatons and Networkng 2014, 2014:138 http://jwcn.euraspjournals.com/content/2014/1/138 RESEARCH Open Access QoS-aware composte schedulng usng fuzzy proactve

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

A Holistic View of Stream Partitioning Costs

A Holistic View of Stream Partitioning Costs A Holstc Vew of Stream Parttonng Costs Nkos R. Katspoulaks, Alexandros Labrnds, Panos K. Chrysanths Unversty of Pttsburgh Pttsburgh, Pennsylvana, USA {katsp, labrnd, panos}@cs.ptt.edu ABSTRACT Stream processng

More information

Hybrid Job Scheduling Mechanism Using a Backfill-based Multi-queue Strategy in Distributed Grid Computing

Hybrid Job Scheduling Mechanism Using a Backfill-based Multi-queue Strategy in Distributed Grid Computing IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.12 No.9, September 2012 39 Hybrd Job Schedulng Mechansm Usng a Backfll-based Mult-queue Strategy n Dstrbuted Grd Computng Ken Park

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

CS 268: Lecture 8 Router Support for Congestion Control

CS 268: Lecture 8 Router Support for Congestion Control CS 268: Lecture 8 Router Support for Congeston Control Ion Stoca Computer Scence Dvson Department of Electrcal Engneerng and Computer Scences Unversty of Calforna, Berkeley Berkeley, CA 9472-1776 Router

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments Comparson of Heurstcs for Schedulng Independent Tasks on Heterogeneous Dstrbuted Envronments Hesam Izakan¹, Ath Abraham², Senor Member, IEEE, Václav Snášel³ ¹ Islamc Azad Unversty, Ramsar Branch, Ramsar,

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Mixed-Criticality Scheduling on Multiprocessors using Task Grouping

Mixed-Criticality Scheduling on Multiprocessors using Task Grouping Mxed-Crtcalty Schedulng on Multprocessors usng Task Groupng Jankang Ren Lnh Th Xuan Phan School of Software Technology, Dalan Unversty of Technology, Chna Computer and Informaton Scence Department, Unversty

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

Dynamic Bandwidth Allocation Schemes in Hybrid TDM/WDM Passive Optical Networks

Dynamic Bandwidth Allocation Schemes in Hybrid TDM/WDM Passive Optical Networks Dynamc Bandwdth Allocaton Schemes n Hybrd TDM/WDM Passve Optcal Networks Ahmad R. Dhan, Chad M. Ass, and Abdallah Sham Concorda Insttue for Informaton Systems Engneerng Concorda Unversty, Montreal, Quebec,

More information

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries Run-Tme Operator State Spllng for Memory Intensve Long-Runnng Queres Bn Lu, Yal Zhu, and lke A. Rundenstener epartment of Computer Scence, Worcester Polytechnc Insttute Worcester, Massachusetts, USA {bnlu,

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

MobileGrid: Capacity-aware Topology Control in Mobile Ad Hoc Networks

MobileGrid: Capacity-aware Topology Control in Mobile Ad Hoc Networks MobleGrd: Capacty-aware Topology Control n Moble Ad Hoc Networks Jle Lu, Baochun L Department of Electrcal and Computer Engneerng Unversty of Toronto {jenne,bl}@eecg.toronto.edu Abstract Snce wreless moble

More information

Burst Round Robin as a Proportional-Share Scheduling Algorithm

Burst Round Robin as a Proportional-Share Scheduling Algorithm Burst Round Robn as a Proportonal-Share Schedulng Algorthm Tarek Helmy * Abdelkader Dekdouk ** * College of Computer Scence & Engneerng, Kng Fahd Unversty of Petroleum and Mnerals, Dhahran 31261, Saud

More information

An Investigation into Server Parameter Selection for Hierarchical Fixed Priority Pre-emptive Systems

An Investigation into Server Parameter Selection for Hierarchical Fixed Priority Pre-emptive Systems An Investgaton nto Server Parameter Selecton for Herarchcal Fxed Prorty Pre-emptve Systems R.I. Davs and A. Burns Real-Tme Systems Research Group, Department of omputer Scence, Unversty of York, YO10 5DD,

More information