Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Size: px
Start display at page:

Download "Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies"

Transcription

1 Appears n the Proceedngs of the 51st Annual IEEE/ACM Internatonal Symposum on Mcroarchtecture (MICRO), 218 Adaptve Schedulng for Systems wth Asymmetrc Memory Herarches Po-An Tsa, Changpng Chen, Danel Sanchez Massachusetts Insttute of Technology {poantsa, cchen, sanchez}@csal.mt.edu Abstract Conventonal multcores rely on deep cache herarches to reduce data movement. Recent advances n de stackng have enabled near-data processng (NDP) systems that reduce data movement by placng cores close to memory. NDP cores enjoy cheaper memory accesses and are more area-constraned, so they use shallow cache herarches nstead. Snce nether shallow nor deep herarches work well for all applcatons, pror work has proposed systems that ncorporate both. These asymmetrc memory herarches can be hghly benefcal, but they requre schedulng computaton to the rght herarchy. We present AMS, an adaptve scheduler that automatcally fnds hgh-qualty thread-to-herarchy mappngs. AMS montors threads, accurately models ther performance under dfferent herarches and core types, and adapts algorthms frst proposed for cache parttonng to produce hgh-qualty schedules. AMS s cheap enough to use onlne, so t adapts to program phases, and performs wthn 1% of an exhaustve-search scheduler. As a result, AMS outperforms asymmetry-oblvous schedulers by up to 37% and by 18% on average. Index Terms Cache herarches, near-data processng, asymmetrc systems, schedulng, analytcal performance modelng. I. INTRODUCTION Data movement has become a key bottleneck for computer systems. For example, an off-chp man memory access costs 1 more energy and takes 1 more tme than a doubleprecson multply-add [19]. Wthout a drastc reducton n data movement, memory accesses and communcaton wll lmt the scalablty of future systems [31]. Conventonal systems rely on deep mult-level cache herarches to reduce data movement. These herarches often take over half of chp area and are domnated by a mult-megabyte last-level cache (LLC). Deep herarches avod costly memory accesses when they can accommodate the program s workng set. But when the workng set does not ft n any cache level, deep herarches add latency and energy for no beneft [67]. Recently, placng cores closer to man memory has become a feasble alternatve to deep herarches. Advances n destackng technology [12] allow tghtly ntegratng memory banks and cores or specalzed processors, an approach known as near-data processng (NDP). NDP cores enjoy lower latency and energy to the memory stacked above them, but have lmted area and power budgets [26, 62]. These factors naturally bas NDP systems not only towards effcent cores [22, 25], but also towards shallow herarches wth few cache levels between cores and memores. Shallow herarches substantally outperform deep ones when the workng set does not ft n a large on-chp LLC, but they Memory stack wth near-memory cores Conventonal processor wth deep cache herarchy Fg. 1: A system wth an asymmetrc memory herarchy. work poorly for cache-frendly applcatons. Consequently, pror work [4, 25, 34, 71, 73] has proposed asymmetrc memory herarches that combne deep and shallow herarches wthn a sngle system. For example, Google recently proposed to use such asymmetrc systems for consumer workloads [14]. Fg. 1 shows an example system. Ths system ncludes a conventonal processor de wth a deep cache herarchy, connected to several memory stacks, each wth a small number of NDP cores and a shallow cache herarchy n ts logc layer. The processor de and memory stacks are connected usng a slcon nterposer. Asymmetrc herarches provde ample opportunty to mprove the performance and effcency of memory-ntensve applcatons. We fnd that mappng threads to the correct herarchy mproves ther performance per Joule by up to 2.8 and by 4% on average (Sec. IV). However, achevng ths potental requres mappng threads to the rght herarchy dynamcally. As shown n pror work, the same applcaton can prefer dfferent herarches dependng on ts nput [4]. Moreover, colocated applcatons can compete for resources n ether herarchy, whch affects ther preferences. Thus, t s unreasonable to expect programmers or users to make ths choce manually. Instead, the system should automatcally schedule threads to the rght herarchy. Nonetheless, ths schedulng problem s qute challengng, as t has a large, non-convex decson space (.e., whch threads use the shallow herarchy, whch threads share the deep herarchy). Much pror work has studed dynamc resource management and schedulng for systems wth symmetrc memory herarches [9, 37, 5, 69, 74]. And pror work on asymmetrc systems [16, 69] focuses only on asymmetrc cores, not memory herarches. To address ths problem, we ntroduce AMS, a novel thread scheduler for systems wth asymmetrc memory herarches (Sec. V). The key nsght behnd AMS s that the problem of modelng a thread s preferences to dfferent herarches under contenton bears a strong resemblance to the cache parttonng problem. Therefore, AMS leverages both workng set proflng 1

2 technques and allocaton algorthms from prevous parttonng technques, even though AMS does not partton any cache. Specfcally, we show that by samplng a thread s mss curve, the number of msses a thread would ncur at dfferent cache szes, we can effectvely model a thread s performance over dfferent herarches under contenton wthout tral and error. We then extend ths model to handle other asymmetres (.e., core types), proposng a novel analytcal model that ntegrates both memory herarchy and core asymmetres. AMS uses the proposed model to remap threads perodcally, mprovng performance and effcency. We contrbute two dfferent mappng algorthms. Frst, AMS-Greedy s a smple scheduler that performs multple rounds of cache parttonng and greedly maps threads to herarches. Second, AMS-DP leverages dynamc programmng to explore the full space of confguratons effcently, fndng the optmal schedule gven the performance model. AMS-Greedy s cheap, scales well to large systems, and performs wthn 1% of AMS-DP. Whle AMS-DP s more expensve, t s stll practcal n small systems and serves as the upper bound of AMS-Greedy. Evaluaton results (Sec. VII) show that, on a 16-core system, AMS outperforms an asymmetry-oblvous baselne by up to 37% and by 18% on average. AMS adapts to program phases and handles core and cache contenton n asymmetrc herarches well, outperformng state-of-the-art schedulers. Specfcally, AMS outperforms a scheduler that extends LLCaware CRUISE [37] to NDP systems by up to 18% and by 7% on average; and AMS outperforms the PIE [69] heterogeneouscore-aware scheduler by up to 13% and by 6% on average. II. BACKGROUND AND RELATED WORK We now revew related work n NDP systems and schedulng algorthms, the areas that AMS draws from. A. PIM and NDP systems Processng-n-memory (PIM) systems proposed to ntegrate processors and DRAM arrays n the same de. PIM systems were studed extensvely n the 9s. J-Machne [2], EXE- CUBE [44], and IRAM [45] proposed to ntegrate processors and man memory, whle Actve Pages [53], DIVA [28], and FlexRAM [39] nstead proposed to enhance tradtonal processors usng memory chps wth coprocessors. Though compellng, PIM was unsuccessful due to the dffcultes of ntegratng hgh-speed logc and DRAM [66]. Wth the success of 3D ntegraton usng through-slcon vas [12], the dea of processng-n-memory has been revsted recently n the context of de-stacked DRAM. Recent neardata processng (NDP) research has focused on two drectons: () how to explot the massve bandwdth of NDP systems wthn ther lmted area and power budgets, and () how to ntegrate NDP systems wth conventonal systems. De-stackng technology offers lower latency, lower energy, and much hgher bandwdth between the logc layer and the memory stack than conventonal off-chp memores. However, t also mposes lmted area and thermal budgets n the logc layer. Ths new tradeoff s attractve for data-ntensve applcatons. However, wthout careful engneerng, t s dffcult to saturate the avalable bandwdth to fully utlze the potental of NDP systems. Thus, one mportant research queston n recent NDP work s what form of computaton to put n the logc layer to best balance programmablty, performance/effcency, and desgn constrants. On the one hand, several desgns focus on general-purpose NDP systems that use smple cores [22, 25, 55], GPUs [72, 73], and reconfgurable logc [24, 26]. On the other hand, multple projects desgn NDP systems talored to mportant emergng workloads, such as graph analytcs [3, 52], neural networks [27, 42], and sparse data structures [33, 35]. Although de-stackng technology has made NDP systems practcal, not all applcatons can beneft from NDP. Therefore, another research drecton has been how to support an asymmetrc system composed of both NDP and conventonal chps. For example, LazyPIM [15] studes how to provde coherence wthn asymmetrc systems. PIM-enabled nstructons [4], TOM [34], and Pattnak et al. [54] focus on how to map computaton across systems wth asymmetrc herarches. PIM-enabled nstructons proposes new nstructons and hardware support to decde when to offload specfc nstructons to n-memory, fxed-functon accelerators to maxmze localty. TOM proposes a combnaton of compler, runtme, and hardware technques to offload computaton and place data to balance bandwdth n GPU-based asymmetrc systems. Smlarly, Pattnak et al. propose to combne compler technques and a runtme affnty predcton model to schedule kernels for asymmetrc systems. Lke ths pror work, AMS focuses on how to schedule threads across an asymmetrc system to maxmze system-wde performance. Unlke ths pror work, AMS ams to schedule threads wth no program modfcatons and transparently to users, smlar to how OS-level schedulers manage symmetrc systems, as recent work on OS for NDP systems advocates [7]. B. Cache, NUMA, and heterogenety-aware thread schedulers Schedulng applcatons under dfferent constrants has been studed extensvely n many contexts. The closest technques to AMS are cache-contenton-aware, NUMA-aware, and heterogeneous-core-aware schedulers. Contenton-aware schedulers [37, 49, 74] classfy and colocate compatble applcatons under the same memory doman to avod nterference. For example, CRUISE [37] dynamcally schedules sngle-thread applcatons n systems wth multple LLCs (e.g., mult-socket systems). CRUISE classfes applcatons nto four categores accordng to ther LLC behavor (nsenstve, thrashng, fttng, and frendly). It then apples fxed schedulng polces to each class. As we wll see, classfcaton-based technques do not work well n asymmetrc systems, where applcaton preferences (and thus classes) are affected by contenton from other colocated applcatons. They also fal to handle same-class applcatons wth dfferent preference degrees (strong/weak). NUMA-aware schedulers have dfferent goals and constrants than asymmetry-aware schedulers. Snce memory bandwdth s scarce n NUMA systems, pror work focuses on how to schedule threads across NUMA nodes to reduce bandwdth 2

3 contenton, smlar to TOM for GPU-based asymmetrc systems. Tam et al. [64] profle whch threads have frequent sharng and place them n the same socket. DINO [13] clusters snglethread processes to equalze memory ntensty, places clusters n dfferent sockets, and mgrates pages along wth ther threads. AsymSched [46] studes NUMA systems wth asymmetrc nterconnects, mgratng threads and pages to use the bestconnected nodes. These NUMA schemes focus on off-chp memory bandwdth utlzaton, whle AMS focuses on the asymmetry between deep and shallow herarches. Fnally, schedulng technques for systems wth heterogeneous cores [16, 69] focus on makng the best use of asymmetrc core mcroarchtectures lke bg.little. Due to the area and power lmts of memory stacks, asymmetrc systems often employ heterogeneous cores [25, 55], where the processor de has not only a deeper herarchy but more powerful cores than the NDP stacks. AMS focuses on asymmetrc memory herarches, but ts performance model can be easly extended to consder other asymmetres. Specfcally, we extend t wth PIE s model [69] to handle asymmetry n both core types and memory herarches (Sec. V-B). III. BASELINE ASYMMETRIC SYSTEM To make our dscusson concrete, we frst descrbe the asymmetrc system we target n ths work, shown n Fg. 1 and Fg. 2. The processor de s smlar to current multcores: each core has ts own prvate caches, and all cores share a mult-megabyte last-level cache (LLC). The processor de s connected to several memory stacks usng hgh-speed SerDes lnks. Each stack has multple DRAM des and a logc layer wth several memory controllers and NDP cores. These NDP cores have only prvate caches due to the area and power constrants of the logc layer [2]. Ths system uses an nterposer, but AMS would also work wth other confguratons, e.g., usng off-package stacks. DRAM Des Logc Layer NDP Core Shared LLC SerDes Lnks Vault Controller Prvate Cache Prvate Caches Cores Fg. 2: Baselne system wth an asymmetrc memory herarchy. A. Memory stacks wth NDP cores We assume a memory stack desgn smlar to HMC 2. s [36]. Memory s organzed n several vertcal slces called vaults. Each DRAM vault s managed by and accessed through a vault controller n the logc layer. Vault controllers are connected va an all-to-all crossbar, as Fg. 3 shows. In addton to vault controllers, we assume the logc layer also has multple lowpower, lean OOO cores, such as Slvermont [4] or Cortex A57 [6]. Those cores have the same ISA as the processor de, so they can run programs wthout help from the man processor. Lke pror work n NDP systems usng de-stackng technques [25, 42, 55, 73], we conservatvely assume the logc layer has a power and area budgets of 1 W and 5 mm 2 for components other than vault controllers and nterconnect. Ths budget supports up to 4 NDP cores n the logc layer, connected to the system va the crossbar. VC VC Core Cache Cache Core VC Crossbar VC Core Cache Cache Core VC VC Fg. 3: Logc layer of each memory stack. Why general-purpose cores? General-purpose cores make t easy for programmers to adapt ther applcatons to ths asymmetrc memory herarchy [25, 55]. Snce both NDP cores and conventonal cores use the same ISA, threads can mgrate between herarches wthout recomplaton or dynamc bnary translaton. Ths enables a smooth transton from tradtonal systems to asymmetrc systems. B. Coherence n NDP prvate caches Deep and shallow herarches share the same physcal address space, so ther caches must be kept coherent to ensure correctness. However, usng conventonal drectorybased coherence would ether requre NDP cores to check a remote drectory even when performng local memory accesses, or requre processor-de cores to check a memory-sde drectory on the memory stacks, addng area and traffc overheads that would lmt the benefts of NDP [15]. To avod these overheads, we perform software-asssted coherence smlar to pror work [25]. Each vrtual memory page s classfed as ether thread-prvate, shared read-only, or shared read-wrte. NDP cores can cache data from threadprvate and shared read-only pages wthout volatng coherence. For smplcty, shared read-wrte pages are consdered uncacheable by NDP cores, whch access them through the LLC to preserve coherence wth processor-de caches. Ths classfcaton technque has also been used to reduce coherence traffc [18] and to mprove data placement n NUCA caches [8, 3]. We use the same dynamc classfcaton mechansm as ths pror work: Pages start prvate to the thread that allocates them. Upon a read from any other thread, the page s reclassfed as shared read-only, and upon a wrte from any other thread, the page s reclassfed as shared read-wrte. Reclassfcatons are done through TLB shootdowns [8, 3], whch flush the page from prvate caches. Fnally, when a thread moves from the processor de to an NDP core, ts drty LLC lnes are flushed. IV. MOTIVATION Although technology advances have enabled systems wth asymmetrc memory herarches, what s ther potental beneft? Moreover, how crtcal s to schedule threads to the rght herarchy? In ths secton, we answer these questons by characterzng the benefts of an asymmetrc system for memory-ntensve applcatons. We also show that the deal scheduler should () dentfy the rght herarchy for each thread, () adapt to executon phases, and () consder resource contenton among threads. 3

4 Load latency (ns) Deep her. LLC ht Shallow her. Deep her. LLC mss Energy per 64B (nj) Deep her. LLC ht Shallow her. Deep her. LLC mss DRAM Logc layer SerDes Lnk On-chp NoC Shared Cache Prvate Caches Fg. 4: Latency and energy of deep herarchy LLC hts, shallow herarchy memory accesses, and deep herarchy memory accesses. A. Asymmetry n access latency and energy One of the key dfferences between deep and shallow herarches s the mult-megabyte LLC n the processor de. The performance offered by deep and shallow herarches largely depends on how frequently accesses ht n the LLC when usng the deep herarchy. Fg. 4 shows the latency and energy breakdowns of a memory reference n three stuatons wth ncreasng costs: an LLC ht n the deep herarchy, a stacked memory access n the shallow herarchy, and an LLC mss (and off-chp stacked memory access) n the deep herarchy (Sec. VII-A detals the methodology for these costs). Fg. 4 shows that an LLC mss from the deep herarchy s the worst-case scenaro: the system ncurs the latency of an LLC lookup for no beneft, then t must traverse the onchp network and off-chp SerDes lnk, reach the DRAM vault memory controller, wat for DRAM to serve the data, and fnally wat for the response to make ts way back. By contrast, a memory access from the shallow herarchy (.e., an NDP core) s 4% faster, because t s not subject to the LLC lookup, onchp network, or SerDes lnk latences. Nevertheless, stacked DRAM s sgnfcantly slower than on-chp SRAM, so an LLC ht n the deep herarchy s 65% faster than a DRAM access from the shallow herarchy. Energy breakdowns follow smlar trends as latency breakdowns. These costs show that shallow herarches complement deep ones, but do not unformly outperform them. If a thread s workng set does not ft n the LLC and fts n a local memory stack, a shallow herarchy works best. But f the LLC can satsfy a substantal number of accesses, a deep herarchy wll be more attractve. B. Effect of asymmetry on applcaton preferences We now smulate several memory-ntensve applcatons to see how they can explot memory asymmetry. We model a deep herarchy wth 32 KB prvate L1s, 256 KB L2s, and a shared 16 MB LLC n the conventonal processor. The shallow Performance/J over the deep herarchy (%) xalanc astar Shallow herarchy Best dynamc scheduler wth oracle nformaton omnet sphnx bzp2 mcf calcul MST MIS cactus delaun Gems herarchy only has prvate L1 and L2 caches (see Sec. VII-A for methodology detals). Both herarches use 2-way OOO cores. We later evaluate heterogeneous cores and multthreaded applcatons; our goal here s to study memory asymmetry ndependently. These cores wth ther prvate caches consume less than 2.5W and 1mm 2 per core, whch s practcal to fabrcate n the logc layer of 3D-stacked DRAM [2]. We smulate the 18 memory-ntensve SPEC CPU26 benchmarks that have >5 L2 MPKI and 8 benchmarks from the Problem-Based Benchmark Sute [61], whch contans memoryntensve graph algorthms. Snce the choce of herarchy affects both performance and effcency, we use performance per Joule (Perf/J),.e., the nverse of energy-delay product, to characterze the dfferences across herarches. Applcatons have strong herarchy preferences. Fg. 5 shows the Perf/J of representatve applcatons when runnng on the shallow herarchy, relatve to the Perf/J when runnng on the deep herarchy. Some applcatons strongly prefer the deep herarchy. For example, xalancbmk has a workng set of about 6 MB, so t benefts sgnfcantly from the 16 MB LLC n the deep herarchy. xalancbmk s Perf/J on the shallow herarchy s almost 2 (-5%) worse than on the deep herarchy. By contrast, soplex has a much larger workng set that cannot ft n the 16 MB LLC. It thus always prefers the shallow herarchy, whch provdes 2 hgher Perf/J than the deep herarchy. Across all applcatons, always usng the shallow herarchy mproves gmean Perf/J over the deep herarchy by 15%. However, always usng the herarchy that offers the best average Perf/J for each applcaton mproves gmean Perf/J by 31% (Fg. 5, rght), doublng the mprovement acheved by always usng the shallow herarchy. Ths result shows that t s mportant to schedule applcatons to the rght herarchy. Dynamc schedulng unlocks the full potental of asymmetrc herarches. Some applcatons have multple phases, each wth dfferent memory behavors and workng sets. For example, as shown n Fg. 6, GemsFDTD prefers the shallow herarchy before t reaches 53 bllon nstructons, and prefers the deep herarchy afterward. Therefore, runnng GemsFDTD on ether herarchy statcally does not yeld major benefts. To show the mpact of these dynamc preferences, we mplement a dynamc scheduler that always runs the applcaton on the best herarchy for each 5 ms phase. Ths substantally mproves applcatons lke GemsFDTD and refne. Of the 26 applcatons, 12 (46%) prefer dfferent herarches over dfferent phases. Overall, our dynamc scheduler mproves gmean Perf/J by 4%, more than the 31% acheved by statc decsons. refne lesle match hull soplex BFS mlc lbquan Fg. 5: Performance per Joule (Perf/J) relatve to the deep herarchy. Hgher s better. 4 Gmean Perf/J mprov. (%) Shallow Best statc Best dynamc Perf/J normalzed to the avg. Perf/J of the deep herarchy Deep Shallow Instructons (Bllons) Fg. 6: Perf/J traces of GemsFDTD, relatve to the deep herarchy.

5 Performance/J over the shallow herarchy (%) Software Hardware Msses MB LLC 4MB LLC 8MB LLC 16MB LLC bzp2 soplex omnet astar sphnx xalanc mcf Fg. 7: Performance per Joule of deep herarches wth dfferent LLC szes, relatve to the shallow herarchy. Applcaton preferences are senstve to contenton. The above results consder a sngle applcaton, but n real-world workloads, multple applcatons are colocated n a sngle system and compete for shared resources, such as LLC capacty. To study ths effect, we sweep the LLC sze of the deep herarchy to mmc capacty contenton among applcatons. Fg. 7 shows the Perf/J mprovement of deep herarches wth dfferent LLC capactes over the shallow herarchy. We select 7 representatve benchmarks. The frst applcaton (bzp2) always benefts from deep herarches due to ts cache-frendly workng sets. The next applcaton (soplex) nstead always prefers a shallow herarchy due to ts streamng behavor. By contrast, the other applcatons have very dfferent preferences across LLC capactes. For example, omnetpp benefts from LLCs 4 MB, whle sphnx3 benefts from LLCs 8 MB, and mcf only benefts from a 16 MB LLC. And even when applcatons prefer the deep herarchy, ther degree of preference also changes sgnfcantly wth avalable capacty (e.g, 8 MB vs. 16 MB for astar). Ths result shows that when applcatons are colocated, resource contenton can dramatcally change ther preferences. It also shows why pror classfcaton-based schedulers can cause pathologes wth asymmetrc herarches. For example, CRUISE can frst classfy and schedule mcf to the deep herarchy, then later schedule others that cause capacty contenton and make mcf strongly prefer the shallow herarchy. In summary, these results show that applcatons have strong preferences for the type of herarchy, and that these preferences change over tme and wth avalable resources. These observatons gude AMS s desgn. V. AMS: ADAPTIVE SCHEDULING FOR ASYMMETRIC MEMORY SYSTEMS AMS realzes the potental of asymmetrc memory herarches by accountng for contenton and dynamc behavor when mappng threads to cores. The key nsght behnd AMS s that the problem of modelng a thread s preferences to dfferent memory herarches on-the-fly and under contenton bears a strong resemblance to the dynamc cache parttonng problem. Therefore, unlke other schedulers, AMS leverages both workng set proflng technques and allocaton algorthms that were orgnally proposed for cache parttonng, even though AMS does not perform cache parttonng. Fg. 8 shows an overvew of AMS. AMS has both hardware and software components. AMS hardware conssts of smple Mss curves Produce Cache sze Hardware utlty montors 1 st Phase (Sec. V-A) Estmate performance under dfferent herarches Sample accesses Fg. 8: AMS overvew. Schedule threads 2 nd Phase Fnd thread placement wth AMS-Greedy (Sec. V-C) or AMS-DP (Sec. V-E) hardware utlty montors [56] to profle per-thread mss curves, whch reflect the number of msses a thread would ncur under dfferent cache szes. AMS software then uses ths nformaton to remap threads perodcally, on each schedulng quantum (every 5 ms n our mplementaton). Ths process conssts of two phases. In the frst phase, AMS software uses mss curves to accurately estmate a thread s performance on both shallow and deep herarches and under dfferent amounts of LLC contenton (Sec. V-A). Mss curves allow AMS software to produce these estmates wthout tral and error (.e., AMS does not run a thread n both herarches to nfer ts preferences). In the second phase, AMS software uses these estmates to fnd a thread placement that acheves hgh system-wde performance. We present two thread placement algorthms: AMS-Greedy performs multple rounds of cache parttonng and uses ts outcomes to progressvely and greedly map threads to herarches (Sec. V-C), whle AMS-DP leverages dynamc programmng to explore the full space of confguratons effcently, fndng the optmal schedule gven the predcted preferences (Sec. V-E). Though AMS-DP s more expensve than AMS-Greedy, t s practcal to use n small systems and serves as AMS-Greedy s upper bound. To smplfy the explanaton, we frst focus on systems wth homogeneous cores runnng sngle-thread applcatons. Sec. V-B extends AMS to heterogeneous cores, Sec. V-D extends AMS to multthreaded workloads, and Sec. V-F dscusses other scenaros, such as oversubscrbed systems. A. Estmatng performance under asymmetrc herarches To model thread preferences, t s crucal to understand the utlty of the processor de s LLC for each thread. To ths end, AMS leverages UMONs [56] to produce mss curves. Each UMON s a set-assocatve tag array wth per-way ht counters. UMONs leverage LRU s stack property to profle dfferent cache szes smultaneously. AMS adds a 4 KB UMON to each core. Each UMON samples prvate cache msses and produces a mss curve that covers the range of possble capactes avalable to the thread (from no capacty to the full LLC). We choose UMONs for ther low overhead and hgh accuracy, but AMS could use other mss curve proflng technques [9, 23, 65]. AMS models thread performance usng total memory access latency, a cost functon derved wth mss curves. AMS can also optmze other cost functons, such as core cycles, as we wll show n Sec. V-B. AMS uses mss curves to derve cost 5

6 # Msses Latency Memory latency Memory stalls Core cycles Mss curve from UMON LLC capacty (MB) LLC capacty (MB) Latency curves NDP core n dff. stack from data NDP core n the same stack as data Processor-de core Fg. 9: Example latency curves for processor-de and NDP cores. functons for all relevant scenaros. Snce NDP and processorde cores have the same prvate caches, we focus on memory references after the prvate cache levels. If a thread runs on a processor-de core, ts latency depends on how much LLC capacty s avalable. Specfcally, the total latency n cycles as a functon of LLC capacty s, whch we call the latency curve, s: L proc (s)=a Lat LLC + M(s) Lat mem,proc where A s the number of accesses that mss n the prvate cache levels (.e., the number of LLC accesses n ths case), Lat LLC s the average latency of a sngle LLC access, M(s) s the number of LLC msses gven capacty s, and Lat mem,proc s the average latency of a sngle access to off-chp man memory. M(s) s the mss curve, and A, the number of LLC accesses, s smply A=M() (wth no LLC capacty, all LLC accesses mss). Note that ths formula covers the total amount of cycles spent n memory references n a gven nterval, not the average latency. Ths s because t s mportant to account for the rate at whch accesses happen, not only ther unt cost. For example, a thread that has nfrequent msses from ts prvate caches wll have low values for A and M(s), and thus wll ncur a low penalty from dfferent thread placements, even f most of the few accesses t performs mss n the LLC. If the thread runs on an NDP core, all A prvate cache msses go to memory. Thus, the thread s latency curve s smply L NDP = A Lat mem,ndp. Because NDP cores do not access a shared LLC, ths curve does not change wth s. However, because the system has multple memory stacks, the average latency per memory access, Lat mem,ndp, depends on the core s stack as well as the placement of the applcaton s data. We use a smple algorthm that makes most NDP memory accesses local by basng data placement to partcular stacks. We descrbe ths algorthm n Sec. VI. AMS smply computes Lat mem,ndp as the weghted average of the number of applcaton pages on each stack, tmes the latency to access that stack. Fg. 9 shows three example latency curves for a partcular thread: the curve for processor-de cores and two curves for two NDP cores on dfferent stacks. These latency curves encode a thread s preferences under dfferent scenaros. For example, f the LLC s very contented and leaves no capacty, ths thread prefers NDP cores. But wth 2 MB of LLC capacty, only the NDP core closest to ts data s better (.e., has a lower latency). Fnally, f the thread can use over 4 MB of LLC capacty, t prefers to run on a processor-de core wth the deep herarchy. Moreover, the latency dfference between L proc (s) and L NDP at each pont also ndcates how strong the preference s. Latency curves NDP core Processor-de core Wegh by MLP Memory stall curves NDP core Processor-de core Add non-memory component weghed by ILP Core cycle curves NDP core Non-mem cycles Processor-de core Fg. 1: To handle heterogeneous cores, AMS transforms the latency curves nto CPI curves usng PIE s performance model. We fnd that ths model matches Fg. 7 s results. Therefore, ths model lets AMS predct how applcatons perform under dfferent decsons wthout drectly proflng or samplng ther performance under varous colocaton combnatons. B. Handlng heterogeneous cores Although AMS focuses on asymmetrc herarches, we must also consder core asymmetry, as NDP cores are typcally smpler than processor-de cores. Fortunately, t s easy to extend AMS to handle heterogeneous cores. We combne AMS s model wth the cycles-per-nstructon (CPI) estmaton technques from PIE [69], whch targets heterogeneous cores but assumes a symmetrc memory herarchy. To map threads across heterogeneous cores, PIE estmates each thread s CPI on dfferent core types. Its model conssts of a memory component, estmated wth the core s memory-level parallelsm (MLP), and a non-memory component, estmated wth the core s nstructon-level parallelsm (ILP). AMS wth PIE models the total cycles spent across core types and LLC szes. It thus works on core cycle curves nstead of memory latency curves. Fg. 1 shows ths transformaton. The memory component of each curve comes from AMS s latency curve weghted usng PIE s estmated MLP, and the non-memory component uses PIE s estmated ILP. Ths requres collectng non-memory stall cycles usng standard hardware counters (as n PIE). Core cycle curves unfy asymmetres n both cores and memory herarches. They can be transformed nto other cost curves as needed (e.g., usng tme nstead of cycles to model cores runnng at dfferent frequences). C. AMS-Greedy: Mappng threads va cache parttonng Gven the cost functon (total latency or core cycle) curves of all threads n the system, we can evaluate a schedule by calculatng the total cost t ncurs. Fndng the best mappng n asymmetrc herarches can be modeled as mnmzng the total cost over all possble thread mappngs. We present two optmzers for ths problem, one based on greedy optmzaton, and another based on dynamc programmng (Sec. V-E). AMS-Greedy works by performng multple rounds of cache parttonng. On each round, the algorthm dentfes the threads that beneft the least from the deep herarchy and schedules them away from the processor de. Fg. 11 llustrates AMS- Greedy s algorthm wth a 4-thread example. AMS-Greedy begns wth all threads mapped to the deep herarchy (the processor de). We denote the cost curves for thread as C proc (s) for the processor de and Cbest NDP for the best NDP stack. Frst, AMS-Greedy fnds threads that always 6

7 Thread 1 Thread 3 Partton the LLC among threads 1-3 8MB Processor-de core group Thread 1 Thread 3 Thread 2 Thread 4 Thread 1 Thread 2 Thread 3 3MB 1MB 4MB NDP core group Thread 2 Thread 4 1 st : Thread4 Best NDP Processor-de Always wants NDP, move to NDP : Opportunty cost Opportunty cost < move to NDP : Opportunty cost Best NDP Next best NDP 2 nd : Thread2 Ordered by opportunty cost (a) Fnd threads that always want (b) Partton the LLC to fnd threads wth mnmal (c) Schedule threads n the NDP core group ordered by the NDP herarchy. opportunty cost to use the NDP herarchy. maxmum opportunty cost. Fg. 11: An example of how AMS-Greedy schedules 4 threads wth dfferent latency curves. prefer the shallow herarchy. These are the threads for whch C proc (s)> Cbest NDP across all possble LLC capactes s (e.g., thread 4 n Fg. 11a). AMS-Greedy moves these threads off the processor de. The remanng threads can beneft from the LLC f they have enough capacty avalable. But there may not be enough LLC capacty or enough processor-de cores to satsfy all threads. Therefore, AMS-Greedy progressvely moves threads to NDP cores, stoppng ether when the remanng threads have suffcent LLC capacty and cores or when NDP cores fll up. AMS-Greedy uses cache parttonng for ths goal. Specfcally, t parttons the LLC usng the Peekahead algorthm [8] (a lnear-tme mplementaton of quadratc-tme UCP Lookahead [56]). AMS-Greedy uses the processor-de cost curves (C proc (s) for thread ) to drve the parttonng. Ths way, the parttonng algorthm fnds a set of partton szes s that seeks to mnmze total cost ( C proc (s )). For example, n Fg. 11b, threads 1, 2, and 3 receve partton szes of 3, 1, and 4 MB. Intutvely, parttonng naturally fnds threads that should gve up the processor de. For example, f a thread has no capacty after parttonng the LLC, that means fttng ts workng set s too costly compared to other optons. We should thus move t to an NDP core and let others share the LLC. Thus, AMS-Greedy ranks threads by ther opportunty cost, the extra cost they ncur when movng to the best NDP core: best NDP Opportunty cost = C C proc (s ) AMS-Greedy moves all threads wth a negatve opportunty cost to NDP cores as long as NDP cores are not oversubscrbed. These threads have lower cost n NDP cores than wth s LLC capacty (e.g., thread 2 n Fg. 11b). If there s no such thread but the processor de s stll oversubscrbed, AMS-Greedy moves the thread wth the smallest opportunty cost. If after a round of parttonng and movng threads there are stll more threads than the number of processor-de cores, AMS- Greedy performs another round of parttonng and movement among the remanng threads. Ths process repeats untl the processor de s not oversubscrbed. Fnally, AMS-Greedy tres to map the threads on NDP cores to ther most favorable stack. Threads are agan prortzed by opportunty cost: threads wth the largest dfference between ther latences n the best and next-best NDP stacks are placed frst. For example, n Fg. 11c, thread 4 has a larger opportunty cost than thread 2 and s mapped frst. AMS-Greedy works well because t shares the same goal as parttonng: dentfyng the threads that beneft the least (or not at all) from the LLC. AMS-Greedy leverages these algorthms to mnmze total cost at each round. Greedly movng threads out may not yeld the optmal soluton because the problem s not convex. Nonetheless, we fnd AMS-Greedy generates hghqualty results because opportunty cost captures the degree of preference accurately. AMS-Greedy also scales well: ts runtme complexty s O(N 2 S), where N s the number of threads and S s the number of LLC segments (O(NS) per round of cache parttonng [8] and O(N) for up to N rounds). Pror work has also leveraged parttonng algorthms for other purposes, such as talorng the cache herarchy to each applcaton [67] and performng dynamc data replcaton n NUCA caches [68]. AMS-Greedy shares smlar nsghts n usng mss curves and parttonng, but t focuses on schedulng n asymmetrc systems and does not partton the cache. D. Handlng multthreaded workloads We have so far consdered only sngle-threaded processes. AMS can handle multthreaded processes wth extra UMONs and smple modfcatons to AMS-Greedy. Multthreaded workloads share data wthn the process, so per-core UMONs may overestmate the sze of the workng set. To solve ths, AMS adds an extra UMON per core to profle shared data. To dstngush prvate vs. shared data, we leverage the per-page data classfcaton scheme used for coherence (Sec. III-B). Cache msses to prvate data are sampled to the per-core UMON, and msses to shared data are sampled to a UMON shared by all threads n the process (although ths UMON s not local to the core, ths mposes neglgble traffc because only 1% of the msses are sampled). Usng the number of prvate cache msses to thread-prvate data and shared data, AMS frst classfes processes as threadprvate-ntensve or shared-ntensve. AMS then treats threadprvate-ntensve processes as multple ndependent threads. Ths s sensble because these processes have lttle data sharng, so they behave smlarly to multple sngle-threaded processes. 7

8 By contrast, AMS-Greedy groups all the threads of each sharedntensve process nto a sngle unt when makng decsons. The algorthm consders the mss curve for shared data only, and performs placement decsons for all ts threads at once (consderng the opportunty cost of all threads). Ths ensures that threads that share data ntensvely stay together. NDP cores access shared read-wrte data pages through the LLC for coherence. Ths makes the processor de preferable under our model for workloads domnated by shared readwrte data. However, many multthreaded applcatons are wellstructured: threads wrte mostly dsjont data and manly use thread-prvate or shared read-only pages. These applcatons often prefer NDP cores. E. AMS-DP: Mappng threads va dynamc programmng Dynamc programmng (DP) [1, 17] s an optmzaton technque that solves a problem recursvely, by dvdng t nto smaller subproblems. Each subproblem s solved only once, and ts result s memozed and reused whenever the subproblem s encountered agan. Memozaton lets DP explore the full space of possble choces effcently. Because DP consders all possble choces, t fnds the globally optmal soluton. By contrast, greedy algorthms take locally optmal decsons but may end up wth a globally suboptmal one. However, not all problems are amenable to DP: the problem must have the property that an optmal soluton can be computed effcently gven optmal solutons of ts subproblems. Often, the dffculty les n castng the problem n a way that meets ths property. Our second AMS varant, AMS-DP, leverages dynamc programmng to fnd the optmal soluton. We agan explot the smlartes between schedulng and cache parttonng by buldng on Sasnowsk et al. [59], who show that DP can solve cache parttonng optmally n polynomal tme. Cache parttonng can be solved wth DP because t has dscrete decsons, at the sze of cache segments (e.g., cache ways or lnes). Ths property allows dvdng the parttonng problem nto subproblems. For example, parttonng a 4 MB cache among eght threads can be dvded nto parttonng two caches (e.g., of 2 MB each or of 1 MB + 3 MB) to two groups of four threads. The smallest subproblem s just allocatng some amount of capacty to a sngle thread. Smlarly, schedulng threads to cores also has dscrete decsons. One thread can occupy only one core and leave the rest to other threads. Ths property allows dvdng a schedulng problem nto subproblems that schedule smaller groups of threads across smaller systems. The smallest subproblem s schedulng a thread to a sngle core, gven C NDP, C proc (s), and some amount s of remanng LLC capacty. Our nsght s that snce these two problems have dscrete decsons, we can combne them together and solve a bgger DP problem to partton the cache and schedule threads at the same tme, whch s very smlar to schedulng n asymmetrc systems as we dscussed. Thus, solvng ths DP problem leads to the optmal parttonng and schedulng. For the rest of the secton, we use the same termnology as Sasnowsk et al. [59]. See [1] for more detals on DP. The key recurrence relaton that lets Sasnowsk et al. use DP s as follows. If M, j s the mnmum cost acheved by parttonng j segments among the frst threads, and C (s ) s the cost of the th thread when allocated s segments, then: M, j = mn{m 1, j s +C (s )} s Ths recurrence shows that the mnmum cost M, j s the mnmum of all possble combnatons of subproblems: the cost of thread wth s cache segments and the mnmum cost of usng j s cache segments for the frst 1 threads. By solvng each M, j bottom-up, we reach the optmal parttonng for M N,S, where N s the number of threads and S s the number of cache segments n the system. In our case, we want to not only partton the cache (conceptually, to prevent cache contenton) but also to schedule threads. Therefore, we extend the recurrence by addng dmensons for processor-de cores and NDP cores. Cores are just another type of dscrete resource to allocate. However, dfferent cores, even NDP cores n dfferent stacks, should be treated dfferently. Suppose the system has one processor de and one NDP stack. We defne M, j,kproc,k nd p as the mnmum cost when parttonng j cache segments to the frst threads and schedulng them wth exactly k proc processor-de cores and k nd p NDP cores. The recurrence above becomes: M, j,kproc,k nd p = mn{mn{m 1, j s,k s proc 1,k nd p +C proc (s )}, M 1, j,kproc,k nd p 1+C NDP } Ths recurrence states that, f we were to schedule the th thread on a processor-de core, we can allocate some LLC capacty s to t and leave the remanng capacty to the frst 1 threads. Ths decson makes total cost to be the cost C proc (s ) for thread plus the mnmum cost of schedulng the frst 1 threads wth j s capacty. If the thread s nstead scheduled on an NDP core, t takes no LLC capacty, and ncurs cost C NDP for the thread plus the mnmum cost to schedule the frst 1 threads wth 1 fewer NDP core avalable. Fnally, the mnmum cost to schedule threads s smply the mnmum of those two schedulng choces. Ths recurrence consders a sngle NDP stack, but addng more stacks as extra dmensons s straghtforward (e.g., M, j,kproc,k nd p,1,k nd p,2 wth two stacks). Ths lets each thread use ts dfferent costs to each stack. Usng ths recurrence, AMS-DP performs standard bottomup DP to fnd the optmal thread-to-core mappng. Whle conceptually smple, AMS-DP scales poorly: every group k (.e., processor-de or NDP stack) adds a new dmenson to the DP algorthm. Ths causes O(N S k proc t k nd p,t ) runnng tme, where N s the number of threads and S s the number of cache segments. Thus, AMS-DP s practcal only n small systems. On larger systems, AMS-DP serves as the upper bound, but smpler technques lke AMS-Greedy are needed. F. Dscusson Our evaluaton focuses on long-runnng, memory-ntensve batch workloads, but AMS should work n other scenaros wth mnor changes. Frst, n oversubscrbed systems wth more 8

9 runnable threads than cores, AMS only needs to consder the actve threads n each quantum. A thread s mss curve can be saved when t s descheduled so that the thread can be mapped to the rght core when t s rescheduled later. Second, kernel threads and short-lved threads or processes can evct any long-runnng thread n the system. Snce they run for a fracton of the schedulng quantum, ther mpact s mnmal. Fnally, to handle latency-crtcal workloads wth real-tme needs, AMS can be combned wth technques that partton the cache to mantan SLOs nstead of maxmzng throughput, such as Heracles [48] or Ubk [41]. VI. DATA PLACEMENT FOR ASYMMETRIC HIERARCHIES NDP cores are most effectve when they access ther local memory stack. Ths requres adoptng a data placement scheme that mnmzes remote accesses. Data placement s a wdely studed topc n non-unform memory access (NUMA) systems. Pror work [13, 21, 7] has proposed varous data mgraton and replcaton technques to reduce remote accesses. Other NUMA work [1, 46] focuses on balancng avalable bandwdth among applcatons. The key dfference between NUMA and NDP systems s bandwdth. Because memory bandwdth s scarce, NUMA systems are lmted by bandwdth to local memory, and pror work fnds that t s mportant to spread pages evenly across NUMA nodes to reduce bandwdth contenton [21, 46]. By contrast, NDP systems suffer a dfferent problem: the NDP cores n each stack enjoy plentful bandwdth to the memory stacked drectly atop them [26], but the bandwdth across stacks s very lmted [34]. In ths case, t s more mportant to reduce nter-stack traffc than ntra-stack traffc, so the key constrant s ensurng that NDP cores have local accesses. Snce relocatng pages s expensve, our data placement algorthm avods mgratng pages and uses smple heurstcs to keep data local. Its goal s to keep pages from the same thread n as few stacks as possble, so that NDP cores have most local accesses. When a thread starts, the system bulds up a dynamc preference lst of memory stacks n the order from whch we fulfll memory allocatons. Ths preference lst s refreshed when a memory stack s depleted. When a new thread starts, AMS pcks the memory stack wth the greatest remanng capacty as the most preferred source. Ths ensures that threads that can beneft from the shallow herarchy are able to leverage t and those that prefer the deep herarchy are not penalzed. Next on the lst are the nearby stacks. In Fg. 1, these are those on the same sde of the processor de. In the example n Fg. 12, an NDP-frendly applcaton can be scheduled on an NDP core n stacks 1 or 2 NDP stacks (cores+dram) Stack wthout remanng capacty Fg. 12: Example data placement preference lst. Memory stacks wth free pages are shown n blue and full stacks n red. TABLE I: CONFIGURATION OF THE SIMULATED SYSTEM. Cores L1 caches L2 caches Coherence Last-level cache Stacked DRAM Stack lnks 3D DRAM tmngs 16 cores (8 processor de NDP), x86-64, 2.5 GHz Slvermont-lke OOO [4]: 8B-wde fetch, 2-level bpred wth bt BHSRs + 1K 2-bt PHT, 2-way ssue, 36-entry IQ, 32-entry ROB, 32-entry LQ/SQ Haswell-lke OOO [29]: 16B-wde fetch, 2-level bpred wth 1K 18-bt BHSRs + 4K 2-bt PHT, 4-way ssue, 6-entry IQ, 192-entry ROB, 72-entry LQ, 42-entry SQ 32 KB, 8-way set-assocatve, splt data and nstructon caches, 3-cycle latency; 15/33 pj per ht/mss [51] 256 KB prvate per-core, 8-way set-assocatve, nclusve, 7-cycle latency; 46/93 pj per ht/mss [51] MESI, 64 B lnes, no slent drops; sequental consstency 16 MB, 2 MB bank per core, 32-way set-assocatve, nclusve, 3-cycle latency, TA-DRRIP [38] replacement; 945/194 pj per ht/mss [51] 4 GB de, HMC 2.-lke organzaton, 8 vaults per stack, 64-bt data bus, 6-cycle all-to-all crossbar n the logc layer [36], 2 pj/bt nternal, 8 pj/bt logc layer [25, 73] 16 GBps bdrectonal, 1-cycle latency, ncludng 3.2 ns for SerDes [43], 2 pj/bt [43, 55] t CK =1.6 ns, t CAS =11.2 ns, t RCD =11.2 ns, t RAS =22.4 ns, t RP =11.2 ns, t WR =14.4 ns to have hgh-bandwdth and low-latency accesses. Fnally, f stacks on one sde are exhausted, we allocate pages to stacks on the opposte sde of the chp. Adoptng more sophstcated data placement technques as n pror NUMA work [21, 46] could ncrease AMS s benefts. For example, the system could dynamcally mgrate data to reduce cross-stack accesses from NDP threads. These technques are orthogonal to AMS, so we leave them to future work. A. Methodology VII. EVALUATION Modeled system: We perform mcroarchtectural, executondrven smulaton usng zsm [58], and model a 16-core system. The processor de has 8 cores, wth prvate 32 KB L1 and 256 KB L2 caches. All 8 cores share a 16 MB LLC that uses the TA-DRRIP [38] thread-aware replacement polcy. The processor de s connected to four NDP stacks va SerDes lnks, as shown n Fg. 1. Each stack has 4 GB of DRAM and 2 NDP cores wth only prvate caches. Table I detals the system s confguraton. We consder systems wth both homogeneous and heterogeneous cores. Our homogeneous-core system (Secs. VII-B to VII-D) uses 2-wde OOO cores smlar to Slvermont [4]. Our heterogeneous-core system (Sec. VII-E) nstead uses 4-wde OOO cores smlar to Haswell [29] n the processor de. Schedulers: We frst compare AMS-Greedy aganst three smple schedulers n Sec. VII-B and Sec. VII-C. Frst, we use Random schedulng as the baselne to whch we compare other schedulers. Ths s a better baselne than the WAS (worst applcaton scheduler) baselne n pror work [37, 69]. Second, All proc always runs threads on processor-de cores. Thrd, All NDP always runs threads on NDP cores. In Sec. VII-D, we compare AMS-Greedy aganst AMS- DP and a more sophstcated scheduler, CRUISE-NDP. We derve CRUISE-NDP by adaptng CRUISE [37] to asymmetrc herarches. Each schedulng quantum, CRUISE-NDP classfes threads as ether cache-nsenstve, cache-frendly, cache-fttng, 9

10 Weghted speedup Random All proc All NDP AMS-Gr Workload (a) Weghted speedup. Normalzed memory accesses RandomAll proc All NDPAMS-Gr Normalzed cross-stack traffc. RandomAll proc All NDPAMS-Gr (b) Memory accesses. (c) Cross-stack traffc. Fg. 13: Smulaton results for dfferent schedulers on 8-app mxes x Norm. dynamc energy Prvate caches Shared LLC Memory Lnks Random All proc All NDP AMS-Gr (d) Data movement energy. or thrashng usng the same heurstcs as CRUISE (all the necessary nformaton for CRUISE s gathered usng UMONs too). CRUISE-NDP then maps thrashng threads to NDP cores, cache-frendly and fttng threads to processor-de cores (prortzng frendly over fttng), and fnally backflls the remanng cores wth nsenstve threads. We model mgraton overheads and fnd remappng every 5 ms causes neglgble overheads, smlar to PIE s fndngs. Workloads: Our workload setup mrrors pror work [9]. We smulate mxes of SPEC CPU26 apps and multthreaded apps from SPEC OMP212 and PARSEC [11]. We evaluate schedulng polces under 5% load (8 cores occuped) and 1% load (16 cores occuped). We use the 18 SPEC CPU26 apps wth 5 L2 MPKI (as n Sec. IV) and fast-forward all apps n each mx for 3 B nstructons. We use a fxed-work methodology and equalze sample lengths to avod sample mbalance, smlar to FIESTA [32]. Each applcaton s then smulated for 2 B nstructons. Each experment runs the mx untl all apps execute at least 2 B nstructons, and we consder only the frst 2 B nstructons of each app to report performance. For multthreaded apps, snce IPC s not a vald measure of work [5], to perform a fxed amount of work we nstrument each applcaton wth heartbeats that report global progress (e.g., when each tmestep or transacton fnshes) and run each applcaton for as many heartbeats as All proc completes n 2 B cycles after the seral regon. Metrcs and energy model: We follow pror work n schedulng technques and use weghted speedup [63] as our performance metrc. We use McPAT 1.3 [47] to derve the energy of cores at 22 nm, and CACTI [51] for caches at 22 nm. We model the energy of 3D-stacked DRAM usng numbers reported n pror work [34, 43, 6]. Dynamc energy for NDP accesses s about 1 pj/bt. We assume that each SerDes lnk consumes 2 pj/bt [43, 55]. B. AMS fnds the rght herarchy We frst evaluate AMS-Greedy n an undercommtted system wth homogeneous cores (8 apps on 16 Slvermont cores) to focus on the effect of memory asymmetry. Fg. 13a shows the dstrbuton of weghted speedups over 4 mxes of 8 randomly chosen memory-ntensve SPEC CPU26 apps. Each lne shows the results for a sngle scheduler over the Random baselne. For each scheme, workload mxes (the x-axs) are sorted accordng to the mprovement acheved. In each mx, dfferent applcatons want dfferent herarches. All proc mproves only 7 mxes and hurts performance on the other 33 because t never leverages the NDP capablty of the asymmetrc system. On average, ts weghted speedup s 8% worse than the Random baselne. All NDP benefts some applcatons sgnfcantly (e.g., soplex n Fg. 5). However, t sometmes hurts applcatons that prefer deep herarches because t never leverages the LLC. On average, All NDP mproves weghted speedup by 9% over the baselne. AMS-Greedy fnds the best herarchy for each applcaton and schedules them accordngly. It thus never hurts performance and mproves weghted speedup by up to 37% and by 18% on average over the Random baselne. AMS-Greedy acheves sgnfcant speedups because t leverages both herarches effcently. Fgs. 13b d gve more nsght on ths. AMS-Greedy uses the LLC as effectvely as All proc and reduces memory accesses by 26% over the baselne (Fg. 13b). AMS-Greedy also schedules applcatons to leverage the system s NDP cores when benefcal. It thus elmnates 8% of the cross-stack traffc (Fg. 13c). Overall, AMS-Greedy reduces dynamc data movement energy by 25% over the baselne, whle All proc ncreases t by 2% and All NDP reduces t by 18% (Fg. 13d). C. AMS adapts to applcaton phases Next, we show how AMS-Greedy adapts to phase changes by examnng a 4-app mx. In ths workload, we nclude two applcatons, astar and xalancbmk, that have dstnct memory behavors across two long phases, and two other applcatons, bzp2 andsphnx3, that have short and fne-graned varatons over tme. To observe tme-varyng behavor, we smulate ths mx for 25 Bcycles. Fg. 14a shows the traces of schedulng and capacty allocaton decsons of AMS-Greedy for all four apps. The upper two traces show that AMS-Greedy takes dfferent decsons for astar and xalancbmk before and after 1 Bcycles. Before 1 Bcycles, astar s mapped on the processor de and xalancbmk s mapped to an NDP core. After 1 Bcycles, both change the herarchy they prefer. The other two apps are more fluctuatng, but prefer the deep herarchy more often. To explan ths phenomenon, Fg. 14b and Fg. 14c show the sampled mss curves for astar and xalancbmk at 5 and 15 Bcycles. At 5 Bcycles, astar has a small workng set (the sharp drop around 1 MB), but xalancbmk has a large workng set (14 MB). Therefore, AMS-Greedy fts the workng sets of astar, bzp2, and sphnx3 n the LLC, and schedules xalancbmk to an NDP core because t prefers the shallow herarchy when capacty s lmted. At 15 Bcycles, 1

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt ECE Department The Unversty of Texas at Austn Austn, TX, USA {joao, patt}@ece.utexas.edu

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss. Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management. //7 Prnceton Unversty Computer Scence 7: Introducton to Programmng Systems Goals of ths Lecture Storage Management Help you learn about: Localty and cachng Typcal storage herarchy Vrtual memory How the

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Self-tuning Histograms: Building Histograms Without Looking at Data

Self-tuning Histograms: Building Histograms Without Looking at Data Self-tunng Hstograms: Buldng Hstograms Wthout Lookng at Data Ashraf Aboulnaga Computer Scences Department Unversty of Wsconsn - Madson ashraf@cs.wsc.edu Surajt Chaudhur Mcrosoft Research surajtc@mcrosoft.com

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems Relablty and Energy-aware Cache Reconfguraton for Embedded Systems Yuanwen Huang and Prabhat Mshra Department of Computer and Informaton Scence and Engneerng Unversty of Florda, Ganesvlle FL 326-62, USA

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Real-Time Guarantees. Traffic Characteristics. Flow Control

Real-Time Guarantees. Traffic Characteristics. Flow Control Real-Tme Guarantees Requrements on RT communcaton protocols: delay (response s) small jtter small throughput hgh error detecton at recever (and sender) small error detecton latency no thrashng under peak

More information

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

Cache Sharing Management for Performance Fairness in Chip Multiprocessors Cache Sharng Management for Performance Farness n Chp Multprocessors Xng Zhou Wenguang Chen Wemn Zheng Dept. of Computer Scence and Technology Tsnghua Unversty, Bejng, Chna zhoux07@mals.tsnghua.edu.cn,

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010 Smulaton: Solvng Dynamc Models ABE 5646 Week Chapter 2, Sprng 200 Week Descrpton Readng Materal Mar 5- Mar 9 Evaluatng [Crop] Models Comparng a model wth data - Graphcal, errors - Measures of agreement

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

A New Token Allocation Algorithm for TCP Traffic in Diffserv Network

A New Token Allocation Algorithm for TCP Traffic in Diffserv Network A New Token Allocaton Algorthm for TCP Traffc n Dffserv Network A New Token Allocaton Algorthm for TCP Traffc n Dffserv Network S. Sudha and N. Ammasagounden Natonal Insttute of Technology, Truchrappall,

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

ARTICLE IN PRESS. Signal Processing: Image Communication

ARTICLE IN PRESS. Signal Processing: Image Communication Sgnal Processng: Image Communcaton 23 (2008) 754 768 Contents lsts avalable at ScenceDrect Sgnal Processng: Image Communcaton journal homepage: www.elsever.com/locate/mage Dstrbuted meda rate allocaton

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Thread Cluster Memory Schedulng: Explotng Dfferences n Memory Access Behavor Yoongu Km Mchael Papamchael Onur Mutlu Mor Harchol-Balter yoonguk@ece.cmu.edu papamx@cs.cmu.edu onur@cmu.edu harchol@cs.cmu.edu

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs46.stanford.edu /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu Perceptron: y = sgn( x Ho to fnd

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach Dstrbuted Resource Schedulng n Grd Computng Usng Fuzzy Approach Shahram Amn, Mohammad Ahmad Computer Engneerng Department Islamc Azad Unversty branch Mahallat, Iran Islamc Azad Unversty branch khomen,

More information

Video Proxy System for a Large-scale VOD System (DINA)

Video Proxy System for a Large-scale VOD System (DINA) Vdeo Proxy System for a Large-scale VOD System (DINA) KWUN-CHUNG CHAN #, KWOK-WAI CHEUNG *# #Department of Informaton Engneerng *Centre of Innovaton and Technology The Chnese Unversty of Hong Kong SHATIN,

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations* Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSE 326: Data Structures Quicksort Comparison Sorting Bound CSE 326: Data Structures Qucksort Comparson Sortng Bound Steve Setz Wnter 2009 Qucksort Qucksort uses a dvde and conquer strategy, but does not requre the O(N) extra space that MergeSort does. Here s the

More information

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems:

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems: Speed/RAP/CODA Presented by Octav Chpara Real-tme Systems Many wreless sensor network applcatons requre real-tme support Survellance and trackng Border patrol Fre fghtng Real-tme systems: Hard real-tme:

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 Causes of Cache Msses: The 3 C s Computer Archtecture ELEC3441 Lecture 9 Cache (2) Dr. Hayden Kwo-Hay So Department of Electrcal and Electronc Engneerng Compulsory: frst reference to a lne (a..a. cold

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array Inserton Sort Dvde and Conquer Sortng CSE 6 Data Structures Lecture 18 What f frst k elements of array are already sorted? 4, 7, 1, 5, 1, 16 We can shft the tal of the sorted elements lst down and then

More information

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member,

More information

A fair buffer allocation scheme

A fair buffer allocation scheme A far buffer allocaton scheme Juha Henanen and Kalev Klkk Telecom Fnland P.O. Box 228, SF-330 Tampere, Fnland E-mal: juha.henanen@tele.f Abstract An approprate servce for data traffc n ATM networks requres

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices Hgh resoluton 3D Tau-p transform by matchng pursut Wepng Cao* and Warren S. Ross, Shearwater GeoServces Summary The 3D Tau-p transform s of vtal sgnfcance for processng sesmc data acqured wth modern wde

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

Towards High Fidelity Network Emulation

Towards High Fidelity Network Emulation Towards Hgh Fdelty Network Emulaton Lanje Cao, Xangyu Bu, Sona Fahmy, Syuan Cao Department of Computer Scence, Purdue nversty, West Lafayette, IN, SA E-mal: {cao62, xb, fahmy, cao208}@purdue.edu Abstract

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/ HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty

More information

Routing in Degree-constrained FSO Mesh Networks

Routing in Degree-constrained FSO Mesh Networks Internatonal Journal of Hybrd Informaton Technology Vol., No., Aprl, 009 Routng n Degree-constraned FSO Mesh Networks Zpng Hu, Pramode Verma, and James Sluss Jr. School of Electrcal & Computer Engneerng

More information

Wavefront Reconstructor

Wavefront Reconstructor A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes

More information

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints TPL-ware Dsplacement-drven Detaled Placement Refnement wth Colorng Constrants Tao Ln Iowa State Unversty tln@astate.edu Chrs Chu Iowa State Unversty cnchu@astate.edu BSTRCT To mnmze the effect of process

More information

If you miss a key. Chapter 6: Demand Paging Source:

If you miss a key. Chapter 6: Demand Paging Source: ADRIAN PERRIG & TORSTEN HOEFLER ( -6- ) Networks and Operatng Systems Chapter 6: Demand Pagng Source: http://redmne.replcant.us/projects/replcant/wk/samsunggalaxybackdoor If you mss a key after yesterday

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution Dynamc Voltage Scalng of Supply and Body Bas Explotng Software Runtme Dstrbuton Sungpack Hong EE Department Stanford Unversty Sungjoo Yoo, Byeong Bn, Kyu-Myung Cho, Soo-Kwan Eo Samsung Electroncs Taehwan

More information

Virtual Machine Migration based on Trust Measurement of Computer Node

Virtual Machine Migration based on Trust Measurement of Computer Node Appled Mechancs and Materals Onlne: 2014-04-04 ISSN: 1662-7482, Vols. 536-537, pp 678-682 do:10.4028/www.scentfc.net/amm.536-537.678 2014 Trans Tech Publcatons, Swtzerland Vrtual Machne Mgraton based on

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams Self-Tunng, Bandwdth-Aware Montorng for Dynamc Data Streams Navendu Jan, Praveen Yalagandula, Mke Dahln, Yn Zhang Mcrosoft Research HP Labs The Unversty of Texas at Austn Abstract We present, a self-tunng,

More information

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment TrpS: Automated Mult-tered Data Placement n a Geo-dstrbuted Cloud Envronment Kwangsung Oh, Abhshek Chandra, and Jon Wessman Department of Computer Scence and Engneerng Unversty of Mnnesota Twn Ctes Mnneapols,

More information

Goals and Approach Type of Resources Allocation Models Shared Non-shared Not in this Lecture In this Lecture

Goals and Approach Type of Resources Allocation Models Shared Non-shared Not in this Lecture In this Lecture Goals and Approach CS 194: Dstrbuted Systems Resource Allocaton Goal: acheve predcable performances Three steps: 1) Estmate applcaton s resource needs (not n ths lecture) 2) Admsson control 3) Resource

More information

SAO: A Stream Index for Answering Linear Optimization Queries

SAO: A Stream Index for Answering Linear Optimization Queries SAO: A Stream Index for Answerng near Optmzaton Queres Gang uo Kun-ung Wu Phlp S. Yu IBM T.J. Watson Research Center {luog, klwu, psyu}@us.bm.com Abstract near optmzaton queres retreve the top-k tuples

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces Range mages For many structured lght scanners, the range data forms a hghly regular pattern known as a range mage. he samplng pattern s determned by the specfc scanner. Range mage regstraton 1 Examples

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities

Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities Multple Sub-Row Buffers n DRAM: Unlockng Performance and Energy Improvement Opportuntes ABSTRACT Nagendra Gulur Texas Instruments (Inda) nagendra@t.com Mahesh Mehendale Texas Instruments (Inda) m-mehendale@t.com

More information

Multi-objective Optimization Using Adaptive Explicit Non-Dominated Region Sampling

Multi-objective Optimization Using Adaptive Explicit Non-Dominated Region Sampling 11 th World Congress on Structural and Multdscplnary Optmsaton 07 th -12 th, June 2015, Sydney Australa Mult-objectve Optmzaton Usng Adaptve Explct Non-Domnated Regon Samplng Anrban Basudhar Lvermore Software

More information

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices MQSm: A Framework for Enablng Realstc Studes of Modern Mult-Queue SSD Devces Arash Tavakkol, Juan Gómez-Luna, and Mohammad Sadrosadat, ETH Zürch; Saugata Ghose, Carnege Mellon Unversty; Onur Mutlu, ETH

More information