Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Size: px

Start display at page:

Download "Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies"

Nathaniel Roberts
5 years ago
Views:

1 Appears n the Proceedngs of the 51st Annual IEEE/ACM Internatonal Symposum on Mcroarchtecture (MICRO), 218 Adaptve Schedulng for Systems wth Asymmetrc Memory Herarches Po-An Tsa, Changpng Chen, Danel Sanchez Massachusetts Insttute of Technology {poantsa, cchen, sanchez}@csal.mt.edu Abstract Conventonal multcores rely on deep cache herarches to reduce data movement. Recent advances n de stackng have enabled near-data processng (NDP) systems that reduce data movement by placng cores close to memory. NDP cores enjoy cheaper memory accesses and are more area-constraned, so they use shallow cache herarches nstead. Snce nether shallow nor deep herarches work well for all applcatons, pror work has proposed systems that ncorporate both. These asymmetrc memory herarches can be hghly benefcal, but they requre schedulng computaton to the rght herarchy. We present AMS, an adaptve scheduler that automatcally fnds hgh-qualty thread-to-herarchy mappngs. AMS montors threads, accurately models ther performance under dfferent herarches and core types, and adapts algorthms frst proposed for cache parttonng to produce hgh-qualty schedules. AMS s cheap enough to use onlne, so t adapts to program phases, and performs wthn 1% of an exhaustve-search scheduler. As a result, AMS outperforms asymmetry-oblvous schedulers by up to 37% and by 18% on average. Index Terms Cache herarches, near-data processng, asymmetrc systems, schedulng, analytcal performance modelng. I. INTRODUCTION Data movement has become a key bottleneck for computer systems. For example, an off-chp man memory access costs 1 more energy and takes 1 more tme than a doubleprecson multply-add [19]. Wthout a drastc reducton n data movement, memory accesses and communcaton wll lmt the scalablty of future systems [31]. Conventonal systems rely on deep mult-level cache herarches to reduce data movement. These herarches often take over half of chp area and are domnated by a mult-megabyte last-level cache (LLC). Deep herarches avod costly memory accesses when they can accommodate the program s workng set. But when the workng set does not ft n any cache level, deep herarches add latency and energy for no beneft [67]. Recently, placng cores closer to man memory has become a feasble alternatve to deep herarches. Advances n destackng technology [12] allow tghtly ntegratng memory banks and cores or specalzed processors, an approach known as near-data processng (NDP). NDP cores enjoy lower latency and energy to the memory stacked above them, but have lmted area and power budgets [26, 62]. These factors naturally bas NDP systems not only towards effcent cores [22, 25], but also towards shallow herarches wth few cache levels between cores and memores. Shallow herarches substantally outperform deep ones when the workng set does not ft n a large on-chp LLC, but they Memory stack wth near-memory cores Conventonal processor wth deep cache herarchy Fg. 1: A system wth an asymmetrc memory herarchy. work poorly for cache-frendly applcatons. Consequently, pror work [4, 25, 34, 71, 73] has proposed asymmetrc memory herarches that combne deep and shallow herarches wthn a sngle system. For example, Google recently proposed to use such asymmetrc systems for consumer workloads [14]. Fg. 1 shows an example system. Ths system ncludes a conventonal processor de wth a deep cache herarchy, connected to several memory stacks, each wth a small number of NDP cores and a shallow cache herarchy n ts logc layer. The processor de and memory stacks are connected usng a slcon nterposer. Asymmetrc herarches provde ample opportunty to mprove the performance and effcency of memory-ntensve applcatons. We fnd that mappng threads to the correct herarchy mproves ther performance per Joule by up to 2.8 and by 4% on average (Sec. IV). However, achevng ths potental requres mappng threads to the rght herarchy dynamcally. As shown n pror work, the same applcaton can prefer dfferent herarches dependng on ts nput [4]. Moreover, colocated applcatons can compete for resources n ether herarchy, whch affects ther preferences. Thus, t s unreasonable to expect programmers or users to make ths choce manually. Instead, the system should automatcally schedule threads to the rght herarchy. Nonetheless, ths schedulng problem s qute challengng, as t has a large, non-convex decson space (.e., whch threads use the shallow herarchy, whch threads share the deep herarchy). Much pror work has studed dynamc resource management and schedulng for systems wth symmetrc memory herarches [9, 37, 5, 69, 74]. And pror work on asymmetrc systems [16, 69] focuses only on asymmetrc cores, not memory herarches. To address ths problem, we ntroduce AMS, a novel thread scheduler for systems wth asymmetrc memory herarches (Sec. V). The key nsght behnd AMS s that the problem of modelng a thread s preferences to dfferent herarches under contenton bears a strong resemblance to the cache parttonng problem. Therefore, AMS leverages both workng set proflng 1

2 technques and allocaton algorthms from prevous parttonng technques, even though AMS does not partton any cache. Specfcally, we show that by samplng a thread s mss curve, the number of msses a thread would ncur at dfferent cache szes, we can effectvely model a thread s performance over dfferent herarches under contenton wthout tral and error. We then extend ths model to handle other asymmetres (.e., core types), proposng a novel analytcal model that ntegrates both memory herarchy and core asymmetres. AMS uses the proposed model to remap threads perodcally, mprovng performance and effcency. We contrbute two dfferent mappng algorthms. Frst, AMS-Greedy s a smple scheduler that performs multple rounds of cache parttonng and greedly maps threads to herarches. Second, AMS-DP leverages dynamc programmng to explore the full space of confguratons effcently, fndng the optmal schedule gven the performance model. AMS-Greedy s cheap, scales well to large systems, and performs wthn 1% of AMS-DP. Whle AMS-DP s more expensve, t s stll practcal n small systems and serves as the upper bound of AMS-Greedy. Evaluaton results (Sec. VII) show that, on a 16-core system, AMS outperforms an asymmetry-oblvous baselne by up to 37% and by 18% on average. AMS adapts to program phases and handles core and cache contenton n asymmetrc herarches well, outperformng state-of-the-art schedulers. Specfcally, AMS outperforms a scheduler that extends LLCaware CRUISE [37] to NDP systems by up to 18% and by 7% on average; and AMS outperforms the PIE [69] heterogeneouscore-aware scheduler by up to 13% and by 6% on average. II. BACKGROUND AND RELATED WORK We now revew related work n NDP systems and schedulng algorthms, the areas that AMS draws from. A. PIM and NDP systems Processng-n-memory (PIM) systems proposed to ntegrate processors and DRAM arrays n the same de. PIM systems were studed extensvely n the 9s. J-Machne [2], EXE- CUBE [44], and IRAM [45] proposed to ntegrate processors and man memory, whle Actve Pages [53], DIVA [28], and FlexRAM [39] nstead proposed to enhance tradtonal processors usng memory chps wth coprocessors. Though compellng, PIM was unsuccessful due to the dffcultes of ntegratng hgh-speed logc and DRAM [66]. Wth the success of 3D ntegraton usng through-slcon vas [12], the dea of processng-n-memory has been revsted recently n the context of de-stacked DRAM. Recent neardata processng (NDP) research has focused on two drectons: () how to explot the massve bandwdth of NDP systems wthn ther lmted area and power budgets, and () how to ntegrate NDP systems wth conventonal systems. De-stackng technology offers lower latency, lower energy, and much hgher bandwdth between the logc layer and the memory stack than conventonal off-chp memores. However, t also mposes lmted area and thermal budgets n the logc layer. Ths new tradeoff s attractve for data-ntensve applcatons. However, wthout careful engneerng, t s dffcult to saturate the avalable bandwdth to fully utlze the potental of NDP systems. Thus, one mportant research queston n recent NDP work s what form of computaton to put n the logc layer to best balance programmablty, performance/effcency, and desgn constrants. On the one hand, several desgns focus on general-purpose NDP systems that use smple cores [22, 25, 55], GPUs [72, 73], and reconfgurable logc [24, 26]. On the other hand, multple projects desgn NDP systems talored to mportant emergng workloads, such as graph analytcs [3, 52], neural networks [27, 42], and sparse data structures [33, 35]. Although de-stackng technology has made NDP systems practcal, not all applcatons can beneft from NDP. Therefore, another research drecton has been how to support an asymmetrc system composed of both NDP and conventonal chps. For example, LazyPIM [15] studes how to provde coherence wthn asymmetrc systems. PIM-enabled nstructons [4], TOM [34], and Pattnak et al. [54] focus on how to map computaton across systems wth asymmetrc herarches. PIM-enabled nstructons proposes new nstructons and hardware support to decde when to offload specfc nstructons to n-memory, fxed-functon accelerators to maxmze localty. TOM proposes a combnaton of compler, runtme, and hardware technques to offload computaton and place data to balance bandwdth n GPU-based asymmetrc systems. Smlarly, Pattnak et al. propose to combne compler technques and a runtme affnty predcton model to schedule kernels for asymmetrc systems. Lke ths pror work, AMS focuses on how to schedule threads across an asymmetrc system to maxmze system-wde performance. Unlke ths pror work, AMS ams to schedule threads wth no program modfcatons and transparently to users, smlar to how OS-level schedulers manage symmetrc systems, as recent work on OS for NDP systems advocates [7]. B. Cache, NUMA, and heterogenety-aware thread schedulers Schedulng applcatons under dfferent constrants has been studed extensvely n many contexts. The closest technques to AMS are cache-contenton-aware, NUMA-aware, and heterogeneous-core-aware schedulers. Contenton-aware schedulers [37, 49, 74] classfy and colocate compatble applcatons under the same memory doman to avod nterference. For example, CRUISE [37] dynamcally schedules sngle-thread applcatons n systems wth multple LLCs (e.g., mult-socket systems). CRUISE classfes applcatons nto four categores accordng to ther LLC behavor (nsenstve, thrashng, fttng, and frendly). It then apples fxed schedulng polces to each class. As we wll see, classfcaton-based technques do not work well n asymmetrc systems, where applcaton preferences (and thus classes) are affected by contenton from other colocated applcatons. They also fal to handle same-class applcatons wth dfferent preference degrees (strong/weak). NUMA-aware schedulers have dfferent goals and constrants than asymmetry-aware schedulers. Snce memory bandwdth s scarce n NUMA systems, pror work focuses on how to schedule threads across NUMA nodes to reduce bandwdth 2

3 contenton, smlar to TOM for GPU-based asymmetrc systems. Tam et al. [64] profle whch threads have frequent sharng and place them n the same socket. DINO [13] clusters snglethread processes to equalze memory ntensty, places clusters n dfferent sockets, and mgrates pages along wth ther threads. AsymSched [46] studes NUMA systems wth asymmetrc nterconnects, mgratng threads and pages to use the bestconnected nodes. These NUMA schemes focus on off-chp memory bandwdth utlzaton, whle AMS focuses on the asymmetry between deep and shallow herarches. Fnally, schedulng technques for systems wth heterogeneous cores [16, 69] focus on makng the best use of asymmetrc core mcroarchtectures lke bg.little. Due to the area and power lmts of memory stacks, asymmetrc systems often employ heterogeneous cores [25, 55], where the processor de has not only a deeper herarchy but more powerful cores than the NDP stacks. AMS focuses on asymmetrc memory herarches, but ts performance model can be easly extended to consder other asymmetres. Specfcally, we extend t wth PIE s model [69] to handle asymmetry n both core types and memory herarches (Sec. V-B). III. BASELINE ASYMMETRIC SYSTEM To make our dscusson concrete, we frst descrbe the asymmetrc system we target n ths work, shown n Fg. 1 and Fg. 2. The processor de s smlar to current multcores: each core has ts own prvate caches, and all cores share a mult-megabyte last-level cache (LLC). The processor de s connected to several memory stacks usng hgh-speed SerDes lnks. Each stack has multple DRAM des and a logc layer wth several memory controllers and NDP cores. These NDP cores have only prvate caches due to the area and power constrants of the logc layer [2]. Ths system uses an nterposer, but AMS would also work wth other confguratons, e.g., usng off-package stacks. DRAM Des Logc Layer NDP Core Shared LLC SerDes Lnks Vault Controller Prvate Cache Prvate Caches Cores Fg. 2: Baselne system wth an asymmetrc memory herarchy. A. Memory stacks wth NDP cores We assume a memory stack desgn smlar to HMC 2. s [36]. Memory s organzed n several vertcal slces called vaults. Each DRAM vault s managed by and accessed through a vault controller n the logc layer. Vault controllers are connected va an all-to-all crossbar, as Fg. 3 shows. In addton to vault controllers, we assume the logc layer also has multple lowpower, lean OOO cores, such as Slvermont [4] or Cortex A57 [6]. Those cores have the same ISA as the processor de, so they can run programs wthout help from the man processor. Lke pror work n NDP systems usng de-stackng technques [25, 42, 55, 73], we conservatvely assume the logc layer has a power and area budgets of 1 W and 5 mm 2 for components other than vault controllers and nterconnect. Ths budget supports up to 4 NDP cores n the logc layer, connected to the system va the crossbar. VC VC Core Cache Cache Core VC Crossbar VC Core Cache Cache Core VC VC Fg. 3: Logc layer of each memory stack. Why general-purpose cores? General-purpose cores make t easy for programmers to adapt ther applcatons to ths asymmetrc memory herarchy [25, 55]. Snce both NDP cores and conventonal cores use the same ISA, threads can mgrate between herarches wthout recomplaton or dynamc bnary translaton. Ths enables a smooth transton from tradtonal systems to asymmetrc systems. B. Coherence n NDP prvate caches Deep and shallow herarches share the same physcal address space, so ther caches must be kept coherent to ensure correctness. However, usng conventonal drectorybased coherence would ether requre NDP cores to check a remote drectory even when performng local memory accesses, or requre processor-de cores to check a memory-sde drectory on the memory stacks, addng area and traffc overheads that would lmt the benefts of NDP [15]. To avod these overheads, we perform software-asssted coherence smlar to pror work [25]. Each vrtual memory page s classfed as ether thread-prvate, shared read-only, or shared read-wrte. NDP cores can cache data from threadprvate and shared read-only pages wthout volatng coherence. For smplcty, shared read-wrte pages are consdered uncacheable by NDP cores, whch access them through the LLC to preserve coherence wth processor-de caches. Ths classfcaton technque has also been used to reduce coherence traffc [18] and to mprove data placement n NUCA caches [8, 3]. We use the same dynamc classfcaton mechansm as ths pror work: Pages start prvate to the thread that allocates them. Upon a read from any other thread, the page s reclassfed as shared read-only, and upon a wrte from any other thread, the page s reclassfed as shared read-wrte. Reclassfcatons are done through TLB shootdowns [8, 3], whch flush the page from prvate caches. Fnally, when a thread moves from the processor de to an NDP core, ts drty LLC lnes are flushed. IV. MOTIVATION Although technology advances have enabled systems wth asymmetrc memory herarches, what s ther potental beneft? Moreover, how crtcal s to schedule threads to the rght herarchy? In ths secton, we answer these questons by characterzng the benefts of an asymmetrc system for memory-ntensve applcatons. We also show that the deal scheduler should () dentfy the rght herarchy for each thread, () adapt to executon phases, and () consder resource contenton among threads. 3

4 Load latency (ns) Deep her. LLC ht Shallow her. Deep her. LLC mss Energy per 64B (nj) Deep her. LLC ht Shallow her. Deep her. LLC mss DRAM Logc layer SerDes Lnk On-chp NoC Shared Cache Prvate Caches Fg. 4: Latency and energy of deep herarchy LLC hts, shallow herarchy memory accesses, and deep herarchy memory accesses. A. Asymmetry n access latency and energy One of the key dfferences between deep and shallow herarches s the mult-megabyte LLC n the processor de. The performance offered by deep and shallow herarches largely depends on how frequently accesses ht n the LLC when usng the deep herarchy. Fg. 4 shows the latency and energy breakdowns of a memory reference n three stuatons wth ncreasng costs: an LLC ht n the deep herarchy, a stacked memory access n the shallow herarchy, and an LLC mss (and off-chp stacked memory access) n the deep herarchy (Sec. VII-A detals the methodology for these costs). Fg. 4 shows that an LLC mss from the deep herarchy s the worst-case scenaro: the system ncurs the latency of an LLC lookup for no beneft, then t must traverse the onchp network and off-chp SerDes lnk, reach the DRAM vault memory controller, wat for DRAM to serve the data, and fnally wat for the response to make ts way back. By contrast, a memory access from the shallow herarchy (.e., an NDP core) s 4% faster, because t s not subject to the LLC lookup, onchp network, or SerDes lnk latences. Nevertheless, stacked DRAM s sgnfcantly slower than on-chp SRAM, so an LLC ht n the deep herarchy s 65% faster than a DRAM access from the shallow herarchy. Energy breakdowns follow smlar trends as latency breakdowns. These costs show that shallow herarches complement deep ones, but do not unformly outperform them. If a thread s workng set does not ft n the LLC and fts n a local memory stack, a shallow herarchy works best. But f the LLC can satsfy a substantal number of accesses, a deep herarchy wll be more attractve. B. Effect of asymmetry on applcaton preferences We now smulate several memory-ntensve applcatons to see how they can explot memory asymmetry. We model a deep herarchy wth 32 KB prvate L1s, 256 KB L2s, and a shared 16 MB LLC n the conventonal processor. The shallow Performance/J over the deep herarchy (%) xalanc astar Shallow herarchy Best dynamc scheduler wth oracle nformaton omnet sphnx bzp2 mcf calcul MST MIS cactus delaun Gems herarchy only has prvate L1 and L2 caches (see Sec. VII-A for methodology detals). Both herarches use 2-way OOO cores. We later evaluate heterogeneous cores and multthreaded applcatons; our goal here s to study memory asymmetry ndependently. These cores wth ther prvate caches consume less than 2.5W and 1mm 2 per core, whch s practcal to fabrcate n the logc layer of 3D-stacked DRAM [2]. We smulate the 18 memory-ntensve SPEC CPU26 benchmarks that have >5 L2 MPKI and 8 benchmarks from the Problem-Based Benchmark Sute [61], whch contans memoryntensve graph algorthms. Snce the choce of herarchy affects both performance and effcency, we use performance per Joule (Perf/J),.e., the nverse of energy-delay product, to characterze the dfferences across herarches. Applcatons have strong herarchy preferences. Fg. 5 shows the Perf/J of representatve applcatons when runnng on the shallow herarchy, relatve to the Perf/J when runnng on the deep herarchy. Some applcatons strongly prefer the deep herarchy. For example, xalancbmk has a workng set of about 6 MB, so t benefts sgnfcantly from the 16 MB LLC n the deep herarchy. xalancbmk s Perf/J on the shallow herarchy s almost 2 (-5%) worse than on the deep herarchy. By contrast, soplex has a much larger workng set that cannot ft n the 16 MB LLC. It thus always prefers the shallow herarchy, whch provdes 2 hgher Perf/J than the deep herarchy. Across all applcatons, always usng the shallow herarchy mproves gmean Perf/J over the deep herarchy by 15%. However, always usng the herarchy that offers the best average Perf/J for each applcaton mproves gmean Perf/J by 31% (Fg. 5, rght), doublng the mprovement acheved by always usng the shallow herarchy. Ths result shows that t s mportant to schedule applcatons to the rght herarchy. Dynamc schedulng unlocks the full potental of asymmetrc herarches. Some applcatons have multple phases, each wth dfferent memory behavors and workng sets. For example, as shown n Fg. 6, GemsFDTD prefers the shallow herarchy before t reaches 53 bllon nstructons, and prefers the deep herarchy afterward. Therefore, runnng GemsFDTD on ether herarchy statcally does not yeld major benefts. To show the mpact of these dynamc preferences, we mplement a dynamc scheduler that always runs the applcaton on the best herarchy for each 5 ms phase. Ths substantally mproves applcatons lke GemsFDTD and refne. Of the 26 applcatons, 12 (46%) prefer dfferent herarches over dfferent phases. Overall, our dynamc scheduler mproves gmean Perf/J by 4%, more than the 31% acheved by statc decsons. refne lesle match hull soplex BFS mlc lbquan Fg. 5: Performance per Joule (Perf/J) relatve to the deep herarchy. Hgher s better. 4 Gmean Perf/J mprov. (%) Shallow Best statc Best dynamc Perf/J normalzed to the avg. Perf/J of the deep herarchy Deep Shallow Instructons (Bllons) Fg. 6: Perf/J traces of GemsFDTD, relatve to the deep herarchy.

5 Performance/J over the shallow herarchy (%) Software Hardware Msses MB LLC 4MB LLC 8MB LLC 16MB LLC bzp2 soplex omnet astar sphnx xalanc mcf Fg. 7: Performance per Joule of deep herarches wth dfferent LLC szes, relatve to the shallow herarchy. Applcaton preferences are senstve to contenton. The above results consder a sngle applcaton, but n real-world workloads, multple applcatons are colocated n a sngle system and compete for shared resources, such as LLC capacty. To study ths effect, we sweep the LLC sze of the deep herarchy to mmc capacty contenton among applcatons. Fg. 7 shows the Perf/J mprovement of deep herarches wth dfferent LLC capactes over the shallow herarchy. We select 7 representatve benchmarks. The frst applcaton (bzp2) always benefts from deep herarches due to ts cache-frendly workng sets. The next applcaton (soplex) nstead always prefers a shallow herarchy due to ts streamng behavor. By contrast, the other applcatons have very dfferent preferences across LLC capactes. For example, omnetpp benefts from LLCs 4 MB, whle sphnx3 benefts from LLCs 8 MB, and mcf only benefts from a 16 MB LLC. And even when applcatons prefer the deep herarchy, ther degree of preference also changes sgnfcantly wth avalable capacty (e.g, 8 MB vs. 16 MB for astar). Ths result shows that when applcatons are colocated, resource contenton can dramatcally change ther preferences. It also shows why pror classfcaton-based schedulers can cause pathologes wth asymmetrc herarches. For example, CRUISE can frst classfy and schedule mcf to the deep herarchy, then later schedule others that cause capacty contenton and make mcf strongly prefer the shallow herarchy. In summary, these results show that applcatons have strong preferences for the type of herarchy, and that these preferences change over tme and wth avalable resources. These observatons gude AMS s desgn. V. AMS: ADAPTIVE SCHEDULING FOR ASYMMETRIC MEMORY SYSTEMS AMS realzes the potental of asymmetrc memory herarches by accountng for contenton and dynamc behavor when mappng threads to cores. The key nsght behnd AMS s that the problem of modelng a thread s preferences to dfferent memory herarches on-the-fly and under contenton bears a strong resemblance to the dynamc cache parttonng problem. Therefore, unlke other schedulers, AMS leverages both workng set proflng technques and allocaton algorthms that were orgnally proposed for cache parttonng, even though AMS does not perform cache parttonng. Fg. 8 shows an overvew of AMS. AMS has both hardware and software components. AMS hardware conssts of smple Mss curves Produce Cache sze Hardware utlty montors 1 st Phase (Sec. V-A) Estmate performance under dfferent herarches Sample accesses Fg. 8: AMS overvew. Schedule threads 2 nd Phase Fnd thread placement wth AMS-Greedy (Sec. V-C) or AMS-DP (Sec. V-E) hardware utlty montors [56] to profle per-thread mss curves, whch reflect the number of msses a thread would ncur under dfferent cache szes. AMS software then uses ths nformaton to remap threads perodcally, on each schedulng quantum (every 5 ms n our mplementaton). Ths process conssts of two phases. In the frst phase, AMS software uses mss curves to accurately estmate a thread s performance on both shallow and deep herarches and under dfferent amounts of LLC contenton (Sec. V-A). Mss curves allow AMS software to produce these estmates wthout tral and error (.e., AMS does not run a thread n both herarches to nfer ts preferences). In the second phase, AMS software uses these estmates to fnd a thread placement that acheves hgh system-wde performance. We present two thread placement algorthms: AMS-Greedy performs multple rounds of cache parttonng and uses ts outcomes to progressvely and greedly map threads to herarches (Sec. V-C), whle AMS-DP leverages dynamc programmng to explore the full space of confguratons effcently, fndng the optmal schedule gven the predcted preferences (Sec. V-E). Though AMS-DP s more expensve than AMS-Greedy, t s practcal to use n small systems and serves as AMS-Greedy s upper bound. To smplfy the explanaton, we frst focus on systems wth homogeneous cores runnng sngle-thread applcatons. Sec. V-B extends AMS to heterogeneous cores, Sec. V-D extends AMS to multthreaded workloads, and Sec. V-F dscusses other scenaros, such as oversubscrbed systems. A. Estmatng performance under asymmetrc herarches To model thread preferences, t s crucal to understand the utlty of the processor de s LLC for each thread. To ths end, AMS leverages UMONs [56] to produce mss curves. Each UMON s a set-assocatve tag array wth per-way ht counters. UMONs leverage LRU s stack property to profle dfferent cache szes smultaneously. AMS adds a 4 KB UMON to each core. Each UMON samples prvate cache msses and produces a mss curve that covers the range of possble capactes avalable to the thread (from no capacty to the full LLC). We choose UMONs for ther low overhead and hgh accuracy, but AMS could use other mss curve proflng technques [9, 23, 65]. AMS models thread performance usng total memory access latency, a cost functon derved wth mss curves. AMS can also optmze other cost functons, such as core cycles, as we wll show n Sec. V-B. AMS uses mss curves to derve cost 5

6 # Msses Latency Memory latency Memory stalls Core cycles Mss curve from UMON LLC capacty (MB) LLC capacty (MB) Latency curves NDP core n dff. stack from data NDP core n the same stack as data Processor-de core Fg. 9: Example latency curves for processor-de and NDP cores. functons for all relevant scenaros. Snce NDP and processorde cores have the same prvate caches, we focus on memory references after the prvate cache levels. If a thread runs on a processor-de core, ts latency depends on how much LLC capacty s avalable. Specfcally, the total latency n cycles as a functon of LLC capacty s, whch we call the latency curve, s: L proc (s)=a Lat LLC + M(s) Lat mem,proc where A s the number of accesses that mss n the prvate cache levels (.e., the number of LLC accesses n ths case), Lat LLC s the average latency of a sngle LLC access, M(s) s the number of LLC msses gven capacty s, and Lat mem,proc s the average latency of a sngle access to off-chp man memory. M(s) s the mss curve, and A, the number of LLC accesses, s smply A=M() (wth no LLC capacty, all LLC accesses mss). Note that ths formula covers the total amount of cycles spent n memory references n a gven nterval, not the average latency. Ths s because t s mportant to account for the rate at whch accesses happen, not only ther unt cost. For example, a thread that has nfrequent msses from ts prvate caches wll have low values for A and M(s), and thus wll ncur a low penalty from dfferent thread placements, even f most of the few accesses t performs mss n the LLC. If the thread runs on an NDP core, all A prvate cache msses go to memory. Thus, the thread s latency curve s smply L NDP = A Lat mem,ndp. Because NDP cores do not access a shared LLC, ths curve does not change wth s. However, because the system has multple memory stacks, the average latency per memory access, Lat mem,ndp, depends on the core s stack as well as the placement of the applcaton s data. We use a smple algorthm that makes most NDP memory accesses local by basng data placement to partcular stacks. We descrbe ths algorthm n Sec. VI. AMS smply computes Lat mem,ndp as the weghted average of the number of applcaton pages on each stack, tmes the latency to access that stack. Fg. 9 shows three example latency curves for a partcular thread: the curve for processor-de cores and two curves for two NDP cores on dfferent stacks. These latency curves encode a thread s preferences under dfferent scenaros. For example, f the LLC s very contented and leaves no capacty, ths thread prefers NDP cores. But wth 2 MB of LLC capacty, only the NDP core closest to ts data s better (.e., has a lower latency). Fnally, f the thread can use over 4 MB of LLC capacty, t prefers to run on a processor-de core wth the deep herarchy. Moreover, the latency dfference between L proc (s) and L NDP at each pont also ndcates how strong the preference s. Latency curves NDP core Processor-de core Wegh by MLP Memory stall curves NDP core Processor-de core Add non-memory component weghed by ILP Core cycle curves NDP core Non-mem cycles Processor-de core Fg. 1: To handle heterogeneous cores, AMS transforms the latency curves nto CPI curves usng PIE s performance model. We fnd that ths model matches Fg. 7 s results. Therefore, ths model lets AMS predct how applcatons perform under dfferent decsons wthout drectly proflng or samplng ther performance under varous colocaton combnatons. B. Handlng heterogeneous cores Although AMS focuses on asymmetrc herarches, we must also consder core asymmetry, as NDP cores are typcally smpler than processor-de cores. Fortunately, t s easy to extend AMS to handle heterogeneous cores. We combne AMS s model wth the cycles-per-nstructon (CPI) estmaton technques from PIE [69], whch targets heterogeneous cores but assumes a symmetrc memory herarchy. To map threads across heterogeneous cores, PIE estmates each thread s CPI on dfferent core types. Its model conssts of a memory component, estmated wth the core s memory-level parallelsm (MLP), and a non-memory component, estmated wth the core s nstructon-level parallelsm (ILP). AMS wth PIE models the total cycles spent across core types and LLC szes. It thus works on core cycle curves nstead of memory latency curves. Fg. 1 shows ths transformaton. The memory component of each curve comes from AMS s latency curve weghted usng PIE s estmated MLP, and the non-memory component uses PIE s estmated ILP. Ths requres collectng non-memory stall cycles usng standard hardware counters (as n PIE). Core cycle curves unfy asymmetres n both cores and memory herarches. They can be transformed nto other cost curves as needed (e.g., usng tme nstead of cycles to model cores runnng at dfferent frequences). C. AMS-Greedy: Mappng threads va cache parttonng Gven the cost functon (total latency or core cycle) curves of all threads n the system, we can evaluate a schedule by calculatng the total cost t ncurs. Fndng the best mappng n asymmetrc herarches can be modeled as mnmzng the total cost over all possble thread mappngs. We present two optmzers for ths problem, one based on greedy optmzaton, and another based on dynamc programmng (Sec. V-E). AMS-Greedy works by performng multple rounds of cache parttonng. On each round, the algorthm dentfes the threads that beneft the least from the deep herarchy and schedules them away from the processor de. Fg. 11 llustrates AMS- Greedy s algorthm wth a 4-thread example. AMS-Greedy begns wth all threads mapped to the deep herarchy (the processor de). We denote the cost curves for thread as C proc (s) for the processor de and Cbest NDP for the best NDP stack. Frst, AMS-Greedy fnds threads that always 6

7 Thread 1 Thread 3 Partton the LLC among threads 1-3 8MB Processor-de core group Thread 1 Thread 3 Thread 2 Thread 4 Thread 1 Thread 2 Thread 3 3MB 1MB 4MB NDP core group Thread 2 Thread 4 1 st : Thread4 Best NDP Processor-de Always wants NDP, move to NDP : Opportunty cost Opportunty cost < move to NDP : Opportunty cost Best NDP Next best NDP 2 nd : Thread2 Ordered by opportunty cost (a) Fnd threads that always want (b) Partton the LLC to fnd threads wth mnmal (c) Schedule threads n the NDP core group ordered by the NDP herarchy. opportunty cost to use the NDP herarchy. maxmum opportunty cost. Fg. 11: An example of how AMS-Greedy schedules 4 threads wth dfferent latency curves. prefer the shallow herarchy. These are the threads for whch C proc (s)> Cbest NDP across all possble LLC capactes s (e.g., thread 4 n Fg. 11a). AMS-Greedy moves these threads off the processor de. The remanng threads can beneft from the LLC f they have enough capacty avalable. But there may not be enough LLC capacty or enough processor-de cores to satsfy all threads. Therefore, AMS-Greedy progressvely moves threads to NDP cores, stoppng ether when the remanng threads have suffcent LLC capacty and cores or when NDP cores fll up. AMS-Greedy uses cache parttonng for ths goal. Specfcally, t parttons the LLC usng the Peekahead algorthm [8] (a lnear-tme mplementaton of quadratc-tme UCP Lookahead [56]). AMS-Greedy uses the processor-de cost curves (C proc (s) for thread ) to drve the parttonng. Ths way, the parttonng algorthm fnds a set of partton szes s that seeks to mnmze total cost ( C proc (s )). For example, n Fg. 11b, threads 1, 2, and 3 receve partton szes of 3, 1, and 4 MB. Intutvely, parttonng naturally fnds threads that should gve up the processor de. For example, f a thread has no capacty after parttonng the LLC, that means fttng ts workng set s too costly compared to other optons. We should thus move t to an NDP core and let others share the LLC. Thus, AMS-Greedy ranks threads by ther opportunty cost, the extra cost they ncur when movng to the best NDP core: best NDP Opportunty cost = C C proc (s ) AMS-Greedy moves all threads wth a negatve opportunty cost to NDP cores as long as NDP cores are not oversubscrbed. These threads have lower cost n NDP cores than wth s LLC capacty (e.g., thread 2 n Fg. 11b). If there s no such thread but the processor de s stll oversubscrbed, AMS-Greedy moves the thread wth the smallest opportunty cost. If after a round of parttonng and movng threads there are stll more threads than the number of processor-de cores, AMS- Greedy performs another round of parttonng and movement among the remanng threads. Ths process repeats untl the processor de s not oversubscrbed. Fnally, AMS-Greedy tres to map the threads on NDP cores to ther most favorable stack. Threads are agan prortzed by opportunty cost: threads wth the largest dfference between ther latences n the best and next-best NDP stacks are placed frst. For example, n Fg. 11c, thread 4 has a larger opportunty cost than thread 2 and s mapped frst. AMS-Greedy works well because t shares the same goal as parttonng: dentfyng the threads that beneft the least (or not at all) from the LLC. AMS-Greedy leverages these algorthms to mnmze total cost at each round. Greedly movng threads out may not yeld the optmal soluton because the problem s not convex. Nonetheless, we fnd AMS-Greedy generates hghqualty results because opportunty cost captures the degree of preference accurately. AMS-Greedy also scales well: ts runtme complexty s O(N 2 S), where N s the number of threads and S s the number of LLC segments (O(NS) per round of cache parttonng [8] and O(N) for up to N rounds). Pror work has also leveraged parttonng algorthms for other purposes, such as talorng the cache herarchy to each applcaton [67] and performng dynamc data replcaton n NUCA caches [68]. AMS-Greedy shares smlar nsghts n usng mss curves and parttonng, but t focuses on schedulng n asymmetrc systems and does not partton the cache. D. Handlng multthreaded workloads We have so far consdered only sngle-threaded processes. AMS can handle multthreaded processes wth extra UMONs and smple modfcatons to AMS-Greedy. Multthreaded workloads share data wthn the process, so per-core UMONs may overestmate the sze of the workng set. To solve ths, AMS adds an extra UMON per core to profle shared data. To dstngush prvate vs. shared data, we leverage the per-page data classfcaton scheme used for coherence (Sec. III-B). Cache msses to prvate data are sampled to the per-core UMON, and msses to shared data are sampled to a UMON shared by all threads n the process (although ths UMON s not local to the core, ths mposes neglgble traffc because only 1% of the msses are sampled). Usng the number of prvate cache msses to thread-prvate data and shared data, AMS frst classfes processes as threadprvate-ntensve or shared-ntensve. AMS then treats threadprvate-ntensve processes as multple ndependent threads. Ths s sensble because these processes have lttle data sharng, so they behave smlarly to multple sngle-threaded processes. 7

8 By contrast, AMS-Greedy groups all the threads of each sharedntensve process nto a sngle unt when makng decsons. The algorthm consders the mss curve for shared data only, and performs placement decsons for all ts threads at once (consderng the opportunty cost of all threads). Ths ensures that threads that share data ntensvely stay together. NDP cores access shared read-wrte data pages through the LLC for coherence. Ths makes the processor de preferable under our model for workloads domnated by shared readwrte data. However, many multthreaded applcatons are wellstructured: threads wrte mostly dsjont data and manly use thread-prvate or shared read-only pages. These applcatons often prefer NDP cores. E. AMS-DP: Mappng threads va dynamc programmng Dynamc programmng (DP) [1, 17] s an optmzaton technque that solves a problem recursvely, by dvdng t nto smaller subproblems. Each subproblem s solved only once, and ts result s memozed and reused whenever the subproblem s encountered agan. Memozaton lets DP explore the full space of possble choces effcently. Because DP consders all possble choces, t fnds the globally optmal soluton. By contrast, greedy algorthms take locally optmal decsons but may end up wth a globally suboptmal one. However, not all problems are amenable to DP: the problem must have the property that an optmal soluton can be computed effcently gven optmal solutons of ts subproblems. Often, the dffculty les n castng the problem n a way that meets ths property. Our second AMS varant, AMS-DP, leverages dynamc programmng to fnd the optmal soluton. We agan explot the smlartes between schedulng and cache parttonng by buldng on Sasnowsk et al. [59], who show that DP can solve cache parttonng optmally n polynomal tme. Cache parttonng can be solved wth DP because t has dscrete decsons, at the sze of cache segments (e.g., cache ways or lnes). Ths property allows dvdng the parttonng problem nto subproblems. For example, parttonng a 4 MB cache among eght threads can be dvded nto parttonng two caches (e.g., of 2 MB each or of 1 MB + 3 MB) to two groups of four threads. The smallest subproblem s just allocatng some amount of capacty to a sngle thread. Smlarly, schedulng threads to cores also has dscrete decsons. One thread can occupy only one core and leave the rest to other threads. Ths property allows dvdng a schedulng problem nto subproblems that schedule smaller groups of threads across smaller systems. The smallest subproblem s schedulng a thread to a sngle core, gven C NDP, C proc (s), and some amount s of remanng LLC capacty. Our nsght s that snce these two problems have dscrete decsons, we can combne them together and solve a bgger DP problem to partton the cache and schedule threads at the same tme, whch s very smlar to schedulng n asymmetrc systems as we dscussed. Thus, solvng ths DP problem leads to the optmal parttonng and schedulng. For the rest of the secton, we use the same termnology as Sasnowsk et al. [59]. See [1] for more detals on DP. The key recurrence relaton that lets Sasnowsk et al. use DP s as follows. If M, j s the mnmum cost acheved by parttonng j segments among the frst threads, and C (s ) s the cost of the th thread when allocated s segments, then: M, j = mn{m 1, j s +C (s )} s Ths recurrence shows that the mnmum cost M, j s the mnmum of all possble combnatons of subproblems: the cost of thread wth s cache segments and the mnmum cost of usng j s cache segments for the frst 1 threads. By solvng each M, j bottom-up, we reach the optmal parttonng for M N,S, where N s the number of threads and S s the number of cache segments n the system. In our case, we want to not only partton the cache (conceptually, to prevent cache contenton) but also to schedule threads. Therefore, we extend the recurrence by addng dmensons for processor-de cores and NDP cores. Cores are just another type of dscrete resource to allocate. However, dfferent cores, even NDP cores n dfferent stacks, should be treated dfferently. Suppose the system has one processor de and one NDP stack. We defne M, j,kproc,k nd p as the mnmum cost when parttonng j cache segments to the frst threads and schedulng them wth exactly k proc processor-de cores and k nd p NDP cores. The recurrence above becomes: M, j,kproc,k nd p = mn{mn{m 1, j s,k s proc 1,k nd p +C proc (s )}, M 1, j,kproc,k nd p 1+C NDP } Ths recurrence states that, f we were to schedule the th thread on a processor-de core, we can allocate some LLC capacty s to t and leave the remanng capacty to the frst 1 threads. Ths decson makes total cost to be the cost C proc (s ) for thread plus the mnmum cost of schedulng the frst 1 threads wth j s capacty. If the thread s nstead scheduled on an NDP core, t takes no LLC capacty, and ncurs cost C NDP for the thread plus the mnmum cost to schedule the frst 1 threads wth 1 fewer NDP core avalable. Fnally, the mnmum cost to schedule threads s smply the mnmum of those two schedulng choces. Ths recurrence consders a sngle NDP stack, but addng more stacks as extra dmensons s straghtforward (e.g., M, j,kproc,k nd p,1,k nd p,2 wth two stacks). Ths lets each thread use ts dfferent costs to each stack. Usng ths recurrence, AMS-DP performs standard bottomup DP to fnd the optmal thread-to-core mappng. Whle conceptually smple, AMS-DP scales poorly: every group k (.e., processor-de or NDP stack) adds a new dmenson to the DP algorthm. Ths causes O(N S k proc t k nd p,t ) runnng tme, where N s the number of threads and S s the number of cache segments. Thus, AMS-DP s practcal only n small systems. On larger systems, AMS-DP serves as the upper bound, but smpler technques lke AMS-Greedy are needed. F. Dscusson Our evaluaton focuses on long-runnng, memory-ntensve batch workloads, but AMS should work n other scenaros wth mnor changes. Frst, n oversubscrbed systems wth more 8

9 runnable threads than cores, AMS only needs to consder the actve threads n each quantum. A thread s mss curve can be saved when t s descheduled so that the thread can be mapped to the rght core when t s rescheduled later. Second, kernel threads and short-lved threads or processes can evct any long-runnng thread n the system. Snce they run for a fracton of the schedulng quantum, ther mpact s mnmal. Fnally, to handle latency-crtcal workloads wth real-tme needs, AMS can be combned wth technques that partton the cache to mantan SLOs nstead of maxmzng throughput, such as Heracles [48] or Ubk [41]. VI. DATA PLACEMENT FOR ASYMMETRIC HIERARCHIES NDP cores are most effectve when they access ther local memory stack. Ths requres adoptng a data placement scheme that mnmzes remote accesses. Data placement s a wdely studed topc n non-unform memory access (NUMA) systems. Pror work [13, 21, 7] has proposed varous data mgraton and replcaton technques to reduce remote accesses. Other NUMA work [1, 46] focuses on balancng avalable bandwdth among applcatons. The key dfference between NUMA and NDP systems s bandwdth. Because memory bandwdth s scarce, NUMA systems are lmted by bandwdth to local memory, and pror work fnds that t s mportant to spread pages evenly across NUMA nodes to reduce bandwdth contenton [21, 46]. By contrast, NDP systems suffer a dfferent problem: the NDP cores n each stack enjoy plentful bandwdth to the memory stacked drectly atop them [26], but the bandwdth across stacks s very lmted [34]. In ths case, t s more mportant to reduce nter-stack traffc than ntra-stack traffc, so the key constrant s ensurng that NDP cores have local accesses. Snce relocatng pages s expensve, our data placement algorthm avods mgratng pages and uses smple heurstcs to keep data local. Its goal s to keep pages from the same thread n as few stacks as possble, so that NDP cores have most local accesses. When a thread starts, the system bulds up a dynamc preference lst of memory stacks n the order from whch we fulfll memory allocatons. Ths preference lst s refreshed when a memory stack s depleted. When a new thread starts, AMS pcks the memory stack wth the greatest remanng capacty as the most preferred source. Ths ensures that threads that can beneft from the shallow herarchy are able to leverage t and those that prefer the deep herarchy are not penalzed. Next on the lst are the nearby stacks. In Fg. 1, these are those on the same sde of the processor de. In the example n Fg. 12, an NDP-frendly applcaton can be scheduled on an NDP core n stacks 1 or 2 NDP stacks (cores+dram) Stack wthout remanng capacty Fg. 12: Example data placement preference lst. Memory stacks wth free pages are shown n blue and full stacks n red. TABLE I: CONFIGURATION OF THE SIMULATED SYSTEM. Cores L1 caches L2 caches Coherence Last-level cache Stacked DRAM Stack lnks 3D DRAM tmngs 16 cores (8 processor de NDP), x86-64, 2.5 GHz Slvermont-lke OOO [4]: 8B-wde fetch, 2-level bpred wth bt BHSRs + 1K 2-bt PHT, 2-way ssue, 36-entry IQ, 32-entry ROB, 32-entry LQ/SQ Haswell-lke OOO [29]: 16B-wde fetch, 2-level bpred wth 1K 18-bt BHSRs + 4K 2-bt PHT, 4-way ssue, 6-entry IQ, 192-entry ROB, 72-entry LQ, 42-entry SQ 32 KB, 8-way set-assocatve, splt data and nstructon caches, 3-cycle latency; 15/33 pj per ht/mss [51] 256 KB prvate per-core, 8-way set-assocatve, nclusve, 7-cycle latency; 46/93 pj per ht/mss [51] MESI, 64 B lnes, no slent drops; sequental consstency 16 MB, 2 MB bank per core, 32-way set-assocatve, nclusve, 3-cycle latency, TA-DRRIP [38] replacement; 945/194 pj per ht/mss [51] 4 GB de, HMC 2.-lke organzaton, 8 vaults per stack, 64-bt data bus, 6-cycle all-to-all crossbar n the logc layer [36], 2 pj/bt nternal, 8 pj/bt logc layer [25, 73] 16 GBps bdrectonal, 1-cycle latency, ncludng 3.2 ns for SerDes [43], 2 pj/bt [43, 55] t CK =1.6 ns, t CAS =11.2 ns, t RCD =11.2 ns, t RAS =22.4 ns, t RP =11.2 ns, t WR =14.4 ns to have hgh-bandwdth and low-latency accesses. Fnally, f stacks on one sde are exhausted, we allocate pages to stacks on the opposte sde of the chp. Adoptng more sophstcated data placement technques as n pror NUMA work [21, 46] could ncrease AMS s benefts. For example, the system could dynamcally mgrate data to reduce cross-stack accesses from NDP threads. These technques are orthogonal to AMS, so we leave them to future work. A. Methodology VII. EVALUATION Modeled system: We perform mcroarchtectural, executondrven smulaton usng zsm [58], and model a 16-core system. The processor de has 8 cores, wth prvate 32 KB L1 and 256 KB L2 caches. All 8 cores share a 16 MB LLC that uses the TA-DRRIP [38] thread-aware replacement polcy. The processor de s connected to four NDP stacks va SerDes lnks, as shown n Fg. 1. Each stack has 4 GB of DRAM and 2 NDP cores wth only prvate caches. Table I detals the system s confguraton. We consder systems wth both homogeneous and heterogeneous cores. Our homogeneous-core system (Secs. VII-B to VII-D) uses 2-wde OOO cores smlar to Slvermont [4]. Our heterogeneous-core system (Sec. VII-E) nstead uses 4-wde OOO cores smlar to Haswell [29] n the processor de. Schedulers: We frst compare AMS-Greedy aganst three smple schedulers n Sec. VII-B and Sec. VII-C. Frst, we use Random schedulng as the baselne to whch we compare other schedulers. Ths s a better baselne than the WAS (worst applcaton scheduler) baselne n pror work [37, 69]. Second, All proc always runs threads on processor-de cores. Thrd, All NDP always runs threads on NDP cores. In Sec. VII-D, we compare AMS-Greedy aganst AMS- DP and a more sophstcated scheduler, CRUISE-NDP. We derve CRUISE-NDP by adaptng CRUISE [37] to asymmetrc herarches. Each schedulng quantum, CRUISE-NDP classfes threads as ether cache-nsenstve, cache-frendly, cache-fttng, 9

10 Weghted speedup Random All proc All NDP AMS-Gr Workload (a) Weghted speedup. Normalzed memory accesses RandomAll proc All NDPAMS-Gr Normalzed cross-stack traffc. RandomAll proc All NDPAMS-Gr (b) Memory accesses. (c) Cross-stack traffc. Fg. 13: Smulaton results for dfferent schedulers on 8-app mxes x Norm. dynamc energy Prvate caches Shared LLC Memory Lnks Random All proc All NDP AMS-Gr (d) Data movement energy. or thrashng usng the same heurstcs as CRUISE (all the necessary nformaton for CRUISE s gathered usng UMONs too). CRUISE-NDP then maps thrashng threads to NDP cores, cache-frendly and fttng threads to processor-de cores (prortzng frendly over fttng), and fnally backflls the remanng cores wth nsenstve threads. We model mgraton overheads and fnd remappng every 5 ms causes neglgble overheads, smlar to PIE s fndngs. Workloads: Our workload setup mrrors pror work [9]. We smulate mxes of SPEC CPU26 apps and multthreaded apps from SPEC OMP212 and PARSEC [11]. We evaluate schedulng polces under 5% load (8 cores occuped) and 1% load (16 cores occuped). We use the 18 SPEC CPU26 apps wth 5 L2 MPKI (as n Sec. IV) and fast-forward all apps n each mx for 3 B nstructons. We use a fxed-work methodology and equalze sample lengths to avod sample mbalance, smlar to FIESTA [32]. Each applcaton s then smulated for 2 B nstructons. Each experment runs the mx untl all apps execute at least 2 B nstructons, and we consder only the frst 2 B nstructons of each app to report performance. For multthreaded apps, snce IPC s not a vald measure of work [5], to perform a fxed amount of work we nstrument each applcaton wth heartbeats that report global progress (e.g., when each tmestep or transacton fnshes) and run each applcaton for as many heartbeats as All proc completes n 2 B cycles after the seral regon. Metrcs and energy model: We follow pror work n schedulng technques and use weghted speedup [63] as our performance metrc. We use McPAT 1.3 [47] to derve the energy of cores at 22 nm, and CACTI [51] for caches at 22 nm. We model the energy of 3D-stacked DRAM usng numbers reported n pror work [34, 43, 6]. Dynamc energy for NDP accesses s about 1 pj/bt. We assume that each SerDes lnk consumes 2 pj/bt [43, 55]. B. AMS fnds the rght herarchy We frst evaluate AMS-Greedy n an undercommtted system wth homogeneous cores (8 apps on 16 Slvermont cores) to focus on the effect of memory asymmetry. Fg. 13a shows the dstrbuton of weghted speedups over 4 mxes of 8 randomly chosen memory-ntensve SPEC CPU26 apps. Each lne shows the results for a sngle scheduler over the Random baselne. For each scheme, workload mxes (the x-axs) are sorted accordng to the mprovement acheved. In each mx, dfferent applcatons want dfferent herarches. All proc mproves only 7 mxes and hurts performance on the other 33 because t never leverages the NDP capablty of the asymmetrc system. On average, ts weghted speedup s 8% worse than the Random baselne. All NDP benefts some applcatons sgnfcantly (e.g., soplex n Fg. 5). However, t sometmes hurts applcatons that prefer deep herarches because t never leverages the LLC. On average, All NDP mproves weghted speedup by 9% over the baselne. AMS-Greedy fnds the best herarchy for each applcaton and schedules them accordngly. It thus never hurts performance and mproves weghted speedup by up to 37% and by 18% on average over the Random baselne. AMS-Greedy acheves sgnfcant speedups because t leverages both herarches effcently. Fgs. 13b d gve more nsght on ths. AMS-Greedy uses the LLC as effectvely as All proc and reduces memory accesses by 26% over the baselne (Fg. 13b). AMS-Greedy also schedules applcatons to leverage the system s NDP cores when benefcal. It thus elmnates 8% of the cross-stack traffc (Fg. 13c). Overall, AMS-Greedy reduces dynamc data movement energy by 25% over the baselne, whle All proc ncreases t by 2% and All NDP reduces t by 18% (Fg. 13d). C. AMS adapts to applcaton phases Next, we show how AMS-Greedy adapts to phase changes by examnng a 4-app mx. In ths workload, we nclude two applcatons, astar and xalancbmk, that have dstnct memory behavors across two long phases, and two other applcatons, bzp2 andsphnx3, that have short and fne-graned varatons over tme. To observe tme-varyng behavor, we smulate ths mx for 25 Bcycles. Fg. 14a shows the traces of schedulng and capacty allocaton decsons of AMS-Greedy for all four apps. The upper two traces show that AMS-Greedy takes dfferent decsons for astar and xalancbmk before and after 1 Bcycles. Before 1 Bcycles, astar s mapped on the processor de and xalancbmk s mapped to an NDP core. After 1 Bcycles, both change the herarchy they prefer. The other two apps are more fluctuatng, but prefer the deep herarchy more often. To explan ths phenomenon, Fg. 14b and Fg. 14c show the sampled mss curves for astar and xalancbmk at 5 and 15 Bcycles. At 5 Bcycles, astar has a small workng set (the sharp drop around 1 MB), but xalancbmk has a large workng set (14 MB). Therefore, AMS-Greedy fts the workng sets of astar, bzp2, and sphnx3 n the LLC, and schedules xalancbmk to an NDP core because t prefers the shallow herarchy when capacty s lmted. At 15 Bcycles, 1

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?