Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Size: px

Start display at page:

Download "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior"

Sibyl Burke
6 years ago
Views:

1 Thread Cluster Memory Schedulng: Explotng Dfferences n Memory Access Behavor Yoongu Km Mchael Papamchael Onur Mutlu Mor Harchol-Balter yoonguk@ece.cmu.edu papamx@cs.cmu.edu onur@cmu.edu harchol@cs.cmu.edu Carnege Mellon Unversty Abstract In a modern chp-multprocessor system, memory s a shared resource among multple concurrently executng threads. The memory schedulng algorthm should resolve memory contenton by arbtratng memory access n such a way that competng threads progress at a relatvely fast and even pace, resultng n hgh system throughput and farness. Prevously proposed memory schedulng algorthms are predomnantly optmzed for only one of these objectves: no schedulng algorthm provdes the best system throughput and best farness at the same tme. Ths paper presents a new memory schedulng algorthm that addresses system throughput and farness separately wth the goal of achevng the best of both. The man dea s to dvde threads nto two separate clusters and employ dfferent memory request schedulng polces n each cluster. Our proposal, Thread Cluster Memory schedulng (), dynamcally groups threads wth smlar memory access behavor nto ether the latency-senstve (memory-non-ntensve) or the bandwdth-senstve (memory-ntensve) cluster. ntroduces three major deas for prortzaton: 1) we prortze the latency-senstve cluster over the bandwdth-senstve cluster to mprove system throughput; ) we ntroduce a nceness metrc that captures a thread s propensty to nterfere wth other threads; 3) we use nceness to perodcally shuffle the prorty order of the threads n the bandwdth-senstve cluster to provde far access to each thread n a way that reduces nter-thread nterference. On the one hand, prortzng memory-non-ntensve threads sgnfcantly mproves system throughput wthout degradng farness, because such lght threads only use a small fracton of the total avalable memory bandwdth. On the other hand, shufflng the prorty order of memory-ntensve threads mproves farness because t ensures no thread s dsproportonately slowed down or starved. We evaluate on a wde varety of multprogrammed workloads and compare ts performance to four prevously proposed schedulng algorthms, fndng that acheves both the best system throughput and farness. Averaged over 9 workloads on a -core system wth memory channels, mproves system throughput and reduces maxmum slowdown by.%/38.% compared to (prevous work provdng the best system throughput) and 7.%/.% compared to PAR-BS (prevous work provdng the best farness). 1. Introducton Hgh latency of off-chp memory accesses has long been a crtcal bottleneck n thread performance. Ths has been further exacerbated n chp-multprocessors where memory s shared among concurrently executng threads; when a thread accesses memory, t contends wth other threads and, as a result, can be slowed down compared to when t has the memory entrely to tself. Inter-thread memory contenton, f not properly managed, can have devastatng effects on ndvdual thread performance as well as overall system throughput, leadng to system underutlzaton and potentally thread starvaton [11]. The effectveness of a memory schedulng algorthm s commonly evaluated based on two objectves: farness [1, 13, 1] and system throughput [1, 13, 5]. On the one hand, no sngle thread should be dsproportonately slowed down, whle on the other hand, the throughput of the overall system should reman hgh. Intutvely, farness and hgh system throughput ensure that all threads progress at a relatvely even and fast pace. Prevously proposed memory schedulng algorthms are based towards ether farness or system throughput. In one extreme, by tryng to equalze the amount of bandwdth each thread receves, some noton of farness can be acheved, but at a large expense to system throughput [1]. In the opposte extreme, by strctly prortzng certan favorable (memory-non-ntensve) threads over all other threads, system throughput can be ncreased, but at a large expense to farness [5]. As a result, such relatvely sngle-faceted approaches cannot provde the hghest farness and system throughput at the same tme. Our new schedulng algorthm explots dfferences n threads memory access behavor to optmze for both system throughput and farness, based on several key observatons. Frst, pror studes have demonstrated the system throughput benefts of prortzng lght (.e., memory-non-ntensve) threads over heavy (.e., memory-ntensve) threads [5, 1, ]. Memory-nonntensve threads only seldom generate memory requests and have greater potental for makng fast progress n the processor. Therefore, to maxmze system throughput, t s clear that a memory schedulng algorthm should prortze memory-non-ntensve threads. Dong so also does not degrade farness because lght threads rarely nterfere wth heavy threads. Second, we observe that unfarness problems usually stem from nterference among memory-ntensve threads. The most memory-ntensve threads become vulnerable to starvaton when less memory-ntensve threads are statcally prortzed over them (e.g., by formng a prorty order based on a metrc that corresponds to memory ntensty, as done n [5]). As a result, the most memoryntensve threads can experence dsproportonately large slowdowns whch lead to unfarness. Thrd, we observe that perodcally shufflng the prorty order among memory-ntensve threads allows each thread a chance to gan prortzed access to the memory banks, thereby reducng unfarness. However, how to best perform the shufflng s not obvous. We fnd that shufflng n a symmetrc manner, whch gves each thread equal possblty to be at all prorty levels, causes unfarness because not all threads are equal n terms of ther propensty to nterfere wth others; some threads are more lkely to slow down other threads. Hence, thread prorty order should be shuffled such that threads wth hgher propensty to nterfere wth others have a smaller chance of beng at hgher prorty. Fnally, as prevous work has shown, t s desrable that schedulng decsons are made n a synchronzed manner across all banks [5, 1, ], so that concurrent requests of each thread are servced n parallel, wthout beng seralzed due to nterference from other threads. 1

2 Overvew of Mechansm. Based on the above observatons, we propose Thread Cluster Memory schedulng (), an algorthm that detects and explots dfferences n memory access behavor across threads. dynamcally groups threads nto two clusters based on ther memory ntensty: a latency-senstve cluster comprsng memory-non-ntensve threads and a bandwdthsenstve cluster comprsng memory-ntensve threads. Threads n the latency-senstve cluster are always prortzed over threads n the bandwdth-senstve cluster to maxmze system throughput. To ensure that no thread s dsproportonately slowed down, perodcally shuffles the prorty order among threads n the bandwdth-senstve cluster. s ntellgent shufflng algorthm ensures that threads that are lkely to slow down others spend less tme at hgher prorty levels, thereby reducng the probablty of large slowdowns. By havng a suffcently long shufflng perod and performng shufflng n a synchronzed manner across all banks, threads are able to explot both row-buffer localty and bank-level parallelsm. Combned, these mechansms allow to outperform any prevously proposed memory scheduler n terms of both farness and system throughput. Contrbutons. In ths paper, we make the followng contrbutons: We ntroduce the noton of thread clusters for memory schedulng, whch are groups of threads wth smlar memory ntensty. We show that by dynamcally dvdng threads nto two separate clusters (latencysenstve and bandwdth-senstve), a memory schedulng algorthm can satsfy the dsparate memory needs of both clusters smultaneously. We propose a smple, dynamc clusterng algorthm that serves ths purpose. We show that threads n dfferent clusters should be treated dfferently to maxmze both system throughput and farness. We observe that prortzng latencysenstve threads leads to hgh system throughput, whle perodcally perturbng the prortzaton order among bandwdth-senstve threads s crtcal for farness. We propose a new metrc for characterzng a thread s memory access behavor, called nceness, whch reflects a thread s susceptblty to nterference from other threads. We observe that threads wth hgh rowbuffer localty are less nce to others, whereas threads wth hgh bank-level parallelsm are ncer, and montor these metrcs to compute thread nceness. Based on the proposed noton of nceness, we ntroduce a shufflng algorthm, called nserton shuffle, whch perodcally perturbs the prorty orderng of threads n the bandwdth-senstve cluster n a way that mnmzes nter-thread nterference by ensurng ncer threads are prortzed more often over others. Ths reduces unfarness wthn the bandwdthsenstve cluster. We compare aganst four prevously proposed memory schedulng algorthms and show that t outperforms all exstng memory schedulers n terms of both farness (maxmum slowdown) and system throughput (weghted speedup) for a -core system where the results are averaged across 9 workloads of varyng levels of memory ntensty. Compared to AT- LAS [5], the best prevous algorthm n terms of system throughput, mproves system throughput and reduces maxmum slowdown by.%/38.%. Compared to PAR-BS [1], the best prevous algorthm n terms of farness, mproves system throughput and reduces maxmum slowdown by 7.%/.%. We show that s confgurable and can be tuned to smoothly and robustly transton between farness and system throughput goals, somethng whch prevous schedulers, optmzed for a sngle goal, are unable to do.. Background and Motvaton.1. Defnng Memory Access Behavor defnes a thread s memory access behavor usng three components as dentfed by prevous work: memory ntensty [5], bank-level parallelsm [1], and rowbuffer localty [19]. Memory ntensty s the frequency at whch a thread msses n the last-level cache and generates memory requests. It s measured n the unt of (cache) msses per thousand nstructons or MPKI. Memory s not a monolthc resource but conssts of multple memory banks that can be accessed n parallel. It s the exstence of multple memory banks and ther partcular nternal organzaton that gve rse to banklevel parallelsm and row-buffer localty, respectvely. Bank-level parallelsm (BLP) of a thread s the average number of banks to whch there are outstandng memory requests, when the thread has at least one outstandng request. In the extreme case where a thread concurrently accesses all banks at all tmes, ts bank-level parallelsm would be equal to the total number of banks n the memory subsystem. A memory bank s nternally organzed as a twodmensonal structure consstng of rows and columns. The column s the smallest addressable unt of memory and multple columns make up a sngle row. When a thread accesses a partcular column wthn a partcular row, the memory bank places that row n a small nternal memory called the row-buffer. If a subsequent memory request accesses the same row that s n the row-buffer, t can be servced much more quckly; ths s called a rowbuffer ht. The row-buffer localty (RBL) of a thread s the average ht-rate of the row-buffer across all banks... Latency- vs. Bandwdth-Senstve Threads From a memory ntensty perspectve, we classfy threads nto one of two dstnct groups: latency-senstve or bandwdth-senstve. Latency-senstve threads spend most of ther tme at the processor and ssue memory requests sparsely. Even though the number of generated memory requests s low, the performance of latencysenstve threads s very senstve to the latency of the memory subsystem; every addtonal cycle spent watng on memory s a wasted cycle that could have been spent on computaton. Bandwdth-senstve threads experence frequent cache msses and thus spend a large porton of ther tme watng on pendng memory requests. Therefore, ther rate of progress s greatly affected by the throughput of the memory subsystem. Even f a memory request s quckly servced, subsequent memory requests wll once agan stall executon.

3 .3. Our Goal: Best of both System Throughput and Farness Maxmum slowdown Hgher Performance and Farness System throughput Fgure 1. Performance and farness of state-of-the-art schedulng A multprogrammed workload can consst of a dverse mx of threads ncludng those whch are latencysenstve or bandwdth-senstve. A well-desgned memory schedulng algorthm should strve to maxmze overall system throughput, but at the same tme bound the worst case slowdown experenced by any one of the threads. These two goals are often conflctng and form a trade-off between system throughput and farness. Intutvely, latency-senstve threads (whch cannot tolerate hgh memory latences) should be prortzed over others to mprove system throughput, whle bandwdth-senstve threads (whch can tolerate hgh memory latences) should be scheduled n a farnessaware manner to lmt the amount of slowdown they experence. Applyng a sngle memory schedulng polcy across all threads, an approach commonly taken by exstng memory schedulng algorthms, cannot address the dsparate needs of dfferent threads. Therefore, exstng algorthms are unable to decouple the system throughput and farness goals and acheve them smultaneously. To llustrate ths problem, Fgure 1 compares the unfarness (maxmum thread slowdown compared to when run alone on the system) and system throughput (weghted speedup) of four state-of-the-art memory schedulng algorthms (FR-FCFS [19], [13], PAR- BS [1], and [5]) averaged over 9 workloads. 1 algorthms. Lower rght corner s the deal operaton pont. An deal memory schedulng algorthm would be placed towards the lower (better farness) rght (better system throughput) part of the plot n Fgure 1. Unfortunately, no prevous schedulng algorthm acheves the best farness and the best system throughput at the same tme. Whle PAR-BS provdes the best farness, t has.9% lower system throughput than the hghestperformance algorthm,. On the other hand, AT- LAS provdes the hghest system throughput but ts maxmum slowdown s 55.3% hgher than the most far algorthm, PAR-BS. Hence, exstng schedulng algorthms are good at ether system throughput or farness, but not both. Our goal n ths paper s to desgn a memory schedulng algorthm that acheves the best of both worlds: hghest system throughput and hghest farness at the same tme. 1 Our evaluaton methodology and baselne system confguraton are descrbed n Secton... Varyng Susceptblty of Bandwdth- Senstve Threads to Interference We motvate the mportance of dfferentatng between threads memory access behavor by showng that not all bandwdth-senstve threads are equal n ther vulnerablty to nterference. To llustrate ths pont, we ran experments wth two bandwdth-senstve threads that were specfcally constructed to have the same memory ntensty, but very dfferent bank-level parallelsm and rowbuffer localty. As shown n Table 1, the random-access thread has low row-buffer localty and hgh bank-level parallelsm, whle the streamng thread has low bank-level parallelsm and hgh row-buffer localty. Randomaccess Streamng Memory ntensty Hgh (0 MPKI) Hgh (0 MPKI) Memory access behavor Bank-level parallelsm Hgh (7.7% of max.) Low (0.3% of max.) Row-buffer localty Low (0.1%) Hgh (99%) Table 1. Two examples of bandwdth-senstve threads: randomaccess vs. streamng Whch of the two threads s more prone to large slowdowns when run together? Fgure shows the slowdown experenced by these two threads for two dfferent schedulng polces: one where the random-access thread s strctly prortzed over the streamng thread and one where the streamng thread s strctly prortzed over the random-access thread. Clearly, as shown n Fgure (b), the random-access thread s more susceptble to beng slowed down snce t experences a slowdown of more than 11x when t s deprortzed, whch s greater than the slowdown of the streamng thread when t s deprortzed random-access Slowdown streamng (a) Strctly prortzng randomaccess thread random-access Slowdown streamng (b) Strctly prortzng streamng thread Fgure. Effect of prortzaton choces between the randomaccess thread and the streamng thread Ths s due to two reasons. Frst, the streamng thread generates a steady stream of requests to a bank at a gven tme, leadng to temporary denal of servce to any thread that accesses the same bank. Second, a thread wth hgh bank-level parallelsm s more susceptble to memory nterference from another thread snce a bank conflct leads to the loss of bank-level parallelsm, resultng n the seralzaton of otherwse parallel requests. Therefore, all else beng the same, a schedulng algorthm should favor the thread wth hgher bank-level parallelsm when dstrbutng the memory bandwdth among bandwdthsenstve threads. We wll use ths nsght to develop a new memory schedulng algorthm that ntellgently prortzes between bandwdth-senstve threads. 3

4 3. Mechansm 3.1. Overvew of Clusterng Threads. To accommodate the dsparate memory needs of concurrently executng threads sharng the memory, dynamcally groups threads nto two clusters based on ther memory ntensty: a latency-senstve cluster contanng lower memory ntensty threads and a bandwdth-senstve cluster contanng hgher memory ntensty threads. By employng dfferent schedulng polces wthn each cluster, s able to decouple the system throughput and farness goals and optmze for each one separately. Prortzng the Latency-Senstve Cluster. Memory requests from threads n the latency-senstve cluster are always strctly prortzed over requests from threads n the bandwdth-senstve cluster. As shown prevously [5, 1, ], prortzng latency-senstve threads (whch access memory nfrequently) ncreases overall system throughput, because they have greater potental for makng progress. Servcng memory requests from such lght threads allows them to contnue wth ther computaton. To avod starvaton ssues and ensure suffcent bandwdth s left over for the bandwdth-senstve cluster, lmts the number of threads placed n the latency-senstve cluster, such that they consume only a small fracton of the total memory bandwdth. Dfferent Clusters, Dfferent Polces. To acheve hgh system throughput and to mnmze unfarness, employs a dfferent schedulng polcy for each cluster. The polcy for the latency-senstve cluster s geared towards hgh performance and low latency, snce threads n that cluster have the greatest potental for makng fast progress f ther memory requests are servced promptly. By contrast, the polcy for the bandwdth-senstve cluster s geared towards maxmzng farness, snce threads n that cluster have heavy memory bandwdth demand and are susceptble to detrmental slowdowns f not gven a suffcent share of the memory bandwdth. Wthn the latency-senstve cluster, enforces a strct prorty, wth the least memory-ntensve thread recevng the hghest prorty. Such a polcy ensures that requests from threads spendng most of ther tme at the processor (.e., accessng memory nfrequently), are always promptly servced; ths allows them to quckly resume ther computaton and ultmately make large contrbutons to overall system throughput. Wthn the bandwdth-senstve cluster, threads share the remanng memory bandwdth, so that no thread s dsproportonately slowed down or, even worse, starved. accomplshes ths by perodcally shufflng the prorty orderng among the threads n the bandwdthsenstve cluster. To mnmze thread slowdown, ntroduces a new shufflng algorthm, called nserton shuffle, that tres to reduce the amount of nter-thread nterference and at the same tme maxmze row-buffer localty and bank-level parallelsm. To montor nter-thread nterference, we ntroduce a new composte metrc, called nceness, whch captures both a thread s propensty to cause nterference and ts susceptblty to nterference. montors the nceness values of threads and adapts ts shufflng decsons to ensure that nce threads are more lkely to receve hgher prorty. Nceness and the effects of shufflng algorthms for the bandwdth-senstve cluster are dscussed n Secton Groupng Threads nto Two Clusters perodcally ranks all threads based on ther memory ntensty at fxed-length tme ntervals called quanta. The least memory-ntensve threads are placed n the latency-senstve cluster whle the remanng threads are placed n the bandwdth-senstve cluster. Throughout each quantum montors the memory bandwdth usage of each thread n terms of the memory servce tme t has receved: summed across all banks n the memory subsystem, a thread s memory servce tme s defned to be the number of cycles that the banks were kept busy servcng ts requests. The total memory bandwdth usage s defned to be the sum of each thread s memory bandwdth usage across all threads. groups threads nto two clusters at the begnnng of every quantum by usng a parameter called ClusterThresh to specfy the amount of bandwdth to be consumed by the latency-senstve cluster (as a fracton of the prevous quantum s total memory bandwdth usage). Our expermental results show that for a system wth N threads, a ClusterThresh value rangng from /N to /N,.e., formng the latency-senstve cluster such that t consumes /N to /N of the total memory bandwdth usage can provde a smooth transton between dfferent good performance-farness trade-off ponts. A thorough analyss of the effect of dfferent ClusterThresh values s presented n Secton 7.1. Groupng of threads nto clusters happens n a synchronzed manner across all memory controllers to better explot bank-level parallelsm [5, 1]. In order for all memory controllers to agree upon the same thread clusterng, they perodcally exchange nformaton, every quantum. The length of our tme quantum s set to one mllon cycles, whch, based on expermental results, s short enough to detect phase changes n the memory behavor of threads and long enough to mnmze the communcaton overhead of synchronzng multple memory controllers. Algorthm 1 shows the pseudocode for the thread clusterng algorthm used by Bandwdth-Senstve Cluster: Farly Sharng the Memory Bandwdth-senstve threads should farly share memory bandwdth to ensure no sngle thread s dsproportonately slowed down. To acheve ths, the thread prorty order for the bandwdth-senstve cluster needs to be perodcally shuffled. As mentoned earler, to preserve bank-level parallelsm, ths shufflng needs to happen n a synchronzed manner across all memory banks, such that at any pont n tme all banks agree on a global thread prorty order. The Problem wth Round-Robn. Shufflng the prorty order n a round-robn fashon among bandwdth-senstve threads would appear to be a smple soluton to ths problem, but our experments revealed two problems. The frst problem s that a round-robn shufflng algorthm s oblvous to nter-thread nterference: t s not aware of whch threads are more lkely

5 Algorthm 1 Clusterng Algorthm Intalzaton: LatencyCluster ; BandwdthCluster Unclassfed {thread : 1 N threads } SumBW 0 Per-thread parameters: MPKI : Msses per klonstructon of thread BWusage : BW used by thread durng prevous quantum Clusterng: (begnnng of quantum) TotalBWusage P BWusage whle Unclassfed do j = arg mn MPKI //fnd thread wth lowest MPKI SumBW SumBW + BWusage j f SumBW ClusterThresh TotalBWusage then Unclassfed Unclassfed {thread j } LatencyCluster LatencyCluster {thread j } else break end f end whle BandwdthCluster Unclassfed to slow down others. The second problem s more subtle and s ted to the way memory banks handle thread prortes: when choosng whch memory request to servce next, each bank frst consders the requests from the hghest prorty thread accordng to the current prorty order. If that thread has no requests, then the next hghest prorty thread s consdered and so forth. As a result, a thread does not have to be necessarly at the top prorty poston to get some of ts requests servced. In other words, memory servce leaks from hghest prorty levels to lower ones. In fact, n our experments we often encountered cases where memory servce was leaked all the way to the ffth or sxth hghest prorty thread n a -thread system. Ths memory servce leakage effect s the second reason the smple round-robn algorthm performs poorly. In partcular, the problem wth round-robn s that a thread always mantans ts relatve poston wth respect to other threads. Ths means lucky threads scheduled behnd leaky threads wll consstently receve more servce than other threads that are scheduled behnd nonleaky threads, resultng n unfarness. Ths problem becomes more evdent f one consders the dfferent memory access behavor of threads. For nstance, a streamng thread that exhbts hgh row-buffer localty and low bank-level parallelsm wll severely leak memory servce tme at all memory banks except for the sngle bank t s currently accessng. Thread Nceness and Inserton Shuffle. To allevate the problems stemmng from memory servce leakage and to mnmze nter-thread nterference, employs a new shufflng algorthm, called nserton shuffle, that reduces memory nterference and ncreases farness by explotng heterogenety n the bank-level parallelsm and row-buffer localty among dfferent threads. We ntroduce a new metrc, called nceness, that cap- The name s derved from the smlarty to the nserton sort algorthm. Each ntermedate state durng an nserton sort corresponds to one of the permutatons n nserton shuffle. Algorthm Inserton Shufflng Algorthm Defnton: N: number of threads n bandwdth-senstve cluster threads[n]: array of bandwdth-senstve threads; we defne a thread s rank as ts poston n the array (N th poston occuped by hghest ranked thread) ncsort(, j): sort subarray threads[..j] n nc. nceness decsort(, j): sort subarray threads[..j] n dec. nceness Intalzaton: (begnnng of quantum) ncsort(1, N) //ncest thread s hghest ranked Shufflng: (throughout quantum) whle true do //each teraton occurs every ShuffleInterval for = N to 1 do decsort(, N) end for //each teraton occurs every ShuffleInterval for = 1 to N do ncsort(1, ) end for end whle tures a thread s propensty to cause nterference and ts susceptblty to nterference. A thread wth hgh rowbuffer localty s lkely to make consecutve accesses to a small number of banks and cause them to be congested. Under such crcumstances, another thread wth hgh bank-level parallelsm becomes vulnerable to memory nterference snce t s subject to transent hgh loads at any of the many banks t s concurrently accessng. Hence, a thread wth hgh bank-level parallelsm s fragle (more lkely to be nterfered by others), whereas one wth hgh row-buffer localty s hostle (more lkely to cause nterference to others), as we have emprcally demonstrated n Secton.. We defne a thread s nceness to ncrease wth the relatve fraglty of a thread and to decrease wth ts relatve hostlty. Wthn the bandwdthsenstve cluster, f thread has the b th hghest bank-level parallelsm and the r th hghest row-buffer localty, we formally defne ts nceness as follows: Nceness b r Every quantum, threads are sorted based on ther nceness value to yeld a rankng, where the ncest thread receves the hghest rank. Subsequently, every ShuffleInterval cycles, the nserton shuffle algorthm perturbs ths rankng n a way that reduces the tme durng whch the least nce threads are prortzed over the ncest threads, ultmately resultng n less nterference. Fgure 3 vsualzes successve permutatons of the prorty order for both the round-robn and the nserton shuffle algorthms for four threads. It s nterestng to note that n the case of nserton shuffle, the least nce thread spends most of ts tme at the lowest prorty poston, whle the remanng ncer threads are at hgher prortes and are thus able to synergstcally leak ther memory servce tme among themselves. Algorthm shows the pseudocode for the nserton shuffle algorthm. Note that the pseudocode does not reflect the actual hardware mplementaton the mplementaton s smple because the permutaton s regular. Handlng Threads wth Smlar Behavor. If the bandwdth-senstve cluster conssts of homogeneous 5

6 (a) Round-robn shuffle (b) Inserton shuffle Fgure 3. Vsualzng two shufflng algorthms threads wth very smlar memory behavor, dsables nserton shuffle and falls back to random shuffle to prevent unfar treatment of threads based on margnal dfferences n nceness values. To do ths, nspects whether threads exhbt a suffcent amount of dversty n memory access behavor before applyng nserton shufflng. Frst, calculates the largest dfference between any two threads n terms of bank-level parallelsm (max BLP ) and row-buffer localty (max RBL). Second, f both values exceed a certan fracton (ShuffleAlgo- Thresh) of ther maxmum attanable values, then nserton shufflng s appled. Specfcally, max BLP must exceed ShuffleAlgoThresh NumBanks and max RBL must exceed ShuffleAlgoThresh. In our experments we set ShuffleAlgoThresh to be 0.1, whch ntutvely means that falls back to random shufflng f BLP and RBL dffer by less than % across all threads n the system. Random Shufflng. When random shufflng s employed, a random permutaton of threads s generated every shufflng nterval whch serves as the thread rankng for the next shufflng nterval. In contrast to nserton shufflng, random shufflng s oblvous to thread nceness and does not follow a predetermned shufflng pattern. Random shufflng s also dfferent from roundrobn n that t does not preserve the relatve poston of threads across shuffles, thereby preventng cases where a nce thread remans stuck behnd a hghly nterferng or malcous thread. The major advantage of random shufflng over nserton shufflng s the sgnfcantly lower mplementaton complexty; t does not requre the montorng of BLP and RBL or the calculaton of nceness values for each thread. However, random shufflng pays the penalty of ncreased unfarness, snce t s unable to successfully mnmze the nterference among heterogeneous threads wth large dfferences n nceness, as we emprcally show n Secton 7.3. can be forced to always employ random shufflng by settng ShuffleAlgoThresh to 1. Secton 7.5 provdes senstvty results for ShuffleAlgo- Thresh; Secton 7.3 evaluates the effect of dfferent shufflng algorthms. 3.. Montorng Memory Access Behavor of Threads To mplement, the L cache and memory controller collect statstcs for each thread by contnuously montorng ts memory ntensty, row-buffer localty (RBL), and bank-level parallelsm (BLP) over tme. If there are multple memory controllers, ths nformaton s sent to a centralzed meta-controller at the end of a quantum, smlarly to what s done n [5]. The metacontroller aggregates the nformaton, computes thread clusters and ranks as descrbed prevously, and communcates them to each of the memory controllers to ensure that the thread prortzaton order s the same n all controllers. Memory ntensty. A thread s L MPKI (L cache msses per klonstructon) s computed at the L cache controller and serves as the measure of memory ntensty. Row-buffer localty. Each memory controller estmates the nherent row-buffer localty of a thread. Dong so requres the memory controller to keep track of a shadow row-buffer ndex [11] for each thread for each bank, whch keeps track of the row that would have been open n that bank f the thread were runnng alone on the system. RBL s smply calculated as the number of shadow row-buffer hts dvded by the number of accesses durng a quantum. Bank-level parallelsm. Each memory controller counts the number of banks that have at least one memory request from a thread as an estmate of the thread s nstantaneous BLP had t been runnng alone. Throughout a quantum, each controller takes samples of a thread s nstantaneous BLP and computes the average BLP for that thread, whch s sent to the meta-controller at the end of the quantum. The meta-controller then computes the average BLP for each thread across all memory controllers Summary: Thread Cluster Memory Schedulng () Prortzaton Rules Algorthm 3 summarzes how prortzes memory requests from threads. When requests from multple threads compete to access a bank, the hgher ranked thread (where rankng depends on the thread cluster) s prortzed as we have descrbed prevously. If two requests share the same prorty, row-buffer ht requests are favored. All else beng equal, older requests are favored. 3.. System Software Support Thread Weghts. supports thread weghts (or prortes) as assgned by the operatng system, such that threads wth larger weghts are prortzed n the memory. Unlke prevous schedulng algorthms, prortzes a thread based on ts weght whle also strvng to preserve the performance of other threads. Gven a thread wth a very large thread weght, blndly prortzng t over all other threads wthout regard to both ts and others

7 Algorthm 3 : Request prortzaton 1. Hghest-rank frst: Requests from hgher ranked threads are prortzed. Latency-senstve threads are ranked hgher than bandwdth-senstve threads (Secton 3.1). Wthn latency-senstve cluster: lower-mpki threads are ranked hgher than others (Secton 3.1). Wthn bandwdth-senstve cluster: rank order s determned by nserton shufflng (Secton 3.3).. Row-ht frst: Row-buffer ht requests are prortzed over others. 3. Oldest frst: Older requests are prortzed over others. memory access behavor would lead to destructon of the performance of all other threads and, as a result, severely degrade system throughput and farness. solves ths problem by honorng thread weghts wthn the context of thread clusters. For example, even f the operatng system assgns a large weght to a bandwdth-senstve thread, does not prortze t over the latency-senstve threads because dong so would sgnfcantly degrade the performance of all latency-senstve threads wthout sgnfcantly mprovng the performance of the hgher-weght thread (as latencysenstve threads rarely nterfere wth t). To enforce thread weghts wthn the latency-senstve cluster, scales down each thread s MPKI by ts weght. Thus, a thread wth a larger weght s more lkely to be ranked hgher than other latency-senstve threads because ts scaled MPKI appears to be low. Wthn the bandwdth-senstve cluster, mplements weghted shufflng where the tme a thread spends at the hghest prorty level s proportonal to ts weght. Farness/Performance Trade-off Knob. s ClusterThresh s exposed to the system software such that the system software can select a value that favors ts desred metrc. We dscuss the effect of ClusterThresh on farness and performance n Secton Multthreaded Workloads Multthreaded applcatons can be broadly categorzed nto two types: those whose threads execute mostly ndependent of each other and those whose threads requre frequent synchronzaton. Snce the frst type of multthreaded applcatons resemble, to a certan extent, multprogrammed workloads, they are expected to perform well under. In contrast, the executon tme of the second type of multthreaded applcatons s determned by slow-runnng crtcal threads [, 1, ]. For such applcatons, can be extended to ncorporate the noton of thread crtcalty to properly dentfy and prortze crtcal threads. Furthermore, we envson to be applcable to composte workloads that consst of an assortment of dfferent applcatons (e.g., multple multthreaded applcatons), by reducng nter-applcaton memory nterference.. Implementaton and Hardware Cost requres hardware support to 1) montor threads memory access behavor and ) schedule memory requests as descrbed. Table shows the major hardware storage cost ncurred n each memory controller to montor threads memory access behavor. The requred addtonal storage cost wthn a controller on our baselne -core system s less than Kbts. (If pure random shufflng s employed, t s less than 0.5 Kbts.) requres addtonal logc to rank threads by aggregatng montored thread metrcs. Both rankng and aggregaton logc are utlzed only at the end of each quantum and are not on the crtcal path of the processor. Rankng can be mplemented usng prorty encoders, as was done n [5]. At the end of every quantum, a central meta-controller (smlar to [5]) gathers data from every memory controller to cluster threads and to calculate nceness. Subsequently, the central meta-controller broadcasts the results to all the memory controllers so that they can make consstent schedulng decsons throughout the next quantum. At any gven pont n tme, each memory controller prortzes threads accordng to ther rankng (Algorthm 3). Even though the rankng of the bandwdthsenstve cluster s shuffled, t s consstent for all memory controllers snce shufflng s determnstc and occurs at regular tme ntervals. The meta-controller exsts only to reduce hardware complexty by consoldatng parts of the processng logc at a sngle locaton rather than replcatng t across separate memory controllers. Although the meta-controller s centralzed, t s unlkely to mpact scalablty snce only small amounts of data ( bytes per hardware context per controller) are exchanged nfrequently (once every mllon cycles). Furthermore, the communcaton s not latency crtcal because the prevous rankng can be used n the controllers whle the next rankng s beng computed or transferred. 5. Related Work: Comparson wth Other Memory Schedulers We descrbe related work on memory schedulng and qualtatvely compare to several prevous desgns. Secton 7 compares quanttatvely wth four stateof-the-art schedulers [19, 13, 1, 5]. Thread-Unaware Memory Schedulers. Memory controller desgns that do not dstngush between dfferent threads [8, 19, 5, 9, 3, 0, 15] have been examned wthn the context of sngle-threaded, vector, or streamng archtectures. The FR-FCFS schedulng polcy [19] that prortzes row-ht requests over other requests s commonly employed n exstng processors. Recent work [] explored reducng the cost of the FR-FCFS desgn for accelerators. The goal of these polces s to maxmze DRAM throughput. Thread-unaware schedulng polces have been shown to be low-performance and prone to starvaton when multple competng threads share the memory controller n general-purpose multcore/multthreaded systems [11, 1, 18,, 13, 1, 5]. Thread-Aware Memory Schedulers. Recent work desgned thread-aware memory schedulers wth the goal of mprovng farness and provdng QoS. Far queueng memory schedulers [1, 18] adapted varants of the far queueng algorthm from computer networks 7

8 Memory ntensty Storage Functon Sze (bts) MPKI-counter (montored by processor) A thread s cache msses per klo-nstructon N thread log MP KI max = 0 Bank-level parallelsm Storage Functon Sze (bts) Load-counter Number of outstandng thread requests to a bank N thread N bank log Queue max = 57 BLP-counter Number of banks for whch load-counter > 0 N thread log N bank = 8 BLP average Average value of load-counter N thread log N bank = 8 Row-buffer localty Storage Functon Sze (bts) Shadow row-buffer ndex Index of a thread s last accessed row N thread N bank log N rows = 13 Shadow row-buffer hts Number of row-buffer hts f a thread were runnng alone N thread N bank log Count max = 153 Table. Storage requred for montorng threads memory access behavor Processor ppelne 8-entry nstructon wndow Fetch/Exec/Commt wdth 3 nstructons per cycle n each core; only 1 can be a memory operaton L1 Caches 3 K-byte per core, -way set assocatve, 3-byte block sze L Caches 5 K-byte per core, 8-way set assocatve, 3-byte block sze DRAM controller (on-chp) 8-entry request buffer, -entry wrte data buffer, reads prortzed over wrtes DRAM chp parameters Mcron DDR-800 tmng parameters (see []) t CL =15ns, t RCD =15ns, t RP =15ns, BL/=ns; banks, K-byte row-buffer per bank DIMM confguraton Sngle-rank, 8 DRAM chps put together on a DIMM Round-trp L mss latency For a 3-byte cache block uncontended: row-buffer ht: 0ns (00 cycles), closed: 0ns (300 cycles), conflct: 80ns (00 cycles) Cores and DRAM controllers cores, ndependent DRAM controllers (1 controller has. GB/s peak DRAM bandwdth) to buld a memory scheduler that provdes QoS to each thread. Stall-tme far memory scheduler () [13] uses heurstcs to estmate the slowdown of each thread, compared to when t s run alone, and prortzes the thread that has been slowed down the most. These algorthms am to maxmze farness, although they can also lead throughput mprovements by mprovng system utlzaton. Parallelsm-aware batch schedulng (PAR-BS) [1] ams to acheve a balance between farness and throughput. To avod unfarness, PAR-BS groups memory requests nto batches and prortzes older batches over younger ones. To mprove system throughput, PAR-BS prortzes less-ntensve threads over others to explot bank-level parallelsm. As we wll show n Secton 7, PAR-BS s batchng polcy mplctly penalzes memorynon-ntensve threads because memory-ntensve threads usually nsert many more requests nto a batch, leadng to long delays for memory-non-ntensve threads and hence relatvely low system throughput. [5] ams to maxmze system throughput by prortzng threads that have attaned the least servce from the memory controllers. However, as shown n [5], ths ncrease n system throughput comes at the cost of farness because the most memory-ntensve threads receve the lowest prorty and ncur very hgh slowdowns. Ipek et al. [] leverage machne learnng technques to mplement memory schedulng polces that maxmze DRAM throughput. Zhu and Zhang [7] descrbe memory schedulng optmzatons for SMT processors to mprove DRAM throughput. Nether of these consder farness or system throughput n the presence of competng threads. Lee et al. [] descrbe a mechansm to adaptvely prortze between prefetch and demand requests n a memory scheduler; ther mechansm can be combned wth ours. Table 3. Baselne CMP and memory system confguraton Comparson wth. Overall, prevous threadaware memory schedulng algorthms have three major shortcomngs, whch we address n. Frst, they are manly based towards ether farness or system throughput no prevous algorthm acheves the best system throughput and farness at the same tme. We wll show that acheves ths by employng multple dfferent prortzaton algorthms, each talored for system throughput or farness. Second, prevous algorthms do not provde a knob that allows a smooth and gradual trade-off between system throughput and farness. s ablty to group threads nto two clusters wth dfferent polces optmzed for farness or system throughput allows t to trade off between farness and system throughput by varyng the clusterng threshold. Thrd, prevous algorthms do not dstngush dfferent threads propensty for causng nterference to others. As a result, they cannot customze ther prortzaton polces to the specfc needs/behavor of dfferent threads., by trackng memory access characterstcs of threads, determnes a prortzaton order that favors threads that are lkely to cause less nterference to others, leadng to mprovements n farness and system throughput.. Methodology and Metrcs We evaluate usng an n-house cycle-level x8 CMP smulator the front-end of whch s based on Pn [7]. The memory subsystem s modeled usng DDR tmng parameters [], whch were verfed usng DRAM- Sm [3] and measurements from real hardware. Table 3 shows the major DRAM and processor parameters n the baselne confguraton. Unless stated otherwse, we assume a -core CMP wth memory controllers. Workloads. We use the SPEC CPU00 benchmarks for evaluaton. We compled each benchmark usng gcc.1. wth -O3 optmzatons and chose a representatve smulaton phase usng PnPonts [17]. From these benchmarks, we formed multprogrammed workloads of varyng memory ntensty, whch were run for 0 mllon cycles. 8

9 # Benchmark MPKI RBL BLP # Benchmark MPKI RBL BLP 1 9.mcf 97.38% href % 1.19.lbquantum 50.00% gromacs % lesle3d 9.35% gobmk % soplex.70% sjeng % lbm 3.5% gcc % GemsFDTD 31.79% dealII % sphnx3.9% wrf % xalancbmk.95% namd % omnetpp 1.3% perlbench % 1. 3.cactusADM.01% calculx % astar 9.% tonto % hmmer 5.% povray % bzp 3.98% Table. Indvdual benchmark characterstcs (MPKI: Msses per klonstructon, RBL: Row-buffer localty, BLP: Bank-level parallelsm). Workload Memory-ntensve benchmarks Memory-non-ntensve benchmarks A B C D calculx(3), dealii, gcc, gromacs(), namd, perl, povray, sjeng, tonto gcc(), gobmk(3), namd(), perl(3), sjeng, wrf calculx(), dealii(), gromacs(), namd, perl(), povray, tonto, wrf calculx, dealii, gcc, gromacs, perl, povray(), sjeng(), tonto(3) mcf, soplex(), lbm(), lesle, sphnx3, xalancbmk, omnetpp, astar, hmmer() bzp(), cactusadm(3), GemsFDTD, href(), hmmer, lbquantum(), sphnx3 GemsFDTD(), lbquantum(3), cactusadm, astar, omnetpp, bzp, soplex(3) omnetpp, bzp(), href, cactusadm, astar, soplex, lbm(), lesle, xalancbmk() Table 5. Four representatve workloads (fgure n parentheses s the number of nstances spawned) We classfy benchmarks based on ther memory ntensty; benchmarks wth an average MPKI greater than one are labeled as memory-ntensve, whle all other benchmarks are labeled as memory-non-ntensve. The memory ntensty of a workload s defned as the fracton of memory-ntensve benchmarks n that workload. Unless stated otherwse, results are for workloads that are 50% memory-ntensve (.e., consstng of 50% memoryntensve benchmarks). For each memory ntensty category (50%, 75% and 0%), we smulate 3 multprogrammed workloads, for a total of 9 workloads. Evaluaton Metrcs. We measure system throughput usng weghted speedup [1] and farness usng maxmum slowdown. We also report harmonc speedup [8], whch measures a balance of farness and throughput. Weghted Speedup = X Harmonc Speedup = IP C shared IP C alone N P IP C alone IP C share Maxmum Slowdown = max IP C alone IP C shared Parameters of Evaluated Schemes. Unless stated otherwse, we use a BatchCap of 5 for PAR-BS [1], a QuantumLength of M cycles and HstoryWeght of for [5] and a FarnessThreshold of 1.1 and IntervalLength of for [13]. FR-FCFS [19] has no parameters. For we set ClusterThresh to /, ShuffleInterval to 800, and ShuffleAlgoThresh to Results We compare s performance aganst four prevously proposed memory schedulng algorthms, FR- FCFS [19], [13], PAR-BS [1] (best prevous algorthm for farness) and [5] (best prevous algorthm for system throughput). Fgure shows where each schedulng algorthms les wth regard to farness and system throughput, averaged across all 9 workloads of varyng memory ntensty. The lower rght part of the fgure corresponds to better farness (lower maxmum slowdown) and better system throughput (hgher weghted speedup). acheves the best system throughput and the best farness, outperformng every algorthm wth regard to weghted speedup, maxmum slowdown, and harmonc speedup (the last shown n Fg. (b)). 3 Maxmum slowdown Weghted speedup Fgure. Performance and farness of vs. other algorthms across all 9 workloads Compared to, the hghest-performance prevous algorthm, provdes sgnfcantly better farness (38.% lower maxmum slowdown) and better system throughput (.% hgher weghted speedup). AT- LAS suffers from unfarness because t s a strct prortybased schedulng algorthm where the thread wth the lowest prorty can access memory only when no other threads have outstandng memory requests to the same bank. As a result, the most deprortzed threads (those whch are the most memory-ntensve) become vulnerable to starvaton and large slowdowns. avods ths problem by usng shufflng to ensure that no memoryntensve thread s dsproportonately deprortzed. 3 The performance of as shown here s for just a sngle operatng pont. As we wll show n Secton 7.1, provdes the flexblty of smoothly transtonng along a wde range of dfferent performance-farness trade-off ponts. 9

10 Compared to PAR-BS, the most far prevous algorthm, provdes sgnfcantly better system throughput (7.% hgher weghted speedup) and better farness (.% lower maxmum slowdown). PAR-BS suffers from relatvely low system throughput snce memory requests from memory-ntensve threads can block those from memory-non-ntensve threads. PAR-BS perodcally forms batches of memory requests and strctly prortzes older batches. Batch formaton mplctly favors memory-ntensve threads because such threads have more requests that can be ncluded n the batch. As a result, memory-non-ntensve threads are slowed down because ther requests (whch arrve nfrequently) have to wat for the prevous batch of requests mostly full of memory-ntensve threads requests to be servced. avods ths problem by ensurng that memorynon-ntensve threads are always strctly prortzed over memory-ntensve ones. outperforms n weghted speedup by 11.1% and n maxmum slowdown by 3.5%. also outperforms the thread-unaware FR-FCFS n both system throughput (.%) and maxmum slowdown (50.1%). We conclude that provdes the best farness and system performance across all examned prevous schedulng algorthms. Indvdual Workloads. Fgure 5 shows ndvdual results for four, randomly selected, representatve workloads descrbed n Table 5. We fnd that the performance and farness mprovements of over all other algorthms are consstent across dfferent workloads. Weghted speedup Maxmum slowdown A B C D AVG Workloads (a) Weghted speedup for ndvdual workloads A B C D AVG Workloads (b) Maxmum slowdown for ndvdual workloads Fgure 5. vs. other algorthms for sample workloads and averaged across 3 workloads 7.1. Tradng off between Performance and Farness To study the robustness of each memory scheduler, as well as the ablty to adapt to dfferent performance and farness goals, we vared the most salent confguraton parameters of each scheduler. We evaluated for a QuantumLength rangng from 1K (conservatve) to 0M cycles (aggressve), PAR-BS for a BatchCap rangng from 1 (conservatve) to (aggressve), for a FarnessThreshold rangng from 1 (conservatve) to 5 (aggressve), and FR-FCFS (has no parameters). Fnally, for, we vary the ClusterThresh from / to / n 1/ ncrements. The performance and farness results are shown n Fgure. The lower rght and upper rght parts of Fgures (a) and (b) respectvely, correspond to better operatng ponts n terms of both performance and farness. In contrast to prevous memory schedulng algorthms, exposes a smooth contnuum between system throughput and farness. By adjustng the clusterng threshold between latency- and bandwdth-senstve clusters, system throughput and farness can be gently traded off for one another. As a result, has a wde range of balanced operatng ponts that provde both hgh system throughput and farness. None of the prevously proposed algorthms provde nearly the same degree of flexblty as. For example, always remans based towards system throughput (.e., ts maxmum slowdown changes by lttle), regardless of ts QuantumLength settng. Smlarly, PAR-BS remans based towards farness (.e., ts weghted speedup changes by lttle). For, an aggressve (large) ClusterThresh value provdes more bandwdth for the latency-senstve cluster and allows relatvely lghter threads among the bandwdth-senstve cluster to move nto the latencysenstve cluster. As a result, system throughput s mproved snce the lghter threads are prortzed over the heaver threads. But the remanng threads n the bandwdth-senstve cluster now compete for a smaller fracton of the memory bandwdth and experence larger slowdowns, leadng to hgher unfarness. In contrast, a conservatve (small) ClusterThresh value provdes only a small fracton of the memory bandwdth for the latencysenstve cluster so that most threads are ncluded n the bandwdth-senstve cluster and, as a result, take turns sharng the memory. We conclude that provdes an effectve knob for tradng off between farness and performance, enablng operaton at dfferent desrable operatng ponts dependng on system requrements. 7.. Effect of Workload Memory Intensty Fgure 7 compares the performance of to prevously proposed schedulng algorthms for four sets of 3 workloads that are 5%, 50%, 75% and 0% memoryntensve. (We nclude 5%-ntensty workloads for completeness, even though memory s not a large bottleneck for them.) s relatve advantage over PAR-BS and becomes greater as the workload becomes more memory-ntensve and memory becomes more heavly contended. When all the threads n the workload are memory-ntensve, provdes 7.% and.1% ncrease n weghted speedup and 5.8% and 8.% decrease n maxmum slowdown compared to PAR-BS and. provdes hgher gans for very memoryntensve workloads because prevous algorthms are ether unable to prortze less memory-ntensve threads (due to batchng polcy n PAR-BS) or cause severe deprortzaton of the most memory-ntensve threads (due to strct rankng n ) n such heavly contended systems.

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?