Cache Sharing Management for Performance Fairness in Chip Multiprocessors

Size: px

Start display at page:

Download "Cache Sharing Management for Performance Fairness in Chip Multiprocessors"

Lenard Nicholson
5 years ago
Views:

1 Cache Sharng Management for Performance Farness n Chp Multprocessors Xng Zhou Wenguang Chen Wemn Zheng Dept. of Computer Scence and Technology Tsnghua Unversty, Bejng, Chna zhoux07@mals.tsnghua.edu.cn, {cwg, zwm-dcs}@tsnghua.edu.cn Abstract Resource sharng can cause unfar and unpredctable performance of concurrently executng applcatons n Chp-Multprocessors (CMP). The shared last-level cache s one of the most mportant shared resources because off-chp request latency may take a sgnfcant part of total executon cycles for data ntensve applcatons. Instead of enforcng deal performance farness drectly, pror work addressng farness ssue of cache sharng manly focuses on the farness metrcs of cache mss numbers or mss rates. However, because of the varaton of cache mss penalty, farness on cache mss cannot guarantee deal farness. Cache sharng management whch drectly addresses deal performance farness s nee for CMP systems. Ths paper ntroduces a model to analyze the performance mpact of cache sharng, and proposes a mechansm of cache sharng management to provde performance farness for concurrently executng applcatons. The proposed mechansm montors the actual penalty of all cache msses and uses Auxlary Tag Drectory (ATD) to dynamcally estmate the cache msses wth cated cache when the applcatons are actually runnng wth shared cache. The estmated relatve slowdown for each core from cated envronment to shared envronment s used to gude cache sharng n order to guarantee performance farness. The experment results show that the proposed mechansm always mproves the performance farness metrc, and can provde no worse throughput than the scenaro wthout any management mechansm. 1. Introducton Data ntensve applcatons usually spend a large proporton of total executon cycles on memory accessng because of the long latency of off-chp requests. The on-chp last level cache (usually L2 or L3 cache), whch s the key component to hde off-chp request latency, s typcally shared by multple cores n a Chp-Multprocessors(CMP) archtecture. As a result, the sharng of last level cache has sgnfcant mpacts on the performance of concurrently executng applcatons on dfferent cores. The technologes of server consoldaton and vrtual machne are callng for the demand of schedulng heterogeneous workloads together. The dfferent memory accessng characterstcs of heterogeneous workloads wll lead to unfar cache sharng, whch breaks the hardware farness assumpton of the scheduler and may brng thread starvaton, prorty nverson and other problems to operatng system s process scheduler [6]. Pror work has notced the problem of unfar cache sharng and the consequental result of unfar performance. [6] proposed a cache parttonng mechansm to enhance farness of cache sharng. Although ntutvely the deal farness of cache sharng should be equal slowdowns relatve to runnng wth a cated cache for all co-scheduled applcatons, [6] uses the equal ncrement of cache mss numbers or mss rates as farness metrcs, because [6] ponts out that deal farness s hard to measure and cache mss farness are usually hghly correlates wth the deal farness. The mechansm proposed n [6] s an mprovement over unmanaged caches. However, because the cache mss farness s not dentcal to deal farness, the mprovement on cache mss farness can not guarantee the same degree of deal farness mprovement, and enforcng farness on cache mss does not necessarly lead to the stuaton of deal farness. To explan the above results, we should note that the correlaton between cache mss farness and deal farness (performance) actually depends on two factors: (1) The performance senstvty of each applcaton to cache msses vares, as applcatons spend dfferng fractons of executon tme stalled on cache msses. The performance of those applcatons wth a large fracton of executon cycles stalled on msses wll be more senstve n cache mss vares. Ths fracton are dverse; for example, data centrc applcatons wll spend much more cycles on memory access (stalled by cache msses) than computaton orented applcatons. (2) The stalls arsng from each mss vary as a functon of Memory Level Parallelsm (MLP) [13][2] and ILP. Clustered cache msses latency cycles overlapped wth recent

2 msses, and the average penalty of each cache mss can be reduced. So the actual penalty of each cache mss s smaller than memory access latency and may vary accordng to dfferent memory access characterstcs. Besdes, computaton operaton cycles and memory access latency cycles may overlap, whch can also reduce the actual penalty of cache msses. Thus, n order to enforcng deal farness on cache sharng, we must frstly buld an analytc model to account for the performance mpact of cache sharng. The analytc model should be able to help to dvde applcatons whole executon cycles nto prvate part, whch s not related to cache sharng, and vulnerable part, whch s susceptble to addtonal cache msses caused by cache sharng. Then, accordng to the analytc performance model we can develop a mechansm to provde a reference pont for deal farness metrc by measurng or estmatng the applcaton s performance n a cated cache when t s actually runnng n a shared cache. Ths paper makes two contrbutons: (1) we proposes a model to analyze the performance mpact of cache sharng for CMP, consderng not only addtonal cache msses caused by cache nterference of concurrent workloads, but also the varaton of the actual penalty for each cache mss. (2) To the best of our knowledge, ths paper proposes the frst mechansm that parttons shared cache and enforces farness for the goal of deal performance farness. Ths mechansm s dynamc and adaptve, no statc profle nee. The proposed mechansm always mproves the performance farness metrc, and can provde no worse throughput than the cache wthout any management mechansm. The rest of the paper s organzed as follows. Secton 2 gves a defnton of deal farness as well as cache mss farness whch s used n pror work. Secton 3 ntroduces an accurate model, whch connects cache mss rate and overall performance, to estmated the performance mpact of cache sharng. In Secton 4, the hardware mechansm of enforcng farness on shared cache n a typcal CMP archtecture s descrbed n detal. Secton 5 descrbes the experment methodology and dscusses experment results. Secton 6 ntroduces related work brefly and Secton 7 gves a concluson. 2. Defnng Performance Farness We defne performance farness, whch s the deal farness dscussed above, as dentcal slowdown for each concurrent workload runnng wth shared cache compared to runnng wth separate, cated caches, whch s called executon tme farness n [6]. Let T denotes the executon tme (count of executon cycles) of workload wth cated cache and T for the executon tme wth shared cache, then performance farness s acheved when: T T = T j Tj for every par of concurrent workloads and j. T and T can be replaced by CPI and CPI for a certan slce of runnng nstructons, and then Equaton 1 can be transformed to: CPI CPI = CPI j CPIj Actually CPI s the slowdown of workload runnng CPI wth shared cache compared to cated cache. And we defne performance farness metrc of a par of concurrently runnng workloads and j under a certan cache partton as M perf : M perf = j CPI CPI CPI j CPIj Intutvely, M perf s the sum of the slowdown dfference between every two co-scheduled workloads. The smaller M perf s, the smaller the slowdown dfference among all coscheduled workloads, thus the better the performance farness. If M perf equals zero, perfect performance farness s acheved. For the purpose of comparson, we also defne that cache mss farness s acheved when: MPKI MPKI = MPKI j MPKI j for every par of concurrent workloads and j, n whch MPKI and MPKI denote the mss count per thousand nstructons when workload runnng wth cated cache and shared cache. Cache mss farness metrc s defned as: M mss = j MPKI MPKI MPKI j MPKI j Smlar to M perf, the smaller M mss s, the better the cache mss farness, and perfect cache mss farness s acheved when M mss equals zero. M mss s the same as M 3 n [6] and FM3 n [7]. [6] proposed fve farness metrcs, but M 2 has been shown to have a poor correlaton wth performance farness; M 1 and M 3 contrbute the most hghly correlaton wth performance farness for most cases. If the workload runs the same count of nstructons wth cated cache and shared cache, M mss s the same as M 1, too. So M mss s representatve enough. (1) (2) (3) (4) (5)

3 P A B 3. Modelng Performance Impact of Cache Sharng We use a typcal CMP archtecture confguraton for followng analyss: N cores on chp; each core has prvate L1 nstructon cache and data cache; unfed on-chp L2 cache shared by all on-chp cores s the last level cache, and a mss n L2 cache wll ssue an off-chp request. All caches are set-assocated. To model performance mpact of cache sharng, the cycles consumed durng applcaton s runnng tme can be categorzed nto two classes: prvate operaton cycles (T pr ) and vulnerable operaton cycles (T vul ). Intutvely, prvate operaton cycles are the part of executon cycles whch only depends on the characterstcs of the workload. Vulnerable operaton cycles are senstve to dfferent co-schedulers on other cores and may vares dversely because of resource sharng. Prvate operaton cycles are consumed by those operatons whch only need prvate resources of the core, such as computaton tme, the latency of fetchng nstructon and data from prvate L1 cache. A L2 request s a hybrd operaton. The tag lookng up latency s accounted n prvate operaton cycles for t s constant no matter whether ths request msses or not. However, f the L2 cache msses, the cycles of off-chp request latency are vulnerable operaton cycles because ths cache mss may be caused by cache sharng and the off-chp request s extra cycles. However, the whole executon cycles s not smply equal to the sum of T pr and T vul because there are overlap cycles of prvate operatons and vulnerable operatons. For example, n an out-of-order processor, even f a nstructon s stalled by a L2 cache mss, other nstructons n the schedule wndows stll can enter ppelne f there s no data dependence. As a result, the overlap cycles (T ovl ) must be ntroduced, then: T = T pr + T vul T ovl (6) The example n Fgure 1 llustrates Equaton 6. p e l n e L 1 M s s L 2 M s s Fgure 1. Illustraton of Equaton 6. The horzontal axs represents tme (cycles) elapsed. In ths example, durng the executon tme of T, two L2 cache msses occur. After the frst mss, the nstructon wndow stll has nstructons wth no data dependence on ths mss, so the ppelne s not stalled yet untl the second mss occurs. The computaton tme of ppelne s certanly part of T pr, and the off-chp request latency of L2 msses s T vul. Note that when L1 msses occur, the L1 request latency (tag lookng up latency n L2 cache), such as slce A and slce B n the fgure, s part of T pr ; however, t ends up as an L2 mss, the latency of off-chp request does not belong to T pr any more. T ovl s formed by those cycles when T pr and T vul overlap. Because the vulnerable operaton cycles are manly the cycles of off-chp request latency, T vul can be approxmated by the total penalty of off-chp requests. Average MLP s nee to consder to estmate the average penalty of each cache mss. We derve the average MLP defnton n [2] as the average number of useful outstandng off-chp requests when there s at least one outstandng off-chp requests, denoted by MLP avg. We have the followng equaton: T vul = N mss =1 = N mss Then Equaton 6 can be: T = T pr T ovl + N mss Mss Penalty Mss Latency MLP avg (7) Mss Latency MLP avg (8) In Equaton 8 the latency of off-chp request (Mss Latency) can be treated as a constant (gnorng other factors such as bus congeston). Equaton 6 and 8 shows that the total executon tme of an applcaton ncludng three parts: T pr s the nhert part that does not change for cache sharng. T ovl and T vul are related to cache sharng; they are the cause of unfar performance and the part of executon tme that cache partton mechansms want to adjust. Accordng Equaton 8, f we can get the parameters of T ovl, N mss and MLP avg through hardware profler, the total executon tme can be estmated dynamcally. 4 Hardware Mechansm The necessary hardware support ncludes two nterdependent parts: cache partton mechansm and hardware profler. Fgure 2 shows how t works. The hardware profler gather the statstcs of each core s ppelne, each prvate L1 nstructon cache and data cache, and unfed L2 cache durng last perod. At the end of perod, cache partton decson s made and the cache partton mechansm apples the dec partton to cache.

4 P r o f l e r C o r e 1 C o r e N L 1 D / L 1 I L 1 D / L 1 I U n f e d L 2 C a c h e C a c h e P a r t t o n M e c h a n s m Fgure 2. Hardware Framework P r o f l e r downs compared to runnng wth cated cache s the deal goal. The key problem s to estmate the cycles nee by the workload wth cated cache when t s actually runnng wth a shared cache. T and CPI and T or CPI for each core are need to evaluate M perf defned n Equaton 3. The job of hardware profler s to profle necessary runtme or statstcal parameters when the applcatons are executng concurrently, and use Equaton 6 and Equaton 8 to dynamcally estmate the stuaton f runnng wth cated cache. Accordng to Equaton 6 and Equaton 8: 4.1 Cache Partton Mechansm The cache partton mechansm s the smpler part. Assumng there are N cores on chp sharng L2 cache, we add log 2 N bts for each cache block to mark whch core ths cache block belongs to. And each core has a bt of overalloc flag to ndcate whether ths core has been allocated too much cache space or not. When a cache mss from core occurred, frstly found the correct cache set; f the overalloc flag of core shows that core s not over allocated, use orgnal cache replacement polcy to fnd a vctm cache block, and change the mark bts of ths cache block as belongng to core ; f the overalloc flag shows that core has been allocated too much cache space, randomly select one cache block wth mark of core as vctm cache block. Ths mechansm guarantees over allocated cores can not gan more cache lnes, and the addtonal cache block wll be gradually taken by other under allocated cores, and the partton target set by overalloc flag of each cores. A specal stuaton s that though the regster of core shows over allocated, there s no cache block marked as belongng to core, we stll allocate a cache lne for core.! " # Fgure 3. Cache Partton Mechansm 4.2 Hardware Profler The hardware profler s more complex. To acheve performance farness for a shared cache CMP, dentcal slow- T T = T pr T ovl + T vul (9) T = Tpr Tovl + Nmss Mss Latency MLPavg Accordng to the defnton of T pr we can get T (10) pr = pr = T pr. So n order to enforce performance farness, T, Tovl and Tvul should be profled n shared mode, and then T pr can be calculated; at the same tme, Tovl, Nmss and MLP avg need to be estmated, and fnally we gan T and be able to evaluate M perf. The data flow for analyss s showed n Fgure 4. T, Tovl, T vul T pr ց ց T M perf T ր Tovl, N mss, MLP avg Fgure 4. Data Flow for Analyss If we only want to enforce cache mss farness defned by Equaton 4, thngs are much smpler: only N N mss and mss are nee, then M mss defned by Equaton 5 can be evaluated Proflng parameters for shared cache The profler calculate statstcs n every P T cycles (proflng perod), then T = PT. To get Tvul, we add a log 2N bts flag to each entry of L2 MSHR (Mss Status Holdng Regster) to dentfy whch core causes ths mss; then montor the count of entres wth the specfed flag to see f there s request of ths core n MSHR. At the begnnng of proflng perod, Tvul s ntalzed to zero. If MSHR has at least one entry for ths core, Tvul should be ncreased because there s at least one off-chp request on ths cycle. To decde whether a cycle of the whole executon tme belongs to Tovl, three condtons should be consdered: (1) whether the ppelne s stalled n ths cycle; (2) whether there s at least an L1 request n ths cycle; (3) whether there

5 Algorthm 1: Calculatng T vul /* at the begnnng of each profler perod */ T vul = 0; /* for each cycle*/ f L2 MSHR has entres for ths core then T vul + +; s at least an L2 request for ths core n ths cycle. Condton (2) only means the on-chp request; f the L1 request ends up to be an L2 mss, the off-chp latency cycles s not condton (2) (referrng the example showed n Fgure 1). Condton (1) can be easly got by montorng the status of ppelne and condton (3) by montorng the whether the L2 MSHR has entres for ths core. To decde condton (2), we add a tmer for each L1 cache, ncludng nstructon cache and data cache. Let L2 LAT denotes the lookup latency of L2 (L2 LAT s always a fxed value). If ths L1 cache msses and ssues a request to L2, the tmer s set to L2 LAT. For each cycle, the tmer s decreased by 1. The condton (2) can be dec by checkng whether the tmer s zero or not. Algorthm 2 descrbed the logc n detal Proflng and estmatng parameters for cated cache The three parameters of Tovl, N mss and MLP avg shown n Fgure 4 are responsble to the applcaton runnng wth a cated cache. However, because the workload s actually runnng wth a shared cache, we need specal hardware to help to estmate the stuaton when t s runnng wth a cated cache. To acheve ths, we uses the technology of Auxlary Tag Drectory (ATD) [13] to attach a vrtual prvate L2 cache to each core. An ATD has the same assocatvty as the man tag drectory of the shared L2 cache and uses the same replacement polcy. However, an ATD only contans the tag and other functonal bts of each cache block but do not keep data. We also add an auxlary MSHR to each ATD, then an ATD can act just as a prvate L2 cache. Each entry n the auxlary MSHR has a tmer to smulate memory accessng request. When a memory request s ssued to an auxlary MSHR, the tmer of the nserted entry s set to the round trp cycles of memory accessng latency. For every cycle, all entres tmers are automatcally decreased by 1, A tmer s value equals 0 means ths memory request has returned and the request block s ready for ATD. Because there s an ATD for each core, the total hardware cost s substantal. We employ set samplng technology [13] to reduce the storage cost. The ATD wth set samplng only selects cache set samples n a specfed nterval, and the be- Algorthm 2: Calculatng T ovl /* at the begnnng of each profler perod */ T ovl = 0; /* for each cycle; for L1I and L1D of ths core*/ f L1 mss occurs at ths cycle then tmer of ths L1=L2 LAT ; else tmer of ths L1 ; /* for each cycle*/ f ppelne s not stalled /*(1)*/ then ppelnenotstall=true; else ppelnenotstall=false; f L2 MSHR has entres for ths core /*(3)*/ then hasl2request=tru else hasl2request=false; f L1I or L1D tmer s not zero /*(2)*/ then hasl1request=true; else hasl1request=false; f ( ppelnenotstall or hasl1request) and hasl2request then T ovl + +; havor of the whole can be approxmated by sampled cache sets. An ATD wth large samplng nterval requres less hardware overhead, but gves less accurate statstcs. So, there s a tradeoff between hardware cost and accuracy n choosng samplng nterval. Wth an ATD and an auxlary MSHR, t s possble to smulate a prvate L2 cache for each core. When a request from upper level L1 cache s comng, t s forwar to the really L2 cache as well as the attached ATD of the core whch the request sender (L1 cache) belongs to. Then the ATD respond to ths request just as L2 cache does, ncludng touchng MRU bts, replacement, countng mss count and ssue requests to the auxlary MSHR. So Nmss and Tovl can be ganed by the ATD and auxlary MSHR attached to each core usng smlar mechansm whch s used to get and T N mss MLP avg ovl. can also be ganed wth the help of auxlary MSHR. MLP avg s defned as the average number of useful outstandng off-chp requests when there s at least one outstandng off-chp requests. Two counter M LP sum and MLP cycles are used to store the accumulated off-chp latency and the count of cycles when there s at least one outstandng off-chp request. The algorthm s descrbed as follows: Note that Tovl and MLPavg are both estmated values whle we can treat Nmss as accurate value f set samplng s not employed. Because L2 cache msses have mpact on the ppelne, the relatve order of L2 cache msses may be

6 Algorthm 3: Calculatng MLP avg /* at the begnnng of each profler perod */ MLP sum = 0; MLP cycles = 0; /* for each cycle*/ f auxlary MSHR s not empty then MLP cycles + +; M LP sum+ =count of auxlary MSHR entres; /* at the end of each profler perod */ MLPavg MLP sum = MLP cycles ; dfferent between runnng wth shared cache and wth cated cache, whch makes the mechansm for calculatng Tovl and MLPavg naccurate compared to really runnng the applcaton wth a cated cache. However, f the applcaton s actually runnng wth cated cache, the same algorthm can used to obtan the accurate MLP by replacng auxlary MSHR wth man L2 MSHR. Now the hardware profler can provde all the parameters nee to calculate T and T for each core. The slowdown for core can be calculated at the end of proflng perod by: Slowdown = T T (11) The core whch has the least slowdown s selected as beng allocated too much cache space by settng the ths core s overalloc flag. Through the cache partton mechansm, the slowdown of each core wll be closer and performance farness wll be mproved. 4.3 Hardware cost For each cache lne, log 2 N bts are ad to ndcate whch core ths cache lne belongs to, n whch N s the number of cores sharng the cache. In addton, there should be some montor crcuts n L2 MSHR, L1 cache and ppelne. For each core, there are 2 addtonal L1 tmers and 1 auxlary MSHR, typcally 32-entry. ATDs are the major part of cost. Set samplng can sgnfcantly reduce the storage of ATD. For the confguraton of 2-way CMP wth 1M, 64 bytes lne sze, 8-way assocatve L2 cache, and samplng for each 16 cache sets, the hardware cost s: For each ATD entry: (24 bts tag)+(4 bts LRU)+(1bt vald)+(2 bts n-addton)=31 bts Number of ATD entres: (1MB L2)/(64 Bytes block)/(16 cache sets samplng)=1024 Total cost for 1 ATD: 1024*31bt<4KB. The orgnal L2 cost s: 1MB+(24+4+1)*(1M/64B)=1.45MB. So the addtonal cost for 2 ATDs of the 2 cores s less than 1%: (2 4KB)/1.45MB = < 1% 5 Expermental Methodology 5.1 Confguraton The evaluaton s performad usng Smcs [8], whch s a whole system smulator supportng CMPs. The memory tmng model s derved from GEMS [9], whch enables detaled cycle-accurate smulaton of multprocessor systems for Smcs. Table 1 shows the basc parameters of the smulated archtecture. The smulated CMP cores are out-of-order superscalar processors wth prvate L1 nstructon and data caches, sharng unfed L2 cache and all lower levels of memory herarchy. All caches are set-assocated usng Pseudo LRU polcy for replacement decson. CMP 2 cores on chp, share on-chp L2 cache Processor 4-ssue out-of-order processors core nstructon wndow sze: 64 Re-order buffer sze: 128 prvate Icache and Dcache for each core L1 cache Icache: 32KB, 64B lne-sze, 4-way PLRU Dcache: 32KB, 64B lne-sze, 4-way PLRU Unfed 1MB, 64B lne-sze, 8way PLRU shared 12-cycle ht, 32-entry MSHR L2 cache shared by all cores on chp Memory 400-cycle round trp latency Table 1. Basc confguraton of the smulated archtecture 5.2 Metrcs To evaluate the farness mprovement of dfferent schemes at the baselne of uncontrolled L2 cache sharng (orgnal Pseudo LRU polcy), we defne: F scheme perf F scheme mss = Mscheme perf Mperf PLRU = Mscheme mss Mmss PLRU (12) (13) Mperf scheme and Mperf PLRU are performance farness metrcs defne by Equaton 3; Mmss scheme and Mmss PLRU are cache

7 mss farness metrcs defne by Equaton 5. Obvously Fperf PLRU = 1 and Fmss PLRU = 1. Those metrcs are normalzed farness metrcs. The smaller the value, the better the farness of the scheme. Zero means perfect farness. If the baselne already acheves perfect farness, Fperf scheme and Fmss scheme would be nfnte unless the cache partton scheme also acheves perfect farness. 5.3 Benchmarks We select a set of most memory-ntensve benchmarks from SPEC CPU 2000 benchmark sute, and run them concurrently n pars n the smulated dual-core CMP system to see the beneft of dfferent cache partton schemes, ncludng uncontrolled L2 cache sharng usng orgnal Pseudo LRU polcy, enforcng cache mss farness on cache parttonng and enforcng performance farness on cache parttonng. We compare the normalzed cache mss farness metrc (Fmss scheme ) and normalzed performance farness metrc (Fperf scheme ) for all cache partton schemes at the baselne of uncontrolled L2 cache sharng. For each selected benchmark, a representatve slce of nstructons s obtaned usng a tool SmPont [10] for smulaton. To categorze the selected benchmarks, we consder two features of each benchmark that can affect the correlaton between cache mss farness and performance farness: (1) the porton whch vulnerable tme took n the whole executon tme, whch s T vul /T accordng to Equaton 6; (2) the average MLP durng executon of the benchmark. Workloads wth hgh value of T vul /T are more senstve to nterleavng, whle workloads wth hgh average MLP are more tolerant for last level cache msses. These two factors must be combned n analyss. Hgh T vul /T Low T vul /T Hgh MLP mcf(55.2%, 1.53) ammp art(56.5%, 1.82) (29.2%, 1.47) applu(72.3%, 1,21) gzp(24.1%, 1.26) Low MLP swm(63.3%, 1.10) aps(28.5%, 1.08) equake(47.6%,1.19) vpr(28.3%, 1.28) sxtrack(39.2%,1.09) Table 2. Benchmark classfcaton Table 2 shows the classfcaton of the ten selected benchmarks based on the values of T vul /T and average MLP: mcf, ammp, art, applu, gzp, swm, aps, equake, vpr and sxtrack. The par of numbers shown n parenthess denotes the benchmark s values of T vul /T and average MLP. These data are collected by runnng a sngle benchmark wth cated cache of orgnal Pseudo-LRU replacement polcy. The eght benchmarks are categorzed nto four classes: hgh-vulnerablty/hgh- MLP, hgh-vulnerablty/low-mlp, low-vulnerablty/hgh- MLP and low-vulnerablty/low-mlp. We par the benchmarks presented above and runnng them concurrently n our smulated system wth shared cache to evaluate farness metrcs of dfferent schemes. Table 3 shows the classfcaton parameters when the benchmarks are runnng concurrently wth shared cache of orgnal replacement polcy. We can see that although the two parameters may vary dversely (especally the porton of cache mss latency), the relatonshp of parameter values are kept. That s, f a benchmark from the class of hgh/hgh s runnng concurrently wth another benchmark from the class of low/low, the frst benchmark usually stll has a larger porton of cache mss latency and a larger MLP. So s other combnatons of benchmarks from the rest classes. T vul /T T vul /T MLP MLP App1 App2 App1 App2 gzp+mcf 32.0% 82.7% gzp+art 26.7% 62.2% aps+art 36.1% 61.9% swm+aps 87.1% 31.1% vpr+applu 31.2% 78.1% ammp+applu 44.0% 86.7% mcf+swm 83.8% 67.8% swm+art 85.0% 73.6% equake+mcf 55.0% 68.8% ammp+sxtrack 40.0% 41.6% equake+sxtrack 64.6% 48.2% Table 3. Benchmark pars and parameters when runnng concurrently. 5.4 Results and Analyss Model Accuracy Verfcaton Before evaluatng farness metrcs, we need to verfy how accurate the model descrbed n above sectons could be. In ths verfcaton experment, we ran benchmarks for two passes. For the frst pass, we used the hardware confguraton n Table 1, but only appled profler component of the hardware mechansm to the shared L2 cache; the cache replacement polcy s not modfed. We ran the benchmark pars n Table 3 n the share cache confguraton and get necessary statstcs for estmatng the benchmarks performance (IPC ) when runnng wth cated cache. In the second pass, we ran these benchmarks agan n dentcal hardware except usng cated L2 cache for each core (the cated cache has the same confguratons as the shared

8 5.4.2 Evaluaton of farness metrcs We use Fperf scheme and Fmss scheme descrbed n Secton 4.2 to measure the farness mprovement for two cache partton schemes: the scheme that enforces cache mss farness (SECF) and the scheme that enforces performance farness (SEPF). The baselne s uncontrolled cache sharng usng orgnally Pseudo LRU polcy, whch s wdely used n producton processors. (a) Cache mss farness metrc (F scheme mss ) Fgure 5. Verfyng accuracy of the model. Y axs represents the relatve error (E defned n Equaton 14). X axs represents the nterval of the set samplng n ATD; nterval 1 means do not use set samplng (sample every tag set n ATD). cache). In ths pass of runnng, we can get the the benchmarks actual (IP C) performance when runnng wth cated cache. Then we can compare IPC and IPC, and get the relatve error: E = 1 IPC IPC (14) To revew the effect of employng set samplng technology on ATD, we compared the estmatng error of dfferent samplng nterval. From Fgure 5 we can see that, f do not use set samplng n ATD (samplng nterval 1), the average E s 4.7% and max E s 10.9% (equake runnng wth sxtrack). As the samplng nterval ncreasng, E ncreases as well. When samplng ATD set wth nterval 16, the average E s 6.7% and max E s 13.3% (art runnng wth gzp), whch s stll accurate enough. So the followng experments used nterval 16 for set samplng n ATD. (b) Performance farness metrc (F scheme perf ) Fgure 6. Comparng farness metrcs of selected benchmark pars wth uncontrolled Pseudo LRU and the two cache partton schemes. Note that shorter bar means better farness Fgure 6 shows the experment results of cache mss farness metrc (Fmss scheme ) and performance farness metrc (Fperf scheme ). For each benchmark par, the two benchmarks are executed concurrently for three passes: n the frst pass, shared L2 cache uses orgnal, uncontrolled Pseudo LRU replacement polcy; n the second and thrd pass, cache partton schemes of enforcng cache mss farness (SECF) and enforcng performance farness (SEPF) are used. The two farness metrcs of (Fmss scheme ) and (Fperf scheme ) are measured for each pass of executon. In Fgure 6 (a), we can see that when usng SECF to en-

Fgure 7. Comparng throughput of selected benchmark pars wth uncontrolled Pseudo LRU and the two cache partton schemes. Note that taller bar means hgher throughput.

However, when usng SEPF to enforce performance mss farness on shared cache, though there s stll cache mss farness metrc (Fmss scheme ) mprovement compared to uncontrolled cache for most benchmarks,

9 Fgure 7. Comparng throughput of selected benchmark pars wth uncontrolled Pseudo LRU and the two cache partton schemes. Note that taller bar means hgher throughput. force cache mss farness on shared cache, cache mss farness metrc (Fmss scheme ) s mproved sgnfcantly compared to uncontrolled Pseudo LRU replacement polcy (note that shorter bar means better farness). However, when usng SEPF to enforce performance mss farness on shared cache, though there s stll cache mss farness metrc (Fmss scheme ) mprovement compared to uncontrolled cache for most benchmarks, Fmss scheme s not mproved as much as SECF. Even there are benchmark pars whch show worse cache mss farness than uncontrolled Pseudo LRU cache when usng SEPF(ammp+equake and equake+mcf ). On average, SECF gans a cache mss farness mprovement of Fmss scheme = 0.14, whle SEPF gans an mprovement of Fmss scheme = Fgure 6 (b) shows the performance farness metrc (Fperf scheme ) when usng dfferent cache partton schemes. The result s dfferent from the result of cache mss farness metrc. Although SECF always shows better Fmss scheme than the SEPF, t fals to gan a better performance farness metrc (Fperf scheme ) than SEPF. Whats more, there are three benchmark pars suffer great performance farness degradaton when usng SECF to enforce cache mss farness: swm+art (3.87), equake+mcf (18.3) and equake+sxtrack(3.32), whch means that SECF may make the problem of unfar performance of concurrent workloads even worse n some cases. In contrast, although SEPF got poor Fmss scheme for ammp+equake and equake+mcf, t ends up that SEPF gans better performance farness n these benchmark pars compared to SECF. On average, SECF gans a performance farness mprovement of Fperf scheme = 1.04, whle SEPF gans an mprovement of Fperf scheme = From the comparson of the results showed n Fgure 6 (a) and Fgure 6 (b), we can get a deeper nsght nto the correlaton between cache mss farness and performance farness. Merely enforcng cache mss farness on concurrent workloads can mprove performance farness n most cases, but can not guarantee deal performance farness and sometmes may suffer performance farness nstead of mprovng t, especally when the co-scheduled workloads have dfferent features. For example, when benchmarks of type 4 (lowvulnerablty/low-mlp) are runnng wth benchmarks wth type 1 (hgh-vulnerablty/hgh-mlp) such as benchmark pars of gzp+mcf, gzp+art and aps+art, SECF gans a much poorer performance farness than SEPF; when benchmarks of type 2 (hgh-vulnerablty/low-mlp) are runnng wth benchmarks wth type 1(hgh-vulnerablty/hgh-MLP) such as benchmark pars of swm+art, equake+mcf and equake+sxtrack, SECF even suffers performance farness a lot. The reason s that when a low-mlp workload s runnng concurrently wth another hgh-mlp workload, more shared cache resource should be allocated to low-mlp workload to guarantee the whole performance farness because low-mlp workload has a relatvely larger cache mss penalty and enforcng cache mss farness can not guarantee performance farness. And f both of the concurrently runnng workloads are hghly senstve to cache nterference (type 1 benchmarks runnng wth type 2 benchmarks), the degradaton of performance wll be more obvous when usng SECF to enforcng cache mss farness. By takng more factors nto account, SEPF can acheve better performance farness n most cases. Fgure 7 shows the throughput of selected benchmark pars wth uncontrolled Pseudo LRU and the two cache partton schemes. We use far speedup defned n Equaton 14 to measure the throughput of concurrent runnng benchmark pars. Note that taller bar n the fgure means hgher throughput. We can see that SEPF gans a compettve throughput for most benchmark pars. And for gzp+mcf, ammp+applu and swm+mcf, SEPF can provder hgher throughput than orgnal Pseudo LRU and SECF. On average, SECF gans a far speedup of 1.04 whle SEPF gans a far speedup of 1.07 on the base lne of orgnal Pseudo LRU replacement polcy. 6 Related Work Pror work has notced the mpact of cache sharng for concurrently runnng threads. [15] proposed to use hardware counters to estmate the cache mss-rate as a functon of cache sze, whch can be used to optmze cache partton to mnmum overall mss rate. [13] desgned a runtme mechansm that parttons a shared cache accordng to the cache utlty of concurrent runnng multple applcatons. [15] and [13] both focused on optmzaton of overall mss rate. [6] ponted out the necessty of enforce farness for co-scheduled workloads. [6] defned a set of farness metrc for cache sharng and proposed both statc strategy and

10 dynamc mechansm to mprove farness. [4] proposed a framework to provde QoS for resources ncludng shared caches. [14] desgned archtectural support for OS to manage shared caches. Some researches ndcates that decreasng n cache mss rate does not necessarly lead to performance mprovement. [12] ndcated that not every cache mss has an equal penalty because of the exstng of MLP. By takng MLP related cost of each cache mss nto account, [12] modfed the standard LRU replacement polcy and hgher performance guaranteed. [2] analyzed the mcroarchtecture mpact on MLP and developed a detaled model to relatng MLP to overall performance. There are researches of performance models for other archtectures. [3] proposed a cycle accountng archtecture for SMT processors to estmate the performance of each coscheduled thread had they ran alone. [5] proposed a performance model for superscalar processors. 7 Concluson Uncontrolled sharng usually leads to unfar performance of concurrent workloads. That s, some workloads suffer a much more sgnfcant slowdown than other workloads. Ths phenomenon brngs more problems such as prorty nverson and thread starvaton to operatng system s process scheduler. Instead of enforcng deal performance farness drectly, pror work addressng farness ssue of cache sharng manly focuses on the farness metrcs of cache mss numbers or mss rates. However, because of the varaton of cache mss penalty, farness on cache mss cannot guarantee deal farness. Cache sharng management whch drectly addresses deal performance farness s nee for CMP systems. Ths paper proposed the concept of performance farness metrc. We bult a detaled model to analyze the performance mpact of cache sharng. Gu by ths model, we desgned a hardware mechansm to enforcng performance farness on shared cache. The hardware mechansm proposed n ths paper s adaptve and hardware effcent. For comparson, the concept of cache mss farness metrc and a hardware mechansm to enforcng cache mss farness are also ntroduced. We mplemented these two cache partton schemes n a smulator. The experment results showed that the proposed mechansm always mproves the performance farness metrc, and can provde no worse throughput than the scenaro wthout any management mechansm. References [1] J. Chang and G. S. Soh. Cooperatve cache parttonng for chp multprocessors. In ICS 07: Proceedngs of the 21st annual nternatonal conference on Supercomputng, pages , New York, NY, USA, ACM. [2] Y. Chou, B. Fahs, and S. G. Abraham. Mcroarchtecture optmzatons for explotng memory-level parallelsm. In ISCA, pages 76 89, [3] S. Eyerman and L. Eeckhout. Per-thread cycle accountng n smt processors. In ASPLOS 09: Proceedng of the 14th nternatonal conference on Archtectural support for programmng languages and operatng systems, pages , New York, NY, USA, ACM. [4] F. Guo, Y. Solhn, L. Zhao, and R. Iyer. A framework for provdng qualty of servce n chp mult-processors. In MI- CRO, pages , [5] T. S. Karkhans and J. E. Smth. A frst-order superscalar processor model. In ISCA 04: Proceedngs of the 31st annual nternatonal symposum on Computer archtecture, page 338, Washngton, DC, USA, IEEE Computer Socety. [6] S. Km, D. Ch, and Y. Solhn. Far cache sharng and parttonng n a chp multprocessor archtecture. In IEEE PACT, pages , [7] J. Ln, Q. Lu, X. Dng, Z. Zhang, and X. Zhang. Ganng nsghts nto mult-core cache parttonng: Brdgng the gap between smulaton and real systems. In In HPCA 08: Proceedngs of the 14th Internatonal Symposum on Hgh- Performance Computer Archtecture, [8] P. S. Magnusson, M. Chrstensson, J. Esklson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner, and B. Werner. Smcs: A full system smulaton platform. Computer, 35(2):50 58, [9] M. M. K. Martn, D. J. Sorn, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hll, and D. A. Wood. Multfacets general executon-drven multprocessor smulator (gems) toolset. SIGARCH Comput. Archt. News, 33:2005, [10] E. Perelman, G. Hamerly, M. V. Besbrouck, T. Sherwood, and B. Calder. Usng smpont for accurate and effcent smulaton. In ACM SIGMETRICS Performance Evaluaton Revew, pages , [11] M. Quresh. Adaptve spll-receve for robust hghperformance cachng n cmps. In Hgh Performance Computer Archtecture, HPCA IEEE 15th Internatonal Symposum on, pages 45 54, Feb [12] M. K. Quresh, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for mlp-aware cache replacement. In ISCA-33, pages , [13] M. K. Quresh and Y. N. Patt. Utlty-based cache parttonng: A low-overhead, hgh-performance, runtme mechansm to partton shared caches. In MICRO-39, pages , [14] N. Rafque, W.-T. Lm, and M. Thottethod. Archtectural support for operatng system-drven cmp cache management. In PACT 06: Proceedngs of the 15th nternatonal conference on Parallel archtectures and complaton technques, pages 2 12, New York, NY, USA, ACM. [15] G. E. Suh, S. Devadas, and L. Rudolph. A new memory montorng scheme for memory-aware schedulng and parttonng. In HPCA 02: Proceedngs of the 14th Internatonal Symposum on Hgh-Performance Computer Archtecture, pages , 2002.

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?