Resource Sharing in Mul1core Processors

Size: px

Start display at page:

Download "Resource Sharing in Mul1core Processors"

Shanon Kelley
6 years ago
Views:

Uppsala Programming for Multicore Architectures Research Center Resource Sharing in Mulcore Processors David Black- Schaffer Assistant Professor, Department of Informa<on

1 Uppsala Programming for Multicore Architectures Research Center Resource Sharing in Mulcore Processors David Black- Schaffer Assistant Professor, Department of Informa<on Technology Uppsala University Special thanks to the Uppsala Architecture Research Team: David Eklöv, Prof. Erik Hagersten, Nikos Nikoleris, Andreas Sandberg, Andreas Sembrant

David Black- Schaﬀer 8// Uppsala University / Department of Informa<on Technology Resource Sharing in Mul<cores Memory Controller I/O L Cache Memory Controller I/O L Cache Sandy Bridge Nehalem Most

2 David Black- Schaﬀer 8// Uppsala University / Department of Informa<on Technology Resource Sharing in Mul<cores Memory Controller I/O L Cache Memory Controller I/O L Cache Sandy Bridge Nehalem Most people focus on the core count Parallelism, synchroniza<on, deadlock, etc. I am going to focus on the shared memory system Cache, bandwidth, prefetching, etc. The shared memory system signiﬁcantly impacts performance, eﬃciency, and scalability.

3 Uppsala University / Department of Informa<on Technology 8// Goals For Today. Overview of shared memory system resources. Impact on applicaon performance. Importance of applicaon sensivity. Research tools and direc<ons

4 Uppsala University / Department of Informa<on Technology 8// Not just cache and bandwidth SHARED RESOURCES

5 David Black- Schaﬀer 8// 5 Uppsala University / Department of Informa<on Technology Mul<core Memory Systems Intel Nehalem Memory Hierarchy (GHz) Latency to private L: cycles Latency (cycles) Latency to DRAM: cycles Last Level Cache 9 9 oﬀ- chip Bandwidth L L Bandwidth to Private L: 5 bytes/cycle per core L Bandwidth to DRAM: - 8 bytes/cycle tdram otal - bytes/cycle per c ore D. Molka, et. al., Memory Performance and Cache Coherency Eﬀects on an Intel Nehalem Mul8processor System, PACT 9.

David Black- Schaﬀer 8// 6 Uppsala University / Department of Informa<on Technology Other Resources SMT (Hyper- Threading) Private caches Instruc<on cache L/L Data

6 David Black- Schaﬀer 8// 6 Uppsala University / Department of Informa<on Technology Other Resources SMT (Hyper- Threading) Private caches Instruc<on cache L/L Data caches Branch predictors, TLBs, instruc<on decode Latency (cycles) 6 AMD s Bulldozer 5 Floa<ng point units Instruc<on fetch/decode L cache Branch predictor Bulldozer 7 8

today, soon to be st Power (controlled by DVFS Dynamic Frequency/Voltage

7 David Black- Schaﬀer 8// 7 Uppsala University / Department of Informa<on Technology Energy As A Resource Probably the nd most important resource today, soon to be st Power (controlled by DVFS Dynamic Frequency/Voltage Scaling) Thermal Capacity AMD s Turbo CORE thermal capacity Intel s Turbo Boost Bulldozer

Uppsala University / Department of Informa<on Technology 8// 8 Aside: Why Share Resources?

Cache par<<oning for different workloads Amdahl s law Always need single- threaded performance

Cache Core 99% Parallel (9x) Fast Synchronizaon Maximum Speedup 9 Parallel (x) 75% Parallel (x)

8 Uppsala University / Department of Informa<on Technology 8// 8 Aside: Why Share Resources? Parallel Performance Data sharing across threads Simplify synchronizaon and coherency Flexibility Cache par<<oning for different workloads Amdahl s law Always need single- threaded performance Core Core Core Core Core Core Core Shared Data Shared Cache Efficient Data Sharing Lock Shared Cache Core 99% Parallel (9x) Fast Synchronizaon Maximum Speedup 9 Parallel (x) 75% Parallel (x) Core Core Core Shared Cache Number of Cores Core Flexible Resource Allocaon Limits of Amdahl s Law

9 Uppsala University / Department of Informa<on Technology 8// 9 Cache Sharing, Cache Pollu<on, and Bandwidth Sharing PERFORMANCE IMPACTS OF SHARING

10 David Black- Schaﬀer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing expected measured cache pirate predicted 7.omnetpp.5. δ Throughput 7.omnetpp.5. L Cache Cache Pirate 7.omnetpp Experiment M M M M 5M 6M 7M cores 7.omnetpp cache size 7.omnetpp.5 expected measured of the Run - independent instances. same program on a - core Nehalem δ Performance aﬀected by shared cache allocaon ¼ of the shared cache slower Understanding performance as a func(on of shared resource alloca(on is cri<cal to cores understanding mul<core scalability -5- Throughput.5. Performance as a funcon of shared cache size M M M M 5M 6M 7M cache size

David Black- Schaﬀer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ Reduced Scalability 7.omnetpp C7.

11 David Black- Schaﬀer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ Reduced Scalability 7.omnetpp C7.omnetpp ache Sensivity.5. expected measured δ cache size Latency (cycles). M M M M 5M 6M 7M cores % o of f % 5 CCoache f ache 5% Coache f of f % Coache 5% Coache f 5 Coache f % o f 5% Coache f Cache ache C L.5 cores -5-. M M M M 5M 6M 7M 7.omnetpp.5.5 expected measured Throughput 7.omnetpp L L cache size DRAM

See this curve. M M Reduced Scalability 7.omnetpp C7.

cores 5.5 cores. 7.omnetpp.5.5 expected measured Throughput 7.

% o f 5% Coache f Cache ache C M M M M 5M 6M 7M cache size 8 shared 9 c 9 Less ache space

12 David Black- Schaﬀer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ.5. How much slower? See this curve. M M Reduced Scalability 7.omnetpp C7.omnetpp ache Sensivity expected measured δ M M 5M 6M 7M cache size Latency (cycles) -5-. cores 5.5 cores. 7.omnetpp.5.5 expected measured Throughput 7.omnetpp 6 7 5% o of f % 5 CCoache f ache 5% Coache f of f % Coache 5% Coache f 5 Coache f % o f 5% Coache f Cache ache C M M M M 5M 6M 7M cache size 8 shared 9 c 9 Less ache space More DRAM accesses Increased latency Reduced performance. -5- L L Need performance as a funcon ol f shared resources to understand mulcore scalability. DRAM

Uppsala University / Department of Informa<on Technology 8// Cache Polluon Applica<ons put data in the cache but do not reuse it No benefit, but no problem unless you are sharing cache space Polluon

13 Uppsala University / Department of Informa<on Technology 8// Cache Polluon Applica<ons put data in the cache but do not reuse it No benefit, but no problem unless you are sharing cache space Polluon wastes space other applicaons could benefit from.5 Effects of Cache Polluon in a - process Workload / of slowdown due to cache pollu<on. Throughput bzip lbm libquantum gamess Overall Alone Mixed Workload Polluon Eliminated

More misses more cache space No benefit from using the shared cache space.

14 Uppsala University / Department of Informa<on Technology 8// Cache Polluon: What s Going On? Does an applica<on benefit from using the shared cache? More misses more cache space No benefit from using the shared cache space. Miss Rao Fewer misses less cache space Private Private Shared Cache Size Shared Significant benefit from the shared cache space.

Thro David Black- Schaﬀer 8// 5 Uppsala University / Department of Informa<on

lbm 6M 7M Bandwidth Sharing Queue corescache size -6- Throughput Predicting Multicore

less cache space cores System max =.GB/s cores 7.lbm 7.lbm7.lbm.5.6 6 required....5...8 8.

. of shared cache size of shared cache size 5 r educ<on i n c ache.

run out of shared bandwidth. Bandwidht (GB/s) -6-7.

15 Thro David Black- Schaﬀer 8// 5 Uppsala University / Department of Informa<on Technology Predicting Multicore Scaling 7.lbm core Scaling 7.lbm 6M 7M Bandwidth Sharing Queue corescache size -6- Throughput Predicting Multicore Scaling M M M M 5M 6M 7M performance No loss with GB/s = GB/s cache size cores less cache space cores System max =.GB/s cores 7.lbm 7.lbm7.lbm required Bandwidth as a funcon Performance as a funcon of shared cache size of shared cache size 5 r educ<on i n c ache. 57% increase andwidth in b M M M M 5M 6M 7M M M M M 5M 6M 7M cache size -6- un<l you run out of shared bandwidth. Bandwidht (GB/s) -6-7.lbm Memory Controller I/O cache pirate measured Bandwidth (GB/s) 6M 6M 7M 7M Throughput Bandwidth (GB/s) Throughput cores 7.lbm 7.lbm 7.lbm required cache pirate predicted measured measured. GB/s

David Black- Schaﬀer Uppsala University / Department of Informa<on Technology 8// 6 Bandwidth Sharing: What is Going On? resources: Cache, Bandwidth But also: queues, row buﬀers, prefetchers, etc.

16 David Black- Schaﬀer Uppsala University / Department of Informa<on Technology 8// 6 Bandwidth Sharing: What is Going On? resources: Cache, Bandwidth But also: queues, row buﬀers, prefetchers, etc. Results in complex applicaon sensivies: Bandwidth, capacity, latency, access paoern Complex interacons with shared bandwidth Bandwidth Bandit: Latency Understanding 5 6 Memory Contention 9 (cycles) Abstract Applications on a multicore that are co-scheduled the bank before it canslowdown be read/written ; or a page-miss when bandwidth row other then the one accessed is cached in the row buffer. the case of a page-miss, the cached row has to first be writt back to the memory bank before the newly accessed row copied into the row buffer. These three events have differe latencies, with a page-hit having the shortest latency, and page-miss having thestlongest. sop mc lbm re f Bandwidth bandwidth (GB/s) Slowdown slowdown (%) compete for shared resources, such as cache capacity and memory bandwidth. The performance degradation resulting from this contention can be substantial, which makes it important to effectively manage these shared resources. This, however, requires an understanding of how applications are impacted by such contention. have been studied extensively While cache-sharing effects lex am clu of late, the effects of memory bandwidth sharing are not as ste r well explored. This is in large due to its complex nature, as Hierarchy Performance Slowdown at 9 of contention) saturaon bandwidth. Memory Baseline bandwidth (no and slowdown with memory sensitivity to bandwidth contention on bottlenecks at Fig. B. L dependsl L DRAM contention (9 of saturation bandwidth). While all of these applications several levels of the memory-system a performance point of view the hierarc and the locality properties exhibit From baseline memory similar bandwidth consumption, their sensitivities to memory of the access stream. Fig.application s. The memory hierarchy. The memory hierarchy can be analyzed with can be described by two metrics: its latency and ban This paper explores the contention of increased Equation using the different degrees of effects parallelism and latency latency at each levelcontention vary widely. This variability demonstrates the importance of a more understanding the effects of memory contention. (For each benchmark width. These two metrics are intimately related. Using Little the hierarchy.memory The queues in the figure indicate that the points availablein parallelism andindecreased parallelism at different the detailed comes from multiple places within the hierarchy. The details of the hierarchywe generated sufficient bandwidth contention to reach 9 of the system

B. B... Uppsala University / Department of Informa<on Technology 8// 7.8.8.. Pirate 7.omnetpp 7.omnetpp expected measured Bandwidth (GB/s) cores Sensi<vity Cri<cal for Scaling 5.gromacs 5.calculix.

9% % % % applicaon to 6% applicaon 6% % 6%.% 6%.6% 8% Impact varies from % % 6% plagorm % % to plagorm.%.8.....8.6.8.6..5.5 Performance.. have enough...8..8.6 (lower beoer) bandwidth. M M 5.

17 B. B... Uppsala University / Department of Informa<on Technology 8// Pirate 7.omnetpp 7.omnetpp expected measured Bandwidth (GB/s) cores Sensi<vity Cri<cal for Scaling 5.gromacs 5.calculix.bzip 6% 6% 5.calculix 9.mcf Shared resource sensivity understand scaling.%.% 5.gromacs 5.calculix.bzip 6.libquantum 8.sphinx 9.mcf 7.lbm 6%.% Sensivity varies from.9% % % % applicaon to 6% applicaon 6% % 6%.% 6%.6% 8% Impact varies from % % 6% plagorm % % to plagorm.% Performance.. have enough (lower beoer) bandwidth. M M 5.calculix 5M 7M M.% 6.libquantum 8.sphinx M 5M 7M M 7.lbm M 5M 7M M M cache Cache size Size % 5.soplex M M %.gcc 5M 7M M M % % cache Cache size Size % 6% cache Cache cache size Size size cacc Miss Ratio Fetch Ratio Miss Ratio Fetch Ratio Bandwidth 6% Insensive 6% 6% Sensive % % % % Insensive Fig. 8. Performance (), bandwidth requirements Fig. 8. () Performance in GB/s, (), and fetch/miss bandwidthratios requirements () for() several in GB/s, benchmarks. and fetch/m This.%.%.%.%.%.% % % δ % Shared Cache 7.omnetpp Sensivity.9%.%.9%.6%.6% 6%.%.% M M M M 5M 6M 7M cache size %.6... Cache Pirate 7.omnetpp Throughput.% % % % 6% 8% 8% % % % 6 Reduced Scalability 7.omnetpp expected measured.. cores % δ 8.sphinx 6.libquantum 5.soplex % 6% 7.omnetpp % %..6.8 % % 6% % M M M M 5M 6M 7M As long as you cache size 5

18 Uppsala University / Department of Informa<on Technology 8// 8 Aside: Implementa<on (Tools) verview Performance as a funcon of cache size Pirate thread steals cache and measures Target s performance Monitor the Pirate s miss rate (Similar for Bandwidth) Reducing cache polluon Sample memory access behavior (reuse distances) Model misses for different cache sizes Iden<fy pollu<ng instruc<ons and change to non- caching accesses aptures performance data for all cache sizes in one run. MB Cache Size 8MB Target Time Shared Cache Pirate $ $ $ $ $ $ $ $ verage Target slowdown: 5% Simulation slowdown: ) 5% Overhead Fully Automa<c D. Eklöv, N. Nikoleris, D. Black- Schaffer, E. Hagersten. Cache Pira8ng: Measuring the Curse of the Shared Cache. ICPP. A. Sandberg, D. Eklöv, E. Hagersten. Reducing Cache Pollu8on Through Detec8on and Elimina8on of Non- Temporal Memory Accesses. SC.

19 Uppsala University / Department of Informa<on Technology 8// 9 What is all of this good for? WRAP UP

20 Uppsala University / Department of Informa<on Technology 8// Why Is This Important? Performance and scaling are impacted by resource sharing How can we use this informaon? Predict and understand performance Workload performance Useful for schedulers (sta<c and dynamic), opmizaon, compilaon Need new modeling technology (on- going research) Can model workload cache behavior for simple cores (StatCC) Not good enough for modern machines No MLP, prefetchers, OoO, etc. Developing new techniques for modeling real hardware

More complex hierarchies (NoCs, distributed caches)

Cores CPU CPU CPU CPU AMD Fusion More focus on energy

21 Uppsala University / Department of Informa<on Technology 8// The Future More threads (s of cores, s of threads) More complex hierarchies (NoCs, distributed caches) Heterogeneous (CPU+GPU+Accelerators) Shared I/O GPU Cores CPU CPU CPU CPU AMD Fusion More focus on energy (per- core DVFS, dark silicon) Intel Integrated Graphics Turbo Boost

22 Uppsala University / Department of Informa<on Technology 8// Resource Sharing in Mul<cores Important Significant impact on performance (and efficiency) scaling Basic data lets us understand interac<ons Complex Sensivity varies across applica<ons Hard to understand the hardware Moving forward Research on predic<ng performance and understanding scaling MS thesis projects to port our tools/analyze your code We re open (and eager) for collaboraon! Uppsala Programming for Multicore Architectures Research Center

23 Uppsala University / Department of Informa<on Technology 8// QUESTIONS?

24 Uppsala University / Department of Informa<on Technology 8// Average behavior is not good enough APPLICATION PHASES

25 Uppsala University / Department of Informa<on Technology 8// 5 Applica<on Phases Bandwidht (GB/s) Everything so far has been average applicaon behavior. 7.lbm M M M M 5M 6M 7M cache size Average Bandwidth lbm M M M M 5M 6M 7M cache size Average Q: Is that good enough? Is applicaon behavior constant? Measured over me Average over me A: No. Applicaons have disnct phases with varying behaviors. Cycles per Instructions 6 A B B C B D E A B B C B D E Branch Miss Predictions A B B C B D E A B B C B D E Basic Block Vectors A. Sembrant, D. Eklöv, E. Hagersten. Efficient SoDware- based Online Phase Classifica8on. IISWC.

26 Uppsala University / Department of Informa<on Technology 8// 6 Cache Miss Ra<o Per Phase A. Sembrant, D. Black- Schaffer, E. Hagersten. Phase Guided Profiling for Fast Cache Modeling. GCO. Overhead

27 Uppsala University / Department of Informa<on Technology 8// 7 A brief look at applica<on sensi<vity SHARING IN MORE DETAIL

28 Uppsala University / Department. of Informa<on Technology. 8// Bandwidth (GB/s) Performance (lower beoer) Perf: : Size: More Detail: Cache Sharing % 5.gromacs.%.8.8. Three SPEC benchmark applica<ons Different behaviors due to different proper<es.% 5.gromacs 5.calculix.% 5.calculix.bzip 8.sphinx 8.sphinx 9.mcf 7.lbm 6%.%.9% 6.libquantum % % % 6% 6% % Each.% will respond.6% differently to cache sharing Sensivity:.% %...%.%.%.. 6% 6 6%.% gromacs.9%.%.6% 6%.%.66. %.bzip 5.calculix.9% % 8% %.bzip 6% % % 8% % % 9.mcf.% 8.sphinx 6% % 6% 8% % M M 5.calculix 5M 6.libquantum 7M 6.libquantum 8.sphinxM % 5.soplex M 5M 5.soplex 7M7.lbm M %.gcc M M 5M M.gcc 7M 5M M7M M M 5M M 7M.% % % Cache cache size Size % Cache cache 6% size Size cache size Cache cache size Size cache size cache s Miss Ratio Fetch Ratio Miss Ratio BandwidthFetch Ratio 6% 6% 6% % % % % slower 5 slower slower Fig. 8. Performance (), bandwidth requirements () in GB/s, Fig. and 8. fetch/miss Performance ratios (), () bandwidth for several requirements benchmarks. () Thisindata GB/s, wasand collected fetch/miss withra 5x bandwidth prefetching enabled. prefetching enabled. x bandwidth x bandwidth. 8.sphinx increase increase behaves quite differently from. 5.gromacs. 8.sphinx. increase behaves 7.lbmquite showsdifferently an 8 difference from 5.gromacs. between its fetch As its cache size is decreased its increasesas by its 5, cache ratios, size isindicating decreased8 prefetched its increases memoryby accesses 5, for ea ra.6.6 while its miss ratio and bandwidth increase bywhile a factor its misshowever, ratio and the bandwidth relative increase increase miss by ratioa factor is still rouh of. The fetch ratio and miss ratio curves are of slightly. The Full working set fits in This fetchindicates ratio andthat miss 7.lbm ratio is curves also relatively are slightly insensiti T different Most indicatingof that there is working a small amount of prefetching. different indicating Likle increased thatof there latency. the is athe small working data amount Figure of prefetching. 9 show the perf in.... However, the significant performance.8 decrease.6 despite However, the.6 of significant 7.lbm with performance hardwaredecrease prefetching despite disabled. the This o the cache increased set fits in bandwidth indicates the cache set fits in the cache that the benchmarkincreased is more bandwidth indicates by a third that the and benchmark increases is more at all cacb Insensive 6.libquantum Mto latency sensitive to increased memory latency. Furthermore, the is now no longer constant with M 5M M 7M M 5.soplex 5M Sensivity sensitive to increased memory latency. M 7MM to F 5M M Insensive to latency 7M M.gcc 5MM 7MM 5M M 7M M 5M 7M % % cache size cache size cache size cache size cache size cache size Miss Ratio Miss Fetch Ratio Ratio Fetch Ratio Bandwidth Bandwidth 6% %.%.% %...6%.%.6. % 6% Fig. 8. Performance (), Fig. 8. bandwidth Performance requirements (), bandwidth () in GB/s, requirements and fetch/miss () inratios GB/s, () and fetch/miss for severalratios benchmarks. () for This several data benchmarks. was!"# collectedthis withdata hardware was collected with hardware prefetching enabled. prefetching enabled. 5.soplex % 9.mcf 5.calculix 6% % libquantum 7.lbm % % lbm % %. %.gcc 6% 8.sph 5.sop!"#

6..5. % % %.8.. 5.gromacs.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6%.%.9%.%.9%.%.9%.%.9% % % % David Black- Schaffer.%.6%.% Uppsala University / Department.

calculix.% 8.sphinx 5.calculix.% 7.lbm 8.sphinx 5.calculix.% 7.lbm 8.sphinx % 6% %.%.6% % 6% % 6%.%.% applica<on access paoern 5.soplex 8% % % 6% % % % 6% % % and hardware % %..6 6 6 6 6 Prefetching.

29 6..5. % % % gromacs.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6%.%.9%.%.9%.%.9%.%.9% % % % David Black- Schaffer.%.6%.% Uppsala University / Department.6% of Informa<on Technology 8// 9 8%.%.6% 8%.%.6% 8%.%.%.% %.%.% %.%.% %.% % % % % More..6 Detail:..6 The Impact..6 of Prefetching gromacs.bzip 9.mcf.%.9% Different degrees of prefetching depending on 6% 5.calculix.% 8.sphinx 5.calculix.% 7.lbm 8.sphinx 5.calculix.% 7.lbm 8.sphinx % 6% %.%.6% % 6% % 6%.%.% applica<on access paoern 5.soplex 8% % % 6% % % % 6% % % and hardware % % Prefetching. reduces. applica<on sensi<vity to 6% % latency % 6.libquantum.% 6.libquantum 5.gromacs 5.calculix.bzip 8.sphinx 9.mcf.%.9% 6% 7.lbm 5.gromacs 5.soplex.bzip.gcc 7.lbm % 6.libquantum % %.%.6% % 6% % % 9.mcf 5.soplex 6.libquantum 6% % %.%.9% 6%.gcc 5.soplex 6.libquantum % %.gcc 5 %..% 6%.6% % % % 8%.%.% % 8% 6% % % 6% % % 6%.%.% %.6 % % calculix 6.libquantum 8.sphinx 5.soplex.% M M 5M 7M M% % 7.lbm.gcc.% 5.calculix M M M 5M 5M 7M 7M M M MM M M5M 5M 5M7M 7M 7M 8.sphinx 7.lbm M M M M M M 5MM 7M5M 7MM M M M 5M cache size cache size % cache cache size size cache cache size cache size size % M M6% Miss cache cache Ratio size sizecache size cache sizecac No Prefetching Miss Ratio Fetch Ratio Minimal Miss Ratio Bandwidth Fetch Ratio5M Significant Miss Ratio prefetching, 7M Fetch Bandwidth and Ratio MFetch it benefits Ratio M Miss from Ratio Bandwidth it Bandwidth 5MFetch Ratio 7M 6% 6% % % % 6% cache size % Prefetching cache % Fig. 8. Performance (), bandwidth requirements () Fig. 8. in Performance GB/s, and fetch/miss (), bandwidth ratios () requirements for several Fig. () benchmarks. 8. Performance in GB/s, size This anddata (), fetch/miss was bandwidth collected ratios () requirements with hardware for several () Fig. benchmarks. Fig. in9. GB/s, 8. Performance and This fetch/miss data data was (), ratios for collected bandwidth 7.lbm () cache with for with requirements hardware several size benchmarks. () prefetching This GB/s, disabled. data andwas fetch/m colle prefetching enabled. prefetching enabled. Miss Ratio 6 Fetch. prefetching enabled. (Fetch ratio miss ratio are identical.) Ratio.8. prefetching enabled... Bandwidth 6 8.sphinx behaves quite differently from 8.sphinx 5.gromacs. behaves quite 7.lbm differently shows an from 8.sphinx 8 5.gromacs. difference. behaves between quite its 7.lbm fetch differently and shows miss from an 8 5.gromacs. 8.sphinx difference between behaves 7.lbm itsquite fetchshows differently and miss an 8 from difference 5.gromacs. between As its cache size is decreased its increases As its cache by 5, size is decreased ratios, indicating its As 8 increases prefetched its cache.6 by memory size 5, is accesses decreased ratios, for indicating its each miss. increases 8cache prefetched As its size, bycache memory 5, clearly sizeshowing accesses ratios, is decreased indicating for thateach prefetching itsmiss. 8 prefetched increases was helping memory by 5, toacces while its miss ratio and bandwidth increase while by its amiss factor ratio and However, bandwidth the relative increase while increase itsbymiss ainfactor miss ratio ratio andhowever, isbandwidth still roughly the relative increase. compensate while increase by aitsfactor infor miss theratio reduced However, isand still cache bandwidth roughly the relative space.. increase This demonstrates by in miss a factor ratio is of.. The fetch ratio and miss ratio curves of.. arethe slightly fetch ratio This andindicates miss ratio that ofcurves 7.lbm.. The are is also slightly fetchrelatively ratio This and insensitive indicates miss ratio tothat the curves that of7.lbm are. islightly also The notrelatively fetch onlythis heavily ratioinsensitive indicates and leverages miss that toratio hardware the 7.lbm curves is prefetching, also are relatively slightly..%. M M 5M 7M. Difference between fetches and misses.8 is prefetch rate..8.8.gcc % ormance (), bandwidth requirements () in GB/s, and fetch/miss ratios () for several benchmarks. This data was collected with calculix Prefetching Disabled 6%.6.8 % %. 9.mcf 7.lbm8

Uppsala University / Department of Informa<on Technology 8// More Detail: Cache Pollu<on We can measure how greedy an applica<on is and how sensive it is By changing

30 Uppsala University / Department of Informa<on Technology 8// More Detail: Cache Pollu<on We can measure how greedy an applica<on is and how sensive it is By changing the code to use non- caching instruc<ons we can make an applicaon less greedy without hur<ng performance Miss Ratio 5. % 5. % 5. % Private Cache Size Shared libquantum lbm

31 Uppsala University / Department of Informa<on Technology 8// More Detail: Bandwidth Sharing Sensi<vity is a func<on of the applica<on Latency sensivity (memory level parallelism) Bandwidth requirement (data rate) And the hardware Ability to handle out- of- order requests (queue sizes) Access paoern costs (streaming vs. random in DRAM banks) consumpon is not a good indicator of sensivity slowdown bandwidth slowdown (%) lbm streamcluster soplex mcf bandwidth (GB/s) g.. Baseline Slowdown bandwidth at 9 (no of contention) saturaon bandwidth and slowdown with m

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe