Resource Sharing in Mul1core Processors

Size: px
Start display at page:

Download "Resource Sharing in Mul1core Processors"

Transcription

1 Uppsala Programming for Multicore Architectures Research Center Resource Sharing in Mulcore Processors David Black- Schaffer Assistant Professor, Department of Informa<on Technology Uppsala University Special thanks to the Uppsala Architecture Research Team: David Eklöv, Prof. Erik Hagersten, Nikos Nikoleris, Andreas Sandberg, Andreas Sembrant

2 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Resource Sharing in Mul<cores Memory Controller I/O L Cache Memory Controller I/O L Cache Sandy Bridge Nehalem Most people focus on the core count Parallelism, synchroniza<on, deadlock, etc. I am going to focus on the shared memory system Cache, bandwidth, prefetching, etc. The shared memory system significantly impacts performance, efficiency, and scalability.

3 Uppsala University / Department of Informa<on Technology 8// Goals For Today. Overview of shared memory system resources. Impact on applicaon performance. Importance of applicaon sensivity. Research tools and direc<ons

4 Uppsala University / Department of Informa<on Technology 8// Not just cache and bandwidth SHARED RESOURCES

5 David Black- Schaffer 8// 5 Uppsala University / Department of Informa<on Technology Mul<core Memory Systems Intel Nehalem Memory Hierarchy (GHz) Latency to private L: cycles Latency (cycles) Latency to DRAM: cycles Last Level Cache 9 9 off- chip Bandwidth L L Bandwidth to Private L: 5 bytes/cycle per core L Bandwidth to DRAM: - 8 bytes/cycle tdram otal - bytes/cycle per c ore D. Molka, et. al., Memory Performance and Cache Coherency Effects on an Intel Nehalem Mul8processor System, PACT 9.

6 David Black- Schaffer 8// 6 Uppsala University / Department of Informa<on Technology Other Resources SMT (Hyper- Threading) Private caches Instruc<on cache L/L Data caches Branch predictors, TLBs, instruc<on decode Latency (cycles) 6 AMD s Bulldozer 5 Floa<ng point units Instruc<on fetch/decode L cache Branch predictor Bulldozer 7 8

7 David Black- Schaffer 8// 7 Uppsala University / Department of Informa<on Technology Energy As A Resource Probably the nd most important resource today, soon to be st Power (controlled by DVFS Dynamic Frequency/Voltage Scaling) Thermal Capacity AMD s Turbo CORE thermal capacity Intel s Turbo Boost Bulldozer

8 Uppsala University / Department of Informa<on Technology 8// 8 Aside: Why Share Resources? Parallel Performance Data sharing across threads Simplify synchronizaon and coherency Flexibility Cache par<<oning for different workloads Amdahl s law Always need single- threaded performance Core Core Core Core Core Core Core Shared Data Shared Cache Efficient Data Sharing Lock Shared Cache Core 99% Parallel (9x) Fast Synchronizaon Maximum Speedup 9 Parallel (x) 75% Parallel (x) Core Core Core Shared Cache Number of Cores Core Flexible Resource Allocaon Limits of Amdahl s Law

9 Uppsala University / Department of Informa<on Technology 8// 9 Cache Sharing, Cache Pollu<on, and Bandwidth Sharing PERFORMANCE IMPACTS OF SHARING

10 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing expected measured cache pirate predicted 7.omnetpp.5. δ Throughput 7.omnetpp.5. L Cache Cache Pirate 7.omnetpp Experiment M M M M 5M 6M 7M cores 7.omnetpp cache size 7.omnetpp.5 expected measured of the Run - independent instances. same program on a - core Nehalem δ Performance affected by shared cache allocaon ¼ of the shared cache slower Understanding performance as a func(on of shared resource alloca(on is cri<cal to cores understanding mul<core scalability -5- Throughput.5. Performance as a funcon of shared cache size M M M M 5M 6M 7M cache size

11 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ Reduced Scalability 7.omnetpp C7.omnetpp ache Sensivity.5. expected measured δ cache size Latency (cycles). M M M M 5M 6M 7M cores % o of f % 5 CCoache f ache 5% Coache f of f % Coache 5% Coache f 5 Coache f % o f 5% Coache f Cache ache C L.5 cores -5-. M M M M 5M 6M 7M 7.omnetpp.5.5 expected measured Throughput 7.omnetpp L L cache size DRAM

12 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ.5. How much slower? See this curve. M M Reduced Scalability 7.omnetpp C7.omnetpp ache Sensivity expected measured δ M M 5M 6M 7M cache size Latency (cycles) -5-. cores 5.5 cores. 7.omnetpp.5.5 expected measured Throughput 7.omnetpp 6 7 5% o of f % 5 CCoache f ache 5% Coache f of f % Coache 5% Coache f 5 Coache f % o f 5% Coache f Cache ache C M M M M 5M 6M 7M cache size 8 shared 9 c 9 Less ache space More DRAM accesses Increased latency Reduced performance. -5- L L Need performance as a funcon ol f shared resources to understand mulcore scalability. DRAM

13 Uppsala University / Department of Informa<on Technology 8// Cache Polluon Applica<ons put data in the cache but do not reuse it No benefit, but no problem unless you are sharing cache space Polluon wastes space other applicaons could benefit from.5 Effects of Cache Polluon in a - process Workload / of slowdown due to cache pollu<on. Throughput bzip lbm libquantum gamess Overall Alone Mixed Workload Polluon Eliminated

14 Uppsala University / Department of Informa<on Technology 8// Cache Polluon: What s Going On? Does an applica<on benefit from using the shared cache? More misses more cache space No benefit from using the shared cache space. Miss Rao Fewer misses less cache space Private Private Shared Cache Size Shared Significant benefit from the shared cache space.

15 Thro David Black- Schaffer 8// 5 Uppsala University / Department of Informa<on Technology Predicting Multicore Scaling 7.lbm core Scaling 7.lbm 6M 7M Bandwidth Sharing Queue corescache size -6- Throughput Predicting Multicore Scaling M M M M 5M 6M 7M performance No loss with GB/s = GB/s cache size cores less cache space cores System max =.GB/s cores 7.lbm 7.lbm7.lbm required Bandwidth as a funcon Performance as a funcon of shared cache size of shared cache size 5 r educ<on i n c ache. 57% increase andwidth in b M M M M 5M 6M 7M M M M M 5M 6M 7M cache size -6- un<l you run out of shared bandwidth. Bandwidht (GB/s) -6-7.lbm Memory Controller I/O cache pirate measured Bandwidth (GB/s) 6M 6M 7M 7M Throughput Bandwidth (GB/s) Throughput cores 7.lbm 7.lbm 7.lbm required cache pirate predicted measured measured. GB/s

16 David Black- Schaffer Uppsala University / Department of Informa<on Technology 8// 6 Bandwidth Sharing: What is Going On? resources: Cache, Bandwidth But also: queues, row buffers, prefetchers, etc. Results in complex applicaon sensivies: Bandwidth, capacity, latency, access paoern Complex interacons with shared bandwidth Bandwidth Bandit: Latency Understanding 5 6 Memory Contention 9 (cycles) Abstract Applications on a multicore that are co-scheduled the bank before it canslowdown be read/written ; or a page-miss when bandwidth row other then the one accessed is cached in the row buffer. the case of a page-miss, the cached row has to first be writt back to the memory bank before the newly accessed row copied into the row buffer. These three events have differe latencies, with a page-hit having the shortest latency, and page-miss having thestlongest. sop mc lbm re f Bandwidth bandwidth (GB/s) Slowdown slowdown (%) compete for shared resources, such as cache capacity and memory bandwidth. The performance degradation resulting from this contention can be substantial, which makes it important to effectively manage these shared resources. This, however, requires an understanding of how applications are impacted by such contention. have been studied extensively While cache-sharing effects lex am clu of late, the effects of memory bandwidth sharing are not as ste r well explored. This is in large due to its complex nature, as Hierarchy Performance Slowdown at 9 of contention) saturaon bandwidth. Memory Baseline bandwidth (no and slowdown with memory sensitivity to bandwidth contention on bottlenecks at Fig. B. L dependsl L DRAM contention (9 of saturation bandwidth). While all of these applications several levels of the memory-system a performance point of view the hierarc and the locality properties exhibit From baseline memory similar bandwidth consumption, their sensitivities to memory of the access stream. Fig.application s. The memory hierarchy. The memory hierarchy can be analyzed with can be described by two metrics: its latency and ban This paper explores the contention of increased Equation using the different degrees of effects parallelism and latency latency at each levelcontention vary widely. This variability demonstrates the importance of a more understanding the effects of memory contention. (For each benchmark width. These two metrics are intimately related. Using Little the hierarchy.memory The queues in the figure indicate that the points availablein parallelism andindecreased parallelism at different the detailed comes from multiple places within the hierarchy. The details of the hierarchywe generated sufficient bandwidth contention to reach 9 of the system

17 B. B... Uppsala University / Department of Informa<on Technology 8// Pirate 7.omnetpp 7.omnetpp expected measured Bandwidth (GB/s) cores Sensi<vity Cri<cal for Scaling 5.gromacs 5.calculix.bzip 6% 6% 5.calculix 9.mcf Shared resource sensivity understand scaling.%.% 5.gromacs 5.calculix.bzip 6.libquantum 8.sphinx 9.mcf 7.lbm 6%.% Sensivity varies from.9% % % % applicaon to 6% applicaon 6% % 6%.% 6%.6% 8% Impact varies from % % 6% plagorm % % to plagorm.% Performance.. have enough (lower beoer) bandwidth. M M 5.calculix 5M 7M M.% 6.libquantum 8.sphinx M 5M 7M M 7.lbm M 5M 7M M M cache Cache size Size % 5.soplex M M %.gcc 5M 7M M M % % cache Cache size Size % 6% cache Cache cache size Size size cacc Miss Ratio Fetch Ratio Miss Ratio Fetch Ratio Bandwidth 6% Insensive 6% 6% Sensive % % % % Insensive Fig. 8. Performance (), bandwidth requirements Fig. 8. () Performance in GB/s, (), and fetch/miss bandwidthratios requirements () for() several in GB/s, benchmarks. and fetch/m This.%.%.%.%.%.% % % δ % Shared Cache 7.omnetpp Sensivity.9%.%.9%.6%.6% 6%.%.% M M M M 5M 6M 7M cache size %.6... Cache Pirate 7.omnetpp Throughput.% % % % 6% 8% 8% % % % 6 Reduced Scalability 7.omnetpp expected measured.. cores % δ 8.sphinx 6.libquantum 5.soplex % 6% 7.omnetpp % %..6.8 % % 6% % M M M M 5M 6M 7M As long as you cache size 5

18 Uppsala University / Department of Informa<on Technology 8// 8 Aside: Implementa<on (Tools) verview Performance as a funcon of cache size Pirate thread steals cache and measures Target s performance Monitor the Pirate s miss rate (Similar for Bandwidth) Reducing cache polluon Sample memory access behavior (reuse distances) Model misses for different cache sizes Iden<fy pollu<ng instruc<ons and change to non- caching accesses aptures performance data for all cache sizes in one run. MB Cache Size 8MB Target Time Shared Cache Pirate $ $ $ $ $ $ $ $ verage Target slowdown: 5% Simulation slowdown: ) 5% Overhead Fully Automa<c D. Eklöv, N. Nikoleris, D. Black- Schaffer, E. Hagersten. Cache Pira8ng: Measuring the Curse of the Shared Cache. ICPP. A. Sandberg, D. Eklöv, E. Hagersten. Reducing Cache Pollu8on Through Detec8on and Elimina8on of Non- Temporal Memory Accesses. SC.

19 Uppsala University / Department of Informa<on Technology 8// 9 What is all of this good for? WRAP UP

20 Uppsala University / Department of Informa<on Technology 8// Why Is This Important? Performance and scaling are impacted by resource sharing How can we use this informaon? Predict and understand performance Workload performance Useful for schedulers (sta<c and dynamic), opmizaon, compilaon Need new modeling technology (on- going research) Can model workload cache behavior for simple cores (StatCC) Not good enough for modern machines No MLP, prefetchers, OoO, etc. Developing new techniques for modeling real hardware

21 Uppsala University / Department of Informa<on Technology 8// The Future More threads (s of cores, s of threads) More complex hierarchies (NoCs, distributed caches) Heterogeneous (CPU+GPU+Accelerators) Shared I/O GPU Cores CPU CPU CPU CPU AMD Fusion More focus on energy (per- core DVFS, dark silicon) Intel Integrated Graphics Turbo Boost

22 Uppsala University / Department of Informa<on Technology 8// Resource Sharing in Mul<cores Important Significant impact on performance (and efficiency) scaling Basic data lets us understand interac<ons Complex Sensivity varies across applica<ons Hard to understand the hardware Moving forward Research on predic<ng performance and understanding scaling MS thesis projects to port our tools/analyze your code We re open (and eager) for collaboraon! Uppsala Programming for Multicore Architectures Research Center

23 Uppsala University / Department of Informa<on Technology 8// QUESTIONS?

24 Uppsala University / Department of Informa<on Technology 8// Average behavior is not good enough APPLICATION PHASES

25 Uppsala University / Department of Informa<on Technology 8// 5 Applica<on Phases Bandwidht (GB/s) Everything so far has been average applicaon behavior. 7.lbm M M M M 5M 6M 7M cache size Average Bandwidth lbm M M M M 5M 6M 7M cache size Average Q: Is that good enough? Is applicaon behavior constant? Measured over me Average over me A: No. Applicaons have disnct phases with varying behaviors. Cycles per Instructions 6 A B B C B D E A B B C B D E Branch Miss Predictions A B B C B D E A B B C B D E Basic Block Vectors A. Sembrant, D. Eklöv, E. Hagersten. Efficient SoDware- based Online Phase Classifica8on. IISWC.

26 Uppsala University / Department of Informa<on Technology 8// 6 Cache Miss Ra<o Per Phase A. Sembrant, D. Black- Schaffer, E. Hagersten. Phase Guided Profiling for Fast Cache Modeling. GCO. Overhead

27 Uppsala University / Department of Informa<on Technology 8// 7 A brief look at applica<on sensi<vity SHARING IN MORE DETAIL

28 Uppsala University / Department. of Informa<on Technology. 8// Bandwidth (GB/s) Performance (lower beoer) Perf: : Size: More Detail: Cache Sharing % 5.gromacs.%.8.8. Three SPEC benchmark applica<ons Different behaviors due to different proper<es.% 5.gromacs 5.calculix.% 5.calculix.bzip 8.sphinx 8.sphinx 9.mcf 7.lbm 6%.%.9% 6.libquantum % % % 6% 6% % Each.% will respond.6% differently to cache sharing Sensivity:.% %...%.%.%.. 6% 6 6%.% gromacs.9%.%.6% 6%.%.66. %.bzip 5.calculix.9% % 8% %.bzip 6% % % 8% % % 9.mcf.% 8.sphinx 6% % 6% 8% % M M 5.calculix 5M 6.libquantum 7M 6.libquantum 8.sphinxM % 5.soplex M 5M 5.soplex 7M7.lbm M %.gcc M M 5M M.gcc 7M 5M M7M M M 5M M 7M.% % % Cache cache size Size % Cache cache 6% size Size cache size Cache cache size Size cache size cache s Miss Ratio Fetch Ratio Miss Ratio BandwidthFetch Ratio 6% 6% 6% % % % % slower 5 slower slower Fig. 8. Performance (), bandwidth requirements () in GB/s, Fig. and 8. fetch/miss Performance ratios (), () bandwidth for several requirements benchmarks. () Thisindata GB/s, wasand collected fetch/miss withra 5x bandwidth prefetching enabled. prefetching enabled. x bandwidth x bandwidth. 8.sphinx increase increase behaves quite differently from. 5.gromacs. 8.sphinx. increase behaves 7.lbmquite showsdifferently an 8 difference from 5.gromacs. between its fetch As its cache size is decreased its increasesas by its 5, cache ratios, size isindicating decreased8 prefetched its increases memoryby accesses 5, for ea ra.6.6 while its miss ratio and bandwidth increase bywhile a factor its misshowever, ratio and the bandwidth relative increase increase miss by ratioa factor is still rouh of. The fetch ratio and miss ratio curves are of slightly. The Full working set fits in This fetchindicates ratio andthat miss 7.lbm ratio is curves also relatively are slightly insensiti T different Most indicatingof that there is working a small amount of prefetching. different indicating Likle increased thatof there latency. the is athe small working data amount Figure of prefetching. 9 show the perf in.... However, the significant performance.8 decrease.6 despite However, the.6 of significant 7.lbm with performance hardwaredecrease prefetching despite disabled. the This o the cache increased set fits in bandwidth indicates the cache set fits in the cache that the benchmarkincreased is more bandwidth indicates by a third that the and benchmark increases is more at all cacb Insensive 6.libquantum Mto latency sensitive to increased memory latency. Furthermore, the is now no longer constant with M 5M M 7M M 5.soplex 5M Sensivity sensitive to increased memory latency. M 7MM to F 5M M Insensive to latency 7M M.gcc 5MM 7MM 5M M 7M M 5M 7M % % cache size cache size cache size cache size cache size cache size Miss Ratio Miss Fetch Ratio Ratio Fetch Ratio Bandwidth Bandwidth 6% %.%.% %...6%.%.6. % 6% Fig. 8. Performance (), Fig. 8. bandwidth Performance requirements (), bandwidth () in GB/s, requirements and fetch/miss () inratios GB/s, () and fetch/miss for severalratios benchmarks. () for This several data benchmarks. was!"# collectedthis withdata hardware was collected with hardware prefetching enabled. prefetching enabled. 5.soplex % 9.mcf 5.calculix 6% % libquantum 7.lbm % % lbm % %. %.gcc 6% 8.sph 5.sop!"#

29 6..5. % % % gromacs.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6%.%.9%.%.9%.%.9%.%.9% % % % David Black- Schaffer.%.6%.% Uppsala University / Department.6% of Informa<on Technology 8// 9 8%.%.6% 8%.%.6% 8%.%.%.% %.%.% %.%.% %.% % % % % More..6 Detail:..6 The Impact..6 of Prefetching gromacs.bzip 9.mcf.%.9% Different degrees of prefetching depending on 6% 5.calculix.% 8.sphinx 5.calculix.% 7.lbm 8.sphinx 5.calculix.% 7.lbm 8.sphinx % 6% %.%.6% % 6% % 6%.%.% applica<on access paoern 5.soplex 8% % % 6% % % % 6% % % and hardware % % Prefetching. reduces. applica<on sensi<vity to 6% % latency % 6.libquantum.% 6.libquantum 5.gromacs 5.calculix.bzip 8.sphinx 9.mcf.%.9% 6% 7.lbm 5.gromacs 5.soplex.bzip.gcc 7.lbm % 6.libquantum % %.%.6% % 6% % % 9.mcf 5.soplex 6.libquantum 6% % %.%.9% 6%.gcc 5.soplex 6.libquantum % %.gcc 5 %..% 6%.6% % % % 8%.%.% % 8% 6% % % 6% % % 6%.%.% %.6 % % calculix 6.libquantum 8.sphinx 5.soplex.% M M 5M 7M M% % 7.lbm.gcc.% 5.calculix M M M 5M 5M 7M 7M M M MM M M5M 5M 5M7M 7M 7M 8.sphinx 7.lbm M M M M M M 5MM 7M5M 7MM M M M 5M cache size cache size % cache cache size size cache cache size cache size size % M M6% Miss cache cache Ratio size sizecache size cache sizecac No Prefetching Miss Ratio Fetch Ratio Minimal Miss Ratio Bandwidth Fetch Ratio5M Significant Miss Ratio prefetching, 7M Fetch Bandwidth and Ratio MFetch it benefits Ratio M Miss from Ratio Bandwidth it Bandwidth 5MFetch Ratio 7M 6% 6% % % % 6% cache size % Prefetching cache % Fig. 8. Performance (), bandwidth requirements () Fig. 8. in Performance GB/s, and fetch/miss (), bandwidth ratios () requirements for several Fig. () benchmarks. 8. Performance in GB/s, size This anddata (), fetch/miss was bandwidth collected ratios () requirements with hardware for several () Fig. benchmarks. Fig. in9. GB/s, 8. Performance and This fetch/miss data data was (), ratios for collected bandwidth 7.lbm () cache with for with requirements hardware several size benchmarks. () prefetching This GB/s, disabled. data andwas fetch/m colle prefetching enabled. prefetching enabled. Miss Ratio 6 Fetch. prefetching enabled. (Fetch ratio miss ratio are identical.) Ratio.8. prefetching enabled... Bandwidth 6 8.sphinx behaves quite differently from 8.sphinx 5.gromacs. behaves quite 7.lbm differently shows an from 8.sphinx 8 5.gromacs. difference. behaves between quite its 7.lbm fetch differently and shows miss from an 8 5.gromacs. 8.sphinx difference between behaves 7.lbm itsquite fetchshows differently and miss an 8 from difference 5.gromacs. between As its cache size is decreased its increases As its cache by 5, size is decreased ratios, indicating its As 8 increases prefetched its cache.6 by memory size 5, is accesses decreased ratios, for indicating its each miss. increases 8cache prefetched As its size, bycache memory 5, clearly sizeshowing accesses ratios, is decreased indicating for thateach prefetching itsmiss. 8 prefetched increases was helping memory by 5, toacces while its miss ratio and bandwidth increase while by its amiss factor ratio and However, bandwidth the relative increase while increase itsbymiss ainfactor miss ratio ratio andhowever, isbandwidth still roughly the relative increase. compensate while increase by aitsfactor infor miss theratio reduced However, isand still cache bandwidth roughly the relative space.. increase This demonstrates by in miss a factor ratio is of.. The fetch ratio and miss ratio curves of.. arethe slightly fetch ratio This andindicates miss ratio that ofcurves 7.lbm.. The are is also slightly fetchrelatively ratio This and insensitive indicates miss ratio tothat the curves that of7.lbm are. islightly also The notrelatively fetch onlythis heavily ratioinsensitive indicates and leverages miss that toratio hardware the 7.lbm curves is prefetching, also are relatively slightly..%. M M 5M 7M. Difference between fetches and misses.8 is prefetch rate..8.8.gcc % ormance (), bandwidth requirements () in GB/s, and fetch/miss ratios () for several benchmarks. This data was collected with calculix Prefetching Disabled 6%.6.8 % %. 9.mcf 7.lbm8

30 Uppsala University / Department of Informa<on Technology 8// More Detail: Cache Pollu<on We can measure how greedy an applica<on is and how sensive it is By changing the code to use non- caching instruc<ons we can make an applicaon less greedy without hur<ng performance Miss Ratio 5. % 5. % 5. % Private Cache Size Shared libquantum lbm

31 Uppsala University / Department of Informa<on Technology 8// More Detail: Bandwidth Sharing Sensi<vity is a func<on of the applica<on Latency sensivity (memory level parallelism) Bandwidth requirement (data rate) And the hardware Ability to handle out- of- order requests (queue sizes) Access paoern costs (streaming vs. random in DRAM banks) consumpon is not a good indicator of sensivity slowdown bandwidth slowdown (%) lbm streamcluster soplex mcf bandwidth (GB/s) g.. Baseline Slowdown bandwidth at 9 (no of contention) saturaon bandwidth and slowdown with m

http://uu.diva-portal.org This is an author produced version of a paper presented at the 4 th Swedish Workshop on Multi-Core Computing, November 23-25, 2011, Linköping, Sweden. Citation for the published

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Understanding Multicore Performance

Understanding Multicore Performance Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1136 Understanding Multicore Performance Efficient Memory System Modeling and Simulation ANDREAS SANDBERG

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores

Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores Muneeb Khan, David Black-Schaffer and Erik Hagersten Department of Information Technology, Uppsala University Email:

More information

Efficient Techniques for Predicting Cache Sharing and Throughput

Efficient Techniques for Predicting Cache Sharing and Throughput Efficient Techniques for Predicting Cache Sharing and Throughput Andreas Sandberg Uppsala University Sweden andreas.sandberg@it.uu.se David Black-Schaffer Uppsala University Sweden david.blackschaffer@it.uu.se

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

GPUs: The Hype, The Reality, and The Future

GPUs: The Hype, The Reality, and The Future Uppsala Programming for Multicore Architectures Research Center GPUs: The Hype, The Reality, and The Future David Black- Schaffer Assistant Professor, Department of Informa

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Creating memory bandwidth contention with best intentions

Creating memory bandwidth contention with best intentions IT 16 010 Examensarbete 30 hp Mars 2016 Creating memory bandwidth contention with best intentions George John Chiramel Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching

Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1335 Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching MUNEEB KHAN ACTA

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Investigating How Simple Software Optimizations Effect Relative Throughput Scaling on Multicores

Investigating How Simple Software Optimizations Effect Relative Throughput Scaling on Multicores Investigating How Simple Software Optimizations Effect Relative Throughput Scaling on Multicores Muneeb Khan, Nikos Nikoleris and Erik Hagersten Department of Information Technology Uppsala University

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Lecture 2: Performance

Lecture 2: Performance Lecture 2: Performance Today s topics: Technology wrap-up Performance trends and equations Reminders: YouTube videos, canvas, and class webpage: http://www.cs.utah.edu/~rajeev/cs3810/ 1 Important Trends

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS 1 JUNG KYU PARK, 2* JAEHO KIM, 3 HEUNG SEOK JEON 1 Department of Digital Media Design and Applications, Seoul Women s University,

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Stealing the shared cache for fun and profit

Stealing the shared cache for fun and profit IT 13 048 Examensarbete 30 hp Juli 2013 Stealing the shared cache for fun and profit Moncef Mechri Institutionen för informationsteknologi Department of Information Technology Abstract Stealing the shared

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Exploi'ng Compressed Block Size as an Indicator of Future Reuse Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction 5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner Topic 1: Introduction These slides are mostly taken verbatim, or with minor changes, from

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Dynamic Cache Pooling in 3D Multicore Processors

Dynamic Cache Pooling in 3D Multicore Processors Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

DRAM Main Memory. Dual Inline Memory Module (DIMM)

DRAM Main Memory. Dual Inline Memory Module (DIMM) DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg 2017-10-03 What is a multiprocessor? Simplest answer: A machine with >1 processors! In scheduling theory, we include multicores in this defnition

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

AS the processor-memory speed gap continues to widen,

AS the processor-memory speed gap continues to widen, IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information