Resource Sharing in Mul1core Processors
|
|
- Shanon Kelley
- 6 years ago
- Views:
Transcription
1 Uppsala Programming for Multicore Architectures Research Center Resource Sharing in Mulcore Processors David Black- Schaffer Assistant Professor, Department of Informa<on Technology Uppsala University Special thanks to the Uppsala Architecture Research Team: David Eklöv, Prof. Erik Hagersten, Nikos Nikoleris, Andreas Sandberg, Andreas Sembrant
2 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Resource Sharing in Mul<cores Memory Controller I/O L Cache Memory Controller I/O L Cache Sandy Bridge Nehalem Most people focus on the core count Parallelism, synchroniza<on, deadlock, etc. I am going to focus on the shared memory system Cache, bandwidth, prefetching, etc. The shared memory system significantly impacts performance, efficiency, and scalability.
3 Uppsala University / Department of Informa<on Technology 8// Goals For Today. Overview of shared memory system resources. Impact on applicaon performance. Importance of applicaon sensivity. Research tools and direc<ons
4 Uppsala University / Department of Informa<on Technology 8// Not just cache and bandwidth SHARED RESOURCES
5 David Black- Schaffer 8// 5 Uppsala University / Department of Informa<on Technology Mul<core Memory Systems Intel Nehalem Memory Hierarchy (GHz) Latency to private L: cycles Latency (cycles) Latency to DRAM: cycles Last Level Cache 9 9 off- chip Bandwidth L L Bandwidth to Private L: 5 bytes/cycle per core L Bandwidth to DRAM: - 8 bytes/cycle tdram otal - bytes/cycle per c ore D. Molka, et. al., Memory Performance and Cache Coherency Effects on an Intel Nehalem Mul8processor System, PACT 9.
6 David Black- Schaffer 8// 6 Uppsala University / Department of Informa<on Technology Other Resources SMT (Hyper- Threading) Private caches Instruc<on cache L/L Data caches Branch predictors, TLBs, instruc<on decode Latency (cycles) 6 AMD s Bulldozer 5 Floa<ng point units Instruc<on fetch/decode L cache Branch predictor Bulldozer 7 8
7 David Black- Schaffer 8// 7 Uppsala University / Department of Informa<on Technology Energy As A Resource Probably the nd most important resource today, soon to be st Power (controlled by DVFS Dynamic Frequency/Voltage Scaling) Thermal Capacity AMD s Turbo CORE thermal capacity Intel s Turbo Boost Bulldozer
8 Uppsala University / Department of Informa<on Technology 8// 8 Aside: Why Share Resources? Parallel Performance Data sharing across threads Simplify synchronizaon and coherency Flexibility Cache par<<oning for different workloads Amdahl s law Always need single- threaded performance Core Core Core Core Core Core Core Shared Data Shared Cache Efficient Data Sharing Lock Shared Cache Core 99% Parallel (9x) Fast Synchronizaon Maximum Speedup 9 Parallel (x) 75% Parallel (x) Core Core Core Shared Cache Number of Cores Core Flexible Resource Allocaon Limits of Amdahl s Law
9 Uppsala University / Department of Informa<on Technology 8// 9 Cache Sharing, Cache Pollu<on, and Bandwidth Sharing PERFORMANCE IMPACTS OF SHARING
10 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing expected measured cache pirate predicted 7.omnetpp.5. δ Throughput 7.omnetpp.5. L Cache Cache Pirate 7.omnetpp Experiment M M M M 5M 6M 7M cores 7.omnetpp cache size 7.omnetpp.5 expected measured of the Run - independent instances. same program on a - core Nehalem δ Performance affected by shared cache allocaon ¼ of the shared cache slower Understanding performance as a func(on of shared resource alloca(on is cri<cal to cores understanding mul<core scalability -5- Throughput.5. Performance as a funcon of shared cache size M M M M 5M 6M 7M cache size
11 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ Reduced Scalability 7.omnetpp C7.omnetpp ache Sensivity.5. expected measured δ cache size Latency (cycles). M M M M 5M 6M 7M cores % o of f % 5 CCoache f ache 5% Coache f of f % Coache 5% Coache f 5 Coache f % o f 5% Coache f Cache ache C L.5 cores -5-. M M M M 5M 6M 7M 7.omnetpp.5.5 expected measured Throughput 7.omnetpp L L cache size DRAM
12 David Black- Schaffer 8// Uppsala University / Department of Informa<on Technology Cache Pirate 7.omnetpp Cache Sharing: What s Going On? e Pirate 7.omnetpp. δ.5. How much slower? See this curve. M M Reduced Scalability 7.omnetpp C7.omnetpp ache Sensivity expected measured δ M M 5M 6M 7M cache size Latency (cycles) -5-. cores 5.5 cores. 7.omnetpp.5.5 expected measured Throughput 7.omnetpp 6 7 5% o of f % 5 CCoache f ache 5% Coache f of f % Coache 5% Coache f 5 Coache f % o f 5% Coache f Cache ache C M M M M 5M 6M 7M cache size 8 shared 9 c 9 Less ache space More DRAM accesses Increased latency Reduced performance. -5- L L Need performance as a funcon ol f shared resources to understand mulcore scalability. DRAM
13 Uppsala University / Department of Informa<on Technology 8// Cache Polluon Applica<ons put data in the cache but do not reuse it No benefit, but no problem unless you are sharing cache space Polluon wastes space other applicaons could benefit from.5 Effects of Cache Polluon in a - process Workload / of slowdown due to cache pollu<on. Throughput bzip lbm libquantum gamess Overall Alone Mixed Workload Polluon Eliminated
14 Uppsala University / Department of Informa<on Technology 8// Cache Polluon: What s Going On? Does an applica<on benefit from using the shared cache? More misses more cache space No benefit from using the shared cache space. Miss Rao Fewer misses less cache space Private Private Shared Cache Size Shared Significant benefit from the shared cache space.
15 Thro David Black- Schaffer 8// 5 Uppsala University / Department of Informa<on Technology Predicting Multicore Scaling 7.lbm core Scaling 7.lbm 6M 7M Bandwidth Sharing Queue corescache size -6- Throughput Predicting Multicore Scaling M M M M 5M 6M 7M performance No loss with GB/s = GB/s cache size cores less cache space cores System max =.GB/s cores 7.lbm 7.lbm7.lbm required Bandwidth as a funcon Performance as a funcon of shared cache size of shared cache size 5 r educ<on i n c ache. 57% increase andwidth in b M M M M 5M 6M 7M M M M M 5M 6M 7M cache size -6- un<l you run out of shared bandwidth. Bandwidht (GB/s) -6-7.lbm Memory Controller I/O cache pirate measured Bandwidth (GB/s) 6M 6M 7M 7M Throughput Bandwidth (GB/s) Throughput cores 7.lbm 7.lbm 7.lbm required cache pirate predicted measured measured. GB/s
16 David Black- Schaffer Uppsala University / Department of Informa<on Technology 8// 6 Bandwidth Sharing: What is Going On? resources: Cache, Bandwidth But also: queues, row buffers, prefetchers, etc. Results in complex applicaon sensivies: Bandwidth, capacity, latency, access paoern Complex interacons with shared bandwidth Bandwidth Bandit: Latency Understanding 5 6 Memory Contention 9 (cycles) Abstract Applications on a multicore that are co-scheduled the bank before it canslowdown be read/written ; or a page-miss when bandwidth row other then the one accessed is cached in the row buffer. the case of a page-miss, the cached row has to first be writt back to the memory bank before the newly accessed row copied into the row buffer. These three events have differe latencies, with a page-hit having the shortest latency, and page-miss having thestlongest. sop mc lbm re f Bandwidth bandwidth (GB/s) Slowdown slowdown (%) compete for shared resources, such as cache capacity and memory bandwidth. The performance degradation resulting from this contention can be substantial, which makes it important to effectively manage these shared resources. This, however, requires an understanding of how applications are impacted by such contention. have been studied extensively While cache-sharing effects lex am clu of late, the effects of memory bandwidth sharing are not as ste r well explored. This is in large due to its complex nature, as Hierarchy Performance Slowdown at 9 of contention) saturaon bandwidth. Memory Baseline bandwidth (no and slowdown with memory sensitivity to bandwidth contention on bottlenecks at Fig. B. L dependsl L DRAM contention (9 of saturation bandwidth). While all of these applications several levels of the memory-system a performance point of view the hierarc and the locality properties exhibit From baseline memory similar bandwidth consumption, their sensitivities to memory of the access stream. Fig.application s. The memory hierarchy. The memory hierarchy can be analyzed with can be described by two metrics: its latency and ban This paper explores the contention of increased Equation using the different degrees of effects parallelism and latency latency at each levelcontention vary widely. This variability demonstrates the importance of a more understanding the effects of memory contention. (For each benchmark width. These two metrics are intimately related. Using Little the hierarchy.memory The queues in the figure indicate that the points availablein parallelism andindecreased parallelism at different the detailed comes from multiple places within the hierarchy. The details of the hierarchywe generated sufficient bandwidth contention to reach 9 of the system
17 B. B... Uppsala University / Department of Informa<on Technology 8// Pirate 7.omnetpp 7.omnetpp expected measured Bandwidth (GB/s) cores Sensi<vity Cri<cal for Scaling 5.gromacs 5.calculix.bzip 6% 6% 5.calculix 9.mcf Shared resource sensivity understand scaling.%.% 5.gromacs 5.calculix.bzip 6.libquantum 8.sphinx 9.mcf 7.lbm 6%.% Sensivity varies from.9% % % % applicaon to 6% applicaon 6% % 6%.% 6%.6% 8% Impact varies from % % 6% plagorm % % to plagorm.% Performance.. have enough (lower beoer) bandwidth. M M 5.calculix 5M 7M M.% 6.libquantum 8.sphinx M 5M 7M M 7.lbm M 5M 7M M M cache Cache size Size % 5.soplex M M %.gcc 5M 7M M M % % cache Cache size Size % 6% cache Cache cache size Size size cacc Miss Ratio Fetch Ratio Miss Ratio Fetch Ratio Bandwidth 6% Insensive 6% 6% Sensive % % % % Insensive Fig. 8. Performance (), bandwidth requirements Fig. 8. () Performance in GB/s, (), and fetch/miss bandwidthratios requirements () for() several in GB/s, benchmarks. and fetch/m This.%.%.%.%.%.% % % δ % Shared Cache 7.omnetpp Sensivity.9%.%.9%.6%.6% 6%.%.% M M M M 5M 6M 7M cache size %.6... Cache Pirate 7.omnetpp Throughput.% % % % 6% 8% 8% % % % 6 Reduced Scalability 7.omnetpp expected measured.. cores % δ 8.sphinx 6.libquantum 5.soplex % 6% 7.omnetpp % %..6.8 % % 6% % M M M M 5M 6M 7M As long as you cache size 5
18 Uppsala University / Department of Informa<on Technology 8// 8 Aside: Implementa<on (Tools) verview Performance as a funcon of cache size Pirate thread steals cache and measures Target s performance Monitor the Pirate s miss rate (Similar for Bandwidth) Reducing cache polluon Sample memory access behavior (reuse distances) Model misses for different cache sizes Iden<fy pollu<ng instruc<ons and change to non- caching accesses aptures performance data for all cache sizes in one run. MB Cache Size 8MB Target Time Shared Cache Pirate $ $ $ $ $ $ $ $ verage Target slowdown: 5% Simulation slowdown: ) 5% Overhead Fully Automa<c D. Eklöv, N. Nikoleris, D. Black- Schaffer, E. Hagersten. Cache Pira8ng: Measuring the Curse of the Shared Cache. ICPP. A. Sandberg, D. Eklöv, E. Hagersten. Reducing Cache Pollu8on Through Detec8on and Elimina8on of Non- Temporal Memory Accesses. SC.
19 Uppsala University / Department of Informa<on Technology 8// 9 What is all of this good for? WRAP UP
20 Uppsala University / Department of Informa<on Technology 8// Why Is This Important? Performance and scaling are impacted by resource sharing How can we use this informaon? Predict and understand performance Workload performance Useful for schedulers (sta<c and dynamic), opmizaon, compilaon Need new modeling technology (on- going research) Can model workload cache behavior for simple cores (StatCC) Not good enough for modern machines No MLP, prefetchers, OoO, etc. Developing new techniques for modeling real hardware
21 Uppsala University / Department of Informa<on Technology 8// The Future More threads (s of cores, s of threads) More complex hierarchies (NoCs, distributed caches) Heterogeneous (CPU+GPU+Accelerators) Shared I/O GPU Cores CPU CPU CPU CPU AMD Fusion More focus on energy (per- core DVFS, dark silicon) Intel Integrated Graphics Turbo Boost
22 Uppsala University / Department of Informa<on Technology 8// Resource Sharing in Mul<cores Important Significant impact on performance (and efficiency) scaling Basic data lets us understand interac<ons Complex Sensivity varies across applica<ons Hard to understand the hardware Moving forward Research on predic<ng performance and understanding scaling MS thesis projects to port our tools/analyze your code We re open (and eager) for collaboraon! Uppsala Programming for Multicore Architectures Research Center
23 Uppsala University / Department of Informa<on Technology 8// QUESTIONS?
24 Uppsala University / Department of Informa<on Technology 8// Average behavior is not good enough APPLICATION PHASES
25 Uppsala University / Department of Informa<on Technology 8// 5 Applica<on Phases Bandwidht (GB/s) Everything so far has been average applicaon behavior. 7.lbm M M M M 5M 6M 7M cache size Average Bandwidth lbm M M M M 5M 6M 7M cache size Average Q: Is that good enough? Is applicaon behavior constant? Measured over me Average over me A: No. Applicaons have disnct phases with varying behaviors. Cycles per Instructions 6 A B B C B D E A B B C B D E Branch Miss Predictions A B B C B D E A B B C B D E Basic Block Vectors A. Sembrant, D. Eklöv, E. Hagersten. Efficient SoDware- based Online Phase Classifica8on. IISWC.
26 Uppsala University / Department of Informa<on Technology 8// 6 Cache Miss Ra<o Per Phase A. Sembrant, D. Black- Schaffer, E. Hagersten. Phase Guided Profiling for Fast Cache Modeling. GCO. Overhead
27 Uppsala University / Department of Informa<on Technology 8// 7 A brief look at applica<on sensi<vity SHARING IN MORE DETAIL
28 Uppsala University / Department. of Informa<on Technology. 8// Bandwidth (GB/s) Performance (lower beoer) Perf: : Size: More Detail: Cache Sharing % 5.gromacs.%.8.8. Three SPEC benchmark applica<ons Different behaviors due to different proper<es.% 5.gromacs 5.calculix.% 5.calculix.bzip 8.sphinx 8.sphinx 9.mcf 7.lbm 6%.%.9% 6.libquantum % % % 6% 6% % Each.% will respond.6% differently to cache sharing Sensivity:.% %...%.%.%.. 6% 6 6%.% gromacs.9%.%.6% 6%.%.66. %.bzip 5.calculix.9% % 8% %.bzip 6% % % 8% % % 9.mcf.% 8.sphinx 6% % 6% 8% % M M 5.calculix 5M 6.libquantum 7M 6.libquantum 8.sphinxM % 5.soplex M 5M 5.soplex 7M7.lbm M %.gcc M M 5M M.gcc 7M 5M M7M M M 5M M 7M.% % % Cache cache size Size % Cache cache 6% size Size cache size Cache cache size Size cache size cache s Miss Ratio Fetch Ratio Miss Ratio BandwidthFetch Ratio 6% 6% 6% % % % % slower 5 slower slower Fig. 8. Performance (), bandwidth requirements () in GB/s, Fig. and 8. fetch/miss Performance ratios (), () bandwidth for several requirements benchmarks. () Thisindata GB/s, wasand collected fetch/miss withra 5x bandwidth prefetching enabled. prefetching enabled. x bandwidth x bandwidth. 8.sphinx increase increase behaves quite differently from. 5.gromacs. 8.sphinx. increase behaves 7.lbmquite showsdifferently an 8 difference from 5.gromacs. between its fetch As its cache size is decreased its increasesas by its 5, cache ratios, size isindicating decreased8 prefetched its increases memoryby accesses 5, for ea ra.6.6 while its miss ratio and bandwidth increase bywhile a factor its misshowever, ratio and the bandwidth relative increase increase miss by ratioa factor is still rouh of. The fetch ratio and miss ratio curves are of slightly. The Full working set fits in This fetchindicates ratio andthat miss 7.lbm ratio is curves also relatively are slightly insensiti T different Most indicatingof that there is working a small amount of prefetching. different indicating Likle increased thatof there latency. the is athe small working data amount Figure of prefetching. 9 show the perf in.... However, the significant performance.8 decrease.6 despite However, the.6 of significant 7.lbm with performance hardwaredecrease prefetching despite disabled. the This o the cache increased set fits in bandwidth indicates the cache set fits in the cache that the benchmarkincreased is more bandwidth indicates by a third that the and benchmark increases is more at all cacb Insensive 6.libquantum Mto latency sensitive to increased memory latency. Furthermore, the is now no longer constant with M 5M M 7M M 5.soplex 5M Sensivity sensitive to increased memory latency. M 7MM to F 5M M Insensive to latency 7M M.gcc 5MM 7MM 5M M 7M M 5M 7M % % cache size cache size cache size cache size cache size cache size Miss Ratio Miss Fetch Ratio Ratio Fetch Ratio Bandwidth Bandwidth 6% %.%.% %...6%.%.6. % 6% Fig. 8. Performance (), Fig. 8. bandwidth Performance requirements (), bandwidth () in GB/s, requirements and fetch/miss () inratios GB/s, () and fetch/miss for severalratios benchmarks. () for This several data benchmarks. was!"# collectedthis withdata hardware was collected with hardware prefetching enabled. prefetching enabled. 5.soplex % 9.mcf 5.calculix 6% % libquantum 7.lbm % % lbm % %. %.gcc 6% 8.sph 5.sop!"#
29 6..5. % % % gromacs.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6% 9.mcf.bzip 5.gromacs 6%.%.9%.%.9%.%.9%.%.9% % % % David Black- Schaffer.%.6%.% Uppsala University / Department.6% of Informa<on Technology 8// 9 8%.%.6% 8%.%.6% 8%.%.%.% %.%.% %.%.% %.% % % % % More..6 Detail:..6 The Impact..6 of Prefetching gromacs.bzip 9.mcf.%.9% Different degrees of prefetching depending on 6% 5.calculix.% 8.sphinx 5.calculix.% 7.lbm 8.sphinx 5.calculix.% 7.lbm 8.sphinx % 6% %.%.6% % 6% % 6%.%.% applica<on access paoern 5.soplex 8% % % 6% % % % 6% % % and hardware % % Prefetching. reduces. applica<on sensi<vity to 6% % latency % 6.libquantum.% 6.libquantum 5.gromacs 5.calculix.bzip 8.sphinx 9.mcf.%.9% 6% 7.lbm 5.gromacs 5.soplex.bzip.gcc 7.lbm % 6.libquantum % %.%.6% % 6% % % 9.mcf 5.soplex 6.libquantum 6% % %.%.9% 6%.gcc 5.soplex 6.libquantum % %.gcc 5 %..% 6%.6% % % % 8%.%.% % 8% 6% % % 6% % % 6%.%.% %.6 % % calculix 6.libquantum 8.sphinx 5.soplex.% M M 5M 7M M% % 7.lbm.gcc.% 5.calculix M M M 5M 5M 7M 7M M M MM M M5M 5M 5M7M 7M 7M 8.sphinx 7.lbm M M M M M M 5MM 7M5M 7MM M M M 5M cache size cache size % cache cache size size cache cache size cache size size % M M6% Miss cache cache Ratio size sizecache size cache sizecac No Prefetching Miss Ratio Fetch Ratio Minimal Miss Ratio Bandwidth Fetch Ratio5M Significant Miss Ratio prefetching, 7M Fetch Bandwidth and Ratio MFetch it benefits Ratio M Miss from Ratio Bandwidth it Bandwidth 5MFetch Ratio 7M 6% 6% % % % 6% cache size % Prefetching cache % Fig. 8. Performance (), bandwidth requirements () Fig. 8. in Performance GB/s, and fetch/miss (), bandwidth ratios () requirements for several Fig. () benchmarks. 8. Performance in GB/s, size This anddata (), fetch/miss was bandwidth collected ratios () requirements with hardware for several () Fig. benchmarks. Fig. in9. GB/s, 8. Performance and This fetch/miss data data was (), ratios for collected bandwidth 7.lbm () cache with for with requirements hardware several size benchmarks. () prefetching This GB/s, disabled. data andwas fetch/m colle prefetching enabled. prefetching enabled. Miss Ratio 6 Fetch. prefetching enabled. (Fetch ratio miss ratio are identical.) Ratio.8. prefetching enabled... Bandwidth 6 8.sphinx behaves quite differently from 8.sphinx 5.gromacs. behaves quite 7.lbm differently shows an from 8.sphinx 8 5.gromacs. difference. behaves between quite its 7.lbm fetch differently and shows miss from an 8 5.gromacs. 8.sphinx difference between behaves 7.lbm itsquite fetchshows differently and miss an 8 from difference 5.gromacs. between As its cache size is decreased its increases As its cache by 5, size is decreased ratios, indicating its As 8 increases prefetched its cache.6 by memory size 5, is accesses decreased ratios, for indicating its each miss. increases 8cache prefetched As its size, bycache memory 5, clearly sizeshowing accesses ratios, is decreased indicating for thateach prefetching itsmiss. 8 prefetched increases was helping memory by 5, toacces while its miss ratio and bandwidth increase while by its amiss factor ratio and However, bandwidth the relative increase while increase itsbymiss ainfactor miss ratio ratio andhowever, isbandwidth still roughly the relative increase. compensate while increase by aitsfactor infor miss theratio reduced However, isand still cache bandwidth roughly the relative space.. increase This demonstrates by in miss a factor ratio is of.. The fetch ratio and miss ratio curves of.. arethe slightly fetch ratio This andindicates miss ratio that ofcurves 7.lbm.. The are is also slightly fetchrelatively ratio This and insensitive indicates miss ratio tothat the curves that of7.lbm are. islightly also The notrelatively fetch onlythis heavily ratioinsensitive indicates and leverages miss that toratio hardware the 7.lbm curves is prefetching, also are relatively slightly..%. M M 5M 7M. Difference between fetches and misses.8 is prefetch rate..8.8.gcc % ormance (), bandwidth requirements () in GB/s, and fetch/miss ratios () for several benchmarks. This data was collected with calculix Prefetching Disabled 6%.6.8 % %. 9.mcf 7.lbm8
30 Uppsala University / Department of Informa<on Technology 8// More Detail: Cache Pollu<on We can measure how greedy an applica<on is and how sensive it is By changing the code to use non- caching instruc<ons we can make an applicaon less greedy without hur<ng performance Miss Ratio 5. % 5. % 5. % Private Cache Size Shared libquantum lbm
31 Uppsala University / Department of Informa<on Technology 8// More Detail: Bandwidth Sharing Sensi<vity is a func<on of the applica<on Latency sensivity (memory level parallelism) Bandwidth requirement (data rate) And the hardware Ability to handle out- of- order requests (queue sizes) Access paoern costs (streaming vs. random in DRAM banks) consumpon is not a good indicator of sensivity slowdown bandwidth slowdown (%) lbm streamcluster soplex mcf bandwidth (GB/s) g.. Baseline Slowdown bandwidth at 9 (no of contention) saturaon bandwidth and slowdown with m
http://uu.diva-portal.org This is an author produced version of a paper presented at the 4 th Swedish Workshop on Multi-Core Computing, November 23-25, 2011, Linköping, Sweden. Citation for the published
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationUnderstanding Multicore Performance
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1136 Understanding Multicore Performance Efficient Memory System Modeling and Simulation ANDREAS SANDBERG
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationPostprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationPerf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores
Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores Muneeb Khan, David Black-Schaffer and Erik Hagersten Department of Information Technology, Uppsala University Email:
More informationEfficient Techniques for Predicting Cache Sharing and Throughput
Efficient Techniques for Predicting Cache Sharing and Throughput Andreas Sandberg Uppsala University Sweden andreas.sandberg@it.uu.se David Black-Schaffer Uppsala University Sweden david.blackschaffer@it.uu.se
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More informationGPUs: The Hype, The Reality, and The Future
Uppsala Programming for Multicore Architectures Research Center GPUs: The Hype, The Reality, and The Future David Black- Schaffer Assistant Professor, Department of Informa
More informationA Fast Instruction Set Simulator for RISC-V
A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationEvaluating STT-RAM as an Energy-Efficient Main Memory Alternative
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationCreating memory bandwidth contention with best intentions
IT 16 010 Examensarbete 30 hp Mars 2016 Creating memory bandwidth contention with best intentions George John Chiramel Institutionen för informationsteknologi Department of Information Technology Abstract
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationOptimizing Performance in Highly Utilized Multicores with Intelligent Prefetching
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1335 Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching MUNEEB KHAN ACTA
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationInvestigating How Simple Software Optimizations Effect Relative Throughput Scaling on Multicores
Investigating How Simple Software Optimizations Effect Relative Throughput Scaling on Multicores Muneeb Khan, Nikos Nikoleris and Erik Hagersten Department of Information Technology Uppsala University
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More informationA Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than
More informationLecture 2: Performance
Lecture 2: Performance Today s topics: Technology wrap-up Performance trends and equations Reminders: YouTube videos, canvas, and class webpage: http://www.cs.utah.edu/~rajeev/cs3810/ 1 Important Trends
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationMinimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationRethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,
More informationA Hybrid Adaptive Feedback Based Prefetcher
A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationI, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.
5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000
More informationChapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in
More informationCase Study 1: Optimizing Cache Performance via Advanced Techniques
6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationBias Scheduling in Heterogeneous Multi-core Architectures
Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationAdvanced Computer Architecture (CS620)
Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).
More informationPerceptron Learning for Reuse Prediction
Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationSCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS
SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS 1 JUNG KYU PARK, 2* JAEHO KIM, 3 HEUNG SEOK JEON 1 Department of Digital Media Design and Applications, Seoul Women s University,
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationPerformance of Multithreaded Chip Multiprocessors and Implications for Operating System Design
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip
More informationStealing the shared cache for fun and profit
IT 13 048 Examensarbete 30 hp Juli 2013 Stealing the shared cache for fun and profit Moncef Mechri Institutionen för informationsteknologi Department of Information Technology Abstract Stealing the shared
More informationIntroduction to OpenMP. Lecture 10: Caches
Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More information5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction
5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner Topic 1: Introduction These slides are mostly taken verbatim, or with minor changes, from
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationChapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY
Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationScheduling the Intel Core i7
Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More information562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationComputer Architecture Spring 2016
omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,
More informationDynamic Cache Pooling in 3D Multicore Processors
Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising
More informationand data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed
5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationDRAM Main Memory. Dual Inline Memory Module (DIMM)
DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationMultiprocessor scheduling, part 1 -ChallengesPontus Ekberg
Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg 2017-10-03 What is a multiprocessor? Simplest answer: A machine with >1 processors! In scheduling theory, we include multicores in this defnition
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationAS the processor-memory speed gap continues to widen,
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationImproving Real-Time Performance on Multicore Platforms Using MemGuard
Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More information