Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core

Size: px

Start display at page:

Download "Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core"

Violet Long
6 years ago
Views:

Announements Your fous should be on the lass projet now Leture 17: Cahing Issues for Multi-ore Proessors This week: status update and meeting A short presentation on: projet desription (problem,

1 Announements Your fous should be on the lass projet now Leture 17: Cahing Issues for Multi-ore Proessors This week: status update and meeting A short presentation on: projet desription (problem, importane, solution), methodology (experiments & tools), prelim results, remaining work (division of labor), open issues. Will shedule a short meeting with Christos to provide feedbak Department of Eletrial Engineering Stanford University Projet papers due on 12/3 Projet presentations on 12/3 Leture 17-1 Leture 17-2 Vs. Cahes for Small-sale Multi-ore Performane Isolation & QoS on a Multi-Core ahes Pros: Better utilization of limited resoures Pros: instrution/data sharing aross ores Cons: limited bandwidth and ports to shared ahe Cons: destrutive interferene ore ore ore Intra-Chip Swith ore Multi-ore hips provide salable performane within a hip Can use to aelerate multi-programmed workloads E.g., three-tier server Can use for servie onsolidation E.g., virtual mahines ahes Pros: isolation (apaity, bandwidth, ports) Cons: underutilization of resoures, data dupliation Cons: additional oherene traffi Pratial approah L1 ahes & shared last-level ahes ( or L3) Intermediate levels may be shared by a luster of ores Cahe Challenge: an we guarantee isolation and QoS aross apps? Share resoures an be an issue ahes, on-hip and off-hip bandwidth, memory, Simple example: one app streams through large memory regions Effetively flushing the shared ahes for all other apps Commonly observed when TPC/IP runs in parallel with other servers Leture 17-3 Leture 17-4

Example of Cahe Interferene (DasCMP 06) Can OS Priorities Solve the Problem 5x 5x 4x 4x 3x 3x 2x 2x Baseline Baseline Slowdown for Spe 05 apps when running in parallel with swim Sharing the ahe in a

2 Example of Cahe Interferene (DasCMP 06) Can OS Priorities Solve the Problem 5x 5x 4x 4x 3x 3x 2x 2x Baseline Baseline Slowdown for Spe 05 apps when running in parallel with swim Sharing the ahe in a multi-ore system What is the problem with OS priority mahanisms? Leture 17-5 Leture 17-6 Is Interferene a Common Problem? Approahes to Resoure Management Capitalist (most system today) No management of resoures If you an generate the requests, you take over resoures Communist Equal distribution of resoures aross all apps Guarantees isolation but not best utilization or even equal performane Elitist Highest priority for one appliation through biased resoure alloation Best effort for the rest of the apps Utilitarian Fous on overall effiieny (e.g., throughput) Have to build some system mehanisms for isolation & QoS Provide resoures to whoever needs it the most Leture 17-7 Leture 17-8

System Platform for QoS Hardware Features for QoS in Cahing Resoure Monitoring QoS Analysis Resoure Enforement Resoure monitoring Measure oupany, interferene, and sharing for eah thread requests and

3 System Platform for QoS Hardware Features for QoS in Cahing Resoure Monitoring QoS Analysis Resoure Enforement Resoure monitoring Measure oupany, interferene, and sharing for eah thread requests and ahe lines s an be short and defined by software s an be used to identify threads, priority level, et Count usage, alloations, evitions, replaements, et Hardware features Mehanisms for resoure monitoring Mehanism for poliy resoure enforement Software features Analysis of operating state Set HW poliies based on high-level approah Resoure enforement Per-way ahe partitioning Capaity-based partitioning Multi-ore aware replaement management Leture 17-9 Leture Dealing with Cahe Interferene Cahe Replaement for Multi-ore The two aspets of ahe replaement Pik a vitim to replae AND deide the priority of newly insert line Options LRU: pik LRU line as vitim and make new line the MRU Senario: running Art (high priority) and Iperf from Spe 05 Measuring misses per instrution (MPI) Poliies for shared (no management), fair (two halves), dediated (individual runs) Stati QoS: iperf an use up to 10% of ahe Dynami QoS: target 0.5*MPI of shared ase Periodially measure MPI and adjust threshold for ahe oupany Suboptimal for streaming data or when working set larger than ahe Alloates resoures based on-rate of demand, not on benefit from ahing LRU insertion: new line inserted in LRU, promoted to MRU later Does not neessarily age older lines as LRU does Bimodal insertion: randomly insert a few lines at MRU, rest at LRU Dynami insertion poliy: between bimodal and LRU Leture Leture 17-12

Physial Layout of Cahes for Larger-sale Systems Dynami Non-Uniform Aess Cahes (NUCA) (ASPLOS 02) ore

Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie

seletively Challenge: non-uniform aess to different slies $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

Approah: organize ahe banks into bank sets Bank set determined by some bits in the address Banks

loser to requesting CPU Mehanisms: mapping, searhing, migration Mapping: simple, fair, shared

further Leture 17-13 Leture 17-14 NUCA Challenges for Multi-ore (MICRO 04) Design Choies for Large

slie per ore Pros: low lateny to data Cons: redued apaity miss operation Searh other private ahes

4 Physial Layout of Cahes for Larger-sale Systems Dynami Non-Uniform Aess Cahes (NUCA) (ASPLOS 02) ore ore ore ore ore ore ore ore Intra-Chip Swith Intra-Chip Swith $ $ $ $ $ $ $ $ Cahe Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Slie Distributed implementation Lower aess lateny, lower aess power, easier to turn-off seletively Challenge: non-uniform aess to different slies $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ Motivation: allow ahe lines to move lose to requesting CPU Without using a diretory arhiteture Approah: organize ahe banks into bank sets Bank set determined by some bits in the address Banks within the set provide ahe assoiativity (searh in series ) Cahe lines an move within a set to get loser to requesting CPU Mehanisms: mapping, searhing, migration Mapping: simple, fair, shared Searhing: inremental, multiast, smart Migration: data moves loser as it is aessed, evited data moved further Leture Leture NUCA Challenges for Multi-ore (MICRO 04) Design Choies for Large Distributed Cahes: Cahing Dark more aesses OLTP (on-line transation proessing) Oean (sientifi ode) One slie per ore Pros: low lateny to data Cons: redued apaity miss operation Searh other private ahes Through snooping or a diretory Centralized or distributed diretory 2 to 3 hops Alternatively feth from off-hip Leture Leture 17-16

Design Choies for Large Distributed Cahes: Cahing Vitim Repliation (ISCA 05) Slies form distributed, shared Pros: good utilization of apaity Cons: variable & high lateny A ompromise between shared &

in Possibility for repliation, migration, Additional hops miss operation Feth data from off-hip Plae in proper slie & update diretory if any Idea: start with shared design and use loal slie as vitim

5 Design Choies for Large Distributed Cahes: Cahing Vitim Repliation (ISCA 05) Slies form distributed, shared Pros: good utilization of apaity Cons: variable & high lateny A ompromise between shared & private designs Capaity utilization of shared ahe with low lateny of private hit operation Searh other ahes et aess if banked ahe Stati plaement of data in Or through diretory Dynami plaement of data in Possibility for repliation, migration, Additional hops miss operation Feth data from off-hip Plae in proper slie & update diretory if any Idea: start with shared design and use loal slie as vitim ahe When eviting from L1, write data in loal Vitim allowed to overwrite invalid bloks and other replias Not allowed to overwrite atively shared bloks that have loal as home Implementation: simple modifiations to shared design On a miss, searh loal slie before remote slies etory or banking struture does not hange Vitim does not hange sharing pattern Invalidations handled loally a little differently Leture Leture NuRAPID: Deoupling s from Using NuRAPID for CMP Optimization Motivation: provide a mehanism for ahing optimization Controlled repliation, ommuniation w/o movement, apaity stealing, Basi idea: deouple tag storage from data storage tag arrays & shared data arrays Aess the tags first, get a pointer to data May be to another slie Controlled repliation No opy on first aess to on-hip data; just set pointer Copying on seond aess In-situ ommuniation Read-write sharing without opying/moving data Keep data lose to reader, adjust pointer to perform writes Requires a new ahe state (C for ommuniation) Capaity stealing Use remote slie as a vitim ahe for a proessor s slie Leture Leture 17-20

Accelerating Multiprocessor Simulation with a Memory Timestamp Record

Accelerating Multiprocessor Simulation with a Memory Timestamp Record Aelerating Multiproessor Simulation with a Memory Timestamp Reord Kenneth Barr Heidi Pan Mihael Zhang Krste Asanovi Marh, 5 Massahusetts Institute of Tehnology Intelligent sampling gives est speed-auray