CS521 CSE IITG 11/23/2012

Size: px

Start display at page:

Download "CS521 CSE IITG 11/23/2012"

Hilary Neal
6 years ago
Views:

1 LSPS: Logically Shared Physically Shared LSPD: Logically Shared Physically Distributed Same size Unequalsize ith out replication LSPD : Logically Shared Physically Shared with replication Access scheduling 1 2 Design Space of Shared Memory Architectures Extent of address space sharing Location of memory modules Uniformity of memory access Each processor sees an exclusiveaddress space Each processor sees partly exclusiveand partly sharedaddress space Each processor sees same shared address space slide 3 slide 4 P P P P M M M M M M M M Interconnection Network M M M Centralized M M M M P P P P Interconnection Network Distributed slide 5 M M M M P P P P Interconnection Network M M M Mixed P P P P P P P P Interconnection Network Interconnection Network M M M M M M Global Interconnection Network M M M slide 6 1

2 UMA (Uniform Memory Access) Uniformity across memory address space Uniformity across processors NUMA (Non Uniform Memory Access) CC NUMA(CacheCoherentNUMA) Coherent NUMA) COMA (Cache Only Memory Architecture) UMA: Symmetrical Shared Memory Multiprocessor (SMP) NUMA: Distributed Shared Memory Multiprocessor LOCATION centralized mixed distributed SHARING full partial none UMA NUMA slide 7 slide 8 M2 M2 M3 M2 M3 LSPS (1) LSPS (M) Logically Shared Physically Shared (LSPS) 9 LSPD (Equal) LSPD (Non Equal) Logically Shared Physically Distributed (LSPD) 1 M2 M3 C1 C2 C3 C4 LSPD (Equal with repetition) LSPD (Equal with repetition) M p1 p2 p3 p4 LPPS May be Static Dynamics Coherence Problem 11 Logically Partitioned Physically Shared (LPPS) 12 2

3 LPPS Bandwidth Memory LPPS p1 p2 p3 p4 Bandwidth Scheduling May be Static Dynamics PD L1 L1 L1 L1 Logically Partitioned Physically Shared (LPPS) LSPS (1) M2 LSPS (M) LSPS (1) Data placement place less important role Try cluster the access into pages Page mode: Access belongs to same pages will be faster Contagious chunk access is faster Simple queue Serve all request of same page at a time Logically Shared Physically Shared (LSPS) 15 Logically Shared Physically Shared (LSPS) Processor Interleaved Memory 8 memory bank, bank busy time =6 clock Memory latency=12 cycle How long it will take time to load 64 element vector load with stride= = 76 clock 1.2cycle/elem ith Stride= *63 = 391 clock 6.1cycle/elem 17 Placement : How to place the data to access with less interference or efficiently in multiple banks Scheduling: Already placed data in memory how to schedule access to reduce interference hat to do first : Scheduling or Placement? Placement Scheduling Scheduling Placement Scheduling Placement Iterate Till Converge 18 3

4 Suppose Application App is running is this architecture App have 5 Array [1], [1], [2], [2] and [1] P1 P2 M2 P3 P4 and M2 size = 4, They can hold 4 element each Draw the conflict graph Example: Access and LSPS (2) overlapped is 5% Take Max Span Tree 19 Draw the conflict graph Example : Access and overlapped is 5% Take Max Span Tree 2 Draw the conflict graph Example : Access and overlapped is 5% Take Max Span Tree 21 Draw the conflict graph Take Max Span Tree Place data to memories Includes Memory size constraints ++= 3 4 +=4 4 M2 22 Take Max Span Tree ++ =4 +=3 23 =M2=M3=3 Size =2,=2,=1 =1,=

5 Share memory Any processor can access any memory module One local and other are Remote Place the data on memories to reduce overall memory access time Address Remapping M2 M3 LSPD (Equal) Logically Shared Physically Distributed (LSPD) M M2 M3 P P1 P2 P3 M M2 M3 P P1 P2 P3 Case I Simple Mapping i*256 A<(i+1)*256 Case II Mapping depends on number of request to Area (Affinity) 26 Suppose 4 processor: 124 memory location Share memory, any processor can access any location P1 : 255 P3 M3: P2 M2: P Suppose in an application PiaccessMj, Aij times Local access takes less time (T l ) Remote access takes more time (T r ) Optimize memory access time by address remapping to memory module 27 Stable Matching Processor accessing a memory more frequently should be nearer. Problem Memory accessed mostly by a processor should be Algorithm, Keinberg nearer to that processor. TardosBook Chapter 1 Every Processor have memory preference list and Every memory have Processor preference list P1 : 255 P3 M3: P2 M2: P P1:, M3,, M2 :P1, P3, P2, P4 P2:, M2, M3, M2 : P2, P1, P3, P4 P3:,, M3, M2 M3 : P3, P1, P4, P2 P4: M2, M3,, : P2, P4, P1, P3 e have to assigned one processor to one memory and also one memory to one processor Match should be stable (Preferred one should be allocated as far as possible) M2 M M2 M

6 M2 M3 31 Suppose 4 processor: 124 memory location All are share memory, any processor can access any location P1 : 255 P3 M3: P2 M2: P ariations of this problem Remote access time depends on remote location from source processor Each memory module can be of different size (Min to Max) Access pattern change time to time Caching and Copying 32 Remote access time depends on remote location from source At the time of Grouping (P1 and ): Makes s 12 (1*D 21 +5*D 31 +3*D 14 ) M2 M3 33 Each memory module can be of different size (Min to Max), Example MinSize=32 and MaxSize=512, Total =124 Memory is divided in to Slices (assume 1 slice per address) Slice preference list of Processor, SP ij =number of access to slice iform processor j IfSlice i mappedtomemory j, SM ij =1 ILP Formulation M i =124 M3 M2 Min M i Max SM ij =1 Minimize total access latency SP ij *SM ij *D ij 34 LSPD (Non Equal) Assume continuous mapping Mapped at address starting at A[256], B[256], C[256] for(i=;i<256;i++){ A[i]=B[i]+C[i]; } Direct Mapped Cache size power of 2 124B A B C Miss Miss Miss Associativity2? How to remove? Padd 36 6

7 Cache hit/miss probability Behavior not predictable completely Scratchpad A high speed memory near to processor But not a Cache Use simple memory address mapping to this area No miss for this area, always hit and predictable Guaranteed performance Memory Cache Core SP //16.1 MP Image // Histogram Calculation for(i=;i<468;i++) for(j=;j<3456;j++) H[R[i][j]]++; [Hardavellas et al, IEEE Micro Top Picks 21] [Hardavellas et al, ISCA 29] As caches become bigger, they get slower: (KB)1, Cache Size 1, 1, large caches ear Hit Laten ncy (cycles) slow access ear Split cache into smaller slices : Increasing access latency forces caches to be distributed 39 Balance cache slice access with network latency 4 cache slice Split cache into slices, distribute across die 41 Goal: place data on chip close to where they are used 42 7

8 Data may exhibit arbitrarily complex behaviors...but few that matter! Learn the behaviors at run time & exploit their characteristics Make the common case fast, the rare case correct Resolve conflicting requirements Cache accesses can be classified at run time Each class amenable to different placement Per class block placement Simple, scalable, transparent No need for H coherence mechanisms at LLC Up to 32% speedup (17% on average) 5% on avg. from an ideal cache organization Rotational Interleaving Data replication and fast single probe lookup Read or rite Read Read rite Read address mod <#slices> Private Shared Read Only Shared Read rite Unique location for any block (private or shared) Maximum capacity, but slow access (3+ cycles) Flexus: Full system cycle accurate timing simulation orkloads OLTP: TPC C 3.1 H IBM DB2 v8 Oracle1g DSS: TPC H Qry6, 8, 13 IBM DB2 v8 SPECweb99 on Apache 2. Multiprogammed: Spec2K Scientific: em3d Model Parameters Tiled, LLC = Server/Scientific wrkld. 16 s, 1MB/ Multi programmed wrkld. 8 s, 3MB/ OoO, 2GHz, 96 entry ROB Folded 2D torus 2 cycle router, 1 cycle link 47 45ns memory Each bubble: cache blocks shared by x s Size of bubble proportional to % accesses y axis: % blocks in bubble that are read write % % R Read-rite Blocks in Block Bub bble ks Instructions Data-Private Data-Shared 12% 1% 8% 6% 4% 2% % accesses % -2% Number of Sharers 48 8

9 Blocks12% % R Read- Block rite ks in Bubble Instructions Data-Private Data-Shared 1% 8% 6% 4% 2% % -2% Number of Sharers migrate locally Server Apps Accesses naturally form 3 clusters share (addr interleave) % % R Read- Bloc cks rite in Blocks Bubble Instructions Data-Private Data-Shared 12% 1% R/ 8% 6% 4% 2% R/O rate Scientific/MP Apps share % -4-2% sharers# Number of Sharers replicate migr replicate Instruction working set too large for one cache slice Distribute in cluster of neighbors, replicate across 49 5 Per block classification High area/power overhead (cut size by half) High latency (indirection through directory) Per page classification (utilize OS page table) Persistent structure Core accesses the page table for every access anyway (TLB) Utilize already existing S/H structures and events Page classification is accurate (<% error) Instructions classification: all accesses from L1 I (perblock) Data classification: private/shared per page at TLB miss On 1 st access On access by another Core i LdA LdA Core j TLB Miss TLB Miss OS A: Privateto i OS A: Private to i A: Shared Classify entire data pages, page table/tlb for bookkeeping 51 Bookkeeping through OS page table and TLB 52 private data: place in local slice Page table entry: P id vpage ppage Reactive NUCA placement guarantee Each R/ datum in unique& knownlocation TLB entry: P vpage ppage Shared data: addr interleave Private data: local slice shared data: place in aggregate (addr interleave) Page table entry: S id vpage ppage TLB entry: Physical Addr.: tag id S vpage cache index ppage offset Fast access, eliminates H overhead, SIMPLE

10 +log 2 (k) RID RID PC: xfa48addr each slice caches the same blocks on behalf of any cluster ( Addr + RID +1) & ( 1) Destination = n Fast access (nearest neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 55 size 4 clusters: local slice + 3 neighbors Identification: all accesses from L1 I But, working set too large to fit in one cache slice Share within neighbors cluster, replicate across 56 RotationalID TileID +log 2 (k) Rotational Interleaving Addr RotationalID RotationalID center ( RotationalIDdest + RotationalIDcenter +1) & ( 1) D = n Fast access (nearest neighbor, simple lookup) Equalize capacity pressure at overlapping slices dest D TileID 57 center TileID dest 29 Hardavellas Nearest neighbor size 8 clusters D C Hardavellas 58 1

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed