CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

Size: px

Start display at page:

Download "CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago"

Delphia Daniel
5 years ago
Views:

1 CMSC Computer Architecture Lecture 15: Multi-Core Prof. Yajig Li Uiversity of Chicago

2 Course Evaluatio Very importat Please fill out! 2

Lab3 Brach Predictio Competitio 8 teams etered the competitio, extra credits give to all Evaluated based o correctess, performace gai, ad writeup

3 Lab3 Brach Predictio Competitio 8 teams etered the competitio, extra credits give to all Evaluated based o correctess, performace gai, ad writeup uality Ross Rauber ad Oliver Tsag, 39.32% improvemet Zaye Khouja ad Aviash Rao, 32.72% improvemet Owe Frazier ad Jaseph Maues, 31.97% improvemet 3

4 Lecture Outlie Multi-core cotiued 4

5 Topics i Parallel Computer Architecture Cache coherece Esure correct operatio i the presece of private caches Memory cosistecy: orderig of memory operatios What should the programmer expect the hardware to provide? Shared memory sychroizatio Istructios to perform atomic operatios (e.g., for locks) 5

6 Cache Coherece 6

7 The VI (Valid/Ivalid) Protocol PrRd / BusRd PrRd/-- Valid Ivalid PrWr / BusWr BusWr Write-through, owrite-allocate cache Actios of the local processor o the cache block: PrRd, PrWr, Actios o the bus to commuicate to memory ad other processors: BusRd, BusWr PrWr / BusWr ObservedEvet/Actio 7

8 A More Sophisticated Protocol: MSI Used with writeback caches Exted metadata per block to ecode three states: M(odified): cache lie is the oly cached copy ad is dirty S(hared): cache lie is potetially oe of several cached copies I(valid): cache lie is ot preset i this cache 8

9 MSI State Machie Upgrade Write-back, write-allocate cache Abbrevia -tio Actio ObservedEvet/Actio Dowgrade (bus iitiated) PrRd PrWr BusRd BusRdX Flush Processor read Processor write Bus read Bus read exclusive (read with itet to modify; must ivalidate all other cache copies) Puts dirty data o bus to update memory ad supply data to other processors 9

10 MSI Protocol Walkthrough 1. If the cache block is modified a. PrRr or PrWr: this is a cache hit. Just retur the value or update the cache value. No eed to go to memory or talk to other processors, ad the block remais modified 10

11 MSI Protocol Walkthrough 1. If the cache block is modified b. BusRd: others wish to read the block; put dirty data o bus; block is dowgraded to shared 11

12 MSI Protocol Walkthrough 1. If the cache block is modified c. BusRdX: others wish to write to the block; put dirty data o bus; block is dowgraded to ivalid 12

13 MSI Protocol Walkthrough 2. If the cache block is shared a. PrRd: cache hit; BusRd: others are just readig the data; othig to be doe 13

14 MSI Protocol Walkthrough 2. If the cache block is shared b. PrWr: we wish to write but other cores are sharig this block; so geerate a BusRdX operatio to ivalidate other copies; the block is upgraded to modified 14

15 MSI Protocol Walkthrough 2. If the cache block is shared c. BusRdX: aother core wats to write to the block, must ivalidate our copy; the block is dowgraded to ivalid 15

16 MSI Protocol Walkthrough 3. If the cache block is ivalid a. PrRd: cache miss ad we just wat to read. Geerate a BusRd operatio to get data (from memory or aother core). The block is upgraded to shared 16

17 MSI Protocol Walkthrough 3. If the cache block is ivalid b. PrWr: cache miss ad we wat to write. Geerate a BusRdX operatio to get data (from memory or aother core) ad ivalidate other copies. The block is upgraded to modified 17

18 The Problem with MSI A block is i o cache to begi with Problem: O a read, the block immediately goes to Shared state although it may be the oly copy to be cached (i.e., o other processor will cache it) Why is this a problem? Suppose the cache that read the block wats to write to it at some poit It eeds to broadcast ivalidate eve though it has the oly cached copy! If the cache kew it had the oly cached copy i the system, it could have writte to the block without otifyig ay other cache à saves uecessary broadcasts of ivalidatios 18

19 The Solutio: MESI Idea: Add aother state idicatig that this is the oly cached copy ad it is clea. Exclusive state Block is placed ito the exclusive state if, durig BusRd, o other cache had it Reuires a shared sigal to detect if other caches have a copy of the block; caches assert the sigal if they have a copy Silet trasitio ExclusiveàModified is possible o write! MESI is also called the Illiois protocol Papamarcos ad Patel, A low-overhead coherece solutio for multiprocessors with private cache memories, ISCA

20 MESI State Machie PrRd ad cache miss: depedig o if other caches have a copy, trasitio from I to S or E E to M occurs if PrWr is observed E to S occurs if BusRd is observed E to I occurs if BusRdX is observed [Culler, David 97] 20

21 Eve More Sophisticated Cache Coherece Protocols? The protocol ca be optimized with more states ad predictio mechaisms to + Reduce uecessary ivalidates ad trasfers of blocks However, more states ad optimizatios -- Are more difficult to desig ad verify (lead to more cases to take care of, race coditios) -- Provide dimiishig returs 21

22 False Sharig P1 ld word0 st word0 ld word0 st word0 Cache block/lie: P2 ld word3 st word3 ld word3 st word3 word0 word1 word2 word3 22

23 Quick Tip to Avoid False Sharig DO Map variables writte by differet processors o differet cache blocks Group variables writte by the same processor ito the same cache block DON T Group variables writte by differet processors ito the same cache block 23

24 Which Is Better? it sum [NUM_PROCS]; it product [NUM_PROCS]; sum[mynum]++; product[mynum] *=2; typedef struct { it sum; it product; } Proc; Proc x[num_procs]; x[mynum].sum++; x[mynum].product*=2; 24

25 Takeaway Cache coherece is critical for esurig correctess Software-maaged cache coherece very difficult Hardware coherece protocols to help programmers write correct ad high-performace programs Soopig cache protocols VI MSI MESI (lab5) MOESI (commo i practice) Directory-based cache coherece More scalable 25

26 Topics i Parallel Computer Architecture Cache coherece Esure correct operatio i the presece of private caches Memory cosistecy: orderig of memory operatios What should the programmer expect the hardware to provide? Shared memory sychroizatio Istructios to perform atomic operatios (e.g., for locks) 26

27 Memory Cosistecy 27

28 Motivatioal Example Dekker s algorithm for critical sectios [Adve WRL Research Report 95] Ca the two processors be i the critical sectio at the same time give that they both obey the vo Neuma model? 28

29 Motivatioal Example Ituitio: Assume P1 is i critical sectio, which meas Flag2 must be 0, which meas P2 caot have executed Flag2 = 1, which meas meas P2 caot be i the critical sectio. [Adve WRL Research Report 95] 29

30 Both Processors i Critical Sectio! Cosider a store buffer (aka. write buffer) Remember this from OoO? Ca also be used with i-order executio! load processor store (ad load bypassig) cache 30

31 Both Processors i Critical Sectio! Cycle 1 (A): value writte i P1 s store buffer, P1 thiks A is executed, but memory is ot updated util cycle 51 Cycle 1 (X): value writte i P2 s store buffer, P2 thiks X is executed, but memory is ot updated util cycle 52 Cycle 2 (B): P1 still sees 0 i Flag2, so it eters critical sectio Cycle 2 (Y): P2 still sees 0 i Flag1, so it eters critical sectio A B X Y [Adve WRL Research Report 95] 31

32 Both Processors i Critical Sectio! What happeed? P1 s view of memory operatios P2 s view of memory operatios A (cycle 1) X (cycle 1) B (cycle 2) Y (cycle 2) X (cycle 51) A (cycle 52) A appeared to happe before X X appeared to happe before A 32

33 The Problem The two processors did NOT see the same order of operatios to memory The happeed before relatioship betwee multiple updates to memory was icosistet betwee the two processors poits of view As a result, each processor thought the other was ot i the critical sectio 33

34 How Ca We Solve The Problem? Idea: Seuetial cosistecy I. All processors see the same order of operatios to memory i.e., all memory operatios happe i a order (called the global total order) that is cosistet across all processors II. Withi this global order, each processor s operatios appear i seuetial order with respect to its ow operatios. Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Trasactios o Computers,

35 Aother Way of Iterpretig SC The whole system (all processors ad memory) sees the same order of all fours memory operatio combiatios performed by ay processor Load à load Load à store Store à store Store à load 35

36 Seuetially Cosistet Operatio Orders Potetial correct global orders (all are correct): A B X Y A X B Y A X Y B X A B Y A X X A Y B B Y X Y A B [Adve WRL Research Report 95] Which order (iterleavig) is observed depeds o implemetatio ad dyamic latecies 36

37 Issues with Seuetial Cosistecy (SC)? Nice abstractio for programmig, ituitive Two issues Orderig reuiremets too coservative Limits the aggressiveess of performace ehacemet techiues E.g., ca t use a store buffer 37

38 Total Store Order (TSO) Remember, for seuetial cosistecy, The whole system (all processors ad memory) sees the same order of all fours memory operatio combiatios performed by ay processor Load à load, load à store, store à store, store à load TSO relaxes the store à load orderig reuiremet Major beefit: a FIFO-based store buffer ca be used Moder ISAs that uses the TSO model SPARC Also similar to X86 38

39 Total Store Order (TSO) Example TSO allows both P1 ad P2 to be i the critical sectio P2 is allowed to see B (load) before A (store) P1 is allowed to see Y (load) before X (store) How should a programmer fix Dekker s algorithm? A B X Y [Adve WRL Research Report 95] 39

40 Memory Fece All memory operatios before a fece must complete ad visible to other processors before fece is executed All memory operatios after the fece must wait for the fece to complete Feces complete i program order A B X Y [Adve WRL Research Report 95] 40

41 The Geeral Problem of Memory Orderig A cotract betwee software ad hardware specified by the ISA ISA specifies what programmers ca assume about memory orderig, e.g., whether seuetial cosistecy (or aother memory cosistecy model) is provided Preservig a ituitive model (e.g., seuetial cosistecy) simplifies programmer s life But makes the hardware desiger s life difficult (limits performace optimizatios that ca be used) Aother example of the programmer-microarchitect tradeoff 41

42 Topics i Parallel Computer Architecture Cache coherece Esure correct operatio i the presece of private caches Memory cosistecy: orderig of memory operatios What should the programmer expect the hardware to provide? Shared memory sychroizatio Istructios to perform atomic operatios (e.g., for locks) 42

43 Sychroizatio 43

44 Race Coditio Upredictable results, called race coditios, ca happe if we do t cotrol access to shared variables A cocurrecy problem; ca occur i sigle processors also E.g., x++ from multiple threads assume x is iitialized to 0. What is the value of x after the followig executio? CPU 1 CPU2 Ld r1, x Ld r1, x Add r1, r1, 1 Add r1, r1, 1 St r1, x St r1, x 44

45 Coordiatig Access to Shared Data Locks: simple primitive to esure updates to sigle variables occur withi a critical sectio May variatios (spilocks, semaphores, ) CPU 1 LOCK x Ld r1, x Add r1, r1, 1 St r1, x UNLOCK x CPU2 LOCK x wait wait lock acuired Ld r1, x Add r1, r1, 1 45

make threads wait à threads causig serializatio ca be o the critical

46 Locks / Critical Sectios Eforce mutually exclusive access to shared data Oly oe thread ca be executig it at a time Coteded critical sectios make threads wait à threads causig serializatio ca be o the critical path Each thread: loop { Compute lock(a) Update shared data ulock(a) } N C 46

47 How NOT To Implemet Locks Lock: while (lock_var == 1); lock_var = 1; Ulock: lock_var = 0; What s the problem? Testig if lock_var is 1 ad settig it to 1 are ot atomic i.e., aother processor ca set lock_var to 1 i betwee à Multiple processors acuire the lock! 47

48 Atomic Read & Write Istructios Aka. read-modify-write Specify a memory locatio ad a register I. Value i mem locatio read ito a register II. Aother value stored ito locatio May variats based o what values are allowed i II Simple example: test&set Read memory locatio ito specified register Store costat 1 ito locatio 48

49 Usig Test&Set to Implemet a Lock Iitialize locatio to 0 lock: t&s register, locatio //atomic read-modify-write bz lock //if ot 0, try agai ret //locked; value i locatio is 1 ulock: st locatio, #0 ret //write 0 to locatio 49

50 May Others Other read-modify-write primitives Swap Compare&swap More facy implemetatios to avoid spiig, reduce memory traffic, promote fairess, etc. All details are defied i ISA 50

51 Course Summary ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig: basic, depedecy hadlig, brach predictio Advaced uarch: OOO, SIMD, VLIW, superscalar Caches (advaced) Virtual memory DRAM Multi-core ALL DONE! 51

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today