EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Size: px

Start display at page:

Download "EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors"

Marjory Davidson
5 years ago
Views:

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.

1 EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 2011 S. Reda EN164 Sp 11 [ material from Patterson & Hennessy, 4 th ed and Harris 1 st ed ] 1

2 Parallel computing Need for improved performance for high-performance computing (HPC) applications (e.g., weather simulations, protein folding, etc) and planetary-scale web services (e.g., Google search, YouTube, etc) Hope to speed-up code by a factor equal to the number of cores or processors Difficulties Partitioning Coordination Communications overhead S. Reda EN164 Sp 11 2

3 Factorial program example 10! 10x9x8x7x6 5x4x3x2x1.x. Need to partition algorithm and re-code to leverage more than one core processor Need to communicate data Need to co-ordinate and synchronize S. Reda EN164 Sp 11 3

4 Multi-core processors L1 D & I caches L2 cache Moore s Law + ILP Wall + Power Wall -> Multi-core processors Cores with private L1 and L2 caches and share L3 cache and bus to DRAM S. Reda EN164 Sp 11 4

5 Multi-processor systems two processor slots Processors share DRAM Memory controller (north bridge) arbitrates memory requests S. Reda EN164 Sp 11 Servers do not share memory Servers are connected by network 5

for all processors Synchronize shared variables using locks

6 1. Communication using shared memory SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform) S. Reda EN164 Sp 11 6

7 Cache coherence problem Suppose two CPU cores share a physical address space Time step Write-through caches Event CPU As cache CPU Bs cache Memory CPU A reads X CPU B reads X CPU A writes 1 to X S. Reda EN164 Sp 11 7

8 Defining cache coherency Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes) read returns written value P 1 writes X; P 2 reads X (sufficiently later) read returns written value c.f. CPU B reading X after step 3 in example P 1 writes X, P 2 writes X all processors see writes in the same order End up with the same final value for X S. Reda EN164 Sp 11 8

9 Maintaining cache coherence with snooping protocol Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU As cache CPU Bs cache Memory CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X S. Reda EN164 Sp 11 9

10 False shares Block size plays an important role in cache coherency Large blocks can cause false sharing where two unrelated shared variables are located in the same cache block, the full block is exchanged between processors even though the processor are accessing different variables False sharing could be eliminated by carefully writing the code and compilation S. Reda EN164 Sp 11 10

11 2. Communication using message passing Each processor has private physical address space Hardware sends/receives messages between cores/ processors using on-chip and/or off-chip More scalable than shared memories but harder to program S. Reda EN164 Sp 11 11

12 Interconnection networks (on-chip between cores or off-chip between processors or servers) Network topologies Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2D Mesh Fully connected S. Reda EN164 Sp 11 12

13 3. Coordination using synchronization primitives Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Example: two threads want to increment a global variable Mutual exclusion (i.e., mutex) algorithms are used in parallel programming to prevent the simultaneous use of a common resource, such as a global variable, by code called critical sections. S. Reda EN164 Sp 11 13

14 Hardware support for mutex Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction E.g., atomic swap of register memory Or an atomic pair of instructions S. Reda EN164 Sp 11 14

15 Synchronization in MIPS Load linked: ll rt, offset(rs) Store conditional: sc rt, offset(rs) Succeeds if location not changed since the ll Returns 1 in rt Fails if location is changed Returns 0 in rt Example: atomic swap (to test/set lock variable) try: add $t0,$zero,$s4 ;copy exchange value ll $t1,0($s1) ;load linked bne $t1,$0, try sc $t0,0($s1) ;store conditional beq $t0,$zero,try ;branch store fails add $s4,$zero,$t1 ;put load value in $s4 S. Reda EN164 Sp 11 15

16 Summary of HW design issues for parallel processing 1. Need HW support for memory coherence 2. Need HW support for synchronization 3. Need HW support for communication (on-chip and off-chip) S. Reda EN164 Sp 11 16

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering