Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Size: px

Start display at page:

Download "Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014"

Eleanor Charles
6 years ago
Views:

1 Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4,

2 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor architecture (20 min) Group exercise on cache coherence (15 min) Wrapup and homework (5 min) 2

3 Announcements SEDE Workshop on Op$mizing Code for the eon Phi, 8:00-4:00, Friday, February 21 Register individually to get training account on Stampede and access to hands- on exercises (see course website under Announcements for registra$on info) Meet as a group in CCSB (or view webcast on your own) A\end as much as you can Extra credit on homework grade for comple$ng hands- on exercises (equivalent of one homework) See course website under Resources for sample batch script for running jobs on hpc.utep.edu compute nodes (thanks to Luis Gu$errez) Quiz 1 answer posted under Quizzes and Tests on course website 3

4 STREAM Benchmark Run on hpc.utep.edu with 1, 2, 4, 6, 8, 10, 12 threads Plot bandwidth vs. number of threads What is the maximum memory bandwidth? Can you achieve this memory bandwidth with one thread? At what number of threads is the memory bandwidth saturated? What are the implica$ons for how many threads you should use in a mul$threaded program? 4

5 Learning Objec$ves Aaer comple$ng today s lesson, you should be able to Explain the difference between logical communica$on model and physical memory architecture Explain the need for a cache coherence protocol on a shared memory architecture Trace the effect on cache and memory values and states of a given cache coherence protocol Explain how false sharing can affect the performance of a mul$threaded program 5

6 Flynn s Taxonomy M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, , Dec Single Instruction Single Data (SISD) (Uniprocessor) Multiple Instruction Single Data (MISD) (????) Single Instruction Multiple Data (SIMD) (vector processor, GPUs) Multiple Instruction Multiple Data (MIMD) (SMPs, clusters) 6

7 Parallel Architecture A parallel computer is a collec$on of processing elements that cooperate and communicate to solve large problems fast. Parallel Architecture = Processor Architecture + Communica$on Architecture 2 classes of mul$processors WRT memory: 1. Centralized Memory Mul$processor < few dozen processor chips small enough to share single, centralized memory 2. Physically Distributed- Memory mul$processor Larger number of cores and chips Memory distributed among processors 7

8 Physically Centralized Memory 8

9 Physically Distributed Memory 9

10 2 Logical Models for Communica$on and Memory Architecture 1. Communica$on occurs through a shared address space. Centralized memory processor u$lized this type of communica$on (symmetric shared memory mul$processors (SMP), uniform memory access (UMA)) Physically separate memories can be addressed as one logically shared address space (in prac$ce only within a node) Meaning that a memory reference can be made by any processor to any memory loca$on (assuming it has the access right) non- uniform memory access (NUMA) 2. Communica$on occurs by explicitly passing messages among the processors: Message- passing mul$processor (logical) Distributed memory MP (physical) Ques$on: In the case of logical shared memory, does the physical memory architecture make a difference in the way you program? 10

11 What is Cache Coherence? In the case of a shared memory mul$processor with private caches, there may be mul$ple copies of a given data item in different processor s caches as well as in the main memory. Need to enforce coherency and consistency Coherency: Read sees result of most recent write. Consistency: What value is reflected in processor caches and memory at any given $me? Enforcement carried out by means of a cache coherence protocol 11

12 Single Processor caching Hit: data in the cache Miss: data not in the cache Hit rate: h Miss rate: m = (1-h) x x P Memory Cache 12

13 Wri$ng to the cache write through vs. write back x Memory x Memory x Memory x Cache x Cache x Cache P P P Before Write through Write back 13

14 Need for Cache Coherence x: 3 x:3 x:3 x:3 P1 P2 P3 Pn - Multiple copies of x - What if P1 updates x and then P3 reads x? 14

15 Cache Write Policies Wri$ng to cache in n processor case Write Update - Write Through Write Invalidate - Write Back Write Update - Write Back Write Invalidate - Write Through 15

16 Write- invalidate x x x x x x I x I P1 P2 P3 P1 P2 P3 P1 P2 P3 Before Write Through Write back 16

17 Write- Update x x x x x x x x x P1 P2 P3 P1 P2 P3 P1 P2 P3 Before Write Through Write back 17

18 Write Back- Write Invalidate State Description Shared (Read-Only) [RO] Exclusive (Read-Write) [RW] Invalid [INV] Data is valid and can be read safely. Multiple copies can be in this state Only one valid cache copy exists and can be read from and written to safely. Copies in other caches are invalid The copy is inconsistent 18

19 Write Back Write Invalidate (cont.) Event Action Read Hit Use the local copy from the cache. Read Miss: If no Exclusive (Read-Write) copy exists, then supply a copy from global memory. Set the state of this copy to Shared (Read-Only). If an Exclusive (Read-Write) copy exists, make a copy from the cache that set the state to Exclusive (Read-Write), update global memory and local cache with the copy. Set the state to Shared (Read- Only) in both caches. 19

20 Write Back Write Invalidate (cont.) Write Hit Write Miss Block Replacement If the copy is Exclusive (Read-Write), perform the write locally. If the state is Shared (Read-Only), then broadcast an Invalid to all caches. Set the state to Exclusive (Read- Write). Get a copy from either a cache with an Exclusive (Read- Write) copy, or from global memory itself. Broadcast an Invalid command to all caches. Update the local copy and set its state to Exclusive (Read-Write). If a copy is in an Exclusive (Read-Write) state, it has to be written back to main memory if the block is being replaced. If the copy is in Invalid or Shared (Read-Only) states, no write back is needed when a block is replaced. 20

21 Example 1 M C C P Q Group exercise: Complete the table in the handout, assuming a write- back write- invalidate coherence protocol. 21

22 Snoopy vs. Directory Based Protocols 22

23 What is a directory? 23

24 Centralized vs. Distributed 24

25 Snoopy Cache- Coherence Protocols Cache Controller snoops all transac$ons on the shared medium (bus or switch) 25

26 Directory- Based Cache Coherence Protocols To implement the opera$ons, a directory must track the state of each cache block: Shared (S): one or more processors have the block cached, and the value is up- to- date Uncached (U): no processor has a copy of the cache block Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block 26

27 Directory- based Protocol Interconnection Network Directory Directory Directory Local Memory Local Memory Local Memory Cache Cache Cache CPU 0 CPU 1 CPU 2

28 Directory- based Protocol Interconnection Network Bit Vector Directories U Memories 7 Caches CPU 0 CPU 1 CPU 2

29 CPU 0 Reads Interconnection Network Read Miss Directories U Memories 7 Caches CPU 0 CPU 1 CPU 2

30 CPU 0 Reads Interconnection Network Directories S Memories 7 Caches CPU 0 CPU 1 CPU 2

31 CPU 0 Reads Interconnection Network Directories S Memories 7 Caches 7 CPU 0 CPU 1 CPU 2

32 CPU 2 Reads Interconnection Network Directories S Memories Read Miss 7 Caches 7 CPU 0 CPU 1 CPU 2

33 CPU 2 Reads Interconnection Network Directories S Memories 7 Caches 7 CPU 0 CPU 1 CPU 2

34 CPU 2 Reads Interconnection Network Directories S Memories 7 Caches 7 7 CPU 0 CPU 1 CPU 2

35 CPU 0 Writes 6 to Write Miss Interconnection Network Directories S Memories 7 Caches 7 7 CPU 0 CPU 1 CPU 2

36 CPU 0 Writes 6 to Interconnection Network Directories S Memories Invalidate 7 Caches 7 7 CPU 0 CPU 1 CPU 2

37 CPU 0 Writes 6 to Interconnection Network Directories E Memories 7 Caches 6 CPU 0 CPU 1 CPU 2

38 CPU 1 Reads Interconnection Network Read Miss Directories E Memories 7 Caches 6 CPU 0 CPU 1 CPU 2

39 CPU 1 Reads Interconnection Network Switch to Shared Directories E Memories 7 Caches 6 CPU 0 CPU 1 CPU 2

40 CPU 1 Reads Interconnection Network Directories E Memories 6 Caches 6 CPU 0 CPU 1 CPU 2

41 CPU 1 Reads Interconnection Network Directories S Memories 6 Caches 6 6 CPU 0 CPU 1 CPU 2

42 CPU 2 Writes 5 to Interconnection Network Directories S Memories Write Miss 6 Caches 6 6 CPU 0 CPU 1 CPU 2

43 CPU 2 Writes 5 to Interconnection Network Directories S Memories 6 Caches 6 6 CPU 0 CPU 1 CPU 2

44 CPU 2 Writes 5 to (Write back) Interconnection Network Directories E Memories 6 Caches 5 CPU 0 CPU 1 CPU 2

45 CPU 0 Writes 4 to Interconnection Network Directories E Memories 6 Caches 5 CPU 0 CPU 1 CPU 2

46 CPU 0 Writes 4 to Interconnection Network Directories E Memories Take Away 6 Caches 5 CPU 0 CPU 1 CPU 2

47 CPU 0 Writes 4 to Interconnection Network Directories E Memories 5 Caches 5 CPU 0 CPU 1 CPU 2

48 CPU 0 Writes 4 to Interconnection Network Directories E Memories 5 Caches CPU 0 CPU 1 CPU 2

49 CPU 0 Writes 4 to Interconnection Network Directories E Memories 5 Caches 5 CPU 0 CPU 1 CPU 2

50 Problem: False Sharing Occurs when two or more processors access different data in same cache line, and at least one of them writes. Leads to ping- pong effect.

51 False Sharing Example for( i=0; i<n; i++ ) a[i] = b[i]; Let s assume we parallelize the code: p = 2 element of a takes 1 word cache line has 8 words

52 False Sharing Example (cont.) cache line a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Wri\en by processor 0 Wri\en by processor 1

53 False Sharing Example (cont.) a[0] a[2] a[4] P0 inv data... a[1] a[3] a[5] P1

54 False Sharing Effect on Performance What will be the effect on performance? How can we detect that false sharing is occurring? How can we fix the problem? 54

55 Wrapup and Homework Homework 1 ready hopefully by tomorrow, will be due Feb 13 Read remainder of Chapter 2 for next class 55

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,