EE382 Processor Design. Processor Issues for MP

EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency Physical Memory Coherency Synchronization Consistency Emphasis on Physical Memory and System Interconnect EE 382 Processor Design Winter 98/99 Michael Flynn 2 1

Outline Partitioning Granularity Overhead and efficiency Multi-threaded MP Shared Bus Coherency Synchronization Consistency Scalable MP Cache directories Interconnection networks Trends and tradeoffs Additional References Hennessy and Patterson, CAQA, Chapter 8 Culler, Singh, Gupta, Parallel Computer Architecture A Hardware/Software Approach http://http.cs.berkeley.edu/~culler/ book.alpha/index.html EE 382 Processor Design Winter 98/99 Michael Flynn 3 L2 Cache Representative System L1 Icache L1 Dcache Pipelines Registers CPU CPU Chipset Memory I/O Bus(es) EE 382 Processor Design Winter 98/99 Michael Flynn 4 2

Shared-Memory Shared Memory MP Consider systems with a single memory address space Contrasted to multi-computers separate memory address spaces message passing for communication and synchronization Example: Network of Workstations EE 382 Processor Design Winter 98/99 Michael Flynn 5 Shared Memory MP Types of shared-memory MP multithreaded or shared resource MP shared-bus MP (broadcast protocols) scalable MP (networked protocols) Issues partitioning of application into p parallel tasks scheduling of tasks to minimize dependency T w communications and synchronization EE 382 Processor Design Winter 98/99 Michael Flynn 6 3

Partitioning If a uniprocessor executes a program in time T 1 with O 1 operations, and a p parallel proc. executes in T p with O p ops, then O p >O 1 due to task overhead Also Sp = T 1 /T p < p, where p=no. of processors in the system and this is also the amount of parallelism (or the degree of partitioning) available in the program. EE 382 Processor Design Winter 98/99 Michael Flynn 7 Granularity Sp overhead limited limited by parallelism and load balance fine grain size coarse EE 382 Processor Design Winter 98/99 Michael Flynn 8 4

Task Scheduling Static at compile time Dynamic run time system load balancing load balancing clustering of tasks with inter-processor communication schedule with compiler assistance EE 382 Processor Design Winter 98/99 Michael Flynn 9 Overhead Limits Sp to less than p with p processors Efficiency = Sp/p = T 1 /(T p * p) Lee s equal work hypothesis: Sp < p/ln(p) Task overhead due to communications delays context switching cold cache effects EE 382 Processor Design Winter 98/99 Michael Flynn 10 5

Multi-threaded MP Multiple processors sharing many execution units each processor has its own state share function units, caches, TLBs, etc. Types time multiplex multiple processors so that there are no pipeline breaks,etc. pipelined processor switch context and on any processor delay (cache miss,etc) Optimizes multi-thread throughput, but limits singlethread performance See Study 8.1 on p. 537 Processors share D cache EE 382 Processor Design Winter 98/99 Michael Flynn 11 Shared-Bus MP Processors with own D cache require cache coherency protocol. Simplest protocols have processors snoop on writes to memory that occur on a shared bus If write is to a line in own cache, either invalidate or update that line. EE 382 Processor Design Winter 98/99 Michael Flynn 12 6

Coherency, Synchronization, and Consistency Coherency Property that the value returned after a read is the same value as the latest write Required for process migration even without sharing Synchronization Instructions that control access to critical sections of data shared by multiple processors Consistency Rules for allowing memory references to be reordered that may lead to observed differences in memory state by multiple processors EE 382 Processor Design Winter 98/99 Michael Flynn 13 Shared-Bus Cache Coherency Protocols Write invalidate, simple 3 state -V,I,D Berkeley (w.invalidate) 4 state - V,S,D,I Illinois (w.invalidate) 4 state - M,E,S,I Dragon (w.update) 5 state - M,E,S,D,I Simpler protocols have somewhat more memory bus traffic. EE 382 Processor Design Winter 98/99 Michael Flynn 14 7

MESI Protocol EE 382 Processor Design Winter 98/99 Michael Flynn 15 Coherence Overhead for Parallel Processing Results for 4 parallel programs with 16 CPUs and 64KB cache Coherence traffic is a substantial portion of bus demand Large blocks can lead to false sharing Hennessy and Patterson CAQA Fig 8.15 EE 382 Processor Design Winter 98/99 Michael Flynn 16 8

Synchronization Primitives Communicating Sequential Processes Process A Process B acquire semaphore acquire semaphore access shared data access shared data (read/modify/write) (read/modify/write) release semaphore release semaphore EE 382 Processor Design Winter 98/99 Michael Flynn 17 Synchronization Primitives Acquiring the semaphore generally requires an atomic read-modify-write operation a location Ensure that only one process enters critical section Test&Set, Locked-Exchange, Compare&Exchange, Fetch&Add, Load-Locked/Store-Conditional Looping on a semaphore with a test and set or similar instruction is called a spin lock Techniques to minimize overhead for spin contention: Test + Test&Set, exponential backoff EE 382 Processor Design Winter 98/99 Michael Flynn 18 9

Memory Consistency Problem Can the tests at L1 and L2 below both succeed? Process A Process B A = 0; B = 0;...... A = 1; B = 1; L1: if (B==0) L2: if (A==0) Memory Consistency Model Rules for allowing memory references by a program executing on one processor to be observed in a different order by a program executing on another processor Memory Fence operations explicitly control ordering of memory references EE 382 Processor Design Winter 98/99 Michael Flynn 19 Memory Consistency Models (Part I) Sequential consistency (strong ordering) All memory ops execute in some sequential order. Memory ops of each processor appear in program order. Processor consistency (Total Store Ordering) Writes are buffered and stored in order Reads are performed in order, but can bypass writes Processor flushes store buffer when synchronization instruction executed Weak consistency Memory references generally allowed in any order Programs enforce ordering when required for shared data by executing Memory Fence instructions All memory references for previous instructions complete before fence No memory references for subsequent instructions issued before fence Synchronization instructions act like fences EE 382 Processor Design Winter 98/99 Michael Flynn 20 10

Memory Consistency Models (Part II) Release consistency Distinguish between acquire/release of semaphore before/after access to shared data Acquire semaphore Ensure that semaphore acquired before any reads or writes by subsequent instructions (which may access shared data) Release semaphore Ensure that any writes by previous instructions (which may access shared data) are visible before semaphore released Hennessy and Patterson CAQA Fig 8.39 EE 382 Processor Design Winter 98/99 Michael Flynn 21 Pentium Processor Example 2-Level Cache Hierarchy Inclusion Enforced Snoops on system bus only need interrogate L2 Cache Policy Write-Back supported Write-Through optional selected by page or line write buffers used Cache Coherence MESI at both levels Memory Consistency Processor Ordering Issues Writes hit E-line on-chip Writes hit E or M line while buffer occupied Pipelines Data Cache Write Buffer Cache Write Buffer System Bus CPU L2 Cache EE 382 Processor Design Winter 98/99 Michael Flynn 22 11

Shared-Bus Performance Models Null Binomial Resubmissions don t automatically occur, e.g, multithreaded MP See study 8.1, page 537 Resubmissions model Where requests remain on bus until serviced See pp 413-415 and cache example posting on web Bus traffic usually limits number of processors Bus optimized for MP supports 10-20 But high cost for small systems Bus that incrementally extends uniprocessor limited to 2-4 EE 382 Processor Design Winter 98/99 Michael Flynn 23 12