Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

Similar documents
EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Multiprocessors continued

On Power and Multi-Processors

Multiprocessors continued

On Power and Multi-Processors

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

CSE502: Computer Architecture CSE 502: Computer Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Thread-Level Parallelism. Shared Memory. This Unit: Shared Memory Multiprocessors. U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Computer Architecture

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

Multiprocessors & Thread Level Parallelism

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Beyond ILP In Search of More Parallelism

Handout 3 Multiprocessor and thread level parallelism

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Multiprocessors 1. Outline

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Processor Architecture

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Review: Symmetric Multiprocesors (SMP)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Computer Systems Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Computer Science 146. Computer Architecture

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

1. Memory technology & Hierarchy

Shared Memory Architectures. Approaches to Building Parallel Machines

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

EECS 570 Lecture 2. Message Passing & Shared Memory. Winter 2018 Prof. Satish Narayanasamy

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Foundations of Computer Systems

Aleksandar Milenkovich 1

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Chap. 4 Multiprocessors and Thread-Level Parallelism

PARALLEL MEMORY ARCHITECTURE

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Shared Symmetric Memory Systems

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Shared Memory Multiprocessors

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Switch Gear to Memory Consistency

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Multiprocessors and Locking

Bus-Based Coherent Multiprocessors

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Chapter 18 Parallel Processing

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Parallel Architecture. Hwansoo Han

Chapter 9 Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Interconnect Routing

Lecture 24: Virtual Memory, Multiprocessors

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

SGI Challenge Overview

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

COSC4201 Multiprocessors

Memory Hierarchy in a Multiprocessor

Chapter 5. Thread-Level Parallelism

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiplying Performance. Application Domains for Multiprocessors. Multicore: Mainstream Multiprocessors. Identifying Parallelism

Transcription:

Multiprocessors Fall 2007 Prof. Thomas Wenisch www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth Shen, Smith, Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Slide 1

Announcements Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, CourseEvaluations at end of lecture today HW 6 Posted, due 12/7 Project due 12/10 In class presentations (~8 minutes + questions) Slide 2

Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id,amt; 0: addi r1,accts,r3 if (accts[id].bal >= amt) 1: ld 0(r3),r4r4 { 2: blt r4,r2,6 accts[id].bal -= amt; 3: sub r4,r2,r4 spew_cash(); 4: st r4,0(r3) } 5: call spew_cash Thread level parallelism (TLP) Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically Example: database/web b server (each query is a thread) accts is shared, can t register allocate even if it were scalar id and amt are private variables, register allocated to r1, r2 Slide 3

Shared-Memory Multiprocessors Shared memory Multiple execution contexts sharing a single address space Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores P 1 P 2 P 3 P 4 Memory System Slide 4

Pluses Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, Why Shared Memory? For applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first then performance Minuses Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Result Traditionally bus based Symmetric Multiprocessors (SMPs), and now the CMPs are the most success parallel machines ever And the first with multi billion dollar markets Slide 5

In More Detail Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, Efficient Naming virtual to physical using TLBs ability to name relevent portions of objects Ease and efficiency of caching caching is natural and well understood d can be done in HW automatically Communication Overhead low since protection is built into memory system easy for HW to packetize requests / replies Integration of latency tolerance demand driven: consistency models, prefetching, multithreaded Can extend to push data to processors and use bulk transfer Slide 6

Shared-Memory Multiprocessors Provide a shared memory abstraction Familiar and efficient for programmers P 1 P 2 P 3 P 4 Cache M 1 Cache M 2 Cache M 3 Cache M 4 Interface Interface Interface Interface Interconnection Network Slide 7

Paired vs. Separate Processor/Memory? Separate processor/memory Uniform memory access (UMA): equal latency to all memory + Simple software, doesn t matter where you put data Lower peak performance Bus based UMAs common: symmetric multi processors (SMP) Paired processor/memory Non uniform memory access (NUMA): faster to local memory More complex software: where you put data matters + Higher peak performance: assuming proper data placement CPU($) CPU($) CPU($) CPU($) CPU($) Mem R CPU($) Mem R CPU($) Mem R CPU($) Mem R Mem Mem Mem Mem Slide 8

Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency Low bandwidth: doesn t scale beyond ~16 processors + Shared property simplifies cache coherence protocols (later) Point to point i network: e.g., mesh or ring (right) (ih) Longer latency: may need multiple hops to communicate + Higherbandwidth: scales to 1000s of processors Cache coherence protocols are complex CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R CPU($) R Mem CPU($) Slide 9

Organizing Point-To-Point Networks Network topology: organization of network Tradeoff performance (connectivity, latency, bandwidth) cost Router chips Networks that require separate router chips are indirect Networks that use processor/memory/router packages are direct + Fewer components, Glueless MP Point to point network examples Indirect tree (left) Direct mesh or ring (right) R R CPU($) CPU($) Mem R R Mem R Mem R Mem R Mem R Mem R CPU($) CPU($) CPU($) CPU($) Mem R CPU($) R Mem CPU($) Slide 10

Implementation #1: Snooping Bus MP CPU($) CPU($) CPU($) CPU($) Two basic implementations Bus based systems Mem Mem Mem Mem Typically small: 2 8 (maybe 16) processors Typically processors split from memories (UMA) Sometimes multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, I use one everyday Slide 11

Implementation #2: Scalable MP CPU($) Mem R CPU($) R Mem Mem R CPU($) R Mem CPU($) General point to point network based systems Typically processor/memory/router blocks (NUMA) Glueless MP: no need for additional glue chips Can be arbitrarily large: 1000 s of processors Massively parallel processors (MPPs) In reality only government (DoD) has MPPs Companies have much smaller systems: 32 64 processors Scalable multi processors Slide 12

Issues for Shared Memory Systems Two in particular Cache coherence Memory consistency model Closely related to each other Different solutions for SMPs and MPPs Slide 13

An Example Execution Processor 0 Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash CPU0 CPU1 Mem Two $100 withdrawals from account #241 at two ATMs Each transaction maps to thread on different processor Track accts[241].bal (dd (address is in r3) Slide 14

No-Cache, No-Problem Processor 0 Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 500 400 400 300 Scenario I: processors have no caches No problem Slide 15

Cache Incoherence Processor 0 Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 V:500 500 D:400 500 D:400 V:500 500 D:400 D:400 500 Scenario II: processors have write back caches Potentially 3 copies of accts[241].bal: memory, p0$, p1$ Canget incoherent (inconsistent) Slide 16

Hardware Cache Coherence CPU Absolute coherence All copies have same data at all times Hard to implement and slow + Not strictly necessary CC D$ tags D$ data a Relative coherence Temporary incoherence OK (e.g., write back) As long as all loads get right values bus Coherence controller: Examines bus traffic (addresses and data) Executes coherence protocol What to do with local copy when you see different things happening on bus Slide 17

Snooping Cache-Coherence Protocols Bus provides serialization point Each cache controller snoops all bus transactions relevant transactions if for a block it contains take action to ensure coherence invalidate update supply value depends on state of the block and the protocol SimultaneousOperationofIndependent of Controllers Slide 18

Snooping Design Choices Controller updates state of blocks in response to processor and snoop events ents and generates bus xactions Often have duplicate cache tags Snoopy protocol set of states state transition diagram actions Basic Choices write through vs. write back invalidate vs. update Processor ld/st Cache State Tag Data... Snoop (observed bus transaction) Slide 19

The Simple Invalidate Snooping Protocol PrRd / -- Valid PrWr / BusWr Wi Write through, h nowrite allocate cache Actions: PrRd, PrWr, BusRd, BusWr PrRd / BusRd BusWr Invalid PrWr / BusWr Slide 20

More Generally: MOESI [Sweazey & Smith ISCA86] M Modified (dirty) O Owned (dirty but shared) WHY? E Exclusive (clean unshared) only copy, not dirty S Shared I Invalid Variants MSI MESI MOSI MOESI S O M E I ownership validity exclusiveness Slide 21

Base Snoopy Organization P Addr Cmd Data Tags and state for snoop Comparator To controller Cache data RAM Tags and state for P Bus- side controller Processorside controller Tag Write-back buffer Comparator To controller Snoop state Addr Cmd Data buffer Addr Cmd System bus Slide 22

Non-Atomic State Transitions Operations involve multiple actions Look up cache tags Bus arbitration Check for writeback Even if bus is atomic, overall set of actions is not Race conditions among multiple operations Suppose P1and P2attemptto to write cached block A Each decides to issue BusUpgr to allow S > M Issues Handle requests for other blocks while waiting to acquire bus Must handle requests for this block A Slide 23

Non-Atomicity Transient States Two types of states Stable (e.g. MESI) P W / Transient or Intermediate M Increases complexity P r W r / PrRd/ P r W r / BusRd/Flush BusRdX/Flush B u s G ra n t/b u s U p g r E BusGrant/BusRdX S M P r W r / BusReq B u s G ra n t/ BusRd (S) P r R d / BusRd/Flush BusRdX/Flush I M PrWr/BusReq BusRdX/Flush I S,E BusGrant/ BusRd (S) PrRd/BusReq S PrRd/ BusRd/ Flush BusR dx/flush I Slide 24

Scalability problems of Snoopy Coherence Prohibitive bus bandwidth Required bandwidth grows with # CPUS but available BW per bus is fixed Adding busses makes serialization/ordering hard Prohibitive processor snooping bandwidth All caches do tag lookup when ANY processor accesses memory Inclusion limits this to L2, but still lots of lookups Upshot: bus based b coherence doesn t scale beyond d8 16 CPUs Slide 25

Scalable Cache Coherence Scalable cache coherence: two part solution Part I: bus bandwidth Replace non scalable bandwidth substrate (bus) with scalable bandwidth one (point to point to network, e.g., mesh) Part II: processor snooping bandwidth Interesting: most snoops result in no action Replace non scalable broadcast protocol (spam everyone) with scalable directory protocol (only spam processors that care) Slide 26

Directory Coherence Protocols Observe: physical address space statically partitioned + Can easily determine which memory module holds a given line That memory module sometimes called home Can t easily determine which processors have line in their caches Bus based protocol: broadcast events to all processors/caches ± Simple and fast, but non scalable Directories: non broadcast coherence protocol Extend memory to track caching information For each physical cache line whose home this is, track: Owner: which processor has a dirty copy (I.e., M state) Sharers: which processors have clean copies (I.e., S state) Processor sends coherence event to home directory Home directory only sends events to processors that care Slide 27

Read Processing Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, Node #1 Directory Node #2 Read A (miss) A: Shared, #1 Slide 28

Write Processing Node #1 Directory Node #2 Read A (miss) A: Shared, #1 A: Mod., #2 Trade offs: Longer accesses (3 hop between Processor, directory, other Processor) Lower bandwidth no snoops necessary Makes sense either for CMPs (lots of L1 miss traffic) or large scale servers (shared memory MP > 32 nodes) Slide 29

Serialization and Ordering Let A and flag be 0 P1 P2 A += 5 while (flag == 0) flag = 1 print A Assume A and flag are in different cache blocks What happens? How do you implement it correctly? Slide 30

Coherence vs. Consistency Intuition says loads should return latest value what is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A Memory System is Coherent if can serialize all operations to that location such that, operations performed dby any processor appear in program order program order = order defined program text or assembly code value returned by a read is value written by last store to that location Slide 31

Why Coherence!= Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 Coherence doesn t say anything. Why? Your uniprocessor ordering mechanisms (ld/st queue) won t help. Why? Slide 32

Sequential Consistency (SC) processors issue memory ops in program order P1 P2 P3 switch randomly set after each memory op provides single sequential order among all operations Memory Slide 33

Sufficient Conditions for SC Every proc. issues memory ops in program order Memory ops happen (start and end) atomically must wait for store to complete before issuing next memory op after load, issuing i proc waits for load dto complete, lt bf before issuing i next op Easily implemented with a shared bus Slide 34

Relaxed Memory Models Motivation with Directory Protocols Misses have longer latency (do cache hits to get to next miss) Collecting acknowledgements can take even longer Recall llsc has Each processor generates at total order of its reads and writes (R >R, R >W, W >W, & W >R) That are interleaved into a global total order ExampleRelaxed Models PC: Relax ordering from writes to (other proc s) reads RC: Relaxallread/write all orderings (but add fences ) Slide 35

Relax Write to Read Order /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; Processor Consistent (PC) Models Allow r1==r2==0 (precluded by SC) Examples: IBM 370, Sun TSO, & Intel IA 32 Why do this? Allows FIFO store buffers Does not astonish programmers (too much) Slide 36

Write Buffers w/ Read Bypass P1 P2 Read Flag 2 t1 Write Flag 1 Read t3 Flag 1 Write Flag 2 t2 Shared Bus t4 Flag 1: 0 Flag 2: 0 P1 P2 Flag 1 = 1 Flag 2 = 1 if (Flag 2 == 0) if (Flag 1 == 0) critical section critical section Slide 37

Why Not Relax All Order? /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; Reorder of A = 1 / B = 1 or r1 = A / r2 = B Via OoO proc., non FIFO store buffers, delayed directory acks., etc. ButMust Order A = 1 / B = 1 before flag =1 flag!= 0 before r1 = A / r2 = B Solution: Programmer indicates when ordering needed Slide 38

The store problem in MP Uniprocessors: Never wait to write memory CPU St X Ld Y Store buffer Memory system Multiprocessors: Must wait order matters! Producer St data, x 1 Consumer CPU St valid, 1 2 CPU Spin until valid Ld data How to know which stores must wait? 2007 Thomas Wenisch 39

The Consitency Debate Programmer must choose: Strict memory ordering e.g., Sequential Consistency (SC) Intuitive like multitasked uni. Slow wait for all stores Relaxed memory ordering eg e.g., Sun RMO Complex SW enforces order Fast parallel / OoO accesses P P P Strict ordering P P P Overlap Accesses Relaxed ordering Can we get intuitive and fast? 2007 Thomas Wenisch 40

Observations Order only matters when CPUs race Races rare speculate! [Gniady 99] [von Praun 06] Even under RMO, fences often unnecessary 5 10% performance loss in commercial apps If access sequence appears atomic, OK to reorder within sequence Enforce order over coarse grain atomic sequences Strict memory order with RMO performance 2007 Thomas Wenisch 41

Store-Wait-Free Multiprocessors [ISCA 07] Atomic Sequence Ordering St P Ld R St A Ld C CPU Ld Q Ld D ckpt ckpt Practical CPU rollback via checkpoints Mem. System Scalable Store Buffer prior designs our design CPU L1 L2 CPU L1 L2 SSB speculative consistent speculative consistent On-chip memory commit/rollback w/o assoc. search 2007 Thomas Wenisch 42

Multiprocessors Are Here To Stay Moore s law is making the multiprocessor a commodity part 500M transistors on a chip, what to do with all of them? Not enough ILP to justify a huge uniprocessor Really big caches? t hit increases, diminishing % miss returns Chip multiprocessors (CMPs) Multiple full processors on a single chip Example: IBM POWER4: two 1GHz processors, 1MB L2, L3 tags Multiprocessors a huge part of computer architecture Multiprocessor equivalent of H+P is 50% bigger than H+P Another entire cores on multiprocessor architecture Hopefully every other year or so (not this year) Slide 43

Multiprocessing & Power Consumption Multiprocessing can be very power efficient Recall:dynamicvoltageandfrequency and scaling Performance vs power is NOT linear Example: Intel s Xscale 1 GHz 200 MHz reduces energy used by 30x Impact of parallel execution What if we used 5 Xscales at 200Mhz? Similar performance as a 1Ghz Xscale, but 1/6th the energy 5 cores * 1/30th = 1/6th Assumes parallel speedup (a difficult task) Remember Ahmdal s law Slide 44

Shared Memory Summary Shared memory multiprocessors + Simple software: easy data sharing, handles both DLP and TLP Complex hardware: must provide illusion of global address space Two basic implementations Symmetric (UMA) multi processors (SMPs) Underlying communication network: bus (ordered) + Low latency, simple protocols that rely on global order Low bandwidth, poor scalability Scalable (NUMA) multi processors (MPPs) Underlying communication network: point to point (unordered) + Scalable lbl bandwidth d Higher latency, complex protocols Slide 45

Shared Memory Summary Two aspects to global memory space illusion Coherence: consistent view of individual cache lines Absolute coherence not needed, relative coherence OK VI and MSI protocols, cache to cache transfer optimization Implementation? SMP: snooping, MPP: directories Consistency: consistent global view of all memory locations Programmers intuitively expect sequential consistency (SC) Global interleaving of individual processor access streams Not naturally provided by coherence, needs extra stuff Relaxed ordering: consistency only for synchronization points Slide 46