Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)

Size: px

Start display at page:

Download "Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)"

Milton Robbins
5 years ago
Views:

A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S.

Yalamanchili Reality Check Conventional processor designs run out of steam Complexity (verification) Power (thermal)

Moore happy (Moore s Law) Architects menace: kick the ball to the other side of the court?

1 A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100 for Prof. Yalamanchili Reality Check Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling) Unanimous direction Multi-core Simple cores (massive number) Keep Wire communication on leash Gordon Moore happy (Moore s Law) Architects menace: kick the ball to the other side of the court? What do you (or your customers) want? Performance (and/or availability) Throughput > latency (turnaround time) Total cost of ownership (performance per dollar) Energy (performance per watt) Reliability and dependability, SPAM/spy free 2

Multi-core Processor Gala 3 Intel s Multicore Roadmap Desktop processors DC 2/4MB DC 4MB DC

DC 3 MB/6 MB shared (45nm) DC 2/4MB shared Enterprise processors DC 16MB DC 4MB DC 2MB 8C

2006 2007 2008 To extend Moore s Law To delay the ultimate limit of physics By 2010 all Intel

2 Multi-core Processor Gala 3 Intel s Multicore Roadmap Desktop processors DC 2/4MB DC 4MB DC 2/4MB shared DC 3MB /6MB shared (45nm) 8C 12MB shared (45nm) Mobile processors DC 2MB SC 1MB DC 3 MB/6 MB shared (45nm) DC 2/4MB shared Enterprise processors DC 16MB DC 4MB DC 2MB 8C 12MB shared (45nm) QC 8/16MB shared QC 4MB SC 512KB/ 1/ 2MB To extend Moore s Law To delay the ultimate limit of physics By 2010 all Intel processors delivered will be multicore Intel s 80-core processor Source: Adapted from Tom s Hardware 4

3 Is a Multi-core really better off? Well, it is hard to say in Computing World 5 Is a Multi-core really better off? DEEP BLUE 480 chess chips Can evaluate 200,000,000 moves per second!! 6

Computing Paradigm Evolution Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Fine-grained Coarse-grained Multithreading

4 Computing Paradigm Evolution Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Fine-grained Coarse-grained Multithreading Multithreading (cycle-by-cycle (Block Interleaving) Interleaving) Why SMT is failing? Simultaneous Chip Multithreading Multiprocessor (or Intel s HT) (CMP) or Multi-Core Processors 7 Major Challenges for Multi-Core Designs Communication Memory hierarchy Data allocation (you have a large shared L2/L3 now) Interconnection network Scalability Bus Bandwidth, how to get there? Power-Performance Win or lose? Borkar s multicore arguments 15% per core performance drop 50% power saving Giant, single core wastes power when task is small How about leakage? Process variation and yield Programming Model 8

Stations, Issue ports, Schedulers etc Source: Intel Corp.

5 Intel Core 2 Duo Homogeneous cores Bus based on chip interconnect Shared on-die Cache Memory Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers etc Source: Intel Corp. Large, shared set associative, prefetch, etc. 9 Core 2 Duo Microarchitecture 10

6 Why Sharing on-die L2? What happens when L2 is too large? 11 Intel Core 2 Duo (Merom) 12

7 Core TM µarch Wide Dynamic Execution 13 Core TM µarch Wide Dynamic Execution 14

Core TM µarch MACRO Fusion Common Intel 32 (once used to be called x86 or IA-32) instruction pairs are combined 15 Micro(-ops) Fusion (from Pentium M) A misnomer.

fuse Store address and store data µops Load-and-op µops (e.g.

8 Core TM µarch MACRO Fusion Common Intel 32 (once used to be called x86 or IA-32) instruction pairs are combined 15 Micro(-ops) Fusion (from Pentium M) A misnomer.. Instead of breaking up an Intel32 instruction into µop, they decide not to break it up A better naming scheme would call the previous techniques IA32 fission To fuse Store address and store data µops Load-and-op µops (e.g. ADD (%esp), %eax) Extend each RS entry to take 3 operands To reduce micro-ops (10% reduction in the OOO logic) Decoder bandwidth (simple decoder can decode fusion type instruction) Energy consumption Performance improved by 5% for INT and 9% for FP (Pentium M data) 16

Smart Memory Access 17 Sun UltraSparc T1 Eight cores,

thread-selection logic Round-robin cycle-by-cycle 1.

16K 4-way 32B L1-I 8K 4-way 16B L1-D Blocking cache

9 Smart Memory Access 17 Sun UltraSparc T1 Eight cores, each 4-way threaded Fine-grained multithreading a thread-selection logic Round-robin cycle-by-cycle 1.2 GHz (90nm) No OOO, 8 instructions per cycle Cache 16K 4-way 32B L1-I 8K 4-way 16B L1-D Blocking cache (reason for MT) 4-banked 12-way 3MB L2 + 4 memory controllers. Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput (200GB/s) 18

opposed to 6 for T1) 16 instructions per cycle One PCI Express port (x8 1.

10 Sun UltraSparc T2 A fatter version of T1 1.4GHz (65nm) 8 threads per core, 8 cores on-die 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1) L2 increased to 8-banked 16-way 4MB shared 8 stage integer pipeline ( as opposed to 6 for T1) 16 instructions per cycle One PCI Express port (x8 1.0) Two 10 Gigabit Ethernet ports with packet classification and filtering Eight encryption engines Four dual-channel FBDIMM memory controllers 711 signal I/O,1831 total 19 STI Cell Broadband Engine Heterogeneous! 64-bit PowerPC Eight SPEs In-order, Dual-issue 128-bit SIMD 128x128b RF 256KB LS Fast Local SRAM Globally coherent DMA (128B/cycle) 128+ concurrent transactions to memory per core High bandwidth EIB (96B/cycle) 20

11 Cell Chip Block Diagram Synergistic Memory flow controller 21 Non-Uniform Cache Architecture ASPLOS 2002 proposed by UT-Austin Facts Large shared on-die L2 Wire-delay dominating on-die cache 3 cycles 1MB 180nm, cycles 4MB 90nm, cycles 16MB 50nm,

12 Multi-banked L2 cache Bank=128KB 11 cycles 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles 23 Multi-banked L2 cache Bank=64KB 47 cycles 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles 24

13 Static NUCA-1 Sub-bank Bank Data Bus Address Bus Tag Array Predecoder Sense amplifier Wordline driver and decoder Use private per-bank channel Each bank has its distinct access latency Statically decide data location for its given address Average access latency =34.2 cycles Wire overhead = 20.9% an issue 25 Static NUCA-2 Bank Switch Tag Array Data bus Predecoder Wordline driver and decoder Use a 2D switched network to alleviate wire area overhead Average access latency =24.2 cycles Wire overhead = 5.9% 26

14 Dynamic NUCA Data can dynamically migrate Move frequently used cache lines closer to CPU 27 Dynamic NUCA way 0 bank 8 bank sets one set way 1 way 2 way 3 Simple Mapping All 4 ways of each bank set needs to be searched Farther bank sets longer access 28

15 Dynamic NUCA way 0 bank 8 bank sets one set way 1 way 2 way 3 Fair Mapping Average access time across all bank sets are equal 29 Dynamic NUCA way 0 way 1 way 2 way 3 bank 8 bank sets Shared Mapping Sharing the closet banks for farther banks 30

16 Transactional Memory Parallel Software Problems Parallel systems are often programmed with Synchronization through barriers Shared objects access control through locks Lock granularity and organization must balance performance and correctness Coarse-grain locking: Lock contention Fine-grain locking: Extra overhead Must be careful to avoid deadlocks or data races Must be careful not to leave anything unprotected for correctness Performance tuning is not intuitive Performance bottlenecks are related to low level events E.g. false sharing, coherence misses Feedback is often indirect (cache lines, rather than variables) 32

17 Parallel Hardware Complexity (TCC s view) Cache coherence protocols are complex Must track ownership of cache lines Difficult to implement and verify all corner cases Consistency protocols are complex Must provide rules to correctly order individual loads/stores Difficult for both hardware and software Current protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers Latency of short messages unlikely to scale well in the future Bandwidth is likely to scale much better High speed interchip connections Multicore (CMP) = on-chip bandwidth 33 What do we want? A shared memory system with A simple, easy programming model (unlike message passing) A simple, low-complexity hardware implementation (unlike shared memory) Good performance 34

18 Lock Freedom Why lock is bad? Common problems in conventional locking mechanisms in concurrent systems Priority inversion: When low-priority process is preempted while holding a lock needed by a high-priority process Convoying: When a process holding a lock is de-scheduled (e.g. page fault, no more quantum), no forward progress for other processes capable of running Deadlock (or Livelock): Processes attempt to lock the same set of objects in different orders (could be bugs by programmers) Error-prone 35 Using Transactions What is a transaction? A sequence of instructions that is guaranteed to execute and complete only as an atomic unit Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction End Transaction Satisfy the following properties Serializability: Transactions appear to execute serially. Atomicity (or Failure-Atomicity): A transaction either commits changes when complete, visible to all; or aborts, discarding changes (will retry again) 36

19 TCC (Stanford) [ISCA 2004] Transactional Coherence and Consistency Programmer-defined groups of instructions within a program Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction Start Buffering Results Commit Results Now Only commit machine state at the end of each transaction Each must update machine state atomically, all at once To other processors, all instructions within one transaction appear to execute only when the transaction commits These commits impose an order on how processors may modify machine state 37 Transaction Code Example MIT LTM instruction set xstart: XBEGIN on_abort lw r1, 0(r2) addi r1, r1, on_abort: // back off j xstart // retry 38

20 Transactional Memory Transactions appear to execute in commit order Flow (RAW) dependency cause transaction violation and restart Transaction A Time ld 0xdddd... st 0xbeef Arbitrate Commit 0xbeef Transaction B ld 0xdddd... ld 0xbbbb Arbitrate 0xbeef Transaction C ld 0xbeef Violation! Commit ld 0xbeef Re-execute execute with new data 39 Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order (think about sequential consistency) Transaction A Store X Commit X Local buffer Shared Memory Transaction B Store X Commit X Local buffer 40

21 Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Store X Commit X Local buffer Shared Memory Transaction B Store X Commit X Local buffer Transaction A ST X = 1 LD X (=1) Commit X X = 1 X = 3 Local stores supply data Transaction B ST X = 3 LD X (=3) LD X (=3) Commit X 41 TCC System Similar to prior thread-level speculation (TLS) techniques CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC speculative multithreading CMP Loosely coupled TLS system Completely eliminates conventional cache coherence and consistency models No MESI-style cache coherence protocol But require new hardware support 42

The TCC Cycle Transactions run in a cycle Speculatively execute code and buffer Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other processors Commit stores

22 The TCC Cycle Transactions run in a cycle Speculatively execute code and buffer Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other processors Commit stores together (as a packet) Provides a well-defined write ordering Can invalidate or update other caches Large packet utilizes bandwidth effectively And repeat 43 Advantages of TCC Trades bandwidth for simplicity and latency tolerance Easier to build Not dependent on timing/latency of loads and stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no MESI!) Commit order implies an ownership sequence 44

How to Use TCC Divide code into potentially parallel tasks Usually loop iterations For initial division, tasks = transactions But can be subdivided up or grouped to match HW limits (buffering)

23 How to Use TCC Divide code into potentially parallel tasks Usually loop iterations For initial division, tasks = transactions But can be subdivided up or grouped to match HW limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Easier to get parallel programs running correctly Programmer then orders transactions as necessary Ordering techniques implemented using phase number Deadlock-free (At least one transaction is the oldest one) Livelock-free (watchdog HW can easily insert barriers anywhere) 45 How to Use TCC Three common ordering scenarios Unordered for purely parallel tasks Fully ordered to specify sequential task (algorithm level) Partially ordered to insert synchronization like barriers 46

24 Basic TCC Transaction Control Bits In each local cache Read bits (per cache line, or per word to eliminate false sharing) Set on speculative loads Snooped by a committing transaction (writes by other CPU) Modified bits (per cache line) Set on speculative stores Indicate what to rollback if a violation is detected Different from dirty bit Renamed bits (optional) At word or byte granularity To indicate local updates (WAR) that do not cause a violation Subsequent reads that read lines with these bits set, they do NOT set read bits because local WAR is not considered a violation 47 During A Transaction Commit Need to collect all of the modified caches together into a commit packet Potential solutions A separate write buffer, or An address buffer maintaining a list of the line tags to be committed Size? Broadcast all writes out as one single (large) packet to the rest of the system 48

Re-execute A Transaction Rollback is needed when a transaction cannot commit Checkpoints needed prior to a transaction Checkpoint memory Use local cache Overflow issue Conflict or capacity misses

25 Re-execute A Transaction Rollback is needed when a transaction cannot commit Checkpoints needed prior to a transaction Checkpoint memory Use local cache Overflow issue Conflict or capacity misses require all the victim lines to be kept somewhere (e.g. victim cache) Checkpoint register state Hardware approach: Flash-copying rename table / arch register file Software approach: extra instruction overheads 49 Sample TCC Hardware Write buffers and L1 Transaction Control Bits Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequence logic 50

26 Ideal Speedups with TCC equake_l : long transactions equake_s : short transactions 51 Speculative Write Buffer Needs Only a few KB of write buffering needed Set by the natural transaction sizes in applications Small write buffer can capture 90% of modified state Infrequent overflow can be always handled by committing early 52

Broadcast Bandwidth Broadcast is bursty Average bandwidth Needs ~16 bytes/cycle @ 32 processors with whole modified lines Needs ~8

27 Broadcast Bandwidth Broadcast is bursty Average bandwidth Needs ~16 32 processors with whole modified lines Needs ~8 32 processors with dirty data only High, but feasible on-chip 53 TCC vs MESI [PACT 2005] Application, Protocol + Processor count 54

28 Implementation of MIT s LTM [HPCA 05] Transactional Memory should support transactions of arbitrary size and duration LTM Large Transactional Memory No change in cache coherence protocol Abort when a memory conflict is detected Use coherency protocol to check conflicts Abort (younger) transactions during conflict resolution to guarantee forward progress For potential rollback Checkpoint rename table and physical registers Use local cache for all speculative memory operations Use shared L2 (or low level memory) for non-speculative data storage 55 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, Saved Set {P1, } During instruction decode: Maintain rename table and saved bits in physical registers Saved bits track registers mentioned in current rename table Constant # of set bits: every time a register is added to saved set we also remove one 56

29 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } When XBEGIN is decoded Snapshots taken of current rename table and S bits This snapshot is not active until XBEGIN retires 57 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } 58

30 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } 59 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } Active snapshot When XBEGIN retires Snapshots taken at decode become active, which will prevent P1 from reuse 1 st transaction queued to become active in memory To abort, we just restore the active snapshot s rename table 60

31 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, R1 P3, Saved Set {P1, } {P2, } {P3, } Active snapshot We are only reserving registers in the active set This implies that exactly # of arch registers are saved This number is strictly limited, even as we speculatively execute through multiple transactions 61 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, R1 P3, Saved Set {P1, } {P2, } {P3, } Active snapshot Normally, P1 would be freed here Since it is in the active snapshot s saved set, we place it onto the register reserved list 62

32 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P2, R1 P3, Saved Set {P2, } {P3, } When retires: Reserved physical registers (e.g. P1) are freed, and active snapshot is cleared Store queue is empty 63 Multiple In-Flight Transactions retire Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P2, Saved Set {P2, } Active snapshot Second transaction becomes active in memory 64

33 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Need to keep Current (speculative) values Rollback values Common case is commit, so keep Current in cache Problem: uncommitted current values do not fit in local cache Solution Overflow hashtable as extension of cache 65 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 T bit per cache line Set if accessed during a transaction O bit per cache set Indicate set overflow Overflow storage in physical DRAM Allocate and resize by the OS Search when miss : complexity of a page table walk If a line is found, swapped with a line in the set 66

34 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data Start with non-transactional data in the cache ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data Transactional read sets the T bit ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1,

35 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data Expect most transactional writes fit in the cache ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 A conflict miss Overflow sets O bit Replacement taken place (LRU) Old data spilled to DRAM (hashtable) 70

36 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Miss to an overflowed line, checks overflow table If found, swap (like a victim cache) Else, proceed as miss 71 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data L2 Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Abort Invalidate all lines with T set (assume L2 or lower level memory contains original values) Discard overflow hashtable Clear O and T bits Commit Write back hashtable; NACK interventions during this Clear O and T bits in the cache 72

LTM vs. Lock-based 73 Further Readings M. Herlihy and J. E. B. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures, ISCA 1993. R. Rajwar and J. R. Goodman, Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, MICRO 2001 R.

37 LTM vs. Lock-based 73 Further Readings M. Herlihy and J. E. B. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures, ISCA R. Rajwar and J. R. Goodman, Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, MICRO 2001 R. Rajwar and J. R. Goodman, Transactional Lock-Free Execution of Lock-Based Programs, ASPLOS 2002 J. F. Martinez and J. Torrellas, Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications, ASPLOS 2002 L. Hammond, V. Wong, M. Chen, B. D. Calrstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukoton Transactional Memory Coherence and Consistency, ISCA 2004 C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, S. Lie, Unbounded Transactional Memory, HPCA 2005 A. McDonald, J. Chung, H. Chaf, C. C. Minh, B. D. Calrstrom, L. Hammond, C. Kozyrakis and K. Olukotun, Characterization of TCC on a Chip- Multiprocessors, PACT

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency