Transactional Memory Coherence and Consistency

Size: px

Start display at page:

Download "Transactional Memory Coherence and Consistency"

Brett Summers
6 years ago
Views:

1 Transactional emory Coherence and Consistency all transactions, all the time Lance Hammond, Vicky Wong, ike Chen, rian D. Carlstrom, ohn D. Davis, en Hertzberg, anohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun Stanford University une 21, 24

2 The Need for Parallelism Transactional Coherence & Consistency 2 otivation Uniprocessor system scaling is hitting limits Power consumption increasing dramatically Wire delays becoming a limiting factor Design and verification complexity is now overwhelming Exploits limited instruction-level parallelism (ILP) So we need support for multiprocessors Inherently avoid many of the design problems Replicate small cores, don t design big ones Exploit thread-level parallelism (TLP) ut can still use ILP within cores ut now we have new problems...

3 Parallel Software Problems Transactional Coherence & Consistency 3 otivation Parallel systems are often programmed with: Synchronization through barriers Shared variable access control through locks... Lock granularity and organization must balance performance and correctness Coarse-grain locking: Lock contention Fine-grain locking: Extra overhead ust be careful to avoid deadlocks or races ust be careful not to leave anything unprotected for correctness Performance tuning is not intuitive Performance bottlenecks are related to low level events Such as: false sharing, coherence misses, Feedback is often indirect (cache lines, not variables)

4 Parallel Hardware Complexity Transactional Coherence & Consistency 4 otivation Cache coherence protocols are complex ust track ownership of cache lines Difficult to implement and verify all corner cases Consistency protocols are complex ust provide rules to correctly order individual loads/stores Difficult for both hardware and software Current protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers (2-3 hops) Latency of short messages unlikely to scale well in the future andwidth likely to scale much better High-speed inter-chip connections Chip multiprocessors = on-chip bandwidth!

5 The Key Question Transactional Coherence & Consistency 5 otivation Is there a shared memory model with: A simple programming model? A simple hardware implementation? Good performance?

6 TCC: Using Transactions Transactional Coherence & Consistency 6 Overview Yes! Provide generalized transactions Programmer-defined groups of instructions within a program egin Transaction Instruction #1 Instruction #2... End Transaction Start uffering Results Commit Results Now Can only commit machine state at the end of each transaction Each must update machine state atomically, all at once To other processors, all instructions within a transaction appear to execute only when the transaction commits These commits impose an order on how processors may modify machine state ust requires: Register checkpointing mechanism Transactional memory support...

7 Transactional emory I Transactional Coherence & Consistency 7 Overview Transactions appear to execute in the commit order RAW dependence errors cause transaction violation & restart Transaction A LOAD X Transaction Time STORE X Violation! LOAD X STORE X Commit X Re-execute with new data LOAD X STORE X

8 Transactional emory II Transactional Coherence & Consistency 8 Overview Antidependencies are automatically handled WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Transaction A Time STORE X Commit X Local uffer 1 2 Shared emory Transaction STORE X Commit X Local uffer Time ST X = 1 LD LOAD X = X1 Commit X X = 1 X = 2 Any Local Stores Supply Data Later Transactions Started / Updated with Shared State Transaction ST X = 2 LD LOAD X = X2 LD LOAD X = X2 Commit X

9 TCC s Difference Transactional Coherence & Consistency 9 Overview So what? Transactional memory is old news.... Herlihy, et.al., proposed to replace locks a decade ago Rajwar and Goodman / artinez and Torrellas proposed more automated versions of the same thing recently Thread-level speculation (TLS) uses transactional memory TCC s New Idea: Leave transactions on all of the time Provides ANY new benefits Completely eliminates conventional cache coherence and consistency models

10 The TCC Cycle Transactional Coherence & Consistency 1 Overview Transactions now run in a cycle Continues for all execution Transaction Starts P P1 P2 Speculatively execute code and buffer Execute Code Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other CPUs Commit stores together, as a block Provides a well-defined write ordering Can invalidate or update other caches Large block utilizes bandwidth effectively Transaction Completes Requests Commit Starts Commit Finishes Commit Wait for Phase Arbitrate Commit And repeat!

11 Advantages of TCC Transactional Coherence & Consistency 11 Overview Trades bandwidth for simplicity & latency tolerance Easier to build Not dependent on timing/latency of loads/stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no ESI) Commit order implies an ownership sequence

12 How to Use TCC I Transactional Coherence & Consistency 12 Overview Divide code into potentially parallel tasks Usually loop iterations, after function calls, etc. For initial division, tasks = transactions ut can be subdivided up or grouped to match hardware limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Therefore, much easier to get a parallel program running correctly! Programmer then orders transactions as necessary Ordering techniques implemented using phase numbers Assign an age number to each transaction Deadlock-free (at least one transaction is always oldest ) Livelock-free (watchdog hardware can easily insert barriers anywhere) Three common scenarios...

13 How to Use TCC II Transactional Coherence & Consistency 13 Overview Unordered for purely parallel code Fully ordered to specify sequential tasks Partially ordered to insert synchronization like barriers CPU CPU 1 CPU 2 Time Unordered Transactions arrier = Phase Transition Parallelized Sequential Code Transactions

14 Sample TCC Hardware Transactional Coherence & Consistency 14 Overview Stores Only Processor Core Local Cache Hierarchy Loads and Stores Node # Write uffer Transaction Control its Read, odified, etc. L1 Cache Snooping from other nodes Commits to other nodes Commit Control Node : Node 1: Node 2: Phase roadcast us or Network Write buffers and some L1 cache bits (TLS-like) Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequencing logic

15 Evaluation ethodology Transactional Coherence & Consistency 15 Results We simulated a wide range of applications: SPLASH-2, SPEC, ava, Spec Divided into transactions using a preliminary TCC API Trace-based analysis Generated execution traces from sequential execution Then analyzed the traces while varying: Number of processors Interconnect bandwidth Communication overheads Simplifications Results shown assume infinite caches and write-buffers ut we track the amount of state stored in them Fixed one cycle/instruction Would require a reasonable superscalar processor for this rate

16 Speedups with TCC Transactional Coherence & Consistency 16 Results Speedup Over 1 Processor TCC speedups are similar to conventional ones And sometimes better: SPECjbb eliminates locking overhead within warehouses HF ÑÉ ÇÅ HF ÑÉ ÇÅ H F Ñ É Ç Å HÑ FÉ ÇÅ art equake_l equake_s lu radix SPECjbb swim tomcatv water HÑ ÇÅ F É H ÑÇ Å F É Number of Processors ÇÅ HÑ F É Speedup Over 1 Processor H F Ñ É Ç Å â euler fft jbyte_5 jbyte_6 jbyte_8 jbyte_9 moldyn mtrt raytrace shallow Fâ Å Ç H É 8 F H ÑÇ Åâ Ñ Ñ É H FÑ ÉÇ Åâ HF ÑÉ ÇÅ â HF ÑÉ ÇÅ â Number of Processors Fâ Å Ç H É Explicitly Parallel Applications TLS-ava Applications

17 Write uffering Needs Transactional Coherence & Consistency 17 Results Only a few K of write buffering needed Set by the natural transaction sizes in applications Occasional overflow can be handled by committing early Reasonable for on-chip buffers 7 5 Write State in K (with 64 lines) % 5% 1% moldyn equake_s euler jbyte_9 raytrace jbyte_5 jbyte_6 shallow equake_l jbyte_8 art water-8p fft swim mtrt Applications tomcatv lu-8p radix-m-8p radix-xs-8p radix-s-8p SPECjbb-8P radix-l-8p

18 roadcast andwidth Transactional Coherence & Consistency 18 Results Another issue is broadcast bandwidth If data is sent with commit, to avoid broadcast saturation: Needs about 16 32p with whole modified lines Needs only about 8 32p with dirty data only High, but feasible on-chip ytes per Cycle Whole Lines Dirty Data Only radix (xl, l, m, s, xs) SPECjbb tomcatv fft

19 andwidth Sensitivity Transactional Coherence & Consistency 19 Results ost parallel applications are tolerant of limited W SPECjbb shows some server-code noise speedup variation Normalized Speedup â â HF ÑÉ ÇÅ â Ö7 à HF ÑÉ ÇÅ â Ö7 à HF ÑÉ ÇÖ 7à F H ÑÖ 7à Å ÉÇ Ñâ 7à Ö H art Ç É equake_l H equake_s Å F F lu Ñ radix_xl Å É radix_l Ç radix_m Å radix_s radix_xs â SPECjbb Ö swim Normalized Speedup H ÑÉ ÇÅ â H H Ñ ÇÅ â ÇÅ â H Ç H É Åâ ÇÅ ÑÉ â É É Ñ euler fft H F jbyte_5 jbyte_6 Ñ Ñ jbyte_8 É jbyte_9 Ç moldyn Å mtrt raytrace 7 tomcatv â shallow à water! andwidth Limit, bytes/cycle 8 procs! andwidth Limit, ytes/cycle

20 Snoop andwidth Transactional Coherence & Consistency 2 Results Snooping requirements are quite reasonable Significantly less than 1 address/cycle on most systems Address-only commits could reduce W requirements Only broadcast addresses for an invalidation-based protocol Send full packets only to memory Needs only about p Addresses per Cycle radix (xl, l, m, s, xs) SPECjbb tomcatv fft

21 Conclusions Transactional Coherence & Consistency 21 Conclusions TCC simplifies shared memory control hardware Trades higher interconnect bandwidth for simpler protocols Eliminates traditional ESI coherence protocols ost communication in large, less latency-sensitive packets Scaling trends favor these trade-offs in the future TCC eases parallel programming Transactions provide error tolerance and free locking Allows all-manual to nearly automated parallelization ore on this at ASPLOS-XI in October

22 TCC all transactions, all the time ore info at:

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency