System Challenges and Opportunities for Transactional Memory

Size: px

Start display at page:

Download "System Challenges and Opportunities for Transactional Memory"

Eileen Stafford
5 years ago
Views:

1 System Challenges and Opportunities for Transactional Memory JaeWoong Chung Computer System Lab Stanford University

2 My thesis is about Computer system design that help leveraging hardware parallelism Transactional memory (TM) for easy parallel programming Contribution Challenges to building an efficient and practical TM system Opportunities to use TM beyond parallel programming 2

3 Multi Core Processors No more frequency race The era of multi cores Parallel programming is not easy Performance Split a sequential task into multiple sub tasks Pentium (1993) Pentium 4 (2000) Core Duo (2006) Year 3

4 Locking is hard to use Synchronize access to shared data Coarse-grain locking Easy to program The other task is blocked Fine-grain locking High concurrency Hard to use Dead lock, priority inversion, High locking overhead : Object : Reference Task1 Task Object reference graph (e.g. Java and C++) 4

5 Transactional Memory

6 Transactional Memory Atomic and isolated execution of instructions Atomicity : All or nothing Isolation : No intermediate results Programmer A transaction encloses instructions logically sequential execution of transactions TM system TX_Begin TX_Begin // Instructions // Instructions // for Task1 // for Task2 TX_End TX_End Transactions are executed in parallel without conflict If conflict, one of them is aborted and restarts 6

7 TM Example R R Tx1 Tx W 5 6 W R Tx 1 : R 13 Tx 2 : R 2 W W 5 6 Data versioning At TX_Begin,, save register values At write, save old memory values Conflict detection Read-set and write-set per transaction Conflict detection with set comparison 7

8 TM Benefits Logically sequential execution of transactions Optimistic concurrency control for parallel transaction execution No dead lock, priority inversion, and convoying TM system handles pathological cases Composability Error Recovery 8

9 TM System Design Many proposals in hardware and software Hardware acceleration for TM is crucial for performance HTM is 2 ~3 times faster than STM Correctness : strong isolation Hardware TM In the beginning register checkpoint At memory access Set read/write bits per cache line Buffer new values in cache or log old values Conflict detection With cache coherence protocol With transaction validation protocol 9

Regs 0 Conflict 0 Load Load 1 5 Memory Tx2 Core 2 ADDR : DATA : R :

10 Hardware TM Example TM hardware Tx1 Core 1 ADDR : DATA : R : W 1 XXX XXX 10 3 XXX XXX 0 5 XXX XXX 01 L1 cache BUS Regs 0 Conflict 0 Load Load 1 5 Memory Tx2 Core 2 ADDR : DATA : R : W 0 L1 cache Regs TM programs R R Tx W 5 R Tx2 6 W 10 R

11 Challenges and Opportunities How to build efficient TM system tuned for common case? How to build practical TM system to deal with uncommon case? Can we use TM to support system software? Can we use TM to improve other important system metrics? 11

12 Contributions Challenges to building TM systems Common case behavior of parallel programs Extract architectural parameters for efficient TM system design TM virtualization Overcome the limitation of TM hardware Opportunity for system beyond parallel programming Multithreading for dynamic binary translation Guarantee correctness of DBT Support for reliability, security, and fast memory snapshot Improve important system metrics other than performance 12

13 Outline Software parallelization : a major issue for performance Transactional memory Challenges to building TM systems Common case behavior of parallel programs TM virtualization Opportunities for systems beyond parallel programming Multithreading for dynamic binary translation Support for reliability, security, and fast memory snapshot Conclusion 13

14 Challenges to Building TM Systems

15 Goal Challenge 1 : Common Case Behavior of Parallel Programs Understand the common case behavior of TM programs Few TM programs available More TM programs now but for research purpose Few efficient TM systems as development tool chicken & egg problem 15

16 Inferring Transactions in Multithread Programs Analyze existing parallel programs Assumption : the inherent parallelism remains regardless of programming tools Mapping programming primitives to transactions Programming primitive Lock/Unlock Parallel_For Transaction primitive Begin/End Begin/End 16

17 Parallel Applications Different domains and different language Languages Java Pthread ANL OpenMP Applications MolDyn, MonteCarlo, RayTracer,, Crypt, LUFact, Series, SOR, SparseMatmult,, SPECjbb2000, PMD, HSQLDB Apache, Kingate,, Bp-vision, Localize, Ultra Tic Tac Toe, MPEG2, AOL Server Barnes, Mp3d, Ocean, Radix, FMM, Cholesky, Radiosity,, FFT, Volrend,, Water-N2, Water-Spatial APPLU, Equake,, Art, Swim, CG, BT, IS 17

18 Key Metrics of TM Programs Measure the key metrics of TM programs Use the metrics to make suggestions for TM designs Key metrics Transaction length Read-/write-set size Write-set to length ratio Frequency of nesting & I/O in transactions Architectural parameters TM primitive overhead Buffer size Transaction commit/abort overhead Support for nesting and system calls 18

19 Transaction Length Number of instructions executed in transaction Length in Instructions Application Avg 50th % 95th % Max Java average Pthreads average ANL average Observation : Up to 95% of transactions < 5000 instructions Suggestion : Light-weight transactional primitives Observation : Rare but long transactions Suggesion : Transaction over context-switching 19

20 Read-/Write /Write-Set Size Bytes of data read/written by transaction Read Set Size in Kbytes th ANL Java Pthreads 80th Write Set Size in Kbytes th ANL Java Pthreads 80th Percentile of Transactions Percentile of Transactions Observation : 98% transactions <16KB read-set and <6KB write set Suggestion : 32K L1 cache is sufficient Observation : Few very large transaction > 32K Suggestion : Need for buffer space virtualization 20

21 Outline Software parallelization : a major issue for performance Transactional memory Challenges to building TM systems Common case behavior of parallel programs TM virtualization Opportunities for systems beyond parallel programming Multithreading for dynamic binary translation Support for reliability, security, and fast memory snapshot Conclusion 21

22 Problem Challenge 2 : TM Virtualization Limited hardware resources tuned for common cases E.g. buffer size for 99% transactions How do we cover uncommon cases as well? Cache as buffer for transactional data What if cache capacity is exhausted? => space virtualization What if a transaction is interrupted? Time virtualization What if transactions are nested deeply? Depth virtualization 22

23 Goals XTM: extended TM Virtualize TM space, time, and depth at low HW cost Completely transparent to user SW Minimize interference with coexisting HW transactions Assumption Overflows, interrupts, and deep nesting are rare Approach Transactional data and metadata in virtual memory Using virtual memory support in OS Data versioning & conflict detection at page granularity Similar to page-based software DSM systems 23

24 XTM Overview Basic operation On HTM overflow, rollback and restart in SW mode At the first access, create a copy of original (master) page Change the address mapping to the copy (private page) Transactional data in private page, committed data in master page At commit, make the private page the new master page All orchestrated by the operating system (no HW) Conflict detection Use TLB shoot-downs to gain exclusive page access Hardware requirement Overflow exception 24

25 XTM Example Timeline Page table pointer Master page table Master page HTM Overflow Xtn Write Xtn Read Commit Per-Tx Page table For Level 0 XTM Metadata Private Copy R W 25

26 Hardware Acceleration XTM-g Gradual page-by by-page switching Reduce the switch overhead from hardware mode to software mode A portion of transactional data in private pages, the rest in the cache Hardware requirement : OV(overflow) ) bit in page table XTM-e Additional buffer for overflowed read/write bits Reduce false conflict at page granularity Hardware requirement : Eviction buffer 26

27 Time Virtualization Interrupt and context-switch procedure Interrupt Naïve Approach Can wait? No Young transction? No Switch to software mode Yes Yes Wait for short transaction to finish Abort young transaction Rare case 27

28 Performance Analysis Normalized Execution Time Versioning Validation Commit Violations Idle Useful 0 XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM tomcatv [37.7%] volrend [0.01%] radix [0.26%] mic ro- P10 [39.2%] mic ro- P20 [60.3%] mic ro- P30 [60.8%] XTM causes no cost for applications without overflow XTM-g g presents a good cost/performance tradeoff point 20% faster to 50% slower than a fully-hardware solution 28

29 Outline Software parallelization : a major issue for performance Transactional memory Challenges to building TM systems Common case behavior of parallel programs TM virtualization Opportunities for systems beyond parallel programming Multithreading for dynamic binary translation Support for reliability, security, and fast memory snapshot Conclusion 29

30 Opportunities for Systems beyond Parallel Programming

31 DBT Opportunity 1 : Dynamic Binary Translation Binary code is translated in run-time PIN, Valgrind, DynamoRIO, StarDBT,, etc Original Binary DBT Tool DBT Framework Translated Binary DBT use cases Translation on new target architecture JIT optimizations in virtual machines Binary instrumentation Profiling, security, debugging, 31

32 Example: Dynamic Information Flow Tracking (DIFT) t = XX ; // untrusted data from network taint(t) = 1;. Variables swap t, u1; swap taint(t), taint(u1); Taint bits u2 = u1; taint(u2) = taint(u1); t u1 u2 Track untrusted data A taint bit per memory byte Security policy uses the taint bit. E.g. no syscall with untrusted data 32

33 Problem : DBT with Parallel Program Thread 1 swap t, u1; swap taint(t), taint(u1); Thread2 u2 = u1; taint(u2) = taint(u1); t u1 u2 Variables Taint bits XX 1 XX Security Breach!! Atomicity between original and instrumented instructions for correctness 33

34 How to Guarantee Atomicity? Easy but unsatisfactory solutions No multithreaded programs (StarDBT( StarDBT) Serialization (Valgrind( Valgrind) Hard solution : Locking Idea : Enclose original and instrumented instruction with lock Fine-grained locks locking overhead, convoying, limited scope of DBT optimizations Coarse-grained locks performance degradation Lock nesting between app & DBT locks potential deadlock Tool developers should be feature + multithreading experts 34

35 Idea Transactional Memory for Correctness of DBT Original and instrumented instructions in a transaction Thread 1 TX_Begin swap t, u1; swap taint(t), taint(u1); TX_End Thread2 TX_Begin u2 = u1; taint(u2) = taint(u1); TX_End Advantages Atomic execution High performance through optimistic concurrency Support for nested transactions 35

36 Granularity of Transaction Per instruction : short Instrumentation High overhead of executing TX_Begin and TX_End Limited scope for DBT optimizations Per basic block : long Amortizing the TX_Begin and TX_End overhead Easy to match TX_Begin and TX_End Per trace : longer Further amortization of the overhead Potentially high transaction conflict Profile-based sizing : dynamic Optimize transaction size based on transaction abort ratio 36

37 Baseline Performance Results Normalized Overhead (% ) 60% 50% 40% 30% 20% 10% 0% Barnes Equake Fmm Radiosity Radix Swim Tomcatv Water Waterspatial 8 CPUs Software TM and DIFT on PIN 41 % overhead on the average Transaction at the DBT trace granularity 37

38 Hardware Acceleration Normailized Overhead (%) 60% 50% 40% 30% 20% 10% 0% B arnes E quake F MMR adios ityr adix S wim Tomc atvwater Waterspatial SW Only Register Checkpoint Reg. + HW Signature Full HW Overhead reduction 28% with register checkpoint 12% with register checkpoint + hardware signature 6% with full hardware TM 38

39 Outline Software parallelization : a major issue for performance Transactional memory Challenges to building TM systems Common case behavior of parallel programs TM virtualization Opportunities for systems beyond parallel programming Multithreading for dynamic binary translation Support for reliability, security, and fast memory snapshot Conclusion 39

40 Opportunity 2 : Improving Other System Metrics TM hardware consists of Fine-grain data versioning HW Fine-grain access tracking HW Fast exception handlers Can use such HW for other purposes Reliability, Security, The benefits for SW Finer granularity (compared to VM-based approach) User-level event handling (compared to VM-based approach) No instrumentation overhead (compared to DBT-based approach) Simplified code (compared to DBT-based approach) 40

41 Outline for TM Hardware Application Reliability Global & local checkpoints (data versioning) Security Fine-grain read/write barriers (address tracking) Isolated execution (data versioning) Memory snapshot (data versioning) Concurrent garbage collector Dynamic memory profiler 41

42 Memory Snapshot Snapshot Read-only image Multiple regions Shared by multiple threads mutator Read-only Snapshot Memory collector Applications Service threads that analyze memory in parallel with app threads Garbage collection, memory profiling (heap & stack), mutator collector < Garbage Collection > 42

43 TM Hardware Snapshot Feature correspondence TM metadata track data written since snapshot TM versioning storage for progressive snapshot Including virtualization mechanism TM conflict detection catch errors Writes to read-only snapshot Differences & additions Data versioning for single thread Vs. multiple thread Table to record snapshot regions Resulting snapshot system Fast : O(# CPUs) Scan (create) and O(1) write/read Small memory footprint : O(# memory locations written) 43

44 GC Overhead Parallel GC: stop app threads & run GC threads 20% to 30% overhead for memory intensive apps Snapshot GC GC is essentially free Fast : Stop app, take snapshot, then run GC & app concurrently Simple : +100 lines over parallel GC by Boehm Fundamentally simpler than any other concurrent GC 44

45 Conclusion Challenges to building TM systems Common case behavior of parallel programs Extract architectural parameters for efficient TM system design TM virtualization Overcome the limitation of TM hardware Opportunity for system beyond parallel programming Multithreading for dynamic binary translation Fix correctness issue for DBT Support for reliability, security, and fast memory snapshot Improve important system metrics other than performance 45

46 Acknowledgement KyungHae,, wife Parents, brother, in-laws Prof. Kozyrakis,, advisor Prof. Olukotun,, associate advisor Prof. Garcia-Molina Prof. Saraswat TCC group mates and research colleagues Korean mafia 46

The Common Case Transactional Behavior of Multithreaded Programs

The Common Case Transactional Behavior of Multithreaded Programs JaeWoong Chung Hassan Chafi,, Chi Cao Minh, Austen McDonald, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab