FlexTM. Flexible Decoupled Transactional Memory Support. Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science
|
|
- Deirdre Elliott
- 6 years ago
- Views:
Transcription
1 FlexTM Flexible Decoupled Transactional Memory Support Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science 1
2 Transactions: Our Goal Lazy Txs (i.e., optimistic conflict resolution) more concurrency SW coordinates conflict management when (i.e., eagerly or lazily) how (i.e., stalling, who aborts) Limitless Txs Large: cache victimization and paging Long: thread switches 2
3 Flexible Transactional Memory Versioning (Isolation) STM (e.g., RSTM) all software approach Execution Time Validation (Consistency check) Bookkeeping (Metadata ops.) 20 0 Application (Useful Work) 3
4 Flexible Transactional Memory Execution Time Versioning (Isolation) Validation (Consistency check) Bookkeeping (Metadata ops.) STM (e.g., RSTM) all software approach RTM [ISCA 07] new cache states help bounded txs software handles large & long txs 20 0 Application (Useful Work) 3
5 Flexible Transactional Memory Execution Time Versioning (Isolation) Validation (Consistency check) Bookkeeping (Metadata ops.) Application (Useful Work) STM (e.g., RSTM) all software approach RTM [ISCA 07] new cache states help bounded txs software handles large & long txs FlexTM [this paper] Good Performance No per-location software metadata Simple hardware No bulk arbiters like lazy HTMs Allows software policy 3
6 Decoupled Hardware Primitives (1/2) Separate interchangeable basic hardware ops. that can be coordinated by software Why? Minimizes hardware state small footprint, simplifies virtualization reduces development time Software accessible to build transactions & fine-tune policy decisions to repurpose hardware for non-tx applications 4
7 Decoupled Hardware Primitives (2/2) 1. Data Isolation (delaying visibility of stores) caches buffer speculative values, provide fast-commit SW allocates overflow region & HW performs access 2. Access Summary (tracking locations accessed) maintains list of locations read & written check on coherence messages or local memory ops. 3. Conflict Summary (tracking data conflict events) tracks conflict occurrence and type between processors 4. Alert-On-Update monitor cache-blocks and trigger handlers 5
8 Outline Preview Data Isolation (aka. Lazy Versioning) Lazy coherence Overflow-Table Conflict Management FlexTM Software Evaluation Summary 6
9 Lazy Coherence (1/2): Approach Lazy coherence: permit multiple readers & writers for a cache block restore coherence for multiple lines simultaneously Current Research (e.g., TCC, Bulk) bulk arbiters, bulk GetXs, bulk ops. on directory Our approach: eager messages but lazy coherence look out for sharer conflicts in standard coherence msgs. continue caching data, but use T-MESI states simple bit-clear ops. convert T-MESI to MESI No bulk messages or address ops. 7
10 Lazy Coherence (2/2): Protocol Two new T tagged states: TMI (T+M) and TI (T+I) TStores & TLoads denote speculative operations ISA can include instructions or SW can tell HW the regions MESI states TLoad / ~Threat TStore Commit Abort TLoad/Threat TMI TI TStore TMI buffers TStores allows multiple writers and readers no data response but threaten On commit, T+M => M On abort, T+M =>T+I => I TI caches threatened TLoads cache remotely TStored block On commit/abort, T+I => I + cached locations are accessed directly + bounded txs perform in-place update 8
11 Overflow Table Challenge : Where to put evicted TMI lines? Solution : Per-thread hash table (in virtual memory) Hardware controller fill table with TMI lines evicted from cache removes table entries when reloaded into cache performs look-aside transparently on L1 miss in parallel with L2 Addr 80 current values Config. Sets,Ways Base 100 TMI WB / L1 miss Overflow-Table controller Lookaside OSig {80} TAGS 80 Data new values per-thread Overflow Table 9
12 Outline Preview Data Isolation (aka. Lazy Versioning) Conflict Management (flexible) Access summary signatures Conflict table Alert-On-Update FlexTM Software Evaluation Summary 10
13 Access Summary (1/2): Signatures Signatures [Bulk ISCA 06, LogTM-SE HPCA 07, SigTM ISCA 07] Bloom filters to represent unbounded set of cache blocks approx. representation with false positives Cache block Addr. hash1 hash2 hash Processor has two signatures: Rsig (Wsig) summarizes locations TLoad (TStore) Conflict Detection: Signatures snoop coherence messages responder detects conflict and overloads response requester picks response and resolves or notes conflict 11
14 Access Summary (2/2): Virtualization [details in paper] Required to handle long running txs & tx pauses Challenge : How to detect conflicts with suspended txs? Solution : Read and Write summary signatures at the directory, (note: does not affect cache hit critical path) Details: merge suspended txns signature with summary sig. all L1 cache misses test signatures if miss, no further action necessary if hit, trap to software routine that mimics conflict HW 12
15 Conflict Tables: Tracking Conflicts Current HTMs detect and resolve at the same time Eager HTM systems perform both on a conflict Lazy HTM systems perform both at commit time Our approach: decouple detection from resolution HW bitmaps record conflict event & expose to SW SW decides when and how to resolve conflicts Per-core conflict bitmap Core-P s table R-W W-R Ncore bits P s read--remote write P s write--remote write P s write--remote read Is there a conflict between P and core i? Ans: Yes (1) / No (0) 13
16 Conflict Tables: Operation 4 core machine C0 W C1 sig:{} Rsig:{} Wsig:{A} Rsig:{} L2 Directory A : M@C1 Either processor can resolve conflict prior to commit If eager, requester resolves conflict immediately Conflicter known, no central arbiter required 14
17 Conflict Tables: Operation 4 core machine TStore A C0 W C1 sig:{} Rsig:{} Wsig:{A} Rsig:{} L2 Directory A : M@C1 Either processor can resolve conflict prior to commit If eager, requester resolves conflict immediately Conflicter known, no central arbiter required 14
18 Conflict Tables: Operation 4 core machine TStore A 3 Threat C0 W C1 Wsig:{A} sig:{} Rsig:{} Wsig:{A} Rsig:{} TGETX 2 Data 4 ACK_INV L2 Directory A : M@C1,C0 2 Fwd_TGETX Either processor can resolve conflict prior to commit If eager, requester resolves conflict immediately Conflicter known, no central arbiter required 14
19 Alert-On-Update (AOU) [ISCA 07] Vector specific coherence or update events to the processor in the form of a lightweight event/interrupt on invalidation (capacity eviction or coherence) on access/update (local event) Aload/Arelease A Tag Data Ld Add... Handler Remote Store / Eviction 15
20 Outline Preview Data Isolation (aka. Lazy Versioning) Conflict Management (flexible) FlexTM Software FlexTM Transaction Example Evaluation Summary 16
21 FlexTM Transaction (1/2) Per-Tx descriptor TSW State active / committed / aborted running / suspended CMPC AbortPC handler for conflict table events AOU events on TSW FlexTM deploys Signatures for detecting and notifying conflicts Conflict Tables for tracking and managing conflicts T-MESI for in-cache buffering and OT for cache overflows AOU for propagating abort events to remote txs. FlexTM software checkpoints registers at Begin_Tx manages conflicts; aborts remote tx by changing TSW controls commit protocol routine 17
22 Lazy Transactions: Example T1 Begin_Tx abort_pc1 T2 Begin_Tx abort_pc2 L1 Wsig:{} Rsig:{} C0 L1 Wsig:{} Rsig:{} C1 L2 Directory 18
23 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 L1 TSW0: AE Wsig:{} Rsig:{} C0 L1 TSW1: AE Wsig:{} Rsig:{} C1 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18
24 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt A L1 A: TMI TSW0: AE Wsig:{A} Wsig:{} Rsig:{} C0 L1 TSW1: AE Wsig:{} Rsig:{} C1 A : M@C0 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18
25 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} Rsig:{} C0 L1 TSW1: AE Wsig:{} Rsig:{} C1 A : M@C0 B : M@C0 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18
26 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B TSt A L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI TSW1: AE Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18
27 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B TSt TSt A B L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18
28 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B Conflict & Commit protocol For-each i set in W-R or CAS (Status[i], ACT, ABORT) TSt TSt A B In software, decentralized, minimal overhead No. of conflicting Txs L1 A: TMI B: TMI TSW0: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C1 18
29 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B Conflict & Commit protocol For-each i set in W-R or CAS (Status[i], ACT, ABORT) TSt TSt A B Conflict Handler! In software, decentralized, minimal overhead No. of conflicting Txs L1 A: TMI B: TMI TSW0: AE TSW1: M Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C0 M@C1 18
30 Lazy Transactions: Example T1 Begin_Tx abort_pc1 ALD TSW0 T2 Begin_Tx abort_pc2 ALD TSW1 TSt TSt A B Conflict & Commit protocol For-each i set in W-R or CAS (Status[i], ACT, ABORT) CAS-Commit Status[id] TSt TSt A B Conflict Handler! In software, decentralized, minimal overhead No. of conflicting Txs L1 A: A: TMI M B: B: TMI M TSW0: AE M TSW1: M Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C0 L1 A: TMI B: TMI TSW1: AE Wsig:{A,B} Wsig:{A} Wsig:{} 1 Rsig:{} C1 A : M@C0,C1 B : M@C0,C1 L2 Directory TSW0 : M@C0 TSW1 : M@C0 M@C1 18
31 Outline Preview Data Isolation (aka. Lazy Versioning) Conflict Management (flexible) FlexTM Software Evaluation Speedup Conflict resolution tradeoffs Other results Summary 19
32 Evaluation set-up Full system simulation, GEMS/SIMICS framework 16 core CMP with shared L2 ORIGIN 2000 like coherence protocol (3 hop requests and silent evictions) Workloads Data Structures: Hash,RBTree, LFUCache, Graph Applications: Scott s Delaunay, STAMP*, STMBench7 Runtime systems CGL, FlexTM (HTM interface), RTM-F, RSTM, & TL2 Polka conflict manager * - STAMP does not (yet?) interface with RTM-F and RSTM 20
33 FlexTM is Fast (1/2) 16 threads Normalized Throughput X FlexTM RTM-F RSTM CGL, 1 thread=1 1.9X 1.8X HashTable RBTree Delaunay X STMBench7 FlexTM gains over RTM-F proportional to SW bookkeeping overheads software metadata management ~50% of tx latency FlexTM gains over RSTM comparable to rigid policy HTMs 21
34 FlexTM is Fast (2/2) 16 threads Normalized Throughput CGL, 1 thread=1 3.8X FlexTM 4.1X 1.4X TL2 H-High contention L-Low contention 1.9X 1.5X 0 Vacation-H Vacation-L Kmeans-L Bayes Genome Kmeans-L and Genome performance gains lower TL-2 per-access overheads low (i.e., high instructions / mem_access) Performance gains in Vacation higher lower number of instructions per memory word accessed 22
35 Lazy mode aids progress Normalized Throughput Eager 1 thread=1, X-axis: No. of threads RBTree Lazy Graph Lazy provides more commits Exploits R-W sharing, allows reader & writers to commit in parallel Eager causes cascaded stalls and aborts Lazy narrows conflict window 23
36 Mixed-mode can be better STMBench7 Normalized Throughput Eager Lazy EagerWW-LazyRW thread=1, X-axis: No. of threads Long writer (~1ms) mixed with short readers (tens thousands cycles) Pair-wise conflicts between writers, conflicts with multiple readers Eager doesn t permit R-W sharing and reduces reader throughput Lazy permits sharing, but wastes writer work on aborts Best Policy: Eager-WW with Lazy-RW 24
37 Other Results Area analysis [in paper] increase in core area small, OoO (0.6%), InO (3%) minimal change to pipeline, most hardware on L1 miss Comparison with Central-Arbiter HTM [in paper] broadcasts and central arbiters are an overkill de-centralized SW commit is efficient & important Non-Tx Applications Watchpoints [in TR-925] Two memory monitoring primitives, AOU & Signatures SW framework for detecting buffer overflows, memory leaks etc X speedup over binary instrumentation 25
38 Summary Decouple TM hardware components to reduce HW complexity enable deployment for varied purposes FlexTM HW manages TM operations, SW manages policy decentralized conflict and commit protocol in SW Conflict management laziness is an important design requirement provides best value when left under software control 26
39 Summary Decouple TM hardware components to Questions? reduce HW complexity enable deployment for varied purposes FlexTM HW manages TM operations, SW manages policy decentralized conflict and commit protocol in SW Conflict management Acknowledgments Multifacet Research group, Wisconsin STAMP group, Stanford Transaction Benchmark group, EPFL Shan Lu, Opera group, Illinois laziness is an important design requirement provides best value when left under software control 26
40 27
41 28
42 FlexTM per-core Hardware Handler PC Flag register Processor Context Registers Control Regs. Read Signature Write Signature Read & Write Access Summary Tag Data L1 Data Cache 29
43 FlexTM per-core Hardware Handler PC Flag register R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Tag Data L1 Data Cache 29
44 FlexTM per-core Hardware Handler PC Flag register R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Alert-On-Update A Tag Data L1 Data Cache 29
45 FlexTM per-core Hardware Handler PC Flag register Data Isolation R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Alert-On-Update ASI Base Address Overflow Sig. Hash Param. L1 miss T A Tag Data Overflow Count C/A Overflow Table Controller L1 Data Cache 29
46 FlexTM per-core Hardware Handler PC Flag register Data Isolation R-W W-R Processor Context Registers Control Regs. Read Signature Write Signature Conflict Tables Ncore bits Read & Write Access Summary Conflict Table Alert-On-Update ASI Base Address Overflow Sig. Hash Param. L1 miss T A Tag Data Overflow Count C/A Overflow Table Controller L1 Data Cache 29
47 FlexTM Area Complexity Core2 Power6 Niagara2 Orig. Core Area L1 area Signatures (2Kbit) Overflow Control %L1D area inc. % core area inc. 32mm 2 53mm 2 12mm 2 1.8mm 2 2.6mm 2 0.4mm % 0.12% 2.1% 0.5% 0.45% 0.3% 0.35% 0.3% 3.9% 0.61% 0.58% 2.5% Effect on the processor core minimal OoO cores (~0.6\%), In-Order (~4%) Negligible effect on L1 latency small area effects, data array is the critical path Signature effects noticeable only on Niagara2 8-way SMT needs 16 2Kbit signatures (4KB state) 30
48 Normalized Throughput Hash Table CST Serial Parallel
49 Normalized Throughput RandomGraph CST Serial Parallel
50 FlexWatcher: Memory Bug Detection FlexTM HW provides two HW primitives for watching memory AOU precisely monitors cache block aligned regions but is limited by cache size Signatures provided unlimited monitoring but are vulnerable to false positives. Extended the ISA to support them as first-class entities insert, member, read-index, activate, clear etc Developed a software bug detection tool add required addresses to signatures HW checks local & remote accesses against the signatures. triggers SW trampoline on signature hits handler disambiguates, if false positive return to execution 33
51 FlexWatcher Evaluation BugBench from illinois, set of real-life programs with known bugs. Bugs detected Buffer Overflow Solution: Pad all heap allocated buffers with 64bytes, watch padded locations Memory Leak Solution: Monitor all heap allocated objects and update the address s timestamp on access. Invariant Violation: Solution: ALoad cache line of interested variable X. On AOU handler trigger assert program specific invariants. 34
52 FlexWatcher Performance Compared against Discover, popular SPARC binary instrumentation tool from Benchmark Bug FlexWatcher Discover BC GZIP GZIP 2 Man Squid BO 1.5X 75X BO 1.15X 17X IV 1.05X N/A BO 1.80X 65X ML 2.50X N/A Execution time normalized to sequential thread performance FlexWatcher overheads were estimated on the simulator Discover overheads were estimated on a Sun T1000 server 35
Chí Cao Minh 28 May 2008
Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor
More informationEazyHTM: Eager-Lazy Hardware Transactional Memory
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center,
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationLowering the Overhead of Nonblocking Software Transactional Memory
Lowering the Overhead of Nonblocking Software Transactional Memory Virendra J. Marathe Michael F. Spear Christopher Heriot Athul Acharya David Eisenstat William N. Scherer III Michael L. Scott Background
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationConflict Detection and Validation Strategies for Software Transactional Memory
Conflict Detection and Validation Strategies for Software Transactional Memory Michael F. Spear, Virendra J. Marathe, William N. Scherer III, and Michael L. Scott University of Rochester www.cs.rochester.edu/research/synchronization/
More informationLog-Based Transactional Memory
Log-Based Transactional Memory Kevin E. Moore University of Wisconsin-Madison Motivation Chip-multiprocessors/Multi-core/Many-core are here Intel has 1 projects in the works that contain four or more computing
More informationMaking the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory
Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory Colin Blundell (University of Pennsylvania) Joe Devietti (University of Pennsylvania) E Christopher Lewis (VMware,
More informationLecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations
Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,
More informationOS Support for Virtualizing Hardware Transactional Memory
OS Support for Virtualizing Hardware Transactional Memory Michael M. Swift, Haris Volos, Luke Yen, Neelam Goyal, Mark D. Hill and David A. Wood University of Wisconsin Madison The Virtualization Problem
More informationLecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)
Lecture 8: Transactional Memory TCC Topics: lazy implementation (TCC) 1 Other Issues Nesting: when one transaction calls another flat nesting: collapse all nested transactions into one large transaction
More informationTransactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28
Transactional Memory or How to do multiple things at once Benjamin Engel Transactional Memory 1 / 28 Transactional Memory: Architectural Support for Lock-Free Data Structures M. Herlihy, J. Eliot, and
More informationRelaxing Concurrency Control in Transactional Memory. Utku Aydonat
Relaxing Concurrency Control in Transactional Memory by Utku Aydonat A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of The Edward S. Rogers
More informationLecture 17: Transactional Memories I
Lecture 17: Transactional Memories I Papers: A Scalable Non-Blocking Approach to Transactional Memory, HPCA 07, Stanford The Common Case Transactional Behavior of Multi-threaded Programs, HPCA 06, Stanford
More informationDependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin
Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin Concurrency Conundrum Challenge: CMP ubiquity Parallel
More informationLogSI-HTM: Log Based Snapshot Isolation in Hardware Transactional Memory
LogSI-HTM: Log Based Snapshot Isolation in Hardware Transactional Memory Lois Orosa and Rodolfo zevedo Institute of Computing, University of Campinas (UNICMP) {lois.orosa,rodolfo}@ic.unicamp.br bstract
More informationTSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan
TSO-CC: Consistency-directed Coherence for TSO Vijay Nagarajan 1 People Marco Elver (Edinburgh) Bharghava Rajaram (Edinburgh) Changhui Lin (Samsung) Rajiv Gupta (UCR) Susmit Sarkar (St Andrews) 2 Multicores
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationAn Effective Hybrid Transactional Memory System with Strong Isolation Guarantees
An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson, Jared Casper, Christos Kozyrakis, Kunle
More informationEXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION
EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION GUOWEI ZHANG, VIRGINIA CHIU, DANIEL SANCHEZ MICRO 2016 Executive summary 2 Exploiting commutativity benefits update-heavy apps Software techniques
More informationDHTM: Durable Hardware Transactional Memory
DHTM: Durable Hardware Transactional Memory Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, Stratis Viglas ISCA 2018 is here!2 is here!2 Systems LLC!3 Systems - Non-volatility over the memory bus - Load/Store
More informationTransactional Memory
Transactional Memory Architectural Support for Practical Parallel Programming The TCC Research Group Computer Systems Lab Stanford University http://tcc.stanford.edu TCC Overview - January 2007 The Era
More informationLecture: Transactional Memory. Topics: TM implementations
Lecture: Transactional Memory Topics: TM implementations 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock 2 Design Space Data Versioning
More informationTokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory
Appears in the International Symposium on Computer Architecture (ISCA), June 2008 TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory Jayaram Bobba, Neelam Goyal, Mark
More informationLecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM
Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of lazy TM 1 Transactions Access to shared variables is encapsulated within transactions the system gives the illusion
More informationDesign and Implementation of Signatures in Transactional Memory Systems
Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison Outline Introduction and motivation Bloom filters Bloom signatures Area
More informationHardware Transactional Memory Architecture and Emulation
Hardware Transactional Memory Architecture and Emulation Dr. Peng Liu 刘鹏 liupeng@zju.edu.cn Media Processor Lab Dept. of Information Science and Electronic Engineering Zhejiang University Hangzhou, 310027,P.R.China
More information6 Transactional Memory. Robert Mullins
6 Transactional Memory ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM ( STM ) Software TM ( HTM ) Hardware TM 2
More informationPerformance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich
Performance Evaluation of Adaptivity in STM Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich Motivation STM systems rely on many assumptions Often contradicting for different
More informationWork Report: Lessons learned on RTM
Work Report: Lessons learned on RTM Sylvain Genevès IPADS September 5, 2013 Sylvain Genevès Transactionnal Memory in commodity hardware 1 / 25 Topic Context Intel launches Restricted Transactional Memory
More informationLecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,
More informationTradeoffs in Transactional Memory Virtualization
Tradeoffs in Transactional Memory Virtualization JaeWoong Chung Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi,, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab Stanford
More informationImplementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University
Implementing and Evaluating Nested Parallel Transactions in STM Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University Introduction // Parallelize the outer loop for(i=0;i
More informationTransactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech
Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency
More informationPrototyping Architectural Support for Program Rollback Using FPGAs
Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software
More informationLecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 20: Transactional Memory Parallel Computer Architecture and Programming Slide credit Many of the slides in today s talk are borrowed from Professor Christos Kozyrakis (Stanford University) Raising
More informationPartition-Based Hardware Transactional Memory for Many-Core Processors
Partition-Based Hardware Transactional Memory for Many-Core Processors Yi Liu 1, Xinwei Zhang 1, Yonghui Wang 1, Depei Qian 1, Yali Chen 2, and Jin Wu 2 1 Sino-German Joint Software Institute, Beihang
More informationThe Bulk Multicore Architecture for Programmability
The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Acknowledgments Key contributors: Luis Ceze Calin
More informationDesign and Implementation of Signatures. for Transactional Memory Systems
Design and Implementation of Signatures for Transactional Memory Systems Daniel Sanchez Department of Computer Sciences University of Wisconsin-Madison August 7 Abstract Transactional Memory (TM) systems
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationExploiting object structure in hardware transactional memory
Comput Syst Sci & Eng (2009) 5: 303 35 2009 CRL Publishing Ltd International Journal of Computer Systems Science & Engineering Exploiting object structure in hardware transactional memory Behram Khan,
More informationLecture: Consistency Models, TM
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationATLAS: A Chip-Multiprocessor. with Transactional Memory Support
ATLAS: A Chip-Multiprocessor with Transactional Memory Support Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun Transactional Coherence and Consistency
More informationLecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks
Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock
More informationDESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM
DESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More informationConsistency & TM. Consistency
Consistency & TM Today s topics: Consistency models the when of the CC-NUMA game Transactional Memory an alternative to lock based synchronization additional reading: paper from HPCA 26 on class web page
More informationPage 1. Consistency. Consistency & TM. Consider. Enter Consistency Models. For DSM systems. Sequential consistency
Consistency Consistency & TM Today s topics: Consistency models the when of the CC-NUMA game Transactional Memory an alternative to lock based synchronization additional reading: paper from HPCA 26 on
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationLecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM
Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM 1 Handling Reads When the home receives a read request, it looks up memory (speculative read) and directory in
More informationEigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics
EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics Pervasive Parallelism Laboratory, Stanford University Sungpack Hong Tayo Oguntebi Jared Casper Nathan Bronson Christos Kozyrakis
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationPerformance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing
Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing Richard Yoo, Christopher Hughes: Intel Labs Konrad Lai, Ravi Rajwar: Intel Architecture Group Agenda
More informationTwo Academic Papers attached for use with Q2 and Q3. Two and a Half hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE
Two Academic Papers attached for use with Q2 and Q3 COMP60012 Two and a Half hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE M.Sc. in Advanced Computer Science Future Multi-Core Computing Date:
More informationTransactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 19: Transactional Memory Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of the slides in today s talk are borrowed from Professor Christos Kozyrakis
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationToken Coherence. Milo M. K. Martin Dissertation Defense
Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software
More informationHardware Support For Serializable Transactions: A Study of Feasibility and Performance
Hardware Support For Serializable Transactions: A Study of Feasibility and Performance Utku Aydonat Tarek S. Abdelrahman Edward S. Rogers Sr. Department of Electrical and Computer Engineering University
More informationCPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner
CPS104 Computer Organization and Programming Lecture 16: Virtual Memory Robert Wagner cps 104 VM.1 RW Fall 2000 Outline of Today s Lecture Virtual Memory. Paged virtual memory. Virtual to Physical translation:
More informationSystem Challenges and Opportunities for Transactional Memory
System Challenges and Opportunities for Transactional Memory JaeWoong Chung Computer System Lab Stanford University My thesis is about Computer system design that help leveraging hardware parallelism Transactional
More informationCost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)
Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationCPS 104 Computer Organization and Programming Lecture 20: Virtual Memory
CPS 104 Computer Organization and Programming Lecture 20: Virtual Nov. 10, 1999 Dietolf (Dee) Ramm http://www.cs.duke.edu/~dr/cps104.html CPS 104 Lecture 20.1 Outline of Today s Lecture O Virtual. 6 Paged
More informationLecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory
Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section
More informationDeAliaser: Alias Speculation Using Atomic Region Support
DeAliaser: Alias Speculation Using Atomic Region Support Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.illinois.edu Memory Aliasing Prevents Good
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More informationMcRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime
McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationAn Integrated Pseudo-Associativity and Relaxed-Order Approach to Hardware Transactional Memory
An Integrated Pseudo-Associativity and Relaxed-Order Approach to Hardware Transactional Memory ZHICHAO YAN, Huazhong University of Science and Technology HONG JIANG, University of Nebraska - Lincoln YUJUAN
More informationUnbounded Page-Based Transactional Memory
Unbounded Page-Based Transactional Memory Weihaw Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Biesbrouck, Gilles Pokam, Osvaldo Colavin, and Brad Calder University of California
More informationNonblocking Transactions Without Indirection Using Alert-on-Update
Nonblocking Transactions Without Indirection Using Alert-on-Update Michael F. Spear, Arrvindh Shriraman, Luke Dalessandro, Sandhya Dwarkadas, and Michael L. Scott Department of Computer Science, University
More informationLecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation
Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, lazy implementation 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end
More informationCommit Algorithms for Scalable Hardware Transactional Memory. Abstract
Commit Algorithms for Scalable Hardware Transactional Memory Seth H. Pugsley, Rajeev Balasubramonian UUCS-07-016 School of Computing University of Utah Salt Lake City, UT 84112 USA August 9, 2007 Abstract
More informationPotential violations of Serializability: Example 1
CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationUsing Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional. memory
Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory Lee Baugh, Naveen Neelakantam, and Craig Zilles Department of Computer Science, University of Illinois
More informationtapping into parallel ism with transactional memory
Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott tapping into parallel ism with transactional memory Arrvindh Shriraman is a graduate student in computer science at the University of Rochester.
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationSELF-TUNING HTM. Paolo Romano
SELF-TUNING HTM Paolo Romano 2 Based on ICAC 14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX International Conference on Autonomic Computing
More informationCache Coherence Protocols for Chip Multiprocessors - I
Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors
More informationLecture 12 Transactional Memory
CSCI-UA.0480-010 Special Topics: Multicore Programming Lecture 12 Transactional Memory Christopher Mitchell, Ph.D. cmitchell@cs.nyu.edu http://z80.me Database Background Databases have successfully exploited
More informationPortland State University ECE 588/688. Transactional Memory
Portland State University ECE 588/688 Transactional Memory Copyright by Alaa Alameldeen 2018 Issues with Lock Synchronization Priority Inversion A lower-priority thread is preempted while holding a lock
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationDecoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor
Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Hari Kannan, Michael Dalton, Christos Kozyrakis Computer Systems Laboratory Stanford University Motivation Dynamic analysis help
More informationLecture 6: TM Eager Implementations. Topics: Eager conflict detection (LogTM), TM pathologies
Lecture 6: TM Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies 1 Design Space Data Versioning Eager: based on an undo log Lazy: based on a write buffer Typically, versioning
More informationFall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi
More informationFARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures
FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures Tayo Oguntebi, Sungpack Hong, Jared Casper, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Outline Motivation The Stanford
More informationPERFORMANCE PATHOLOGIES IN HARDWARE TRANSACTIONAL MEMORY
... PERFORMANCE PATHOLOGIES IN HARDWARE TRANSACTIONAL MEMORY... TRANSACTIONAL MEMORY IS A PROMISING APPROACH TO EASE PARALLEL PROGRAMMING. HARDWARE TRANSACTIONAL MEMORY SYSTEM DESIGNS REFLECT CHOICES ALONG
More informationByteSTM: Java Software Transactional Memory at the Virtual Machine Level
ByteSTM: Java Software Transactional Memory at the Virtual Machine Level Mohamed Mohamedin Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationASelectiveLoggingMechanismforHardwareTransactionalMemorySystems
ASelectiveLoggingMechanismforHardwareTransactionalMemorySystems Marc Lupon Grigorios Magklis Antonio González mlupon@ac.upc.edu grigorios.magklis@intel.com antonio.gonzalez@intel.com Computer Architecture
More informationFlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes
FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes Rishi Agarwal and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationLecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation
Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation 1 Eager Overview Topics: Logs Log optimization Conflict examples Handling deadlocks Sticky scenarios
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationExploiting Semantic Commutativity in Hardware Speculation
Appears in the Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 16 Exploiting Semantic Commutativity in Hardware Speculation Guowei Zhang Virginia Chiu Daniel
More informationABORTING CONFLICTING TRANSACTIONS IN AN STM
Committing ABORTING CONFLICTING TRANSACTIONS IN AN STM PPOPP 09 2/17/2009 Hany Ramadan, Indrajit Roy, Emmett Witchel University of Texas at Austin Maurice Herlihy Brown University TM AND ITS DISCONTENTS
More information