Hardware Transactional Memory Architecture and Emulation

Hardware Transactional Memory Architecture and Emulation Dr. Peng Liu 刘鹏 liupeng@zju.edu.cn Media Processor Lab Dept. of Information Science and Electronic Engineering Zhejiang University Hangzhou, 310027,P.R.China

Outline Motivation Introduce basic TM concepts & interfaces TM implementation tradeoffs Discuss opportunities beyond parallelism Related work

Motivation: The Parallel Programming Crisis Multi-core chips, inflection point for SW development Scalable performance now requires parallel programming Parallel programming up until now Limited to people with access to large parallel systems Using low-level concurrency features in languages Thin veneer over underlying hardware Too cumbersome for mainstream software developers Difficult to write, debug, maintain and even get some speedup We need better concurrency abstractions Goal = easy to use + good performance 90% of the speedup with 10% of the effort

Parallel Programming is Hard Thread level parallelism is great until we want to share data Defining & implementing synchronization Races, deadlock avoidance, memory model issues Fundamentally, it s hard to work on shared data at the same time so we don t mutual exclusion via locks Locks have problems performance/correctness, fine/coarse tradeoff deadlocks and failure recovery

Transactional Memory (TM) Memory transaction [Knight 86, Herlihy & Moss 93] Inspired by database transactions Execute large, programmer-defined regions atomically and in isolation atomic { x = x + y; } Atomicity (all or nothing) At commit, all memory writes take effec at once On abort, none of the writes appear to take effect Isolation No other code can observe writes before commit Serializability Transactions seem to commit in a single serial order The exact order is not guaranteed though

Advantages of TM Easy to use synchronization construct As easy to use as coarse-grain locks Programmer declares, system implements Performs as well as fine-grain locks Automatic read-read & fine-grain concurrency No tradeoff between performance & correctness Failure atomicity & recovery No lost locks when a thread fails Failure recovery = transaction abort + restart Composability Safe & scalable composition of software modules

Programming with TM Basic atomic blocks: atomic {} User-triggered abort: abort Conditional synchronization: retry Composing code sequences: orelse Integration with parallel models:?

TM Caveats and Open Issues TM Vs. Locks I/O and unrecoverable actions Interaction with non-transactional code

Atomic() Lock() + Unlock() The difference Atomic: high-level declaration of atomicity Does not specify implementation/blocking behavior Does not provide a consistency model Lock: low-level blocking primitive Does not provide atomicity or isolation on its own Keep in mind Locks can be used to implement atomic() Locks can be used for purposes beyond atomicity Cannot replace all lock regions with atomic regions Atomic eliminates many data races Atomic blocks can suffer from atomicity violations Atomic action in algorithm split into two atomic blocks

I/O and Other Irrevocable Actions Challenge: difficult to undo output & redo input I/O devices, I/O registers Alternative solutions (open problem) Buffer output & log input Finalize output & clear log at commit Does not work if atomic does input after output Guarantee that transaction will not abort Abort interfering transactions or sequentialize the system Does not work with abort(), input-after-output Transaction-based systems Multiple transactional devices Manager coordinates transactions across devices

Interactions with Non-Transactional Code Two basic alternatives Weak atomicity Transactions are serializable only against other transactions No guarantees about interactions with non-transactional code Strong atomicity Transactions are serializable against all memory accesses Non-transactional loads/stores are 1-instrcution transactions The tradeoff Strong atomicity seems intuitive Predictable interactions for a wide range of coding patterns But, strong atomicity has high overheads for software TM

Why TM? TM= declarative synchronization User specifies requirement (atomicity & isolation) System implements in best possible way Motivation for TM Difficult for user to get explicit sync right Correctness Vs performance Vs complexity Explicit sync is difficult to scale Locking scheme for 4 CPUs is not the best for 64 Difficult to do explicit sync with composable SW Need a global locking strategy Other advantages: fault atomicity,.. TM applicability Apps with irregular or unstructured parallelism Difficult to prove independence in advance Difficult to partition data in advance TM does not generate new parallelism It just helps you tap into what is there TM target: 90% of benefit @ 10% of work

Implementation Requirements for TM To build TM, you need Data Versioning atomic { x = x + y; } Conflict Detection T0 atomic { x = x + y; } T1 atomic { x = x / 8; } Conflict Resolution T0 x = x + y; x = x / 8; x = x / 8; T1 Where do you put the new x until commit? How do you detect that reads/writes to x need to be serialized? How do you enforce serialization when required? Design space tradeoffs

TM Implementation Basics TM systems must provide atomicity and isolation Without sacrificing concurrency Basic implementation requirements Data versioning Conflict detection & resolution Implementation options Hardware transactional memory (HTM) Software transactional memory (STM) Hybrid transactional memory Hardware accelerated STMs and dual-mode systems

Data Versioning Manage uncommitted (new) and committed (old) versions of data for concurrent transactions Eager versioning (undo-log based) Update memory location directly Maintain undo info in a log + Faster commit, direct reads (SW) - Slower aborts, fault tolerance issues Lazy versioning (write-buffer based) - Buffer data until commit in a write-buffer - Update actual memory location on commit + Faster abort, no fault tolerance issues - Slower commits, indirect reads (SW)

Conflict Detection Detect and handle conflicts between transaction Read-Write and (often) Write-Write conflicts Must track the transaction s read-set and write-set Read-set: addresses read within the transaction Write-set: addresses written within the transaction Pessimistic detection Check for conflicts during loads or stores SW: SW barriers using locks and/or version numbers HW: check through coherence actions Use contention manager to decide to stall or abort Various priority policies to handle common case fast Optimistic detection Detect conflicts when a transaction attempts to commit SW: validate write/read-set using locks or version numbers HW: validate write-set using coherence actions Get exclusive access for cache lines in write-set On a conflict, give priority to committing transaction Other transactions may abort later on On conflicts between committing transactions, use contention manager to decide priority

Conflict Detection Tradeoffs Pessimistic conflict detection (aka encounter or eager) + Detect conflicts early Undo less work, turn some aborts to stalls - No forward progress guarantees, more aborts in some cases - Locking issues (SW), fine-grain communication (HW) Optimistic conflict detection (aka commit or lazy) + Forward progress guarantees + Potentially less conflicts, shorter locking (SW), bulk communication (HW) - Detects conflicts late, still has fairness problems

Conflict Detection Granularity Object granularity (SW/hybrid) + Reduced overhead (time/space) + Close to programmers reasoning - False sharing on large objects (e.g. arrays) Word granularity + Minimize false sharing - Increased overhead (time/space) Cache line granularity + compromise between object & word + works for both HW/SW Mix & match ->best of both words word-level for arrays, object-level for other data,..

TM Implementation Space Hardware TM systems Lazy + optimistic: Stanford TCC Lazy + pessimistic: MIT LTM, Intel VTM Eager + pessimistic: Wisconsin LogTM Software TM Systems Lazy + optimistic (rd/wr): Sun TL2 Lazy + optimistic (rd)/pessimistic (wr): MS OSTM Eager + optimistic (rd)/pessimistic (wr): Intel STM Eager + pessimistic (rd/wr): Intel STM Optimal design is still an open questions May be different for HW, SW, and hybrid

Hardware or Software TM? Can be implemented in HW or SW SW is slow Bookkeeping is expensive: 2-8x slowdown SW has correctness pitfalls Even for correctly synchronized code! Lack of strong atomicity Let s use hardware for TM

Types of Hardware Support Hardware-accelerated STM systems (HASTM, SigTM, USTM, FlexTM ) Start with STM system & identify key bottlenecks Provide (simple) HW primitives for acceleration Hardware-based TM systems (TCC, LTM, VTM, LogTM, ) Versioning & conflict detection directly in HW Hybrid TM systems (Sun Rock, ) Combine an HTM with an STM by switching modes when needed Based on xaction characteristics available resources,

Hardware TM Data versioning in caches Cache the write-buffer or the undo-log Cache metadata to track read-set and write-set Can do with private, shared, and multi-level caches Conflict detection through cache coherence protocol Coherence lookups detect conflicts between transactions Works with snooping & directory coherence Notes Register checkpoint must be taken at transaction begin Virtualization of hardware resources HTM support similar for TLS and speculative lock-elision Some hardware can support all three models actually

HTM Advantages Transparent No need for SW barriers, function cloning,.. Fast common case behavior Zero-overhead tracking of read-set & write-set Zero-overhead versioning Fast commit & abort without data movement Continuous validation of read-set Strong isolation Conflicts detected on non-xaction loads/stores as well Can simplify multi-core hardware Replace existing coherence with transactional coherence

HTM Challenges and Opportunities 1.What s the best implementation in hardware? Many available options 2.What s the right HW/SW interface? HTM support flexible SW environment 3.What s happens when HW resources are exhausted? Virtualization of hardware resources Time virtualization Interrupts, paging, and context switch with xaction What happens to the state in caches Space virtualization Where is the write-buffer or log stored How are R&W bits stored and checked Most transactions are currently small Small read-sets & write-sets Short in terms of instructions

Project Aims The self-tuning transactional memory system Dynamically adapt its policies to best suit the application behavior. Configurable parameterized application programming interface (API) to improve the scalability and flexibility. Develop loop-closed debugger for HTM based on our FPGA prototype platform. Validate the self-tuning memory hierarchy in the platform that can support both software-managed memories and a cache-coherent or transactional memory system.

Processing Elements Concepts Memory Wall Processor frequency vs. DRAM memory latency Latency introduced by multiple levels of memory Attack on the Memory Wall 3-level Memory Model: Main storage, tightly coupled memory (TCM) and HTM cache, and Register file Streaming DMA architecture RISC processor RTOS support real-time worlds

Hardware Transactional Memory Architecture

TM Version, Conflict, Contention Implement an atomic and isolated transactional region: Versioning: eager and lazy Conflict detection: optimistic and pessimistic Contention management To make a transactional code region appear atomic, all its modifications must be stored and kept isolated from other transaction until commit time. To ensure serializability between transactions, conflicts must be detected and resolved.

Designer-defined Interface Define the TM instructions in three models Basic model XSTART, a transaction begin mark XEND, a transaction end mark Extension model XSTART_OPEN, independent atomicity and isolation for nested transactions XSTART_CLOSED, independent rollback & restart for nested transactions XABR, abort a running transaction XVLD, validate a running transaction User mode UCLEAR, clear the read-set and write-set data USTORE, store the data to memory, the speculative cache state is not changed ULOAD, load the data from the memory, the speculative cache state is not changed Problems: How to coincide with the instruction set architecture and processor pipeline How to write the application program using these primitives

Emulation Platform Framework Architecture research relies on software simulators which are too slow to facilitate interesting experiments. An alternative to simulation is to develop FPGA-based platforms for parallel computing platform. For the HTM project, we have developed the Transactional system Emulation Accelerator (TEA) platform to validate the HTM design and to support programming models and application development. We also can use the FPGA-based technology for prototyping modern CMP systems.

TEA Architecture FPGA E FPGA S RISC/DSP RISC/DSP RISC/DSP RISC/DSP I$ HTM DMA TCM I$ HTM DMA TCM DDR2 DRAM CTRL Token ARB I$ HTM DMA TCM I$ HTM DMA TCM switch switch Router switch switch DMA TCM I$ HTM DMA TCM I$ HTM Linux RISC FPGA M I/O DMA TCM I$ HTM DMA TCM I$ HTM RISC/DSP RISC/DSP RISC/DSP RISC/DSP FPGA N FPGA W Each User FPGA (East, South, West, North) contains two RISC/DSP cores enhanced with a HTM and DMA mechanism. The FPGA M connects all the processors to the shared memory and I/O devices. The router interfaces with the token arbiter, the DDR controller and RISC32E core that runs the Linux OS/RTOS.

Breakdown of TEA s Bandwidth FPGA-FPGA ⅰ)LVCMOS Link Control FPGA to User FPGA link: 100MHz x80bit = 8.0Gb/s User FPGA to User FPGA link:100mhz x100bit = 10.0Gb/s ⅱ)GTP Link Control FPGA to User FPGA link: 2 GTPs User FPGA to User FPGA link: 6 GTPs Memory Capacity 10GB DDRⅡ/FPGA Bandwidth 64bit x 150MHz =9.6Gb/s I/O Control FPGA 8 SFP User FPGA 2 SFP Supports both 10-Gigabit Ethernet and 10-Gigabit Infiniband standards Bandwidth 2.5Gb/s In addition, one Gbit Ethernet port/fpga for supplementary,

TEA Platform Photo Cache coherence and TM Emulation On-chip Interconnection Network and Protocol Verification of MPSoC

Contributions Evaluated hardware TM systems The best system from efficiency/complexity and application standpoint Replaced coherence and consistency with only transactions Using only transactions for communication is advantageous and efficient Devised a hardware/software interface for TM Simple primitives provide TM with flexible and needed semantics

Problems Software simulator user-level or full system? Hardware emulator? Is TM an panacea? How to attack memory wall?

Related Work Cell processor and Roadrunner http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/roadrunner-tutorial-session-1-web1.pdf RAMP( Research Accelerator for Multiple Processors) project, an FPGA-based hardware emulator in computer architecture. http://ramp.eecs.berkeley.edu/ Smart Memory (Stanford University) A.Firoozshahian, et al., A memory system design framework: creating smart memories, ISCA 2009. Sun s Rock is a highly-speculative multicore processor with a isolating hardware checkpointing feature. M. Tremblay and S.Chaudhry, A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor, ISSCC 2008. TCC project http://tcc.stanford.edu/ LogTM K.E.Moore, et al., LogTM: log-based transactional memory, HPCA 2006. EazyHTM S.Tomić, et al., EazyHTM:eager-lazy hardware transactional memory, MICRO 2009. MetaTM Rossbach et al., "TxLinux and MetaTM: transactional memory and the operating system," Communications of the ACM, 2008. FlexTM S.Arrvindh et al. Flexible decoupled transactional memory support, ISCA 2008. TM research community TM bibliography: http://www.cs.wisc.edu/trans-memory

Selected References TM Overview Larus & Rajwar. Transactional Memory, Morgan & Claypool Publishers,2007, 2011 Larus & Kozyrakis. Transactional Memory. Communications of the ACM, 2008. Harris et al. Transactional Memory: An Overview, IEEE Micro, 2007. Basics Herligh & Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures, ISCA, 1993. Hammond, et al. Transactional Memory Coherence and Consistency, ISCA, 2004. Rajwar et al. Virtualizing Transactional Memory. ISCA, 2005. Moore et al. logtm: Log-Based Transactional Memory, HPCA, 2006. Ceze et al. BulkSC: Bulk Enforcement of Sequential Consistency, ISCA, 2007. McDonald. Architectures for Transactional Memory, Dissertation, Stanford University, 2009. McDonald. Architectural Semantics for Practical Transactional Memory, ISCA, 2006. Moravan. Supporting Nested Transactional Memory in LogTM, ASPLOS, 2006. Wee et al. A practical FPGA-based Framework for Novel CMP Research, FPGA, 2007. Njoroge et al. ATLAS: A Chip-Multiprocessor with Transactional Memory Support, DATE, 2007. Lupon et al. A Dynamically Adaptable Hardware Transactional Memory, Microarchitecture, 2010. Christos. Transactional Memory, Concepts, Implementations, & Opportunities, 2008. http://ppl.standord.edu/~christos