Hardware Transactional Memory Architecture and Emulation

Similar documents
Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

6 Transactional Memory. Robert Mullins

Chí Cao Minh 28 May 2008

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Transactional Memory

Tradeoffs in Transactional Memory Virtualization

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

EazyHTM: Eager-Lazy Hardware Transactional Memory

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Transactional Memory. Concurrency unlocked Programming. Bingsheng Wang TM Operating Systems

Improving the Practicality of Transactional Memory

The Common Case Transactional Behavior of Multithreaded Programs

LogTM: Log-Based Transactional Memory

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Lecture: Transactional Memory. Topics: TM implementations

Log-Based Transactional Memory

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Flexible Architecture Research Machine (FARM)

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Lecture 12 Transactional Memory

Transactional Memory Implementation Lecture 1. COS597C, Fall 2010 Princeton University Arun Raman

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Scheduling Transactions in Replicated Distributed Transactional Memory

Potential violations of Serializability: Example 1

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

DBT Tool. DBT Framework

6.852: Distributed Algorithms Fall, Class 20

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Towards Pervasive Parallelism

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Dependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

ABORTING CONFLICTING TRANSACTIONS IN AN STM

Speculative Synchronization

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Building and Using the ATLAS Transactional Memory System

DESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM

FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Using Software Transactional Memory In Interrupt-Driven Systems

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Reminder from last time

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture: Consistency Models, TM

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Practical Near-Data Processing for In-Memory Analytics Frameworks

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

FlexTM. Flexible Decoupled Transactional Memory Support. Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science

Conflict Detection and Validation Strategies for Software Transactional Memory

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Transactional Memory Coherence and Consistency

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 28 November 2014

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Comparing Memory Systems for Chip Multiprocessors

Multiprocessors & Thread Level Parallelism

Concurrent Preliminaries

Hardware Transactional Memory. Daniel Schwartz-Narbonne

Transactional Memory. review articles. Is TM the answer for improving parallel programming?

SOFTWARE TRANSACTIONAL MEMORY FOR MULTICORE EMBEDDED SYSTEMS

Overview of Transaction Management

ARCHITECTURES FOR TRANSACTIONAL MEMORY

High Performance Computing on GPUs using NVIDIA CUDA

Chris Rossbach, Owen Hofmann, Don Porter, Hany Ramadan, Aditya Bhandari, Emmett Witchel University of Texas at Austin

Commit Algorithms for Scalable Hardware Transactional Memory. Abstract

Handout 3 Multiprocessor and thread level parallelism

SYSTEM CHALLENGES AND OPPORTUNITIES FOR TRANSACTIONAL MEMORY

Multiprocessors and Locking

Introduction to Parallel Computing

System Challenges and Opportunities for Transactional Memory

Multiprocessor Synchronization

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Lock vs. Lock-free Memory Project proposal

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees

Hardware Support For Serializable Transactions: A Study of Feasibility and Performance

An Introduction to Parallel Programming

Intro to Transactions

Lock Elision and Transactional Memory Predictor in Hardware. William Galliher, Liang Zhang, Kai Zhao. University of Wisconsin Madison

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Exploiting Distributed Software Transactional Memory

Lecture 17: Transactional Memories I

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Transcription:

Hardware Transactional Memory Architecture and Emulation Dr. Peng Liu 刘鹏 liupeng@zju.edu.cn Media Processor Lab Dept. of Information Science and Electronic Engineering Zhejiang University Hangzhou, 310027,P.R.China

Outline Motivation Introduce basic TM concepts & interfaces TM implementation tradeoffs Discuss opportunities beyond parallelism Related work

Motivation: The Parallel Programming Crisis Multi-core chips, inflection point for SW development Scalable performance now requires parallel programming Parallel programming up until now Limited to people with access to large parallel systems Using low-level concurrency features in languages Thin veneer over underlying hardware Too cumbersome for mainstream software developers Difficult to write, debug, maintain and even get some speedup We need better concurrency abstractions Goal = easy to use + good performance 90% of the speedup with 10% of the effort

Parallel Programming is Hard Thread level parallelism is great until we want to share data Defining & implementing synchronization Races, deadlock avoidance, memory model issues Fundamentally, it s hard to work on shared data at the same time so we don t mutual exclusion via locks Locks have problems performance/correctness, fine/coarse tradeoff deadlocks and failure recovery

Transactional Memory (TM) Memory transaction [Knight 86, Herlihy & Moss 93] Inspired by database transactions Execute large, programmer-defined regions atomically and in isolation atomic { x = x + y; } Atomicity (all or nothing) At commit, all memory writes take effec at once On abort, none of the writes appear to take effect Isolation No other code can observe writes before commit Serializability Transactions seem to commit in a single serial order The exact order is not guaranteed though

Advantages of TM Easy to use synchronization construct As easy to use as coarse-grain locks Programmer declares, system implements Performs as well as fine-grain locks Automatic read-read & fine-grain concurrency No tradeoff between performance & correctness Failure atomicity & recovery No lost locks when a thread fails Failure recovery = transaction abort + restart Composability Safe & scalable composition of software modules

Programming with TM Basic atomic blocks: atomic {} User-triggered abort: abort Conditional synchronization: retry Composing code sequences: orelse Integration with parallel models:?

TM Caveats and Open Issues TM Vs. Locks I/O and unrecoverable actions Interaction with non-transactional code

Atomic() Lock() + Unlock() The difference Atomic: high-level declaration of atomicity Does not specify implementation/blocking behavior Does not provide a consistency model Lock: low-level blocking primitive Does not provide atomicity or isolation on its own Keep in mind Locks can be used to implement atomic() Locks can be used for purposes beyond atomicity Cannot replace all lock regions with atomic regions Atomic eliminates many data races Atomic blocks can suffer from atomicity violations Atomic action in algorithm split into two atomic blocks

I/O and Other Irrevocable Actions Challenge: difficult to undo output & redo input I/O devices, I/O registers Alternative solutions (open problem) Buffer output & log input Finalize output & clear log at commit Does not work if atomic does input after output Guarantee that transaction will not abort Abort interfering transactions or sequentialize the system Does not work with abort(), input-after-output Transaction-based systems Multiple transactional devices Manager coordinates transactions across devices

Interactions with Non-Transactional Code Two basic alternatives Weak atomicity Transactions are serializable only against other transactions No guarantees about interactions with non-transactional code Strong atomicity Transactions are serializable against all memory accesses Non-transactional loads/stores are 1-instrcution transactions The tradeoff Strong atomicity seems intuitive Predictable interactions for a wide range of coding patterns But, strong atomicity has high overheads for software TM

Why TM? TM= declarative synchronization User specifies requirement (atomicity & isolation) System implements in best possible way Motivation for TM Difficult for user to get explicit sync right Correctness Vs performance Vs complexity Explicit sync is difficult to scale Locking scheme for 4 CPUs is not the best for 64 Difficult to do explicit sync with composable SW Need a global locking strategy Other advantages: fault atomicity,.. TM applicability Apps with irregular or unstructured parallelism Difficult to prove independence in advance Difficult to partition data in advance TM does not generate new parallelism It just helps you tap into what is there TM target: 90% of benefit @ 10% of work

Implementation Requirements for TM To build TM, you need Data Versioning atomic { x = x + y; } Conflict Detection T0 atomic { x = x + y; } T1 atomic { x = x / 8; } Conflict Resolution T0 x = x + y; x = x / 8; x = x / 8; T1 Where do you put the new x until commit? How do you detect that reads/writes to x need to be serialized? How do you enforce serialization when required? Design space tradeoffs

TM Implementation Basics TM systems must provide atomicity and isolation Without sacrificing concurrency Basic implementation requirements Data versioning Conflict detection & resolution Implementation options Hardware transactional memory (HTM) Software transactional memory (STM) Hybrid transactional memory Hardware accelerated STMs and dual-mode systems

Data Versioning Manage uncommitted (new) and committed (old) versions of data for concurrent transactions Eager versioning (undo-log based) Update memory location directly Maintain undo info in a log + Faster commit, direct reads (SW) - Slower aborts, fault tolerance issues Lazy versioning (write-buffer based) - Buffer data until commit in a write-buffer - Update actual memory location on commit + Faster abort, no fault tolerance issues - Slower commits, indirect reads (SW)

Conflict Detection Detect and handle conflicts between transaction Read-Write and (often) Write-Write conflicts Must track the transaction s read-set and write-set Read-set: addresses read within the transaction Write-set: addresses written within the transaction Pessimistic detection Check for conflicts during loads or stores SW: SW barriers using locks and/or version numbers HW: check through coherence actions Use contention manager to decide to stall or abort Various priority policies to handle common case fast Optimistic detection Detect conflicts when a transaction attempts to commit SW: validate write/read-set using locks or version numbers HW: validate write-set using coherence actions Get exclusive access for cache lines in write-set On a conflict, give priority to committing transaction Other transactions may abort later on On conflicts between committing transactions, use contention manager to decide priority

Conflict Detection Tradeoffs Pessimistic conflict detection (aka encounter or eager) + Detect conflicts early Undo less work, turn some aborts to stalls - No forward progress guarantees, more aborts in some cases - Locking issues (SW), fine-grain communication (HW) Optimistic conflict detection (aka commit or lazy) + Forward progress guarantees + Potentially less conflicts, shorter locking (SW), bulk communication (HW) - Detects conflicts late, still has fairness problems

Conflict Detection Granularity Object granularity (SW/hybrid) + Reduced overhead (time/space) + Close to programmers reasoning - False sharing on large objects (e.g. arrays) Word granularity + Minimize false sharing - Increased overhead (time/space) Cache line granularity + compromise between object & word + works for both HW/SW Mix & match ->best of both words word-level for arrays, object-level for other data,..

TM Implementation Space Hardware TM systems Lazy + optimistic: Stanford TCC Lazy + pessimistic: MIT LTM, Intel VTM Eager + pessimistic: Wisconsin LogTM Software TM Systems Lazy + optimistic (rd/wr): Sun TL2 Lazy + optimistic (rd)/pessimistic (wr): MS OSTM Eager + optimistic (rd)/pessimistic (wr): Intel STM Eager + pessimistic (rd/wr): Intel STM Optimal design is still an open questions May be different for HW, SW, and hybrid

Hardware or Software TM? Can be implemented in HW or SW SW is slow Bookkeeping is expensive: 2-8x slowdown SW has correctness pitfalls Even for correctly synchronized code! Lack of strong atomicity Let s use hardware for TM

Types of Hardware Support Hardware-accelerated STM systems (HASTM, SigTM, USTM, FlexTM ) Start with STM system & identify key bottlenecks Provide (simple) HW primitives for acceleration Hardware-based TM systems (TCC, LTM, VTM, LogTM, ) Versioning & conflict detection directly in HW Hybrid TM systems (Sun Rock, ) Combine an HTM with an STM by switching modes when needed Based on xaction characteristics available resources,

Hardware TM Data versioning in caches Cache the write-buffer or the undo-log Cache metadata to track read-set and write-set Can do with private, shared, and multi-level caches Conflict detection through cache coherence protocol Coherence lookups detect conflicts between transactions Works with snooping & directory coherence Notes Register checkpoint must be taken at transaction begin Virtualization of hardware resources HTM support similar for TLS and speculative lock-elision Some hardware can support all three models actually

HTM Advantages Transparent No need for SW barriers, function cloning,.. Fast common case behavior Zero-overhead tracking of read-set & write-set Zero-overhead versioning Fast commit & abort without data movement Continuous validation of read-set Strong isolation Conflicts detected on non-xaction loads/stores as well Can simplify multi-core hardware Replace existing coherence with transactional coherence

HTM Challenges and Opportunities 1.What s the best implementation in hardware? Many available options 2.What s the right HW/SW interface? HTM support flexible SW environment 3.What s happens when HW resources are exhausted? Virtualization of hardware resources Time virtualization Interrupts, paging, and context switch with xaction What happens to the state in caches Space virtualization Where is the write-buffer or log stored How are R&W bits stored and checked Most transactions are currently small Small read-sets & write-sets Short in terms of instructions

Project Aims The self-tuning transactional memory system Dynamically adapt its policies to best suit the application behavior. Configurable parameterized application programming interface (API) to improve the scalability and flexibility. Develop loop-closed debugger for HTM based on our FPGA prototype platform. Validate the self-tuning memory hierarchy in the platform that can support both software-managed memories and a cache-coherent or transactional memory system.

Processing Elements Concepts Memory Wall Processor frequency vs. DRAM memory latency Latency introduced by multiple levels of memory Attack on the Memory Wall 3-level Memory Model: Main storage, tightly coupled memory (TCM) and HTM cache, and Register file Streaming DMA architecture RISC processor RTOS support real-time worlds

Hardware Transactional Memory Architecture

TM Version, Conflict, Contention Implement an atomic and isolated transactional region: Versioning: eager and lazy Conflict detection: optimistic and pessimistic Contention management To make a transactional code region appear atomic, all its modifications must be stored and kept isolated from other transaction until commit time. To ensure serializability between transactions, conflicts must be detected and resolved.

Designer-defined Interface Define the TM instructions in three models Basic model XSTART, a transaction begin mark XEND, a transaction end mark Extension model XSTART_OPEN, independent atomicity and isolation for nested transactions XSTART_CLOSED, independent rollback & restart for nested transactions XABR, abort a running transaction XVLD, validate a running transaction User mode UCLEAR, clear the read-set and write-set data USTORE, store the data to memory, the speculative cache state is not changed ULOAD, load the data from the memory, the speculative cache state is not changed Problems: How to coincide with the instruction set architecture and processor pipeline How to write the application program using these primitives

Emulation Platform Framework Architecture research relies on software simulators which are too slow to facilitate interesting experiments. An alternative to simulation is to develop FPGA-based platforms for parallel computing platform. For the HTM project, we have developed the Transactional system Emulation Accelerator (TEA) platform to validate the HTM design and to support programming models and application development. We also can use the FPGA-based technology for prototyping modern CMP systems.

TEA Architecture FPGA E FPGA S RISC/DSP RISC/DSP RISC/DSP RISC/DSP I$ HTM DMA TCM I$ HTM DMA TCM DDR2 DRAM CTRL Token ARB I$ HTM DMA TCM I$ HTM DMA TCM switch switch Router switch switch DMA TCM I$ HTM DMA TCM I$ HTM Linux RISC FPGA M I/O DMA TCM I$ HTM DMA TCM I$ HTM RISC/DSP RISC/DSP RISC/DSP RISC/DSP FPGA N FPGA W Each User FPGA (East, South, West, North) contains two RISC/DSP cores enhanced with a HTM and DMA mechanism. The FPGA M connects all the processors to the shared memory and I/O devices. The router interfaces with the token arbiter, the DDR controller and RISC32E core that runs the Linux OS/RTOS.

Breakdown of TEA s Bandwidth FPGA-FPGA ⅰ)LVCMOS Link Control FPGA to User FPGA link: 100MHz x80bit = 8.0Gb/s User FPGA to User FPGA link:100mhz x100bit = 10.0Gb/s ⅱ)GTP Link Control FPGA to User FPGA link: 2 GTPs User FPGA to User FPGA link: 6 GTPs Memory Capacity 10GB DDRⅡ/FPGA Bandwidth 64bit x 150MHz =9.6Gb/s I/O Control FPGA 8 SFP User FPGA 2 SFP Supports both 10-Gigabit Ethernet and 10-Gigabit Infiniband standards Bandwidth 2.5Gb/s In addition, one Gbit Ethernet port/fpga for supplementary,

TEA Platform Photo Cache coherence and TM Emulation On-chip Interconnection Network and Protocol Verification of MPSoC

Contributions Evaluated hardware TM systems The best system from efficiency/complexity and application standpoint Replaced coherence and consistency with only transactions Using only transactions for communication is advantageous and efficient Devised a hardware/software interface for TM Simple primitives provide TM with flexible and needed semantics

Problems Software simulator user-level or full system? Hardware emulator? Is TM an panacea? How to attack memory wall?

Related Work Cell processor and Roadrunner http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/roadrunner-tutorial-session-1-web1.pdf RAMP( Research Accelerator for Multiple Processors) project, an FPGA-based hardware emulator in computer architecture. http://ramp.eecs.berkeley.edu/ Smart Memory (Stanford University) A.Firoozshahian, et al., A memory system design framework: creating smart memories, ISCA 2009. Sun s Rock is a highly-speculative multicore processor with a isolating hardware checkpointing feature. M. Tremblay and S.Chaudhry, A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor, ISSCC 2008. TCC project http://tcc.stanford.edu/ LogTM K.E.Moore, et al., LogTM: log-based transactional memory, HPCA 2006. EazyHTM S.Tomić, et al., EazyHTM:eager-lazy hardware transactional memory, MICRO 2009. MetaTM Rossbach et al., "TxLinux and MetaTM: transactional memory and the operating system," Communications of the ACM, 2008. FlexTM S.Arrvindh et al. Flexible decoupled transactional memory support, ISCA 2008. TM research community TM bibliography: http://www.cs.wisc.edu/trans-memory

Selected References TM Overview Larus & Rajwar. Transactional Memory, Morgan & Claypool Publishers,2007, 2011 Larus & Kozyrakis. Transactional Memory. Communications of the ACM, 2008. Harris et al. Transactional Memory: An Overview, IEEE Micro, 2007. Basics Herligh & Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures, ISCA, 1993. Hammond, et al. Transactional Memory Coherence and Consistency, ISCA, 2004. Rajwar et al. Virtualizing Transactional Memory. ISCA, 2005. Moore et al. logtm: Log-Based Transactional Memory, HPCA, 2006. Ceze et al. BulkSC: Bulk Enforcement of Sequential Consistency, ISCA, 2007. McDonald. Architectures for Transactional Memory, Dissertation, Stanford University, 2009. McDonald. Architectural Semantics for Practical Transactional Memory, ISCA, 2006. Moravan. Supporting Nested Transactional Memory in LogTM, ASPLOS, 2006. Wee et al. A practical FPGA-based Framework for Novel CMP Research, FPGA, 2007. Njoroge et al. ATLAS: A Chip-Multiprocessor with Transactional Memory Support, DATE, 2007. Lupon et al. A Dynamically Adaptable Hardware Transactional Memory, Microarchitecture, 2010. Christos. Transactional Memory, Concepts, Implementations, & Opportunities, 2008. http://ppl.standord.edu/~christos