An Update on Haskell H/STM 1

Similar documents
Improving STM Performance with Transactional Structs 1

A Hybrid TM for Haskell

Improving STM Performance with Transactional Structs

Reduced Hardware Lock Elision

Revisiting Software Transactional Memory in Haskell 1

Leveraging Hardware TM in Haskell

Improving STM Performance with Transactional Structs

Refined Transactional Lock Elision

COMP3151/9151 Foundations of Concurrency Lecture 8

Enhancing efficiency of Hybrid Transactional Memory via Dynamic Data Partitioning Schemes

Conflict Detection and Validation Strategies for Software Transactional Memory

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lock vs. Lock-free Memory Project proposal

Composable Shared Memory Transactions Lecture 20-2

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

unreadtvar: Extending Haskell Software Transactional Memory for Performance

Integrating Transactionally Boosted Data Structures with STM Frameworks: A Case Study on Set

CSE 230. Concurrency: STM. Slides due to: Kathleen Fisher, Simon Peyton Jones, Satnam Singh, Don Stewart

6.852: Distributed Algorithms Fall, Class 20

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Chí Cao Minh 28 May 2008

Improving the Practicality of Transactional Memory

Remote Invalidation: Optimizing the Critical Path of Memory Transactions

Simon Peyton Jones (Microsoft Research) Tokyo Haskell Users Group April 2010

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 28 November 2014

!!"!#"$%& Atomic Blocks! Atomic blocks! 3 primitives: atomically, retry, orelse!

Cost of Concurrency in Hybrid Transactional Memory

Reduced Hardware Transactions: A New Approach to Hybrid Transactional Memory

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Hardware Transactional Memory. Daniel Schwartz-Narbonne

6 Transactional Memory. Robert Mullins

Efficient Hybrid Transactional Memory Scheme using Near-optimal Retry Computation and Sophisticated Memory Management in Multi-core Environment

arxiv: v1 [cs.dc] 22 May 2014

Performance Improvement via Always-Abort HTM

Invyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy

Monitors; Software Transactional Memory

Massimiliano Ghilardi

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory

Opportunities and pitfalls of multi-core scaling using Hardware Transaction Memory

Linearizability of Persistent Memory Objects

Optimizing Hybrid Transactional Memory: The Importance of Nonspeculative Operations

Compiler Correctness for Software Transactional Memory

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

A Relativistic Enhancement to Software Transactional Memory

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

Optimizing Hybrid Transactional Memory: The Importance of Nonspeculative Operations

Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

arxiv: v2 [cs.dc] 2 Mar 2017

DBT Tool. DBT Framework

HydraVM: Mohamed M. Saad Mohamed Mohamedin, and Binoy Ravindran. Hot Topics in Parallelism (HotPar '12), Berkeley, CA

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Software Transactional Memory

Interchangeable Back Ends for STM Compilers

Advances in Programming Languages

COS 326 David Walker. Thanks to Kathleen Fisher and recursively to Simon Peyton Jones for much of the content of these slides.

Monitors; Software Transactional Memory

Amalgamated Lock-Elision

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Transactional Memory

Hardware Transactional Memory on Haswell

New Programming Abstractions for Concurrency. Torvald Riegel Red Hat 12/04/05

Lecture: Transactional Memory. Topics: TM implementations

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

bool Account::withdraw(int val) { atomic { if(balance > val) { balance = balance val; return true; } else return false; } }

Understanding Hardware Transactional Memory

False Conflict Reduction in the Swiss Transactional Memory (SwissTM) System

Evaluating the Impact of Transactional Characteristics on the Performance of Transactional Memory Applications

Scheduling Transactions in Replicated Distributed Transactional Memory

Multicore programming in Haskell. Simon Marlow Microsoft Research

SELF-TUNING HTM. Paolo Romano

1 Publishable Summary

Inherent Limitations of Hybrid Transactional Memory

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions

Software Transactional Memory

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

arxiv: v1 [cs.dc] 13 Aug 2013

MSpec: A Design Pattern for Concurrent Data Structures

Atomicity via Source-to-Source Translation

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture: Consistency Models, TM

Programming with Transactional Memory

Cost of Concurrency in Hybrid Transactional Memory

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Tradeoffs in Transactional Memory Virtualization

THE concept of a transaction forms the foundation of

PARALLEL AND CONCURRENT PROGRAMMING IN HASKELL

Remote Transaction Commit: Centralizing Software Transactional Memory Commits

Lecture 21 Concurrency Control Part 1

Performance Improvement via Always-Abort HTM

Lecture 22 Concurrency Control Part 2

Hybrid NOrec: A Case Study in the Effectiveness of Best Effort Hardware Transactional Memory

The Multicore Transformation

Transactional Memory in C++ Hans-J. Boehm. Google and ISO C++ Concurrency Study Group chair ISO C++ Transactional Memory Study Group participant

Design Tradeoffs in Modern Software Transactional Memory Systems

Transcription:

An Update on Haskell H/STM 1 Ryan Yates and Michael L. Scott University of Rochester TRANSACT 10, 6-15-2015 1 This work was funded in part by the National Science Foundation under grants CCR-0963759, CCF-1116055, CCF-1337224, and CCF-1422649, and by support from the IBM Canada Centres for Advanced Studies. 1/16

Outline Haskell TM. Our implementations using HTM. Performance results. Future work. Slides: http://goo.gl/0zfjxj Paper: http://goo.gl/er29ef 2/16

Haskell TM At last TRANSACT we reported around 280 open source libraries in the Haskell ecosystem that depend on STM. Now there are over 400. Reasons for using Haskell STM: It is easy! Expressive API with retry and orelse. Trivial to build into libraries. 3/16

Haskell STM Example transa = do v <- dequeue(queue1) return v transb = do v <- dequeue(queue2) if somecondition(v) then return v else retry... mainloop = do a <- atomically(transa orelse transb orelse...) handlerequest(a) mainloop 4/16

Existing Haskell STM Implementations Glasgow Haskell Compiler (GHC), 7.8. Explicit transactional variables. Object based. Lazy value-based validation. 5/16

Existing Haskell STM Implementations Coarse-grain Lock (STM-Coarse) Serialize commits with a global lock. Similar to NOrec [Dalessandro et al., 2010, Dalessandro et al., 2011, Riegel et al., 2011]. Fine-grain Locks (STM-Fine) Lock for each TVar. Two-phase commit. Similar to OSTM [Fraser, 2004]. 6/16

Haskell with HTM Hybrid TM (Hybrid) Three levels for transactions (CF [Matveev and Shavit, 2013]). Full transactions in hardware. Software transaction, commit in hardware. Full software fallback. HLE Commit (HTM-Coarse, HTM-Fine) Like coarse-grain and fine-grain lock STMs, but using hardware transactions to elide locks around commit. 7/16

Red-black tree performance Last year 200 nodes, 84,000 tree operations per second on 4 threads. Now 50,000 nodes, 24,000,000 tree operations per second on 72 threads. 8/16

Red-black tree performance What changed? Constant space 2 metadata tracking for retry. Lowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate measurement. 2 Nearly. 9/16

Results (Intel c Xeon TM E5-2699 v3 two socket, 36-core) 2.5 107 Tree operations per second 2 1.5 1 0.5 0 1 8 18 36 54 72 Threads HashMap STM-Coarse HTM-Coarse Hybrid STM-Fine HTM-Fine 10/16

Results (bad PRNG) (Intel c Xeon TM E5-2699 v3 two socket, 36-core) 2.5 107 Tree operations per second 2 1.5 1 0.5 0 1 8 18 36 54 72 Threads HashMap STM-Coarse HTM-Coarse Hybrid STM-Fine HTM-Fine 11/16

Results (single socket) (Intel c Xeon TM E5-2699 v3 two socket, 36-core) Tree operations per second 1.5 1 0.5 0 10 7 1 2 4 6 8 10 12 14 16 18 Threads HashMap STM-Coarse HTM-Coarse Hybrid STM-Fine HTM-Fine 12/16

Results retry (Intel c Xeon TM E5-2699 v3 two socket, 36-core) 10 6 Queue transfers per second 6 4 2 0 0 20 40 60 Threads STM-Coarse HTM-Coarse Hybrid STM-Fine HTM-Fine 13/16

Future Implementation Work TStruct Flexible transactional variable granularity. Unboxed mutable variables. Node Node lock lock Node Node hash hash key key key key value value value value parent TVar parent color color left value left parent parent right hash right left left color color right right 14/16

Future Implementation Work (orelse) Supporting orelse in Hardware Transactions t1 = atomically (a orelse b) Atomically choose second branch when the first retrys. No direct support in hardware for a partial rollback. If the first transaction does not write to any TVars, there is nothing to roll back. Keep a TRec while running the first transaction. Or rewrite the first transaction to delay writes until after the choice to retry. 15/16

Summary We have a much better understanding of the performance issues. Performance on a concurrent set is competative at scale. Good performance for infrequent retry use cases. HTM is a useful and flexible tool that helps performance. We have a roadmap for future improvements. Slides: http://goo.gl/0zfjxj Paper: http://goo.gl/er29ef 16/16

17/16

Supporting retry Existing retry Implementation When retry is encountered, add the thread to the watch list of each TVar in the transaction s TRec. When a transaction commits, wake up all transactions in watch lists on TVars it writes. 18/16

Supporting retry Hardware Transactions Replace watch lists with bloom filters for read sets. Support read-only retry directly in HTM. Record write-set during HTM then perform wake-ups after HTM commit. 19/16

Wakeup Structure Committed writer transactions search blocked thread read-sets in a short transaction eliding a global wakeup lock. Committing HTM read-only retry transactions atomically insert themselves in the wakeup structure by writing the global wakeup lock inside the hardware transaction. Releases lock when successfully blocked. Aborts wakeup transaction (short and cheap). Serializes HTM retry transactions (rare anyway). 20/16

Future Implementation Work (orelse) Existing orelse Implementation Atomic choice between transactions biased toward the first. Nested TRecs allow for partial rollback. If the first transaction encounters retry, throw away the writes, but merge the reads and move to the second transaction. 21/16

Haskell STM Metadata Structure key Node prev TRec value parent left right color TVar value watch Watch Queue thread next prev Watch Queue thread next prev index tvar old new tvar old new... tvar old new 22/16

Haskell HTM Metadata Structure Node HTRec key read-set value write-set parent TVar left value right hash color Wakeup read-set thread... 23/16

Haskell Before TStruct Node key value parent left right color value hash TVar Node key value parent left right color 24/16

Haskell with TStruct Node lock hash key value color parent left right Node lock hash key value color parent left right 25/16

Haskell STM TQueue Implementation data TQueue a = TQueue (TVar [a]) (TVar [a]) dequeue :: TQueue a -> a -> STM () dequeue (TQueue _ write) v = modifytvar write (v:) enqueue :: TQueue a -> STM a enqueue (TQueue read write) = readtvar read >>= \case (v:vs) -> writetvar read vs >> return v [] -> reverse <$> readtvar write >>= \case [] -> retry (v:vs) -> do writetvar write [] writetvar read vs return v 26/16

Haskell STM Implementation Fairly standard commit protocol, but missing optimizations from more recent work. Commit Coarse grain: perform writes while holding the global lock. Fine grain: Acquire locks for writes while validating. Check that read-only variables are still valid while holding the write locks. Perform writes and release locks. 27/16

Haskell STM Broken code that we are not allowed to write! transferbad :: TVar Int -> TVar Int -> Int -> STM () transferbad accountx accounty value = do x <- readtvar accountx y <- readtvar accounty writetvar accountx (x + v) writetvar accounty (y - v) if x < 0 then launchmissles else return () 28/16

Haskell STM Broken code that we are not allowed to write! thread :: IO () thread = do transfer a b 200 transfer a c 300 29/16

C ABI vs Cmm ABI GHC s runtime support for STM is written in C. Code is generated in Cmm and calls into the runtime are essentially foreign calls with significant extra overhead. We avoid this by writing the HTM support in Cmm. Typeclass machinery could allow deeper code specialization. 30/16

Lazy Evaluation Lazy evaluation may lead to false conflicts due to the update step that writes back the fully evaluated value. One solution could be to delay performing updates (to shared values) until after a transaction commits. Races here are fine as any update must represent the same value. 31/16

References [Dalessandro et al., 2011] Dalessandro, L., Carouge, F., White, S., Lev, Y., Moir, M., Scott, M. L., and Spear, M. F. (2011). Hybrid NOrec: A case study in the effectiveness of best effort hardware transactional memory. In Proc. of the 16th Intl. Symp. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 39 52, Newport Beach, CA. [Dalessandro et al., 2010] Dalessandro, L., Spear, M. F., and Scott, M. L. (2010). NOrec: Streamlining STM by abolishing ownership records. In Proc. of the 15th ACM Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 67 78, Bangalore, India. [Fraser, 2004] Fraser, K. (2004). Practical lock-freedom. PhD thesis, University of Cambridge Computer Laboratory. [Matveev and Shavit, 2013] Matveev, A. and Shavit, N. (2013). Reduced hardware NOrec. In 5th Workshop on the Theory of Transactional Memory (WTTM), Jerusalem, Israel. [Riegel et al., 2011] Riegel, T., Marlier, P., Nowack, M., Felber, P., and Fetzer, C. (2011). In Proc. of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 53 64, San Jose, CA. 32/16