Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

Similar documents
Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Hardware Transactional Memory on Haswell

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Invyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy

Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing

Efficient Architecture Support for Thread-Level Speculation

VMM Emulation of Intel Hardware Transactional Memory

Speculative Synchronization

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Work Report: Lessons learned on RTM

Performance Improvement via Always-Abort HTM

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

A Fast Instruction Set Simulator for RISC-V

Enhancing Real-Time Behaviour of Parallel Applications using Intel TSX

Exploring Speculative Parallelism in SPEC2006

HTM in the wild. Konrad Lai June 2015

Performance Improvement via Always-Abort HTM

Intel Transactional Synchronization Extensions (Intel TSX) Linux update. Andi Kleen Intel OTC. Linux Plumbers Sep 2013

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Yuxi Chen, Shu Wang, Shan Lu, and Karthikeyan Sankaralingam *

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

HAFT Hardware-Assisted Fault Tolerance

Implementing Transactional Memory in Kernel space

POSH: A TLS Compiler that Exploits Program Structure

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

bool Account::withdraw(int val) { atomic { if(balance > val) { balance = balance val; return true; } else return false; } }

Instruction Level Parallelism (ILP)

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Yukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University)

DMP Deterministic Shared Memory Multiprocessing

LIMITS OF ILP. B649 Parallel Architectures and Programming

Microarchitecture Overview. Performance

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Unbounded Transactional Memory

T-SGX: Eradicating Controlled-Channel

Understanding Hardware Transactional Memory

Summary: Open Questions:

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Fast, precise dynamic checking of types and bounds in C

Heuristics for Profile-driven Method- level Speculative Parallelization

Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Topic 22: Multi-Processor Parallelism

Dynamic Performance Tuning for Speculative Threads

Computer Architecture Spring 2016

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Topic 22: Multi-Processor Parallelism

Multithreaded Value Prediction

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor

Potential violations of Serializability: Example 1

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Superscalar Processors

Tradeoffs in Transactional Memory Virtualization

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

Footprint-based Locality Analysis

Software-Controlled Multithreading Using Informing Memory Operations

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions

UNIVERSITY OF MINNESOTA. Shengyue Wang

2 TEST: A Tracer for Extracting Speculative Threads

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Understanding The Effects of Wrong-path Memory References on Processor Performance

Scheduling the Intel Core i7

DeAliaser: Alias Speculation Using Atomic Region Support

Call Paths for Pin Tools

CS425 Computer Systems Architecture

Deterministic Shared Memory Multiprocessing

A Concurrent Skip List Implementation with RTM and HLE

Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat

Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences

Managing Resource Limitation of Best-Effort HTM

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Going Under the Hood with Intel s Next Generation Microarchitecture Codename Haswell

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Dependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin

Bias Scheduling in Heterogeneous Multi-core Architectures

More on Conjunctive Selection Condition and Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

COMP3151/9151 Foundations of Concurrency Lecture 8

CS 351 Final Exam Solutions

Transactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preview)

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

Transcription:

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research Tokyo

Thread-Level Speculation (TLS) [Franklin et al., 92] or Speculative Multithreading (SpMT) Speculatively parallelize a sequential program into a multithreaded program. What is parallelization? To find data-independent tasks from a program. Why speculation? Because a compiler cannot detect every data dependence. Sequential execution Task Task Task TLS execution w/ 3 threads 2

Runtime Requirements for TLS With TLS: Compiler finds probably data-independent tasks. Runtime guarantees data independence among tasks. (Minimum) runtime requirements for TLS Data dependence (= conflict) detection among tasks Execution rollback at a conflict Ordered commit of tasks TLS execution w/ 3 threads Conflict Ordered commit 3 Rollback

Hardware Transactional Memory (HTM) Coming into the Market Blue Gene/Q zec12 POWER8 4th Generation Core Processor (Haswell) HTM supports Conflict detection among transactions Execution rollback at a conflict HTM satisfies 2/3 of the runtime requirements for TLS! Task = transaction 4

Our Goal How well can TLS improve the performance on real HTM hardware? Used Intel 4th Generation Core Processor (Intel TSX). Manually modified and measured SPEC CPU2006. 5

Our True Goal How poorly can TLS improve the performance on real HTM hardware? Because proposed TLS systems had advanced hardware support. E.g. ordered transactions, data forwarding, etc. Blue Gene/Q is the only real system supporting advanced hardware for TLS. Ordered transactions 6

Our True Goal How poorly can TLS improve the performance on real HTM hardware? What kind of hardware support should be implemented next in the off-the-shelf HTM? 7

Transactional Memory At programming/compile time Enclose critical sections with transaction begin/end operations. At execution time Memory operations within a transaction observed as one step by other threads. Multiple transactions executed in parallel as long as their memory operations do not conflict. xbegin(); a->count++; xend(); Thread X xbegin(); a->count++; xend(); xbegin(); a->count++; xend(); Thread Y xbegin(); a->count++; xend(); xbegin(); b->count++; xend(); 8

HTM IBM Research - Tokyo Instruction set (Intel TSX) : Begin a transaction : End a transaction XABORT, etc. Micro-architecture Read and write sets held in CPU caches Conflict detection using CPU cache coherence protocol Conflict detection by cache line granularity Rollback by discarding write set and restoring registers Abort reasons: Read set and write set conflict Read set and write set overflow External interruptions, etc. abort_handler abort_handler: 9

TLS for Loops We focus on frequently executed loops. Task = iteration(s) = transaction Why not parallelize function calls? Difficult to implement TLS for function calls on HTM. (Refer to the paper for the details.) Sequential execution Iteration 1 Iteration 2 Iteration 3 TLS execution w/ 3 threads Iteration 1 Iteration 2 Iteration 3 10

TLS on HTM Enclose each iteration with and. Re-execute iteration in case of abort. Iteration 1 Iteration 2 Conflict Iteration 3 Iteration 3 re-execution 11

Ordered Transactions Must commit in the same order as sequential execution. Because data independence can be guaranteed only after all of the preceding iterations have committed. Iteration 1 Iteration 2 Iteration 3 Commit order inversion 12

Ordered Transactions by Software Hardware support by proposed TLS systems Wait until the preceding iterations commit. Software implementation by checking commit order Use a global variable to indicate the next iteration to commit. Abort if cannot commit. Iteration 1 Can commit? Iteration 2 Can commit? Iteration 3 Can commit? Iteration 3 reexecution Can commit? 13

Ordered Transactions by Software Hardware support by proposed TLS systems Wait until the preceding iterations commit. Software implementation by checking commit order Use a global variable to indicate the next iteration to commit. Abort if cannot commit. Iteration 1 Can commit? Iteration 2 Why not spin-wait? Refer to our paper. Can commit? Iteration 3 Can commit? Iteration 3 reexecution Can commit? 14

Our Goal How poorly can TLS improve the performance on real HTM hardware? What kind of hardware support should be implemented next in the off-the-shelf HTM? Will hardware support for ordered transactions really help? 15

False Sharing due to Cache-Line Granularity Conflict Detection double array[]; for (int i = ; i < ; i++) { array[i] = ; } Writes by Thread 1 Writes by Thread 2 Writes by Thread 3 TLS array[] 16 Cache line = 64 bytes on x86

Transaction Coarsening to Avoid False Sharing Iteration 1 Iteration 2 Iteration 8 Iteration 9 Iteration 10 Iteration 16 Iteration 17 Iteration 18 Iteration 24 Writes by Thread 3 array[] Writes by Thread 1 Writes by Thread 2 17

Benchmarks and Methodology SPEC CPU2006 6 benchmarks showing more than 1.5-fold speedups with 4 threads in a previous TLS study [Packirisamy et al., 2009] 429.mcf, 433.milc, 456.hmmer, 464.h264ref, 470.lbm, and 482.sphinx3 Manually modified frequently executed loops. Inserted,, and commit order checks. Transformed a target loop into a doubly-nested loop for transaction coarsening Experimental environment Core i7-4770 processor (4 cores, 2-way SMT) 4-GB memory Linux 2.6.32-431 / GCC 4.9.0 18

Normalized Throughput Results Throughput (1 = sequential) 1.5 1 0.5 429.mcf 456.hmmer 470.lbm 433.milc 464.h264ref 482.sphinx3 Higher is better 0 0 1 2 3 4 5 6 7 8 9 Number of software threads SMT enabled Up to 11% speedups with 2 or 4 threads. But mostly degraded the throughput. 19

433.milc Throughput (1 = sequential) 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 Number of software threads Parallel program Loop coverage: 23% Abort ratio (%) 100 80 60 40 20 Total Overflow Other 0 Order inversion Conflict 1 2 3 4 5 6 7 8 9 Number of software threads Commit order inversion is a dominant abort reason. Hardware support for ordered transactions will help. 20

Abort Statistics (1/2) 120 429.mcf 120 433.milc Abort ratio (%) 100 80 60 40 20 Abort ratio (%) 100 Total 80 Order inversion 60 Buffer overflow 40 Conflict 20 Other Total Order inv Buffer ov Conflict Other 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 120 Number of software threads 456.hmmer Number of software threads Abort ratio (%) 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 Number of software threads Total Order inversion Buffer overflow Conflict Other Conflicts were a dominant abort reason in all of the benchmarks except 433.milc. 21

Abort Statistics (2/2) 120 464.h264ref 120 470.lbm Abort ratio (%) 100 80 60 40 20 Abort ratio (%) 100 Total 80 Order inversion 60 Buffer overflow 40 Conflict 20 Other Total Order inv Buffer ov Conflict Other 0 120 1 2 3 4 5 6 7 8 9 482.sphinx3 Number of software threads 0 1 2 3 4 5 6 7 8 9 Number of software threads Abort ratio (%) 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 Number of software threads Total Order inversion Buffer overflow Conflict Other Conflicts were a dominant abort reason in all of the benchmarks except 433.milc. 22

Reasons for Conflicts and Possible Hardware Support Benchmark 429.mcf Conflict reason RAW dependence Possible hardware support Data forwarding 433.milc No 456.hmmer RAW dependence Data forwarding 464.h264ref WAR dependence Multi-version cache 470.lbm 482.sphinx3 WAW dependence (false sharing by prefetching) WAW dependence (false sharing) WAW dependence (false sharing by prefetching) (Fix in prefetcher) Word-level conflict detection (Fix in prefetcher) 23

Examples of Read-After-Write Data Dependence 429.mcf static int size; static DATA array[n]; func() { for () { if () { size++; array[size]->field = ; } } } 456.hmmer for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + ; } Hardware support already proposed in TLS literatures. Data forwarding. 24

Example of Write-After-Read Data Dependence 464.h264ref for () { line = func(); = line[0]; } static DATA line[n]; DATA *func() { line[0] = ; return line; } Difficult to analyze by a compiler. WAR dependence across different functions in different source files. Multi-version caches needed. 25

Conflicts Precede Commit Order Inversion Commit order matters only when most of the transactions reach the committing points. With data dependence, most of the transactions cannot run to the end. Iteration 1 Iteration 2 Conflict Iteration 3 Commit order inversion 26

Conflicts due to Prefetching Even with transaction coarsening, conflicts still happened. 464.h264ref and 482.sphinx3. Prefetched adjacent cache lines caused conflicts. Writes by Thread 1 Prefetch Conflict 64 bytes 64 bytes 64 bytes Writes by Thread 2 Prefetch 27

Conclusion IBM Research - Tokyo How well can TLS improve the performance on real HTM hardware? Up to 11% speedups with 4 threads in SPEC CPU2006 on 4th Generation Core Processor. But degraded throughput in most cases. What kind of hardware support should be implemented next in the off-the-shelf HTM? Hardware support for ordered transactions will help in parallel programs. However, many programs contain data dependence. Not only ordered transactions, but also other hardware facilities to avoid conflicts should be implemented. (Intel should fix the adjacent cache line prefetcher!) 28