Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Similar documents
Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions

MODERN multiprocessor architectures are inherently

IEEE TRANSACTIONS ON COMPUTERS, VOL. 67, NO. 1, JANUARY

Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions

DMP Deterministic Shared Memory Multiprocessing

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

Advanced Memory Management

Deterministic Replay and Data Race Detection for Multithreaded Programs

Heckaton. SQL Server's Memory Optimized OLTP Engine

Deterministic Shared Memory Multiprocessing

IN recent years, deploying on virtual machines (VMs)

Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay

Execution Replay for Multiprocessor Virtual Machines

Abstractions for Practical Virtual Machine Replay. Anton Burtsev, David Johnson, Mike Hibler, Eric Eride, John Regehr University of Utah

Scalable Deterministic Replay in a Parallel Full-system Emulator

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

CA485 Ray Walshe Google File System

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Speculative Locks. Dept. of Computer Science

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

ReVirt: Enabling Intrusion Analysis through Virtual Machine Logging and Replay

Lessons Learned During the Development of the CapoOne Deterministic Multiprocessor Replay System

COREMU: a Portable and Scalable Parallel Full-system Emulator

CSE 530A ACID. Washington University Fall 2013

Task Scheduling of Real- Time Media Processing with Hardware-Assisted Virtualization Heikki Holopainen

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism

The Google File System

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism

Multi-Hypervisor Virtual Machines: Enabling An Ecosystem of Hypervisor-level Services

Zhang Chen Zhang Chen Copyright 2017 FUJITSU LIMITED

Distributed Systems 16. Distributed File Systems II

The Google File System

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

SecVisor: A Tiny Hypervisor for Lifetime Kernel Code Integrity

High Performance Transactions in Deuteronomy

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Samuel T. King, George W. Dunlap, and Peter M. Chen University of Michigan. Presented by: Zhiyong (Ricky) Cheng

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

Debugging operating systems with time-traveling virtual machines

QuartzV: Bringing Quality of Time to Virtual Machines

COSC 6385 Computer Architecture. Virtualizing Compute Resources

Process Synchronization. Mehdi Kargahi School of ECE University of Tehran Spring 2008

CS533 Concepts of Operating Systems. Jonathan Walpole

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Operating Systems 4/27/2015

The Bulk Multicore Architecture for Programmability

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Virtualization. Virtualization

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

G Disco. Robert Grimm New York University

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Google File System. Arun Sundaram Operating Systems

TERN: Stable Deterministic Multithreading through Schedule Memoization

VMM Emulation of Intel Hardware Transactional Memory

Master s Thesis! Improvement of the Virtualization Support in the Fiasco.OC Microkernel! Julius Werner!

Chimera: Hybrid Program Analysis for Determinism

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Scalable Cache Coherence

Philip A. Bernstein Colin Reid Ming Wu Xinhao Yuan Microsoft Corporation March 7, 2012

SGX memory oversubscription

Virtual Machines. Part 2: starting 19 years ago. Operating Systems In Depth IX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Nested Virtualization Friendly KVM

Speculative Synchronization

Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic!

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

NFSv4 as the Building Block for Fault Tolerant Applications

Chí Cao Minh 28 May 2008

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT

Lecture 22: Fault Tolerance

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Deterministic Process Groups in

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

xsim The Extreme-Scale Simulator


A Practical approach of tailoring Linux Kernel. Junghwan Kang /

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

RAMP-White / FAST-MP

Databases - Transactions

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

The Google File System (GFS)

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

COSC 6385 Computer Architecture. Virtualizing Compute Resources

Improving the Practicality of Transactional Memory

Chapter 13: I/O Systems

Prototyping Architectural Support for Program Rollback Using FPGAs

Increase KVM Performance/Density

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

Unbounded Transactional Memory

VM Migration, Containers (Lecture 12, cs262a)

CSE 120 Principles of Operating Systems

To EL2, and Beyond! connect.linaro.org. Optimizing the Design and Implementation of KVM/ARM

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Transcription:

Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016

Table of Contents 1 Introduction 4 Samsara Overview 2 Background & Motivation 5 Evaluation 3 Record & Replay Memory Interleaving with HAV 6 Conclusion 2

Introduction Non-determinism Multiprocessor architectures are inherently non-deterministic The lack of reproducibility complicates software debugging, security analysis, and fault tolerance 3

Introduction Deterministic Replay Gives computer users the ability to travel backward in time, recreating past states and events Checkpoint + Record all non-deterministic events Replay Phase Checkpoint Execute same instruction stream Final State Inject events in logged points Replay log Recording Phase Initial State Instruction stream Final State Non-determinism Events (e.g. user / network inputs, interrupts ) 4

Introduction Deterministic Replay for Multi-processor Deterministic replay for single processor is relatively mature and well-developed Challenge on the multi-processor systems: Memory Access Interleaving 5

Background & Motivation Hardware-based schemes Use special hardware support for recording memory access interleaving Redesign the cache coherence protocol The FDR System [ISCA 03] 6

Background & Motivation Hardware-based schemes Use special hardware support for recording memory access interleaving Redesign the cache coherence protocol Issues Increase the complexity of the circuits, impractical for use in real systems Huge space overhead which limits the duration of the recorded interval 7

Background & Motivation Software-only schemes Modify OS, compiler, runtime libraries or VMM Virtualization-based approaches -- CREW protocol CREW: Concurrent-Read & Exclusive-Write P0 P1 23 X = 1 P0 : 23 P1 : 5 P0 : 24 P1 : 6 24 Y = 0 5 6 a = X b = Y + 1 P1 : 5 P0 : 25 P0 : 25 P1 : 8 25 X = 0 7 c = 0... 8 c = X + Y... 8

Background & Motivation Software-only schemes Modify OS, compiler, runtime libraries or VMM Virtualization-based approaches -- CREW protocol CREW: Concurrent-Read & Exclusive-Write Issues Each memory access operation must be checked for logging before execution Serious performance degradation (about 10x compared to the native execution) Huge log sizes (approximately 1 MB/processor/second) 9

Background & Motivation To summarize Software-only schemes are inefficient without proper hardware support No commodity processor with dedicated hardware-based record and replay capability 10

Background & Motivation To summarize Software-only schemes are inefficient without proper hardware support No commodity processor with dedicated hardware-based record and replay capability Our goal To implement a software approach that can take full advantages of the latest hardware features in commodity processors to record and replay memory access interleaving efficiently without introducing any hardware modifications. Hardware-assisted virtualization (HAV) (e.g., Intel Virtualization Technology) 11

Record & Replay Memory Interleaving with HAV Point-to-point logging approach (CREW protocol) Record dependences between pairs of instructions Large number of memory access detections (VM exit) Huge logs Excessive overhead P0 P1...... 12

Record & Replay Memory Interleaving with HAV Point-to-point logging approach (CREW protocol) Record dependences between pairs of instructions Large number of memory access detections (VM exit) Huge logs Excessive overhead Chunk-based Strategy P0 P1 P0 P1 Restrict processors execution into a series of chunks Record chunk size & commit order Chunk execution must satisfy: Atomicity Serializability............ 13

Record & Replay Memory Interleaving with HAV Serializability: Conflict detection, Chunk commit Atomicity: Copy-on-write (COW), Rollback P0 P1 C1 COW COW Truncation Reason: Chunk Size Limit Conflict Detection Commit Chunk Start LD (A) ST (A) ST (A) ST (B) Chunk Complete R-set { A } W-set { A, B } C2 Truncation Reason: I/O Instruction C3 LD (D) ST (D) R-set { D } W-set { D } LD (A) LD (B) ST (B) R-set { A, B } W-set { B } Squash & Rollback Re-execution 14

Record & Replay Memory Interleaving with HAV Obtain R&W-set Efficiently via HAV Extensions VM-based approaches: numerous VM exits (hardware page protection) Accessed and Dirty Flags of EPT (Extended Page Tables) Our approach: a simple EPT traversal 15

Record & Replay Memory Interleaving with HAV Obtain R&W-set Efficiently via HAV Extensions VM-based approaches: numerous VM exits (hardware page protection) Accessed and Dirty Flags of EPT (Extended Page Tables) Our approach: a simple EPT traversal P0 P0 VM exit R (b) W(b) R (b) W(b) R (d) R (e) W(e) R (b) W(b) R (b) W(b) R (d) R (e) W(e) R (a) R (c) R (a) R (c) a EPT traversal 16

Record & Replay Memory Interleaving with HAV Partial traversal of EPT EPT uses a hierarchical, tree-based design If the accessed flag of one internal entry is 0, then the accessed flags of all entries in its subtrees are guaranteed to be 0 Locality of reference (just need to traverse a tiny part of EPT) Accessed: 1 Accessed: 1 Accessed: 0 Accessed: 1 Accessed: 0 17

Record & Replay Memory Interleaving with HAV P0 Observations Chunk commit is time-consuming Wait for lock Write-back operation Obtain R&W-set Chunk Complete Wait for Lock Broadcast Updates Lock Detect Conflict Write-back Updates Subsequent Chunk 18

Record & Replay Memory Interleaving with HAV P0 Decentralized Three-Phase Commit Protocol Move this out of the synchronized block Support parallel commit while ensuring serializability Three phases: Pre-commit phase Commit phase Synchronization phase Lock Obtain R&W-set Detect Conflict Write-back Updates Chunk Complete Wait for Lock Insert into committing list Broadcast Updates Update Chunk Info Check Committing List Subsequent Chunk 19

Record & Replay Memory Interleaving with HAV Replay Memory Interleaving Guarantee all chunks will be properly re-built and executed in the original order Design goal: maintain the same parallelism as the recoding phase 1. Truncate a chunk at the recorded timestamp 2. Ensure that all preceding chunks have been committed successfully before the current chunk starts 20

Samsara Overview 21

Evaluation Experimental Setup 4-core Intel Core i7-4790 processor, 12GB memory, 1TB Hard Drive Host: Ubuntu 12.04 with Linux kernel version 3.11.0 and Qemu-1.2.2 Guest: Ubuntu 14.04 with Linux kernel version 3.13.1 Workloads Computation intensive applications PARSEC SPLASH-2 I/O intensive applications kernel-build pbzip2 22

Evaluation Log Size Samsara generates log at an average rate of 0.0027 MB/core/s and 0.0031 MB/core/s for recoding two and four cores Reduces the log file size by 98.6% compared to the previous software-only schemes Log size produced by Samsara during recording (compressed with gzip). 23

Evaluation The proportion of each type of non-deterministic events The size of the chunk commit order log is practically negligible compared with other nondeterministic events 9.36% with two cores and 19.31% with four cores on average The proportion of each type of nondeterministic events in a log file (without compression).

Evaluation Recording Overhead Compared to Native Execution Compare the performance to native execution 2.3X and 4.1X for recording these workloads on two and four cores Previous software-only approaches cause about 10X with two cores Recording overhead compared to the native execution. 24

Conclusion We made the first attempt to leverage HAV extensions to achieve an efficient software-based replay system on commodity multiprocessors. We abandon the inefficient CREW protocol and instead use a chunk-based strategy. We avoid all memory access detections, and obtain each chunk s read-set and write-set by retrieving the accessed and the dirty flags of the EPT. We propose a decentralized three-phase commit protocol which significantly reduces the performance overhead by allowing chunk commits in parallel while still ensuring serializability. 25

Thanks renshiru@pku.edu.cn 26