TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT

Similar documents
TicToc: Time Traveling Optimistic Concurrency Control

Towards Thousand-Core RISC-V Shared Memory Systems. Quan Nguyen Massachusetts Institute of Technology 30 November 2016

Logical Leases: Scalable Hardware and Software Systems through Time Traveling. Xiangyao Yu

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

TicToc: Time Traveling Optimistic Concurrency Control

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University

Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System

Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System

A Scalable Lock Manager for Multicores

Chapter 5. Multiprocessors and Thread-Level Parallelism

Taurus: A Parallel Transaction Recovery Method Based on Fine-Granularity Dependency Tracking

An Evaluation of Distributed Concurrency Control. Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS

arxiv: v2 [cs.dc] 23 Sep 2015

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Heckaton. SQL Server's Memory Optimized OLTP Engine

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, Samuel Madden Presenter : Akshada Kulkarni Acknowledgement : Author s slides are used with

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

High Performance Transactions in Deuteronomy

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

STAR: Scaling Transactions through Asymmetric Replication

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Multiprocessors & Thread Level Parallelism

CSE 344 MARCH 9 TH TRANSACTIONS

L i (A) = transaction T i acquires lock for element A. U i (A) = transaction T i releases lock for element A

Introduction to Data Management CSE 344

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

CSE 344 MARCH 25 TH ISOLATION

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Leveraging Lock Contention to Improve Transaction Applications. University of Washington

Handout 3 Multiprocessor and thread level parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Concurrency control 12/1/17

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors

arxiv: v1 [cs.dc] 24 May 2015

Defining properties of transactions

Transactional Consistency and Automatic Management in an Application Data Cache Dan R. K. Ports MIT CSAIL

No compromises: distributed transactions with consistency, availability, and performance

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Replication. Consistency models. Replica placement Distribution protocols

UNIT I (Two Marks Questions & Answers)

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

1. Memory technology & Hierarchy

Concurrency Control II and Distributed Transactions

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Introduction to Data Management CSE 344

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Replication and Consistency. Fall 2010 Jussi Kangasharju

Introduction to Data Management CSE 344

Database Architectures

Speculative Synchronization

Cache Coherence in Distributed and Replicated Transactional Memory Systems. Technical Report RT/4/2009

References. Concurrency Control. Administração e Optimização de Bases de Dados 2012/2013. Helena Galhardas e

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Transaction Management & Concurrency Control. CS 377: Database Systems

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

The Google File System

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Topics in Reliable Distributed Systems

Chapter 15 : Concurrency Control

Database Architectures

Exam 2 Review. Fall 2011

Chapter-4 Multiprocessors and Thread-Level Parallelism

Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases

A lock is a mechanism to control concurrent access to a data item Data items can be locked in two modes:

CSE 344 MARCH 5 TH TRANSACTIONS

Introduction to Data Management CSE 414

LIMITS OF ILP. B649 Parallel Architectures and Programming

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Curriculum 2013 Knowledge Units Pertaining to PDC

Analyzing the Impact of System Architecture on the Scalability of OLTP Engines for High-Contention Workloads

Lock-based concurrency control. Q: What if access patterns rarely, if ever, conflict? Serializability. Concurrency Control II (OCC, MVCC)

Database Systems CSE 414

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Computer Architecture

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Fast HTAP over loosely coupled Nodes

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

CLOUD-SCALE FILE SYSTEMS

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Concurrency Control! Snapshot isolation" q How to ensure serializability and recoverability? " q Lock-Based Protocols" q Other Protocols"

CS 347 Parallel and Distributed Data Processing

Transaction Processing: Concurrency Control. Announcements (April 26) Transactions. CPS 216 Advanced Database Systems

Shared Memory Multiprocessors

Transactions. Kathleen Durant PhD Northeastern University CS3200 Lesson 9

CS5412: TRANSACTIONS (I)

5/17/17. Announcements. Review: Transactions. Outline. Review: TXNs in SQL. Review: ACID. Database Systems CSE 414.

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Main-Memory Databases 1 / 25

Consistency & Replication

Transcription:

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT

FOR FIFTY YEARS, WE HAVE RIDDEN MOORE S LAW Moore s Law and the scaling of clock frequency = printing press for the currency of performance

10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 TECHNOLOGY SCALING u Transistors x 1000 Clock frequency (MHz) Power (W) Cores 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 Each generation of Moore s law doubles the number of transistors but clock frequency has stopped increasing.

10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 TECHNOLOGY SCALING u Transistors x 1000 Clock frequency (MHz) Power (W) Cores 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 To increase performance, need to exploit parallelism.

DIFFERENT KINDS OF PARALLELISM - 1 Instruction Level Transaction Level a = b + c d = e + f g = d + b Read A Read B Compute C Read A Read D Compute E Read C Read E Compute F 5

DIFFERENT KINDS OF PARALLELISM - 2 Thread Level A x B = C Task Level Search( image ) Cloud Different thread computes each entry of product matrix C 6

DIFFERENT KINDS OF PARALLELISM - 3 Thread Level A x B = C User Level Search( image ) Lookup( data ) Cloud Different thread computes each entry of product matrix C Query( record ) 7

DEPENDENCY DESTROYS PARALLELISM for i = 1 to n a[b[i]] = (a[b[i - 1]] + b[i]) / c[i] Need to compute i th entry after i - 1 th has been computed L 8

Read A Read A DIFFERENT KINDS OF DEPENDENCY No dependency! Write A Write A WAW: Semantics decide order Write A Read A RAW: Read needs new value Read A Write A WAR: We have flexibility here! 9

DEPENDENCE IS ACROSS TIME, BUT WHAT IS TIME? Time can be physical time Time could correspond to logical timestamps assigned to instructions Time could be a combination of the above à Time is a definition of ordering 10

Read A WAR DEPENDENCE Initially A = 10 Thread 0 Thread 1 Logical Order Write A Physical Time Order A=13 Local copy of A = 10 Read happens later than Write in physical time but is before Write in logical time. 11

WHAT IS CORRECTNESS? We define correctness of a parallel program based on its outputs in relation to the program run sequentially 12

SEQUENTIAL CONSISTENCY A B C D Global Memory Order A B C D A C B D C D A B C B A D Can we exploit this freedom in correct execution to avoid dependency? 13

AVOIDING DEPENDENCY ACROSS THE STACK Circuit Multicore Processor Multicore Database Distributed Database Distributed Shared Memory Efficient atomic instructions Tardis coherence protocol TicToc concurrency control with Andy Pavlo and Daniel Sanchez Distributed TicToc Transaction processing with fault tolerance 14

SHARED MEMORY SYSTEMS Cache Coherence Concurrency Control Multi-core Processor OLTP Database 15

DIRECTORY-BASED COHERENCE P P P P P P P P P Data replicated and cached locally for access Uncached data copied to local cache, writes invalidate data copies 16

CACHE COHERENCE SCALABILITY 250% Storage Overhead 200% 150% 100% 50% 0% Today 16 64 256 1024 Core Count Read A Read A Write A Invalidation O(N) Sharer List 17

LEASE-BASED COHERENCE Ld(A) Core 0 Core 1 Write Timestamp (wts) Ld(A) Read Timestamp (rts) Ld(A) St(A) Program Timestamp (pts) t=0 1 2 3 4 5 6 7 A read gets a lease on a cacheline Lease renewal aler lease expires A store can only commit aler leases expire Tardis: logical leases Logical Time Timestamp 18

LOGICAL TIMESTAMP Invalidation Tardis (No Invalidation) Physical Time Order Logical Time Order (concept borrowed from database) Old Version New Version logical time 19

TIMESTAMP MANAGEMENT Core A state A B pts=5 wts Shared LLC Cache S 0 10 S 0 10 S 0 5 rts Program Timestamp (pts) Timestamp of last memory operation Write Timestamp (wts) Data created at wts Read Timestamp (rts) Data valid from wts to rts wts Lease rts Logical Time 20

TWO-CORE EXAMPLE Core 0 pts=0 Cache Core 1 pts=0 Cache 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A S 0 0 B S 0 0 physical time 21

STORE A @ CORE 0 Core 0 pts=0 pts=1 A M Cache 1 1 ST(A) Req Core 1 pts=0 Cache 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A MS 0 0 B S 0 0 M owner:0 Write at pts = 1 22

LOAD B @ CORE 0 Core 0 pts=1 Cache A M 1 1 B LD(B) Req Core 1 pts=0 Cache 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A M owner:0 B S 0 11 0 Reserve rts to pts + lease = 11 23

STORE B @ CORE 1 Core 0 pts=1 Cache A M 1 1 B S 0 11 Core 1 pts=0 pts=12 Cache B M 12 12 ST(B) Req 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A M B MS 0 11 owner:0 M owner:1 Exclusive ownership returned No invalidation 24

Two VERSIONS COEXIST Core 0 pts=1 Cache A M 1 1 B S 0 11 Core 1 pts=12 Cache B M 12 12 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A B M owner:0 M owner:1 Core 1 traveled ahead in time Versions ordered in logical time 25

LOAD A @ CORE 1 Core 0 pts=1 Cache A M S 1 22 1 B S 0 11 WB(A) Req S M owner:0 1 22 A B Core 1 A B M owner:1 pts=12 Cache M 12 12 LD(A) Req 1 2 5 Core 0 store A load B load B Write back request to Core 0 Downgrade from M to S Core 1 store B load A Reserve rts to pts + lease = 22 3 4 26

LOAD B @ CORE 0 Core 0 pts=1 A Core 1 M owner:0 B M owner:1 pts=12 Cache Cache A S 1 22 A S 1 22 B S 0 11 B M 12 12 1 2 Core 0 store A load B physical time 5 load B Core 1 store B load A logical timestamp 5 > physical 4, 5 < logical 4 time 3 4 time global memory order physical time order 27

Core 0 store A load B RAW load B SUMMARY OF EXAMPLE Directory RAW Core 1 WAR store B load A physical time Core 0 store A load B load B Tardis RAW WAR Core 1 store B load A physical time order physical + logical time order 28

PHYSIOLOGICAL TIME Core 0 store A (1) load B (1) load B (1) Tardis Core 1 store B (12) load A (12) Global Memory Order 2 1 3 Physical Time Logical Time Physiological Time X < PL Y := X < L Y or (X = L Y and X < P Y) Thm: Tardis obeys Sequential Consistency 29

TARDIS PROS AND CONS Scalability No Invalidation, Multicast or Broadcast Lease Renew Speculative Read Timestamp Size Timestamp Compression Time Stands Still Livelock Avoidance 30

EVALUATION Storage overhead per cacheline (N cores) Directory: N bits per cacheline Tardis: Max(Const, log(n)) bits per cacheline 250% 200% 150% 100% 50% 0% Directory Tardis 16 64 256 1024 31

SPEEDUP Normalized Speedup Graphite Multi-core Simulator (64 cores) 1.3 1.2 1.1 1 0.9 DIRECTORY TARDIS TARDIS AT 256 CORES 32

NETWORK TRAFFIC Normalized Traffic 1.05 1 0.95 0.9 0.85 0.8 DIRECTORY TARDIS TARDIS AT 256 CORES 33

Network Traffic Storage Overhead Snoopy Coherence Directory Coherence Optimized Directory TARDIS High Complexity Performance Degradation 34

CONCURRENCY CONTROL BEGIN COMMIT BEGIN BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT Serializability COMMIT BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT Results should correspond to some serial order of atomic execution 35

CONCURRENCY CONTROL BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT Can t Have This BEGIN BEGIN COMMIT COMMIT BEGIN COMMIT BEGIN COMMIT Results should correspond to some serial order of atomic execution 36

BOTTLENECK 1: TIMESTAMP ALLOCATION Centralized Allocator Timestamp allocation is a scalability bottleneck Throughput (Million txn/s) 25 20 15 10 5 T/O 2PL Synchronized Clock Clock skew causes unnecessary aborts 0 0 20 40 60 80 Thread Count 37

BOTTLENECK 2: STATIC ASSIGNMENT Time Timestamps assigned before a transaction starts T1@ts=1 T2@ts=2 BEGIN BEGIN READ(A) WRITE(A) COMMIT COMMIT Suboptimal assignment leads to unnecessary aborts. T1@ts=2 T2@ts=1 BEGIN BEGIN READ(A) WRITE(A) COMMIT ABORT 38

KEY IDEA: DATA DRIVEN TIMESTAMP MANAGEMENT Traditional T/O 1. Acquire timestamp (TS) 2. Determine tuple visibility using TS Timestamp Allocation Static Timestamp Assignment TicToc 1. Access tuples and remember their timestamp info. 2. Compute commit timestamp (CommitTS) No Timestamp Allocation Dynamic Timestamp Assignment 39

TICTOC TRANSACTION EXECUTION BEGIN READ PHASE VALIDATION PHASE WRITE PHASE COMMIT Read & Write Tuples Execute Transaction Compute CommitTS Decide Commit/Abort Update Database wts : rts : last data write @ wts last data read @ rts data valid between wts and rts Tuple Format Data wts (Write Timestamp) rts (Read Timestamp) 40

TICTOC EXAMPLE 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B Database States 5 commit? 6 commit? T1 Transaction Local States T2 41

LOAD A FROM T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Load a snapshot of tuple A - Data, wts and rts T2 42

LOAD A FROM T2 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Load a snapshot of tuple A - Data, wts and rts T2 43

STORE B FROM T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Store B to local write set T2 44

LOAD B FROM T2 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Load a snapshot of tuple B - Data, wts and rts T2 45

COMMIT PHASE OF T1 0 1 2 3 4 Logical Time Tuple A Tuple B v1 v1 Transaction 1 1 3 5 load A store B commit? Transaction 2 2 4 6 load A load B commit? T1 T2 Compute CommitTS Write Set: tuple.rts+1 CommitTs Read Set: tuple.wts CommitTs tuple.rts 46

COMMIT PHASE OF T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B T1 commits @ TS=2 5 commit? 6 commit? T1 rts extension on tuple A T2 47

COMMIT PHASE OF T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 v2 1 3 load A store B 2 4 load A load B T1 commits @ TS=2 5 commit? 6 commit? T1 Copy tuple B from write set to database T2 48

COMMIT PHASE OF T2 0 1 2 3 4 Logical Time Tuple A Tuple B v1 v1 v2 Transaction 1 1 3 5 load A store B commit? Transaction 2 2 4 6 load A load B commit? T1 T2 T1 commits @ TS=2 Compute CommitTS for T2 Find consistent read time for T2 (no writes in T2) 49

FINAL STATE 0 1 2 3 4 Logical Time Tuple A Tuple B v1 v1 v2 Transaction 1 1 3 5 load A store B commit? Transaction 2 2 4 6 load A load B commit? T1 T2 T2 commits @ TS=0 T1 commits @ TS=2 Txn 1 < Txn 2 physical time Txn 2 < logical Txn 1 time Thm: Serializability = All operations valid at CommitTS 50

EXPERIMENTAL SETUP DBx1000: Main Memory DBMS No logging No B-tree (hash indexing) Concurrency Control Algorithms MVCC: HEKATON (Microsoft) OCC: SILO (Harvard/MIT) 2PL: DL_DETECT, NO_WAIT 10 GB YCSB Benchmark 51

EVALUATION TICTOC HEKATON DL_DETECT NO_WAIT SILO YCSB -- Medium Contention YCSB -- High Contention 4 1 3 0.8 Throughput (Million txn/s) 2 1 Throughput (Million txn/s) 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 Thread Count 0 0 10 20 30 40 50 60 70 80 Thread Count 52

TICTOC DISCUSSION Thm: Serializability = All ops valid at CommitTS Transactions may have same CommitTS Logical timestamp growing rate indicates inherent parallelism Commit Timestamp 10000 8000 6000 4000 2000 0 TS ALLOC High Medium 0 2000 4000 6000 8000 10000 # of CommiCed Txns 53

PHYSIOLOGICAL TIME ACROSS THE STACK Circuit Multicore Processor Multicore Database Distributed Database Distributed Shared Memory Efficient atomic instructions Tardis coherence protocol TicToc concurrency control Distributed TicToc Transaction processing with fault tolerance 54

ATOMIC INSTRUCTION (LR/SC) ABA Problem Detect ABA using timestamp (wts) Core 0 Core 1 LR(x); # x = A ST(x); # x = B ST(x); # x = A SC(x); # x = A ABA? 55

TARDIS CACHE COHERENCE Simple: No Invalidation Scalable: O(log N) storage No Broadcast, No Multicast No Clock Synchronization Support Relaxed Consistency Models 56

T1000: PROPOSED 1000-CORE SHARED MEMORY PROCESSOR 57

TICTOC CONCURRENCY CONTROL Data Driven Timestamp Management No Central Timestamp Allocation Dynamic Timestamp Assignment 58

DISTRIBUTED TICTOC Data Driven Timestamp Management Efficient Two-Phase Commit Protocol Support Local Caching of Remote Data 59

FAULT TOLERANT DISTRIBUTED SHARED MEMORY Transactional Programming Model Distributed Command Logging Dynamic Dependency Tracking Among Transactions (WAR dependency can be ignored) 60

TIME TRAVELING TO ELIMINATE WAR Xiangyao Yu, Srini Devadas CSAIL, MIT