TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT
FOR FIFTY YEARS, WE HAVE RIDDEN MOORE S LAW Moore s Law and the scaling of clock frequency = printing press for the currency of performance
10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 TECHNOLOGY SCALING u Transistors x 1000 Clock frequency (MHz) Power (W) Cores 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 Each generation of Moore s law doubles the number of transistors but clock frequency has stopped increasing.
10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 TECHNOLOGY SCALING u Transistors x 1000 Clock frequency (MHz) Power (W) Cores 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 To increase performance, need to exploit parallelism.
DIFFERENT KINDS OF PARALLELISM - 1 Instruction Level Transaction Level a = b + c d = e + f g = d + b Read A Read B Compute C Read A Read D Compute E Read C Read E Compute F 5
DIFFERENT KINDS OF PARALLELISM - 2 Thread Level A x B = C Task Level Search( image ) Cloud Different thread computes each entry of product matrix C 6
DIFFERENT KINDS OF PARALLELISM - 3 Thread Level A x B = C User Level Search( image ) Lookup( data ) Cloud Different thread computes each entry of product matrix C Query( record ) 7
DEPENDENCY DESTROYS PARALLELISM for i = 1 to n a[b[i]] = (a[b[i - 1]] + b[i]) / c[i] Need to compute i th entry after i - 1 th has been computed L 8
Read A Read A DIFFERENT KINDS OF DEPENDENCY No dependency! Write A Write A WAW: Semantics decide order Write A Read A RAW: Read needs new value Read A Write A WAR: We have flexibility here! 9
DEPENDENCE IS ACROSS TIME, BUT WHAT IS TIME? Time can be physical time Time could correspond to logical timestamps assigned to instructions Time could be a combination of the above à Time is a definition of ordering 10
Read A WAR DEPENDENCE Initially A = 10 Thread 0 Thread 1 Logical Order Write A Physical Time Order A=13 Local copy of A = 10 Read happens later than Write in physical time but is before Write in logical time. 11
WHAT IS CORRECTNESS? We define correctness of a parallel program based on its outputs in relation to the program run sequentially 12
SEQUENTIAL CONSISTENCY A B C D Global Memory Order A B C D A C B D C D A B C B A D Can we exploit this freedom in correct execution to avoid dependency? 13
AVOIDING DEPENDENCY ACROSS THE STACK Circuit Multicore Processor Multicore Database Distributed Database Distributed Shared Memory Efficient atomic instructions Tardis coherence protocol TicToc concurrency control with Andy Pavlo and Daniel Sanchez Distributed TicToc Transaction processing with fault tolerance 14
SHARED MEMORY SYSTEMS Cache Coherence Concurrency Control Multi-core Processor OLTP Database 15
DIRECTORY-BASED COHERENCE P P P P P P P P P Data replicated and cached locally for access Uncached data copied to local cache, writes invalidate data copies 16
CACHE COHERENCE SCALABILITY 250% Storage Overhead 200% 150% 100% 50% 0% Today 16 64 256 1024 Core Count Read A Read A Write A Invalidation O(N) Sharer List 17
LEASE-BASED COHERENCE Ld(A) Core 0 Core 1 Write Timestamp (wts) Ld(A) Read Timestamp (rts) Ld(A) St(A) Program Timestamp (pts) t=0 1 2 3 4 5 6 7 A read gets a lease on a cacheline Lease renewal aler lease expires A store can only commit aler leases expire Tardis: logical leases Logical Time Timestamp 18
LOGICAL TIMESTAMP Invalidation Tardis (No Invalidation) Physical Time Order Logical Time Order (concept borrowed from database) Old Version New Version logical time 19
TIMESTAMP MANAGEMENT Core A state A B pts=5 wts Shared LLC Cache S 0 10 S 0 10 S 0 5 rts Program Timestamp (pts) Timestamp of last memory operation Write Timestamp (wts) Data created at wts Read Timestamp (rts) Data valid from wts to rts wts Lease rts Logical Time 20
TWO-CORE EXAMPLE Core 0 pts=0 Cache Core 1 pts=0 Cache 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A S 0 0 B S 0 0 physical time 21
STORE A @ CORE 0 Core 0 pts=0 pts=1 A M Cache 1 1 ST(A) Req Core 1 pts=0 Cache 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A MS 0 0 B S 0 0 M owner:0 Write at pts = 1 22
LOAD B @ CORE 0 Core 0 pts=1 Cache A M 1 1 B LD(B) Req Core 1 pts=0 Cache 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A M owner:0 B S 0 11 0 Reserve rts to pts + lease = 11 23
STORE B @ CORE 1 Core 0 pts=1 Cache A M 1 1 B S 0 11 Core 1 pts=0 pts=12 Cache B M 12 12 ST(B) Req 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A M B MS 0 11 owner:0 M owner:1 Exclusive ownership returned No invalidation 24
Two VERSIONS COEXIST Core 0 pts=1 Cache A M 1 1 B S 0 11 Core 1 pts=12 Cache B M 12 12 1 2 5 Core 0 store A load B load B 3 4 Core 1 store B load A A B M owner:0 M owner:1 Core 1 traveled ahead in time Versions ordered in logical time 25
LOAD A @ CORE 1 Core 0 pts=1 Cache A M S 1 22 1 B S 0 11 WB(A) Req S M owner:0 1 22 A B Core 1 A B M owner:1 pts=12 Cache M 12 12 LD(A) Req 1 2 5 Core 0 store A load B load B Write back request to Core 0 Downgrade from M to S Core 1 store B load A Reserve rts to pts + lease = 22 3 4 26
LOAD B @ CORE 0 Core 0 pts=1 A Core 1 M owner:0 B M owner:1 pts=12 Cache Cache A S 1 22 A S 1 22 B S 0 11 B M 12 12 1 2 Core 0 store A load B physical time 5 load B Core 1 store B load A logical timestamp 5 > physical 4, 5 < logical 4 time 3 4 time global memory order physical time order 27
Core 0 store A load B RAW load B SUMMARY OF EXAMPLE Directory RAW Core 1 WAR store B load A physical time Core 0 store A load B load B Tardis RAW WAR Core 1 store B load A physical time order physical + logical time order 28
PHYSIOLOGICAL TIME Core 0 store A (1) load B (1) load B (1) Tardis Core 1 store B (12) load A (12) Global Memory Order 2 1 3 Physical Time Logical Time Physiological Time X < PL Y := X < L Y or (X = L Y and X < P Y) Thm: Tardis obeys Sequential Consistency 29
TARDIS PROS AND CONS Scalability No Invalidation, Multicast or Broadcast Lease Renew Speculative Read Timestamp Size Timestamp Compression Time Stands Still Livelock Avoidance 30
EVALUATION Storage overhead per cacheline (N cores) Directory: N bits per cacheline Tardis: Max(Const, log(n)) bits per cacheline 250% 200% 150% 100% 50% 0% Directory Tardis 16 64 256 1024 31
SPEEDUP Normalized Speedup Graphite Multi-core Simulator (64 cores) 1.3 1.2 1.1 1 0.9 DIRECTORY TARDIS TARDIS AT 256 CORES 32
NETWORK TRAFFIC Normalized Traffic 1.05 1 0.95 0.9 0.85 0.8 DIRECTORY TARDIS TARDIS AT 256 CORES 33
Network Traffic Storage Overhead Snoopy Coherence Directory Coherence Optimized Directory TARDIS High Complexity Performance Degradation 34
CONCURRENCY CONTROL BEGIN COMMIT BEGIN BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT Serializability COMMIT BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT Results should correspond to some serial order of atomic execution 35
CONCURRENCY CONTROL BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT BEGIN COMMIT Can t Have This BEGIN BEGIN COMMIT COMMIT BEGIN COMMIT BEGIN COMMIT Results should correspond to some serial order of atomic execution 36
BOTTLENECK 1: TIMESTAMP ALLOCATION Centralized Allocator Timestamp allocation is a scalability bottleneck Throughput (Million txn/s) 25 20 15 10 5 T/O 2PL Synchronized Clock Clock skew causes unnecessary aborts 0 0 20 40 60 80 Thread Count 37
BOTTLENECK 2: STATIC ASSIGNMENT Time Timestamps assigned before a transaction starts T1@ts=1 T2@ts=2 BEGIN BEGIN READ(A) WRITE(A) COMMIT COMMIT Suboptimal assignment leads to unnecessary aborts. T1@ts=2 T2@ts=1 BEGIN BEGIN READ(A) WRITE(A) COMMIT ABORT 38
KEY IDEA: DATA DRIVEN TIMESTAMP MANAGEMENT Traditional T/O 1. Acquire timestamp (TS) 2. Determine tuple visibility using TS Timestamp Allocation Static Timestamp Assignment TicToc 1. Access tuples and remember their timestamp info. 2. Compute commit timestamp (CommitTS) No Timestamp Allocation Dynamic Timestamp Assignment 39
TICTOC TRANSACTION EXECUTION BEGIN READ PHASE VALIDATION PHASE WRITE PHASE COMMIT Read & Write Tuples Execute Transaction Compute CommitTS Decide Commit/Abort Update Database wts : rts : last data write @ wts last data read @ rts data valid between wts and rts Tuple Format Data wts (Write Timestamp) rts (Read Timestamp) 40
TICTOC EXAMPLE 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B Database States 5 commit? 6 commit? T1 Transaction Local States T2 41
LOAD A FROM T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Load a snapshot of tuple A - Data, wts and rts T2 42
LOAD A FROM T2 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Load a snapshot of tuple A - Data, wts and rts T2 43
STORE B FROM T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Store B to local write set T2 44
LOAD B FROM T2 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B 5 commit? 6 commit? T1 Load a snapshot of tuple B - Data, wts and rts T2 45
COMMIT PHASE OF T1 0 1 2 3 4 Logical Time Tuple A Tuple B v1 v1 Transaction 1 1 3 5 load A store B commit? Transaction 2 2 4 6 load A load B commit? T1 T2 Compute CommitTS Write Set: tuple.rts+1 CommitTs Read Set: tuple.wts CommitTs tuple.rts 46
COMMIT PHASE OF T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 1 3 load A store B 2 4 load A load B T1 commits @ TS=2 5 commit? 6 commit? T1 rts extension on tuple A T2 47
COMMIT PHASE OF T1 0 1 2 3 4 Logical Time Tuple A v1 Transaction 1 Transaction 2 Tuple B v1 v2 1 3 load A store B 2 4 load A load B T1 commits @ TS=2 5 commit? 6 commit? T1 Copy tuple B from write set to database T2 48
COMMIT PHASE OF T2 0 1 2 3 4 Logical Time Tuple A Tuple B v1 v1 v2 Transaction 1 1 3 5 load A store B commit? Transaction 2 2 4 6 load A load B commit? T1 T2 T1 commits @ TS=2 Compute CommitTS for T2 Find consistent read time for T2 (no writes in T2) 49
FINAL STATE 0 1 2 3 4 Logical Time Tuple A Tuple B v1 v1 v2 Transaction 1 1 3 5 load A store B commit? Transaction 2 2 4 6 load A load B commit? T1 T2 T2 commits @ TS=0 T1 commits @ TS=2 Txn 1 < Txn 2 physical time Txn 2 < logical Txn 1 time Thm: Serializability = All operations valid at CommitTS 50
EXPERIMENTAL SETUP DBx1000: Main Memory DBMS No logging No B-tree (hash indexing) Concurrency Control Algorithms MVCC: HEKATON (Microsoft) OCC: SILO (Harvard/MIT) 2PL: DL_DETECT, NO_WAIT 10 GB YCSB Benchmark 51
EVALUATION TICTOC HEKATON DL_DETECT NO_WAIT SILO YCSB -- Medium Contention YCSB -- High Contention 4 1 3 0.8 Throughput (Million txn/s) 2 1 Throughput (Million txn/s) 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 Thread Count 0 0 10 20 30 40 50 60 70 80 Thread Count 52
TICTOC DISCUSSION Thm: Serializability = All ops valid at CommitTS Transactions may have same CommitTS Logical timestamp growing rate indicates inherent parallelism Commit Timestamp 10000 8000 6000 4000 2000 0 TS ALLOC High Medium 0 2000 4000 6000 8000 10000 # of CommiCed Txns 53
PHYSIOLOGICAL TIME ACROSS THE STACK Circuit Multicore Processor Multicore Database Distributed Database Distributed Shared Memory Efficient atomic instructions Tardis coherence protocol TicToc concurrency control Distributed TicToc Transaction processing with fault tolerance 54
ATOMIC INSTRUCTION (LR/SC) ABA Problem Detect ABA using timestamp (wts) Core 0 Core 1 LR(x); # x = A ST(x); # x = B ST(x); # x = A SC(x); # x = A ABA? 55
TARDIS CACHE COHERENCE Simple: No Invalidation Scalable: O(log N) storage No Broadcast, No Multicast No Clock Synchronization Support Relaxed Consistency Models 56
T1000: PROPOSED 1000-CORE SHARED MEMORY PROCESSOR 57
TICTOC CONCURRENCY CONTROL Data Driven Timestamp Management No Central Timestamp Allocation Dynamic Timestamp Assignment 58
DISTRIBUTED TICTOC Data Driven Timestamp Management Efficient Two-Phase Commit Protocol Support Local Caching of Remote Data 59
FAULT TOLERANT DISTRIBUTED SHARED MEMORY Transactional Programming Model Distributed Command Logging Dynamic Dependency Tracking Among Transactions (WAR dependency can be ignored) 60
TIME TRAVELING TO ELIMINATE WAR Xiangyao Yu, Srini Devadas CSAIL, MIT