A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

Size: px

Start display at page:

Download "A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim"

Alexandra Fowler
5 years ago
Views:

1 A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

2 Era of multicore machines 2

3 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly? Ordering Becomes scalability bottleneck... 3

4 Example: Read Log Update (RLU) Extension of RCU Modifes objects in a thread s local log Clock maintains correct snapshot (old vs new) Frees objects via epoch-based reclamation 4

5 Read Log Update (RLU) operation Global Clock (22) RLU header A A log/bufer to store copies (per-thread) B Log D B C D Read on start Q P Local Clock (22) E

6 RLU commit operation Global Clock (23) (22) 1. P updates clocks 2. P executes RCU-epoch Waits for Q to fnish B P Write Clock (23) ( ) D C Logical Clock maintains correctness/ordering A B C D E Maintained via atomic instructions FAA/CAS Q Local Clock (22) Q will read only old objects

7 Issue with logical clock RLU sufers from global clock contention Cache-line contention due to atomic instructions Possible to circumvent with our approach Phi 18 ARM 16 How can we achieve ordering with9 minimal timestamping overhead? 8 15 Ops/usec 6 Atomic 3 4 Ordo #cores #cores

8 Our proposed ordering primitive: Ordo Exposes a monotonically increasing clock Current hardware already provides rdtscp (X86), cntvct (ARM), stick (Sparc) Relies on a per-core invariant hardware clock Monotonically increases with constant skew regardless of dynamic frequency and voltage scaling 8

9 Challenges with Ordo Comparing two clocks Clocks are not synchronized Cores receive RESET signal at varying times Application: Modifying algorithms to use Ordo Able to compare between two timestamps 9

10 Embracing the invariant clocks Measure a global uncertainty window Ensure a new timestamp once a window is over Provides a notion of globally synchronized clock Measured ofset MUST have the invariant: Measured ofset is greater than the physical ofset Physical ofset: ofset due to RESET signal Measured ofset: physical ofset + one-way delay 1

11 Calculating global uncertainty window: ORDO_BOUNDARY Add one-way delay latency on each path C1 C1 C 2 2 C2 1) Calculate C1 timestamp 2) Notify C2 via memory 3) Get C2 timestamp T(C1): 4) Repeat steps 1-3 to get the minimum T(C2): 2 time 11

12 Calculating global uncertainty window: ORDO_BOUNDARY Add one-way delay latency on each path C1 C2 T(C2): 5 C2 C 1 3 Repeat prior steps in opposite direction Do not know which clock is ahead of the other T(C1): 8 time 12

13 Calculating global uncertainty window: ORDO_BOUNDARY Repeat steps for each pair of cores from C1 to Cn The maximum ofset is the ORDO_BOUNDARY C1 C 2 2 C2 C 1 3 C1 C2 C1 T(C2): 5 T(C1): C1 C2 3 C2 T(C2): 2 time T(C1): 8 13

14 Ordo application Applicable to any timestamp-based algorithm Expose Ordo API for these algorithms get_time(): Current hardware timestamp cmp_time(t1, t2): Compare two timestamps with uncertainty, if t1-t2 < ORDO_BOUNDARY new_time(t): Return tnew > (t + ORDO_BOUNDARY) Catch: Algorithms should handle uncertainty 14

15 Algorithms with Ordo handling uncertainty Physical to logical timestamping: Rely on cmp_time() to compare two timestamps Either defer or revert if comparison is uncertain Use new_time() to guarantee new time Physical timestamping: Use new_time() to access the global clock 15

16 Read Log Update (RLU Ordo Global ofset (3) A P B Log D B C D Read on start Q ) operation Local Clock (5) Q s core clock (5) P s local clock (22) E

17 RLU Global ofset (3) Ordo commit operation 1. P updates own clock 2. P executes RCU-epoch Waits for Q to fnish B A B P Write Clock (15) ( ) D C C D E Q Local Clock (5) Q will read only old objects

18 Algorithms modifed with Ordo RLU See er p a p our Transactional Locking (TL2) in STM Database concurrency control: OCC, MVCC Oplog used in Linux forking functionality 18

19 Evaluation Questions: Measured global ofset (ORDO_BOUNDARY) Maximum scalability of Ordo Ordo s impact on algorithms Machines confguration: 24 core, 8 socket Intel Xeon machine (Xeon) 256 core, Intel Xeon Phi (Phi) 96 core, 2 socket ARM machine (ARM) 32 core, 8 socket AMD machine (AMD) 19

20 Ofset between clocks Empirically measured ofset after reboots ORDO_BOUNDARY is the maximum ofset Machine Minimum (ns) Maximum (ns) Intel Xeon Intel Xeon phi 9 27 ARM 1 1,1 AMD

21 Timestamping with Ordo Ordo relies on hardware timestamping x faster than atomic increments Ops/usec/core 12 Xeon(Atomic) ARM(Atomic) 8 ARM(Ordo) Phi(Atomic) 8 Xeon(Ordo) 4 12 Ops/usec/core #core Phi(Ordo) AMD(Atomic) AMD(Ordo) #core 21

22 Scaling RLU with Ordo RLUOrdo is 2.1x faster on an average Still sufers from object copy and its locking RLU 2% Ops/usec Xeon Ops/usec 12 4 RLU(Ordo) 2% ARM Phi AMD #core 8 16 #core 22

23 Discussion and limitations Simplifes the design and understanding of algorithms Not a panacea Applicable when clock is contentious No skew consideration Thread ID-based timestamp comparison has its limitation 23

24 Conclusion Ordo is a scalable timestamping primitive Relies on invariant hardware clocks Exposes time-based API to the user Applied Ordo to fve concurrent algorithms Improves the scalability of algorithms by at most 39.7x across architectures 24

25 Backup Slides

26 Ofset between clocks Clocks are not synchronized 8th socket in Xeon and 2nd socket in ARM Results remain consistent even after reboots and measuring after a period of time Xeon # core Arm Ofset between clocks # core

27 Sensitivity of ORDO_BOUNDARY Varying ORDO_BOUNDARY from 1/8x 8x Cycles increases from K on Xeon machine Normalized throughput core 1-socket 8-sockets 27

28 Physical timestamping: Oplog Improves Exim performance by 1.9x at 24 cores 12k 1k Messages/sec Stock Oplog(Ordo) 8k 6k 4k 2k k #core

29 Scaling database concurrency control Improves OCC and MVCC by x for read-only (YCSB) OCCOrdo 1.24x faster than Tictoc and Silo (TPC-C) OCC OCC (Ordo) Xeon Txns/usec Txns/usec MVCC MVCC (Ordo) Phi ARM # core AMD 8 16 # core

30 Cannot use clock synchronization protocols No information on minimum bounds on message delivery between/among clocks Protocols introduce various errors Can lead to mis-synchronized clocks Larger or smaller than the actual physical ofset Lead to incorrect implementation of concurrent algorithms 3

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University

Rethinking Serializable Multi-version Concurrency Control Jose Faleiro and Daniel Abadi Yale University Theory: Single- vs Multi-version Systems Single-version system T r Read X X 0 T w Write X Multi-version