ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

Size: px

Start display at page:

Download "ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu"

Grant McDowell
5 years ago
Views:

1 ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

2 int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded

3 int threadsafe_get_seqno() { acquire(lock); int ret=++seqno; release(lock); return ret; } // < 10 Million ops/s

4 10 MCS MUTEX TTAS CLH TAS 8 Throughput (Mops) Hardware threads

5 why so slow?

8 ~70 ns intra-socket latency ~14 Mops

9 QPI QPI Quad-Socket Intel System QPI M cachelines/s gigabyte/s QPI ~200ns O/W latency = 5 million ops per second

10 THREAD 1 THREAD 2 critical section interconnect latency wait for lock 400x wait for lock interconnect latency critical section wait for lock critical section

11 THREAD 1 THREAD 2 THREAD3 critical section wait for lock critical section wait for lock 600x wait for lock wait for lock critical section wait for lock critical section

12 client dedicated server thread client client client

13 CLIENT1 DEDICATED SERVER THREAD request wait for request 400x wait for critical section wait for request wait for critical section

14 CLIENT1 DEDICATED SERVER THREAD CLIENTn request wait for request request 400x still! wait for critical section critical section wait for wait for request wait for critical section wait for critical section

15 CLIENT1 DEDICATED SERVER THREAD CLIENTn wait for request 400x still! wait for critical section critical section critical section critical section critical section critical section wait for critical request section wait for wait for wait for wait wait for for wait for wait for wait for critical section critical section critical section critical section wait for wait for wait for wait for wait for

16 (fast, fly-weight delegation) ffwd design write all s, thread group N SERVER read & act on all thread group 0 requests READ client request one line per thread CLIENTS WRITE write request to server client request for each of N thread groups read & act on all thread group 1 requests write all s, thread group 0 WRITE shared server one line per group of 15 threads READ spin on server shared server (128 bytes) client request (64 bytes) return values[15] toggle bits toggle bit function pointer arg count argv[6] Server acts upon pending requests in batches 15 clients. Each group of 15 clients shares one 128-byte line pair. One dedicated 64-byte request line, per client-server pair Requests are sent synchronously

17 A request in more detail client request (64 bytes) toggle bit function pointer arg count argv[6] shared server (128 bytes) return values[15] toggle bits request is new if: request toggle bit!= toggle bit server calls function with (64-bit) arguments provided client polls line until toggle bit == bit

18 delegation server thread group 0 requests local buffer..111

19 delegation server thread group 0 requests local buffer..110

20 delegation server thread group 0 requests local buffer..100

21 delegation server thread group 0 requests local buffer..100

22 delegation server thread group 0 requests local buffer..100 global buffer modified < cache lines > shared

23 group 0 requests local buffer..100 global buffer..100 modified < cache lines > shared

24 group shared 0 requests < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared

25 shared < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared

26 shared < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared

27 shared < -requests- > modified local buffer..100 global buffer..100 modified < cache lines > shared

28 performance evaluation

29 evaluation systems 4 16-core Xeon E5-4660, Broadwell, 2.2 GHz 4 8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4 8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4 8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz

30 application benchmarks Same benchmarks as in Lozi et al. (RCL) [USENIX ATC 12] programs that spend large % of time in critical sections Except BerkeleyDB - ran out of time

31 raytrace-car (SPLASH-2) FFWD MCS MUTEX TAS FC RCL Duration (ms) mutex ffwd 0 MCS RCL # of threads RCL experienced correctness issues above 82 threads.

32 application benchmarks comparing best performance (any thread count) for all methods up to 2.5x improvement over pthreads, any thread count 10+ times speedup at max thread count

33 memcached-set 300 FFWD MCS MUTEX TAS RCL Duration (sec) pthread mutex # of threads RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work.

34 microbenchmarks

35 ffwd is much faster on largely sequential data structures linked list (coarse locking), stack, queue fetch and add, for few shared variables for highly concurrent data structures, ffwd falls behind when the lock contention is low fetch and add, with many shared variables hashtable for concurrent data structures with long query times, ffwd keeps up, but is not a clear leader lazy linked list binary search tree

36 naïve 1024-node linked-list, coarse-grained locking FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET STM Throughput (Mops) Hardware threads

37 two-lock queue FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H MS SIM BLF Throughput (Mops) Hardware threads

38 stack FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H LF SIM BLF Throughput (Mops) Hardware threads

39 fetch-and-add, 1 variable FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC Throughput (Mops) hardware-provided atomic increment instruction! Hardware threads

40 ffwd is much faster on largely sequential data structures naïve linked list, stack, queue fetch and add, for few shared variables for highly concurrent data structures, fwd falls behind when there are many locks fetch and add, with many shared variables hashtable for concurrent data structures with long query times, ffwd keeps up, but is not a clear leader lazy linked list binary search tree

41 128-thread hash table FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET 175 Throughput (Mops) #buckets locking takes the lead when #locks ~ #threads

42 fetch-and-add, 128 threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL 300 Throughput (Mops) # of shared variables (locks)

43 fetch-and-add, 128 threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC 300 Throughput (Mops) atomic increment > # of shared variables (locks)

44 ffwd is much faster on largely sequential data structures naïve linked list, stack, queue fetch and add, for few shared variables for highly concurrent data structures, once lock# is similar to thread#, fwd falls behind fetch and add, with many shared variables hashtable for concurrent data structures with long query times, ffwd keeps up, but is not a clear leader lazy linked list binary search tree

45 128-thread lazy concurrent lists FFWD-LZ MCS-LZ MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ Throughput (Mops) #elements

46 128-thread lazy (LZ) + skip (SK) lists FFWD-LZ FFWD-SK MCS-LZ MCS-SK MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ Throughput (Mops) high concurrency, not very long list lower skip list complexity saves the day #elements single MCS lock skip list

47 binary search tree simple, unbalanced tree 50% queries, 50% updates all tree operations delegated for ffwd/rcl

48 128-thread binary search tree 50 FFWD RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded Throughput (Mops/s) ffwd is bounded by single-threaded performance K 8K 32K 128K Initial Size

49 128-thread tree + 4-way sharding ffwd 50 FFWD FFWD-S4 RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded Throughput (Mops) K 8K 32K 128K Tree size

50 what makes ffwd so fast? write all s, thread group N SERVER read & act on all thread group 0 requests READ client request one line per thread CLIENTS WRITE write request to server client request for each of N thread groups read & act on all thread group 1 requests write all s, thread group 0 WRITE shared server one line per group of 15 threads READ spin on server shared server (128 bytes) requests are virtually un-contended, toggle return values[15] bits contiguous in memory = happy server hardware pre-fetcher buffered s on server 15 s in one contiguous copy client request (64 bytes) 2 modified cache-lines instead of 15 toggle bit function pointer arg count argv[6] s are read-only on the client line never leaves the server L1 very light-weight processing on the server plenty of hand-tuning

51 Why isn t it even faster? Link bandwidth is 300 cache lines per link > 300+ Mops Latency suggests 2.5 Mops/client. 120 clients > 300 Mops Why are we only seeing 55 Mops? Processing limit? 55 Mops = 40 cycles per operation Insufficient concurrency: round-trip bandwidth-delay product is 120 cache lines server store / load buffers, reorder window size?

52 using ffwd Free C library available now (Rust is on the way) Some current limitations: delegated functions cannot, in turn, delegate functions delegated functions typically should not block (nor acquire locks) up to 6, 64-bit parameters currently assume one client per hardware thread

53 related work Remote Core Locking [Lozi, USENIX ATC 12] Barrelfish - delegation-based OS [Baumann, SOSP 09] Flat Combining [Hendler, SPAA 10] Log-based node replication [Calciu, ASPLOS 17]

54 in conclusion delegation is (much) faster than you thought it is easy to use, and has many attractive applications similar results on Intel Broadwell Intel Sandy Bridge Intel Westmere-EX, and AMD Abu Dhabi

55 questions? UIC has many open CS faculty positions this year, all areas libffwd, extended paper and more

56 comparing parallelism locking delegation critical section #locks #servers communication latency #locks #clients

Remote Core Locking. Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. to appear at USENIX ATC 12

Remote Core Locking. Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. to appear at USENIX ATC 12 Remote Core Locking Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications to appear at USENIX ATC 12 Jean-Pierre Lozi LIP6/INRIA Florian David LIP6/INRIA Gaël Thomas