An FPGA-based In-line Accelerator for Memcached

Size: px

Start display at page:

Download "An FPGA-based In-line Accelerator for Memcached"

Theodore Allison
6 years ago
Views:

1 An FPGA-based In-line Accelerator for Memcached MAYSAM LAVASANI, HARI ANGEPAT, AND DEREK CHIOU THE UNIVERSITY OF TEXAS AT AUSTIN 1

2 Challenges for Server Processors Workload changes Social networking Cloud applications Big data applications Power wall Dark silicon More and more cores insufficient Applications Architectures Devices 2

3 Memcached: A Highly-Used Server Application Web Server Database 2-Hit 1-Get Memcached An application level cache Database queries Server computations Used in social networking sites Facebook YouTube Twitter Reddit 3

4 Memcached: A Highly-Used Server Application Web Server 1-Get 5-Set On Memcached Miss Access database Update Memcached (Set) 3-Request 2-Miss 4-Reply Memcached Database 4

5 Simplified Memcached Control Flow Read-parse command Update Retrieval Remove Set CAS Get Gets Delete Flush Statistics Append / Prepend Add / Inc / Dec Find item Find item in hash Find old item Create new item and/or detach the old item Copy the data to new item Send data Remove the expired item Remove the item Attach the new item Clean up/acknowledge 5

6 Memcached on an FPGA? Rx Fifo FPGA Memcached control flow graph Update Retrieva l Rem ove s Network Tx Fifo t t Memory system Pros: Specialization benefit Cons: Complex to be implemented as hardware Memcached entirely on an FPGA Waste of hardware resources Deal with application modifications 6

7 Hybrid Architecture: Optimize the common case Rx Fifo FPGA Memcached control flow graph Update Retrieva l Rem ove s Network Tx Fifo t t Size problem App Modifications? Common case extraction May depend on state General purpose core $ Interconnect/Shared memory 7

8 CAL 2013 In-Line Acceleration 8

9 In-line Acceleration Fast Path: Hot Trace Network NIC/In-line accelerator Rx Fifo FPGA resources Tx Fifo Compiler Update t t Application Control Flow Graph Retrieva l Rem ove s Control / Status General purpose core $ Memory Last level cache 9

10 In-line Acceleration Slow Path: Original App Network NIC/In-line accelerator Rx Fifo FPGA resources Tx Fifo Application Control Flow Graph Ap Update pe nd /P re pe nd Retrieva l Rem ove s Control / Status General purpose core $ Memory Last level cache 10

11 Bail out Issue Problem: Fast path may modify some global data and later decides to bail out Transfer the computation to the slow path without causing any inconsistency Solution: Roll back global updates User-defined roll back routine Many server applications are roll back friendly Rollback code is already available in transactional applications (i.e. databases) Partial updates are isolated to provide atomicity Memcached bail out : 30 Lines of Code 11

12 In-line Accelerator Generation Process Input application Profiling Supported applications Small Hot trace Roll back friendly Worker thread model Synthesizes the code based on pre-defined Templates: Multiple threads Multiple engines (FPGA 12) Slicing Programming Template-based HLS tool Verilog model Write the roll back code Annotate memory types Packet IO Shared coherent Private Required programming << Hardware design efforts 12

13 Memcached Results 13

14 The Memcached In-line Accelerator Single engine/single thread - 5% of FPGA s LUTs FPGA Slice registers: 6570 Slice LUTs: 8739 Clock cycle: 7.30 ns Xilinx Virtex 6vhx250T 14

15 Throughput K Gets/sec Assumes accelerator has dedicated last level cache port Fast Path Performance/Power Watt, 590K RPS, 590 Reqs/J 8 Watts, 175K RPS, 21.8 Reqs/J 27x more energy efficient One FPGA engine vs. single Xeon core Offered load K Gets/sec 1x Xeon core 2x Xeon core 1x FPGA-based accelerator 2x FPGA-based accelerator Xeon frequency: 2.2 Ghz FPGA frequency: 100 Mhz 15

16 K Gets/sec Fast Path/Slow path Interaction Cost Cache coherency, Locking overhead Item size (bytes) Less overhead for larger items 0 K sets/sec 10 K sets/sec 15 K sets/sec 20 K sets/sec Full system stack - Gem5 simulation of Client and server Assumes accelerator has dedicated last level cache port Server: LLC-connected FPGA-based accelerator (single engine fast path) + one Alpha core (slow path) 16

17 Assumes each engine has dedicated last level cache port Projecting the Hybrid Architecture Performance Active Xeon cores Accelerator engines Processor + FPGA power (watts) Performance (requests/sec) Xeon processor M 15.2 K Xeon processor + In-line accelerator M 48 K Energy efficiency (requests/j) 17

18 Conclusion/Future work Conclusion Accelerated Memcached Performance/energy efficiency: Important Software-like programmability: Also important Future work In-line acceleration of other applications Tradeoffs in the selection of fast path Alternative bail out solutions Reduce cache port assumptions 18

19 Thank you This material is based upon work supported in part by the National Science Foundation under grants and

20 Recent work Reference Platform Main contribution Berezecki et al., IGCC 2011 Hetherington et al., ISPASS 2012 Chalamalasetti et al., FPGA 2012 Lim et al., ISCA 2013 Blott et al., HotClouds 2013 Lavasani et al., CAL 2013 Tilera GPU FPGA ASIC/FPGA + General purpose cores FPGA + General purpose cores FPGA + General purpose cores Many-small-cores are more efficient GPU can improve the performance Get requests on FPGA Get on FPGA, the rest on CPU Improving the throughput Automatic slicing and HW generation 20

21 Generating Memcached In-line Accelerator Input application Profiling Slicing Programming Template-based HLS Memcached LOCs User-level working thread: Libevent Profiled and sliced Hot trace: Almost all get operations on UDP Hot routines LOCs: 963 Hot instructions: <2000 Bail-out code is 30 lines of code Memory type annotation Verilog model C simulation model 21

22 IP Routers Are Similar Network Rx Fifo Tx Fifo Line cards Network processor Routing application control flow graph (abstract) Update re t Retrieva l Rem ove s t Control plane Interconnect network 22

23 Synchronization/Address Translation Network Rx Fifo Tx Fifo NIC/In-line accelerator FPGA resources Update t Application Control Flow Graph Retrieva l Rem ove s t Lock table TLB General purpose core $ DRAM Last level cache 23

LegUp: Accelerating Memcached on Cloud FPGAs

0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are