Memcached Design on High Performance RDMA Capable Interconnects

Size: px

Start display at page:

Download "Memcached Design on High Performance RDMA Capable Interconnects"

Candace Henderson
6 years ago
Views:

Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K.

1 Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, USA

2 Outline IntroducLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

3 IntroducLon Tremendous increase in interest in interactive web-sites (social networking, e-commerce etc.) Dynamic data is stored in databases for future retrieval and analysis Database lookups are expensive Memcached a distributed memory caching layer, implemented using traditional BSD sockets Socket interface provides portability, but entails additional processing and multiple message copies High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 1 Gigabit Ethernet/iWARP, RoCE) Low latency, High Bandwidth, Low CPU overhead Many machines in Top5 list (

4 Outline IntroducLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

Memcached Overview System Area Network System Area Network Internet Proxy Servers (Memcached Clients) Memcached Servers Database Servers Memcached provides a scalable distributed caching Spare memory

5 Memcached Overview System Area Network System Area Network Internet Proxy Servers (Memcached Clients) Memcached Servers Database Servers Memcached provides a scalable distributed caching Spare memory in data- center servers can be aggregated to speedup lookups Basically a key- value distributed memory store Keys can be any character strings, typically MD5 sums or hashes Typically used to cache database queries, results of API calls or webpage rendering elements Scalable model, but typical usage very network intensive - Performance directly related to that of underlying networking technology

6 Outline IntroducLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

Modern High Performance Interconnects ApplicaAon ApplicaLon Interface Sockets IB Verbs Kernel Space TCP/IP TCP/IP SDP RDMA Protocol ImplementaLon Ethernet Driver IPoIB Hardware Offload RDMA User

7 Modern High Performance Interconnects ApplicaAon ApplicaLon Interface Sockets IB Verbs Kernel Space TCP/IP TCP/IP SDP RDMA Protocol ImplementaLon Ethernet Driver IPoIB Hardware Offload RDMA User space Network Adapter 1/1 GigE Adapter InfiniBand Adapter 1 GigE Adapter InfiniBand Adapter InfiniBand Adapter Network Switch Ethernet Switch InfiniBand Switch 1 GigE Switch InfiniBand Switch InfiniBand Switch 1/1 GigE IPoIB 1 GigE- TOE SDP IB Verbs

8 Problem Statement High-performance RDMA capable interconnects have emerged in the scientific computation domain Applications using Memcached are still relying on sockets Performance of Memcached is critical to most of its deployments Can Memcached be re-designed from the ground up to utilize RDMA capable networks?

$..) Sockets not designed for high- performance Stream semanlcs o\en mismatch for$

9 A New Approach using Unified CommunicaLon RunLme (UCR) Current Approach Our Approach ApplicaAon ApplicaAon UCR Sockets IB Verbs 1/1 GigE Network RDMA Capable N/ws (IB, 1GE, iwarp, RoCE...) Sockets not designed for high- performance Stream semanlcs o\en mismatch for upper layers (Memcached, Hadoop) MulLple copies can be involved

10 Outline IntroducLon & MoLvaLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

11 Unified CommunicaLon RunLme (UCR) Initially proposed to unify communication runtimes of different parallel programming models J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, (PGAS 1) Design of UCR evolved from MVAPICH/MVAPICH2 software stacks (h`p://mvapich.cse.ohio- state.edu/) UCR provides interfaces for Active Messages as well as onesided put/get operations Enhanced APIs to support Cloud computing applications Several enhancements in UCR end-point based design, revamped active-message API, fault tolerance and synchronization with timeouts. Communications based on endpoint, analogous to sockets

12 AcLve Messaging in UCR Active messages are proven to be very powerful in many environments GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc. We introduce Active messages into the data-center domain An Active Message consists of two parts header and data When the message arrives at the target, header handler is run Header handler identifies the destination buffer for the data Data is put into the destination buffer Completion handler is run afterwards (optional) Special flags to indicate local & remote completions (optional)

13 AcLve Messaging in UCR (contd.) Origin Header Target Origin Header + Data Target Header Handler Set LComplFlag Header Handler RDMA Data Copy Data Set LComplFlag Set ComplFlag Completion Handler Set RComplFlag Set ComplFlag Completion Handler Set RComplFlag (General Active Message Functionality) (Optimized Short Active Message Functionality)

14 Outline IntroducLon & MoLvaLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

15 Memcached Design using UCR Sockets Client Master Thread Sockets Worker Thread Shared Data RDMA Client Sockets Worker Thread Verbs Worker Thread Verbs Worker Thread Memory Slabs Items Server and client perform a negolalon protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads

16 Outline IntroducLon & MoLvaLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

17 Used Two Clusters Intel Clovertown Experimental Setup Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad- core CPUs, 6 GB main memory, 25 GB hard disk Network: 1GigE, IPoIB, 1GigE TOE and IB (DDR) Intel Westmere Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad- core CPUs, 12 GB main memory, 16 GB hard disk Network: 1GigE, IPoIB, and IB (QDR) Memcached So\ware Memcached Server: Memcached Client: (libmemcached).45

18 Memcached Get Latency (Small Message) Time (us) Time (us) SDP IPoIB OSU Design 1G 1G-TOE K 2K 4K Message Size Intel Clovertown Cluster (IB: DDR) K 2K 4K Message Size Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes DDR: 6 us; QDR: 5 us 4K bytes - - DDR: 2 us; QDR:12 us Almost factor of four improvement over 1GE (TOE) for 4KB on the DDR cluster Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster

19 Memcached Get Latency (Large Message) 5 7 Time (us) SDP IPoIB OSU Design 1G 1G-TOE 4K 8K 16K 32K 64K 128K 256K 512K Message Size 4K 8K 16K 32K 64K 128K 512K Message Size Intel Clovertown Cluster (IB: DDR) Memcached Get latency 8K bytes DDR: 17 us; QDR: 13 us 512K bytes - - DDR: 362 us; QDR: 94 us Almost factor of three improvement over 1GE(TOE) for 512KB on the DDR cluster Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster Intel Westmere Cluster (IB: QDR)

20 Memcached Set Latency (Small Message) Time (us) SDP IPoIB OSU Design G 1G-TOE K 2K 4K K 2K 4K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Set latency 4 bytes DDR: 7 us; QDR: 5 us 4K bytes - - DDR: 15 us; QDR:13 us Almost factor of four improvement over 1GE (TOE) for 4KB on the DDR Cluster Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster

21 Memcached Set Latency (Large Message) Time (us) SDP IPoIB OSU Design 1G 1G-TOE 4K 8K 16K 32K 64K 128K 256K 512K Message Size Intel Clovertown Cluster (IB: DDR) 4K 8K 16K 32K 64K 128K 256K 512K Message Size Intel Westmere Cluster (IB: QDR) Memcached Get latency 8K bytes DDR: 18 us; QDR: 15 us 512K bytes - - DDR: 375 us; QDR:185 us Almost factor of two improvement over 1GE (TOE) for 512KB on the DDR cluster Almost factor of three improvement over IPoIB for 512KB on the QDR cluster

22 Memcached Latency (1% Set, 9% Get) Time (us) SDP IPoIB OSU Design 1G 1G K 2K 4K K 2K 4K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes DDR: 7 us; QDR: 5 us 4K bytes - - DDR: 15 us; QDR:12 us Almost factor of four improvement over 1GE (TOE) for 4K bytes on the DDR cluster

23 Memcached Latency (5% Set, 5% Get) Time (us) SDP IPoIB OSU Design 1G 1G-TOE K 2K 4K K 2K 4K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get latency 4 bytes DDR: 7 us; QDR: 5 us 4K bytes - - DDR: 15 us; QDR:12 us Almost factor of four improvement over 1GE (TOE) for 4K bytes on the DDR cluster

24 Memcached Get TPS (4byte) Thousands of Transactions per second (TPS) SDP 1G - TOE OSU Design SDP IPoIB OSU Design 8 Clients 16 Clients 8 Clients 16 Clients Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get transaclons per second for 4 bytes On IB DDR about 6K/s for 16 clients On IB QDR 1.9M/s for 16 clients Almost factor of six improvement over 1GE (TOE) Significant improvement with nalve IB QDR compared to SDP and IPoIB

Memcached Get TPS (4KB) Transactions per second(tps) 35 3 25 2 15 1 5 SDP 1G - TOE OSU Design 9 8 7 6 5 4 3 2 1

Cluster (IB: QDR) Memcached Get transaclons per second for 4K bytes On IB DDR about 33K/s for 16 clients On IB

25 Memcached Get TPS (4KB) Transactions per second(tps) SDP 1G - TOE OSU Design SDP IPoIB OSU Design 8 Clients 16 Clients 8 Clients 16 Clients Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) Memcached Get transaclons per second for 4K bytes On IB DDR about 33K/s for 16 clients On IB QDR 842K/s for 16 clients Almost factor of four improvement over 1GE (TOE) Significant improvement with nalve IB QDR

26 Outline IntroducLon & MoLvaLon Overview of Memcached Modern High Performance Interconnects Unified CommunicaLon RunLme (UCR) Memcached Design using UCR Performance EvaluaLon Conclusion & Future Work

27 Conclusion & Future Work Described a novel design of Memcached for RDMA capable networks Provided a detailed performance comparison of our design compared to unmodified Memcached using sockets over RDMA and 1GE networks Observed significant performance improvement with the proposed design Factor of four improvement in Memcached get latency (4K bytes) Factor of six improvement in Memcached get transaclons/s (4 bytes) We plan to improve UCR by taking into account the many features in OpenFabrics API, Unreliable Datagram transport and designing iwarp and RoCE versions of UCR, and thereby scaling Memcached We are working on enhancing the Hadoop/HBase designs for RDMA capable networks

28 Thank You! {jose, subramon, luom, zhanjmin, huangjia, rahmanmd, islamn, ouyangx, wangh, state.edu Network- Based CompuLng Laboratory h`p://nowlab.cse.ohio- state.edu/ MVAPICH Web Page h`p://mvapich.cse.ohio- state.edu/

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department