High-Performance Key-Value Store on OpenSHMEM

Similar documents
Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

High-Performance Key-Value Store On OpenSHMEM

DISP: Optimizations Towards Scalable MPI Startup

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

A Distributed Hash Table for Shared Memory

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Designing Shared Address Space MPI libraries in the Many-core Era

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Advanced RDMA-based Admission Control for Modern Data-Centers

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Distributed Systems 16. Distributed File Systems II

High Performance Transactions in Deuteronomy

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

Unified Runtime for PGAS and MPI over OFED

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

CSE 124: Networked Services Lecture-17

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

MDHIM: A Parallel Key/Value Store Framework for HPC

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Optimizing LS-DYNA Productivity in Cluster Environments

UCX: An Open Source Framework for HPC Network APIs and Beyond

COSC 6385 Computer Architecture - Multi Processor Systems

High Performance Computing

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

NAMD Performance Benchmark and Profiling. January 2015

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

RDMA for Memcached User Guide

Hash Table Design and Optimization for Software Virtual Switches

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

MemcachedGPU: Scaling-up Scale-out Key-value Stores

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

@ Massachusetts Institute of Technology All rights reserved.

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

ABySS Performance Benchmark and Profiling. May 2010

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Memory Management. Dr. Yingwu Zhu

GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores

Chapter 8 Main Memory

Comparing One-Sided Communication with MPI, UPC and SHMEM

The End of a Myth: Distributed Transactions Can Scale

Open MPI for Cray XE/XK Systems

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Paving the Road to Exascale

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

Virtual Memory 2. q Handling bigger address spaces q Speeding translation

SNAP Performance Benchmark and Profiling. April 2014

Outlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

Chapter 6 Objectives

HYCOM Performance Benchmark and Profiling

Revealing Applications Access Pattern in Collective I/O for Cache Management

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

The Optimal CPU and Interconnect for an HPC Cluster

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Towards scalable RDMA locking on a NIC

Tradeoff Evaluation. Comparison between CB-R and CB-A as they only differ for this aspect.

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Computer Architecture Memory hierarchies and caches

Lecture: Consistency Models, TM

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Dynamic Metadata Management for Petabyte-scale File Systems

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Transcription:

High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory

Outline Background SHMEMCache Overview Challenges Design Experiments Conclusion S-2

Distributed In-memory Key-Value Store Caching popular key-value (KV) pairs in memory in a distributed manner for fast access (mainly reads). E.g. Memcached used by Facebook serves >1 billion request per second. Client-server architecture. Client and server communicate over TCP/IP. //Key-value Operations SET(key, value); GET(key); memory memory Servers Backend Database Client HDD fast memory slow S-3

Memcached Structure Two-stage hashing and KV pair locating 1 st stage to a server and 2 nd stage to a hash entry in that server. A hash entry contains pointers to locate KV pairs. Server chases chained pointers to find desired KV pair. Storage and eviction policy Memory storage is divided into KV blocks in various sizes. A KV pair is stored in the smallest possible KV block. Least Recent Used (LRU) for evicting KV pairs. Server Client 2 nd stage 1 st stage Server Chaining Hash entry KV block KV evicted S-4

PGAS and OpenSHMEM Partitioned Global Address Space (PGAS) enables a large memory address space for parallel programming. UPC, CAF, SHMEM. OpenSHMEM is a recent standardization effort of SHMEM. Participating processes are called Processing Elements (PE). A symmetric memory space that is globally-visible. Variables have same name and same address on different PE. Communication through one-sided operations. Put: write to symmetric memory. Get: read from symmetric memory. S-5

Our Objective There is a high suitability of leveraging OpenSHMEM s feature set for building a distributed in-memory key-value store. Similar memory-style addressing: client sees memory storage. One-sided communication relieves the server of participation in communication. Portability on both commodity servers and HPC systems. We propose SHMEMCache, a high-performance key-value store built on top of OpenSHMEM. S-6

Outline Background SHMEMCache Overview Challenges Design Experiments Conclusion S-7

Structure of SHMEMCache Server stores KV pairs in symmetric memory. Client performs two types of KV operations. Direct KV operation: client directly accessing KV pairs. Active KV operation: client informs server to conduct KV operations by writing messages. PE 0 (client) Pointer Directory Direct KV Operation SET/GET Active KV Operation OpenSHMEM one-sided communication hash table PE 1 (server) KV blocks message chunks Symmetric memory There are three major challenges in realizing this structure. S-8

Challenges #1 Concurrent reads/writes cause read-write & write-write races. Solution: read-friendly exclusive write. Client Server Communicate 1. Read-write & write-write races KV block S-9

Read-Friendly Exclusive Write Key issue: concurrent write can invalidate both read and write. Write must be conducted exclusively. Solving write-write race: use locks to write exclusively. Locks are expensive, so not for the much more frequent reads. Solving read-write race: a new optimistic concurrency control. Existing solutions are heavy-weight: checksum, cacheline versioning We propose an efficient head-tail versioning mechanism. Combining locking and versioning operations to minimize overheads. S-10

Read-Friendly Exclusive Write (cont.) Structure of KV block head version (e.g. v0) KV pair unused space tail version + tag + lock bit Write operations: (1) an atomic CAS for locking and writing tail version; (2) a Put for writing KV content and unlocking; (3) a Get for writing head version. Read operation: A Get for reading the whole KV block. Reader B read v0 read v1 read v0 read v1 read v1 read v1 Reader A Writer B Writer A lock & write v1 fail to lock Exclusive write unlock write v1 S-11 lock & write v2 Time

Challenges #2 Concurrent reads/writes cause read/write & write/write races. Remotely locating a KV pair incurs costly remote pointer chasing. Solution: set-associative hash table and pointer directory. 1. Read-write & write-write races Chaining Communicate Hash entry 2. Remote pointer chasing KV block S-12

Mitigating Remote Pointer Chasing Set-associative hash table Multiple pointers are fetched with one hash lookup. If pointer not found in a hash entry, only server chases pointers. 1001100100011110 Main entry Hash Table K-V Store unused bits Tag Pointer sub-entry Pointer directory New pointer is stored after first read/write. Recency and count indicate how recent and frequent a KV pair has been accessed. They help decide which pointer to keep in the directory. S-13 Tag Pointer Recency Count Pointer directory

Challenges #3 Concurrent reads/writes cause read/write & write/write races. Remotely locating a KV pair incurs costly remote pointer chasing. Server/client is unaware of other s access/eviction actions. Solution: Coarse-grained cache management. 1. Read-write & write-write races 2. Remote pointer chasing Client Server KV block 3. Unawareness of access & eviction access unware eviction unware KV evicted S-14

Coarse-grained Cache Management Relaxed recency definition from time point to time range. Non-intrusive recency update When recency changes, client directly update it using atomic CAS. Batch eviction and lazy invalidation Server evicts KV pairs in batches with the oldest recency, and notifies client the new expiration bar. KV pair that has higher recency than the evicted batch will be updated. New KV pair will be inserted to the top. Client lazily invalidate pointers. Pointer directory Lazy invalidation Client non-intrusive update 1000 900 100 0 S-15 insert update expiration bar = 100 batch eviction Server

Outline Background SHMEMCache Overview Challenges Design Experiments Conclusion S-16

Experimental Setup Innovation An in-house cluster with 21 dual-socket server nodes, each featuring 10 Intel Xeon(R) cores and 64 GB memory. All nodes are connected through an FDR Infiniband interconnect with the ConnectX- 3 NIC. Titan supercomputer Titan is a hybrid-architecture Cray XK7 system, which consists of 18,688 nodes and each node is equipped with a 16-core AMD Opteron CPU and 32GB of DDR3 memory. Benchmarks: microbenchmark and YCSB. OpenSHMEM version In-house cluster: Open MPI v1.10.3 Titan: Cray SHMEM v7.4.0 S-17

Latency Innovation cluster SHMEMCache outperforms Memcached by 25x and 20x for SET and GET latencies. SHMEMCache outperforms HiBD by 128% and 16% for SET and GET latencies. 1000 Memcached HiBD SHMEMCache 1000 Memcached HiBD SHMEMCache Time (µs) 100 10 Time (µs) 100 10 1 1 16 256 4K 64K 512K 1 1 16 256 4K 64K 512K Value Size(Bytes) Value Size(Bytes) Fig. 1 SET latency. Fig. 2 GET latency. S-18

Latency (cont.) Titan supercomputer Direct KV operation is consistently faster. SHMEMCache s achieves comparable performance on Titan with the innovation cluster. 1000 Active Direct 1000 Active Direct Time (µs) 100 10 Time (µs) 100 10 1 1 16 256 4K 64K 512K Value Size(Bytes) Fig. 1 SET latency. 1 1 16 256 4K 64K 512K Value Size(Bytes) Fig. 2 GET latency. S-19

Throughput Innovation cluster SHMEMCache outperforms Memcached by 19x and 33x for 32-Byte and 4-KB SET, and 14x and 30x for 32-Byte and 4-KB GET. 10000 10000 Throughput (Kops/S) 1000 100 10 Memcached (32 Byte) SHMEMCache (32 Byte) Memcached (4 KB) SHMEMCache (4 KB) Throughput (Kops/S) 1000 100 10 Memcached (32 Byte) SHMEMCache (32 Byte) Memcached (4 KB) SHMEMCache (4 KB) 1 2 4 8 16 1 2 4 8 16 Number of Clients Number of Clients Fig. 1 SET throughput. Fig. 2 GET throughput. S-20

Throughput (cont.) Titan supercomputer SHMEMCache scales well on Titan supercomputer to up to 1024 machines. 100000 32 Byte 4 KB 100000 32 Byte 4 KB Throughput (Kops/S) 10000 1000 Throughput (Kops/S) 10000 1000 100 1 4 16 64 256 1024 100 1 4 16 64 256 1024 Number of Clients Number of Clients Fig. 1 SET throughput. Fig. 2 GET throughput. S-21

Hit Ratio of Pointer Directory Varying YCSB workloads and number of entries in the pointer directory. Larger pointer directory helps increase hit ratio. The impact is getting smaller due to less popular KV pairs. No need to maximize its size at the cost of low space efficiency. Pointer Directory Hit Ratio 1.2 1 0.8 0.6 0.4 0.2 0 100 % SET + 0% GET 128 entries 256 entries 512 entries 50% SET + 50% GET 5% SET + 95% GET 0% SET + 100% GET S-22

Conclusion SHMEMCache, a high-performance distributed in-memory KV store built on top of OpenSHMEM. Novel designs to tackle three major challenges. Read-friendly write for solving read-write and write-write races. Set-associative hash table and pointer directory for mitigating remote pointer chasing. Coarse-grained cache management for reducing high costs of informing server/client about access/eviction operations. Evaluation results showing that SHEMMCache achieves highperformance at scale of 1024. S-23

Acknowledgment S-24

Thank You and Questions? S-25