MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Size: px
Start display at page:

Download "MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing"

Transcription

1 MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI

2 Goal: Improve Memcached 1. Reduce space overhead (bytes/key) 2. Improve performance (queries/sec) 2

3 Overview Previous Work: Sharding Avoid inter-thread synchronization e.g., dedicated cores [Berezecki11] Hotspot? Memory Efficiency? Our Approach: Algorithm Engineering Apply concurrent / space-efficient data structures MemC3 vs. Memcached 3x throughput 30% more small key-value items 3

4 Memcached Overview A DRAM-based key-value store GET(key) SET(key, value) LRU eviction for high hit rate Webserver Webserver Webserver Typical use: Speed up webservers Alleviate db load get(x) set(y,"123") get(z) Memcached server on miss Memcached server on miss Database 4

5 Watch the next talk! Typical Workloads Often used for small objects (Facebook [Atikoglu12] ) 90% keys < 31 bytes Some apps only use 2-byte values Tens of millions of queries per second for large memcached clusters (Facebook [Nishtala13] ) Small Objects, High Rate 5

6 Memcached: Core Data Structures Key-Value Index: Chaining hash table Hash table w/ chaining K V K V K V K V K V 6 Bin Fan April 13!!

7 Memcached: Core Data Structures Key-Value Index: Chaining hash table Hash table w/ chaining K V LRU Eviction: Doubly-linked lists Doubly-linked-list (for each slab) LRU header K V K V K V K V 7

8 Problems We Solve Single-node scalability Accessing hash table and updating LRU are serialized Space overhead 56-byte header per object Including 3 pointers and 1 refcount For a 100B object, overhead > 50% 8

9 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency Additional algo & tuning improvements described in paper 9

10 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency Additional algo & tuning improvements 10

11 Memcached Default Hash Table Chaining items hashed in same bucket: lookup K V K V K V Good: simple (Data Structure 101) Bad: low cache locality: (dependent pointer dereference) Bad: pointer costs space 11

12 Linear Probing Probing consecutive buckets for vacancy Good: simple lookup Good: cache friendly Bad: poor memory efficiency: ( if occupancy > 50%, lookup needs to search a long chain) 12

13 Cuckoo Hashing [Pagh04] Each key has two candidate buckets Assigned by hash 1 (key), hash 2 (key) Stored in one of its candidate buckets Lookup: read 2 buckets in parallel hash 1 (x) lookup x hash 2 (x) 0! 1! 2! 3! 4! 5! 6! 7! 13

14 Cuckoo Hashing [Pagh04] Each key has two candidate buckets Assigned by hash 1 (key), hash 2 (key) Stored in one of its candidate buckets Lookup: read 2 buckets in parallel Insert: Perform key displacement recursively Still O(1) on average [Pagh04] hash 1 (x) hash 1 (b) 0! 1! 2! 3! 4! insert x 5! hash 2 (c) hash 6! 2 (x) 7! a c b 14

15 Increase Set-Associativity a b Each bucket still fits in 1 cacheline b e f g h x 6 7 a x c d 2 cacheline-sized reads per lookup 50% space utilized 2 cacheline-sized reads per lookup 95% space utilized! 15

16 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency Additional algo & tuning improvements 16

17 During insertion: False Miss Problem always a floating item during insertion a reader may miss this floating item 0! 1! 2! 3! Floating Insert Item x 4! 5! b 6! a 7! 17

18 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! 2! 3! Insert x 4! 5! 6! 7! a b 18

19 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: Insert x 2! 3! 4! 5! 6! 7! a b 19

20 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: Insert x 2! 3! 4! 5! 6! 7! b a 20

21 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: 2! 3! 4! b a Insert x 5! 6! 7! 21

22 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: 2! 3! 4! b a 5! Only need to ensure each move is atomic w.r.t. reader 6! 7! x 22

23 How to Ensure Atomic Move e.g., move key b from bucket 4 to bucket 2 A simple implementation: Lock bucket 2 and 4 Move key Unlock bucket 2 and 4 Our approach: Optimistic locking 0! 1! 2! 3! 4! Optimized for read-heavy workloads Each key mapped to a version counter 5! Reader detects version change 6! (described in paper) 7! b 23

24 How to Coordinate Writers Simple (current) solution: Serialize inserts Works fine with read-heavy workload Ongoing work: allow multiple writers 24

25 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency 2ptr/key => 1bit/key, concurrent update Additional algo & tuning improvements 25

26 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency 2ptr/key => 1bit/key, concurrent update Additional algo & tuning improvements Avoid unnecessary full-key comparisons on hash collision 26

27 Problems We Solve Single-node scalability Accessing hash table and updating LRU are serialized GET requires no mutex Single-writer/multiple-reader Space overhead 56-byte header per object 3 pointers + 1 refcount => 1 pointer + 1 refbit 27

28 Hash Table Microbenchmark Million Lookups/sec Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache 6 local threads reading hash tables of ~1GB 60% 50% 40% 30% 20% 10% 0% Lookups all hit Base 21.54% 12.79% 68% 14.28% 1.74% 1.9% Chaining w/ Bkt Lock Optimistic Cuckoo Base Lookups all miss Chaining w/ Bkt Lock 47.89% Optimistic Cuckoo 235% 28

29 Hash Table Microbenchmark Million Lookups/sec Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache 6 local threads reading hash tables of ~1GB 60% 50% 40% 30% 20% 10% 0% Lookups all hit Base 21.54% 12.79% 68% 14.28% 1.74% 1.9% Chaining w/ Bkt Lock Optimistic Cuckoo Base Lookups all miss Chaining w/ Bkt Lock 47.89% Optimistic Cuckoo 235% 29

30 End-to-end Performance 16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipf distributed 50 remote clients generate workloads 5" max tput 4.3 MOPS MQPS% 4" 3" 2" 1" max tput 1.5 MOPS Memcached MemC3 Sharding 0" 1" 2" 4" 6" 8" 10" 12" 14" 16" Number%of%Server%Threads%% max tput 0.6 MOPS 30

31 Conclusion Optimistic cuckoo hashing High space efficiency Optimized for read-heavy workloads Source Code available: github.com/efficient/libcuckoo MemC3 improves Memcached 3x throughput, 30% more (small) objects Optimistic Cuckoo Hashing, CLOCK, other system tuning 31

32 References [Atikoglu12] Workload analysis of a large-scale key- value store. [Berezecki11] Many-core key-value store [Nishtala13] Scaling Memcache at Facebook [Pagh04] Cuckoo hashing 32

33 Q & A Thanks! 33

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing To appear in Proceedings of the 1th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), Lombard, IL, April 213 MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter

More information

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing Bin Fan, David G. Andersen, Michael Kaminsky Carnegie Mellon University, Intel Labs CMU-PDL-12-116 November 2012 Parallel

More information

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

Cuckoo Filter: Practically Better Than Bloom

Cuckoo Filter: Practically Better Than Bloom Cuckoo Filter: Practically Better Than Bloom Bin Fan (CMU/Google) David Andersen (CMU) Michael Kaminsky (Intel Labs) Michael Mitzenmacher (Harvard) 1 What is Bloom Filter? A Compact Data Structure Storing

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level

More information

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service. Goals Memcache as a Service Tom Anderson Rapid application development - Speed of adding new features is paramount Scale Billions of users Every user on FB all the time Performance Low latency for every

More information

Cuckoo Linear Algebra

Cuckoo Linear Algebra Cuckoo Linear Algebra Li Zhou, CMU Dave Andersen, CMU and Mu Li, CMU and Alexander Smola, CMU and select advertisement p(click user, query) = logist (hw, (user, query)i) select advertisement find weight

More information

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan FAWN A Fast Array of Wimpy Nodes David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan Carnegie Mellon University *Intel Labs Pittsburgh Energy in computing

More information

Mega KV: A Case for GPUs to Maximize the Throughput of In Memory Key Value Stores

Mega KV: A Case for GPUs to Maximize the Throughput of In Memory Key Value Stores Mega KV: A Case for GPUs to Maximize the Throughput of In Memory Key Value Stores Kai Zhang Kaibo Wang 2 Yuan Yuan 2 Lei Guo 2 Rubao Lee 2 Xiaodong Zhang 2 University of Science and Technology of China

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

A Distributed Hash Table for Shared Memory

A Distributed Hash Table for Shared Memory A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash

More information

@ Massachusetts Institute of Technology All rights reserved.

@ Massachusetts Institute of Technology All rights reserved. ..... CPHASH: A Cache-Partitioned Hash Table with LRU Eviction by Zviad Metreveli MASSACHUSETTS INSTITUTE OF TECHAC'LOGY JUN 2 1 2011 LIBRARIES Submitted to the Department of Electrical Engineering and

More information

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Hyeontaek Lim, Carnegie Mellon University; Dongsu Han, Korea Advanced Institute of Science and Technology (KAIST); David G. Andersen, Carnegie

More information

From architecture to algorithms: Lessons from the FAWN project

From architecture to algorithms: Lessons from the FAWN project From architecture to algorithms: Lessons from the FAWN project David Andersen, Vijay Vasudevan, Michael Kaminsky*, Michael A. Kozuch*, Amar Phanishayee, Lawrence Tan, Jason Franklin, Iulian Moraru, Sang

More information

RocksDB Key-Value Store Optimized For Flash

RocksDB Key-Value Store Optimized For Flash RocksDB Key-Value Store Optimized For Flash Siying Dong Software Engineer, Database Engineering Team @ Facebook April 20, 2016 Agenda 1 What is RocksDB? 2 RocksDB Design 3 Other Features What is RocksDB?

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching

NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching Gage Eads 1, Juan A. Colmenares 2 1 EECS Department, University of California, Berkeley, CA, USA 2 Computer Science Laboratory,

More information

A distributed in-memory key-value store system on heterogeneous CPU GPU cluster

A distributed in-memory key-value store system on heterogeneous CPU GPU cluster The VLDB Journal DOI 1.17/s778-17-479- REGULAR PAPER A distributed in-memory key-value store system on heterogeneous CPU GPU cluster Kai Zhang 1 Kaibo Wang 2 Yuan Yuan 3 Lei Guo 2 Rubao Li 3 Xiaodong Zhang

More information

Hash Table Design and Optimization for Software Virtual Switches

Hash Table Design and Optimization for Software Virtual Switches Hash Table Design and Optimization for Software Virtual Switches P R E S E N T E R : R E N WA N G Y I P E N G WA N G, S A M E H G O B R I E L, R E N WA N G, C H A R L I E TA I, C R I S T I A N D U M I

More information

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-211-51 November 26, 211 : A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

More information

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory Pengfei Zuo, Yu Hua, and Jie Wu, Huazhong University of Science and Technology https://www.usenix.org/conference/osdi18/presentation/zuo

More information

Memshare: a Dynamic Multi-tenant Key-value Cache

Memshare: a Dynamic Multi-tenant Key-value Cache Memshare: a Dynamic Multi-tenant Key-value Cache ASAF CIDON*, DANIEL RUSHTON, STEPHEN M. RUMBLE, RYAN STUTSMAN *STANFORD UNIVERSITY, UNIVERSITY OF UTAH, GOOGLE INC. 1 Cache is 100X Faster Than Database

More information

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory Pengfei Zuo, Yu Hua, Jie Wu Huazhong University of Science and Technology, China 3th USENIX Symposium on Operating Systems

More information

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda The Ohio State University

More information

Dynacache: Dynamic Cloud Caching

Dynacache: Dynamic Cloud Caching Dynacache: Dynamic Cloud Caching Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh and Sachin Ka? Stanford University MIT 1 Driving Web Applica4on Performance Web- scale applica4ons heavily reliant on memory

More information

MULTI-THREADED QUERIES

MULTI-THREADED QUERIES 15-721 Project 3 Final Presentation MULTI-THREADED QUERIES Wendong Li (wendongl) Lu Zhang (lzhang3) Rui Wang (ruiw1) Project Objective Intra-operator parallelism Use multiple threads in a single executor

More information

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR AGENDA INTRODUCTION Why SILT? MOTIVATION SILT KV STORAGE SYSTEM LOW READ AMPLIFICATION CONTROLLABLE WRITE AMPLIFICATION

More information

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant Big and Fast Anti-Caching in OLTP Systems Justin DeBrabant Online Transaction Processing transaction-oriented small footprint write-intensive 2 A bit of history 3 OLTP Through the Years relational model

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, Miguel Castro Microsoft Research Abstract We describe the design and implementation of FaRM, a new main memory distributed

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017 Liang 1 Jintian Liang CS244B December 13, 2017 1 Introduction FAWN as a Service FAWN, an acronym for Fast Array of Wimpy Nodes, is a distributed cluster of inexpensive nodes designed to give users a view

More information

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1, Rohan Kadekodi 1, Vijay Chidambaram 1,2, Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research

More information

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems 1 Key-Value Store Clients

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

An FPGA-based In-line Accelerator for Memcached

An FPGA-based In-line Accelerator for Memcached An FPGA-based In-line Accelerator for Memcached MAYSAM LAVASANI, HARI ANGEPAT, AND DEREK CHIOU THE UNIVERSITY OF TEXAS AT AUSTIN 1 Challenges for Server Processors Workload changes Social networking Cloud

More information

Scalable Concurrent Hash Tables via Relativistic Programming

Scalable Concurrent Hash Tables via Relativistic Programming Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett September 24, 2009 Speed of data < Speed of light Speed of light: 3e8 meters/second Processor speed: 3 GHz, 3e9 cycles/second

More information

NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching

NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching NbQ-CLOCK: A Non-blocking Queue-based CLOCK Algorithm for Web-Object Caching Gage Eads Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2013-174

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Concurrent Data Structures Concurrent Algorithms 2016

Concurrent Data Structures Concurrent Algorithms 2016 Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David 11.2016 1 Data Structures (DSs) Constructs for efficiently storing and retrieving

More information

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Go Deep: Fixing Architectural Overheads of the Go Scheduler Go Deep: Fixing Architectural Overheads of the Go Scheduler Craig Hesling hesling@cmu.edu Sannan Tariq stariq@cs.cmu.edu May 11, 2018 1 Introduction Golang is a programming language developed to target

More information

High Performance Transactions in Deuteronomy

High Performance Transactions in Deuteronomy High Performance Transactions in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, Ryan Stutsman, and Rui Wang Microsoft Research Overview Deuteronomy: componentized DB stack Separates transaction,

More information

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia (CMU), Michael Kaminsky (Intel Labs), David Andersen (CMU) RDMA RDMA is a network feature that

More information

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis

More information

No compromises: distributed transactions with consistency, availability, and performance

No compromises: distributed transactions with consistency, availability, and performance No compromises: distributed transactions with consistency, availability, and performance Aleksandar Dragojevi c, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam,

More information

Panu Silvasti Page 1

Panu Silvasti Page 1 Multicore support in databases Panu Silvasti Page 1 Outline Building blocks of a storage manager How do existing storage managers scale? Optimizing Shore database for multicore processors Page 2 Building

More information

System and Algorithmic Adaptation for Flash

System and Algorithmic Adaptation for Flash System and Algorithmic Adaptation for Flash The FAWN Perspective David G. Andersen, Vijay Vasudevan, Michael Kaminsky* Amar Phanishayee, Jason Franklin, Iulian Moraru, Lawrence Tan Carnegie Mellon University

More information

GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores

GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores Conglong Li Carnegie Mellon University conglonl@cs.cmu.edu Alan L. Cox Rice University alc@rice.edu Abstract Memory-based key-value stores,

More information

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno()

More information

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL Building High Performance Apps using NoSQL Swami Sivasubramanian General Manager, AWS NoSQL Building high performance apps There is a lot to building high performance apps Scalability Performance at high

More information

The Art and Science of Memory Allocation

The Art and Science of Memory Allocation Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking

More information

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

An efficient Unbounded Lock-Free Queue for Multi-Core Systems An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.

More information

Enhancing the Scalability of Memcached Alex Wiggins Jimmy Langston, Ph.D.

Enhancing the Scalability of Memcached Alex Wiggins Jimmy Langston, Ph.D. INTEL CORPORATION Enhancing the Scalability of Memcached Alex Wiggins wigginal@engr.orst.edu Jimmy Langston, Ph.D. jimmy.langston@intel.com 5/17/2012 Memcached is an open-source, multi-threaded, distributed,

More information

High-Performance Key-Value Store On OpenSHMEM

High-Performance Key-Value Store On OpenSHMEM High-Performance Key-Value Store On OpenSHMEM Huansong Fu Manjunath Gorentla Venkata Ahana Roy Choudhury Neena Imam Weikuan Yu Florida State University Oak Ridge National Laboratory {fu,roychou,yuw}@cs.fsu.edu

More information

FAWN: A Fast Array of Wimpy Nodes

FAWN: A Fast Array of Wimpy Nodes FAWN: A Fast Array of Wimpy Nodes David G. Andersen, Jason Franklin, Michael Kaminsky *, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan Carnegie Mellon University, * Intel Labs SOSP 09 CAS ICT Storage

More information

Understanding Hardware Transactional Memory

Understanding Hardware Transactional Memory Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence

More information

A Scalable Lock Manager for Multicores

A Scalable Lock Manager for Multicores A Scalable Lock Manager for Multicores Hyungsoo Jung Hyuck Han Alan Fekete NICTA Samsung Electronics University of Sydney Gernot Heiser NICTA Heon Y. Yeom Seoul National University @University of Sydney

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)

More information

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems

Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services Part II: Software Infrastructure in Data Centers: Key-Value Data Management Systems 1 Key-Value Store Clients

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main

More information

Abstraction, Reality Checks, and RCU

Abstraction, Reality Checks, and RCU Abstraction, Reality Checks, and RCU Paul E. McKenney IBM Beaverton University of Toronto Cider Seminar July 26, 2005 Copyright 2005 IBM Corporation 1 Overview Moore's Law and SMP Software Non-Blocking

More information

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andrew Pavlo, Michael

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Architecture of a Real-Time Operational DBMS

Architecture of a Real-Time Operational DBMS Architecture of a Real-Time Operational DBMS Srini V. Srinivasan Founder, Chief Development Officer Aerospike CMG India Keynote Thane December 3, 2016 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc.

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

Data Structure Engineering for High Performance Software Packet Processing

Data Structure Engineering for High Performance Software Packet Processing Data Structure Engineering for High Performance Software Packet Processing Dong Zhou dongz@cs.cmu.edu January 18, 2019 THESIS PROPOSAL Computer Science Department Carnegie Mellon University Pittsburgh,

More information

Mnemosyne Lightweight Persistent Memory

Mnemosyne Lightweight Persistent Memory Mnemosyne Lightweight Persistent Memory Haris Volos Andres Jaan Tack, Michael M. Swift University of Wisconsin Madison Executive Summary Storage-Class Memory (SCM) enables memory-like storage Persistent

More information

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Closing the Performance Gap Between Volatile and Persistent K-V Stores Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs

More information

Capriccio : Scalable Threads for Internet Services

Capriccio : Scalable Threads for Internet Services Capriccio : Scalable Threads for Internet Services - Ron von Behren &et al - University of California, Berkeley. Presented By: Rajesh Subbiah Background Each incoming request is dispatched to a separate

More information

No Compromises. Distributed Transactions with Consistency, Availability, Performance

No Compromises. Distributed Transactions with Consistency, Availability, Performance No Compromises Distributed Transactions with Consistency, Availability, Performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, Miguel

More information

Albis: High-Performance File Format for Big Data Systems

Albis: High-Performance File Format for Big Data Systems Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference

More information

Optimizing Flash-based Key-value Cache Systems

Optimizing Flash-based Key-value Cache Systems Optimizing Flash-based Key-value Cache Systems Zhaoyan Shen, Feng Chen, Yichen Jia, Zili Shao Department of Computing, Hong Kong Polytechnic University Computer Science & Engineering, Louisiana State University

More information

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University) Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University) Background: Memory Caching Two orders of magnitude more reads than writes

More information

XtraDB 5.7: Key Performance Algorithms. Laurynas Biveinis Alexey Stroganov Percona

XtraDB 5.7: Key Performance Algorithms. Laurynas Biveinis Alexey Stroganov Percona XtraDB 5.7: Key Performance Algorithms Laurynas Biveinis Alexey Stroganov Percona firstname.lastname@percona.com XtraDB 5.7 Key Performance Algorithms Focus on the buffer pool, flushing, the doublewrite

More information

Implementation and Evaluation of Moderate Parallelism in the BIND9 DNS Server

Implementation and Evaluation of Moderate Parallelism in the BIND9 DNS Server Implementation and Evaluation of Moderate Parallelism in the BIND9 DNS Server JINMEI, Tatuya / Toshiba Paul Vixie / Internet Systems Consortium [Supported by SCOPE of the Ministry of Internal Affairs and

More information

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS 261 Fall Caching. Mike Lam, Professor. (get it??) CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer

More information

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO Oliver Michel Network monitoring is important Security issues Performance issues Equipment failure Analytics Platform

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

InnoDB: Status, Architecture, and Latest Enhancements

InnoDB: Status, Architecture, and Latest Enhancements InnoDB: Status, Architecture, and Latest Enhancements O'Reilly MySQL Conference, April 14, 2011 Inaam Rana, Oracle John Russell, Oracle Bios Inaam Rana (InnoDB / MySQL / Oracle) Crash recovery speedup

More information

Using RDMA Efficiently for Key-Value Services

Using RDMA Efficiently for Key-Value Services Using RDMA Efficiently for Key-Value Services Anuj Kalia, Michael Kaminsky, David G. Andersen Carnegie Mellon University, Intel Labs CMU-PDL-14-16 June 214 Parallel Data Laboratory Carnegie Mellon University

More information

Linked Lists: The Role of Locking. Erez Petrank Technion

Linked Lists: The Role of Locking. Erez Petrank Technion Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing

More information

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( ) Anti-Caching: A New Approach to Database Management System Architecture Guide: Helly Patel (2655077) Dr. Sunnie Chung Kush Patel (2641883) Abstract Earlier DBMS blocks stored on disk, with a main memory

More information

Performance and Optimization Issues in Multicore Computing

Performance and Optimization Issues in Multicore Computing Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program

More information

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

NetCache: Balancing Key-Value Stores with Fast In-Network Caching NetCache: Balancing Key-Value Stores with Fast In-Network Caching Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica NetCache is a rack-scale key-value

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

NetCache: Balancing Key-Value Stores with Fast In-Network Caching NetCache: Balancing Key-Value Stores with Fast In-Network Caching Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica NetCache is a rack-scale key-value

More information

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan GPUfs: Integrating a File System with GPUs Yishuai Li & Shreyas Skandan Von Neumann Architecture Mem CPU I/O Von Neumann Architecture Mem CPU I/O slow fast slower Direct Memory Access Mem CPU I/O slow

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 13, March 10, 2014 Mohammad Hammoud Today Welcome Back from Spring Break! Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+

More information

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC Bojie Li * Zhenyuan Ruan * Wencong Xiao Yuanwei Lu Yongqiang Xiong Andrew Putnam Enhong Chen Lintao Zhang Microsoft Research

More information

arxiv: v1 [cs.ds] 17 May 2016

arxiv: v1 [cs.ds] 17 May 2016 Fast Concurrent Cuckoo Kick-out Eviction Schemes for High-Density Tables William Kuszmaul Stanford University 50 Serra Mall Stanford, CA 9305 kuszmaul@stanford.edu arxiv:605.05236v [cs.ds] 7 May 206 ABSTRACT

More information

Hashing and Natural Parallism. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Hashing and Natural Parallism. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Hashing and Natural Parallism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Sequential Closed Hash Map 0 1 2 3 16 9 buckets 2 Items h(k) = k mod 4 Art of Multiprocessor

More information

Cache Policies. Philipp Koehn. 6 April 2018

Cache Policies. Philipp Koehn. 6 April 2018 Cache Policies Philipp Koehn 6 April 2018 Memory Tradeoff 1 Fastest memory is on same chip as CPU... but it is not very big (say, 32 KB in L1 cache) Slowest memory is DRAM on different chips... but can

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Increasing Performance for PowerCenter Sessions that Use Partitions

Increasing Performance for PowerCenter Sessions that Use Partitions Increasing Performance for PowerCenter Sessions that Use Partitions 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology Memory: Page Table Structure CSSE 332 Operating Systems Rose-Hulman Institute of Technology General address transla+on CPU virtual address data cache MMU Physical address Global memory Memory management

More information