Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

Similar documents
Scalable Low-Latency Indexes for a Key-Value Store Ankita Kejriwal

At-most-once Algorithm for Linearizable RPC in Distributed Systems

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

FaRM: Fast Remote Memory

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

CPS 512 midterm exam #1, 10/7/2016

CS3600 SYSTEMS AND NETWORKS

High-Level Data Models on RAMCloud

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Making RAMCloud Writes Even Faster

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Heckaton. SQL Server's Memory Optimized OLTP Engine

SMORE: A Cold Data Object Store for SMR Drives

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

File Systems: Interface and Implementation

File Systems: Interface and Implementation

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Distributed File Systems II

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Care and Feeding of a MySQL Database for Linux Adminstrators. Dave Stokes MySQL Community Manager

Chapter 6. File Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

No compromises: distributed transactions with consistency, availability, and performance

Scalable Concurrent Hash Tables via Relativistic Programming

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Chapter 7: File-System

The Google File System

NUMA replicated pagecache for Linux

The Google File System

JANUARY 20, 2016, SAN JOSE, CA. Microsoft. Microsoft SQL Hekaton Towards Large Scale Use of PM for In-memory Databases

Chapter 11: File System Implementation. Objectives

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

The Google File System

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Log-structured Memory for DRAM-based Storage

CS 550 Operating Systems Spring File System

C13: Files and Directories: System s Perspective

Data Analytics on RAMCloud

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

A Distributed Hash Table for Shared Memory

NUMA-aware OpenMP Programming

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Chapter 4 Data Movement Process

About Speaker. Shrwan Krishna Shrestha. /

Google File System. By Dinesh Amatya

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CA485 Ray Walshe Google File System

CLOUD-SCALE FILE SYSTEMS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

MapReduce. U of Toronto, 2014

The Google File System

File System Implementations

Chapter 12: File System Implementation

File System: Interface and Implmentation

CSE 153 Design of Operating Systems

ASN Configuration Best Practices

Dalí: A Periodically Persistent Hash Map

File-System Structure. Allocation Methods. Free-Space Management. Directory Implementation. Efficiency and Performance. Recovery

HYDRAstor: a Scalable Secondary Storage

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

File Systems: Interface and Implementation

Scaling for Humongous amounts of data with MongoDB

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

MySQL Cluster Ed 2. Duration: 4 Days

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

BigData and Map Reduce VITMAC03

Chapter 12: File System Implementation

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Chapter 11: Implementing File-Systems

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

GR Reference Models. GR Reference Models. Without Session Replication

Week 12: File System Implementation

Introduction to Distributed Data Systems

A Scalable Lock Manager for Multicores

Huge market -- essentially all high performance databases work this way

Eternal Story on Temporary Objects

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System (GFS)

What the Hekaton? In-memory OLTP Overview. Kalen Delaney

pgpool: Features and Development Tatsuo Ishii pgpool Global Development Group SRA OSS, Inc. Japan

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Chapter 11: Implementing File Systems

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Exploiting Commutativity For Practical Fast Replication

Indexing: Overview & Hashing. CS 377: Database Systems

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Restoring from Mac and PC backups. John Steele (PC) Mike Burton (MAC)

Ceph: A Scalable, High-Performance Distributed File System

EMC Backup and Recovery for Microsoft SQL Server

CS720 - Operating Systems

CSE 124: Networked Services Fall 2009 Lecture-19

Stateless Network Functions:

Advanced Database Systems

Bioinformatics Framework

Transcription:

Indexing in RAMCloud Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout Stanford University

RAMCloud 1.0 Introduction Higher-level data models Without sacrificing latency and scalability Secondary Indexes: lookups and range queries on attributes that are not the primary key Indexing in RAMCloud Slide 2

Key Design Issues API and RAMCloud object format Index placement / partitioning Index memory allocation Failure / Restoration Consistency Indexing in RAMCloud Slide 3

Key Design Issues API and RAMCloud object format Index placement / partitioning Index memory allocation Failure / Restoration Consistency Indexing in RAMCloud Slide 4

Index Placement Lookup on index Index Key age Table 1 Table 1 return data Indexing in RAMCloud Slide 5

Option 1: Co-locate with data: Index Partitioning Multi lookup on index Index Key age Table 1, Part 1 Index Key age Table 1, Part 1 Index Key age Table 1, Part 1 Index Key age Table 1, Part 1 Table 1, Part 1 Table 1, Part 1 Table 1, Part 1 Table 1, Part 1 return data Option 2: Partition based on index key: Lookup in correct index partition Multi read matched data Index Key age Table 1 age : <= 50 Index Key age Table 1 age : > 50 Table 1, Part 1 Table 1, Part 2 return data Indexing in RAMCloud Slide 6

Index lookup: Assume data + index on n servers Opt 1: multilookup to n servers + local reads Opt 2: lookup to index server + multiread to x servers x [0, n-1] For small n: expect x n-1 For large n: expect x << n Option 2 more scalable Index entry format: Index Partitioning <index key, primary key hash> Indexing in RAMCloud Slide 7

Key Design Issues API and RAMCloud object format Index placement / partitioning Index memory allocation Failure / Restoration Consistency Indexing in RAMCloud Slide 8

Failure / Restoration Tablet Server Doesn t affect indexes Normal RAMCloud data recovery Index server Backup / Recover No backup / Rebuild Indexing in RAMCloud Slide 9

Failure/Restoration: Write Latency Index 2 memory write backup write (ultimately disk) Index 1 10 us 20 us 30 us time backup / recover no backup / rebuild Indexing in RAMCloud Slide 10

Failure/Restoration: Restoration Time Recovery: Similar to RAMCloud data recovery: 1-2 s Rebuild: Cost analysis: Se>ng Index par@@on to be recovered 1 GB Size of index entries 50 B (42 for key + 8 for keyhash) Num of index entries 2 * 10^7 Max memory bandwidth 35 GB/s master Memory bw with overheads 20 GB/s Hash table size (10% of total mem) 25 GB (for 256 GB machine) Time to scan hash table 1.25 s Time to compare hash info from bucket negligible Num objects to check if all match 2.5 * 10^9 (for 100B objects) Cache miss @me 0.5 * 10^9 cache miss / s Total cache miss @me 5.12 s Network Bandwidth 1 GB/s Time to transfer over network 1 s Index Time per object to insert 1.5 us Recovery Total @me to insert 30 s January 21, 2014 Master Secondary Indexing in RAMCloud Total @me to insert with paralleliza@on 1 s Slide 11

Memory Benchmark Random reads from array of 2 * 10^8 objects of size 64 B on rcmonster Aggregate bandwidth in GB/s 0 5 10 15 20 25 30 35 Where x is: 1 4 8 16 1 4 8 16 24 32 Reading x objects in parallel rcmonster: 2 x Xeon E5 2670@2.6GHz Indexing in RAMCloud Slide 12

Key Design Issues API and RAMCloud object format Index placement / partitioning Index memory allocation Failure / Restoration Consistency Indexing in RAMCloud Slide 13

At any time, data is consistent with index entries corresponding to it, if: If data X exists, X is reachable from all key indexes. returned to client is consistent with key used to look it up. Provides linearizability Tradeoff with performance Also desirable: Consistency Dangling pointers are not accumulating. Memory footprint will not increase beyond what is necessary. Indexing in RAMCloud Slide 14

Simple solution: Lock indexes and tablets for the entire duration of index update affects scalability and performance Our solution: Key Idea: Writing object is the commit point Interesting situations: For multi-threaded write/read, non-locking, no failures For multi-threaded write/write, non-locking, no failures Failure of an Index Server Failure of Master Server Consistency Indexing in RAMCloud Slide 15

Consistency Multi-threaded write/read, non-locking, no failures: Object Update There exists time x, s.t.: at time < x, client can lookup old data; at time >= x, it can lookup the new data. Foo: Bob Brown fname Index lname Index Bob, 4444444 Brown, 4444444 Indexing in RAMCloud Slide 16

Consistency Multi-threaded write/read, non-locking, no failures: Object Update There exists time x, s.t.: at time < x, client can lookup old data; at time >= x, it can lookup the new data. Step 1 Foo: Bob Brown Foo: Bob Brown fname Index Bob, 4444444 Bob, 4444444 Sam, 4444444 lname Index Brown, 4444444 Brown, 4444444 Smith, 4444444 Indexing in RAMCloud Slide 17

Consistency Multi-threaded write/read, non-locking, no failures: Object Update There exists time x, s.t.: at time < x, client can lookup old data; at time >= x, it can lookup the new data. Step 1 Step 2 Foo: Bob Brown Foo: Bob Brown Foo: Sam Smith fname Index Bob, 4444444 Bob, 4444444 Sam, 4444444 Bob, 4444444 Sam, 4444444 lname Index Brown, 4444444 Brown, 4444444 Smith, 4444444 Brown, 4444444 Smith, 4444444 Indexing in RAMCloud Slide 18

Consistency Multi-threaded write/read, non-locking, no failures: Object Update There exists time x, s.t.: at time < x, client can lookup old data; at time >= x, it can lookup the new data. Step 1 Step 2 Step 3 Foo: Bob Brown Foo: Bob Brown Foo: Sam Smith Foo: Sam Smith fname Index Bob, 4444444 Bob, 4444444 Sam, 4444444 Bob, 4444444 Sam, 4444444 Sam, 4444444 lname Index Brown, 4444444 Brown, 4444444 Smith, 4444444 Brown, 4444444 Smith, 4444444 Smith, 4444444 Indexing in RAMCloud Slide 19

Secondary Indexes: lookups & range queries on attributes that are not the primary key Key design issues: Index partitioning Co-locate with data Partition based on index key Failure / Restoration Backup / recover No backup / rebuild Summary Consistency: Linearizability Key idea: Writing object is the commit point Indexing in RAMCloud Slide 20

Thank you!