A Practical Scalable Distributed B-Tree

Similar documents
A Practical Scalable Distributed B-Tree

sinfonia: a new paradigm for building scalable distributed systems

The Google File System

The Google File System

Google File System. By Dinesh Amatya

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System (GFS)

The Google File System

MapReduce. U of Toronto, 2014

The Google File System

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

CLOUD-SCALE FILE SYSTEMS

The Google File System

Ceph: A Scalable, High-Performance Distributed File System

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Disk Scheduling COMPSCI 386

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

BigData and Map Reduce VITMAC03

Current Topics in OS Research. So, what s hot?

Google File System. Arun Sundaram Operating Systems

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

CA485 Ray Walshe Google File System

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Scalable Concurrent Hash Tables via Relativistic Programming

HP AutoRAID (Lecture 5, cs262a)

Multi-structured redundancy. Eno Thereska, Phil Gosset, Richard Harper Microsoft Research, Cambridge UK [HotStorage 12]

Distributed File Systems II

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

Multiprocessor Support

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Building distributed applications using Sinfonia

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

DIVING IN: INSIDE THE DATA CENTER

An Evaluation of Distributed Concurrency Control. Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS

Main-Memory Databases 1 / 25

NUMA replicated pagecache for Linux

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Oracle Database 12c: JMS Sharded Queues

Cluster-Level Google How we use Colossus to improve storage efficiency

Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Using MVCC for Clustered Databases

Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CSE 124: Networked Services Fall 2009 Lecture-19

CPS 512 midterm exam #1, 10/7/2016

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

A practical scalable distributed B-tree Marcos K. Aguilera, Wojciech Golab

Dynamic Metadata Management for Petabyte-scale File Systems

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Review: Creating a Parallel Program. Programming for Performance

Kathleen Durant PhD Northeastern University CS Indexes

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Improving STM Performance with Transactional Structs 1

Google is Really Different.

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Scale and Scalability Thoughts on Transactional Storage Systems. Liuba Shrira Brandeis University

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Chapter 11: File System Implementation. Objectives

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

CSE 124: Networked Services Lecture-16

The Google File System

I/O and file systems. Dealing with device heterogeneity

Bigtable. Presenter: Yijun Hou, Yixiao Peng

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

HP AutoRAID (Lecture 5, cs262a)

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Staggeringly Large Filesystems

arxiv: v2 [cs.db] 21 Nov 2016

Distributed Systems 16. Distributed File Systems II

Modern Database Concepts

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Orleans. Actors for High-Scale Services. Sergey Bykov extreme Computing Group, Microsoft Research

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Systems. GFS / HDFS / Spanner

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

Memory hierarchy review. ECE 154B Dmitri Strukov

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Distributed File Systems

CS5412: OTHER DATA CENTER SERVICES

Portland State University ECE 588/688. Graphics Processors

Transcription:

A Practical Scalable Distributed B-Tree CS 848 Paper Presentation Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah PVLDB 08 March 8, 2010 Presenter: Evguenia (Elmi) Eflov

Presentation Outline 1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Distributed (key,value) Storage The paper presents three motivating examples: The back-end of a multiplayer game. Multiplayer games need to store and manage data for thousands of players while providing low latency access and very high data consistency Metadata storage for a cluster file system. Metadata access is often the bottleneck in such systems. Metadata changes, for example, file renaming or relocation, in cluster file systems need to be atomic Secondary indexes. A lot of application require more than one index on a set of data to guarantee fast access based on different conditions. As in the two previous examples, data changes need to be atomic

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

B+tree B-tree is a tree data structure that stores values sorted by key and allows updates and lookups in amortized logarithmic time B+tree is a form of B-tree where inner nodes of the tree store keys and pointers, and leaf nodes store key-value pairs

Distributed B+tree

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Sinfonia Sinfonia is a distributed data storage service that provides ACID properties for the application data Sinfonia provides a data manipulation primitive, a minitransaction Minitransaction... Consists of 3 (possibly empty) sets of operations Operations are comparisons, reads, and writes Reads and writes are performed only if all of the comparisons are successful Is performed as part of a two-phase commit Some varieties can be performed in a single phase

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Assumptions The B-tree operates in a data center environment. This guarantees high bandwidth, low latency connections between client and server machines Individual machines can fail without causing the system to stall, but network partitions will stall the system B-tree is not going to grow or shrink rapidly

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Design of the B-tree

Design of the B-tree Each server in the system stores some number of inner and leaf nodes of the B-tree Each server in the system stores the version table of all the inner nodes of the B-tree Each client caches all inner nodes of the B-tree, and uses this cache while executing a transaction During a transaction, the client composes a set of reads and writes required At commit time, Sinfonia s minitransaction is used to perform the B-tree operations required on the server data Comparisons are added by the B-tree client library to guarantee data consistency

B-tree Operations Efficiency To make the B-tree efficient, the following three techniques are used: Clients use optimistic concurrency control, which works well unless the B-tree is rapidly shrinking or growing Since version numbers of the inner nodes are stored at each server, inner node versions can be checked at any server in the system, for example, at the server where a leaf node being accessed is stored Inner B-tree nodes are lazily replicated by clients - nodes that a particular client does not access may be stale or not present on the client

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Standard B-tree Operations

Migration Operations The distributed B-tree supports the following additional operations: Migrate(x, s) - migrates node B-tree node x to server s The following operations for multi-node migration:

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Why are transactions required? In order to guarantee data consistency, each data manipulation on the B-tree has to be performed atomically, for example, renaming of a file in the cluster file system or transferring an item and payment for the item between characters in the computer game While a minitransaction provided by Sinfonia is sufficient to perform the necessary B-tree node manipulations, it is tedious of the user of the B-tree to code in terms of the minitransaction The B-tree provides transaction interface as a way for the user to define all the necessary Read and Write operations within a transaction, while adding necessary comparisons to guarantee data consistency

Transaction Interface

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Extensions The following extensions are suggested to enhance the existing implementation Enhanced migration tasks - migration tasks that help the system adapt to seasonal variations or balance the load aggressively can be implemented Dealing with hot-spots - migration task to migrate popular keys to different servers can be implemented, including migration of the keys that are currently stored in the same node (continued)

Extensions Varying the replication factor of inner nodes - replicating version numbers of the inner nodes on lower levels of the tree less aggressively could decrease the cost of modifying those nodes Finer-grained concurrency control to avoid false sharing - if concurrency control operated on keys (or small groups of keys) rather than nodes, the number of conflicts could be decreased

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Experimental Setup 10-byte keys, 8-byte values, 12-byte pointers - 4 bytes specify server, 8 bytes - offset within the server 4 KB nodes, with leaf nodes storing 220 key-value pairs and inner nodes storing 180 key-pointer pairs Same number of servers and clients Each client has 4 parallel threads, each thread issues a new request as soon as the current request is completed Key space consists of 10 9 elements, with keys chosen uniformly, at random for each operation

Workloads The following workloads are used in both scalability and migration experiments Insert Lookup Update - values for existing keys are updated Mixed - 60% lookups and 40% updates Before the insert workload, the B-tree was pre-populated with 40,000 elements rather than starting with an empty B-tree. Were the experiments performed in the order they are presented in?

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Results of the Scalability Experiments

Results of the Migration Experiments For migration experiments, Move task was performed by a migration client while the rest of the setup for the corresponding experiment was executed Migration rate was 55.3 ± 2.7 nodes/s (around 10000 key-value pairs/s) on an idle system, and less than 5 nodes/s when executed with other tasks.

1 Background Problem Distributed B+tree Sinfonia 2 Distributed B-tree Implementation Assumptions Design of the B-tree B-tree Operations Transactions Extensions 3 Experimental Results Workload Results 4 Discussion Questions

Some Discussion Questions Are experiments representative of the workload of the motivating examples? Would larger transactions have different scalability? Can co-locating of the lower level inner nodes and the corresponding leaf nodes increase the throughput of the system?