Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Similar documents
Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

BigTable. CSE-291 (Cloud Computing) Fall 2016

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Bigtable: A Distributed Storage System for Structured Data

Distributed Systems [Fall 2012]

Bigtable: A Distributed Storage System for Structured Data

CSE-E5430 Scalable Cloud Computing Lecture 9

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Bigtable: A Distributed Storage System for Structured Data

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CS November 2017

CS November 2018

BigTable: A Distributed Storage System for Structured Data

Distributed File Systems II

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

BigTable A System for Distributed Structured Storage

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Design & Implementation of Cloud Big table

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

BigTable: A System for Distributed Structured Storage

Lecture: The Google Bigtable

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Distributed Database Case Study on Google s Big Tables

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

The Google File System

7680: Distributed Systems

CS5412: OTHER DATA CENTER SERVICES

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Google File System. Arun Sundaram Operating Systems

The Google File System

Data Storage in the Cloud

FAQs Snapshots and locks Vector Clock

Outline. Spanner Mo/va/on. Tom Anderson

The Google File System

CS5412: DIVING IN: INSIDE THE DATA CENTER

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Google Data Management

Lessons Learned While Building Infrastructure Software at Google

CLOUD-SCALE FILE SYSTEMS

The Google File System

Cassandra- A Distributed Database

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

11 Storage at Google Google Google Google Google 7/2/2010. Distributed Data Management

Extreme Computing. NoSQL.

DIVING IN: INSIDE THE DATA CENTER

GFS-python: A Simplified GFS Implementation in Python

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Distributed System. Gang Wu. Spring,2018

Google is Really Different.

Applications of Paxos Algorithm

CA485 Ray Walshe NoSQL

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

The Google File System

The MapReduce Abstraction

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

PoS(ISGC 2011 & OGF 31)075

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Google File System. Alexandru Costan

CA485 Ray Walshe Google File System

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Ghislain Fourny. Big Data 5. Wide column stores

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

CS5412: DIVING IN: INSIDE THE DATA CENTER

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Staggeringly Large Filesystems

Google File System. By Dinesh Amatya

Infrastructure system services

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

The Google File System

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

Distributed computing: index building and use

The Google File System (GFS)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CSE 124: Networked Services Lecture-16

The Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

From Google File System to Omega: a Decade of Advancement in Big Data Management at Google

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

CSE 124: Networked Services Fall 2009 Lecture-19

Indexing Large-Scale Data

MapReduce & BigTable

Column-Oriented Storage Optimization in Multi-Table Queries

Distributed Filesystem

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Staggeringly Large File Systems. Presented by Haoyan Geng

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Transcription:

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. M. Burak ÖZTÜRK 1

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements Performance Evaluation Lessons Related Works Conclusions 2

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. 3

Rows The row keys in a table are arbitrary strings: Currently upto 64KB in size Every read or write of data under a single row key is atomic Each row range is called a tablet, which is the unit of distribution and load balancing. 4

Column Families Column keys are grouped into sets called column families, which form the basic unit of access control. A column key is named using the following syntax: family:qualier. Column family names must be printable, but qualiers may be arbitrary strings. 5

Timestamps Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. 6

7

Creating and deleting tables and column families. Changing cluster, table, and column family metadata, such as access control rights. Single-row transactions, no general transactions Client-supplied scripts, limited usage. Scripts written in Sawzall 8

Bigtable uses the distributed Google File System (GFS) to store log and data files The Google SSTable file format is used internally to store Bigtable data. 9

Key/value pairs in a specied key range SSTable contains a sequence of blocks. A block index is used to locate blocks Binery search in index SSTable can be completely mapped into memory 10

Bigtable relies on a highly-available and persistent distributed lock service called Chubby Chubby has five active replicas, one of them is master. Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure. 11

The Bigtable implementation has three major components: A library that is linked into every client One master server Many tablet servers 12

A three-level hierarchy analogous to that of a B+ tree to store tablet location information 13

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially The client library caches tablet locations 14

Bigtable uses Chubby to keep track of tablet servers via master The master assigns the tablet by sending a tablet load request to the tablet server. If a tablet server loses its exclusive lock, it can attempt to reacquire an exclusive lock If the file no longer exists, the tablet server kills itself. When a tablet server terminates, it attempt to release its lock so that master reassign its tablets more quickly 15

When a master is started by the cluster management system, The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. The master scans the servers directory in Chubby to find the live servers. The master communicates with every live tablet server to discover what tablets are already assigned to each server. The master scans the METADATA table to learn the set of tablets. 16

17

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This is called minor compaction. 18

A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished. A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. 19

Clients can group multiple column families together into a locality group. More efficient reads Page metadata (such as language and checksums) can be in one locality group, an application that wants to read the metadata does not need to read through all of the page contents. 20

Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy's scheme The second pass uses a fast compression algorithm This two-pass compression scheme achieved a 10-to-1 reduction in space. 21

To avoid duplicating log reads by first sorting the commit log entries in order of the keys: (table; row name; log sequence number). As a result, logs can be read efficiently with one disk seek followed by a sequential read Also, to protect mutations from GFS latency spikes, each tablet server actually has two log writing threads, each writing to its own log file; 22

Read the string stored under the row key Write is same as read but it writes string under the row key The random read benchmark shadowed the operation of the random write benchmark 23

The scan benchmark is similar to the sequential read benchmark, but uses support provided by the Bigtable API for scanning over all values in a row range The random reads (mem) benchmark is similar to the random read benchmark, but the locality group that contains the benchmark data is marked as in-memory, so no read from GPS. 24

The table shows the rate per tablet server 25

26

Random reads are slower than all other operations by an order of magnitude or more. Random and sequential writes perform better than random reads since writes to a single commit log and uses group commit to stream Sequential reads perform better than random reads Scans are even faster since the tablet server can return a large number of values in response 27

28

Large distributed systems are vulnerable to many types of failures: Memory and network corruption, extended and asymmetric network partitions, overflow of GFS quotas. Simple design: Whatever the code or the system is huge or not, simple design help in code maintenance and debugging. 29

The Boxwood project Oracle's Real Application Cluster database and IBM's DB2 Parallel Edition provide a complete relational model with transactions C-Store and Bigtable share many characteristics but C-Store behaves like relational database 30

Bigtable, a distributed system for storing structured data at Google As of August 2006, more than sixty projects are using Bigtable. Different from usual relational database with its design and interface 31

QUESTIONS? M.Burak Öztürk burakozturk.cs@gmail.com 32