Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Similar documents
Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable. Presenter: Yijun Hou, Yixiao Peng

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

CS November 2017

BigTable: A Distributed Storage System for Structured Data

Distributed Systems [Fall 2012]

BigTable. CSE-291 (Cloud Computing) Fall 2016

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

CS November 2018

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Distributed File Systems II

7680: Distributed Systems

BigTable A System for Distributed Structured Storage

BigTable: A System for Distributed Structured Storage

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Distributed Database Case Study on Google s Big Tables

Bigtable: A Distributed Storage System for Structured Data

MapReduce & BigTable

CS5412: OTHER DATA CENTER SERVICES

Outline. Spanner Mo/va/on. Tom Anderson

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CSE-E5430 Scalable Cloud Computing Lecture 9

Bigtable: A Distributed Storage System for Structured Data

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Lecture: The Google Bigtable

CS5412: DIVING IN: INSIDE THE DATA CENTER

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Extreme Computing. NoSQL.

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Bigtable: A Distributed Storage System for Structured Data

CLOUD-SCALE FILE SYSTEMS

Google Data Management

CA485 Ray Walshe NoSQL

Google big data techniques (2)

CS5412: DIVING IN: INSIDE THE DATA CENTER

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Systems 16. Distributed File Systems II

Distributed computing: index building and use

Google File System 2

Infrastructure system services

Design & Implementation of Cloud Big table

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CSE 124: Networked Services Lecture-16

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

W b b 2.0. = = Data Ex E pl p o l s o io i n

The Google File System (GFS)

CA485 Ray Walshe Google File System

Lessons Learned While Building Infrastructure Software at Google

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

The Google File System

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

11 Storage at Google Google Google Google Google 7/2/2010. Distributed Data Management

The Google File System

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CSE 124: Networked Services Fall 2009 Lecture-19

DIVING IN: INSIDE THE DATA CENTER

Distributed System. Gang Wu. Spring,2018

MapReduce. U of Toronto, 2014

The Google File System

BigData and Map Reduce VITMAC03

Map-Reduce. Marco Mura 2010 March, 31th

Data Storage in the Cloud

Ghislain Fourny. Big Data 5. Wide column stores

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

GFS: The Google File System

Ghislain Fourny. Big Data 5. Column stores

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

The Google File System. Alexandru Costan

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

The Google File System

Distributed Filesystem

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

HBase. Леонид Налчаджи

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

The MapReduce Abstraction

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Big Table. Dennis Kafura CS5204 Operating Systems

GFS-python: A Simplified GFS Implementation in Python

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

NPTEL Course Jan K. Gopinath Indian Institute of Science

Transcription:

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

Google s Motivation Lots of data Web contents, satellite data, user data, email, etc. Different projects/applications Billions of users Many incoming requests Storage for structured data No commercial system big enough Low-level storage optimization help performance significantly

Bigtable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

Data Model A sparse, distributed persistent multi-dimensional sorted map The map is indexed by a row key, a column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row, column, timestamp) -> cell contents

Data Model Rows Arbitrary string Access to data in a row is atomic Ordered lexicographically Row

Data Model Column Two-level name structure: family: qualifier Column Family is the unit of access control Column family

Data Model Timestamps Store different versions of data in a cell Lookup options Return most recent n versions Return new enough versions timestamps

Data Model The row range for a table is dynamically partitioned Each row range is called a tablet Tablet is the unit for distribution and load balancing

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

APIs Metadata operations Create/delete tables, column families, etc. Writes Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

Building Blocks Bigtable uses the distributed Google File System (GFS) to store log and data files The Google SSTable file format is used internally to store Bigtable data An SSTable provides a persistent, ordered immutable map from keys to values Each SSTable contains a sequence of blocks A block index (stored at the end of SSTable) is used to locate blocks The index is loaded into memory when the SSTable is open

Tablet and SSTables SSTable 64KB 64KB 64KB Block Block Block

GFS write flow

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

Implementation Single-master distributed system Three major components Library that linked into every client One master server Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Metadata Operations Many tablet servers Tablet servers handle read and write requests to its table Splits tablets that have grown too large

Implementation BigTable Client BigTable Master Performs metadata ops load balancing Metadata ops BigTable Client Library Read/Write BigTable Tablet Server Serves data BigTable Tablet Server Serves data Open Cluster scheduling system GFS Chubby Handles failover, monitoring, scheduling Holds tablet data, logs Holds metadata, handles master election

Master Operation Upon start up the master needs to discover the current tablet assignment. Obtains unique master lock in Chubby Prevents concurrent master instantiations Scans servers directory in Chubby for live servers Communicates with every live tablet server Discover all tablets Scans METADATA table to learn the set of tablets Unassigned tablets are marked for assignment

Master Operation Detect tablet server failures/resumption Master periodically asks each tablet server for the status of its lock Tablet server lost its lock or master cannot contact tablet server: Master attempts to acquire exclusive lock on the server s file in the servers directory If master acquires the lock then the tablets assigned to the tablet server are assigned to others If master loses its Chubby session then it kills itself Election will be triggered

Implementation Each Tablets is assigned to one tablet server. Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for 100MB to 200MB of data per tablet Automatically split tablets that have grown too large Tablet server is responsible for 10-1000 tablets Clients communicate directly with tablet servers

Given a row, how do clients find the location of the tablet whose row range covers the target row? METADATA: Key: table id + end row, Data: location

Tablet Locating A 3-level hierarchy analogous to that of a B+ tree to store tablet location information : A file stored in chubby contains location of the root tablet Root tablet contains location of Metadata tablets The root tablet never splits Each meta-data tablet contains the locations of a set of user tablets Client library aggressively caches tablet locations

Tablet Server When a tablet server starts, it creates and acquires exclusive lock on, a uniquely-named file in a specific Chubby directory Call this servers directory A tablet server stops serving its tablets if it loses its exclusive lock This may happen if there is a network connection failure that causes the tablet server to lose its Chubby session

Tablet Server A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists If the file no longer exists then the tablet server will never be able to serve again Kills itself At some point it can restart; it goes to a pool of unassigned tablet servers

Tablet Server Failover Master Chubby Server Master Chubby Server X Message sent to tablet server by master Tablet server X Tablet X server X X Tablet server (other tablet servers drafted to serve other abandoned tablets) Logical view: Tablet Tablet Tablet Tablet Tablet Tablet Logical view: Tablet Tablet Tablet Tablet GFS Chunkserver GFS Chunkserver GFS Chunkserver Backup copy of tablet made primary Physical layout: SSTable SSTable SSTable SSTable SSTable SSTable Physical layout: SSTable SSTable SSTable (replica) SSTable (replica) SSTable (replica) SSTable Extra replica of tablet created automatically by GFS Image source: http://itsitrc.blogspot.ca/2013/02/bigtable-used-by-google.html

Tablet Server Commit log stores the updates that are made to the data Recent updates are stored in memtable (sorted buffer) Older updates are stored in SStable files Memtable Read Op Memory GFS Tablet Log SST SST SST Write Op SSTable Files

Compactions Minor compaction At the threshold Reduce memory usage Reduce log traffic on restart Merging compaction Periodically Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Results in only one SSTable No deletion records, only live data

Refinements Locality groups Clients can group multiple column families together into a locality group. Compression Compression applied to each SSTable block separately Uses Bentley and McIlroy's scheme and fast compression algorithm Caching for read performance Scan Cache and Block Cache Bloom filters Reduce the number of disk accesses

Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation

Evaluation N tablet servers (N=1,50,250,500) Round-trip time less than 1ms R row-keys ~ 1GB data per tablet server 1000-byte values one time 6 operations

Evaluation

Performance - Scaling As the number of tablet servers is increased by a factor of 500: Performance of random reads from memory increases by a factor of 300. Performance of scans increases by a factor of 260. Not Linear! WHY?

Comments The authors claim a very low failure rate, whereas they also mentioned the vulnerability in lessons due to many types of failures. Extend in this direction. The API does not support standard SQL query, it could be obstacle for applications, such as secondary index.

Thanks!