Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Similar documents
Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

BigTable. CSE-291 (Cloud Computing) Fall 2016

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Bigtable. Presenter: Yijun Hou, Yixiao Peng

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Distributed File Systems II

Bigtable: A Distributed Storage System for Structured Data

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CS November 2017

CS November 2018

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Bigtable: A Distributed Storage System for Structured Data

CSE-E5430 Scalable Cloud Computing Lecture 9

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Distributed Systems [Fall 2012]

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

BigTable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data

Lecture: The Google Bigtable

Distributed Database Case Study on Google s Big Tables

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

BigTable A System for Distributed Structured Storage

Extreme Computing. NoSQL.

CA485 Ray Walshe NoSQL

Outline. Spanner Mo/va/on. Tom Anderson

BigTable: A System for Distributed Structured Storage

7680: Distributed Systems

CA485 Ray Walshe Google File System

The Google File System

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

The Google File System

The Google File System

MapReduce & BigTable

Distributed Filesystem

CS5412: OTHER DATA CENTER SERVICES

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

11 Storage at Google Google Google Google Google 7/2/2010. Distributed Data Management

CLOUD-SCALE FILE SYSTEMS

Infrastructure system services

The Google File System

W b b 2.0. = = Data Ex E pl p o l s o io i n

CS5412: DIVING IN: INSIDE THE DATA CENTER

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

GFS: The Google File System

CS 655 Advanced Topics in Distributed Systems

Design & Implementation of Cloud Big table

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Big Table. Dennis Kafura CS5204 Operating Systems

Staggeringly Large Filesystems

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

GFS-python: A Simplified GFS Implementation in Python

The Google File System

Data Storage in the Cloud

The Google File System. Alexandru Costan

The Google File System (GFS)

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

CSE 124: Networked Services Lecture-16

NPTEL Course Jan K. Gopinath Indian Institute of Science

CSE 124: Networked Services Fall 2009 Lecture-19

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Ghislain Fourny. Big Data 5. Wide column stores

Google File System. Arun Sundaram Operating Systems

CSE 124: Networked Services Lecture-17

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Data Informatics. Seon Ho Kim, Ph.D.

Google Disk Farm. Early days

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed Systems 16. Distributed File Systems II

Map-Reduce. Marco Mura 2010 March, 31th

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Distributed System. Gang Wu. Spring,2018

Google File System 2

Cassandra Design Patterns

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Tools for Social Networking Infrastructures

Typical size of data you deal with on a daily basis

Lessons Learned While Building Infrastructure Software at Google

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

The Google File System

Staggeringly Large File Systems. Presented by Haoyan Geng

Google Data Management

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

MapReduce. U of Toronto, 2014

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Transcription:

Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng

What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation: Large scale and amounts of data - petabytes of data across thousands of servers - Goals: - scalability - wide applicability - high availability - high performance

Outline - Data Model - API - Infrastructure - Implementation - Refinements - Performance Evaluation - Real Applications

Data Model - Sparse, distributed, persistent multidimensional sorted map - Indexed by: a. Row key b. Column key c. Timestamp (row:string, column:string, time:int64) string

Data Model: Rows - Row keys are arbitrary strings - Read/write done on a single row key is atomic - Data ordered lexicographically by row key row key

Data Model: Tablets - Row range of a table is dynamically partitioned - A row range = tablet - Benefits: - Efficiency and communication with less machines - Selection of row key for locality - ex: maps.google.com/index.html com.google.maps/index.html Row Key Tablet 1 A......... C... Tablet 2 D...

Data Model: Column Families - Column keys are grouped into sets - Column families: for access control - Associated type of data - Relatively smaller number of column families in a table - Number of columns however is unbounded - Column key syntax: family:(optional) qualifier

Data Model: Timestamps - For versioning i.e. a cell of a table can have multiple versions of the same data - Assignment: - By Bigtable: real time (microseconds) - By client application - Stored in decreasing order - Version management by automatic garbage collecting - Specifying last n versions - Keeping only recent ones (time range)

timestamp Column Family contents column key Column Family anchor contents: anchor:cnnsi anchor:my.look.ca com.cnn.ww row key com.google.www com.lego.com org.apache.hadoop org.apache.hbase org.golang Tablet 1 Tablet 2 A table consists of multiple tablets, and a cluster consists of multiple tables

API - Metadata Functions - Create and delete tables and column families - Changing metadata - Client Operations - Writes - Set() to write - Delete() to delete - Reads - Over a particular row - Over multiple column families - Transactions - single row (one row key)

Infrastructure - GFS: for storing log and data files - SSTable: for storing Bigtable data - Immutable map of key-value pairs - Block indexes - Chubby - Distributed lock service - Provides namespace for directories and files - Each directory or file used as lock - Variety of Tasks - One master only - Storing schema - Storing location of data - Discovering tablet servers/finalizing tablet deaths 64K block 64K block SSTable 64K block Bigtable is highly dependent on Chubby! Index

Implementation: Introduction - Three components: - Client library - One master server - Tablet assignment to tablet server - Addition/Expiration of tablet server - Load balancing - Schema changes - Garbage collecting - Many tablet servers - Manages set of tablets - Handles read/write requests - Splits tablets

Implementation: Tablet Location - Bigtable uses a three-level hierarchy to store information about tablet locations - Level 1: a file stored in Chubby that contains the location of the root tablet - Level 2: the root tablet in the special METADATA table that contains location of all tablets - Level 3: the other tablets in the METADATA table that contain locations of sets of user tablets

Implementation: Tablet Assignment - Each tablet is assigned to one tablet server at a time - Bigtable uses Chubby to track tablet servers - Locking mechanism determines tablet server status - Master detects when tablet server assignments change and reassigns tablets accordingly - Performs series of checks to respond appropriately - When started, the master must discover current assignments before making changes - Changes are made to the set of existing tablets when: - A table is created/deleted - Two existing tablets are merged together - An existing tablet is split into two

Implementation: Tablet Serving - Tablet states are persisted in the GFS - Updates are committed to a log that stores redo records - Recent updates stored in memory in a memtable - Older updates stored in a sequence of SSTables - Allows for recovery of updates - Recovery of tablets involves retrieving metadata and reconstructing memtable - Reads and writes to tablets checked for valid authorization and to be well-formed

Implementation: Compactions - Minor Compaction - Creates a new memtable when the current one reaches threshold - Two main goals: reduce memory usage & data read from commit log in case of recovery - Merging Compaction - Reads SSTables and memtable to create a new SSTable - Seeks to merge updates from SSTables created by minor compactions - Major Compaction - Merging compaction that rewrites all SSTables into a single SSTable - Reclaims resources used unnecessarily by deleted data and ensures of complete data deletion

Refinements: Locality Groups - Multiple column families that can be grouped together by clients - Individual SSTable created for each group in a tablet - Can be created to increase read efficiency - Tuning parameters allow for specific configuration of each locality group - Storage in memory - Size of SSTable blocks

Refinements: Compression - Clients can compress SSTables for locality groups and select the format to be used - Each block is compressed separately rather than as a whole SSTable - Allows reads to be performed without decompressing the whole SSTable - Only the required block will be decompressed - Two-pass compression scheme often employed - Pass 1: Bentley and McIlroy s Scheme - Pass 2: Fast compression algorithm - 100-200 MB/s encode, 400-1000 MB/s decode - Prioritizes speed over space reduction

Refinements: Caching for Read Performance - Two separate levels of caching used for high-performing reads - Scan Cache - Higher-level cache - Block Cache - Lower-level cache - Each level of cache has its own optimal use case

Refinements: Bloom Filters - Filters that can determine if an SSTable may contain data for a specified row/column pair - Created for SSTables in a locality group - Seek to reduce number of disk accesses - Useful when reading from tablets whose SSTables aren t in memory

Refinements: Commit-Log Implementation One commit log is used per tablet server as opposed to per tablet. Pros: -Prevents large scale concurrent writes to GFS Cons: - Recovery from commit log can be tedious - Mutations for different tablets intertwined - Better utilized group in the same commit log commits

Refinements: Commit-Log Implementation A solution is to go through the log and only apply necessary entries. However the log must be read multiple times per tablet. To fix this, commit log is split into multiple smaller files. They are then sorted in order by key.

Refinements: Speeding Up Tablet Recovery If a tablet is moved from one server to another, the tablet is compacted. Before being unloaded, it is compacted once more to eliminate any remaining uncompacted state.

Refinements: Exploiting Immutability SSTables are immutable. Easier concurrency control due to not needing synchronization of file accesses. Faster tablet splitting: Parent and child tablets use the same SSTable.

Performance Evaluation: Setup A Bigtable cluster was set up to use multiple tablet servers, which varied in amount. 1GB of data was read/written to per tablet server. Tasks were delegated to multiple clients and distributed evenly.

Performance Evaluation: Benchmarks Sequential Read - Reads the string generated under the row key. Sequential Write - Uses row keys. Random distinct strings are written under each row key by multiple clients. Random Read - Same as sequential, but reads the random write results. Random Write - Workload spread relatively evenly amongst clients. Writes to rows in no particular order. Scan - Utilizes Bigtable API to scan all values within a range of rows.

Performance Evaluation: Single Tablet-Server Performance Random Read - Always the slowest. Involves transferring a 64kb SSTable block from GFS server to tablet server. Only uses a single 1000 byte value from the block. Sequential Read - Faster than random. The 64kb block is stored in a block cache, and is used for 64 requests instead of just 1. Random and Sequential Write - Efficient due to having only a single commit log. Group commit helps to efficiently write to the GFS. Scan - Can return multiple values for a single client RPC.

Performance Evaluation: Single Tablet-Server Performance

Performance Evaluation: Scaling Aggregate throughput increased by a factor of 100, as tablet server count increased from 1 to 500. Drop in Per-Server throughput when increasing tablet servers due to competition for CPU and Network. Random read has the worst scaling.

Real Applications Google Analytics - Gathers information on website traffic and other statistics. Google Earth - Big Table is used to store images. Each row represents a geographic segment.

Conclusion - Bigtable is used in many Google products today - Used for its scalability and high performance - Indexed by row key, column key, timestamp - Clusters are managed by a master server, which delegates tablets to individual tablet servers - Refinement techniques used to achieve these goals