Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng

What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation: Large scale and amounts of data - petabytes of data across thousands of servers - Goals: - scalability - wide applicability - high availability - high performance

Outline - Data Model - API - Infrastructure - Implementation - Refinements - Performance Evaluation - Real Applications

Data Model - Sparse, distributed, persistent multidimensional sorted map - Indexed by: a. Row key b. Column key c. Timestamp (row:string, column:string, time:int64) string

Data Model: Rows - Row keys are arbitrary strings - Read/write done on a single row key is atomic - Data ordered lexicographically by row key row key

Data Model: Tablets - Row range of a table is dynamically partitioned - A row range = tablet - Benefits: - Efficiency and communication with less machines - Selection of row key for locality - ex: maps.google.com/index.html com.google.maps/index.html Row Key Tablet 1 A......... C... Tablet 2 D...

Data Model: Column Families - Column keys are grouped into sets - Column families: for access control - Associated type of data - Relatively smaller number of column families in a table - Number of columns however is unbounded - Column key syntax: family:(optional) qualifier

Data Model: Timestamps - For versioning i.e. a cell of a table can have multiple versions of the same data - Assignment: - By Bigtable: real time (microseconds) - By client application - Stored in decreasing order - Version management by automatic garbage collecting - Specifying last n versions - Keeping only recent ones (time range)

timestamp Column Family contents column key Column Family anchor contents: anchor:cnnsi anchor:my.look.ca com.cnn.ww row key com.google.www com.lego.com org.apache.hadoop org.apache.hbase org.golang Tablet 1 Tablet 2 A table consists of multiple tablets, and a cluster consists of multiple tables

API - Metadata Functions - Create and delete tables and column families - Changing metadata - Client Operations - Writes - Set() to write - Delete() to delete - Reads - Over a particular row - Over multiple column families - Transactions - single row (one row key)

Infrastructure - GFS: for storing log and data files - SSTable: for storing Bigtable data - Immutable map of key-value pairs - Block indexes - Chubby - Distributed lock service - Provides namespace for directories and files - Each directory or file used as lock - Variety of Tasks - One master only - Storing schema - Storing location of data - Discovering tablet servers/finalizing tablet deaths 64K block 64K block SSTable 64K block Bigtable is highly dependent on Chubby! Index

Implementation: Introduction - Three components: - Client library - One master server - Tablet assignment to tablet server - Addition/Expiration of tablet server - Load balancing - Schema changes - Garbage collecting - Many tablet servers - Manages set of tablets - Handles read/write requests - Splits tablets

Implementation: Tablet Location - Bigtable uses a three-level hierarchy to store information about tablet locations - Level 1: a file stored in Chubby that contains the location of the root tablet - Level 2: the root tablet in the special METADATA table that contains location of all tablets - Level 3: the other tablets in the METADATA table that contain locations of sets of user tablets

Implementation: Tablet Assignment - Each tablet is assigned to one tablet server at a time - Bigtable uses Chubby to track tablet servers - Locking mechanism determines tablet server status - Master detects when tablet server assignments change and reassigns tablets accordingly - Performs series of checks to respond appropriately - When started, the master must discover current assignments before making changes - Changes are made to the set of existing tablets when: - A table is created/deleted - Two existing tablets are merged together - An existing tablet is split into two

Implementation: Tablet Serving - Tablet states are persisted in the GFS - Updates are committed to a log that stores redo records - Recent updates stored in memory in a memtable - Older updates stored in a sequence of SSTables - Allows for recovery of updates - Recovery of tablets involves retrieving metadata and reconstructing memtable - Reads and writes to tablets checked for valid authorization and to be well-formed

Implementation: Compactions - Minor Compaction - Creates a new memtable when the current one reaches threshold - Two main goals: reduce memory usage & data read from commit log in case of recovery - Merging Compaction - Reads SSTables and memtable to create a new SSTable - Seeks to merge updates from SSTables created by minor compactions - Major Compaction - Merging compaction that rewrites all SSTables into a single SSTable - Reclaims resources used unnecessarily by deleted data and ensures of complete data deletion

Refinements: Locality Groups - Multiple column families that can be grouped together by clients - Individual SSTable created for each group in a tablet - Can be created to increase read efficiency - Tuning parameters allow for specific configuration of each locality group - Storage in memory - Size of SSTable blocks

Refinements: Compression - Clients can compress SSTables for locality groups and select the format to be used - Each block is compressed separately rather than as a whole SSTable - Allows reads to be performed without decompressing the whole SSTable - Only the required block will be decompressed - Two-pass compression scheme often employed - Pass 1: Bentley and McIlroy s Scheme - Pass 2: Fast compression algorithm - 100-200 MB/s encode, 400-1000 MB/s decode - Prioritizes speed over space reduction

Refinements: Caching for Read Performance - Two separate levels of caching used for high-performing reads - Scan Cache - Higher-level cache - Block Cache - Lower-level cache - Each level of cache has its own optimal use case

Refinements: Bloom Filters - Filters that can determine if an SSTable may contain data for a specified row/column pair - Created for SSTables in a locality group - Seek to reduce number of disk accesses - Useful when reading from tablets whose SSTables aren t in memory

Refinements: Commit-Log Implementation One commit log is used per tablet server as opposed to per tablet. Pros: -Prevents large scale concurrent writes to GFS Cons: - Recovery from commit log can be tedious - Mutations for different tablets intertwined - Better utilized group in the same commit log commits

Refinements: Commit-Log Implementation A solution is to go through the log and only apply necessary entries. However the log must be read multiple times per tablet. To fix this, commit log is split into multiple smaller files. They are then sorted in order by key.

Refinements: Speeding Up Tablet Recovery If a tablet is moved from one server to another, the tablet is compacted. Before being unloaded, it is compacted once more to eliminate any remaining uncompacted state.

Refinements: Exploiting Immutability SSTables are immutable. Easier concurrency control due to not needing synchronization of file accesses. Faster tablet splitting: Parent and child tablets use the same SSTable.

Performance Evaluation: Setup A Bigtable cluster was set up to use multiple tablet servers, which varied in amount. 1GB of data was read/written to per tablet server. Tasks were delegated to multiple clients and distributed evenly.

Performance Evaluation: Benchmarks Sequential Read - Reads the string generated under the row key. Sequential Write - Uses row keys. Random distinct strings are written under each row key by multiple clients. Random Read - Same as sequential, but reads the random write results. Random Write - Workload spread relatively evenly amongst clients. Writes to rows in no particular order. Scan - Utilizes Bigtable API to scan all values within a range of rows.

Performance Evaluation: Single Tablet-Server Performance Random Read - Always the slowest. Involves transferring a 64kb SSTable block from GFS server to tablet server. Only uses a single 1000 byte value from the block. Sequential Read - Faster than random. The 64kb block is stored in a block cache, and is used for 64 requests instead of just 1. Random and Sequential Write - Efficient due to having only a single commit log. Group commit helps to efficiently write to the GFS. Scan - Can return multiple values for a single client RPC.

Performance Evaluation: Single Tablet-Server Performance

Performance Evaluation: Scaling Aggregate throughput increased by a factor of 100, as tablet server count increased from 1 to 500. Drop in Per-Server throughput when increasing tablet servers due to competition for CPU and Network. Random read has the worst scaling.

Real Applications Google Analytics - Gathers information on website traffic and other statistics. Google Earth - Big Table is used to store images. Each row represents a geographic segment.

Conclusion - Bigtable is used in many Google products today - Used for its scalability and high performance - Indexed by row key, column key, timestamp - Clusters are managed by a master server, which delegates tablets to individual tablet servers - Refinement techniques used to achieve these goals