References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed Storage System for Structured Data Fay Chang et al OSDI 2006 Online documentation: HBase Magda Balazinska - CSE 444, Spring 2013 1 Magda Balazinska - CSE 444, Spring 2013 2 What is Bigtable? Bigtable Data Model Distributed storage system Designed to Hold structured data Scale to thousands of servers Store up to several hundred TB (maybe even PB) Perform backend bulk processing Perform real-time data serving To scale, Bigtable has a limited set of features Sparse, multidimensional sorted map (row:string, column:string, time:int64)è string Notice how everything but time is a string Example from Fig 1: Columns are grouped into families Magda Balazinska - CSE 444, Spring 2013 3 Magda Balazinska - CSE 444, Spring 2013 4 Key Features Read/writes of data under single row key is atomic Only single-row transactions! Data is stored in lexicographical order Improves data access locality Horizontally partitioned into tablets Tablets are unit of distribution and load balancing Column families are unit of access control Data is versioned (old versions garbage collected) Ex: most recent three crawls of each page, with times Outline Magda Balazinska - CSE 444, Spring 2013 5 Magda Balazinska - CSE 444, Spring 2013 6 1
API Data definition Creating/deleting tables or column families Changing access control rights Data manipulation Writing or deleting values Looking up values from individual rows Iterate over subset of data in the table Bigtable can serve as input/output for MapReduce Magda Balazinska - CSE 444, Spring 2013 7 Outline Magda Balazinska - CSE 444, Spring 2013 8 Chubby Lock Service Google File System In a distributed system, agreement is a problem Different failure scenarios are possible Nodes can have inconsistent views of who is up and who is down Messages can arrive out-of-order at different nodes But need agreement to make decisions Chubby Provides black-box agreement service through lock abstraction Uses the well-known Paxos algorithm Magda Balazinska - CSE 444, Spring 2013 9 A file = A series of chunks Size of a chunk 64MB Append & read only Fault-tolerance Chunks are distributed Chunks are replicated File Master node Decides chunk placement Decides replica placement Tells clients where to find data Master 10 A Table in Bigtable: Basics A table consists of a set of tablets: Section 53 Each tablet stores a range of the table Each tablet comprises one or more s Table (example with two tablets) Tablet 1 Tablet 2 Each is stored in a GFS file Initially, a tablet has only one Updates add s Major compaction puts everything back into one 11 Details Persistent map from keys to values Ordered Immutable Keys and values are strings API Look up value associated with a key Iterate over all key/value pairs in given range Implementation Sequence of blocks One block index to locate other blocks Magda Balazinska - CSE 444, Spring 2013 12 2
Details is a sequence of blocks Last block is the index to locate other blocks Index is loaded into memory when is open Optionally, whole can be memory mapped Lookup in Read index block of Binary search on index block to find data block Read data block is a GFS File Data Data Data Data Data Data Index Block Magda Balazinska - CSE 444, Spring 2013 13 Magda Balazinska - CSE 444, Spring 2013 14 A Table in Bigtable: Basics A table consists of a set of tablets: Section 53 Each tablet stores a range of the table Each tablet comprises one or more s Table (example with two tablets) Tablet 1 Tablet 2 Each is stored in a GFS file Initially, a tablet has only one Updates add s Major compaction puts everything back into one 15 BigTable Components A library linked into every client One master server Assigns tablets to tablet servers Ensures load balance between tablet servers Detects when tablet servers come and go Handles schema changes Many tablet servers (can be added/removed) Each server manages a set of tablets (10 to 1K) Loads tablets into memory Handles read and write requests Splits tablets that have grown too large Magda Balazinska - CSE 444, Spring 2013 16 Finding Tablet Servers Read Operation on Table Hierarchy analogous t B+ tree On first request, client needs 3 network round trips to find tablet location Root tablet (1st METADATA tablet) Chubby file Subsequently, clients cache tablet locations Other METADATA tablets UserTable1 UserTableN Assuming simple case of 1 tablet = 1 Find location of appropriate tablet Find appropriate tablet in the table and its location Use tablet location hierarchy from previous slide Metadata for a tablet contains list of s Then read data from the Magda Balazinska - CSE 444, Spring 2013 17 Magda Balazinska - CSE 444, Spring 2013 18 3
Assigning Tablets to Tablet Servers Problem Need to balance load for serving read/write requests Want to avoid Chubby file and root tablet being hot-spots Solution Master Assigns tablets to tablet servers Manages tablet server churn and load imbalances Processes schema changes Tablet server Loads tablets into memory (ie, loads index blocks of s) Handles read/write to tablets that it has loaded Splits large tablets Clients cache tablets locations Magda Balazinska - CSE 444, Spring 2013 19 Writing to Tablets Remember: s are immutable When a write operation arrives at a tablet server: Write mutation to a separate commit log stored in GFS Wait until done Insert the mutation into in-memory buffer: memtable The memtable is sorted lexicographically To serve reads, the tablet server Merges s and memtable into a single view Magda Balazinska - CSE 444, Spring 2013 20 Tablet Representation Loading Tablets memtable Read Op To load a tablet, a tablet server does the following Memory GFS tablet log Finds location of tablet through its METADATA (Fig 4) Metadata for tablet includes list of s and set of redo points Read s index blocks into memory Recall an consists of a set of blocks + 1 index block Write Op Files Read the commit log since redo point and reconstructs the memtable (the METADATA includes the redo point) Magda Balazinska - CSE 444, Spring 2013 21 Magda Balazinska - CSE 444, Spring 2013 22 Compaction Optimizations To keep memtables below a threshold Minor compaction: convert memtable into Merging compaction: Read a few s and the memtable Write out a new Major compaction: Replace all s and memtable with a new Vertical partitioning: locality groups Compression of blocks Caching of data Additional indexing: bloom filters Avoid reading that does not have needed data Commit log optimizations Single commit log per tablet server Tablet migration optimization Magda Balazinska - CSE 444, Spring 2013 23 Magda Balazinska - CSE 444, Spring 2013 24 4
Outline Performance Magda Balazinska - CSE 444, Spring 2013 25 Magda Balazinska - CSE 444, Spring 2013 26 Summary Bigtable is a distributed system for storing structured data Provides high performance and high availability Scales incrementally Restricted functionality Widely used by many applications at Google Try HBase http://hbaseapacheorg/ Next Steps Magda Balazinska - CSE 444, Spring 2013 27 Magda Balazinska - CSE 444, Spring 2013 28 5