Bigtable: A Distributed Storage System for Structured Data

Size: px

Start display at page:

Download "Bigtable: A Distributed Storage System for Structured Data"

Archibald Cain
6 years ago
Views:

1 Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber ~Harshvardhan

2 IntroducJon BigTable: A distributed storage system for managing structured data Designed to scale to a very large size: PetaBytes of data across 10 3 servers Projects using BigTable: Web Indexing, Google Earth, Personalized Search, Google AnalyJcs Different demands on BigTable: Data size: URLs, to webpages to satelite imagery Latency: backend bulk processing to real Jme data serving Provide flexible, high performance solujon for all Goals: Wide Applicability, Scalability, High Performance, High Availability

3 IntroducJon Paper describes simple data model provided by BigTable: Resembles a DataBase Different interface and does not provide a full relajonal data model Simple data model and dynamic control over data layout and format Data indexed using row and column names: arbitrary strings Clients control locality of data through schema choices Schema parameters let clients dynamically control to serve data out of memory of disk PresentaJon Outline: 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

4 QuesJons? End of 5min PresentaJon

5 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

6 Data Model BigTable is a sparse, distributed, persistent mulj dimensional sorted map Map indexed by: Row key, Column key, and Timestamp Each value in map is an uninterpreted array of Bytes (row:string, column:string, time:int64) string

7 Data Model Example: WebTable Want to keep a copy of a large collecjon of web pages and related informajon Could be used by many different projects Use URLs as row keys Various aspects of web pages as column names Store contents of web pages in the Contents: column under Jmestamps when they were fetched

8 Data Model Example: WebTable Want to keep a copy of a large collecjon of web pages and related informajon Could be used by many different projects Use URLs as row keys Various aspects of web pages as column names Store contents of web pages in the Contents: column under Jmestamps when they were fetched

9 Data Model: Rows Rows: Row keys are arbitrary strings: up to 64KB in size; ~ Bytes typically Each read or write of data in single row is atomic: helps in concurrency BigTable maintains data in lexicographic order by row key Tablets: Row range for a table is dynamically par22oned Each row range is called a tablet Tablet is the unit of distribu2on and load balancing Reads of short row ranges are efficient and require less communica2on Properly selected row keys can get good locality for data access Eg: in WebTable, pages in same domain are conjguous due to reverse hostnames as keys Storing pages from same domain together makes some host and domain analyses more efficient

10 Data Model: Column Families Column Families: Column keys are grouped into sets called column families Form the basic unit of access control Intend to have small # of disjnct column families in a table Rarely change during operajon In contrast, table may have an unbounded number of columns Named in the following syntax: family:qualifier Family names must be printable, qualifiers may be arbitrary strings WebTable Example: We can have two useful column families: language: contains only one column key; the language of the web page anchor: muljple column keys: qualifier is name of referring site; cell contents is the link text

11 Data Model: Timestamps Timestamps: Each cell can contain muljple versions of the same data Versions indexed by Jmestamps Timestamps are 64 bit integers Different versions stored in decreasing Jmestamp order Most recent versions can be read first Can garbage collect cell versions automajcally: Can specify either that only the last n versions be kept Or only new enough versions be kept e.g., only keep values that were wrimen in the last seven days WebTable Example: Timestamps of crawled pages stored in the contents: column Time at which the page version was crawled Garbage collecjon to keep only the most recent three versions

12 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

13 Client API API: Provides funcjons for: CreaJng and delejng tables and column families Changing cluster, table, and column family metadata, access control, etc. ApplicaJons can: Write or delete values in BigTable Look up values from individual rows Iterate over a subset of the data in a table

14 Client API API: Supports single row transacjons Can perform atomic read modify write sequences on data in a single row Supports execujon of client supplied scripts in the server address spaces Scripts wrimen in Sawzall: Language developed at Google for data processing Allows various forms of data transformajons, filtering based on arbitrary expressions, and summarizajon via a variety of operators Can be used with MapReduce: Framework for running large scale parallel computajons developed at Google Used both as input source and output target for MapReduce

15 API: Client API: Example

16 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

17 Building Blocks Built on several other pieces of Google infrastructure: Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Uses Google SSTable file format internally to store data Relies on Chubby

18 Building Blocks Built on several other pieces of Google infrastructure: Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Uses Google SSTable file format internally to store data Relies on Chubby

19 Building Blocks Built on several other pieces of Google infrastructure Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Scheduling jobs Managing resources on shared machines Dealing with machine failures Monitoring machine status Uses Google SSTable file format internally to store data Relies on Chubby

20 Building Blocks Built on several other pieces of Google infrastructure: Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Uses Google SSTable file format internally to store data Relies on Chubby

21 Building Blocks Built on several other pieces of Google infrastructure Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Uses Google SSTable file format internally to store data Provides a persistent, ordered immutable map from keys to values Provides operajons for: Look up the value associated with a specified key Iterate over all key/value pairs in a specified key range Relies on Chubby

22 Building Blocks Built on several other pieces of Google infrastructure: Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Uses Google SSTable file format internally to store data Relies on Chubby

23 Building Blocks Built on several other pieces of Google infrastructure Uses Google File System (GFS) to store and log data files Depends on a cluster management system for: Uses Google SSTable file format internally to store data Relies on Chubby Highly available and persistent distributed lock service Used for: Ensuring at most one acjve master at any Jme Storing bootstrap locajon of BigTable data Discovering tablet servers and finalize tablet server deaths Storing schema info and access control lists

24 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

25 ImplementaJon Three major components: Library linked into every client Many Tablet servers Master server Tablet: BigTable cluster stores a number of tables Each table consists of a set of tablets Each tablet contains all data associated with a row range IniJally, each table consists of one tablet As table grows, it is automajcally split into muljple tablets Each approx MB in size by default

26 ImplementaJon Three major components: Library linked into every client Caches tablet locajons If locajon unknown or incorrect, get informajon Requires 3 6 network round trips 3 for unknown, 6 for stale Many Tablet servers Master server

27 ImplementaJon Three major components: Library linked into every client Many Tablet servers Master server

28 ImplementaJon Three major components: Library linked into every client Many Tablet servers Manages a set of tablets (typ tablets/server) Handles read and write requests to the tablets that it has loaded Splits tablets that have grown too large Can be dynamically added or removed from a cluster Master server

29 ImplementaJon Three major components: Library linked into every client Many Tablet servers Master server

30 ImplementaJon Three major components: Library linked into every client Many Tablet servers Master server Responsible for: Assigning tablets to tablet servers DetecJng the addijon and expirajon of tablet servers Balancing tablet server load Garbage collecjon of files in GFS Handles schema changes such as table and column family creajons Client data does not move through the master: Clients communicate directly with tablet servers Most clients never communicate with the master Therefore, master is lightly loaded in pracjce

31 ImplementaJon: Details Tablet LocaJon Tablet Assignment Tablet Serving

32 ImplementaJon: Details Tablet LocaJon Tablet Assignment Tablet Serving

33 ImplementaJon: Details Tablet LocaJon Use a three level hierarchy similar to a B + tree to store tablet locajon info. First level contains locajon of root table root tablet contains locajon of all tablets contained in a METADATA table Each METADATA tablet contains locajon of a set of user tables Each METADATA row stores approx. 1KB of data in memory With 128M METADATA tablets, can store 2 61 Bytes Tablet Assignment Tablet Serving

34 ImplementaJon: Details Tablet LocaJon Tablet Assignment Each tablet assigned to one tablet server at a Jme Master keeps track of set of live tablet servers and current assignments When tablet is unassigned, master assigns to tablet server with enough room Uses Chubby to keep track of tablet and acquire/release locks Tablet Serving

35 ImplementaJon: Details Tablet LocaJon Tablet Assignment Tablet Serving Tablet stored in GFS Updates commimed to commit log that stores redo records Recent commits stored in memory in sorted buffer (memtable) Older commits stored in SSTables Read request: Get metadata from METADATA table Metadata contains list of SSTables comprising a tablet + redo points Read SSTables in memory and reconstruct memtable by applying updates since redo points Write request: Check if request correct, sender authorizajons, etc. Write valid mutajons to commit log, memtable

36 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

37 Refinements The implementajon required a number of refinements to achieve goals. Some are given below: Locality Groups Can group muljple column families together in locality groups Separate SSTable for each locality group in tablet Exploits locality for Read/Write efficiency Compression Clients control compression of SSTables for a locality group Also which compression format is used, opjmized for speed 10 to 1 reducjon can be achieved, compared to 4 to 1 for Gzip on HTML Caching for read performance Two level caching: Scan Cache is higher level, caches key value pairs from SSTables to tablet server Block Cache is lower level, caches SSTable blocks read from GFS Useful for sequenjal reads or random read of different columns in same locality group

38 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

39 Performance EvaluaJon Setup: BigTable cluster with N tablet servers 1 GB RAM, write to GFS cell consisjng of 1786 machines, 2X 400GB HDDs each N client machines for generajng BigTable load for tests Same number to prevent bomleneck from client side 2X dual core Opteron 2GHz chips, single gigabit Ethernet link GBPS aggregate bandwidth available at the root, 2 level tree All machines in the same hosjng facility, so RTT < 1 msec

40 Performance EvaluaJon Results: Aggregate throughput increases dramajcally, by over a factor of a hundred From 1 to 500 servers Scaling: Does not scale well. Performance does not increase linearly Significant drop in per server throughput when going from 1 50 servers Caused by imbalance in load in muljple server configurajons Processors contending for CPU and network Load balancing algorithm can not do a perfect job Rebalancing is thromled to reduce # of tablet movements Load shits around as test progresses

41 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

42 Real ApplicaJons Real World Deployment As of August 2006, there were 388 non test BigTable clusters 24,500 tablet servers

43 Real ApplicaJons Google AnalyJcs Google AnalyJcs is a service that helps webmasters analyze traffic pamerns Provides aggregate stajsjcs Unique visitors per day, site tracking reports, page views, etc. Enabled by embedding a small JavaScript program in the web page Program invoked when page visited, records informajon to AnalyJcs User id, info. about the page, etc. AnalyJcs summarizes data to webmasters Uses two tables: Raw Click Table (~200TB) Maintains a row for each end user session Row name is tuple containing website name and Jmestamp Ensures that sessions for same website are conjguous, sorted Compresses to 14% of original size Summary Table (~20TB) Contains predefined summaries for each website Generated from Raw Click Table by periodic MapReduce jobs MapReduce job extracts recent session data from raw click table Compresses to 29% of original size

44 1. Data Model 2. Client API 3. Underlying Google Infrastructure 4. BigTable ImplementaJon 5. Refinements 6. Performance 7. Examples 8. Conclusions

45 Conclusions BigTable Distributed system for storing structured data at Google Clusters have been in producjon use since April 2005 Design and ImplementaJon took 7 person years By August 2006, more than 60 projects using BigTable High Performance, High Availability Unusual interface: difficult to adapt? New users uncertain, parjcularly from relajonal DB background Works well in pracjce though, demonstrated by many projects using it Future work: Support for secondary indices Replicated across data centers, muljple masters

46 Questions?

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng