Google Data Management

Size: px

Start display at page:

Download "Google Data Management"

Brent Johnston
5 years ago
Views:

1 Google Data Management Vera Goebel Department of Informatics, University of Oslo 2009

2 Google Technology Kaizan: continuous developments and improvements Grid computing: Google data centers and messages BitTorrent technology: read data from many computers simultaneously High performance from low-cost hardware: commodity or white box hardware in data centers Google Linux: work around bottlenecks of standard operating systems Parallelization: use good programming ideas from other languages Memory and disk usage for data replication

3 Googleplex: Google Computing Framework a Linux modifications b distributed architecture c technical architecture d Web-centric architecture

4 Google s Fusion: Hardware and Software Innovations

5 BackRub* Service that became Google Developed at Stanford University *Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, 1997,

6 PageRank* Algorithm - I Voting algorithm weighted for importance Indicators of a Web page s importance: #pages that link to a particular page Other factors: #people clicking on a Web page Frequency with which content on a Web page is changed Requires a lot of computing power *Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, 1997,

7 PageRank - II Over 8 billion Web pages Search problem: find Web pages, manage links pointing to Web pages (link = pointer)

8 Google Data Centers A data center is usually a facility owned and operated by a third party where customers place their servers. The staff of the data center manage the power, air conditioning and routine maintenance. The customer specifies the computers and components. When a data center must expand, the staff of the facility may handle virtually all routine chores and may work with the customer s engineers for certain more specialized tasks.

9 Characteristics for Google Data Center 1. Google data centers (approx. two dozen): They come online and automatically, under the direction of the Google File System, start getting work from other data centers. These facilities, sometimes filled with 10,000 or more Google computers, find one another and configure themselves with minimal human intervention. 2. Standard desktop PCs: The hardware in a Google data center can be bought at a local computer store. 3. Each Google server comes in a standard case called a pizza box with one important change: the plugs and ports are at the front of the box to make access faster and easier. 4. Google racks are assembled for Google to hold servers on their front and back sides. This effectively allows a standard rack, normally holding 40 pizza box servers, to hold 80 servers. 5. A Google data center can go from a stack of parts to online operation in as little as 72 hours, unlike more typical data centers that can require a week or even a month to get additional resources online. 6. Each server, rack and data center works in a way that is similar to what is called plug and play. Like a mouse plugged into the USB port on a laptop, Google s network of data centers knows when more resources have been connected. These resources, for the most part, go into operation without human intervention.

10 Google File System Early days Challenges: today - Scalability - Fault-tolerance - Auto recovery Frank Eliassen, Ifi/UiO 10

11 Google Platform Characteristics 100s to 1000s of PCs in cluster Many modes of failure for each PC: App bugs, OS bugs Human error Disk failure, memory failure, net failure, power supply failure Connector failure Monitoring, fault tolerance, auto-recovery essential Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

12 Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May 2009

13 Google File System: Design Criteria Detect, tolerate, recover from failures automatically Large files, >= 100 MB in size Large, streaming reads (>= 1 MB in size) Read once Large, sequential writes that append Write once Concurrent appends by multiple clients (e.g., producer-consumer queues) Want atomicity for appends without synchronization overhead among clients Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

14 GFS: Architecture One master server (state replicated on backups) Many chunk servers (100s 1000s) Spread across racks; intra-rack b/w greater than inter-rack Chunk: 64 MB portion of file, identified by 64-bit, globally unique ID Many clients accessing same and different files stored on same cluster Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

15 Master Server Holds all metadata: Namespace (directory hierarchy) Access control information (per-file) Mapping from files to chunks Current locations of chunks (chunkservers) Delegates consistency management Garbage collects orphaned chunks Migrates chunks between chunkservers Holds all metadata in RAM; very fast operations on file system metadata Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May 2009

16 Chunkserver Stores 64 MB file chunks on local disk using standard Linux filesystem, each with version number and checksum Read/write requests specify chunk handle and byte range Chunks replicated on configurable number of chunkservers (default: 3) No caching of file data (beyond standard Linux buffer cache) Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

17 Client Issues control (metadata) requests to master server Issues data requests directly to chunkservers Caches metadata Does no caching of data No consistency difficulties among clients Streaming reads (read once) and append writes (write once) don t benefit much from caching at client Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

18 GFS: Architecture (2) Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

19 Client API Not a filesystem in traditional sense Not POSIX compliant Does not use kernel VFS interface Library that apps can link in for storage access API: open, delete, read, write (as expected) snapshot: quickly create copy of file append: at least once, possibly with gaps and/or inconsistencies among clients Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

20 Client Read Client sends master: read(file name, chunk index) Master s reply: chunk ID, chunk version number, locations of replicas Client sends closest chunkserver w/replica: read(chunk ID, byte range) Closest determined by IP address on simple rack-based network topology Chunkserver replies with data Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

21 Client Write Some chunkserver is primary for each chunk Master grants lease to primary (typically for 60 sec.) Leases renewed using periodic heartbeat messages between master and chunkservers Client asks master for primary and secondary replicas for each chunk Client sends data to replicas in daisy chain Pipelined: each replica forwards as it receives Takes advantage of full-duplex Ethernet links Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

22 Client Write (3) All replicas acknowledge data write to client Client sends write request to primary Primary assigns serial number to write request, providing ordering Primary forwards write request with same serial number to secondaries Secondaries all reply to primary after completing write Primary replies to client Source: M. Siegenthaler, CS 6464, Cornell Computer Science, May

23 Client Write (2) 23

24 BigTable A System for Distributed Structured Storage Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber Adapted by Jarle Søberg from Li Tianbao s transcript from Jeff Dean s slides

25 Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank, Per-user data: User preference settings, recent queries/search results, Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, Scale is large Billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data

26 Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren't, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer Also fun and challenging to build large-scale systems :)

27 Goals Want asynchronous processes to be continuously updating different pieces of data Want access to most current data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data Efficient joins of large one-to-one and one-to-many datasets Often want to examine data changes over time E.g. Contents of a web page over multiple crawls

28 BigTable Distributed multi-level map With an interesting data model Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance

29 Status Design/initial implementation started beginning of 2004 Currently ~100 BigTable cells Production use or active development for many projects: Google Print My Search History Orkut Crawling/indexing pipeline Google Maps/Google Earth Blogger Largest bigtable cell manages ~200TB of data spread over several thousand machines (larger cells planned)

30 Background: Building Blocks Building blocks: Google File System (GFS): Raw storage Scheduler: schedules jobs onto machines Lock service: distributed lock manager Also can reliably hold tiny files (100s of bytes) w/ high availability MapReduce: simplified large-scale data processing BigTable uses of building blocks: GFS: stores persistent state Scheduler: schedules jobs involved in BigTable serving Lock service: master election, location bootstrapping MapReduce: often used to read/write BigTable data

31 Replicas Google File System (GFS) Masters GFS Master GFS Master Client Client C0 C3 C1 C4 C3 C1 C5 C0 Chunkserver 1 Chunkserver 2 Chunkserver N C3 C4 Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunkks triplicated across three machines for safety See SOSP^03 paper at

32 MapReduce: Easy-to-use Cycles Many Google problems: Process lots of data to produce other data Many kinds of inputs: Document records, log files, sorted in-disk data structures, etc. Want to use easily hundreds or thousands of CPUs MapReduce: framework that provides (for certain classes of problems): Automatic & efficient parallelization/distribution Fault-tolerance, I/O scheduling, status/monitoring User writes Map and Reduce functions Heavily used: ~3000 jobs, 1000s of machine days each day See: MapReduce: Simplified Data Processing on Large Clusters. OSDI^04 BigTable can be input and/or output for MapReduce computations

33 Typical Cluster Cluster Scheduling Master Lock Service GFS Master Machine 1 Machine 2 Machine N User Task Single Task BigTable Server User Task BigTable Server BigTable Master Scheduler Slave GFS Chunkserver Scheduler Slave GFS Chunkserver Scheduler Slave GFS Chunkserver Linux Linux Linux

34 BigTable Overview Data Model Implementation Structure Tablets, compactions, locality groups, API Details Shared logs, compression, replication, Current/Future Work

35 Basic Data Model Distributed multi-dimensional sparse map (row, column, timestamp) cell contents contents COLUMNS ROWS <html> t2 t3 t1 TIMESTAMPS Good match for most of our applications

36 Rows Name is an arbitrary string Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines

37 Tablets Large tables broken into tablets at row boundaries Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet from failed machine Fine-grained load balancing Migrate tablets away from overloaded machine Master makes load-balancing decisions

38 Tablets & Splitting language contents aaa.com cnn.com EN <html> cnn.com/sports.html TABLETS Website.com Zuppa.com/menu.html

39 Tablets & Splitting language contents aaa.com cnn.com EN <html> cnn.com/sports.html TABLETS Website.com Yahoo.com/kids.html Yahoo.com/kids.html?D Zuppa.com/menu.html

40 System Structure Bigtable cell Bigtable master performs metadata ops, load balancing Bigtable client Bigtable client library Open() Bigtable tablet server serves data Bigtable tablet server serves data Bigtable tablet server serves data Cluster Scheduling Master GFS Lock service handles failover, monitoring holds tablet data, logs holds metadata, handles master-election

41 Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? Need to find tablet whose row range covers the target row One approach: could use the BigTable master Central server almost certainly would be bottleneck in large system Instead: store special tables containing tablet location info in BigTable cell itself

42 Locating Tablets (cont.) Our approach: 3-level hierarchical lookup scheme for tablets Location is ip:port of relevant server 1 st level: bootstrapped from lock server, points to owner of META0 2 nd level: Uses META0 data to find owner of appropriate META1 tablet 3 rd level: META1 table holds locations of tablets of all other tables META1 table itself can be split into multiple tablets Aggressive prefetching + caching - Most ops go right to proper machine

43 Tablet Representation Read Write buffer in memory (random-access) Append-only log on GFS Write SSTable on GFS SSTable on GFS SSTable on GFS (mmap) Tablet SSTable: Immutable on-disk ordered map from string string String keys: <row, column, timestamp> triples

44 Compactions Tablet state represented as set of immutable compacted SSTable files, plus tail of log (buffered in memory) Minor compaction: When in-memory state fills up, pick tablet with most data and write contents to SSTables stored in GFS Separate file for each locality group for each tablet Major compaction: Periodically compact all SSTables for tablet into new base SSTable on GFS Storage reclaimed from deletions at this point

45 Columns contents: anchor:cnnsi.com anchor:stanford.edu cnn.com CNN homepage CNN Columns have two-level name structure: Family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional level of indexing, if desired

46 Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: Return most recent K values Return all values in timestamp range (or all values) Column families can be marked w/ attributes: Only retain most recent K values in a cell Keep values until they are older than K seconds

47 Locality Groups Column families can be assigned to a locality group Used to organize underlying storage representation for performance Scans over one locality group are O(bytes_in_locality_group), not O(bytes_in_table) Data in a locality group can be explicitly memory-mapped

48 Locality Groups contents: language: pagerank: <html > EN 0.65

49 API Metadata operations Create/delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns

50 Shared Logs Designed for 1M tablets, 1000s of tablet servers 1M logs being simultaneously written performs badly Solution: shared logs Write log file per tablet server instead of per tablet Updates for many tablets co-mingled in same file Start new log chunks every so often (64MB) Problem: during recovery, server needs to read log data to apply mutations for a tablet Lots of wasted I/O if lots of machines need to read data for many tablets from same log chunk

51 Shared Log Recovery Recovery: Servers inform master of log chunks they need to read Master aggregates and orchestrates sorting of needed chunks Assigns log chunks to be sorted to different tablet servers Servers sort chunks by tablet, writes sorted data to local disk Other tablet servers ask master which servers have sorted chunks they need Tablet servers issue direct RPCs to peer tablet servers to read sorted data for its tablets

52 Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Within each SSTable for a locality group, encode compressed blocks Keep blocks small for random access (~64KB compressed data) Exploit fact that many values very similar Needs to be low CPU cost for encoding/decoding Two building blocks: BMDiff, Zippy

53 BMDiff Bentley, Mcllroy DCC 99: Data Compression Using Long Common Strings Input: dictionary * source Output: sequence of COPY: <x> bytes from offset <y> LITERAL: <literal text> Store hash at every 32-byte aligned boundary in Dictionary Source processed so far For every new source byte Compute incremental hash of last 32 bytes Lookup in hash table On hit, expand match forwards & backwards and emit COPY Encode: ~100MB/s, Decode: ~1000MB/s

54 Zippy LZW-like: Store hash of last four bytes in 16K entry table For every input byte: Compute hash of last four bytes Lookup in table Emit COPY or LITERAL Differences from BMDiff: Much smaller compression window (local repetitions) Hash table is not associative Careful encoding of COPY/LITERAL tags and lengths Sloppy but fast: Algorithm % remaining Encoding Decoding Gzip 13.4% 21MB/s 118MB/s LZO 20.5% 135MB/s 410MB/s Zippy 22.2% 172MB/s 409MB/s

55 BigTable Compression Keys: Sorted strings of (Row, Column, Timestamp): prefix compression Values: Group together values by type (e.g. column family name) BMDiff across all values in one family BMDiff output for values 1..N is dictionary for value N+1 Zippy as final pass over whole block Catches more localized repetitions Also catches cross-column-family repetition, compresses keys

56 Compression Effectiveness Experiment: store contents for 2.1B page crawl in BigTable instance Key: URL of pages, with host-name portion reversed com.cnn.www/index.html:http Groups pages from same site together Good for compression (neighboring rows tend to have similar contents) Good for clients: efficient to scan over all pages on a web site One compression strategy: gzip each page: ~28% bytes remaining BigTable: BMDiff + Zippy Type Count(B) Space(TB) Compressed%remaining Web page contents Links Anchors

57 In Development/Future Plans More expressive data manipulation/access Allow sending small scripts to perform read/modify/write transactions so that they execute on server? Multi-row (I.e. distributed) transaction support General performance work for very large cells BigTable as a service? Interesting issues of resource fairness, performance isolation, prioritization, etc. across different clients

BigTable A System for Distributed Structured Storage

BigTable A System for Distributed Structured Storage Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber Adapted