Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things
GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation Large scale Very specific, well-understood workloads 2
GFS environment Why did Google build its own file system? Unique file system requirements Huge volume of data Huge read/write bandwidth Reliability over tens of thousands of nodes with frequent failures (commodity nodes based clusters) Mostly operating on large data blocks Needs efficient distributed operations Google s unique position it has control over, and customizes, its Applications, libraries, operating system, networks even its computers! 3
GFS workload Files are huge by traditional standards (GB, TB, PB) Large files are >= 100MB, multi-gb files common Most files are mutated by appending new data rather than overwriting existing data E.g., what did you search for, which link did you follow, Once written, the files are only read, often sequentially Mining for patterns Appending becomes the focus of performance optimization and atomicity guarantees A conventional, if not standard, interface; some specialized operation (snapshot, record append) 4
GFS Design aims Maintain data and system availability Handle failures gracefully and transparently Low synchronization overhead between entities of GFS Exploit parallelism of numerous entities Ensure high sustained throughput over low latency for individual reads / writes 5
GFS File layout Files divided into fixed-size chunks (64 MB) with an immutable global uid Each chunk is replicated on multiple chunk servers for reliability 6
GFS Architecture One master server, many chunk servers (100-1000s) Master maintains all FS metadata File namespace File to chunk mappings Chunk location info Access control info Chunk version #s Info maintain persistently in a replicated operation log Master Uses heartbeat to check on chunk servers Garbage collects orphaned chunks Metadata req/resp App Client RW req/resp Migrates chunks between chunkservers Master Chunkserver Linux FS Metadata Chunkserver Linux FS 7
GFS Architecture Chunkserver Stores 64 MB file chunks on local disk using standard Linux filesystem, each with version # and checksum Has no understanding of overall file system Only deals with chunks R/w requests specify chunk handle and byte range Chunks replicated on configurable number of chunkservers (default: 3) No caching of file data (beyond standard Linux buffer cache) Send periodic heartbeats to Master 8
GFS Architecture Client No file system interface at the operating-system level User-level API is provided Does not support all the features of POSIX file system access but looks familiar (i.e. open, close, read ) Two special operations Snapshot An efficient way of creating a copy of the current instance of a file or directory tree Append Allows a client to append data to a file as an atomic operation without having to lock a file; multiple processes can append to the same file concurrently without fear of overwriting one another s data 9
Read algorithm Access request translated by GFS client Ask (RPC) master for chunk handle and replica location (info cached at the client) Get data (RPC) from one of the replicas Application 1 (file name, byte range) GFS Client (file name, chunk indx) 2 3 (chunk handle, replica loc) Master Application 6 (data from file) GFS Client (chunk handle, byte range) 4 5 Chunk server Chunk server Chunk server (data from file) 10
Write algorithm Master grants a chunk lease to 1 replica That replica is called the primary Leases expire in 60 Application 1 (file name, data) GFS Client (file name, chunk indx) 2 Master Primary can request extension Master can take it back 3 (chunk handle, primary & replica loc) Client sends request to all replicas Application (data) Primary Buffer Chunk When ACK, send write request to primary GFS Client (data) (data) 4 Secondary Buffer Secondary Buffer Chunk Chunk 11
Write algorithm Primary picks an order for mutations to the chunk Ask replicas to apply same mutations in the same order Application GFS Client (write command) 5 Primary Secondary Secondary 6 D1 D2 D3 D1 D2 D3 D1 D2 D3 (write command, serial order) Chunk Chunk Chunk 7 When ACK, report to client Similar to passive replication but optimized for large data Application GFS Client (response) 9 Primary Secondary Secondary D1 D2 D3 D1 D2 D3 D1 D2 D3 Chunk Chunk Chunk 8 (response) 12
GFS record append Google uses large files as queues between multiple producers and consumers Same control flow as for writes, except Client pushes data to replicas of last chunk of file Client sends request to primary Common case: request fits in current last chunk: Primary appends data to own replica Primary tells secondaries to do same at same byte offset in theirs Primary replies with success to client 13
GFS record append When data won t fit in last chunk Primary fills current chunk with padding Primary instructs other replicas to do same Primary replies to client, retry on next chunk If record append fails at any replica, client retries operation So replicas of same chunk may contain different data even duplicates of all or part of record data What guarantee does GFS provide on success? Data written at least once in atomic unit 14
GFS Limitations Security? Trusted environment, trusted users Master biggest impediment to scaling Performance bottleneck Holds all data structures in memory Takes long time to rebuild metadata Must vulnerable point for reliability Solution: Have systems with multiple master nodes, all sharing set of chunk servers. Not a uniform name space. Large chunk size. Can t afford to make smaller, since this would create more work for master. 15
Fault tolerance Fast recovery, master and chunk-servers designed to restart and restore state in seconds No persistent log of chunk location in the master Chunk replicated across multiple machines and racks Data structures are kept in memory, must be able to recover from system failure Log of all changes made to metadata, checkpoints of state when log grows to big Log and latest checkpoint used to recover state Log and checkpoints replicated on multiple machines 16
GFS Summary Success: used actively by Google Availability and recoverability on cheap hardware High throughput by decoupling control and data Supports massive data sets and concurrent appends Semantics not transparent to apps Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics) Performance not good for all apps Assumes read-once, write-once workload (no client caching!) Replaced in 2010 by Colossus Eliminate master node as single point of failure Targets latency problems, more latency sensitive applications Reduce block size to be between 1~8 MB Few details public 17
Hadoop Distributed File System (HDFS) Apache Hadoop A SW framework for distributed storage and processing of big data sets using MapReduce programming model Key for it HDFS Hadoop splits files into blocks, distributes them across nodes, and transfers packaged code to process data in parallel (data locality) Hadoop's MapReduce and HDFS inspired by Google s MapReduce and GFS HDFS Portable file system, no POSIX compliant Provides shell commands and Java API similar to other file systems Can be mounted using FUSE 18
HDFS 19
GFS vs. HDFS GFS Master chunkserver operation log chunk random file writes possible multiple writer, multiple reader model chunk: 64KB data and 32bit checksum pieces default block size: 64MB HDFS NameNode DataNode journal, edit log block only append is possible single writer, multiple reader model per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp) default block size: 128MB 20
Bigtable Distributed storage (no FS) for structured data Designed to scale to petabytes of data stored across thousands of commodity servers 450,000 machines (NYTimes estimate, June 06) Example users: Google Earth, Google Analytics, Google Finance, Personalized Search, Build on Scheduler (Google WorkQueue) Google File System Chubby Lock service {lock/file/name} service Coarse-grained locks, can store small amount of data in a lock 21
Data model: a big map <Row, Column, Timestamp> triple for key Each value is an uninterpreted array of bytes Arbitrary columns on a row-by-row basis Column family:qualifier. Family is heavyweight, qualifier lightweight Column-oriented physical store- rows are sparse! Lookup, insert, delete API Each read or write of data under a single row key is atomic 22
Bigtable is not Structured data but not a DHT Not addressing the same problems as DHTs churn, variable bandwidth, untrusted participants Key-value pairs are useful but too limiting nor a database No table-wide integrity constraints No multi-row transactions Uninterpreted values: No aggregation over data Can specify: keep last N versions or last N days C++ functions, not SQL (no complex queries) Clients indicate what data to cache in memory 23
Tables, tablets and SSTables Bigtable keeps data in lexicographic order by row key Row range for a table is dynamically partitioned Each row range is called a tablet The unit of distribution and load balancing Clients can exploit this by selecting their row keys for good locality, e.g., maps.google.com/index.html stored under key com.google.maps/index.html Built out of multiple, possible shared, SSTables Tablet Tablet aardvark apple apple_two_e boat Tablet Start:aardvark End:apple SSTable SSTable SSTable SSTable 64K block 64K block 64K block SSTable 64K block 64K block 64K block SSTable Index Index 24
SSTable Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values Index loaded into memory when SSTable is opened Lookup is a single disk seek Alternatively, client can load SSTable into mem 64K block 64K block 64K block SSTable Index 25
Servers Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 MBs Each tablet lives at only one server Tablet server splits tablets that get too big Master responsible for load balancing and fault tolerance Use Chubby to monitor health of tablet servers, restart failed servers GFS replicates data. Prefer to start tablet server on same machine that the data is already at 26
Editing/Reading a table Mutations are committed to a commit log (in GFS) Then applied to an in-memory version (memtable) For concurrency, each memtable row is copy-on-write Reads applied to merged view of SSTables & memtable Reads & writes continue during tablet split or merge Tablet Insert Insert Delete Memtable (sorted) apple_two_e boat Insert Delete SSTable (sorted) SSTable (sorted) Insert 27
Compactions Minor compaction convert a full memtable into an SSTable, and start a new memtable Reduce memory usage Reduce log traffic on restart Merging compaction Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Merging compaction that results in only one SSTable No deletion records, only live data 28
Finding a tablet A three-level hierarchy to store tablet location information Client library caches tablet locations Metadata includes log of all events about each tablet 29
Summary GFS / HDFS Data-center customized API, optimizations Append focused DFS Separate control (filesystem) and data (chunks) Replication and locality Rough consistency à apps handle rest Bigtable Specialized storage rather than file systems Value simple designs 30