CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but not limited to the images and text from IEEE/ACM digital libraries, Google File System Documentation, and some of the publicly available information gathered through google search engine. Use of these slides other than for pedagogical purpose for CSE 124, may require explicit permissions from the respective sources. 12/1/2009 CSE 124 Networked Services Fall 2009 1
Announcements Programming Project-2 Presentation/Demo scheduled for 12/03/2009 Each team gets 10 minutes presentation, demo, and Q&A. The presentation less than 10 slides Should contain the purpose of the project the functioning of the project the challenges that you have faced problems that you have solved and the expected commercial potential for your project. The demo either follows the presentation or can be along with the presentation. Finals schedule: Thursday during the Finals week, 3pm-5.59pm CENTER 218 Topics details will be announced soon. 12/1/2009 CSE 124 Networked Services Fall 2009 2
Google File Systems Google File system A scalable distributed file system Large distributive data intensive applications Widely deployed in Google Scalability 100s of terabytes 1000s of disks 1000s of machines Main benefits Fault tolerance while running over commodity hardware High aggregate performance 12/1/2009 CSE 124 Networked Services Fall 2009 3
Why GFS? Component failures are common Aplication bugs, OS bugs, human errors, failure of disks, memory, connectors, networking, or power failure File sizes are huge Multi-GB is common Even TBs are expected I/O operations and Block Sizes are to be reconsidered Most files are appended most often Most operations include appending new data Fewer overwriting Random writes within files are mostly non-existent Large repositories scanned by data analysis programs Data streams generated by continuous programs Archival data File system Co-design with application will be far more optimal APIs design must consider the application Atomic append helps multiple clients to concurrently append data Can be useful for clustering 1000s of nodes 12/1/2009 CSE 124 Networked Services Fall 2009 4
Design objectives of GFS Inexpensive commodity hardware Must store modest number of large files Few million files each of 100MB or even Multi-GB Must support small files Must support two kinds of reads Large streaming reads (1 MB or more) Small random reads (few KBs) May batch and sort for multiple small reads Must support many large sequential writes Similar in size to those Reads Written files are seldom modified Small writes must be supported (may be with less efficiency) Must support concurrent appending of the files Multiple clients must append the same file Must provide high sustained throughput 12/1/2009 CSE 124 Networked Services Fall 2009 5
GFS API Similar to the standard POSIX file API Supports usual create, delete, open, close, read, and write operations Additional interfaces Snapshot Creates a copy of a file or directory tree very efficiently Record append Allows multiple clients to simultaneously Atomicity is guaranteed 12/1/2009 CSE 124 Networked Services Fall 2009 6
GFS architecture 12/1/2009 CSE 124 Networked Services Fall 2009 7
GFS Master Single master design To simplify the original design Maintains Namespace for files as well as chunks Access control information Mapping from files to chunks Current locations of chunks Does Maps files to chunks Chunk lease management Garbage collection Chunk migration between chunkservers Scalability of single Master design several Peta bytes and Processing load for meta files GFS evolved to multiple GFS masters over a collection of chunk servers Upto 8 masters could be mapped onto on chunk server 12/1/2009 CSE 124 Networked Services Fall 2009 8
Master Design Master sends instructions to chunkserver To Delete a given chunk To Create a new chunk Periodic communication between Master and chunkserver to keep state information: Is chunkserver down? disk failures on chunkserver? Any replicas corrupted? Which chunk replicas does chunkserver store? 12/1/2009 CSE 124 Networked Services Fall 2009 9
Master bottleneck Master is typically faster Since metadata is small Less than 64 bytes per file name 64 bytes metadata per 64MB chunk So one Master worked well early designs When the file sizes are smaller Large file sizes resulted Gmail files Metadata became too huge Master s memory is limited to hold the metadata Master became a bottleneck 12/1/2009 CSE 124 Networked Services Fall 2009 10
Chunks and chunk servers Analogous to block, however, very large Stored as file on chunkserver Size: 64 MB! Chunk handle (~ chunk file name) used to locate Chunk replicated across multiple chunkservers Minimum three replicas Chunk servers Stores chunks Do not cache chunks Large chunk size: Pros Helps reduce the number of Client-Master interactions Helps using persistent TCP connection between Client-Chunkserver Reduces the size of metadata in the Master Large chunk size: Cons Can create hotspots Inefficiency in storing smaller files (Gmail files) 12/1/2009 CSE 124 Networked Services Fall 2009 11
Client-Chunkserver interactions Read request is originated by Applications GFS Client receives the request from Applications GFS Client translates the reqeust <File name, byte offset> <File name, chunk index> Note that chunk sizes are fixed (64MB) GFS client queries the Master with <Filename and Chunk index> Master identifies the <chunk handle> and the location of the chunk servers GFS client request chunks from one of the chosen chunk servers Usually the nearest chunkserver is chosen Chunkserver sends requested data to the clients GFS client forwards the data to the application 12/1/2009 CSE 124 Networked Services Fall 2009 12
Example of Client-Master interaction Application (Search indexer) 1 File name: crawl_index_99 Offset: 2048 Bytes GFS client GFS Master File name: crawl_index_99 Index: 3 2 crawl_index_99 Chunk_001 (R1,R5, R8 ) Chunk_002 (R8, R4, R6) 3 Chunk_003 (R4,R3, R2 ) Chunk_003, Chunkservers: R4, R3, R2 12/1/2009 CSE 124 Networked Services Fall 2009 13
Example of Client-chunkserver interaction Application (Search indexer) 6 GFS client 2048 Bytes 4 Chunk_003, 2048 Bytes 2048 Bytes 5 Chunkserver R2 Chunkserver R3 Chunkserver R2 12/1/2009 CSE 124 Networked Services Fall 2009 15
Example of Write operation Application (Search indexer) GFS Master File name: crawl_index_99, DATA 1 GFS client File name, crawl_index_99 Chunk Index 2 3 Chunk_handle, Primary and secondary replica information 12/1/2009 CSE 124 Networked Services Fall 2009 17
Example Write Application (Search indexer) GFS client 4 DATA Secondary Chunkserver chunk Buffer Primary Chunkserver chunk Buffer Secondary Chunkserver chunk Buffer 12/1/2009 CSE 124 Networked Services Fall 2009 18
Example Write Application (Search indexer) GFS client 5 Write Secondary Chunkserver Buffer Primary Chunkserver 6 D1 D2 D3 Secondary Chunkserver Buffer chunk chunk chunk 7 Write 12/1/2009 CSE 124 Networked Services Fall 2009 19
Example Write Application (Search indexer) GFS client 9 Resp onse Secondary Chunkserver Buffer Primary Chunkserver D1 D2 D3 Secondary Chunkserver Buffer chunk chunk chunk 8 Resp onses 12/1/2009 CSE 124 Networked Services Fall 2009 20
Write control and data flow in GFS From the original GFS publication Data flow may not be one-to-many Bandwidth utilization Location of primaries and secondaries Data and control flow separation 12/1/2009 CSE 124 Networked Services Fall 2009 22
Master Operations novelties Locking Read locks and write locks are separate Efficiency in multiple activities Replica placement Large number of big chunks Bandwidth limitations of racks Combined bandwidth of all servers will far exceed the rack bandwidth Policies Maximize reliability and availability Maximize the network bandwidth utilization Solution Place chunk replicas across racks Nearby racks so that the network bandwidth can be better utilized 12/1/2009 CSE 124 Networked Services Fall 2009 23
Master Operations novelties Creation, re-replication, rebalancing Creation: Where to place the initially empty (new) chunks Place new chunks on new chunk servers with low average utilization Place the new chunks on chunk servers where the number of recent creations where high Heavy write traffic may follow Place new chunk replicas across different racks Re-replication refers to creating additional replicas When the existing number of replicas fall below the user s requirement Chunkserver may be unavailable Replica corruption Replica errors replication goal is increased Strategy: Master picks the highest priority chunk and replicates it on additional chunk servers Priority of a chunk is boosted base on its impact One has more failed replicas 12/1/2009 CSE 124 Networked Services Fall 2009 24
Master Operations novelties Rebalancing Examines the current replica distribution Moves replicas to better disk space Balances the network as well as file system space Decide which replicas to move Gradually fills new chunk servers Garbage Collection Deletion of a chunk is logged instantly actual chunk deletion is not done immediately Deleted file is renamed to a hidden name with time stamp During regular scan of the file system name space Old hidden files are removed Until then it can be undeleted Orphan chunks are deleted as well Not reachable from any file Master-Chunk server HeartBeat message is used for garbage collection Chunk server reports the chunks it has Master responds with the chunks it does not have metadata Chunk server deletes the unwanted chunks Stale replica deletion (version number) 12/1/2009 CSE 124 Networked Services Fall 2009 25
High availability Fast Recovery Chunk replication Master replication Replicates the metadata, logs, and check points Shadow masters helps replicate read-only operation of master Data integrity Checksum-based Each chunk is broken into 64KB blocks Each block has a 32bit checksum 12/1/2009 CSE 124 Networked Services Fall 2009 26
Performance Setup 1 master Two master replicas 16 chunk servers 16 clients 1Gbps link 1.4GHz pentium III 2 GB memory 100Mbps full duplex 2x80GB 5400 RPM disks Servers (19) Clients (16) 12/1/2009 CSE 124 Networked Services Fall 2009 27
Performance (Read) Read rate for N clients Upto 16 readers Each client randomly reads a 4MBb block from a 320GB file It is repeated to read the entire file Aggregate and theoretical limits At 125MBps, the 1Gbps network link saturates Or 2.5MBps/client saturates 80% of per client limit is achieved 94% of aggregate limit is achieved 12/1/2009 CSE 124 Networked Services Fall 2009 28
Performance (Write) N clients writes to N distinct files Each client writes 1GB data to a file 1 MB writes Aggregate rate and theoretical limit are provided Limit: 67MBps Multiple replica writes Each with 12.5MBps NIC Aggregate: 35MBps/16 clients 2.2MBps/client Main culprit was network software stack Multiple writes for each file Network protocol stack does not do pipelining well In real-world performance is better 12/1/2009 CSE 124 Networked Services Fall 2009 29
GFS performance Record append N clients append simultaneously to a single file Performance is limited by the network bandwidth of the chunk server with last chunk Drops from 6MBps/client to 4.8MBps/client 12/1/2009 CSE 124 Networked Services Fall 2009 30
Real-world measurements Cluster A: Used for research and development (by over a hundred engineers) Typical task initiated by user and runs for a few hours. Task reads MB s-tb s of data, transforms/analyzes the data, and writes results back. Cluster B: Used for production data processing. Typical task runs much longer than a Cluster A tas Continuously generates and processes multi-tb data sets. Human users rarely involved. Clusters had been running for about a week when measurements were taken. 12/1/2009 CSE 124 Networked Services Fall 2009 31
Performance with clusters 12/1/2009 CSE 124 Networked Services Fall 2009 32
Performance of GFS 12/1/2009 CSE 124 Networked Services Fall 2009 33
Reading Google File System documents 12/1/2009 CSE 124 Networked Services Fall 2009 35