CSE 124: Networked Services Lecture-16

Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1

Updates PlanetLab experiments began First batch is given access information Read through PlanetLab documentation at www.planetlab.org Project-2 idea final presentation Those who have questions on Super-proxy, request for office appointment Presentation/Demo Deadline: Last Lecture class (December 2 nd, 2010) Submission of report (one page or more) documentation and final source code: finals week It should contain: a brief description of the project Instructions for building and using the code 11/23/2010 CSE 124 Networked Services Fall 2010 2

Google File System 11/23/2010 CSE 124 Networked Services Fall 2010 3

Google File Systems Google File system A scalable distributed file system Large distributive data intensive applications Widely deployed in Google Scalability 100s of terabytes 1000s of disks 1000s of machines Main benefits Fault tolerance while running over commodity hardware High aggregate performance 11/23/2010 CSE 124 Networked Services Fall 2010 4

Why GFS? Component failures are common Aplication bugs, OS bugs, human errors, failure of disks, memory, connectors, networking, or power failure File sizes are huge Multi-GB is common Even TBs are expected I/O operations and Block Sizes are to be reconsidered Most files are appended most often Most operations include appending new data Fewer overwriting Random writes within files are mostly non-existent Large repositories scanned by data analysis programs Data streams generated by continuous programs Archival data File system Co-design with application will be far more optimal APIs design must consider the application Atomic append helps multiple clients to concurrently append data Can be useful for clustering 1000s of nodes 11/23/2010 CSE 124 Networked Services Fall 2010 5

Design objectives of GFS Inexpensive commodity hardware Must store modest number of large files Few million files each of 100MB or even Multi-GB Must support small files Must support two kinds of reads Large streaming reads (1 MB or more) Small random reads (few KBs) May batch and sort for multiple small reads Must support many large sequential writes Similar in size to those Reads Written files are seldom modified Small writes must be supported (may be with less efficiency) Must support concurrent appending of the files Multiple clients must append the same file Must provide high sustained throughput 11/23/2010 CSE 124 Networked Services Fall 2010 6

GFS API Similar to the standard POSIX file API Supports usual create, delete, open, close, read, and write operations Additional interfaces Snapshot Creates a copy of a file or directory tree very efficiently Record append Allows multiple clients to simultaneously Atomicity is guaranteed 11/23/2010 CSE 124 Networked Services Fall 2010 7

GFS architecture 11/23/2010 CSE 124 Networked Services Fall 2010 8

GFS Master Single master design (later shadow masters and master replicas) To simplify the original design Maintains Namespace for files as well as chunks Access control information Mapping from files to chunks Current locations of chunks Does Maps files to chunks Chunk lease management Garbage collection Chunk migration between chunkservers Scalability of single Master design several Peta bytes and Processing load for meta files GFS evolved to multiple GFS masters over a collection of chunk servers Upto 8 masters could be mapped onto on chunk server 11/23/2010 CSE 124 Networked Services Fall 2010 9

Master Design Master sends instructions to chunkserver To Delete a given chunk To Create a new chunk Periodic communication between Master and chunkserver to keep state information: Is chunkserver down? disk failures on chunkserver? Any replicas corrupted? Which chunk replicas does chunkserver store? 11/23/2010 CSE 124 Networked Services Fall 2010 10

Master bottleneck Master is typically faster Since metadata is small Less than 64 bytes per file name (prefix compression applied) 64 bytes metadata per 64MB chunk So one Master worked well early designs When the file sizes are smaller Large number of files resulted Gmail files Metadata became too huge Master s memory is limited to hold the metadata Master became a bottleneck 11/23/2010 CSE 124 Networked Services Fall 2010 11

Chunks and chunk servers Analogous to block, however, very large Stored as files on chunk server Size: 64 MB! Chunk handle (~ chunk file name) is used to locate Chunks are replicated across multiple chunk servers Minimum three replicas Chunk servers Stores chunks Do not cache chunks Large chunk size: Pros Helps reduce the number of Client-Master interactions Helps using persistent TCP connection between Client-Chunkserver Reduces the size of metadata in the Master Large chunk size: Cons Can create hotspots Inefficiency in storing smaller files (Gmail files) 11/23/2010 CSE 124 Networked Services Fall 2010 12

Client-Chunkserver interactions Read request is originated by Applications GFS Client receives the request from Applications GFS Client translates the request <File name, byte offset> <File name, chunk index> Note that chunk sizes are fixed (64MB) GFS client queries the Master with <Filename and Chunk index> Master identifies the <chunk handle> and the location of the chunk servers GFS client request chunks from one of the chosen chunk servers Usually the nearest chunkserver is chosen Chunkserver sends requested data to the clients GFS client forwards the data to the application 11/23/2010 CSE 124 Networked Services Fall 2010 13

Example of Client-Master interaction Application (Search indexer) 1 File name: crawl_index_99 Offset: 164KB Size: 2048 Bytes GFS client GFS Master File name: crawl_index_99 Index: 3 2 crawl_index_99 Chunk_001 (R1,R5, R8 ) Chunk_002 (R8, R4, R6) 3 Chunk_003 (R4,R3, R2 ) Chunk_003, Chunkservers: R4, R3, R2 11/23/2010 CSE 124 Networked Services Fall 2010 14

Example of Client-chunkserver interaction Application (Search indexer) 6 GFS client 2048 Bytes 4 Chunk_003, 2048 Bytes 2048 Bytes 5 Chunkserver R2 Chunkserver R3 Chunkserver R2 11/23/2010 CSE 124 Networked Services Fall 2010 16

Example of Write operation Application (Search indexer) GFS Master File name: crawl_index_99, DATA 1 GFS client File name, crawl_index_99 Chunk Index 2 3 Chunk_handle, Primary and secondary replica information 11/23/2010 CSE 124 Networked Services Fall 2010 18

Example Write Application (Search indexer) GFS client 4 DATA Secondary Chunkserver chunk Buffer Primary Chunkserver chunk Buffer Secondary Chunkserver chunk Buffer 11/23/2010 CSE 124 Networked Services Fall 2010 19

Example Write Application (Search indexer) GFS client 5 Write Secondary Chunkserver Buffer Primary Chunkserver 6 D1 D2 D3 Secondary Chunkserver Buffer chunk chunk chunk 7 Write 11/23/2010 CSE 124 Networked Services Fall 2010 20

Example Write Application (Search indexer) GFS client 9 Resp onse Secondary Chunkserver Buffer Primary Chunkserver D1 D2 D3 Secondary Chunkserver Buffer chunk chunk chunk 8 Resp onses 11/23/2010 CSE 124 Networked Services Fall 2010 21

Write control and data flow in GFS From the original GFS publication Data flow may not be one-to-many Bandwidth utilization Location of primaries and secondaries Data and control flow separation 11/23/2010 CSE 124 Networked Services Fall 2010 23

Master Operations novelties Locking Read locks and write locks are separate Efficiency in multiple activities Replica placement Large number of big chunks Bandwidth limitations of racks Combined bandwidth of all servers will far exceed the rack bandwidth Policies Maximize reliability and availability Maximize the network bandwidth utilization Solution Place chunk replicas across racks Nearby racks so that the network bandwidth can be better utilized 11/23/2010 CSE 124 Networked Services Fall 2010 24

Master Operations novelties Creation, re-replication, rebalancing Creation: Where to place the initially empty (new) chunks Place new chunks on new chunk servers with low average utilization Place the new chunks on chunk servers where the number of recent creations where high Heavy write traffic may follow Place new chunk replicas across different racks Re-replication refers to creating additional replicas When the existing number of replicas fall below the user s requirement Chunkserver may be unavailable Replica corruption Replica errors replication goal is increased Strategy: Master picks the highest priority chunk and replicates it on additional chunk servers Priority of a chunk is boosted base on its impact One has more failed replicas 11/23/2010 CSE 124 Networked Services Fall 2010 25

Master Operations novelties Rebalancing Examines the current replica distribution Moves replicas to better disk space Balances the network as well as file system space Decide which replicas to move Gradually fills new chunk servers Garbage Collection Deletion of a chunk is logged instantly actual chunk deletion is not done immediately Deleted file is renamed to a hidden name with time stamp During regular scan of the file system name space Old hidden files are removed Until then it can be undeleted Orphan chunks are deleted as well Not reachable from any file Master-Chunk server HeartBeat message is used for garbage collection Chunk server reports the chunks it has Master responds with the chunks it does not have metadata Chunk server deletes the unwanted chunks Stale replica deletion (version number) 11/23/2010 CSE 124 Networked Services Fall 2010 26

High availability Fast Recovery Chunk replication Master replication Replicates the metadata, logs, and check points Shadow masters helps replicate read-only operation of master Data integrity Checksum-based Each chunk is broken into 64KB blocks Each block has a 32bit checksum 11/23/2010 CSE 124 Networked Services Fall 2010 27

Performance Setup 1 master Two master replicas 16 chunk servers 16 clients 1Gbps link 1.4GHz pentium III 2 GB memory 100Mbps full duplex 2x80GB 5400 RPM disks Servers (19) Clients (16) 11/23/2010 CSE 124 Networked Services Fall 2010 28

Performance (Read) Read rate for N clients Upto 16 readers Each client randomly reads a 4MBb block from a 320GB file It is repeated to read the entire file Aggregate and theoretical limits At 125MBps, the 1Gbps network link saturates Or 2.5MBps/client saturates 80% of per client limit is achieved 94% of aggregate limit is achieved 11/23/2010 CSE 124 Networked Services Fall 2010 29

Performance (Write) N clients writes to N distinct files Each client writes 1GB data to a file 1 MB writes Aggregate rate and theoretical limit are provided Limit: 67MBps Multiple replica writes Each with 12.5MBps NIC Aggregate: 35MBps/16 clients 2.2MBps/client Main culprit was network software stack Multiple writes for each file Network protocol stack does not do pipelining well In real-world performance is better 11/23/2010 CSE 124 Networked Services Fall 2010 30

GFS performance Record append N clients append simultaneously to a single file Performance is limited by the network bandwidth of the chunk server with last chunk Drops from 6MBps/client to 4.8MBps/client 11/23/2010 CSE 124 Networked Services Fall 2010 31

Real-world measurements Cluster A: Used for research and development (by over a hundred engineers) Typical task initiated by user and runs for a few hours. Task reads MB s-tb s of data, transforms/analyzes the data, and writes results back. Cluster B: Used for production data processing. Typical task runs much longer than a Cluster A tas Continuously generates and processes multi-tb data sets. Human users rarely involved. Clusters had been running for about a week when measurements were taken. 11/23/2010 CSE 124 Networked Services Fall 2010 32

Performance with clusters 11/23/2010 CSE 124 Networked Services Fall 2010 33

Performance of GFS 11/23/2010 CSE 124 Networked Services Fall 2010 34

Master Requests 11/23/2010 CSE 124 Networked Services Fall 2010 35

Reading Google File System documents 11/23/2010 CSE 124 Networked Services Fall 2010 36