CSE 124: Networked Services Lecture-17

Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1

Updates PlanetLab experiments (midway) A few batches completed Be ready and read through PlanetLab documentation at www.planet-lab.org Project-2 idea final presentation Register for schedule Presentation/Demo Deadline: Last Lecture class (December 2 nd, 2010) Submission of report (one page or more) documentation and final source code: Friday mid night, finals week It should contain: a brief description of the project Instructions for building and using the code 11/30/2010 CSE 124 Networked Services Fall 2010 2

Haystack a file system for another type of giant scale services 11/30/2010 CSE 124 Networked Services Fall 2010 3

Google File Systems Google File system A scalable distributed file system Large distributive data intensive applications Widely deployed in Google Scalability 100s of terabytes 1000s of disks 1000s of machines Main benefits Fault tolerance while running over commodity hardware High aggregate performance 11/30/2010 CSE 124 Networked Services Fall 2010 4

GFS serves a kind of giant scale services Component failures are common File sizes are huge Multi-GB is common Even TBs are expected I/O operations and Block Sizes are to be reconsidered Most files are appended most often Most operations include appending new data Fewer overwriting Random writes within files are mostly non-existent File system Co-design with application will be far more optimal APIs design must consider the application Atomic append helps multiple clients to concurrently append data Can be useful for clustering 1000s of nodes GFS may not be efficient for services such as Facebook 11/30/2010 CSE 124 Networked Services Fall 2010 5

Facebook s situation Facebook Biggest photo sharing websie Biggest social networking service Photos dominate its storage requirements Stores over 260 Billion photos (early 2010) 20 petabytes Each week 1 billion new photos 60 tera bytes Peak image serving rate 1 Billion images/sec 11/30/2010 CSE 124 Networked Services Fall 2010 6

Facebook faced photo storage challenges Read/write characteristics of photo storage Written once Read often Never modified Rarely deleted Why traditional file systems don t work well Directories and file metadata Metadata is inefficient Reading metadata requires many I/O reads for billions of photos, metadata is too huge to be stored in memory Many I/O operations required (atleast three) 1: filename to inode number translation 2: read inode from disk 3: read the file from the disk 11/30/2010 CSE 124 Networked Services Fall 2010 7

Desired features of large photo storage system High throughput Large number of requests Reads at Billion images/sec at peak, writes at many millions/day Low latency Photos must be served quickly Demands minimal disk I/O operations Fault tolerance Machine/Disk Failures happen very often Entire data center may be failed Replication in geographically distinct centers Cost-effective Cheaper to scale to large systems Effective cost per terabyte usable storage Simple Minimal time to deploy Few months of testing only available for most cases 11/30/2010 CSE 124 Networked Services Fall 2010 8

Traditional CDN based photo storage 1000s of files/directory: 10 disk I/O per image 100s of files/directory: 3 disk I/O per image 1. Read directory metadata to memory 2. Load inode into memory 3. Read file contents 11/30/2010 CSE 124 Networked Services Fall 2010 9

Why CDNs are not always useful? For social networks CDNs are not always useful CDNs serve hot photos well Large number of reads of fewer photos Mostly helpful for recently uploaded photos Social networking has a long tail of objects Older photos that may be rarely cached Significant amount of the Facebook traffic High CDN miss rate, high cache miss rate Caching long tail using CDNs cause CDNs are expensive Caching is too expensive Diminishing returns 11/30/2010 CSE 124 Networked Services Fall 2010 10

Facebook s improved approach To reduce disk I/O per images Photo Store Caches were attempted After reach file read, the Photo Server caches Filename to file mapping Proved slightly better Long tail read distribution of photos Can benefit when handles are in memory 11/30/2010 CSE 124 Networked Services Fall 2010 11

Facebook s overall strategy Hot objects CDN Long tail objects Haystack Haystack objectives Reduce file system metadata Store the metadata entirely in memory Require only one disk I/O per image 11/30/2010 CSE 124 Networked Services Fall 2010 12

Haystack components Thee major components Haystack store Haystack directory Haystack cache Store Persistent photo storage Carries physical volumes (100 phy volumes each with 100GB with a total store capacity of 10TB) Manages filesystem metadata Multiple Store s physical volumes are grouped to logical volumes When a photo is written to logical volume, all phyiscal volumes get written Redundancy for fault tolerance 11/30/2010 CSE 124 Networked Services Fall 2010 13

Haystack components: Store Design is made very simple Requests contain a <photo id, physical volume> Returns error if object is not located Each store maintains multiple physical volumes Each physical volume is a large file (100GB) Each file operation requires only one I/O As file metadata is stored in the memory Metadata is small for a given physical volume Filename, offset, and size 11/30/2010 CSE 124 Networked Services Fall 2010 14

Haystack s components: Directory Haystack directory Maintains logical to physical mapping Application metadata Photo to logical volume mapping Information necessary for URL creation Capacity of logical volumes Directory constructs the URL for a photo Directory Load balances writes across logical volumes Reads across physical volumes Directory Decides whether a photo to be served by CDN Decides read-only and write-enabled volumes 11/30/2010 CSE 124 Networked Services Fall 2010 15

Haystack s components: Cache Haystack cache Internal CDN Caches the requests for most popular photos Insulates the store from CDN failure Insulates the store immediately after writing a new hot photo Organized as a distributed hash table Receives HTTP requests from CDNs or users browsers Locates a photo by its unique Id as its key Caches only if Request comes direct from user (post CDN caching is not useful) Request is for an object in Write-enabled physical volume Many reads follow a write Performance is worst when read and write come together 11/30/2010 CSE 124 Networked Services Fall 2010 16

Haystack architecture photo read 11/30/2010 CSE 124 Networked Services Fall 2010 17

Photo upload process in Haystack 2. Server requests a writeenabled logical volume 4. Server assigns a unique id to the photo and writes to the physical volumes 11/30/2010 CSE 124 Networked Services Fall 2010 18

Organization of Physical volume 11/30/2010 CSE 124 Networked Services Fall 2010 19

Photo read/write/delete Read Cache supplies logical volume id, key, alt key, and cookie to store Cookie (random number) eliminates attacks (guessing photo URLs) Looks up memory for metadata info Reads entire needle from disk, checks integrity and cookie Write Web servers provide logical volume id, key, alt key, cookie, and data Each machine synchronously appends needles to its physical volumes Updates inmemory mappings Delete Stores sets delete flag in in-memoty mapping and in the volume file Read requests look for delete flag and returns error Deleted space is reclaimed later 11/30/2010 CSE 124 Networked Services Fall 2010 20

File index in Haystack store Index is a very important optimization in Haystack Reconstructing the memory mapping (metadata) from physical volume is expensive Takes a long time after a reboot Store maintains an index file for each physical volume Checkpoint for locating needles on disk Superblock followed by sequence of index records of needles Order of needles must be the same Orphan needles Needles without corresponding index records Store sequentially examine each orphan creates a matching index Deleted photos May retain some deleted photos for longer 11/30/2010 CSE 124 Networked Services Fall 2010 21

File index for at Haystack store 11/30/2010 CSE 124 Networked Services Fall 2010 22

CDF of accesses Vs age (time since upload) 11/30/2010 CSE 124 Networked Services Fall 2010 23

Multi-write operations Day 11/30/2010 CSE 124 Networked Services Fall 2010 24

Cache hit rates for Haystack stores 11/30/2010 CSE 124 Networked Services Fall 2010 25

Experimental setup Commodity storage blade Two threaded quad core Intel Xeon CPUs 48GB main memory Hardware RAID controller (RAID-6) 256-512MB NVRAM 12x1TB SATA drives Each blade has average 9TB of disk storage capacity Photos are not cached on Store machines 11/30/2010 CSE 124 Networked Services Fall 2010 26

Performance evaluation Performance bench mark tools RandomIO an open source tool for 64KB reads/writes that we use to measure performance Haystress A customized tools for performance benchmarking of Haystack Stress tests Haystack under a variety of synthetic workloads Even measures the network traffic impact (over HTTP) Assesses maximum read/write performance by issuing random 11/30/2010 CSE 124 Networked Services Fall 2010 27

Throughput and latency performance (Synthetic) Throughput: (85%) Delay (117%) Throughput: (97%) Delay (103%) F: 98% reads, 2% multiwrites G: 96% reads, 4% multiwrites Read Throughput: 76-79% Read Delay: 124%-129% Write throughput change: 200% Throughput: 1, 4 (30%), 16 (75%) Delay: 1, 4 (310%), 16 (895%) 11/30/2010 CSE 124 Networked Services Fall 2010 28

Facebook Production system throughput Read only Store performance (Reads: 1K-2.5K) Daily traffic increase: 0.2-0.5% (Peak traffic: Sun-Mon) Week (Sun-Sat) 11/30/2010 CSE 124 Networked Services Fall 2010 29

Facebook Production system throughput Write-enabled Store machines (High Reads 3K-6K!) Cache helps for read; Avg. photos per multiwrite: 9.27 11/30/2010 CSE 124 Networked Services Fall 2010 30

Facebook Production System Latency W-E machines Read latency is affected by writes Read traffic increases by weak Multi-write benefits from NVRAM backed RAID R-O machines High latency No caches Diminishing latency CPU utilization Low (4-8%) 11/30/2010 CSE 124 Networked Services Fall 2010 31

Summary Facebook s Haystack A new filesystem for giant scale services focuses on Long tail CDNs are not very useful File I/O is limited to 1 operation Minimal metadata Metadata can be kept in memory Haystack Directory Store Cache Store Physical volume Collection of needles Needle index for quick lookup/recreation after bootup Read throughput: 97% close to the device throughput 1 I/O interaction per image 11/30/2010 CSE 124 Networked Services Fall 2010 32

Reading Haystack paper from Facebook (available from course website) 11/30/2010 CSE 124 Networked Services Fall 2010 33