CSE 124: Networked Services Lecture-17

Similar documents
CSE 124: Networked Services Lecture-16

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

Finding a needle in Haystack: Facebook's photo storage

CSE 124: Networked Services Fall 2009 Lecture-19

CA485 Ray Walshe Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Efficiency at Scale. Sanjeev Kumar Director of Engineering, Facebook

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

Distributed Filesystem

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Distributed File Systems II

The Google File System (GFS)

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

The Google File System

CLOUD-SCALE FILE SYSTEMS

CSE 153 Design of Operating Systems

The Google File System

The Google File System. Alexandru Costan

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

The Lion of storage systems

Google File System. Arun Sundaram Operating Systems

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

The Google File System

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Decentralized Distributed Storage System for Big Data

Google is Really Different.

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

GFS: The Google File System

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Systems 16. Distributed File Systems II

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS: The Google File System. Dr. Yingwu Zhu

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System. By Dinesh Amatya

CSE 124: Networked Services Lecture-15

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Google Disk Farm. Early days

File systems CS 241. May 2, University of Illinois

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Google File System 2

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Staggeringly Large File Systems. Presented by Haoyan Geng

Staggeringly Large Filesystems

Current Topics in OS Research. So, what s hot?

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Ext4 Filesystem Scaling

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

The Google File System GFS

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 20 April 2011 Spring 2011 Exam 2

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Topics in P2P Networked Systems

Next-Generation Cloud Platform

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Operating Systems. Operating Systems Professor Sina Meraji U of T

CS5460: Operating Systems Lecture 20: File System Reliability

L7: Performance. Frans Kaashoek Spring 2013

FFS: The Fast File System -and- The Magical World of SSDs

Operating Systems. File Systems. Thomas Ropars.

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Chapter 11: File System Implementation. Objectives

CS November 2018

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS November 2017

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

BIG DATA TESTING: A UNIFIED VIEW

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Tools for Social Networking Infrastructures

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Map Reduce. Yerevan.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Isilon Performance. Name

CIT 668: System Architecture. Scalability

Transcription:

Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1

Updates PlanetLab experiments (midway) A few batches completed Be ready and read through PlanetLab documentation at www.planet-lab.org Project-2 idea final presentation Register for schedule Presentation/Demo Deadline: Last Lecture class (December 2 nd, 2010) Submission of report (one page or more) documentation and final source code: Friday mid night, finals week It should contain: a brief description of the project Instructions for building and using the code 11/30/2010 CSE 124 Networked Services Fall 2010 2

Haystack a file system for another type of giant scale services 11/30/2010 CSE 124 Networked Services Fall 2010 3

Google File Systems Google File system A scalable distributed file system Large distributive data intensive applications Widely deployed in Google Scalability 100s of terabytes 1000s of disks 1000s of machines Main benefits Fault tolerance while running over commodity hardware High aggregate performance 11/30/2010 CSE 124 Networked Services Fall 2010 4

GFS serves a kind of giant scale services Component failures are common File sizes are huge Multi-GB is common Even TBs are expected I/O operations and Block Sizes are to be reconsidered Most files are appended most often Most operations include appending new data Fewer overwriting Random writes within files are mostly non-existent File system Co-design with application will be far more optimal APIs design must consider the application Atomic append helps multiple clients to concurrently append data Can be useful for clustering 1000s of nodes GFS may not be efficient for services such as Facebook 11/30/2010 CSE 124 Networked Services Fall 2010 5

Facebook s situation Facebook Biggest photo sharing websie Biggest social networking service Photos dominate its storage requirements Stores over 260 Billion photos (early 2010) 20 petabytes Each week 1 billion new photos 60 tera bytes Peak image serving rate 1 Billion images/sec 11/30/2010 CSE 124 Networked Services Fall 2010 6

Facebook faced photo storage challenges Read/write characteristics of photo storage Written once Read often Never modified Rarely deleted Why traditional file systems don t work well Directories and file metadata Metadata is inefficient Reading metadata requires many I/O reads for billions of photos, metadata is too huge to be stored in memory Many I/O operations required (atleast three) 1: filename to inode number translation 2: read inode from disk 3: read the file from the disk 11/30/2010 CSE 124 Networked Services Fall 2010 7

Desired features of large photo storage system High throughput Large number of requests Reads at Billion images/sec at peak, writes at many millions/day Low latency Photos must be served quickly Demands minimal disk I/O operations Fault tolerance Machine/Disk Failures happen very often Entire data center may be failed Replication in geographically distinct centers Cost-effective Cheaper to scale to large systems Effective cost per terabyte usable storage Simple Minimal time to deploy Few months of testing only available for most cases 11/30/2010 CSE 124 Networked Services Fall 2010 8

Traditional CDN based photo storage 1000s of files/directory: 10 disk I/O per image 100s of files/directory: 3 disk I/O per image 1. Read directory metadata to memory 2. Load inode into memory 3. Read file contents 11/30/2010 CSE 124 Networked Services Fall 2010 9

Why CDNs are not always useful? For social networks CDNs are not always useful CDNs serve hot photos well Large number of reads of fewer photos Mostly helpful for recently uploaded photos Social networking has a long tail of objects Older photos that may be rarely cached Significant amount of the Facebook traffic High CDN miss rate, high cache miss rate Caching long tail using CDNs cause CDNs are expensive Caching is too expensive Diminishing returns 11/30/2010 CSE 124 Networked Services Fall 2010 10

Facebook s improved approach To reduce disk I/O per images Photo Store Caches were attempted After reach file read, the Photo Server caches Filename to file mapping Proved slightly better Long tail read distribution of photos Can benefit when handles are in memory 11/30/2010 CSE 124 Networked Services Fall 2010 11

Facebook s overall strategy Hot objects CDN Long tail objects Haystack Haystack objectives Reduce file system metadata Store the metadata entirely in memory Require only one disk I/O per image 11/30/2010 CSE 124 Networked Services Fall 2010 12

Haystack components Thee major components Haystack store Haystack directory Haystack cache Store Persistent photo storage Carries physical volumes (100 phy volumes each with 100GB with a total store capacity of 10TB) Manages filesystem metadata Multiple Store s physical volumes are grouped to logical volumes When a photo is written to logical volume, all phyiscal volumes get written Redundancy for fault tolerance 11/30/2010 CSE 124 Networked Services Fall 2010 13

Haystack components: Store Design is made very simple Requests contain a <photo id, physical volume> Returns error if object is not located Each store maintains multiple physical volumes Each physical volume is a large file (100GB) Each file operation requires only one I/O As file metadata is stored in the memory Metadata is small for a given physical volume Filename, offset, and size 11/30/2010 CSE 124 Networked Services Fall 2010 14

Haystack s components: Directory Haystack directory Maintains logical to physical mapping Application metadata Photo to logical volume mapping Information necessary for URL creation Capacity of logical volumes Directory constructs the URL for a photo Directory Load balances writes across logical volumes Reads across physical volumes Directory Decides whether a photo to be served by CDN Decides read-only and write-enabled volumes 11/30/2010 CSE 124 Networked Services Fall 2010 15

Haystack s components: Cache Haystack cache Internal CDN Caches the requests for most popular photos Insulates the store from CDN failure Insulates the store immediately after writing a new hot photo Organized as a distributed hash table Receives HTTP requests from CDNs or users browsers Locates a photo by its unique Id as its key Caches only if Request comes direct from user (post CDN caching is not useful) Request is for an object in Write-enabled physical volume Many reads follow a write Performance is worst when read and write come together 11/30/2010 CSE 124 Networked Services Fall 2010 16

Haystack architecture photo read 11/30/2010 CSE 124 Networked Services Fall 2010 17

Photo upload process in Haystack 2. Server requests a writeenabled logical volume 4. Server assigns a unique id to the photo and writes to the physical volumes 11/30/2010 CSE 124 Networked Services Fall 2010 18

Organization of Physical volume 11/30/2010 CSE 124 Networked Services Fall 2010 19

Photo read/write/delete Read Cache supplies logical volume id, key, alt key, and cookie to store Cookie (random number) eliminates attacks (guessing photo URLs) Looks up memory for metadata info Reads entire needle from disk, checks integrity and cookie Write Web servers provide logical volume id, key, alt key, cookie, and data Each machine synchronously appends needles to its physical volumes Updates inmemory mappings Delete Stores sets delete flag in in-memoty mapping and in the volume file Read requests look for delete flag and returns error Deleted space is reclaimed later 11/30/2010 CSE 124 Networked Services Fall 2010 20

File index in Haystack store Index is a very important optimization in Haystack Reconstructing the memory mapping (metadata) from physical volume is expensive Takes a long time after a reboot Store maintains an index file for each physical volume Checkpoint for locating needles on disk Superblock followed by sequence of index records of needles Order of needles must be the same Orphan needles Needles without corresponding index records Store sequentially examine each orphan creates a matching index Deleted photos May retain some deleted photos for longer 11/30/2010 CSE 124 Networked Services Fall 2010 21

File index for at Haystack store 11/30/2010 CSE 124 Networked Services Fall 2010 22

CDF of accesses Vs age (time since upload) 11/30/2010 CSE 124 Networked Services Fall 2010 23

Multi-write operations Day 11/30/2010 CSE 124 Networked Services Fall 2010 24

Cache hit rates for Haystack stores 11/30/2010 CSE 124 Networked Services Fall 2010 25

Experimental setup Commodity storage blade Two threaded quad core Intel Xeon CPUs 48GB main memory Hardware RAID controller (RAID-6) 256-512MB NVRAM 12x1TB SATA drives Each blade has average 9TB of disk storage capacity Photos are not cached on Store machines 11/30/2010 CSE 124 Networked Services Fall 2010 26

Performance evaluation Performance bench mark tools RandomIO an open source tool for 64KB reads/writes that we use to measure performance Haystress A customized tools for performance benchmarking of Haystack Stress tests Haystack under a variety of synthetic workloads Even measures the network traffic impact (over HTTP) Assesses maximum read/write performance by issuing random 11/30/2010 CSE 124 Networked Services Fall 2010 27

Throughput and latency performance (Synthetic) Throughput: (85%) Delay (117%) Throughput: (97%) Delay (103%) F: 98% reads, 2% multiwrites G: 96% reads, 4% multiwrites Read Throughput: 76-79% Read Delay: 124%-129% Write throughput change: 200% Throughput: 1, 4 (30%), 16 (75%) Delay: 1, 4 (310%), 16 (895%) 11/30/2010 CSE 124 Networked Services Fall 2010 28

Facebook Production system throughput Read only Store performance (Reads: 1K-2.5K) Daily traffic increase: 0.2-0.5% (Peak traffic: Sun-Mon) Week (Sun-Sat) 11/30/2010 CSE 124 Networked Services Fall 2010 29

Facebook Production system throughput Write-enabled Store machines (High Reads 3K-6K!) Cache helps for read; Avg. photos per multiwrite: 9.27 11/30/2010 CSE 124 Networked Services Fall 2010 30

Facebook Production System Latency W-E machines Read latency is affected by writes Read traffic increases by weak Multi-write benefits from NVRAM backed RAID R-O machines High latency No caches Diminishing latency CPU utilization Low (4-8%) 11/30/2010 CSE 124 Networked Services Fall 2010 31

Summary Facebook s Haystack A new filesystem for giant scale services focuses on Long tail CDNs are not very useful File I/O is limited to 1 operation Minimal metadata Metadata can be kept in memory Haystack Directory Store Cache Store Physical volume Collection of needles Needle index for quick lookup/recreation after bootup Read throughput: 97% close to the device throughput 1 I/O interaction per image 11/30/2010 CSE 124 Networked Services Fall 2010 32

Reading Haystack paper from Facebook (available from course website) 11/30/2010 CSE 124 Networked Services Fall 2010 33