Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

Similar documents
CSE 124: Networked Services Lecture-17

Finding a needle in Haystack: Facebook's photo storage

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CA485 Ray Walshe Google File System

Distributed Filesystem

Distributed File Systems Part II. Distributed File System Implementation

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Efficiency at Scale. Sanjeev Kumar Director of Engineering, Facebook

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

SSD Admission Control for Content Delivery Networks

A brief history on Hadoop

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

CSE 124: Networked Services Lecture-16

CSE 153 Design of Operating Systems

I/O CANNOT BE IGNORED

GFS: The Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

I/O CANNOT BE IGNORED

Ceph: A Scalable, High-Performance Distributed File System

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Current Topics in OS Research. So, what s hot?

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

Definition of RAID Levels

EMC Celerra CNS with CLARiiON Storage

The Google File System

Google File System. Arun Sundaram Operating Systems

ISSUES IN STORAGE OF PHOTOS IN FACEBOOK: REVIEW OF VARIOUS STORAGE TECHNIQUES

Network. CS Advanced Operating Systems Structures and Implementation Lecture 13. File Systems (Con t) RAID/Journaling/VFS.

The Google File System (GFS)

Dynamic Metadata Management for Petabyte-scale File Systems

An Introduction to Big Data Analysis using Spark

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Network File System (NFS)

Network File System (NFS)

The Google File System

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

The Google File System

High Performance Computing Course Notes High Performance Storage

Chapter 6: File Systems

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

An Introduction to GPFS

GFS: The Google File System. Dr. Yingwu Zhu

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Introducing SUSE Enterprise Storage 5

SYMMETRIC STORAGE VIRTUALISATION IN THE NETWORK

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

The Google File System. Alexandru Costan

Changing Requirements for Distributed File Systems in Cloud Storage

SoftNAS Cloud Performance Evaluation on AWS

The Google File System

Testing & Assuring Mobile End User Experience Before Production Neotys

CS 4284 Systems Capstone

Toward An Integrated Cluster File System

VirtualWisdom â ProbeNAS Brief

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

File Systems Management and Examples

The Google File System

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

CLOUD-SCALE FILE SYSTEMS

LECTURE SCHEDULE 2. Units of Memory, Hardware, Software and Classification of Computers

File systems CS 241. May 2, University of Illinois

Google File System. By Dinesh Amatya

Storage Optimization with Oracle Database 11g

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Strategic Briefing Paper Big Data

FCFS: On-Disk Design Revision: 1.8

Decentralized Distributed Storage System for Big Data

Server Selection Mechanism. Server Selection Policy. Content Distribution Network. Content Distribution Networks. Proactive content replication

About Terracotta Ehcache. Version 10.1

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Precept 2: PMM / VMM. COS 318: Fall 2017

Embedded Technosolutions

File System Aging: Increasing the Relevance of File System Benchmarks

CLOUD-SCALE INFORMATION RETRIEVAL

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Today s Objec2ves. AWS/MR Review Final Projects Distributed File Systems. Nov 3, 2017 Sprenkle - CSCI325

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

EECS 482 Introduction to Operating Systems

Aerospike Scales with Google Cloud Platform

Naming in Distributed Systems

Squirrel case-study. Decentralized peer-to-peer web cache. Traditional centralized web cache. Based on the Pastry peer-to-peer middleware system

Boost Performance and Extend NAS Life

Transcription:

Finding a Needle in a Haystack Facebook s Photo Storage Jack Hartner

Paper Outline Introduction Background & Previous Design Design & Implementation Evaluation Related Work Conclusion

Facebook Photo Storage Needs 260 billion images stored 20 petabytes of data stored 1 billion photos / 60 terabytes added per week 1 million images served per second at peak times

A Look at Use Cases Metric Typical Distributed File System Facebook Photo Storage File Size Varied Small, Constant Data is written Once Once Data is read Often Often Data is modified Often Never Data is deleted Sometimes Rarely

The Old Way A Traditional POSIX System High cost of directory navigation on disk root Large metadata & directory navigation require disk operations, so they become the throughput bottleneck dir1 dir2 dir3 usr1 usr2 usr3 High cost of accessing metadata on disk Q: In our experience, we find that the disadvantages of a traditional POSIX based filesystem are directories and per file metadata. Explain how this disadvantage becomes the limiting factor for the read throughput.

The New Way Shrink Metadata & Eliminate Directories Minimal navigation cost root Objective: One disk operation per read Minimal metadata cost RAM M Metadata Photo Data Photo Data All metadata can be cached in RAM

Keeping All Metadata in Main Memory RAM Yes, it is possible to cache metadata for only the most popular files, which works fine for most systems BUT Facebook sees a large number of requests for less popular or older content, known as the long tail. SO Long tail requests will probably miss both the CDN and this RAM cache, thus there is no real gain in performance. Q: We accomplish this by keeping all metadata in main memory,. Why did keeping metadata in memory become a challenge in Facebook s system? Is it possible just to keep metadata of the most popular files in memory and to achieve the objective ( at most one disk operation per read ) by exploiting access locality?

Implementation of Haystack NAS Haystack Directory Haystack Store Web Server Photo Store Server Web Server Haystack Cache Browser CDN Browser CDN Old Way New Way

Goal #1 High Throughput and Low Latency Photos must be served quickly for good user experience Users should not be able to sense different performance for old and new photos Goal of one disk operation per read Possible because metadata is drastically reduced and kept entirely in memory Requests that exceed capacity can t be ignored, must be handled by CDN Expensive and limited by diminishing returns

Goal #2 Fault Tolerant Large scale systems have failures every day Similar to GFS failures are the norm Users must have 24/7 availability of photos regardless of failures Made possible through replication of data in geographically distinct locations

Goal #3 Cost Effective Higher Performance and Lower Cost than NFS / POSIX Quantified by two metrics: Cost per terabyte of usable storage (~28% less) Normalized read rate per terabyte of usable storage (~4x read rate)

Goal #4 Simple Easy to implement (deployable in months instead of years) Easy to maintain

Deploying Haystack Quickly Haystack Store Volume This structure can be implemented with an already existing file system to speed development root Haystack Stores utilize XFS, a robust file system commonly used in UNIX systems. One Large File Occupies Entire Physical Volume Q: That simplicity let us build and deploy a working system in a few months instead of a few years. Comment on this statement (why can Haystack be considered as simple adaptation of Unix file systems?)

Why Not GFS? Main use cases for Facebook storage: Development Data Log Files Photos GFS GFS GFS did not have the correct RAM-to-disk ratio to store all photo metadata in memory Long tail access becomes a problem when partial metadata is cached GFS is best suited for small numbers of large files, not a large number of very small files! Q:.. we explored whether it would be useful to build a system similar to GFS. Comment on the statement. Why does Serving photo requests in the long tail represents a problem on GFS?

Questions? Discussion?