Staggeringly Large File Systems. Presented by Haoyan Geng

Similar documents
Staggeringly Large Filesystems

The Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

CLOUD-SCALE FILE SYSTEMS

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Google File System. By Dinesh Amatya

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

The Google File System

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

GFS: The Google File System. Dr. Yingwu Zhu

CA485 Ray Walshe Google File System

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Distributed Filesystem

The Google File System. Alexandru Costan

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

The Google File System (GFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google Disk Farm. Early days

The Google File System

GFS: The Google File System

Google File System 2

Distributed File Systems II

The OceanStore Write Path

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

The Google File System

Distributed System. Gang Wu. Spring,2018

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System. Arun Sundaram Operating Systems

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Fall 2009 Lecture-19

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed File Systems. Directory Hierarchy. Transfer Model

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

2/27/2019 Week 6-B Sangmi Lee Pallickara

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Google is Really Different.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Map-Reduce. Marco Mura 2010 March, 31th

NPTEL Course Jan K. Gopinath Indian Institute of Science

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

The Google File System GFS

MapReduce. U of Toronto, 2014

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

BigData and Map Reduce VITMAC03

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

7680: Distributed Systems

Seminar Report On. Google File System. Submitted by SARITHA.S

Distributed Systems 16. Distributed File Systems II

Distributed Systems. GFS / HDFS / Spanner

This material is covered in the textbook in Chapter 21.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS 345A Data Mining. MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Introduction to Cloud Computing

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

NPTEL Course Jan K. Gopinath Indian Institute of Science

Cluster-Level Google How we use Colossus to improve storage efficiency

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Data Storage in the Cloud

Advantages of P2P systems. P2P Caching and Archiving. Squirrel. Papers to Discuss. Why bother? Implementation

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Performance Gain with Variable Chunk Size in GFS-like File Systems

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Google Cluster Computing Faculty Training Workshop

TI2736-B Big Data Processing. Claudia Hauff

Extreme computing Infrastructure

Outline. Challenges of DFS CEPH A SCALABLE HIGH PERFORMANCE DFS DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM 11/16/2010

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Transcription:

Staggeringly Large File Systems Presented by Haoyan Geng

Large-scale File Systems How Large? Google s file system in 2009 (Jeff Dean, LADIS 09) - 200+ clusters - Thousands of machines per cluster - Pools of thousands of clients - More than 4 Petabytes of data - 40 GB/s I/O load Amazon s S3 in 2010 (Werner Vogels blog) - Over 100 billion objects - Over 120,000 storage operations per second

Range of the network - Wide/Local area Organizing model - Centralized/P2P/Layered... Large can be different Environment - Trusted/Untrusted infrastructure Abundance of resources - Bandwidth, storage space,... The goals: - Availability, reliability, scalability,...

Large can be different

Large can be different GFS OceanStore Infrastructure Datacenter Wide-area Organizing Model Centralized Fully distributed Target Users Google Anyone Environment Trusted Untrusted Availability High High Reliability High High Recovery Self-maintaining Self-maintaining

Google File System The Authors Sanjay Ghemawat Google Fellow, worked on GFS, MapReduce, BigTable,... PhD from MIT (Barbara Liskov) Howard Gobioff PhD from CMU (Garth Gibson) Shun-Tak Leung PhD from UW

A file system for Google s infrastructure Clusters of commodity PCs Component failures are common Program bugs (System/Apps) Storage failures (Memory/Disk) Network failure Power outage Other human errors... Motivation

Motivation A file system for Google s infrastructure Clusters of commodity PCs Component failures are common Trusted environment Malicious users? Not a big problem...

Motivation A file system for Google s infrastructure Clusters of commodity PCs Component failures are common Trusted environment Files are huge Most mutations are append

Design Choices Fault tolerant... but no need for BFT Fast failure recovery Relaxed consistency guarantee Selective optimization (block size, append,...) - Small files must be supported, but we need not optimize for them. - Small writes at arbitrary positions in a file are supported but do not have to be efficient. Co-design with applications

Architecture Bullet Explanation Bullet

Architecture Interface Chunkserver Master Mutation

Interface Standard operations Create, Delete, Open, Close, Read, Write Record append Snapshot

Runs on top of Linux file system Basic unit: Chunks 64 MB each A file in GFS consists of several chunks Each chunk is a Linux file Eliminates fragmentation problems Chunkserver Replicated for reliability (3 by default) The primary: serializes mutations

Master Controls system-wide activities - Leases, replica placement, garbage collection,... Stores all metadata in memory, 64 bytes / chunk - File/chunk namespaces - File-to-chunk mapping - Replica locations The good part: simple, fast Not so good? - Throughput - Single point of failure

More on Master Optimizations Throughput - Minimize operations at master - Fast recovery using logs & checkpoints Single point of failure - Replicates itself - Shadow masters

Mutation Bullet Explanation Bullet

Mutation Bullet Explanation Bullet

Mutation Bullet Explanation Bullet

Mutation Bullet Explanation Bullet

A Few Issues Consistency Guarantee Data Integrity Replica Placement File Deletion

Consistency Guarantee Consistent: all clients will always see the same data Defined: consistent and mutated in entirety Semantics of (concurrent) Write/Append Write: Consistent but undefined Append: Defined, interspersed with inconsistent - Atomically at least once

Data Integrity Checksum 32-bit for each 64-KB block Independent verification, not chunk-wise comparison - Recall the semantics of Append

Replica Placement Below-average utilization Load balancing Distributes recent creations Avoid hot spot Places replicas across racks Reliability Bandwidth utilization

File Deletion Rename to hidden file upon deletion Permanently remove after 3 days Garbage collection: scan periodically to remove orphan chunks

Performance Micro-benchmark Google s working environment

Performance Micro-benchmark - 16 Chunkservers, up to 16 clients - Links: 100 Mbps host-switch, 1 Gbps inter-switch

Performance Micro-benchmark

Performance Clusters 300~500 operations per second on master 300~500 MB/s read, 100 MB/s write

Discussion A big engineering success Anything new? Highly specialized design Comparison with others? Questions?

Pond: the OceanStore Prototype The Authors Sean Rhea - PhD student Meraki Patrick Eaton - PhD student EMC Dennis Geels - PhD student Google Hakim Weatherspoon - PhD student... here! Ben Zhao - PhD student UCSB John Kubiatowicz - The professor PhD from MIT (Anant Agarwal)

OceanStore s Vision Universal availability Transparency High durability Decentralized storage Self-maintainable Strong consistency Strong security Untrusted infrastructure As a rough estimate, we imagine providing service to roughly 10 10 users, each with at least 10,000 files. OceanStore must therefore support over 10 14 files. -- The OceanStore paper (ASPLOS 00)

System Overview

Inside the System Data Object: File in Pond Tapestry: Distributed storage Primary Replica: Inner-ring Archival Storage: Erasure codes Secondary Replica: Caching

B-Tree of read-only blocks Every version is kept forever Copy on write Data Object

Tapestry DHTs & Storage systems Chord - Cooperative File System (MIT) Pastry - PAST (Microsoft & Rice) Tapestry - OceanStore (Berkeley) Distributed Object Location and Routing Uses GUID to locate objects Self-organizing Tapestry s locality: Locates the closest replica

Primary Replica Serializes & applies changes A set of servers: inner ring - Assigned by the responsible party Byzantine-fault-tolerant protocol - Tolerates at most f faulty nodes with 3f+1 nodes - MAC within inner ring, public-key outside - Proactive threshold signatures How to durably store the order of updates? - Antiquity: Log-based method (Eurosys 07)

Archival Storage Erasure code - Divide a block into m fragments - Encode into n (n > m) fragments - Recover from any m fragment - Fragments are distributed uniformly and deterministically (n = 32, m = 16 in Pond) Distributes data across machines Efficiency: Erasure code vs. Replication

Secondary Replica (Caching) Reconstruction from fragments is expensive An alternative: whole-block caching Greedy local caching: LRU in Pond Contact the primary for the latest version

Handling Updates Sends update to Primary tier Random replicas

Handling Updates Primary: Byzantine agreement Secondary: Epidemically propagation (Gossiping)

Handling Updates Multicast from the primary

Implementation Based on SEDA Implemented in Java Event driven, each subsystem as a stage

Implementation The current code base of Pond contains approximately 50,000 semicolons and is the work of five core graduate student developers and as many undergraduate interns.

Performance Archival Storage

Performance Latency Breakdown

Performance Andrew Benchmark Five phases: Create target directory (Write) Copy files (Write) Examine file status (Read) Examine file data (Read) Compile & link the files (Read+Write)

Performance Andrew Benchmark Times are in seconds

Discussion OceanStore: a great vision Pond: a working subset of the vision Cool functionalities, but some are expensive Performance upon failures? Questions?